Preventing YAML parsing vulnerabilities with snakeyaml in Java

What is YAML?

YAML is a human-readable language to serialize data that’s commonly used for config files. The word YAML is an acronym for “YAML ain’t a markup language” and was first released in 2001. You can compare YAML to JSON or XML as all of them are text-based structured formats.

How are YAML, JSON, and XML different?

While similar to those languages, YAML is designed to be more readable than JSON and less verbose than XML. For example, all three languages have different syntax to handle structure and nesting, but YAML uses indentation with whitespaces for this.

YAML files are often used to configure applications, application servers, or clusters. It is a very common format in Spring Boot applications and, of course, to configure Kubernetes. However, similarly to JSON and XML, you can use YAML to serialize and deserialize data.

Although YAML looks like an excellent alternative for XML and JSON, many people aren’t a big fan of the structure. Since the language is line-based and uses indentation to represent structure and nesting, indentation often causes problems when parsing complex data structures. A single missing (or extra) whitespace in a complex, data-heavy structure will cause failures when parsing YAML. This causes unexpected problems, and finding the problem in a YAML file is difficult.

Most importantly to note, manually importing YAML in your Java application with an outdated version of snakeyaml might get you into trouble.

TL;DR

The outdated version of snakeyaml contains a Denial of Service vulnerability.
We highly recommend that you update snakeyaml to version 1.26 or higher to prevent this problem.

Parsing YAML files in Java with snakeyaml.

To parse YAML files in your Java application, you can use the well-known library snakeyaml. This is a lightweight straightforward library that you can use to convert YAML to objects and the other way around.

Let’s focus on reading YAML into our Java program. You can basically do this in two different ways. The first way is the generic way of reading YAML input with snakeyaml. In the snippet below I will read a YAML file from my resources folder that is on my classpath.

InputStream is = getClass().getClassLoader().getResourceAsStream(filename);
Yaml yaml = new Yaml();
var lhm = (LinkedHashMap) yaml.load(is);

By loading the YAML like this, the result will be a LinkedHashMap<Object> representing the YAML file in a structured way. This means it can contain anything from any type, because it is a generic structure not bound to a specific type.

The second way of reading YAML is more specific. You can parse your YAML input to a particular object. Snakeyaml will try to bind the YAML variables to the object’s field by naming convention. This will end in an exception if the YAML file doesn’t fit the object structure or the deserialized target object. In the snippet below I will parse my YAML input to a type Person.

InputStream is = getClass().getClassLoader().getResourceAsStream("person.yaml");
Yaml yaml = new Yaml(new Constructor(Person.class));
Person person = yaml.load(is);

Person.java

public class Person {
   private String firstname;
   private String lastname;
//getters and setters
}

Person.yaml

firstname: "Matt"
lastname: "Murdock"

Both ways of parsing YAML to an object work perfectly fine. If you are absolutely sure about what the input should be you can convert your YAML input to a specific object. If this is not the case you might prefer the more generic way and search the list manually.

Billion laughs attack in YAML

One feature of YAML is that you can create anchors. You can reuse these anchors in different places so you do not have to repeat yourself. In the simplified example below, I create two variables: var1 and var2. By using anchors, var2 has the same value as var1.

var1: &anchor value
var2: *anchor

Let’s take this to the extreme and create the famous billion laughs attack for YAML. By applying this concept in a nested way, I can actually make a billion laughs.

lol1: &lol1 ["lol","lol","lol","lol","lol","lol","lol","lol","lol"]
lol2: &lol2 [*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1]
lol3: &lol3 [*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2]
lol4: &lol4 [*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3]
lol5: &lol5 [*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4]
lol6: &lol6 [*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5]
lol7: &lol7 [*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6]
lol8: &lol8 [*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7]
lol9: &lol9 [*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8]
lolz: &lolz [*lol9]

As you can see, lol1 is a list of 10 strings "lol". The variable lol2 is a list of 10 times lol1. By repeating this principle several times, we end up with lolz = 10^9 times "lol". Better said, a billion laughs.

With anchors, you can create a YAML bomb! The tremendous amount of (nested) objects that such a YAML input creates will cause a memory overload.

When looking at snakeyaml and specifically versions below 1.26, this can be a problem. If you parse YAML in the generic way like described in the first example, this YAML bomb will end up in a java.lang.OutOfMemoryError on the Java heap space. This typically means your application crashes and is not available anymore, so a Denial of Service.

However, if you parse your YAML to a specific object like in the second example, this might seem less of an issue. The snakeyaml library tries to match the variable name to a field in your object. Because this is impossible, you will get a YAML exception. Although it might seem like a good solution, it is not foolproof. 

Say we have a type person with a firstname and lastname like before. But besides that, it can also contain children represented by a collection of type person.

public class Person {
   private String firstname;
   private String lastname;
   private List<Person> children;
   //getters and setters
}

Now I have the same problem as the original billion laughs attack. I can create a similar YAML file using anchors that goes through several layers. The YAML file could look like this. Note that you are not obligated to fill in all the fields of a type.

firstname: X
lastname: XX
children:
   - children: &a [{firstname: a},{firstname: a ..]
   - children: &b [{children: *a},{children: *a} ..]
   - children: &c [{children: *b},{children: *b} ..]
   - children: &d [{children: *c},{children: *c} ..]
   - children: &e [{children: *d}, {children: *d} ..]
...

I created multiple children of the root parent, and all the children’s children are pointing to a previous anchor creating a snowball effect. So regardless of how I parse the YAML file generically or specifically parse to the Person object, I will end up with a heap overload.

Fixing a billion laughs YAML attacks in Java

The solution to this problem is way easier than you think and much less painful than finding the missing whitespace in a YAML file. You only have to update your snakeyaml version to 1.26 or higher. The folks at snakeyaml did a great job by fixing this issue by limiting the number of aliases for non-scalar nodes to a maximum of 50. When parsing a YAML bomb like described earlier with the newer version of snakeyaml you will just get an exception containing this message. This also means your heap will not overflow, and your application keeps running.

This once again shows how important it is to keep track of the libraries you depend on. Updating to the newer version in this case solves the problem. As you might know, Snyk can help you with this if you connect your code repository. Next to that, keep scanning the libraries you depend on with Snyk Open Source so you will not be surprised by such a vulnerability.

Use Snyk for free

Keep track of vulnerabilities in the libraries you depend on with Snyk Open Source.