Discussions on improving security through chaos engineering

Written by

August 3, 2023

0 mins read

When you rely on a tool to support you in an intense situation, you probably want reassurance that it got tested for extreme conditions. For example, if you’re about to go skydiving, you'd want to know that the parachute strapped to your back underwent rigorous testing and will perform it's needed most.

The same is true with the systems supporting our security initiatives. What happens when those systems are under high pressure in an emergency?

Chaos engineering answers this question by testing systems to their limits in a controlled, systematic way. By running targeted experiments, teams determine if their system is ready to withstand a high-pressure situation or needs fixes to get there.

What is chaos engineering?

Chaos engineering accounts for chaos theory: the idea that we can identify patterns in seemingly random and unexpected events. By running strategic experiments to see how a system stacks up with these events, teams can be better prepared if these extenuating circumstances happen in real life.

Security chaos engineering specifically focuses on testing cybersecurity systems that would be the “first responders” in emergencies such as data breaches or attacks. Teams inject these possible events into the system, then observe how it behaves.

Aaron Rinehart, CTO and co-Founder at Verica, spoke with Guy Podjarny, Founder and President of Snyk, about chaos engineering on The Secure Developer podcast, episode #67. He summarized security chaos engineering as “proactively injecting the conditions that you expect will fire or trigger your security.”

Kelly Shortridge, VP of product strategy at Capsule8, also spoke with Guy on episode #63. They discussed how security chaos engineering tests the limits of a system and reminds us that real-life security programs do not exist in a vacuum. According to Kelly, “Security chaos engineering basically says, ‘No, that is not how the world works. You need to… have this really intellectually honest model where you embrace any information that failure gives you and just view it as a learning opportunity.’”

Why is chaos engineering necessary for security?

Chaos engineering is vital to security because it proves that your controls can withstand the unexpected and unpredictable. Ultimately, these experiments lead to a more resilient product.

Aaron mentioned a few reasons why security chaos engineering is invaluable for development and security teams. For one, it proactively strengthens systems against unexpected events, building confidence that the system will work as intended no matter what. For another, chaos engineering empowers security teams to find system issues before they break publicly.

He compares it to a controlled burn: “I grew up on a farm in Missouri, so it’s kind of like we did a controlled burn of a field. You don’t just light a match and go to town. You notify the county, bring out the EMT and the fire department — everybody’s there in case something happens, right? That’s when you do chaos engineering… you’re not freaking out, because during [real outages], people freak out. Their cognitive load is consumed by, “this could be a breach. The CEO is on the phone telling me I got to get this thing back up and running; somebody’s going to lose their job over this.’”

How to start security chaos engineering at your organization

If your security team is thinking about performing some chaos engineering experiments, Aaron and Kelly have a few words of advice:

Focus on engineering errors

Start with targeted experiments that will uncover manual engineering errors. For example, a security chaos engineering experiment could focus on exploiting a misconfigured port or an overly-permissive account: two common issues caused by human error.

But Aaron reminds us that “[these human errors] aren’t any engineer’s fault; it’s the fact that the size, scale, the complexity, and speed that we’re building things today is very difficult for humans to model mentally.”

Don’t try to “boil the ocean”

Aaron also warns against trying to accomplish too much in a single experiment (aka “boiling the ocean”). Chaos engineering experiments must focus on injecting a single failure in a targeted, systematic way. These strategic injections differ from attack simulations because they focus on testing one issue at a time.

Aaron said, “When you start stepping through multiple attack points and sending a lot of data to the system, it is very difficult to sift through that data: what broke, what didn’t, what worked, what didn’t… You lose visibility because of all the noise.”

Expect failure

As you start experimenting, it’s also important not to get discouraged when your system doesn’t hold up to the challenge. Failure is one of the most critical tools in chaos engineering. Kelly said, “It’s a philosophical shift that is necessary for security, which is just embracing that failure is inevitable and that it can really be powerful.”

The future of chaos engineering

As businesses continue to turn to cloud transformation and their systems grow in complexity, it will become more critical than ever for teams to use chaos engineering. Aaron explains that cloud transformation shouldn’t be a prerequisite for chaos engineering. It should be the reverse: chaos engineering prepares organizations for cloud transformation. By testing the effectiveness of their new cloud capabilities from the start, teams are better prepared to maintain it moving forward. In Aaron’s words: “The only way to understand a complex system is to interact with it.”

Want to dive deeper into security chaos engineering and learn more about how it could work for you and your business? Tune into the podcast episodes featuring Aaron and Kelly:

Episode 63: Container Security, Microservices, and Chaos Engineering with Kelly Shortridge
Episode 67: Security Chaos Engineering - What is it, and why should you care? With Aaron Rinehart

And for more content on developer security, AppSec, and DevSecOps, subscribe to The Secure Developer today!