SREs bring ORDER(R) to CHAOS
2022年10月19日
0 分で読めますCategorizing the challenges and duties of your trusted friend, the site reliability engineer (SRE).
From Snyk Ambassador Keith McDuffee, DevSecOps and founder of StackRef.com.
"What's the difference between a DevOps engineer and a site reliability engineer?"
It's a question I hear all the time — and one I've heard (and sometimes asked) in job interviews. But is there a correct answer? It all depends on who you ask. I believe there is a line, but it's so blurry you’d have to squint to see it.
For now, let's talk about the SRE. What are their responsibilities? What do they do? What challenges do they face, day-to-day?
I'm going to (try to) be clever here and group their responsibilities into what I'm calling ORDERR, and their challenges as CHAOS.
CHAOS
What are the things keeping SREs on their toes and restless at night? They can be categorized in the following:
Critical component failure
The underlying components of a computer system are many, but we're not just talking about hardware here. While hardware —along with their redundancies — can fail, there are services and applications to consider as well. Applications crash, services become overloaded, and hardware flat-out fails.
Human intervention
Lack of effective automation is akin to an SRE being woken up to type 4, 8, 15, 16, 23, and 42 into a keyboard at 2:00 AM every night. Relying on a flesh-and-blood human to interact with tasks that could be easily left for a program to do is just asking for exasperation, exhaustion, and fat-fingering the wrong code. What's more, when those automated systems fail — and they can and do fail — not having a way to replicate the process quickly and easily becomes another problem itself.
Application code bugs
There's only so much an SRE can do. Some can code and know an application inside and out, but that's not usually what they've been hired for. An SRE can, however, help pinpoint the responsible parties and get the right engineers on the case. SREs should also have a bit of precognition with their tools of observability since, without it, there's a ticking time bomb waiting to wake them. Again.
Open source issues
Sometimes, your developers aren’t entirely at fault for code that is crashing or is insecure (though, ideally, they should know better). Many applications are built upon open source dependencies — some of which approach black boxes of wizardry and complexity that no single team of internal software developers could fully understand. A module imported into a project to enable a single feature might introduce a thousand other components that are unneeded, leading to potentially yet-undiscovered vulnerabilities and performance issues.
Open source dependencies are not the only potential culprit. Paid, third-party inclusions can also be at fault, which sometimes can be worse — you're now at the mercy of that company's priorities, not of the greater community (log4j, anyone?).
Security events
An SRE's true nightmare. If and when it happens, it's either not immediately, easily remedied, or it's left undiscovered for longer than you'd like to admit. Denial of service (DoS) events are included here and are one thing to mitigate, but an active, hostile presence in your systems is another. Things are "remedied" only once the active situation is under control and has subsided, and you're treating everything that could have been compromised as nuclear waste.
ORDERR
I tried — I really tried — to get this down to just ORDER, but you simply cannot ignore or separate the two last Rs on the list.
Observability
What good is attempting to deal with the CHAOS if you're blind to it? Observability is simply how you're keeping tabs on it all — availability, performance, activity.
But it's not just about having a view into this information. There must be actionable, reliable alerting in place for events that call for attention. "Actionable" and "reliable" go hand-in-hand here. If an SRE is woken up at night by a false alarm — and those false alarms continue — they're apt to get fatigued and possibly run into a "crying wolf" situation, where an alert thought to be false is actually something to be concerned about.
While meaningful, timely alerting is important, an SRE can't rely on them alone. Having a view into monitoring, where SREs can continually take the pulse of what's running, is going to be the first step into setting up those alerts! Some SRE teams use third-party tools, while others roll their own. It all depends on budgets and whether there’s a solution that meets your unique situation.
There are too many tools to list for this category, but here are a few: Datadog, New Relic, ELK Stack (can be supported and hosted free), SolarWinds, and Splunk.
Reliability
This is a wide-ranging topic. Many things could affect systems and applications from running reliably. From the ground up, the entire stack (and beyond!) is at play — geolocation, hosting facility, servers and hardware, networking, operating system, services, application code, and how your provided services are delivered. Tired yet? SREs are.
I would have added an "A" here for availability, but I believe they are closely related enough that they can mean the same thing. Ensuring you have the things in place to make your systems available — redundancy, automated and/or manual methods of failover, security, and denial-of-service incident handling — is part of the reliability topic.
Except for application code, SREs are solely responsible for the care and feeding of what keeps an organization's systems reliable. In the case of application code, in many cases, SREs are merely a means of observability to the software developers who built it, not the ones fixing what's broken. This is partially achieved with static code analysis tools that raise issues as they are found, or via the monitoring and logging in place. Once found, it's the SRE's responsibility to ensure those issues are triaged to the appropriate team(s) and addressed promptly.
Snyk offers tools to help with this category are Snyk.
Disaster response
"Disaster" is a broad term for SREs. It can cover anything from critical infrastructure failing to (God forbid) a plane crashing with your lead software engineer in it. Of course, it also covers anything and everything security related.
Having a functional means of response to these disasters is fundamental to an SRE's duties. Which makes having available, updated runbooks (covered in the E below) key, as one can't always rely on grey matter alone. Tabletop exercises and regular disaster recovery testing can be difficult to make time for, but overall they are critical to ensuring response times to disastrous situations are fast and better organized.
Some paid (and limited/free) tools for this category are Mattermost and PagerDuty.
An effective response also means ...
Effective communication
Methods of communication are many, especially when many teams are separated via remote work and distributed time zones. Having a well-established method of communicating during any troublesome issues is a must — not just in the where, but also in the how and when.
Proper means of communication not only help SREs maintain their observability of given situations, but also helps ensure that situations are being handled properly and timely, and work is not being duplicated.
There is also external communication to both internal and external stakeholders. If something is being observed as abnormal, communication starts with internal teams before external customers are made aware of issues that impact their use of your services. SREs are the starting point of the if, when, and how external communication is handled.
Here are some sommon communication tools I’m sure you know about: Slack, Microsoft Teams, and Atlassian StatusPage.
Recovery
All of the above leads to the ultimate goal (beyond nothing going wrong) — recovering everything to a stable, secure state. All of the aforementioned items will determine how quickly this can be achieved.
Observe what is going on.
Have a reliable means of failover and redundancy.
Ensure your disaster response action plan is followed.
Put your effective communication to use.
SREs are responsible for meeting a set MTR (Mean Time to Recovery), and it can't be reached with blind luck. Once you've recovered, it's back to effective communication to bring the stakeholders' alarms down. But for the SRE, it's not over yet. Onto ...
Retrospective
Once the dust has settled, SREs can take a well-earned breather, but not for too long. Something bad happened. And left unchecked, it's likely to happen again.
This leads us back to effective communication. We have to understand the when, what, where, why, and how, in addition to the what are we going to do to prevent it.
Lessons learned must be documented along with the incident, with an action plan that doesn't simply fly off into the ether. Any of the prior items above might need tweaking — better observability, hardened systems and security, additional layers of reliability, or even improved communication. Then reasonable timelines must be set to ensure those things get done, and the SREs should be standing over the responsible parties to ensure they're addressed.
DevOps vs. SRE
So, then, what do DevOps folks do, if they are not also SREs? I won't get into it all, but, concerning an SRE, DevOps help make all of those things happen between the SREs and the developers — empowering both sides to do what they need to do, without having to throw too many things over the wall. They support and provide all of the ORDERR that SREs depend on to handle the CHAOS, while empowering non-SREs to be as involved as they need to be.
Good luck out there …
- K