What I Wish I Knew About Incident Management

Nov 7, 2020 11 min read Reliability

I gave this talk last year at LinkedIn’s internal SRE conference, thought I’d share it here as well.

Why I am writing this post

Like every Software Engineer / SRE, I’ve had my share of troubleshooting software. However, I had never been oncall before I joined Linkedin and the impact of a system outage that affects thousands of engineers made the first week of oncall pretty overwhelming.

first week of me handling production issues

But things got better overtime.

In this post, I would like to share the incident management practices I have picked up over the years as an SRE at Linkedin that help me keep calm under pressure and effectively drive incidents to resolution.

present me handling production issues

What this post is not about

In this post, I am not going to talk about how to debug linux or distributed systems or the various debugging tools. (For stories from the frontlines, check out Software Misadventures Podcast!)

First oncall week

My first few weeks at Linkedin - they were great! I was meeting smart engineers and learning new things. It wasn’t until my first oncall rotation that I started thinking what if there’s an outage and I need to fix it?

anxiety before first week of oncall

Now, don’t get me wrong. LinkedIn has really good systems in place for monitoring/alerting, triaging issues and a very well defined process for incident response. Those are absolutely critical. And to prepare, I had shadowed our oncall the week prior and even had an experienced team member shadow me to guide and help me out, but still, I was anxious.

Here’s what I wish I had known.

Before oncall starts
- Oncall handoff
- Organizing Slack during an oncall week
Signal vs Noise
- Trust, but verify - not all alerts are created equal
- Declaring an incident
Communication during an incident
Incident response is a collaborative process
Towards a resolution
- Looking at changes
- Keep calm and carry on - one step at a time
Learning from the incident

Before Oncall Starts

Oncall handoff

Before an oncall week starts, I talk to the person who is currently oncall to get context on any incidents that happened during the week or any weird bugs that were discovered in our stack. It gives me perspectives on issues that could be getting carried over from the previous week and any critical changes I should be aware of. Although you can’t plan for all that happens during an oncall week, a proper handoff helps you prepare for it.

Organizing Slack during an oncall week

As an SRE on LinkedIn’s container scheduler and deployment infrastructure team, our users are engineers at LinkedIn. We use Slack for internal communication and we have certain channels where users share issues they are experiencing with the tooling or to get our oncall’s attention. During an oncall week, I star these channels and organize my sidebar so that I can easily notice messages from our users and distinguish them from the other messages I receive. As an optional tip - I also like to mute/leave channels that I am not actively participating in to reduce clutter.

As compared to a normal week, I spend more time on Slack when I am oncall - responding to people, answering support requests, helping users with tooling. These Slack notifications can create a little bit of distraction, however, considering all our users are internal, a rise in messages on our channels can also indicate that something might be wrong with our system and our alerting hasn’t caught it yet.

slack pings during oncall

Signal vs Noise

Trust, but verify - not all alerts are created equal

trust-but-verify-you-must

This is something I learned during my interview at LinkedIn and it has been very applicable in my experience since.

In early days of oncall and incident management, it is very common to feel as if things are on fire when any alert gets triggered. Over time, I have realized that gaining context and verifying that the alert is actually indicating an issue helps with the right next steps. It is important to distinguish signal from noise because:

An alert could be non-actionable as it was recently configured with a threshold that’s making it too noisy
The monitoring stack is down and the alert is being triggered because the configuration treats a lack of data points as an issue
Your service is operating perfectly fine, however, the traffic tier routing requests to your service had an issue
Timer on a deliberately silenced alert expired and the alert started triggering

I have experienced all of the above at some point in time. Now when I either receive an alert or a user reports an issue, I check our services (metrics, logs, reproduce the reported problem etc.) to verify that the issue is real.

Declaring an incident

Not all actionable alerts result in an incident. To be effective at identifying the ones that do, it is extremely crucial to think about the bigger picture of mitigating the issue than be overwhelmed by the technical task of resolving the alert.

Some of the qualitative measures that help make this differentiation is to consider whether an issue requires coordinating the fix with other teams or whether the issue is impacting customers or violating an SLO. If any of the conditions are true, declare an incident. It is always better to declare an incident early in the process than waiting too long.

At LinkedIn, we have defined guidelines for all teams about what warrants an incident along with different levels of severity. This takes guesswork out of the picture, and provides a shared understanding to every team member.

Communication during an incident

The incident title and scoping the impact

Every incident gets a title. I didn’t realize it initially, but giving an incident a title forces one to define the problem and communicate it to the stakeholders very succinctly. When communicating to stakeholders, scoping the incident is very important. What I mean by scoping is identifying how big the impact is - which environment is impacted, is the impact limited to a region or is it global, how many customers are impacted, etc. For instance, “the login feature on the site is not working" vs “the login feature on the site is not working for traffic originating from Asia Pacific" say two very different things.

Establish communication channels and an incident lead

We heavily rely on Slack to communicate during an incident. A dedicated slack channel helps focus all the energy and inputs from everyone in one place. It helps the incident lead collect data about symptoms that the users are experiencing and also captures a log of considered/discarded hypotheses and any changes made to the system. Once the Slack channel is created, establish an incident lead and let everyone know who is driving the incident forward.

Eslablishing explicit comms channels for reporting and identifying issues and establishing an incident lead reduce the delay in action and disambiguate any confusion.

Communicate changes to the system

If there’s a fix you’d like to try out, let others know who are and encourage everyone to do the same. This ensures that the potential fix doesn’t make an already bad situation worse and helps catch any blind spots early. If you do end up making a change to the system after getting consensus, let others know and follow up on its effects.

Provide regular updates

While working towards a resolution for the incident, it is very easy to get overwhelmed by the technical details and miss to communicate an update. This leads to angry leadership and annoyed customers who have no insight into what’s happening.

An update provides visibility to the customers that the issue is being worked upon and lets the leadership identify if they can help out with anything. Depending on the severity of the incident, an update every 15-30 mins serves pretty well. The update doesn’t have to be extremely detailed, rather a brief summary describing the current state and immediate next steps is sufficient. An example update:

[UPDATE] Our hypothesis about the memory leak checks out and we have validated the fix in the staging environment. We have pushed out the change and as soon as it goes through the CI pipeline, we’ll canary the new release, monitor metrics and promote the change after verifying the fix.

Incident response is a collaborative process

Get help early

In my earlier days, I used to think that it was solely my responsibility to mitigate the issue, find the root cause, and roll out the fix. If I couldn’t do it, it wouldn’t reflect well on me. In reality, incident management, like much of software development, is a very collaborative process.

One of the big differences in how I approach it now is I focus on actively pulling in other engineers who could help with debugging or resolving the issue early in the process. This change in perspective has relieved me of a lot of unnecessary stress and also made me more effective at resolving incidents.

When requesting help, be specific about the task as well as the urgency. It helps others calibrate their response and manage things they might have at hand.

Working with others

One of the most important things while handling incidents is working with others. Considering the multi-faceted nature of an SRE role, it’s one of the most important while underrated skills.

With multiple people involved, various possibilities get shared and one of the responsibilities of the incident lead is to guide these discussions in a productive direction while filtering out the noise. One should be cautious about going down any rabbit holes and focus on stopping the bleeding first. If there’s a probable cause, share it early - it helps rule out possibilities. And it’s okay to ask questions that seem obvious, but specific questions are more helpful.

Being fearlessly curious, and keeping an open mind enables one to consider possibilities that could be overseen otherwise. Remember, you are working together as a team on sharing and ruling out hypotheses to solve a challenging problem.

Video Conferencing is your friend

We work with engineers in geographically distributed locations (even when it’s not a pandemic) and some discussions are much better conducted synchronously than async. In such scenarios, start a video call sooner in the process. Either a shared conference room or a video call has almost always helped figure out the path forward sooner and coordinate next steps.

Towards a resolution

Looking at changes

Things mostly break when something changes. If we froze our code, configurations and infrastructure, we’d have way less outages than otherwise. Code or config deployments are two of the most common actions that result in change. So, rule those out first. A few things to check for change are:

Code or config deployment targeting the application experiencing the issue
Global configuration changes of the underlying infrastructure (network, OS, etc.)
Anomalous traffic patterns

The intention here is to identify what is unusual about the system that’s resulting in an issue. I must mention that isn’t always a single change, rather a combination of changes that result in the perfect storm.

Keep calm and carry on - one step at a time

While all of this is going on, it’s pretty common to be stressed knowing that users are unable to use your system as promised. If you start to feel you are panicking or overwhelmed, call for support. Panic results in a scrambled path forward which in turn takes longer for an incident to be resolved. And for many people, there’s a linear correlation in length of an incident and the stress they feel.

In addition to calling for support, take a step back to re-evaluate the issue holistically and think about the upstreams/downstreams of your system to rule out possibilities in a structured way. Structuring your thoughts and reminding yourself that you are not alone in this helps keep calm and handle the pressure.

Learning from the incident

I have gone from dreading incidents to treating them as an opportunity to learn something new. I have learnt the most about our software and the underlying infrastructure when something breaks. It makes the task of incident response exciting and fun - like a detective solving a difficult case. And when the incident is finally resolved, it brings a great sense of satisfaction, accomplishment and pride for the team involved in resolving the incident.

When you are not oncall, be available to help your teammates when there’s an incident. They would appreciate it and you’d learn something new in the process. Last but not the least, ensure that not just the postmortem but the incident resolution is blameless. It helps foster a culture of ownership, results in incidents being resolved faster and helps improve the entire team’s performance over time.

Thanks to Guang Yang and Aishanee Shah for reviewing this post and providing very valuable feedback.

If you have any specific incident management practices that have helped you over time, I’d love to know. Feel free to comment on the post or reach out to me via Twitter / Email. I look forward to hearing from you!

Sre Oncall Reliability