Discover more from Hartley's Handbook
Ultimate Guide to Incident Response Plans for Product Engineering Orgs [Free Template]
For improving or implementing incident response, look no further.
Unless you’re in a company that never releases software, you will, at some point in the future, deal with an incident. No one knows the day or the hour, but an incident is lurking around the corner. Are you prepared?
If you don’t have a plan in place already, have no fear. I’ve got an easy-to-use template/reference here (Notion) and below I’ll go into way more detail on each part. This template is built from 10+ years of experience in engineering organizations of various sizes, witnessing good and bad on both the engineering and management side of things.
The very basics
Incident response can be extremely complicated, but at its core, here’s what you need to do.
Identify that it is an incident along with its priority
Start an incident
Stabilize the incident
Resolve the incident
Review the incident in a postmortem
You now hold the power in your hands, but before we dive a little deeper and get more detailed, here’s my guiding principle for running incidents in product engineering organizations: Fail With Transparency
Failing With Transparency
One of the guiding principles, when I’m thinking about running incidents in a product engineering org, is to fail with transparency. Too often engineering teams try to hide their mistakes, sweeping any bad bugs under the rug so other parts of the organization don’t worry about the software. I’ve found that you can sweep bugs and incidents under the rug, but if you’re looking to build trust with stakeholders, incidents give you a chance to cement the relationship.
Think of it like the status pages for AWS or the old-school Fail Whale with Twitter. Once folks know something is up, they really want to know what’s wrong and what’s being done to fix it.
Are stakeholders happy about incidents? Absolutely not, but they do expect incidents to occur from time to time. Instead of hiding, share your incidents, postmortems, next steps, and learnings. Pull your stakeholders in and call it what it is, an incident.
Starting with something simple like the following is totally fine:
Hey there, we got a notification that Y service is down. We’re taking a look and hope to resolve in X amount of time. We’ll keep you updated, but please let us know what impact this has on your team. You can follow along here (link to channel).
This lets them follow along, inform their teams, create contingency plans, etc. Failing out in the open is better than trying to fix something before stakeholders notice. They’re going to find out one way or another, so why not make them part of the solution?
Also, when in doubt, over-communicate. Do this in a concise way, but give updates that everyone can understand even if the update is “no update at this time.”
Now that that’s out of the way, let’s take a look at each part of our template and tie it to the five points above.
Identify that it is an incident along with its priority
This is where it all starts. A monitor goes off, an alert is triggered, or someone sends a sad tweet saying they can’t get their free disc golf disc because the form is busted, you know the drill.
Whatever problem was noted, you now must determine what is the real problem. Understand that this may shift as the incident continues, but don’t be fooled by easy answers. “The validation is broken on this form,” seems innocuous enough, but if you don’t uncover why the validation is truly broken, you may solve one piece of the problem, but not all of it. As the team begins to form, continue to question the core problem you are addressing.
The output of problem identification should be a Problem Statement. This is likely the only time you’ll see me link out to anything related to Six Sigma, but this is a great overview of writing a clear problem statement. Your problem statement will evolve as you learn more but use it as the guiding light for the incident.
Severity is an area of incidents where I like to follow the KISS method (Keep It Simple Stupid). No need to get cute or fancy here, use an increasing scale for decreasing priority. SEV1, SEV2, SEV3 should work just fine and the below table outlines the way Atlassian determines severity level for their incidents. No need to recreate the wheel unless it truly needs to be customized.
To say it a different way:
SEV-1: Hold everything, this is a critical process that’s down and we should not move forward anywhere until this is resolved. There’s a big ol’ fire.
SEV-2: Serious stuff, but the world’s not on fire. It will make folks grumpy and should be resolved by the end of the day. Contained fire, but could spread.
SEV-3: Eh, bigger than a bug, but not super severe. You should probably fix this within a week if not within 2-3 days. Smoke, but nothing else.
Start an Incident
There are many ways to start an incident, whether through DataDog, PagerDuty, FreshService, or any other on-call SaaS. Whatever the triggering mechanism, or system you’re using, starting an incident involves making a broad declaration and shouting from the rooftops, “*cough cough* SOMETHING’S WRONG AND WE WANT TO FIX IT!”
My preferred method of starting an incident (when remote) is to get all the appropriate parties in a channel and onto a call. Defining “the appropriate parties” can be broken down thusly:
Who’s going to fix it?
Who cares that it gets fixed?
Who can identify impact?
Who fixed it last time (or knows about the area that needs fixin’)?
Those four questions can usually get you who you need to get to work on an incident. With your crew in tow, you may now move about the incident. Think of it as a very boring (or exciting to some) way of forming The Fellowship of the Ring or Dom Toretto assembling a team for whatever the plot is for Fast & The Furious 15.
Once the team is assembled, you can fill out the first major portion of the template.
Incident Management Team aka “The Crew” (with explanations)
Who is handling the communication for this incident? As incident commander, you are pulling folks together and ensuring folks stay on task. You are also responsible for keeping stakeholders and incident-watchers up-to-date. This is especially critical in SEV-1 and SEV-2 incidents. I like to ensure there is an update every 15 minutes or so in the early goings of an incident, providing clarity about what is being tried, what was identified, and how the team is proceeding.
This role is also critical as a facilitator. View yourself less as a “hey can I get an update” requester, but as the captain of the ship, steering conversation if it veers off course, or asking “dumb” questions to open new lines of thinking. This is especially helpful when stuck. Some questions to consider:
What changed recently?
What is the simplest reason this might not be working anymore?
Any third-party vendors we’re relying on for this work?
What happens if we reset/revert?
I put “dumb” in quotes for a reason. It isn’t that these are bad questions, they’re just obvious. Problem is, when an incident is going on, sometimes individuals forget to think through the obvious and hop straight to more complex complications. Helping the team slow down and think about the simplest solution can go a long way.
Remember Occam’s Razor, paraphrasing but, “the simplest reason is usually the right one.” Essentially, start small and work big with dependencies and complexities, eliminating reasons along the way.
The Technical Lead for the incident is the individual, typically an engineer, that is leading the technical resolution of the incident itself. Whether a subject matter expert in the area or simply the first to respond to the incident, the Technical Lead tends to make final decisions on what order solutions are tried.
If no one is signing up for the role, the incident commander should assign it to whoever appears to be taking the lead role at the moment. This is not intended to add additional pressure, but to identify one individual that can provide updates and make decisions throughout the remainder of the incident.
Assessing the impact or “blast radius” as I and others like to call it, is key at the beginning of an incident. Blast radius refers to how broad the issue is and how many individuals it affects.
The Impact Assessor is the number cruncher, trying to determine the overall cost or opportunity cost of the incident itself. They will likely, or should, ask questions like:
How many users is this affecting?
What dollars or time lost can we associate with the area that is down?
Where can we get the data from? (database, dashboard, etc)
What will the impact continue to be if not resolved?
This can be an engineer, product manager, engineering manager, data analyst, or some other party not focused on resolving the problem.
Anyone who is actively working on the resolution to the incident. Whether through scripts, code updates, or reverts, whoever has their hands on the keyboard goes on this list. These tend to be the individuals that are more heads-down and focused during the incident.
The expectation for these individuals is to divvy up work (if appropriate), talk through solutions, and reconvene as needed to work through the next steps. For a SEV-1 incident, I typically like everyone to stay on the call until the incident is stable, but give space to the engineers that need some quiet to fix the problem.
Set a check-in time, usually based on how often updates are needed for stakeholders, and agree on how you will communicate through the silence. The worst thing you can have in an incident is silence that no one understands. When quiet, bystanders and stakeholders will assume the worst and wonder if anyone cares or if anyone is working on the issue.
Stabilize the Incident
Once started, the immediate goal is not to resolve, the immediate goal is to stabilize. Think about your incident as a leaky bucket. As you put water in for the day, you notice the bucket has one or several leaks. Your first thought is generally not, “how do I rebuild this entire bucket so the leaks stop,” but instead should be “how do I stop these leaks?”
More specifically, if the incident is preventing all of your customers from checking out, the first goal is to get them checking out again. Once you stabilize the flow, you can begin working on the root cause and solving the larger issue. You can think of it this way:
Stabilization gets the business running again, and resolution keeps the business running in the future.
Keep the team focused on the solutions that will bring about the fastest stabilization, but understand how to weight them appropriately. If there are two stabilization solutions with one taking fifteen minutes and the other taking twenty minutes, understand what additional benefit the twenty-minute solution provides and how costly those five minutes will be.
Look at your monitors or alerts to determine that the incident is indeed stable, mark it as such, update relevant parties, and move on to the Resolution phase.
Resolve the Incident
In some cases, you can skip this section entirely because the stabilization was also the resolution, but even in those cases an important question to ask is, “how do we prevent this from happening again?”
It’s a question that may be better served in the postmortem, but if the stabilization is not going to last for days at a time, the team must determine a more steadfast solution. If nothing else, it is likely helpful to reassess the monitors or trigger points to determine if they were loud enough or too loud.
In cases where the stabilization was a small patch, continue working as an incident team to determine the long-term fix. Any additional logic that needs to be sorted? A broader sweep through the codebase to eliminate the opportunity for this to happen again? There are plenty of ways you might need to resolve the incident, but the point of stabilizing first is it gives everyone a chance to breathe. By taking a moment to look at the broader picture, you can see what fringe areas may be affected, additional avenues that need assessing, or more tests that need to be written.
Resolving is essentially putting a stamp on the incident saying, “This incident is stable, as complete as we expect it to be in the short term, and we don’t expect to see this again.” The resolution allows all watching parties to go back to their own work and stop worrying about the incident.
For more free engineering management templates, be sure to subscribe!
Review the Incident in a Postmortem
This is probably a whole post on its own, but each incident is not officially over until the team holds a postmortem to review what happened. Postmortems are extremely common throughout not just the tech world, but globally in organizations. For a good primer in postmortems, check out PagerDuty’s overview.
In the template I’ve provided, you can continue using the same document for the postmortem. Key items to review:
Steps used to resolve / Overall Timeline
Root Cause Analysis
Action Items for further prevention
With the incident resolved, the team should have a good understanding of how broad the impact was, who was impacted, and what the associated costs are with the impact. Ensure your stakeholders have all of their impact questions answered, as they will need to surface the impact to their teams and leaders. Use the individual designated as the Impact Assessor to finish these details. It is also helpful if you or the team can tie the impact back to “what does this mean for the business?” but that is of secondary importance outside of the primary impact.
How did we know there was an incident? Use two primary categories, manual or automated, for the detection method.
If detected manually, is that a stakeholder or customer that said something was broken? Or did we accidentally stumble upon it ourselves?
For automated detection, was it an early alert or did we only get alerted once the incident got real bad?
Secondary for automated would be, did anyone notice the automated alert or do we need to increase the volume of those alerts?
Do we need to reset expectations about how to identify automated alerts?
All of these are great questions about detection methods and how you might improve for future incidents.
Steps Used to Resolve / Overall Timeline
Review the timeline and the steps used to resolve the incident and ask some of the following questions:
What went well
What didn’t go so well
What did we miss initially that could have helped us resolve faster
What will we change next time
Do we feel the response time was adequate
If you’re using toolsets from PagerDuty or DataDog for your incidents it is likely that you have a full timeline built out from Slack or your company’s chat tool. If not, be sure the incident commander is taking good notes during the incident for review.
Fill in any gaps as you review with the team. During a postmortem, no detail is too small. You are attempting to paint the full picture of the incident to improve for the future.
Root Cause Analysis
How you get to the root cause is up to you and the team, but dig deep. The root cause is unlikely to be a simple answer, so spend at least five minutes trying to understand if it’s a people, process, or program problem.
The Five Whys is typically used as the golden standard for root cause analysis (RCA), here’s a good overview from Kanbanize. At its core, the Five Whys prods you to keep uncovering layers until you get to the true root of the problem. Explore the root and explore solutions for the root problem before moving on in the postmortem.
Action Items for further prevention
Show me a postmortem with no action items and I’ll show you a team that hasn’t thought critically enough about the problem. There is always something to do after an incident. Whether it is additional research into the problem area, assessment of triggers and alerts, or whole projects dedicated to refactoring the problem area, there are always action items. Determine who will take the action items and the expected timeframe to complete them.
Bonus: Remember to keep it blameless
Regardless of how many times we’ve been through an incident or postmortem, I like to remind the team that the postmortem is intended to be blameless. In other words, no finger-pointing allowed. We are all working against the problem together. Remind the team of this at the beginning of the retro and remind them again if blame is getting assigned to an individual or group.
Do not tolerate blame assignment.
Tips For Managers During Incidents
Have a plan in place
By reading this guide you’re already further ahead than most organizations. Please do not be a company that says, “well, if we don’t talk about it, then we’ll never have an incident,” as if it were some weird superstition. Cliche quote incoming but:
Those who fail to plan, plan to fail.
The first version of your incident response plan does not, and likely should not be the last version. After several incidents using the framework, hold a retrospective with the teams that took part and identify what went well, what didn’t go well, and what confused them. Make some changes, communicate those changes and keep on tweaking.
With a plan in place, you can now set clear expectations about how to run incidents and what you expect from each member of your team. By setting clear expectations, you can ensure each incident is moving forward appropriately, or measure and review when things are off. Reset expectations as you go if needed, but continue working with your team to help them through incidents.
Especially in the early goings I like to be in most of the incidents my team is on so I can lend a helping hand on the facilitation side. This might take the form of being the Incident Commander, or just riding along, but either way, it helps you set the behaviors you want to see continue. It’s also a good way to test the process you helped build to see what feels off.
Have a fast build pipeline, or a way to sneak around it
Unpopular opinion, but I don’t want all of my tests to run if I’m trying to prevent $2 million from leaking out of the business. I need a way to get the fix in there now and not ten minutes from now. While build pipelines are created to help with resiliency of a codebase and not letting bugs creep in, there must be a way to get hotfixes in quickly.
If your codebase and pipeline don’t currently allow for this, work with your teams to figure out how you get there. There’s nothing worse than having a solution available to stop the leaking, then watching CircleCI buffer for five to twenty minutes.
Be Helpful, Not Helicopter
Check in with the Incident Commander to see if you can be helpful at all. If they say no, stay out of the way. Watch the incident and offer suggestions or thoughts where it makes sense, but if you’re not a main contributor, hold off.
Being a helicopter and hovering or constantly asking for updates is a great way to distract the team from solving the incident. If you absolutely feel like you must be involved, spend more time on the data assessment or stakeholder communications. I’d also ask that you take a good look at why you feel you must be involved, and sort that out so you aren’t a bottleneck in the future.
No swooping, no pooping, and no seagull managers during incidents. Let your people work.
Know how incidents impact work and productivity the 24 hours after
Most organizations with on-call rotations or folks who have to hop into incidents don’t give any additional incentive for doing so. It tends to be “part of the job” which is fine! Some companies pay overtime for these sorts of situations, but I’ve never been in a company where that is true.
As a manager, don’t be tone-deaf about the effects of incidents in the following 24 hours. There are follow-ups, next steps for resolution or clean-up, and likely more meetings to discuss further. Give that individual space and for the love of all things, don’t expect them to be on your 9:00am standup if they were up at 4:30am to resolve the incident. More importantly, tell them that. Make it the norm that those on incidents overnight are given leeway as long as they communicate what to expect.
If you’re looking for your engineers to burn out quickly, force them to be at all meetings and on time the day following an overnight incident.
Run an incident before you need the Incident Response
As much as you want to believe that everyone read the documents and understands them, have a gameday where you run through a fake incident. Schedule time for everyone to run through the appropriate roles from identification to resolution (you can skip the postmortem). Note: Be sure you let your stakeholders know you are running a test to not spook them.
This sets a comfort level with the tools and helps you understand where your team might falter a bit in a real incident. True incidents don’t have the luxury of looking around with the toolset to try and remember which buttons create which part of the incident response.
Keep calm, and be a calming presence
Which group do you want to work on a problem with, one where everyone is yelling and has their hair on fire? Especially as a manager, you must maintain the calm and cool things down if they get heated. There’s no sense in getting all riled up in these conversations because it’s not going to help anyone.
Honestly, it’s also okay to be a little silly during an incident. Just because the work is serious doesn’t mean everyone’s demeanor needs to be serious as well. This is not to say it’s time for someone to bust out the joke book or try out their tight five, but it’s alright to keep things lighthearted. Don’t force this by any means, but don’t shy away from it if that’s the personality of your product engineering team.
Tips For Handling Incidents In A Product Engineering Org
When I think about how I typically handle incidents, the following comes to mind. Not all tips will be applicable for every incident, but each boils down to:
Fail with transparency
Show you are working the problem and care. This is more just putting on paper what I’ve done in the past and is not commentary on recently handled incidents.
Pulling in the right folks initially
Use domain documentation, broad slack channels, whatever makes the most sense
Stakeholder awareness is top priority.
Helps stem the bleeding, and while sometimes awkward for them to see how the sausage is made, it’s better than staying quiet
Pull in the core stakeholders at the top of the hierarchy so they can appropriately communicate with their folks
Get engineers to stabilize, then work on root cause
Eliminate what it’s not, low-hanging fruit, things that it would be stupid if it was causing the issue, Occam’s Razor, etc, etc
Updates every 15-30 minutes, even if the update is “there is no update” or as progress is being made
Helps stakeholders and others know that items are still being worked on
Helps the timeline for postmortems
If pausing on work, clearly lay out the next steps for the next day
Hold all conversations in the incident channel
DM’s are the enemy! Easy to lose track of who is saying what and who has what information if it’s not done in the open
Being clear about impact or possible impact
Determine if additional data is required from a team outside of your scope
Link impact/outage to cost wherever possible
Next day follow-up
Is the incident still occurring? How do we know it’s stopped if stable?
If fixed, do we need support to resolve anything outstanding?
Letting your leader know
I generally let my boss know severity, how much chaos it’s causing, and overall temperature as soon as it starts
Ticket creation for appropriate teams to ensure we are tracking the work somewhere
Checklist For During the Incident
[ ] Create a channel for the incident
[ ] Pull in the appropriate subject matter experts (or on call folks if after hours)
[ ] Pull in the appropriate stakeholders
[ ] Let your leaders know (may be in the stakeholder group already)
[ ] State the issue clearly along with current state and what that means for those affected
[ ] Identify root cause or eliminate what it is not if the root cause is elusive
[ ] Submit updates every 15-30 minutes, even if the update is “there is no update” or as progress is being made
[ ] If pausing on work (or incident is stable), clearly lay out the next steps for the next day
[ ] Hold all conversations in the incident channel (avoid DM’s)
[ ] Be clear about impact or possible impact
[ ] Determine if additional data is required from a team outside of our scope
[ ] Link impact/outage to cost wherever possible
[ ] Next day follow-up
[ ] Is the incident still occurring? How do we know it’s stopped if stable?
[ ] If fixed, do we need support to resolve anything outstanding?
[ ] Ticket creation for appropriate teams to ensure we are tracking the work somewhere