What I learned watching a principal engineer restore a service in less than 10 minutes while I was still setting up my local test environment

I’ll never forget the incident that changed how I think about handling high-severity incidents in production. I was 3 months into a new role and shadowing the primary on-call engineer. Multiple alerts started to fire indicating an increase in errors related to data ingestion/indexing of a service my team managed. The primary posted in a company-wide Slack channel notifying folks that something was awry and could be customer-impacting.
My engineering instincts kicked in immediately; I wanted to find the root cause of the problem. I pulled up the error logs, started tracing the stack, and began spinning up my local environment to step through the code. I was going to find this bug, write a fix, and be the hero that saved the day. That’s what good engineers do, right?
Meanwhile, a principal engineer I deeply admired entered the incident Slack channel and took a completely different approach. I started explaining where I thought the problematic code might be based on my limited experience.
While he was courteous, he didn’t want to go deeper into the codebase at that moment. Instead, he:
1. Asked the support team whether we had a status page up to inform customers about the impact.
2. Pulled up our deployment dashboard and cross-referenced recent changes with the error spike.
3. Found a correlation with a deploy from earlier in the day.
4. Executed a rollback to the last known stable version
From the time he entered the incident channel to service restoration was < 10 minutes. I was still running git blame and stepping through my debugger.
That incident taught me something important: mitigation speed is more important than a root-cause fix for customers. The fundamental fixes can come later. First, you stop the bleeding. Every minute of degraded performance or downtime is additional loss of customer trust. Customers will leave your platform and acquiring new ones is much more expensive.
There’s a lot to be said about what should happen to prevent incidents from happening in the first place, but once they occur mitigation and communication should be the focus. You need multiple people to make sure things like customer communication, mitigation, and impact analysis can all happen in parallel. Don’t focus on root-cause analysis until the temperature dies down and customers are no longer experiencing immediate pain.
Multiple people/roles can help to reduce MTTR (mean-time-to-resolution) by working in parallel.
- Incident Commander — this person ensures the right people and resources are available and coordinates across all them.
- Internal Communications — this person should be relaying information to internal stakeholders from engineering to support to leadership. This gives everyone timely updates while allowing others to focus on problem solving.
- External Communications — this person updates public status pages and also communicates with customers who are submitting support tickets.
- Primary on-call engineer — this person is an engineer who is focused on mitigating the issue as quickly as possible. They should set up a live video call for others to join so that communications can be updated without taking focus off of mitigation.
- Secondary on-call engineer(s) — this may be multiple people, depending on the scope and impact of the incident. These engineers should support the primary engineer asynchronously or on the live video call. Think of this as a pair-programming exercise where speed is of critical importance but an extra pair of eyes is crucial for various CLI commands, deployments, etc.
Rolling back code isn’t always the solution to a problem. But focusing on short-term mitigations should be when time is of the essence.