Failures are inevitable. But when they occur, our goal should be to resolve them as quickly and efficiently as possible. PagerDuty has developed an open-source Incident Response framework based on the Incident Command System (ICS). That product-independent process has helped many organizations set up Incident Response processes that resolve technical issues as quickly and effectively as possible. There’s a lot of documentation you can follow, but how do incidents actually play out in real-time when they happen? In this talk, you get to peek behind the curtain to see what happens inside the walls of the company that’s known for waking you up to tell you there’s a technical incident you need to deal with. We will walk through a cascading failure that happened when unexpected errors were found happening in one of our Kafka clusters. I share all the gritty details of this actual incident, as they occurred, and use that as a way to demonstrate the structure of how we apply ICS to resolving technical problems. Attendees of this talk will walk away with an understanding of how to effectively manage complex technical incidents across multiple teams, the additional roles that are necessary to support technical responders, how to structure a blameless post-mortem, and how to start developing a similar process in their own organizations. This talk delves into technical concepts due to the nature of the problem, but it is mostly focused on the mechanics of managing any technical incident.
Peeking Behind the Curtain: The Anatomy of a Real Major Incident George Miranda
September 12, 2019