Netflix relies on Apache Cassandra as a critical source of truth database, and while Cassandra is a remarkably resilient database, it does, ever so occasionally, break. This talk explores how complex Cassandra deployments fail in production, but more importantly, the techniques, tools, and approaches our distributed systems engineers use to debug and mitigate these failures. We will first cover software-based failure modes that come either from our software or the software that Cassandra builds upon. For example, retry storms or unbounded queues can effectively overwhelm the database. Cassandra and Linux give you many tools to detect these, and there are numerous strategies you can use to avoid them. Along the way we will also visit common JVM failure modes and how to assess and remediate these. Next, we will cover hardware-based failure modes that come from a typical cloud environment. We will cover the broad classes of drive and network failures that we cope with every day, as well as the metrics and tests we use to detect them. With these understandings, we will learn how to automatically heal these failures and safely meet our latency SLOs. Finally, we will cover systemic failure modes, including complex interactions of multiple components and systems. This will involve a number of concrete failures we observed that involved multiple bugs, system failures, or data modeling issues in combination. At the end of the talk, we hope that the audience leaves understanding how large distributed databases can fail, but also how to use DevOps skills and tools to debug, mitigate, and automate away these failure modes going forward.
How Netflix debugs and fixes Apache Cassandra when it breaks Joey Lynch
September 13, 2019