Cloud database offerings have expanded over the past decade, encompassing everything from virtualized machines in the cloud to entirely serverless databases. With this pace of innovation in the cloud ecosystem, Cassandra stands in a unique position to serve its users with unique advantages over any other system. This also puts Cassandra in an interesting position to compete with the ongoing innovations in the cloud. With the internal architecture and storage mechanism aside, citizen developers in the community are looking for several other operability aspects/ ecosystem around the services for the long term investments and benefits from such service. At this juncture, it is vital for the Cassandra dev community to build muscle around supporting such an ecosystem that the community is looking for, to be on par with rest of cloud services and other competing offers in the industry. Specifically, it is important for us to innovate and focus on improving these areas along with the rest of the product.
– Ease of use
– Simplified operability in the cloud, precisely simplified operability in the hybrid/multi-cloud
– Pluggability with other infrastructure services such as metrics, discovery, and monitoring
– Painless rollouts of version and protocol upgrades
– Elegant developer experience in polyglot environments
– Dev education of Cassandra best practices
– Unified access across the complex systems
As Cassandra stands today, operating the database requires either considerable labor, complex automation, or both. Some of this complexity is an unavoidable result of operating a distributed system, but much of it is operational complexity stemming from properties of C* itself. As a result of these complexities, C* operators spend too much time dealing with issues that the database should solve on its own, and are unable to reap the full benefit of Cassandra’s powerful distributed data model. As part of this talk, we will focus on how to address the aforementioned challenges to keep the Cassandra competitive within the cloud offerings and database services industry, with simplified operability and elegant developer and operator experience. We also hope to get Apache Cassandra 4.0 up and running in any cloud without a hassle with the help of the sidecar. We hope to leave a thought in the Cassandra dev community to start thinking about these areas in upcoming releases of Cassandra.
nIn eventually consistent systems, when a node failures or network partition occurs, we’re presented with a trade-off: to execute a request and sacrifice consistency or reject execution and sacrifice availability. In such system, quorums, overlapping node subsets guaranteeing at least one node to hold the most recent value, can be a good middle-ground. We can tolerate failures and loss of connectivity for some nodes while still serving latest results. Quorum-based replication schemes incur high storage costs: we have to store redundant values on several nodes to guarantee enough copies are going to be available in case of failure. It turns out that we do not have to store data on each replica. We can reduce storage and compute resources by storing the data only a subset of nodes, and only use the other nodes (Transient Replicas), for redundancy in failure scenarios. In this talk, we discuss Witness Replicas, a replication scheme used in Spanner and Megastore, and Apache Cassandra implementation of this concept, called Transient Replication and Cheap Quorums. (edited)
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
This presentation will describe the initial experience building and using CassKop, an operator developed for running Cassandra on top of Kubernetes. CassKop works as a usual K8S controller (reconcile the real state with a desired state) and automates the Cassandra operations through JMX. All the operations are launched by calling standard K8S APIs (kubectl apply …) or by using a K8S plugin (kubectl casskop …). CassKop is developed in GO, based on CoreOS operator-sdk framework.nAmong the main features :n- deploying a rack aware cluster (or AZ aware cluster)n- scaling up & down (including cleanups)n- setting and modifying configuration parameters (C* and JVM parameters)n- adding / removing a datacenter in Cassandran- rebuilding nodesn- removing node or replacing node (in case of hardware failure)n- upgrading C* or Java versions (including upgradesstables)n- monitoring (using Prometheus/Grafana)n- … By using local and persistent volumes, it is possible to handle failures or stop/start nodes for maintenance operations with no transfer of data between nodes. Moreover, cassandra-reaper is deployed in K8S and used for scheduling repair sessions (Thanks to Spotify and TheLastPickle teams)nThe Cassandra exporter for Prometheus and backup/restore developed by Instaclustr are also used (Thanks to Instaclustr team) During this session, we will delve into the architecture and implementation of this operator, and share our learnings.
Netflix relies on Apache Cassandra as a critical source of truth database, and while Cassandra is a remarkably resilient database, it does, ever so occasionally, break. This talk explores how complex Cassandra deployments fail in production, but more importantly, the techniques, tools, and approaches our distributed systems engineers use to debug and mitigate these failures. We will first cover software-based failure modes that come either from our software or the software that Cassandra builds upon. For example, retry storms or unbounded queues can effectively overwhelm the database. Cassandra and Linux give you many tools to detect these, and there are numerous strategies you can use to avoid them. Along the way we will also visit common JVM failure modes and how to assess and remediate these. Next, we will cover hardware-based failure modes that come from a typical cloud environment. We will cover the broad classes of drive and network failures that we cope with every day, as well as the metrics and tests we use to detect them. With these understandings, we will learn how to automatically heal these failures and safely meet our latency SLOs. Finally, we will cover systemic failure modes, including complex interactions of multiple components and systems. This will involve a number of concrete failures we observed that involved multiple bugs, system failures, or data modeling issues in combination. At the end of the talk, we hope that the audience leaves understanding how large distributed databases can fail, but also how to use DevOps skills and tools to debug, mitigate, and automate away these failure modes going forward.
We’ve started to provide our stream based data pipeline by using Google Cloud Dataflow and Apache Beam since 2018 fall. It collects event logs from microservices running on GKE, then transforms and forwards the logs to GCS and BigQuery to use for analytics, etc. As you know, implementing and operating streaming jobs are challenging. We’re encountered various issues during that time.
I’d like to share our knowledge on development and operation perspective. There are 3 topics in the development part, 1) Implementing stream jobs with using spotify/scio, 2) How to debug the jobs, especially having DynamicDestination, 3) How to load testing, to ensure our jobs stable. And topics in the next part, 1) How to deploy new jobs safely(with avoiding data loss), 2) How to monitor the jobs and surrounding systems, and misc.
Learning a streaming framework like Apache Beam is exciting but with ‘Hello World!’ examples aren’t fun. There aren’t a lot of free and interesting streaming data sources for beginners to play with, it’s very easy to give up learning if it’s boring. To keep myself learning something new, I find myself need incentives and accomplishments to continue. If you are like me need some motivation to keep learning, this talk will give you some inspiration.
In this talk, I will share my experience of learning Apache Beam by showing demos to you on how I create a streaming data with my custom ‘Marvel Fights Stream Producer’. I will discuss how I went through with the Apache Beam Programming Guide and replaced the official examples with Marvel streaming data I produced. I will also talk about what I learn from creating my custom data stream producer and how that helps me learn Apache Beam better.