ScyllaDB X Cloud has landed. Fast scaling, max efficiency, lower cost. Learn more

Apache Cassandra Resources

Key Features of Apache Cassandra

Several key features of Apache Cassandra have historically made it an appealing choice for enterprises, including fast writes, high scalability, Cassandra Query Language, fault tolerance, and high performance tuning capabilities.

Fast Writes. Cassandra manages big data, whether unstructured or not, to allow for quick writes to the data store or database.

High Scalability. Cassandra clusters grow horizontally with ease as nodes are added and needs change, across geographical regions on demand.

Cassandra Query Language (CQL). In contrast to the vertical scaling power of Standard Query Language (SQL), which is well-suited to relational databases, Cassandra is NoSQL, enabling quick and easy horizontal data movement across clusters for massive scalability. Use CQL to define a simple primary key and complete other basic tasks in Apache Cassandra.

Fault Tolerance. All Cassandra nodes are treated the same way, so when one fails, it’s not generally catastrophic. It is easy to add enough Cassandra nodes to avoid total failure and downtime.

Tunable Consistency. Along with typical JVM performance tuning, Cassandra allows for enhanced performance tuning, configurable table level compression options, and more.

At a high level, Apache Cassandra data model components include tables, keyspaces, and columns. Keyspaces are data containers similar to the schema in a relational database. Keyspaces define tables, also called column families in earlier versions of Cassandra, where data is stored in rows that contain columns. Columns define the data structure within a table.

Key Challenges of Apache Cassandra

Stability, performance at scale, maintenance overhead, and total cost of ownership (TCO) are commonly cited as common Apache Cassandra challenges.

For example:

  • Fanatics, official retailer of virtually every major sports league and event, faced Cassandra struggles such as node sprawl, frequent garbage collection (GC) pauses, and CPU spikes during compactions — and this led to timeouts.
  • Expedia faced 4 main challenges with Cassandra: Garbage Collection, burst traffic, infrastructure costs, and infrequent releases.
  • Rakuten, a global online retailer with 1.5B worldwide members, faced JVM issues, long Garbage Collection pauses, and timeouts – plus they learned the hard way that a single slow node can bring down the entire cluster.
  • Throughout their years of working with Cassandra, web browser maker Opera faced constant issues including enormous load, Cassandra processes being killed, and many GC-related issues that required hundreds of hours to debug.
  • For 7 years, television ad leader Steelhouse (now MNTN) relied on Cassandra clusters that habitually generated reams of timeouts and prevented them from maintaining the expected timeout SLA.
  • Dstillery, a leading custom audience solutions provider, needs to be able to read and write at scale to the tune of hundreds of billions of requests a day with a timeout of 25 milliseconds. No matter how many nodes they added to the Cassandra cluster, they could not lower the failure rate below 0.1% (which amounts to thousands of failures).

Read more about how teams approached these common Apache Cassandra challenges.

Apache Cassandra Documentation

Apache Cassandra documentation is open source and available online. It outlines Apache Cassandra system requirements and other information you need to use this kind of database. The Apache Cassandra Quickstart guide is the recommended starting point.

As a distributed, open-source and free, wide-column NoSQL database / data store intended to handle massive amounts of big data across many servers with no single points of failure, Apache Cassandra offers documentation for its support for high availability, low latency operations for client clusters across data centers.

Use Apache Cassandra statistics for monitoring and to set appropriate metrics. For Apache Cassandra use cases see the Apache Cassandra documentation for case studies. Read our two part blog series for a deep dive into Apache Cassandra benchmarks versus ScyllaDB benchmarks.

According to Apache, the Apache feather logo and Apache Cassandra are trademarks of the Apache Software Foundation under the terms of the Apache License 2.0. However, registered trademarks or trademarks aside, Cassandra remains an open source product.

Apache Cassandra System Requirements

Memory. In a virtualized environment such as one running large EC2 instances, 4GB is the minimum typically recommended. There is no reason to use less than 8GB to 16GB with dedicated hardware, where clusters with 32 GB or more per node may exist.

CPU. 8-core boxes are now the basic nexus of Apache Cassandra performance and price for raw hardware. To allow CPU bursting if you’re running on virtualized machines, a public cloud provider might be better.

Disk. At least 2 disks are necessary, one for the CommitLogDirectory and one for DataFileDirectories. The CommitLogDirectory device or server must be fast enough to receive all of the writes, whatever its size. The device or server for DataFileDirectories must be fast enough to both keep up with compaction and flushing and satisfy reads that are not cached in memory and big enough to house all of your data.

Apache Cassandra Hardware Requirements

Apache Cassandra throughput improves with more RAM, more CPU cores, and faster disks. A minimal production server for Cassandra requires at least 2 cores, and at least 8GB of RAM. Typical production servers have 8 cores or more, plus 32GB of RAM and counting.

CPU. Adding CPU cores increases both read and write throughput because although Cassandra is highly concurrent, the Cassandra write path is CPU bound. By running Cassandra on as many CPU cores as possible, it can handle many simultaneous requests using multiple threads.

Memory. Cassandra runs inside a Java VM. Cassandra will use the heap along with significant amounts of RAM for bloom filters, compression metadata, counter caches, and other uses. Clusters should be tuned based on individual workload, yet basic guidelines suggest several other best practices. For example, always use ECC RAM, because Cassandra lacks internal safeguards against bit-level corruption. And the minimum Cassandra heap should be 2GB, and consume 50% or less of system RAM.

Disks. Cassandra persists data to one disk for the commitlog when a new write is made so that it can be replayed after a system shutdown or crash. Cassandra persists data to a second disk to the data directory when memtables are flushed and thresholds are exceeded. Cassandra performs well on solid state disks and spinning hard drives. However, when using spinning disks, the commitlog and the data files must be on separate physical disks. Since Cassandra is generally designed to provide redundancy via many low-cost, independent servers, using a SAN or NFS for data directories should typically be avoided.

Cloud Deployments. Many large users of Cassandra run in Azure, AWS, GCP, and other public clouds. To run in a physical space, Cassandra requires similar hardware. Typically, network and disk performance increase with generation and instance size, so newer and larger instance generations and types often outperform older, smaller alternatives.

Apache Cassandra Metrics

Monitoring Apache Cassandra metrics and performance is important for allowing your team to identify pressing resource limitations and ongoing slowdowns—and rapidly take action to remedy them. Cassandra’s standard metrics include weighted shifting averages for request calls over intervals of 15-, five-, and one-minute.

Other key areas for analyzing and capturing metrics include:

  • All throughput, especially read and write requests. One-minute read and write throughput rates in particular offer a near-real-time glimpse into Cassandra.
  • To determine when to add cluster capacity, start by monitoring compaction performance. Increases in thread pool statistics pending tasks may indicate when to add additional capacity.
  • Read and write latency in particular, but any latency.
  • Disk usage and disk space on every node.
  • Duration and frequency of garbage collection.
  • Overruns and errors, particularly unavailable exceptions arising from Cassandra nodes that are unavailable in the cluster.

Apache Cassandra 3.x vs
Apache Cassandra 4.x Comparison

Apache Cassandra recently incremented its major version from 3 to 4 after nearly six years of work. Six years encompasses almost an entire technology cycle, with new Java virtual machines, new system kernels, new hardware, new libraries and even new algorithms. Progress in these areas presented the engineers behind Cassandra with an unprecedented opportunity to achieve new levels of performance. Did they seize it?

As engineers behind ScyllaDB, a Cassandra-compatible open source database designed from the ground up for extremely high throughput and low latency, we were curious about the performance of Cassandra 4.0. Specifically, we wanted to understand how far Cassandra 4.0 performance advanced versus Cassandra 3.11, and against ScyllaDB Open Source 4.4.3. So we put them all to the test.

Cassandra 4.0 is an advancement from Cassandra 3.11. It is clear that Cassandra 4.0 has aptly piggy-backed on advancements to the JVM, and upgrading from Cassandra 3.11 to Cassandra 4.0 will benefit many use cases.

In our test setup, Cassandra 4.0 showed a 25% improvement for a write-only disk-intensive workload and 33% improvements for cases of read-only with either a low or high cache hit rate. Otherwise, the maximum throughput between the two Cassandra releases was relatively similar.

However, most workloads won’t be executed in maximum utilization and the tail latency in max utilization is usually not good. In our tests, we marked the throughput performance at a service-level agreement of under 10 millisecond in P90 and P99 latency. At this service level Cassandra 4.0, powered by the new JVM/GC (JVM garbage collection), can perform twice that of Cassandra 3.0. Outside of sheer performance, we tested a wide range of administrative operations, from adding nodes, doubling a cluster, node removal and compaction, all of them under emulated production load. Cassandra 4.0 improves these admin operation times up to 34%.

But for data-intensive applications that require ultra-low latency with extremely high throughput, consider other options such as ScyllaDB, the fastest NoSQL Database. ScyllaDB provides the same Cassandra Query Language (CQL) interface and queries, the same drivers, even the same on-disk SSTable format, but with a modern architecture designed to eliminate Cassandra performance issues, limitations and operational barriers. ScyllaDB consistently and significantly outperformed Cassandra 4.0 on our benchmarks. On identical hardware, ScyllaDB withstood up to 5x greater traffic and offered lower latencies than Apache Cassandra 4.0 in almost every tested scenario. ScyllaDB also completed admin tasks 2.5 to 4 times faster than Cassandra 4.0.

Moreover, ScyllaDB’s feature set goes beyond Cassandra’s in many respects. The bottom line: Cassandra’s performance improved since its initial release in 2008, but ScyllaDB has lept ahead of Cassandra with its shared-nothing, shard-per-core architecture that takes full advantage of modern infrastructure and networking capabilities.

Apache Cassandra Compatibility Testing

For a project to be compatible with Apache Cassandra, it should pass the entire suite of Apache Cassandra compatibility testing, including both distributed tests and unit tests which include CQL query language.

For example, ScyllaDB, which is API-compatible with Apache Cassandra, undergoes this kind of testing, plus other types:

Jepsen distributed systems testing. This flexible distributed software test causes Jepsen partitions, hard-to-handle network outages, and delays for multiple distributed database systems, including Apache Cassandra and ScyllaDB, as a systems safety research tool.

CharybdeFS database filesystem error testing. This injects errors for operations under the control of test scripts to create repeatable filesystem error test cases from rare error conditions.

Project Gemini data integrity testing. Made open source under an Apache license, this ScyllaDB system-under-test (SUT) is run against a test oracle known to pass automatic random abuse tests to detect a range of hidden data corruption bugs.

Distributed testing. An extended version of the dtest Cassandra projects testing, ScyllaDB/Apache Cassandra dtests validate Apache Cassandra cluster operation.

Longevity testing. Longevity testing detects problems in long-running cluster functions that can arise after deploying Cassandra over time.

NoSQL Masterclasses: Advance Your NoSQL Knowledge

Looking for extensive training on data modeling, database migration, and high performance for NoSQL Databases? Our experts offer 3-hour masterclasses that assists practitioners wanting to migrate from SQL to NoSQL or advance their understanding of NoSQL data modeling. This free, self-paced class covers techniques and best practices on these NoSQL concepts that will help you steer clear of mistakes that could inconvenience any engineering team.