Scylla Open Source vs. Apache Cassandra: Getting the Most out of AWS’s New Bare Metal Instances

Summary

In this evaluation, we did our best to define a workload that reflects conditions users will encounter in production deployments. Based on those workloads, we leveraged our familiarity with user deployments to select the optimal machine types per database to make the benchmarks as realistic and as practical as possible.

These benchmark results provide a throughput and latency comparison of Scylla Open Source version 2.2 release vs. Cassandra 3.11 release. The results obtained from extensive testing show Scylla outperforms Cassandra at 99% and 99.9% latency, smoothing out Cassandra’s more egregious latency spikes. Plus, by condensing the cluster size from 40 nodes to only four, makes system management far more efficient and cost-effective.

Summary of Findings:

  • 2.5X reduction in AWS EC2 costs
  • Dramatic improvement in reliability through 10X reduction in cluster size
  • Up to 11X improvement in 99th percentile latency
  • Amazon EC2 Bare Metal Instances are a great platform for Scylla

Test Methodology and Configuration

Why Bare Metal? The Basics of Scylla’s Architecture

Scylla is unique among databases in its ability to fully leverage powerful modern hardware. Its shared-nothing, lock-free and shard-per-core architecture allows Scylla to scale up with hardware resources: cores, RAM and I/O devices. AWS EC2’s strongest instance available, i3.metal, is the perfect choice for our Scylla cluster.

Cassandra, in contrast, is limited in its ability to scale up due to its reliance on the Java virtual machine (JVM). The JVM prevents Cassandra from interacting directly with server hardware. Its threads and locks are heavy, and they actually slow down as the core count grows. In the absence of NUMA awareness, performance suffers. As a result, using the i3.metal instance would not yield optimal results for Cassandra and will be a waste of money as a chosen machine type. To make the comparison fair, we selected i3.4xlarge instances for Cassandra which are 4 times smaller than i3.metal but also 4 times cheaper.

Benchmark Workload

The following workload was defined for each test:

  • ~40 billion partitions
  • Replication factor of 3
  • 50:50 read/write ratio
  • Latency of up to 10 milliseconds for the 99th percentile
  • Throughput requirements of 300k, 200k, and 100k Operations per Second.
  • Workload that spans most of the data, thus data will be served mostly from disk; a real-life scenario which is more-taxing on the database software. Standard deviation was 6.4B nodes.

The load generator, cassandra-stress, was used to populate each database using the default cassandra-stress schema. Once population and compaction completed, the service was restarted on all nodes, ensuring that the tests ran against a cold cache and with all compactions complete. Each test ran for 90 minutes to ensure a steady state.

Why a 40-node Cassandra Cluster?

To achieve SLAs against this benchmark with Cassandra, we calculated that a 40-node cluster, running on i3.4xlarge instances, would be required. As you can read further, this setup did not meet the SLA for most workloads and, ultimately, a larger cluster was required.

Setup and Configuration
 Scylla ClusterCassandra Cluster
EC2 Instance typei3.metal (36 physical cores| 512 GiB RAM)i3.4xlarge (8 physical cores | 122 GiB RAM)
Storage (ephemeral disks)8 NVMe drives, each 1900GB2 NVMe drives, each 1900GB
Cluster size4-node cluster on single DC40-node cluster on single DC
Total CPU and RAMCPU count: 288 | RAM size: 2TBCPU count: 640 | RAM size: ~4.76TB
DB SW versionScylla Open Source version 2.2Cassandra 3.11.2 (OpenJDK build 1.8.0_171-b10)
Latency tests7 x i3.8xlarge (up to 10Gb network)
14 cassandra-stress clients, 2 per instance
8 x i3.8xlarge (up to 10Gb network)
16 cassandra-stress clients, 2 per instance
Cassandra Optimizations

Cassandra requires detailed tuning to achieve optimal results. For these tests, tuning settings were derived the official Datastax guide to Cassandra tuning. We invite readers to suggest other tuning settings that might produce an alternative perspective on the relative performance characteristics of these two databases.

Originally we applied the changes listed below only to the cassandra.yaml and the jvm options files — that yielded poor performance results. After applying the IO tuning setting, Cassandra started performing much better. Yet despite multiple attempts using various amounts of cassandra-stress clients and threads per client, we could not get more than 30K operations per second throughput.

cassandra.yamlbuffer_pool_use_heap_if_exhausted: true
disk_optimization_strategy: ssd
row_cache_size_in_mb: 10240
concurrent_compactors: 16
compaction_throughput_mb_per_sec: 960
jvm.options
-Xmx48G
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
-XX:ParallelGCThreads=16
-XX:PrintFLSStatistics=1
-Xloggc:/var/log/cassandra/gc.log
#-XX:+UseParNewGC
#-XX:+UseConcMarkSweepGC
#-XX:+CMSParallelRemarkEnabled
#-XX:SurvivorRatio=8
#-XX:MaxTenuringThreshold=1
#-XX:CMSInitiatingOccupancyFraction=75
#-XX:+UseCMSInitiatingOccupancyOnly
#-XX:CMSWaitDuration=10000
#-XX:+CMSParallelInitialMarkEnabled
#-XX:+CMSEdenChunksRecordAlways
IO tuningecho 1 > /sys/block/md0/queue/nomerges
echo 8 > /sys/block/md0/queue/read_ahead_kb
echo deadline > /sys/block/md0/queue/scheduler
Scylla Optimizations

In contrast to Cassandra, Scylla dynamically tunes itself for optimal performance. No specific optimizations were required.

Dataset Used and Disk Space Utilization

Latency tests strive to stress disk access. To achieve this, each database was populated with a large dataset of around 11TB, consisting of 38.85B partitions using the default cassandra-stress schema, where each partition size was ~310 bytes. A replication factor of 3 was used. Each Cassandra node holds a data set that is 5.5 times bigger than its RAM, whereas each Scylla node holds 16.25 times the data size of RAM.

The total storage used by Cassandra is smaller than in Scylla due to the Cassandra 3.0 file format utilized in Cassandra 3.11. (As a side note, Scylla 3.0, planned for Q4 2018 release, will provide native support for the newer file format natively)

 ScyllaCassandra
Total used storage~32.5 TB~27 TB
/dev/md0 (Avg.)~8.18 TB / node~692 GB / node
Data size / RAM ratio~16.25 : 1~5.5 : 1

 

Benchmark Test Results for Scylla Open Source 2.2 vs Apache Cassandra 3.11

Benchmark Test Results

Performance Results (Raw Data)

The following table summarizes the results for each latency test.

cassandra-stress WorkloadsScylla Open Source version 2.2 (4-nodes x i3.metal)Cassandra 3.11 (40-nodes x i3.4xlarge)

Latency Test (300K)

Workload A (CL=Q)
50% RD/WR

Range:
38.85B partitions

Distribution: gaussian\
(1..38850000000,
19425000000,
6475000000\)

Duration: 90 min.

Overall Throughput (ops/sec): ~304K
Avg Load (scylla-server): ~70%
————————————————————————-
READ partitions (Avg.): ~58.62M
Avg. 95% Latency (ms): 8.0
Avg. 99% Latency (ms): 12.0
Avg. 99.9% Latency (ms): 36.0
————————————————————————-
UPDATE partitions (Avg.): ~58.62M
Avg. 95% Latency (ms): 3.2
Avg. 99% Latency (ms): 4.8
Avg. 99.9% Latency (ms): 9.9

* 14 x i3.8xlarge used as loaders (10gb network)

Overall Throughput (ops/sec): ~306K
Avg Load: ~50%
————————————————————————-
READ partitions (Avg.): ~51.61M
Avg. 95% Latency (ms): 5.3
Avg. 99% Latency (ms): 131.3
Avg. 99.9% Latency (ms): 483.2
————————————————————————-
UPDATE partitions (Avg.): ~51.61M
Avg. 95% Latency (ms): 2.4
Avg. 99% Latency (ms): 11.9
Avg. 99.9% Latency (ms): 474.4

* 16 x i3.8xlarge used as loaders (10gb network)

Latency Test (200K)

Workload A (CL=Q)
50% RD/WR

Range:
38.85B partitions

Distribution: gaussian\
(1..38850000000,
19425000000,
6475000000\)

Duration: 90 min.

Overall Throughput (ops/sec): ~200K
Avg Load (scylla-server): ~50%
————————————————————————-
READ partitions (Avg.): ~38.58M
Avg. 95% Latency (ms): 5.3
Avg. 99% Latency (ms): 7.5
Avg. 99.9% Latency (ms): 10.2
————————————————————————-
UPDATE partitions (Avg.): ~38.58M
Avg. 95% Latency (ms): 2.3
Avg. 99% Latency (ms): 3.1
Avg. 99.9% Latency (ms): 5.3

* 14 x i3.8xlarge used as loaders (10gb network)

Overall Throughput (ops/sec): ~200K
Avg Load: ~35%
————————————————————————-
READ partitions (Avg.): ~33.87M
Avg. 95% Latency (ms): 1.6
Avg. 99% Latency (ms): 85.8
Avg. 99.9% Latency (ms): 423.6
————————————————————————-
UPDATE partitions (Avg.): ~33.87M
Avg. 95% Latency (ms): 0.9
Avg. 99% Latency (ms): 3.1
Avg. 99.9% Latency (ms): 422.5

* 16 x i3.8xlarge used as loaders (10gb network)

Latency Test (100K)

Workload A (CL=Q)
50% RD/WR

Range:
38.85B partitions

Distribution: gaussian\
(1..38850000000,
19425000000,
6475000000\)

Duration: 90 min.

Overall Throughput (ops/sec): ~100K
Avg Load (scylla-server): ~25%
————————————————————————-
READ partitions (Avg.): ~19.29M
Avg. 95% Latency (ms): 2.6
Avg. 99% Latency (ms): 5.6
Avg. 99.9% Latency (ms): 8.4
————————————————————————-
UPDATE partitions (Avg.): ~19.29M
Avg. 95% Latency (ms): 0.8
Avg. 99% Latency (ms): 2.4
Avg. 99.9% Latency (ms): 3.5

* 14 x i3.8xlarge used as loaders (10gb network)

Overall Throughput (ops/sec): ~100K
Avg Load: ~18%
————————————————————————-
READ partitions (Avg.): ~16.88M
Avg. 95% Latency (ms): 1.2
Avg. 99% Latency (ms): 8.6
Avg. 99.9% Latency (ms): 286.6
————————————————————————-
UPDATE partitions (Avg.): ~16.88M
Avg. 95% Latency (ms): 0.8
Avg. 99% Latency (ms): 1.0
Avg. 99.9% Latency (ms): 283.7

* 16 x i3.8xlarge used as loaders (10gb network)

Summary of Results

Compared to the 40-node Cassandra cluster, the Scylla 4-node cluster provided:

  • 10X reduction in administration overhead.
    – All operations, including upgrade, logging, monitoring and so forth require a fraction of the effort.
  • 2.5X reduction in AWS EC2 costs.
    – In fact, had we been more strict with Cassandra, we would have increased its cluster size to meet the required latency.
  • Scylla met the SLA of 99% latency < 10ms in all but one tested workload
    – One exception of 12ms in the read workload under 300K OPS
  • Cassandra met the SLA of 99% latency < 10ms only for the 100K OPS workload.
    – The Cassandra 99% latency at 300k OPS was 131ms
  • Scylla demonstrated superior 99.9% latency in all test cases
    – In some cases it improved by up to 45X.
  • Where Cassandra proved better write @100K OPS both databases demonstrated very low single-digit latency.
Auto-tuning Compaction to Achieve Superior Performance

Scylla version 2.2 introduces several new capabilities, including the Compaction Controller for the Size Tiered Compaction Strategy (STCS). This new controller provides Scylla with an understanding of just how much CPU shares it can allocate for compactions. In all the tested workloads (300K OPS, 200K OPS, and 100K OPS) the incoming traffic load on the CPU-reactor was on average 70%, 50%, and 25% respectively. The compaction controller understands if there are enough unused/free CPU shares to be allocated for compactions. This enables Scylla to complete the compactions in a fast and aggressive manner while ensuring that the foreground load is maintained and the throughput is unaffected. The spikes you see in the CPU-reactor graph in each of the workloads correspond exactly to compaction jobs execution, as can be seen, the in the compaction graph.

When the workload is bigger (300K OPS), SSTables are created faster and more frequent compactions are needed, which is why we see more frequent CPU-reactor spikes to 100%.

When the workload is smaller (100K OPS), SSTables are created more slowly and compactions are needed less frequently, resulting in very few CPU-reactor spikes during that run.

latency testlatency test latency test

Latency Test (300K OPS): Mixed 50% WR/RD Workload (CL=Q)
 

latency test 200k latency test 200k latency test 200k

Latency Test (200K OPS): Mixed 50% WR/RD Workload (CL=Q)
 

latency test 100k latency test 100k latency test 100k

Latency Test (100K OPS): Mixed 50% WR/RD Workload (CL=Q)
 
Reliability

The hardware failure rate is a function of the node. While theoretically SSD manufacturers cite Mean Time Between Failure (MTBF) of a million or more hours, a reliable rate we received from one of the three large vendors is that, in production, failures of SSD drives occur in roughly two years. We can make these factors more empirical by performing a calculation drawn from reliability and failure analysis, specifically the Probability of Concurrent Double Failures (PCDF). The more nodes a cluster has, the chances for a single failure and even a double failure increase. Both Scylla and Cassandra can cope with node loss when the replication factor is 3 and quorum is needed. However, in the event of a subsequent fault, the range of data will not be accessible for a quorum consistency level requirement until replacement nodes are installed and remaining replicas stream the data into the new nodes.

With this in mind, we can compare the PCDF for Cassandra versus Scylla, as tested in this benchmark using the following formula:

PCDF = NC * (MTTR/MTBF)^2 * DIY

Where:

  • NC = Node Count
  • DIY = Days In Year = 365
  • MTBF – Mean Time Between Failure = 2 years = 2 * 365 = 730
  • MTTR – Mean Time To Repair/Recover = 1 day
  • PCDF = Probability of Concurrent Double Failures

The results show that Scylla provides significantly lower PCDF than Cassandra, along with a commensurate reduction in the overhead of downtime and manual intervention:

Cassandra’s PCDF = 40 * ( 1/730)^2*365 = 2.7% for a double error
Scylla’s PCDF = 4 * ( 1/730)^2*365 = 0.27% for a double error

Conclusion: Calculating TCO

To conclude, we can put these performance numbers into the larger context of total cost of ownership (TCO). As we have shown, in terms of platform provisioning, Scylla’s TCO is 2.5x better than Cassandra.

Additionally, there is a huge difference in latency at the upper percentiles. Had we been trying to achieve the same latency, the ROI would have at least doubled, to 5x or more.

ScyllaDB vs. Cassandra chart

Scylla Open Source version 2.2Cassandra 3.11

Year term estimated cost: ~$112K

  • 4 x i3.metal cost: $112,100
  • (1-year contract, all upfront payment)

Year term estimated cost: ~$278.6K

  • 40 x i3.4xlarge cost: $278,560
  • (1-year contract, all upfront payment)

Appendix A: Schemas and Commands

Scylla Schema (RF=3)

Cassandra Schema (RF=3)

cassandra-stress commands – Scylla

  • Population (~11TB | 38.85B partitions | CL=ONE) x 8 clients
  • nohup cassandra-stress write no-warmup n=4856250000 cl=one -mode native cql3 -node [IPs] -rate threads=200 -log file=[log_file] -pop seq=1..4856250000 &

Latency tests: Mixed 50% WR/RD workload (CL=Q) x 14 clients

  • 7 clients Xnohup taskset -c 1-15 cassandra-stress mixed ratio\(write=1,read=1\) no-warmup duration=90m cl=quorum -pop dist=gaussian\(1..38850000000,19425000000,6475000000\) -mode native cql3 -node [IPs] -log file=[log_file] -rate threads=200 limit=7142/s | 14285/s | 21650/s &(300K | 200K | 100K Ops)
  • 7 clients Xnohup taskset -c 17-31 cassandra-stress mixed ratio\(write=1,read=1\) no-warmup duration=90m cl=quorum -pop dist=gaussian\(1..38850000000,19425000000,6475000000\) -mode native cql3 -node [IPs] -log file=[log_file] -rate threads=200 limit=21650/s | 14285/s | 7142/s & (300K | 200K | 100K Ops)

cassandra-stress commands – Cassandra

Population (~11TB | 38.85B partitions | CL=ONE) x 16 clients

  • nohup cassandra-stress write n=2428125000 cl=one -mode native cql3 -node [IPs] -rate threads=200 -log file=[log file] -pop seq=0..2428125000 &

Latency test: Mixed 50% WR/RD workload (CL=Q) x 16 clients

  • nohup cassandra-stress mixed ratio\(write=1,read=1\) no-warmup duration=90m cl=quorum -pop dist=gaussian\(1..38850000000,19425000000,6475000000\) -mode native cql3 -node [IPs] -log file=[log_file] -rate threads=200 limit=19000/s | 12500/s | 6250/s & (300K | 200K | 100K Ops)

Let’s do this

Getting started takes only a few minutes. Scylla has an installer for every major platform and is well documented. If you get stuck, we’re here to help.