New Benchmark: Scylla 2.2 (i3.Metal x 4-nodes) vs Cassandra 3.11 (i3.4xlarge x 40-nodes)
Benchmarking is no easy task, especially when comparing databases with different “engines” under the hood. You want your benchmark to be fair, to run each database on its optimal setup and hardware, and to keep the comparison as apples-to-apples as possible. (For more on this topic, see our webinar on the “Do’s and Don’ts of Benchmarking Databases.”) We kept this in mind when conducting this Scylla versus Cassandra benchmark, which compares Scylla and Cassandra on AWS EC2, using cassandra-stress as the load generator.
Most benchmarks compare different software stacks on the same hardware and try to max out the throughput. However, that is often the wrong approach. A better way is to think about the comparison from the user’s perspective. That is, start by considering the volume, throughput, and latency a database user needs in order to meet his or her business needs. The goal here is to gauge the resources each database needs to meet established requirements and to see how Scylla and Cassandra handle the challenge.
Scylla and its underlying Seastar infrastructure fully leverage powerful modern hardware. With Scylla’s linear scale-up capabilities, the more cores and RAM the better, so we chose AWS EC2’s strongest instance available, i3.metal, for our Scylla cluster. Cassandra, on the other hand, is limited in its ability to scale up due to its reliance on the JVM. Its threads and locks are too heavy and slow down as the number of cores grows, and with no NUMA awareness, performance will suffer. As a result, using the i3.metal instance would not yield optimal results for Cassandra (read more about Cassandra’s sweet spot here). To make a fair comparison with Cassandra, we created a 40-node Cassandra cluster on i3.4xlarge instances. This recreates a use case we saw for of one of our Fortune-50 accounts.
We defined a workload of 38.85 billion partitions, replicated 3 times, 50:50 read/write ratio, a latency of up to 10 msec for the 99th percentile, and throughput requirements of 300k, 200k, and 100k IOPS.
We used our load generator, cassandra-stress, to populate each database with approximately 11TB (38.85B partitions), using the default cassandra-stress schema and RF=3 (see Appendix-A for the schema and cassandra-stress commands used in this benchmark). Once population and compaction completed, we restarted the Scylla /Cassandra service on all nodes, making sure the actual tests started with a cold cache and after all compactions were finished.
We tested the latency for various throughputs to have a more comprehensive comparison. We used a Gaussian distribution (38.85B partitions, Median: 19.425B and Standard Deviation: 6.475B) in all latency tests to achieve as much disk access as possible instead of reading from RAM. Each of the tests ran for 90 minutes to make sure we were in a steady state.
Conclusions and Summary
- Compared to the 40-node Cassandra cluster, the Scylla 4-node cluster provided:
- 10X reduction in administration overhead. This means ALL operations such as upgrade, logging, monitoring and so forth will take a fraction of the effort.
- 2.5X reduction in AWS EC2 costs. In fact, had we been more strict with Cassandra, we would have increased its cluster size to meet the required latency.
- Scylla was able to meet the SLA of 99% latency < 10ms in all tested workloads (with the one exception of 12ms in the read workload under 300K OPS).
- Cassandra was able to meet the SLA of 99% latency < 10ms only for the 100K OPS workload.
- Scylla demonstrated superior 99.9% latency in ALL cases, in some cases showing an improvement of up to 45X.
- Scylla also demonstrated better 99% latency in all cases but one. In the one case where Cassandra was better (write @100K OPS) both clusters demonstrated very low single-digit latency.
- Cassandra demonstrated better 95% latency in most cases. This is an area for improvement for Scylla. Read more about this in the monitoring section below.
Comparison of AWS Server Costs
|Scylla 2.2||Cassandra 3.11|
|Year term Estimated cost: ~$112K
Year term Estimated cost: ~$278.6K
Setup and Configuration
|Scylla Cluster||Cassandra Cluster|
|EC2 Instance type||i3.Metal (72 vCPU | 512 GiB RAM)||i3.4xlarge (16 vCPU | 122 GiB RAM)|
|Storage (ephemeral disks)||8 NVMe drives, each 1900GB||2 NVMe drives, each 1900GB|
|Network||25Gbps||Up to 10Gbps|
|Cluster size||4-node cluster on single DC||40-node cluster on single DC|
|Total CPU and RAM||CPU count: 288 | RAM size: 2TB||CPU count: 640 | RAM size: ~4.76TB|
|DB SW version||Scylla 2.2||Cassandra 3.11.2
(OpenJDK build 1.8.0_171-b10)
|Scylla Loaders||Cassandra Loaders|
|Population||4 x m4.2xlarge (8 vCPU | 32 GiB RAM)
8 c-s clients, 2 per instance
|16 x m4.2xlarge (8 vCPU | 32 GiB RAM)
16 c-s clients, 1 per instance
|Latency tests||7 x i3.8xlarge (up to 10Gb network)
14 c-s clients, 2 per instance
|8 x i3.8xlarge (up to 10Gb network)
16 c-s clients, 2 per instance
It’s no secret that Cassandra’s out-of-the-box performance leaves much to be desired. Cassandra requires quite a bit of tuning to get good results. Based on recommendations from Datastax and Amy Tobey’s guide to Cassandra tuning, we applied the following optimizations to the Cassandra cluster.
Originally we applied the changes listed below to only the
cassandra.yaml and the
jvm options files — that yielded poor performance results. Despite multiple attempts using various amounts of cassandra-stress clients and threads per client, we could not get more than 30K operations per second throughput. After applying the IO tuning setting, Cassandra started performing much better.
There is no need to optimize the Scylla configuration. Scylla automatically configures the kernel, the OS, and itself to dynamically adjust to the best setup.
Dataset Used and Disk Space Utilization
Because we wanted the latency tests to be based primarily on disk access, we populated each database with a large dataset of ~11TB consisting of 38.85B partitions using the default cassandra-stress schema, where each partition size was ~310 bytes. A replication factor of 3 was used. Each Cassandra node holds a data set that is 5.5 times bigger than its RAM, whereas each Scylla node holds 16.25-times the data size of RAM. Note that the Cassandra 3.x file format consumes less disk space. The September/October release of Scylla will include full compatibility with Cassandra’s 3.x format, bringing further improvements to volume and performance.
|Total used storage||~32.5 TB||~27 TB|
||~8.12 TB / node||~690.9 GB / node|
|/dev/md0 (Avg.)||~8.18 TB / node||~692 GB / node|
|Data size / RAM ratio||~16.25 : 1||~5½ : 1|
Performance Results (Graphs)
Performance Results (Data)
The following table summarizes the results for each of the latency tests conducted.
Scylla Monitoring Screenshots
Scylla version 2.2 introduces several new capabilities, including the Compaction Controller for the Size Tiered Compaction Strategy (STCS). This new controller provides Scylla with an understanding of just how much CPU shares it can allocate for compactions. In all the tested workloads (300K Ops, 200K Ops, and 100K Ops) the incoming traffic load on the CPU-reactor was on average 70%, 50%, and 25% respectively. The compaction controller understands if there are enough unused/free CPU shares to be allocated for compactions. This enables Scylla to complete the compactions in a fast and aggressive manner while ensuring that the foreground load is maintained and the throughput is unaffected. The spikes you see in the CPU-reactor graph in each of the workloads correspond exactly to compaction jobs execution, as can be seen, the in the compaction graph.
When the workload is bigger (300K OPS), SSTables are created faster and more frequent compactions are needed, which is why we see more frequent CPU-reactor spikes to 100%. When the workload is smaller (100K OPS), SSTables are created more slowly and compactions are needed less frequently, resulting in very few CPU-reactor spikes during that run.
Latency test (300K Ops): Mixed 50% WR/RD workload (CL=Q)
Latency test (200K OPS): Mixed 50% WR/RD workload (CL=Q)
Latency test (100K OPS): Mixed 50% WR/RD workload (CL=Q)
Scylla’s compaction controller code is new, as is the CPU scheduler. Looking at graphs we see that it’s possible to smooth compaction automatically and reduce the latency to 1/3 the size and thus push more throughput while still meeting the SLA of this use case (10ms for 99%).
Scylla Schema (RF=3)
Cassandra Schema (RF=3)
C-S commands – Scylla
- Population (~11TB | 38.85B partitions | CL=ONE) x 8 clients
nohup cassandra-stress write no-warmup n=4856250000 cl=one -mode native cql3 -node [IPs] -rate threads=200 -log file=[log_file] -pop seq=1..4856250000 &
- Latency tests: Mixed 50% WR/RD workload (CL=Q) x 14 clients
7 clients X
nohup taskset -c 1-15 cassandra-stress mixed ratio\(write=1,read=1\) no-warmup duration=90m cl=quorum -pop dist=gaussian\(1..38850000000,19425000000,6475000000\) -mode native cql3 -node [IPs] -log file=[log_file] -rate threads=200 limit=7142/s | 14285/s | 21650/s &(300K | 200K | 100K Ops)
7 clients X
nohup taskset -c 17-31 cassandra-stress mixed ratio\(write=1,read=1\) no-warmup duration=90m cl=quorum -pop dist=gaussian\(1..38850000000,19425000000,6475000000\) -mode native cql3 -node [IPs] -log file=[log_file] -rate threads=200 limit=21650/s | 14285/s | 7142/s &(300K | 200K | 100K Ops)
C-S commands – Cassandra
- Population (~11TB | 38.85B partitions | CL=ONE) x 16 clients
nohup cassandra-stress write n=2428125000 cl=one -mode native cql3 -node [IPs] -rate threads=200 -log file=[log file] -pop seq=0..2428125000 &
- Latency test: Mixed 50% WR/RD workload (CL=Q) x 16 clients
nohup cassandra-stress mixed ratio\(write=1,read=1\) no-warmup duration=90m cl=quorum -pop dist=gaussian\(1..38850000000,19425000000,6475000000\) -mode native cql3 -node [IPs] -log file=[log_file] -rate threads=200 limit=19000/s | 12500/s | 6250/s &(300K | 200K | 100K Ops)