Benchmarking Scylla 1.6 Vs. Cassandra 3.0.9 on Amazon Web Services i2.8xlarge instances
Benchmarking database systems helps users determine the best database solution for their application. Scylla, an Apache Cassandra drop-in replacement database, uses novel techniques to deliver best-in-class throughput and latency to users of wide column NoSQL databases. This benchmark results provide throughput and latency measurements comparison of Scylla 1.6 release vs. Cassandra 3.0.9 release.
The results obtained from extensive testing show Scylla outperforms Cassandra by factors of 2.3 to 5.1 based on the different workload tested. This document reviews the testing process, tuning and throughput and the latency measurements.
The Cassandra testing process
Our initial target version to benchmark Cassandra was 3.9.0 from the Apache Cassandra repository. However, we ended up using DataStax Community (DSC) version for the benchmark below. Tuning and operating Cassandra and DSC is a lengthy process. We had several iterations and more than 80 hours of false starts before we were able to set up a stable Cassandra cluster. We learned that the DSC out-of-the-box experience will not provide users with the expected performance or stability for a production environment. Please refer to the settings applied to the Cassandra installation at the end of this document.
While testing Cassandra 3.9.0, we found that extensive tuning is required to get the system stable, and the results obtained from Cassandra were far below our expectations.
After installing Cassandra 3.9.0, the first test we created was the data insertion test (write). During the test, insertion clients came to a halt and reported data insertion failures to the point that the job ended with a FAILURE set. We added tuning to make sure we had enough file handlers and reduced the number of loaders. Still, the tests showed miserable performance results. We experienced write throughput in rates that were a small percentage of the Scylla throughput rate. At that point, we decided to switch to the 3.x stable release (DSC version) and not use the cutting-edge Cassandra 3.9 release. The DSC is based on Cassandra 3.0.9 version. While the installation process is more straightforward than the Apache version, the testing yielded some insight about the ways to operate Cassandra. The first testing of data insertion of 11,600,000,000 partitions started smoothly. Our first configuration was naive—we didn’t change the JVM settings or the number of compactors used. So, with a default 4GB heap size, we tried to inject over 1TB of data. Once the initial writes failed we increased the JVM heap size to 24GB, and the data insertions went well until we reached ~50% of data in the servers. At the halfway point, all 3 Cassandra servers crashed with OOM errors. To save time we decided to stick with half the data set testing and proceed with the comparison to Scylla. Later research yielded another configuration setting—increase the number of memory mapped files in the kernel from 64k to 1M to solve the dataset size limitation.
System under test
Both Scylla and Cassandra use three i2.8xlarge AWS servers to cluster a 3-node database.
Write and read loads are generated through 10 m4.2xlarge instances deployed in the same VPC.
Installation, settings, and configurations of the servers and loaders are described later in the Installation and Setup section of this document.
About the measurements
The Cassandra stress tool reports the number of operations per second (OPS) as measured by the client and provides 6 latency measurements: mean, median, 95%, 99%, 99.9% and max. All latency measurements are reported in milliseconds.
About the tests conducted
In each test scenario, Scylla and Cassandra used their own 3 nodes of i2.8xlarge machines, provisioned on AWS EC2 west-2 region.
Scylla setup included 3 database servers and 10 loaders. The Cassandra setup also included 3 database servers and 10 loaders.
Write workload test
To determine the performance of a write-only workload, each loader created unique partitions in the stress keyspace. A test schema using a replication factor of 3 was generated in each database.
Table 1: Write workload results
|Measurement||Scylla 1.6||Cassandra 3.0.9|
|Row rate [OPS]||164,610||70,771|
|Latency mean [ms]||30||71|
|Latency median [ms]||24||20|
|Latency 95th percentile [ms]||51||143|
|Latency 99th percentile [ms]||66||478|
|Latency 99.9th percentile [ms]||296||1,384|
|Latency max [ms]||3,607||32,601|
Observations from the write throughput workload test
Scylla provides 230% better throughput while maintaining 99.9% latency at roughly one fifth of Cassandra latency. Looking into the bounding factors of the test, we observe that Scylla throughput is bounded by CPU capacity, and the load on each server is nearly 100%.
One note about latency: These are latency results under 100% CPU workload, and this is a throughput benchmark and thus the high latency results. If Scylla or even Cassandra were run for working range (30%-50% load), the latency numbers would be dramatically lower.
Figure 2 (above) shows Scylla behavior during dataset insertion. CPU utilization is at peak. As can be seen, the CPU workload is back to idle roughly 30 minutes after the last data insertion. The short 30-minutes of additional CPU processing is attributed to the compaction process.
For Cassandra, when we look at the CPU utilization, we see that the utilization process continued for an additional ~37 hours, for our data set of 1.3TB. The short period of low utilization seen in Figure 3 (below) is a time period in which the cluster crashed and restarted.
Read workload tests
For the read workload tests, we generated 2 scenarios. The first scenario (large spread) explores a use case when most of the reads are fetched from disk. This scenario describes patterns where most of the data is not stored in cache, or the dataset is larger than the server’s RAM.
The second scenario (small spread) explores a use case when most of the reads are fetched from memory. This scenario describes patterns, where data size is smaller than server’s RAM, or a very narrow range of data is read by the application.
Table 2: Large spread read results
|Measurement||Scylla 1.6||Cassandra 3.0.9|
|Row rate [OPS]||100,128||19,091|
|Latency mean [ms]||50||281|
|Latency median [ms]||5||55|
|Latency 95th percentile [ms]||138||618|
|Latency 99th percentile [ms]||569||4,027|
|Latency 99.9th percentile [ms]||2,344||9,465|
|Latency max [ms]||5,975||32,017|
Table 3: Small spread read results
|Measurement||Scylla 1.6||Cassandra 3.0.9|
|Row rate [OPS]||344,051||107,598|
|Latency mean [ms]||15||50|
|Latency median [ms]||7||30|
|Latency 95th percentile [ms]||14||122|
|Latency 99th percentile [ms]||176||173|
|Latency 99.9th percentile [ms]||466||242|
|Latency max [ms]||1,138||389|
Observations from the read throughput tests
As mentioned before, we expect the large spread data set read is expected to be served mostly from the storage media, in our case the 8x800GB SSD ephemeral volumes. Reviewing the load on Scylla servers, we observe that the CPU is not bounded. However, the media is working at peak IOPs. The next order of work is to repeat the tests with newly available i3 instances on AWS—stay tuned for the next benchmarking reports.
Figure 4 (above) shows Scylla behavior during the large dataset read test. CPU utilization and per-server requests are described in the lower part of the graph.
Figure 5 (above) shows Scylla disk and cache behavior during the large data set read process and illustrates the high rate of reads from disk and the expected high rate of cache misses.
In the small spread read scenario, the media operation is minimal to non-existent, and the requests are served from server’s cache, sending the CPU load to its peak of nearly 100%.
Figure 6 (above) shows Scylla behavior during the small spread read test. CPU utilization is maximized and each server delivered more than 130K operations per second.
Figure 7 (above) shows Scylla’s disk and cache behavior during the small spread read test. Nearly all requests are served from Scylla’s cache.
Installation, configuration, and workload creation
In the following section, we will review the methods used to install Scylla 1.6 and Cassandra 3.0.9 servers and loaders, which configuration settings were applied, and the method of creating the write, large spread read, and small spread read workloads.
Cassandra 3.0.9 Installation
For the Cassandra installation, we used DataStax Cassandra Community release, based on Apache Cassandra 3.0.9.
Using Centos 7.2 AMI: ami-d2c924b2 as operating system for the instances.
sudo yum install java-1.8.0-openjdk -y sudo vi /etc/yum.repos.d/datastax.repo [datastax] name = DataStax Repo for Apache Cassandra baseurl = http://rpm.datastax.com/community enabled = 1 gpgcheck = 0 sudo yum install dsc30 -y sudo yum install cassandra30-tools -y
After installation we changed the mounting directories as follows:
sudo mdadm --create /dev/md0 --level 0 --raid-devices 8 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi sudo mkfs.xfs -f /dev/md0 sudo mount /dev/md0 /var/lib/cassandra/
Cassandra 3.0.9 Configuration settings
We used G1GC garbage collection method, applied a JVM heap size of 24GB.
-Xms24G -Xmx24G -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=32 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
In the cassandra.yaml file, we enabled the ssd support and increased the number of compactors to the number of CPUs in the machines.
buffer_pool_use_heap_if_exhausted: true disk_optimization_strategy: ssd row_cache_size_in_mb: 10240 concurrent_compactors: 32
Scylla 1.6 Installation
We used Scylla AMI ScyllaDB 1.6.0 (ami-930bb0f3), a 3 node orchestration completed on i2.8xlarge machines.
No further settings or changes were made to the deployment.
Load and testing creations
The following perl scripts helped create the workloads used.
The project of creating the workload can be found here: https://github.com/eyalgutkind/Scylla_Cassandra_stress
Here are some examples of the scripts output:
nohup cassandra-stress write n=580000000 cl=QUORUM -schema "replication(factor=3)" -mode native cql3 -pop seq=4060000001..4640000000 -node 172.31.1.79,172.31.4.45,172.31.7.248 -rate threads=500 -log file=testwrite_scylla16_halfsize_feb242017.data
Read large spread workload
nohup cassandra-stress read duration=60m cl=QUORUM -mode native cql3 -pop dist=gaussian\(1..5800000000,2900000000,966666666\) -node 172.31.2.8,172.31.3.167,172.31.11.4 -rate threads=500 -log file=testreadlarge_cass309_i28xlhalfsize_feb282017.data
Read small spread workload
nohup cassandra-stress read duration=60m cl=QUORUM -mode native cql3 -pop dist=gaussian\(1..5800000000,2900000000,100\) -node 172.31.1.79,172.31.4.45,172.31.7.248 -rate threads=500 -log file=testreadsmall_scylla1p6_i28xlhalfsize_feb252017.data