Join us at Scylla Summit 2017 in San Francisco Oct 24-25 - Register now!

Benchmarking Scylla 1.6 vs Cassandra 3.0.9 on Amazon Web Services i2.8xlarge Instances

Throughput of Scylla vs Cassandra Using 2TB of Data for Different YCSB Workloads

Executive summary

Benchmarking database systems helps users determine the best database solution for their application. Scylla, an Apache Cassandra drop-in replacement database, uses novel techniques to deliver best-in-class throughput and latency to users of wide column NoSQL databases. This benchmark results provide throughput and latency measurements comparison of Scylla 1.6 release vs. Cassandra 3.0.9 release.

The results obtained from extensive testing show Scylla outperforms Cassandra by factors of 2.3 to 5.1 based on the different workload tested. This document reviews the testing process, tuning and throughput and the latency measurements.

The Cassandra testing process

Our initial target version to benchmark Apache Cassandra was 3.9.0 from the Apache Cassandra repository. However, we ended up using DataStax Community (DSC) version for the benchmark below. Tuning and operating Apache Cassandra and DSC is a lengthy process. We had several iterations and more than 80 hours of false starts before we were able to set up a stable Cassandra cluster. We learned that the DSC out-of-the-box experience will not provide users with the expected performance or stability for a production environment. Please refer to the settings applied to the Apache Cassandra installation at the end of this document.

While testing Apache Cassandra 3.9.0, we found that extensive tuning is required to get the system stable, and the results obtained from Cassandra were far below our expectations.

After installing Apache Cassandra 3.9.0, the first test we created was the data insertion test (write). During the test, insertion clients came to a halt and reported data insertion failures to the point that the job ended with a FAILURE set. We added tuning to make sure we had enough file handlers and reduced the number of loaders. Still, the tests showed miserable performance results. We experienced write throughput in rates that were a small percentage of the Scylla throughput rate. At that point, we decided to switch to the 3.x stable release (DSC version) and not use the cutting-edge Cassandra 3.9 release. The DSC is based on Apache  Cassandra 3.0.9 version. While the installation process is more straightforward than the Apache version, the testing yielded some insight about the ways to operate Apache Cassandra. The first testing of data insertion of 11,600,000,000 partitions started smoothly. Our first configuration was naive—we didn’t change the JVM settings or the number of compactors used. So, with a default 4GB heap size, we tried to inject over 1TB of data. Once the initial writes failed we increased the JVM heap size to 24GB, and the data insertions went well until we reached ~50% of data in the servers. At the halfway point, all 3 Cassandra servers crashed with OOM errors. To save time we decided to stick with half the data set testing and proceed with the comparison to Scylla. Later research yielded another configuration setting—increase the number of memory mapped files in the kernel from 64k to 1M to solve the dataset size limitation.

System under test

Both Scylla and Cassandra use three i2.8xlarge AWS servers to cluster a 3-node database.

Write and read loads are generated through 10 m4.2xlarge instances deployed in the same VPC.

Installation, settings, and configurations of the servers and loaders are described later in the Installation and Setup section of this document.

About the measurements

The Cassandra stress tool reports the number of operations per second (OPS) as measured by the client and provides 6 latency measurements: mean, median, 95%, 99%, 99.9% and max. All latency measurements are reported in milliseconds.

About the tests conducted

In each test scenario, Scylla and Cassandra used their own 3 nodes of i2.8xlarge machines, provisioned on AWS EC2 west-2 region.

Scylla setup included 3 database servers and 10 loaders. The Cassandra setup also included 3 database servers and 10 loaders.

Write workload test

To determine the performance of a write-only workload, each loader created unique partitions in the stress keyspace. A test schema using a replication factor of 3 was generated in each database.

Table 1: Write workload results

MeasurementScylla 1.6Cassandra 3.0.9
Row rate [OPS]164,61070,771
Latency mean [ms]3071
Latency median [ms]2420
Latency 95th percentile [ms]51143
Latency 99th percentile [ms]66478
Latency 99.9th percentile [ms]2961,384
Latency max [ms]3,60732,601

Observations from the write throughput workload test

Scylla provides 230% better throughput while maintaining 99.9% latency at roughly one fifth of Cassandra latency. Looking into the bounding factors of the test, we observe that Scylla throughput is bounded by CPU capacity, and the load on each server is nearly 100%.

One note about latency: These are latency results under 100% CPU workload, and this is a throughput benchmark and thus the high latency results. If Scylla or even Cassandra were run for working range (30%-50% load), the latency numbers would be dramatically lower.

Figure 2 (above) shows Scylla behavior during dataset insertion. CPU utilization is at peak. As can be seen, the CPU workload is back to idle roughly 30 minutes after the last data insertion. The short 30-minutes of additional CPU processing is attributed to the compaction process.

For Cassandra, when we look at the CPU utilization, we see that the utilization process continued for an additional ~37 hours, for our data set of 1.3TB. The short period of low utilization seen in Figure 3 (below) is a time period in which the cluster crashed and restarted.

Read workload tests

For the read workload tests, we generated 2 scenarios. The first scenario (large spread) explores a use case when most of the reads are fetched from disk. This scenario describes patterns where most of the data is not stored in cache, or the dataset is larger than the server’s RAM.

The second scenario (small spread) explores a use case when most of the reads are fetched from memory. This scenario describes patterns, where data size is smaller than server’s RAM, or a very narrow range of data is read by the application.

Table 2: Large spread read results

MeasurementScylla 1.6Cassandra 3.0.9
Row rate [OPS]100,12819,091
Latency mean [ms]50281
Latency median [ms]555
Latency 95th percentile [ms]138618
Latency 99th percentile [ms]5694,027
Latency 99.9th percentile [ms]2,3449,465
Latency max [ms]5,97532,017

Table 3: Small spread read results

MeasurementScylla 1.6Cassandra 3.0.9
Row rate [OPS]344,051107,598
Latency mean [ms]1550
Latency median [ms]730
Latency 95th percentile [ms]14122
Latency 99th percentile [ms]176173
Latency 99.9th percentile [ms]466242
Latency max [ms]1,138389

Observations from the read throughput tests

As mentioned before, we expect the large spread data set read is expected to be served mostly from the storage media, in our case the 8x800GB SSD ephemeral volumes. Reviewing the load on Scylla servers, we observe that the CPU is not bounded. However, the media is working at peak IOPs. The next order of work is to repeat the tests with newly available i3 instances on AWS—stay tuned for the next benchmarking reports.

Figure 4 (above) shows Scylla behavior during the large dataset read test. CPU utilization and per-server requests are described in the lower part of the graph.

Figure 5 (above) shows Scylla disk and cache behavior during the large data set read process and illustrates the high rate of reads from disk and the expected high rate of cache misses.

In the small spread read scenario, the media operation is minimal to non-existent, and the requests are served from server’s cache, sending the CPU load to its peak of nearly 100%.

Figure 6 (above) shows Scylla behavior during the small spread read test. CPU utilization is maximized and each server delivered more than 130K operations per second.

Figure 7 (above) shows Scylla’s disk and cache behavior during the small spread read test. Nearly all requests are served from Scylla’s cache.

Installation, configuration, and workload creation

In the following section, we will review the methods used to install Scylla 1.6 and Cassandra 3.0.9 servers and loaders, which configuration settings were applied, and the method of creating the write, large spread read, and small spread read workloads.

Cassandra 3.0.9 Installation

For the Cassandra installation, we used DataStax Cassandra Community release, based on Apache Cassandra 3.0.9.

Using Centos 7.2 AMI: ami-d2c924b2 as operating system for the instances.

sudo yum install java-1.8.0-openjdk -y
sudo vi /etc/yum.repos.d/datastax.repo
[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0
sudo yum install dsc30 -y
sudo yum install cassandra30-tools -y

After installation we changed the mounting directories as follows:

sudo mdadm --create /dev/md0 --level 0 --raid-devices 8 /dev/xvdb /dev/xvdc /dev/xvdd /dev/xvde /dev/xvdf /dev/xvdg /dev/xvdh /dev/xvdi
sudo mkfs.xfs -f /dev/md0
sudo mount /dev/md0 /var/lib/cassandra/

Cassandra 3.0.9 Configuration settings

Jvm.options modifications

We used G1GC garbage collection method, applied a JVM heap size of 24GB.

-Xms24G
-Xmx24G
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
-XX:ParallelGCThreads=32
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/var/log/cassandra/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10

In the cassandra.yaml file, we enabled the ssd support and increased the number of compactors to the number of CPUs in the machines.

buffer_pool_use_heap_if_exhausted: true
disk_optimization_strategy: ssd
row_cache_size_in_mb: 10240
concurrent_compactors: 32

Scylla 1.6 Installation

We used Scylla AMI ScyllaDB 1.6.0 (ami-930bb0f3), a 3 node orchestration completed on i2.8xlarge machines.

No further settings or changes were made to the deployment.

Load and testing creations

The following perl scripts helped create the workloads used.

The project of creating the workload can be found here: https://github.com/eyalgutkind/Scylla_Cassandra_stress

Here are some examples of the scripts output:

Write workload

nohup cassandra-stress write n=580000000 cl=QUORUM -schema "replication(factor=3)" -mode native cql3 -pop seq=4060000001..4640000000 -node 172.31.1.79,172.31.4.45,172.31.7.248 -rate threads=500 -log file=testwrite_scylla16_halfsize_feb242017.data

Read large spread workload

nohup cassandra-stress read duration=60m cl=QUORUM -mode native cql3 -pop dist=gaussian\(1..5800000000,2900000000,966666666\) -node 172.31.2.8,172.31.3.167,172.31.11.4 -rate threads=500 -log file=testreadlarge_cass309_i28xlhalfsize_feb282017.data

Read small spread workload

nohup cassandra-stress read duration=60m cl=QUORUM -mode native cql3 -pop dist=gaussian\(1..5800000000,2900000000,100\) -node 172.31.1.79,172.31.4.45,172.31.7.248 -rate threads=500 -log file=testreadsmall_scylla1p6_i28xlhalfsize_feb252017.data

Let’s do this

Getting started takes only a few minutes. Scylla has an installer for every major platform and is well documented. If you get stuck, we’re here to help.

Apache®, Apache Cassandra®, are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.