Scylla vs. Cassandra benchmark (cluster)

The following benchmark compares Scylla and Cassandra on a small cluster with replication factor 3 and statement consistency level QUORUM.

Test Bed

The test was executed on SoftLayer servers. Configuration included:

3 DB servers (Scylla / Cassandra):

  • SoftLayer bare metal server
  • CPU: 2x 12-core Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, with hyperthreading
  • RAM: 128 GB
  • Networking: 10 Gbps
  • Disk: MegaRAID SAS 9361-8i, 4x 960 GB SSD
  • OS: Fedora 22 chroot running in CentOS 7, Linux 3.10.0-229.11.1.el7.x86_64
  • Java: Oracle JDK 1.8.0_60-b27

22 load servers (cassandra-stress):

  • SoftLayer VM server
  • XEN hypervisor
  • CPU: Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz, 16 logical cores
  • RAM: 16G
  • Networking: 1 Gbps
  • OS: CentOS 7, Linux 3.10.0-229.7.2.el7.x86_64

All machines were located in the same data center.

Server tuning

The DB server was tuned for better performance:

  • the bond interface was dropped and bare NIC interface was used.
  • NIC rx interrupts were bound each to a separate CPU
  • NIC tx queues were bound each to a separate CPU
  • Open file limit was increased: ulimit -n 1000000
  • 128 huge pages were configured
  • cpufreq’s scaling_governor was changed to performance

See configure.sh

Client tuning

The client VMs were tuned for better performance:

  • NIC IRQs were bound to CPU 0
  • cassandra-stress was limited to CPUs 1-15 using taskset
  • tsc was set as the clock source.

Scylla

  • version: 0498cebc58b9fbadb25a7b018cebf95d965d88da
  • command line:

    scylla --data-file-directories=/var/lib/scylla/data --commitlog-directory=/var/lib/scylla/commitlog
    --collectd=1 --collectd-address=$COLLECTD_HOST:25827 --collectd-poll-period 3000 --collectd-host=\"$(hostname)\"
    --log-to-syslog=1 --log-to-stdout=1 -m80G --options-file=./scylla.yaml
  • scylla.yaml

Note that in this benchmark Scylla was using the default kernel networking stack (POSIX), not DPDK.

Scylla was compiled and run in Fedora 22 chroot, which is equipped with gcc 5.1.1-4. We saw that this gives about 10% performance improvement over the standard CentOS Scylla RPM running outside the chroot. Future Scylla RPMs will be based on gcc 5.1 so that default installations can also benefit.

Cassandra

The differences between cassandra.yaml used in the test and the distribution-provided one are:

  • Replaced IP addresses
  • concurrent_writes and concurrent_reads were increased to 384
  • native_transport_max_threads was increased to 256
  • concurrent_compactors was increased to 8
  • memtable_flush_writers was increased to 12

Workloads

Three workloads were tested:

  • Write only: cassandra-stress write cl=QUORUM duration=15min -mode native cql3 -rate threads=700 -node $SERVERS
  • Read Only: cassandra-stress read cl=QUORUM 'ratio(read=1)' duration=15min -pop 'dist=gauss(1..10000000,5000000,500000)' -mode native cql3 -rate threads=700 -node $SERVERS
  • Mixed: 50/50 Read/Write: cassandra-stress mixed cl=QUORUM 'ratio(read=1,write=1)' duration=15min -pop 'dist=gauss(1..10000000,5000000,500000)' -mode native cql3 -rate threads=700 -node $SERVERS

The $SERVERS variable was set to contain all server addresses, comma separated.

Before starting each workload, schema was created using the following command:

cassandra-stress write n=1 cl=ALL -schema "replication(factor=3)" -mode native cql3 -node $SERVERS

The cassandra-stress from scylla-tools-0.9-20150924.86671d1.el7.centos.noarch RPM was used. It’s based on version 2.1.8 of cassandra-stress. The main difference from the stock cassandra-stress is that it was changed not to use the thrift API in the setup phase as it is as of now disabled in Scylla. It was not modified in any way which would favor Scylla over Cassandra performance-wise.

Methodology

Each cassandra-stress process was placed on a separate loader machine. Each test was executed on an empty database (sstables and commitlogs deleted). Before each workload the page cache was dropped.

Before “read” and “mixed” workloads the database was populated with 100000000 partitions. The population process was performed using all client VMs writing in parallel. Each client VM was responsible for populating a sub-range of partitions, adjacent to other sub-ranges:

cassandra-stress write n=$range_size cl=ALL -pop "seq=$range_start..$range_end"  -mode native cql3 -rate threads=500 -node $SERVER

The cassandra-stress processes were started simultaneously.

Reported throughput is the average operation rate as perceived by servers, summed for all servers in the cluster. The throughput was sampled at a fixed rate during the test using monotonic operation counters. For scylla, transport-*/total_requests-requests_served collectd counter was used, sampled once every 3 seconds by collectd. It was then aggregated using riemann and fed into graphite. In case of Cassandra the StorageProxy MBean counters “WriteOperations” and “ReadOperations” were used, with the help of mx4j plugin. They were sampled by collectd once every 10 seconds and also fed into graphite.

The first 60 seconds and last 15 seconds of the results were ignored to allow for warmup and teardown.

After each test cassandra-stress logs were collected and analyzed. It was verified that server-perceived throughput matches client-perceived throughput and that there are no errors reported by cassandra-stress.

Each server instance (Scylla or Cassandra) was running on a separate node. The first instance was designated as the seed node. Servers were started one after another. Server was deemed up when it bound to the CQL port.

Both Scylla and Cassandra were running with the same durability settings.

Scylla results

The following table contains statistics of total operations per second by workload in the whole cluster:

Workload Average Stdev Min Max
write 1,930,833 3,190 1,650,625 2,010,829
read 1,951,835 1,873 1,943,209 1,955,225
mixed 1,552,604 68,185 1,094,988 1,651,162

Below you can find the dashboards showing statistics obtained during all three workloads.

All graphs, unless otherwise mentioned in the title, contain separate curves for each node. The nodes are labelled “m1”, “m2” and “m3” as can be seen on the legend for the load graph. Not all such graphs have legend next to them, but all of them have the same legend.

The “load” graph shows Scylla’s aggregated reactor-*/gauge-load collectd metric by server. The next two graphs show CQL transcations per second, the first is by server, the second one shows accumulated metric for the whole cluster.

scylla-dash-1

I/O dashboard:

scylla-dash-2

Scylla memory dashboard:

scylla-dash-3

Cassandra results

The following table contains statistics of total operations per second by workload in the whole cluster:

Workload Average Stdev Min Max
write 125,224 12,382 105,436 166,314
read 48,291 19,238 26,631 98,353
mixed 65,950 20,179 1,592 88,847

Below you can find dashboard snapshots taken during each workload.

Write

cass-dash-write-tps cass-dash-write-io

Here’s a dashboard for another, much longer run, on which you can see that the throughput doesn’t change with time:

cass-dash-write-tps-long

Read

Note that the first part of the test includes the population phase.

cass-dash-read-tps cass-dash-read-io

Note that the throughput is slowly rising. This can be attributed to compaction reducing the number of sstables needed to serve queries. Eventually compaction ceases and the cluster throughput stabilizes at about 173K tps after 45 minutes. Here’s how the throughput shape looks like for a longer (55 min) run:

cass-dash-read-tps-long
cass-dash-read-io-long

Mixed

Note that the first part of the test includes the population phase.

cass-dash-mixed-tps cass-dash-mixed-io

And for a longer run (same commentary as for the “read” workload):

cass-dash-mixed-io

Summary

Scylla

Cassandra 2.1.9

Average throughput. Results were rounded with accuracy of 10K.