Sep26

Performance Improvements in Scylla 2.3

Subscribe to Our Blog

Scylla 2.3 was just recently released, and you can read more about that here. Aside from the many interesting feature developments like improved support for materialized views and hardware enablement like native support for AWS i3.metal baremetal instance, Scylla 2.3 also delivers even more performance improvements on top of our already industry-leading performance.

Most of the performance improvements center around three pillars:

  • Improved CPU scheduling, with more work being tagged and isolated
  • The result of a diligent search for latency-inducing events, known as reactor stalls, particularly in the Scylla cache and in the process of writing SSTables
  • A new, redesigned I/O Scheduler that aims to provide lower latencies at peak throughput for user requests

In this article we will look into performance benchmark comparisons between Scylla 2.2 and Scylla 2.3 that should highlight those improvements.  The results show that Scylla 2.3 improves the tail latencies for CPU-intensive workloads by about 20% in comparison to Scylla 2.2 while I/O-bound read latencies are up to 85% better in Scylla 2.3.

CPU-Intensive mixed workload

Read Latency Scylla 2.2 Scylla 2.3 Improvement
average 1.5ms 1.4ms 7%
p95 2.7ms 2.5ms 7%
p99 4.1ms 3.3ms 20%
p99.9 7.7ms 5.8ms 25%

Table 1: Client-side read latency comparison between Scylla 2.2 and Scylla 2.3. The average and 95th percentiles are virtually identical. But the higher percentiles are consistently better

Write Latency Scylla 2.2 Scylla 2.3 Improvement
average 1.2ms 1.2ms 0%
p95 2.0ms 2.0ms 0%
p99 3.1ms 2.8ms 10%
p99.9 6.3ms 5.6ms 11%

Table 2: Client-side write latency comparison between Scylla 2.2 and Scylla 2.3. Writes are not fundamentally faster, therefore there is no improvement in average and 95th percentile latencies. But as we make progress with the Scylla CPU Scheduler and fix latency-inducing events, higher percentiles like the 99th and 99.9th percentiles are consistently better.

For this benchmark, we have created two clusters of three nodes each using AWS i3.8xlarge (32 vCPUs, 244GB RAM). In each cluster, we populated 4 separate tables with 1,000,000,000 keys each with a replication factor of three. The data payload consists of 5 columns of 64 bytes each, yielding a total data size of approximately 300GB per table, or 1.2TB for the four tables.

After population, we ran a full compaction to make the read comparison fair, and then ran a cache warmup session of 1 hour and started four pairs of clients (total of 8). Each pair of clients works a different table and has:

  • a reader client issuing QUORUM reads at a fixed throughput of 20,000 reads/s using a gaussian distribution
  • a writer client issuing QUORUM writes at a fixed throughput of 20,000 reads/s using the same gaussian distribution as the reads.

The gaussian distribution is chosen so that after the warm-up period we will achieve high cache hit ratios (> 96%) making sure that we exercise the CPU more than we exercice the storage. Over the four tables, this workload will generate 80,000 reads/s and 80,000 writes/s for a total of 160,000 ops/s.

To showcase the improvement, we can focus on one of the read-write pairs operating in a particular table and look at their client-side latencies.  The results are summarized in Table 1, for the read results, and Table 2 for the write results. As we would expect, the underlying cost of read and write requests didn’t change between Scylla 2.2 and Scylla 2.3 so there is not a lot of difference in the lower percentiles. But as we deliver improvements in CPU Scheduling and hunting down latency inducing events, we can see those improvements reflected in the higher percentiles.

The clients will report the latency percentiles over the entire one hour run. It is more enlightening to look at how the tail latencies behave over time. In Figure 1 we can see the 99.9th percentile read latencies, measured every 30 seconds for the entire run. We can see that Scylla 2.3 yields consistently lower latencies than Scylla 2.2.

Figure 2 shows a similar scenario for writes. While the 99.9th percentile for writes fluctuate between 4ms and 6ms, they sit tightly at the 4ms mark for Scylla 2.3.  Since Scylla’s write path issues no I/O operation and happens entirely in memory for this workload, we can be sure that the improvements seen in this run are due entirely to changes in CPU scheduling and fixing latency-inducing events.

Figure 1: Server-side, 99.9th percentile read latency over the entire execution during constant throughput workload served from cache. Scylla 2.3 (in green) exhibits consistently lower latencies than Scylla 2.2 (in yellow)

Figure 2: Server-side, 99.9th percentile write latency over the entire execution during constant throughput workload. As I/O operations are not present in the write foreground path, the improvement in latency seen in Scylla 2.3 (in green) as compared to Scylla 2.2 (in yellow) is entirely due to CPU improvements.

IO-bound read-latencies workload during major compactions

Scylla 2.2 Scylla 2.3 Improvement
average 8.4ms 6.0ms 40%
p50 5.6ms 3.1ms 80%
p95 15ms 8ms 85%
p99 67ms 63ms 6.3%

Table 3: read latency distributions in Scylla 2.2 and Scylla 2.3 during a major compaction. The tail latencies are about the same, as we expect those to be ultimately determined by the storage technology. But the lower percentiles are improved up to 85% for Scylla 2.3 as the I/O Scheduler is more capable to uphold latencies for foreground operations in the face of background work

In this benchmark, we are interested in understanding how read latencies behave in the face of background operations, like compactions, that can bottleneck the disk. Because the disks are relatively slow, peaking at 162MB/s, compactions can easily reach that number by themselves. Read requests that will need to access the storage array can be sitting behind the compaction requests in the device queue and incur added latencies. This is the scenario that the new I/O Scheduler in Scylla 2.3 improves.

We have created two clusters of three nodes each, using AWS c5.4xlarge (16 vCPUs, 32GB RAM) with an attached 3TB EBS volume, amounting to 9000 IOPS and 162MB/s, to mimic a situation where the user is bound to slower than NVMe storage. This is a fairly common situation with environments doing medium-size deployments on premises with older SSD devices.

We populated both the Scylla 2.2 the Scylla 2.3 cluster with a key-value schema in a single table with 220,000,000 keys using a replication factor of three. The data payload consists of values that are 1kB in size, yielding a total data size of approximately 200GB—  much larger than the memory in the instance.

After starting a major compaction, we issue reads at a fixed throughput of 5,000 reads/s with consistency level of ONE for both Scylla 2.2 and Scylla 2.3. The keys to be read are drawn from a uniform distribution, and since the data set is larger than memory the cache hit ratio is verified to be at most 3%, meaning that those reads are served from disk.

Table 3 summarizes the results seen from the perspective of the client. For this run, we don’t expect a lot of changes in the high tail latencies as this will likely be determined by the storage characteristics. But we can see that Scylla 2.3 shows stark improvements in the lower percentile latencies— with the 95th percentile being 85% better than Scylla 2.2, as well as the average.

In Figure 3, we can see the 75th percentile comparative latency over time for Scylla 2.2 and Scylla 2.3. One interesting thing to notice is that some of the compactions finish a bit earlier for Scylla 2.2. With some of the reads now having to read from fewer SSTables and not having to dispute bandwidth with compactions, both versions see a decrease in latency. But the decrease in latency is less pronounced for Scylla 2.3, meaning it was already much closer to the baseline latencies during the compaction. As the reads are not, themselves, cheaper, both versions will eventually reach the same latencies.

Figure 3: 75th percentile read latency for Scylla 2.2 (in green) vs Scylla 2.3 (in yellow). Scylla performs much better during compactions.

Aside from noticing the performance difference between those two versions, it is helpful to notice that the new I/O Scheduler in Scylla 2.3 solves one long standing configuration nuisance and by itself, simplifies deployments a lot. The setup utility scylla_io_setup is known to produce results that vary considerably between disks, even if the disks have the same specifications. This confuses users and adds an unnecessary manual step to get those files to agree. As we show in Table 4, this is no longer the case with Scylla 2.3. The differences between the nodes are minimal.

Node1 Node2 Node3
Scylla 2.2 --max-io-requests=1308 --max-io-requests=12
--num-io-queues=3
--max-io-requests=8
--num-io-queues=2
Scylla 2.3 read_iops: 9001
read_bandwidth: 160MB
write_iops: 9723
write_bandwidth: 162MB
read_iops: 9001
read_bandwidth: 160MB
write_iops: 9723
write_bandwidth: 162MB
read_iops: 9001
read_bandwidth: 160MB
write_iops: 9720
write_bandwidth: 162MB

Table 4: I/O configuration properties in Scylla 2.2 and Scylla 2.3 For Scylla 2.2, this is /etc/scylla.d/io.conf. For Scylla 2.3, this is /etc/scylla.d/io_properties.yaml. Scylla 2.2 uses a statistical process that is prone to errors and differences. Often times those parameters disagree even if the disks are similar, and the users have to either live with it or manually adjust the parameters. In Scylla 2.3, the process generates more detailed and consistent results and the differences are minimal.

Since the Scylla 2.2 nodes don’t agree on their I/O properties, we have manually changed them so that they all look the same. The configuration shown in Table3 for Node2 was chosen as the one to use, being the one in the middle between all three.

Conclusion

Scylla 2.3 ships with many feature improvements like large partition detection, enhancements to materialized views and secondary indexes and more. But it also brings performance improvements for the common case workloads that are a result of our always ongoing effort to bring consistently low latencies for our users.

We demonstrated in this article that latencies are improved in Scylla 2.3 for disk-bound workloads due to the new improved I/O Scheduler, leading to up to 85% better latencies. Workloads that are not disk-bound, such as the ones executed in fast NVMe like the AWS I3 instances also show latency improvements, mostly visible in the higher percentile latencies, as a result of incremental improvements in the CPU scheduler and diligent work to identify and fix latency inducing events.

Test commands used in the benchmarks:

For the CPU-intensive write workload

Population:
First we create the schema – four keyspaces to be targeted in parallel:

for i in 1 2 3 4; \
do cassandra-stress write \
    no-warmup \
    cl=QUORUM \
    n=1 \
    -schema "replication(factor=3) keyspace=keyspace$i" \
    -mode cql3 native \
    -errors ignore \
    -node $NODES; \
done

Population is done using 4 loaders in parallel, with each loader targeting its own $KEYSPACE_ID (keyspace1..4)

cassandra-stress write \
  no-warmup \

  cl=QUORUM \
  n=1000000000 \
  -schema 'replication(factor=3) keyspace=$KEYSPACE_ID' \
  -mode cql3 native \
  -rate 'threads=200' \
  -pop 'seq=1..1000000000' \
  -node $NODES \
  -errors ignore

Reads and writes start once all automatic compactions finish.

Read load command, per table:

cassandra-stress read \
  no-warmup \

  cl=QUORUM \
  duration=180m \
  -schema 'keyspace=$KEYSPACE_ID replication(factor=3)' \
  -mode cql3 native \
  -rate 'threads=200 limit=20000/s' \
  -errors ignore \
  -pop 'dist=gauss(1..1000000000,500000000,5000000)' \
  -node $NODES

Write load command, per table:

cassandra-stress write \
  no-warmup \
   cl=QUORUM \
   duration=180m \
   -schema 'keyspace=$KEYSPACE_ID replication(factor=3)' \
   -mode cql3 native \
  -rate 'threads=200 limit=20000/s' \
   -errors ignore \
   -pop 'dist=gauss(1..1000000000,500000000,5000000)' \
   -node $NODES

For the I/O-intensive workload:

Population
First we created the schema, then we used four nodes to write the population into the database, with each loader writing a sequential 55 million keys  part of the entire 220 million keys population:

Schema creation

cassandra-stress write no-warmup \
  n=1 \

  cl=QUORUM \
  -schema 'replication(factor=3)' \
  -mode cql3 native \

  -errors ignore \
  -node $NODES

Population commands

Loader 1

cassandra-stress write no-warmup \
  CL=ALL \
  n=55000000 \
  -pop 'seq=1..55000000' \
  -col 'size=FIXED(1024) n=FIXED(1)' \
  -schema 'replication(factor=3)' \
  -mode cql3 native \
  -errors ignore \
  -node $NODES \
  -rate 'threads=250'

Loader 2

cassandra-stress write no-warmup \
   CL=ALL \
   n=55000000 \

   -pop 'seq=55000001..110000000' \
   -col 'size=FIXED(1024) n=FIXED(1)' \
   -schema 'replication(factor=3)' \
   -mode cql3 native \
   -errors ignore \
   -node $NODES \
   -rate 'threads=250'

Loader 3

cassandra-stress write no-warmup \
 CL=ALL \

  n=55000000 \
  -pop 'seq=110000001..165000000' \
  -col 'size=FIXED(1024) n=FIXED(1)' \
  -schema 'replication(factor=3)' \
  -mode cql3 native \
  -errors ignore \
  -node $NODES \
  -rate 'threads=250'

Loader 4

cassandra-stress write no-warmup \
 CL=ALL \

  n=55000000 \
  -pop 'seq=165000001..220000000' \
  -col 'size=FIXED(1024) n=FIXED(1)' \
  -schema 'replication(factor=3)' \
  -mode cql3 native \
  -errors ignore \
  -node $NODES \
  -rate 'threads=400'

Reads and writes start once all automatic compactions finish.

Major compactions are triggered with nodetool compact concurrently in all Scylla cluster nodes.

Read command:

cassandra-stress read no-warmup \
   CL=ONE \

   duration=1h \
   -pop 'dist=uniform(1..220000000)' \
   -col 'size=FIXED(1024) n=FIXED(1)' \
   -schema 'replication(factor=3)' \
   -mode cql3 native \
   -errors ignore \
   -node $NODES \
   -rate 'threads=500 limit=5000/s'

Dan YasnyAbout Dan Yasny

Glauber CostaAbout Glauber Costa

Glauber Costa is VP of Field Engineering at ScyllaDB. He shares his time between the engineering department working on upcoming Scylla features and helping customers succeed.

Before ScyllaDB, Glauber worked with Virtualization in the Linux Kernel for 10 years, with contributions ranging from the Xen Hypervisor to all sorts of guest functionality and containers.


Tags: 2.3, benchmark, Benchmarks, performance, scylla, ScyllaDB