fbpx

Join us at P99 CONF, the virtual event on all things performance. Oct 19 & 20. Registration is free

See all blog posts

Benchmarking ScyllaDB 5.0 on AWS i4i.4xlarge

Earlier this year, we compared ScyllaDB 4.4.3’s performance on early-access instances of AWS’s storage-optimized I4i instances for IO-heavy workloads. Running ScyllaDB 4.4.3 on both the i4i.16xlarge and i3.16xlarge, we observed up to 2.7x higher throughput per vCPU for reads. With an even mix of reads and writes, we observed 2.2x higher throughput per vCPU on the new i4i series, with a 40% reduction in average latency than I3 instances. [Read the results from this i4i.16xlarge benchmark].

Since then, the complete family of I4i instances went into general availability – opening up access to the i4i.4xlarge instances that are closely related to the i3.4xlarge instances used in our recent ScyllaDB vs Cassandra benchmarks. Also, a major new release of ScyllaDB went to general availability with performance optimizations such as a new IO scheduler that’s optimized for mixed read/write workloads.

Curious to quantify the combined performance impact of the new I4is plus ScyllaDB’s latest performance optimizations, we recently kicked off a new series of benchmarks. This blog shares the first results from that project: ScyllaDB Open Source 5.0 on i4i.4xlarge vs. ScyllaDB 4.4.3 on the i3.4xlarge we used in previous benchmarks. We are working on expanding these benchmarks – stay tuned to our blog for additional comparisons and findings.

What’s So Interesting About the AWS I4i Instances?

The new I4i instances are powered by the latest generation Intel Xeon Scalable (Ice Lake) Processors with an all-core turbo frequency of 3.5 GHz. Compare this to the I3’s Intel Xeon E5-2686 v4 (Broadwell) @ 2.3 GHz (2.7 GHz). Moreover, they use AWS’s directly-attached, NVMe-based Nitro SSD devices that were designed to minimize latency and maximize transactions per second. AWS found that they offer up to 30% better compute price performance, 60% lower storage I/O latency, and 75% lower storage I/O latency variability compared to I3 instances. Another difference: there’s also a new, larger instance size (i4i.32xlarge) that features 2x the RAM of the i3.metal as well as 2x the SSD — a whopping 128 vCPUs backed by a terabtye of RAM and 30 terabtyes of storage per node.

Here are the I4i technical specs:

Our Setup

The benchmarking was a series of simple invocations of Cassandra-stress with CL=QUORUM. The procedure was:

  1. Set up the cluster from intact AMIs and spin up powerful loaders.
  2. Load the cluster with 3TB of data.
  3. Warm up the cache by ~3 hours of reads.
  4. Apply the data distributions (described below) in gradual increments of 10k ops/s.
  5. Stop when p90 latency exceeds 1000 ms.

See the earlier ScyllaDB vs Cassandra benchmark for all the setup details

For this round of benchmarking, we ran a total of 10 scenarios spanning different data distributions and workloads. The data distributions used were:

  • Uniform distribution, with a close-to-zero cache, hit ratio
  • “Real-life” (Gaussian) distribution, with sensible cache-hit ratios of ~30%

The workloads were read only, write only, and mixed (50% read, 50% write). For each of these 3 variations, we ran disk-intensive workloads as well as memory-intensive workloads (e.g., the entire dataset could fit in RAM).

Benchmark Results: i4i.4xlarge vs i3i.4xlarge

Here’s a summary of the results (maximum throughput in k/s with single-digit ms P99 latencies).

Disk-intensive write-only workload

With a disk-intensive workload of 100% writes, ScyllaDB 5.0 on the i4i achieved 2X (100%) greater throughput with single-digit ms P99 latencies: 350k vs 170k.

Disk-intensive read-only workload

For 100% reads, ScyllaDB 5.0 on the i4i achieved 67% greater throughput with single-digit ms P99 latencies: 100k vs 60k.

Disk-intensive mixed workload – write latencies

With a 50% read / 50% write workload, ScyllaDB 5.0 on the i4i achieved 37% greater throughput with single-digit ms P99 latencies for writes: 110k vs 80k.

Disk-intensive mixed workload – read latencies

With a 50% read / 50% write workload, ScyllaDB 5.0 on the i4i achieved 62% greater throughput with single-digit ms P99 latencies for reads: 110k vs 68k.

Disk-intensive Gaussian mixed workload – write latencies

With the Gaussian distribution mixed workload, ScyllaDB 5.0 on the i4i achieved 12.5% greater throughput with single-digit ms P99 latencies for writes: 90k vs 80k.

Disk-intensive Gaussian mixed workload – read latencies

With the Gaussian distribution mixed workload, ScyllaDB 5.0 on the i4i achieved 28% greater throughput with single-digit ms P99 latencies for reads: 90k vs 70k.

Memory-intensive write-only workload

Now, over to the memory-intensive workloads, where the entire dataset could fit in RAM.

With a disk-intensive workload of 100% writes, ScyllaDB 5.0 on the i4i achieved 2X (100%) greater throughput with single-digit ms P99 latencies: 400k vs 200k. This 2X improvement in throughput matches the gain from the disk-intensive write-only workload.

Memory-intensive read-only workload

For 100% reads, ScyllaDB 5.0 on the i4i achieved 83% greater throughput with single-digit ms P99 latencies: 440k vs 240k.

Memory-intensive mixed workload – write latencies

With a 50% read / 50% write workload, ScyllaDB 5.0 on the i4i achieved 112% greater throughput with single-digit ms P99 latencies for writes: 340k vs 160k.

Memory-intensive mixed workload – read latencies

With a 50% read / 50% write workload, ScyllaDB 5.0 on the i4i achieved 112% greater throughput with single-digit ms P99 latencies for reads: 340k vs 160k.

Surprising Latencies for Administrative Operations

Finally, we measured the latencies during “Add Node” and “Major Compaction” operations. We used 3x i4i.4xlarge for “Add Node,” 1x i4i.4xlarge for “Major Compaction, ScyllaDB 5.0, ~50% ScyllaDB load of background workload, and mixed disk-intensive workload. We aimed at 30% CPU load but overshot.

Here are the results for adding a node:

Write latency (in ms) Read latency (in ms)
Mean 1.17 2.41
p90 2.28 4.41
p99 2.72 5.52

And here are the results for major compaction:

Write latency (in ms) Read latency (in ms)
Mean 1.11 3.79
p90 1.35 5.21
p99 1.52 6.06/span>

Yes – compaction unexpectedly improved the P99 write latencies ever so slightly. There are cases where it’s better for ScyllaDB to process data than to go back to the idle loop.

Final Thoughts

Of course, your mileage will vary. We strongly encourage you to perform your own benchmarks with your own workloads. In the meantime, we’ll be performing more benchmarking as well.

We want to delve into some of the more interesting results (for example the discrepancy across the read and write latencies on the Gaussian distribution and the dip in read latencies for the memory-intensive workload). Additionally, we want to compare ScyllaDB’s performance on I4is vs that of other databases. We are also considering more matchups like our “4 x 40” benchmark: ScyllaDB on 4 nodes vs. Cassandra on 40 nodes.

VIEW MORE SCYLLADB BENCHMARKS

Juliusz Stasiewicz

About Juliusz Stasiewicz

Juliusz Stasiewicz is a Software Engineer at ScyllaDB. He has also worked at a number of companies on C++ programming, including Samsung and DataStax.

Piotr Grabowski

About Piotr Grabowski

From a young age, Piotr participated in many competitive programming contests. Always striving to stay on top of the latest business and tech news, he never finishing a day without checking his RSS reader! Piotr holds a BSc in Computer Science from the University of Warsaw and is now pursuing an MSc.