Operating at Monstrous New Scales: Benchmarking Petabyte Workloads on ScyllaDB

30 minutes

In This NoSQL Presentation

ScyllaDB is a distributed database designed to scale horizontally and vertically — in theory. What about in practice? ScyllaDB’s Benny Halevy will take you through the process and results of benchmarking our NoSQL database at the petabyte level, showing how you can use advanced features like workload prioritization to control priorities of transactional (read-write) and analytic (read-only) queries on the same cluster with smooth and predictable performance

Benny Halevy, Director, Software Engineering, ScyllaDB

Benny leads the storage software development team at ScyllaDB. Benny has been working on operating systems and distributed file systems for over 20 years. Most recently, Benny led software development for GSI Technology, providing a hardware/software solution for deep learning and similarity search using in-memory computing technology. Previously, Benny co-founded Tonian (later acquired by Primary Data) and led it as CTO, developing a distributed file server based on the pNFS protocol delivering highly scalable performance and dynamic, out-of-band data placement control. Before Tonian, Benny was the lead architect in Panasas of the pNFS protocol.

Video Transcript

Hello, guys. I’m Benny Halevy. I’m going to present to you a monstrous benchmark that we did here in ScyllaDB demonstrating how workloads accessing petabytes of information work in ScyllaDB. So let’s start with introducing myself. I’m the Director of Software Engineering here in ScyllaDB, and I lead specifically the Storage Software Development team here in Israel.

I’ve been working on operating systems, storage and file systems throughout my career. In particular, I was working on distributed file systems in NFS, if you’re interested. Let me give you some background about the motivation for doing this benchmark. In the last year, we’re seeing more and more applications that are moved to public and private clouds, and they’re accessing increasingly larger data sets that are collected and analyzed, and there is an increasing need to support these petabyte-scale applications. These applications serve billions of users, and when you multiply these, the total number of users and the size of the entities that they represent, you get petabytes of data that you need to store and analyze. Data collection itself is rapid. You have hundreds of users at the [same] time per day that generate events that need to be collected rapidly, and this data also needs to be accessed by multiple applications, many times at the same time, that combine both online-transaction processing and analytics applications. For the benchmark, we wanted to model these applications, so we chose to run two concurrent workloads at the same time. The first workload is a large user data set that contains per-user data. The data set itself is read mostly, and it’s updated regularly. Typically, this data set is used by analytics applications. And, at the same time, we want it to run also a smaller but real-time-oriented application data set, representing, for example, online bidding data for advertisement placement that is accessed in online-production processing manner. This workload requires low latency to meet real-time deadlines and to maximize the algorithm’s efficiency. So for example, for AdTech, the algorithms calculate heuristics at the given time. They only have a few milliseconds to calculate these heuristics, so the responsiveness of the database is critical to the application efficiency. So let’s do a back-of-the-envelope sizing of the data set. Say we have 1 billion users and for each user we keep about 10,000 records. Each record could be a URL that represents a click of this user. That takes about 100 bytes to store. This leads us to 1 petabyte of storage. And for the application, say we are accessing 10 million auctions at any given point in time. We keep 1,000 records per auction, and each record is about 1,000 bytes. We easily get to several terabytes of storage that we need to access. So let’s drill in and look at the petabyte-scale benchmark that we ran here in ScyllaDB. First, let’s reiterate about the goals we had in mind when running the benchmark. So we first wanted to construct a petabyte-scale ScyllaDB cluster. It turns out that it’s fairly easy, but not trivial. We wanted to load the database with data, and while doing that, measure the throughput of loading and the latency we get, and that consists an order of 1 petabyte of user data and an order of 1 terabyte of application data, and then after loading the database to run concurrent workloads over the user and application data sets, and for them, measure the throughput and latency. Grossly speaking, the user workload was around 5 million transactions per second, and we measured two variants of it. One is read-only, and the other is 80 percent reads and 20 percent writes, and this represented a high-throughput workload simulating online analytics. At the same time, we ran a smaller, 200,000-transaction-per-seconds application workload that was 50 percent reads, 50 percent writes, and we cared about low latency for this workload that represented online-transaction processing. On top of that, we wanted to demonstrate the use of workload prioritization to balance these two workloads. All right? So let’s get down to it. What was the bill of materials? How did we build this cluster? We provisioned 20 virtual machines on AWS. The instance type was i3en.metal. Each instance is pretty big, has 96 virtual CPUs, has 768 gigabytes of memory, 60 terabyte of fast NVMe storage and 100-gigabits-per-second network interface. To be able to operate such a large cluster, we needed 50 load generators. Each is of type c5n.9xlarge, having 36 vCPUs, 96 gigabytes of memory and 50-gig network interface. The software we used to operate the cluster was first ScyllaDB Enterprise running on the cluster nodes, version 2021.1.6, and to generate the load, we used the industry standard cassandra-stress tool that was used over the ScyllaDB shard-aware Java driver, which is essential to get the best performance from the ScyllaDB cluster. We also used the ScyllaDB monitoring stack on the side for gathering the metrics from the ScyllaDB cluster and for presenting them. The ScyllaDB monitoring stack uses Prometheus and Grafana. Let’s describe the workloads that we generated. First, the user keyspace was constructed as a key/value data set, having 500 billion keys, while their value size had a mean size of 600 bytes. Those represent 1 petabyte of uncompressed text data with a 3 1/3 compression ratio, meaning we compressed about 2,000 bytes per record down to 600 bytes. We used the LZ4 compression algorithm, and we replicated the data twice in the cluster. Each key was replicated to two nodes. For accessing the data from the user workload, we used a consistency level of one. The keys themselves were randomly selected in a uniform distribution. For the 80/20 read/write query workload, we also used a cassandra-stress while each of the 50 load generators used a normal distribution to draw random keys out of its assigned one-fiftieth range of the keys, and each load generator used 100,000 threads in a fixed transaction rate of 100,000 transactions per second. With this total formula and read transactions per second, plus 1 million write transactions, the workload itself ran for 3 hours, and we set it up with 5 minutes warm-up time. To generate application workload, we used cassandra-stress as well. Each of the 50 load generators used a normal distribution to draw its random keys out of its assigned one-fiftieth range of the whole keyspace. We also used 1,000 threads per load generator and a fixed transaction rate of 4,000 transactions per second. So this total of 100,000 read transactions per second and the same amount of writes per second. The workload ran, too, for 3 hours with a 5-minutes warmup time. Okay, so are we ready to show the results? First, let’s talk about the data ingestion. We demonstrated that we can insert the data at a rate of 7 1/2 million inserts per second using those 50 concurrent load generators. And with this high throughput, we saw a 4-millisecond 99 percentile write latency. That meant that we could load the 1-petabyte cluster in roughly 20 hours. I find it quite impressive. What was the CPU load during ingestion? So we measured around 90 percent CPU utilization on average. It was interesting to see, if you can look here at the htop output, if you look at the top, you see four cores that are loaded at 90 percent, but they all do kernel work. These cores were serving interrupts from the network. In these large clusters, we’ve assigned these four cores statically to serve interrupts while the other cores serve the ScyllaDB software. What were the storage demands during data ingestion? So it turns out that today’s disks are able to handle multigigabyte workloads, and if we focus here on the I/O queue throughput for commitlog versus compaction, we see that the instances each used about 900 megabytes per second for commitlog writes. And interesting to see that if you sum it up, about 3,000 bytes per record, finds replication factor of two and we multiply it by 50 nodes, you see that we get to 7 1/2 million inserts per second, like we set up by cassandra-stress. Now, this around 1-gigabyte-per-second commitlog traffic generated about 6 gigabytes per second per instance of compaction IO, shown at the bottom. Overall, 20 cluster nodes times 6 gigabytes per second get us a total throughput of 120 gigabytes per second. It was also interesting to see the behavior of compaction when loading the disks. ScyllaDB introduced a few years ago a compaction algorithm called incremental compaction, and in short, ICS, which creates and deletes equal-sized sstables in contrast to the legacy size-tiered compaction strategy that creates increasingly larger sstables. Now, when working with these equal-sized sstables, we can see that we can dramatically reduce the requirement for temporary space using compaction, and if you look at the graph, you can see the small increments where the disk usage is reduced in small increments whenever compaction is able to delete temporary sstables. And that’s a complete contrast to size-tiered compaction strategy with which you can see huge fluctuations in disk usage. Okay, so let’s get to the meat of the benchmark. Let’s talk about throughput. First thing, we wanted to see how much we can load the ScyllaDB petabyte cluster and still provide single-digit millisecond latency for the 99 percentile. So at the graph, the bottom two lines in the blue and red show the p50 read and write latency respectively which is well over 1 millisecond. While we loaded the cluster with 4 million transactions per second on the left side, up to 7 million transactions per second on the right side. And if we look at the p99 latencies, again, the read p99 latency is in blue, and the write p99 latency is in red. We can see that as we increase the throughput inflicted on the cluster, p99 latency goes high, and if you continue to increase the throughput, it would go sky high and become unusable. But still, we demonstrated that we can get, with 7 million transactions per second, we can get pretty decent both mean and p99 latencies out of the cluster. The numbers themselves are presented here, where we summed the application workload, presenting about 300,000 read-write operations on the cluster while the user workload was 7 million read-only operations per second, and just to brag about it, p99 of the write latency for the application was a little over 2 millisecond, while the read latency was a bit higher, 6.8 millisecond, while the user workload had the quite similar p99 latency of 6.4 milliseconds. For the rest of the benchmark, we used a slightly lower throughput, and we’ll get to it later. So let’s drill in a little bit about the system internals and look at the cache efficiency. On the top graph, we can see partition hits, around between six to eight 10,000 reads per second versus the partition misses on the bottom that are close to 400,000 reads per second, meaning that the hit rate was only a little over 1 percent, and this happened due to the randomness of the key/value reads that didn’t utilize the cache almost at all. So this shows us there is potential for using the ScyllaDB extension to the CQL query language called BYPASS CACHE, where it asks the server to just disregard the cache and execute the query directly from disk, and by disk we save the overhead of updating the cache. We kept that as a future work item, and we will get back to it later on. Just a note, that previous tests we run show that BYPASS CACHE can improve performance by up to 70 percent. And on the other side of the spectrum, if the whole data set fits in a cache, we can see improvement of up to a factor of four. So this is also an interesting case to be tested. All right. So get back to the workload, these are the results for the 5 million read-only user transactions per second and 200,000 application transactions per second that were distributed evenly, between 50 percent read and 50 percent write. In this case, we can see that the application latency-disk case was about between 1.4 to 2.3 milliseconds for write and read respectively. Next thing we wanted to compare this read-only-dominated workload, maybe with a more realistic of 80/20 workload for the user data set. In this case, the 5 million transactions per second were divided using 4 million transactions per seconds for read and 1 million transactions per seconds for write. In this case, the interesting thing is that we saw a significant increase in p99 latency, both for the application writes and for the application reads. That might breach the contract for the application that requires latencies of, say, up to 2 or 3 milliseconds. So in order to balance that, we deploy the mechanism we call workload prioritization, and with it we kept the number of shares granted for the application at 1,000 shares while we reduced the number of shares granted to the user workload from 1,000 to 500 shares. And this table really nicely shows that we were able to reduce the application p99 latencies significantly, so the write latency went down from around 2 1/2 millisecond down to 1.2 millisecond, and the read latency went down from 4 1/2 milliseconds down to 3.7 millisecond on the expense, of course, of the user workload that showed increased latencies. The bottom line here is, using workload prioritization, we can dynamically balance and equalize the system resources and divide them between different workloads that share the same hardware infrastructure. And I’d also like to show some graphs that demonstrate how these service levels look in action. So it’s important to know that each service level has its own queues per shard for consuming CPU, and those are for consuming I/O. So the top chart shows the application workload I/O queue and the distribution of the bandwidth per shard for this type of queue, and at the bottom we see 20 times higher bandwidth consumption for the user-workload queues. And it’s really important to show the separation of resources because these two workloads are significantly different, and managing each of them using different queues allows us to balance the priorities and serve each workload according to the number of shares that are granted to it. As I mentioned before, running this benchmark wasn’t a walk in the park. So as expected, setting up and setting the petabyte-scale database wasn’t trivial, but that said, it didn’t take any unreasonable effort. Personally, I was assigned to run this task just 2 or 3 weeks ago, and I had to go over a few hurdles to get it done. So first hurdle was merely provisioning. So it took some time to find an availability zone in AWS that had enough of the instance types that we needed for the benchmark. So take a note of that, and if you plan to deploy such a large cluster, make sure to provision your resources well ahead. Second hurdle was tuning the hardware for such high workloads. So if you remember the chart showing the CPU utilization and the four I/O queue-handling CPUs at the top, it turned out that our default assignment of cores to I/O queue handling wasn’t optimized, and we were able to overwhelm just two of the cores with interrupts, and we couldn’t get to the level of throughput ] that we wanted to demonstrate. So to get it, I had to manually assign these CPUs for I/O queue handling, and we took a note of that, and a fix for this problem will be merged for out-of-the-box machine images in the coming future. Another nitpick was setting the CPU power governor on each node to performance in order to maximize the performance of the system. As for the benchmarking framework itself, it turned out that cassandra-stress wasn’t really built and designed to work at this scale. So for example, just anecdotally, to create 500 billion keys, we had to use a nondefault setting, since the default distribution for keys draws keys only from a space of 100 billion keys, so when realizing that we just couldn’t create 1 petabyte of data, it turned out the problem was before distribution on cassandra-stress, but this was easily overcome by changing this setting. We also ran into some issues with the data-collection library that needed to be fixed to support such a large number of parallel load generators. As for the way we configured the ScyllaDB nodes, for the record, we have used these following nondefault configuration. So at the node level, as I mentioned, we used four I/O queue-serving CPUs rather than the two that are set up by default. We also used a mount option that’s nondefault in the version of ScyllaDB that we tested called the discard option, and if you’re interested, you can read more about it. It’s about letting the NVMe drives know when blocks on the disk are discarded and are no longer used. So setting up this option allows the file system to continually update the block layer of discarded blocks, and this allowed much more stable compaction performance. And it’s worth mentioning that this is now the default in our open-source releases. As for the ScyllaDB configuration, we used a compaction-static-shares configuration of 100 shares. We also configured the compaction and forced the threshold to true. The scheme itself used the incremental compaction strategy with a 10-gigabyte sstable size and this base-simplification goal of 1.25. As I mentioned before, we used the LZ4 compressor for compressing the data. To summarize, I would like to mention some future work we have in front of us. So first is a white paper that’s coming up based on this benchmark and expanding on it. In addition, we plan to test two more workloads. The first is random-read workload using the BYPASS CACHE option that was previously mentioned in order to optimize the cache utilization doing random reads. In addition to that, we are going to also test a data set that fits entirely in the nodes cache and demonstrate the maximum performance you can get out of the cluster using smaller data sets. Just bear in mind that each instance in this example, the i3en.metal, has 768 gigabytes of memory, so these smallish data sets are not so small. Thank you very much. Stay in touch. You can contact me in the e-mail address below.

Read More