Hello, my name is Pavel. Today, I’m going to talk about solid state drives, the way they seem to handle IO workload and the approach we use in ScyllaDB to get good IO latencies from those drives and hopefully not only from those. So first things first, when talking about managing server workload, we always mean schedulers. If there are different components that want to get access to some limited resource like CPU time, memory or, as in this talk, disk IOPS or disk bandwidth, then we need a scheduler to manage this access.
For example, in ScyllaDB, there is SQL Server that might want to read a row from disk. There is a memtable that might want to write itself back to disk, and there is compaction that is when both reading from and writing to this disk all at the same time and at huge portions. So deciding which components should be given which portion of the resource is heavily tied with another important duty of the scheduler. That is preventing the resource from being overloaded. Let’s see what it means for disks. If you would try to find out how fast a disk can serve one to four and more requests running simultaneously, you will most likely measure something like this. When the request concurrency is low, and the concurrency here is how many requests are submitted at a time, then disk would complete them at more or less the same time of few microseconds. This behavior is due to internal parallelism all the modern disks have. As the concurrency grows, the disk or maybe the kernel driver will start queueing those requests, thus increasing their latencies. At this part of the measurement, you’ll face this so-called Little’s law, and this part of the plot is stepped on when trying to measure disks’ peak throughput. Apparently, the IO scheduler should be aware of this behavior. If there is a long queue of requests to be served, it’s safe to send several of them into the disk, of course. On the other hand, sending too many requests will hurt fairness because the disk doesn’t know about request priorities, and it will serve them in order, so more urgent requests will have to wait. Since it’s often hard to pin the parallelism value, the good strategy of the scheduler is to specify the desirable IO latency goal, that is a dashed orange line over there on the plot, and limit the concurrency based on that number. Having said that, the approach to IO scheduling can be as simple as, configuring the maximum concurrency value, and don’t put more requests into the disk than that. Oh, and don’t forget about priorities, of course. Since its early days, ScyllaDB had been using this approach, elaborating the idea of how to get this maximum concurrency value. First, it was literally the number of requests. Later, it was changed to take into account maximum IOPS and maximum bandwidths of the disk. But since all disks have different IOPS and bandwidths for reads versus writes, this actually was four numbers, and we have a nice blog post about this, by the way. The sheer mathing nature of this ScyllaDB core also required this limit to be evaluated in advance with the help of a tool called iotune. However, this approach didn’t show results we wanted, and we wanted to have reads latency to be smaller than 1 millisecond and to have write latencies, no matter how large really, for writes, we are are more interested in the throughput, and larger throughput typically means larger latencies. The biggest mistake in treating the disk like I mentioned is the idea that if the disk has to handle both reads and writes at the same time, we call it mixed workload. Then it won’t make any difference between the IO direction, and the scheduler can assume that our available IOPS and bandwidths for reads and writes can be exchanged to one another. Here is the disproving graph. The rightmost bar, called peak, is the maximum IOPS for reads and writes when they run on their own. The next, the pure bar, is the same. Disk is doing just reads or just writes, but the workload has the concurrency of one, that is doing one request at a time. The measured pure IOPS is lower than the peak one because disk doesn’t exploit its internal parallelism, but it’s still expected and stays within the initial model, but next comes the continuous bar. This is the IOPS of mixed workload that consists of both reads and writes with concurrency of one. You see? Disk clearly prefers serving writes even if it means getting the resource from reads. The read IOPS drop more than two times, but writes just few percent. It’s quite opposite to what we needed. Next four bars show the same mixed IOPS measurements with the right flow rate limited below its peak values. Read flow remains the same, one by one. Rate limiting writes does make the disk somewhat more fair towards reads but also somewhat slower, and this observation led us to the idea of what we later called rate-limited scheduling. To get better understanding of how disks handle mixed IO, we tried to generate a complete rate profile of a disk, that is, measure what latencies would we observe if loading the disk with mixed workload with different reads and writes intensities. There is a whole lot of measurements, indeed, let alone the fact that there are four independent parameters to change, two IOPS values and two request sizes, so the results had to be also somehow represented. All those tasks were solved with the help of a tool called Diskplorer, and here is how that profile looks for one of the popular cloud instances. We didn’t actually try to draw four-dimensional space in Google Presentations. Sorry about that. Here is the two-dimensional slice of it. X axis is the right bandwidth. Y axis is the read bandwidth. Well, it’s read IOPS actually, but since requests are of a fixed size in this area, it can be called bandwidth. Colored squares represent the observed latencies of reads, and the two roles of scale to the right was exact value. What’s important here is the more bluish the color is, the smaller the latency is. Here, the scheduler should stay if it wants to provide good latency. This picture shows that the safety area is tricky, but since we’re not parsing the minimal latency but rather want to have some fixed one, it’s enough to stay below the diagonal line in this read/write area. Oh, and why we are here: That’s not how all disks look like apparently. Different disk models on different cloud instances, types, behave differently, but pretty much all the disks we’ve checked show the profile that can be described by the mentioned diagonal line, so let’s move on, and we’ll close that line. On the language of operations, we can get back to the four-dimensional space and describe the safety area like this. The scheduler should take the real bandwidths and real IOPS it sends to the disk, normalize them by their maximum values, sum up the resulting components and make sure it doesn’t exceed some constant, likely 1.0 but not necessarily. The idea actually remains the same: Scheduler should limit itself in the amount of requests it puts into the disk, but the cutoff decision becomes a little bit more complicated. This math looks pretty simple, but note that both bandwidth and IOPS are something-per-second values that are hard to measure instantly. Requests arrive into the queue at irregular intervals, so to get the idea of their rate, only some middle or long-term estimation is possible. And, of course, this estimation would reflect some history of the measurements, and this history might easily get out of that safety area for some time. It takes more efforts to put this equation into code. Fortunately, to somehow control the units-per-second growth, there’s a nice algorithm called token bucket. Briefly, the algorithm can turn a chaotic input flow of requests into some output flow with its rates staying below the configured threshold. The illustration, and actually the implementation too of this algo, is usually a bucket that’s filled with tokens at the desired fixed rate, and when a request wants to be served, it should carry away some tokens from that bucket. If no token is available, the request should wait or should be dropped, and also the bucket is limited with the maximum amount of tokens it can hold. From the mathematician point of view, what this algorithm does, it accumulates the total number of requests and makes sure the speed of the growth of this accumulator doesn’t exceed the configured limit, and that’s almost exactly what we need. With some effort, it’s possible to demonstrate that original scheduling configuration can be converted into something that A, matches the token bucket equation and B, contains some easy-to-calculate value, just the total number of request costs where the cost of a request is one divided by some number plus the request length divided by some other number. These costs an be calculated for every request. The accumulated cost, which is the sum of costs of all requests insofar, can also be calculated instantly without the need to apply sliding averaging, approximation or any other long-term estimation math. At the end of the day, the latency-friendly scheduling strategy looks like this. Get a token bucket. Fill it with tokens at 1 hertz rate. Measure each incoming request cost to be one divided by maximum IOPS value plus its size divided by the maximum bandwidth value. Token self-loading points number here, but that’s not a big problem. Voila, here is what we call rate-limited scheduler. Some funny parts about this approach is that the dimension of request cost here is neither bytes nor item nor bytes per second nor anything like this but just duration. That is, seconds or milliseconds or even microseconds, and it’s a bit counterintuitive, actually, to read logs or analyze metrics by requests measured in seconds, but still. So boring theory is over. How about practice? As for today, the described algorithm is implemented in the operating system called Seastar on top of which ScyllaDB is built. ScyllaDB itself is going to have that Seastar version in the next release. In order to make it work, it’s only necessary to provide the io_properties.yaml file, even generated by the iotune tool from the older ScyllaDB version. However, since some disks’ profile isn’t linear, like I mentioned both, but for those disks, manual update of the io_properties file might be needed, but we are going to automate this as well. Other than the scheduler itself, we are also going to explore those weird accumulated costs in seconds via Prometheus metrics. The reason it’s not yet there is actually one of the hardest things programmers have had to deal with. We are selecting good names for these metrics. What else? One of the most notable longer-term plans is to make the scheduler tune itself on the fly by changing the constant I imagined in the math above not to be 1.0 but to dynamically change with the load. The testing period is still going, but we already have some early results. For example, here are the query-class latencies during the Cassandra stress testing. The loader’s profile was configured to generate both select queries and update queries, and the update flow was intensive enough to make memtable flushing and compaction up and running most of the time because we wanted to have mixed workload on the disk. Pure workload works like charm even without those changes. So the leftmost measurement is a ScyllaDB version with its older IO scheduler onboard. Next comes the measurement with the new scheduler. As you can see, it has notably lower latencies, and the last one is how the new scheduler behaves when being legacy configured with the io_properties.yaml file taken from the older iotune run. The results look promising, and we hope they will stay the same. At least we have some clear evidence that the new scheduler is smart enough not to lock up the queue completely. So this is it. Thank you for watching. Please stay in touch. If you are interested, you can try to reach me most preferably via the e-mail. I have a Twitter, but I am not actually using it, and I didn’t manage to remove the Twitter bird from the slide, so here it is. Thanks again, and have fun.