ScyllaDB strives to offer its users predictable low latencies. However, in real life, things do not always go according to plan, and sometimes predictable low latencies become unpredictable big latencies. When that happens, it’s time to go into detective mode and go figure out what’s going on.
Sources of latencies can be anywhere, and it pays to have deep knowledge of all the components that comprise the solution, including the Linux Kernel. Thankfully, with a big percentage of our team being former Linux Kernel hackers, we have that covered.
While big latencies are themselves undesirable, they make for good stories! In this particular example, we’ll show you how we have recently tracked and fixed a bug affecting the Linux CFQ I/O elevator–what methods and tools we used, and what were the consequences for Scylla users. This particular problem was fixed by us and merged to the upstream Linux Kernel. Starting on the 1.4 release, Scylla also starts defaulting to the noop I/O elevator, which bypasses this problem.
It all starts with write timeouts. With the default configuration of ScyllaDB, timeouts occur if requests are not serviced within two seconds, which means the requests were taking longer. So that’s slow. Worse, the workload is mostly fine with the usual low latencies according to the expectation except for those spiky requests, so that’s unpredictable.
So why? To acknowledge a request, all ScyllaDB needs to do is write it to a commitlog memory buffer, and to a memtable memory buffer. At some point we need to write those requests back to disk and fsync them, and if that starts to take a long time, then we slow down on serving requests.
Those requestswere fairly large – over 2MB in size. That is a challenging workload, but since my SSD can do around 500 MB/s, even if we were writing everything synchronously and the commitlog could only have 1% of the bandwidth, it should take only 200 milliseconds to write.
If writing is not the problem, and we need to write and flush, what’s our next candidate? You guessed correctly: flushes. Flushing something to disk is known to be a costly operation, but certainly not that costly. We inserted ad-hoc code into Scylla to take timestamps before and after a flush is requested. And there we had it: flushes were taking a lot of time. Depending on the run, anywhere from 3 seconds to 30 seconds. But always consistent and in perfect correlation with the timeouts.
Could the Linux Kernel be at fault?
Since flush is mostly kernel work, I wrote a simple systemtap script to measure how much time the fdatasync syscall was taking. It usually didn’t take much. However, since those syscalls are blocking, and ScyllaDB serializes blocking syscalls in its syscall thread, it could be that another syscall was taking a long time and was causing the flush issues.
To investigate this possibility further, I augmented that simple systemtap script inserting probes at points of interest in the entry and return points of some Linux VFS functions of interest, like open, rename, and others in addition to fdatasync; basically any blocking operation:
TIME COMM FUNC LAT(ms) 15:02:49 scylla do_sys_open 1989 15:02:49 scylla SyS_open 1989 15:02:49 scylla do_sys_open 1975 15:02:49 scylla SyS_open 1975 15:03:45 scylla xfs_rename 27849 15:03:45 scylla vfs_rename 27849 15:03:45 scylla SyS_rename 27849 15:03:45 scylla do_sys_open 27806 15:03:45 scylla SyS_open 27806 15:03:45 scylla do_sys_open 27689 15:03:45 scylla SyS_open 27689 15:03:45 scylla do_sys_open 27416 15:03:45 scylla SyS_open 27416
That’s bad. Opening a file takes 27 seconds. That’s almost enough time for Usain Bolt to fetch 3 gold medals. Flush itself is still not there, but because those operations may end up serializing each other in our syscall thread, a file open may very well be causing flush to wait. And indeed it was.
Now we need to figure out why. It’s obviously unlikely that this is sheer computation so we can safely bet that there are locks involved.
Those locks, however, could be anywhere. So it would be nice to know things like all processes that spend more than a certain amount of time waiting on a lock, and those respective stack traces so we get insight into what’s going on.
We can keep using systemtap, but since we’ll be collecting a lot of stack traces which requires significant overhead, this is a great time to try the newer, eBPF filters. The iovisor project maintains a very nice set of tools in their github page. One of them, is the offcputime script, that will collect stack traces for processes that stay out of the CPU. We have contributed back to bcc options to start collecting stack traces only if the wait time is larger than a certain user-defined amount. After those changes, finding the problematic issues becomes relatively easy.
So now we run the workload again, and fire offcputime collecting stack traces for everything that stays out of the processor for more than 10 seconds.
This comprehensive stack trace immediately pops out:
ffffffffb90c57b1 finish_task_switch ffffffffb97dffb5 schedule ffffffffb97e310c schedule_timeout ffffffffb97e1f12 __down ffffffffb90ea821 down ffffffffc046a9dc xfs_buf_lock ffffffffc046abfb _xfs_buf_find ffffffffc046ae4a xfs_buf_get_map ffffffffc046babd xfs_buf_read_map ffffffffc0499931 xfs_trans_read_buf_map ffffffffc044a561 xfs_da_read_buf ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 ffffffffc0452b90 xfs_dir2_leaf_lookup_int ffffffffc0452e0f xfs_dir2_leaf_lookup ffffffffc044d9d3 xfs_dir_lookup ffffffffc047d1d9 xfs_lookup ffffffffc0479e53 xfs_vn_lookup ffffffffb925347a path_openat ffffffffb9254a71 do_filp_open ffffffffb9242a94 do_sys_open ffffffffb9242b9e sys_open ffffffffb97e42b2 entry_SYSCALL_64_fastpath 00007fb0698162ed [unknown] - scylla (10327)
What we see here is that we try acquiring xfs_buf_lock and go to sleep, waking up more than 10 seconds later. At this point, knowledge of how XFS handles its buffers really helps. XFS has a limited number of metadata buffers that can be in-flight for a given allocation group. If it runs out of such buffers, it needs to wait until some are made available. Since we’re blocking in the buffer lock, that’s very likely the case. The open syscall obviously need to handle metadata so it will block.
When buffers are dirty they need to be flushed back to be made available. It makes sense that XFS will wait for ScyllaDB requests to complete. But in a modern SSD, 10 seconds is enough for 5GB worth of requests to complete. If that amount is split into multiple requests, then XFS should be able to sneak their requests in between. If that’s a single request, then maybe not. But surely we don’t send 5GB requests at any time, so something smells bad here.
Tracing and understanding the Linux Block layer
Could it be that the block layer is playing tricks on us? One of the most important aspects of the block layer is its I/O Elevator. Originally, it was named “elevator” since the idea was that block requests will get picked up in the direction of the spindle, just like an elevator picks up its passengers. In modern terminology it is mostly called the I/O Scheduler. We are using the former terminology here to avoid confusion with Scylla’s Userspace Disk I/O Scheduler. Even without rotating media, the job left for such elevators is basically to modify the the pool of requests – by merging requests, reordering requests, deferring requests, or whatever it wants to do to achieve a specific goal.
There are three elevators available in Linux:
noop: which as the name implies, doesn’t really do anything,
deadline: which aims at guaranteeing a maximum service time for requests,
and CFQ, the Complete Fair Queue, which tries to service various process equally.
CentOS7, which we recommend as the preferred OS to run ScyllaDB, ships by default with the CFQ enabled, and up until our 1.4 release, we haven’t touched that configuration.
So what happens if we just change the elevator? Like magic, the bug disappears! We could have stopped here, but that just leaves a possibility that something else is at fault and we just don’t understand it. We can’t leave any stone unturned! The CFQ can and does reorder requests, but… seconds? Something is wrong here.
At this point, the right tool for the job™ is blktrace. It allow us to have very good visibility into what’s happening within the block layer. If we go back to CFQ and ask blktrace to generate a summary at the end, we can see that the xfsaild thread, that manages XFS’ Active Items List has “Dispatch Wait” times in the dozens of seconds.
xfsaild/sda1 (804) Reads Queued: 0, 0KiB Writes Queued: 133, 1,245KiB Read Dispatches: 3, 768KiB Write Dispatches: 9, 3,334KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 4, 512KiB Writes Completed: 15, 1,792KiB Read Merges: 0, 0KiB Write Merges: 67, 705KiB IO unplugs: 29 Timer unplugs: 0 Allocation wait: 0 Allocation wait: 50 Dispatch wait: 0 Dispatch wait: 15,626,265 Completion wait: 0 Completion wait: 999 xfsaild/sda2 (805) Reads Queued: 0, 0KiB Writes Queued: 1, 8KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 3, 384KiB Writes Completed: 5, 1,024KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 1 Timer unplugs: 0 Allocation wait: 0 Allocation wait: 0 Dispatch wait: 0 Dispatch wait: 15,562,754 Completion wait: 0 Completion wait: 0
Things are consistent so far. We know that XFS is involved in the problem, and we know that the problem is that metadata buffers are being depleted. The xfsaild thread will hold at least some metadata buffers until it periodically flushes them, but it can take many seconds before the requests xfsaild sends to the disk can complete.
Dispatch Wait is the time a request waits after inserted into the queue until it is dispatched. So we need to find a such request that is actually affected by that.
The problem, of course, is that there are thousands of them. They all look more or less like this:
8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240
Most of those columns can be ignored. The interesting ones are the 4th, 6th and 8th, which tell us:
4th column: (81.092472813) the timestamp in seconds.
6th column: (A) the request status. A means accepted.
8th column: (141698288 + 8) the request address, in base + offset form.
Since blktrace generates a dump with millions of requests, in the order in which they happen, a little bit of script-fu comes in handy, We are trying to group requests that have the same address, and flag the ones that have a very large difference between being accepted and being dispatched, since we know from the summary that we had requests with very big Dispatch Wait times.
In that big haystack of requests, we can find this needle. We’ll remove the columns we are not interested in for simplicity:
Timestamp Status Request Info 81.092472813 A 141698288 + 8 <- (8,1) 141696240 81.092472889 Q 141698288 + 8 [xfsaild/sda1] 81.092496118 I 141698288 + 8 ( 22911) [xfsaild/sda1] 96.718761435 D 141698288 + 8 (15626265317) [swapper/0]
Look at the status: A means accepted, Q means queued, and I means inserted. In this context, when the requests is inserted, that means that the control is passed to the I/O Elevator. We can learn from this trace that all aspects of the block layer prior to the I/O Elevator seem to be behaving. They all happened within 2 milliseconds. But the I/O Elevator took 15 seconds to dispatch it.
So that’s conclusive:
- ScyllaDB is doing everything right but is being harmed by XFS taking many seconds to open a file
- XFS code code is doing everything right but it is being harmed by one of its own threads not being able to flush fast enough
- The XFS xfsaild kernel thread is doing everything right but it is being harmed by the I/O elevator always deferring its requests
Deferring, or more precisely, reordering requests is part of the job description of an I/O elevator. However, this is acceptable only when it happens for a few milliseconds. When deferring them for dozens of seconds, we have a classical starvation problem. My math may be indeed rusty, but last time I checked, 15 seconds is greater than hundreds of milliseconds. This is a bug in the Linux I/O elevator!
Now that we know what the problem is, the hard work is done. Patching the Linux Kernel is, believe it or not, the easiest part. We wrote a patch to fix this, and we can verify the results by using systemtap to collect a histogram on the xfs_buf_lock acquisition times. We now see most requests taking 2 microseconds to complete, with the odd one off sitting at around 131 milliseconds. The bug is gone.
Latency histogram of xfs_buf_lock acquisition (microseconds): value |-------------------------------------------------- count 0 | 11 1 |@@@@ 161 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966 4 |@ 54 8 | 36 16 | 7 32 | 0 64 | 0 ~ 1024 | 0 2048 | 0 4096 | 1 8192 | 1 16384 | 2 32768 | 0 65536 | 0 131072 | 1 262144 | 0 524288 | 0
This fix was included in upstream Linux in commit 3932a86b4b9d1. It will be backported to the stable kernel in versions 4.8.4, 4.7.10, 4.4.27.
However, there is no need to worry: while those versions are not out or if you don’t choose to upgrade, ScyllaDB users can avoid this problem by simply changing their I/O elevator to noop or deadline.