See all blog posts

Qualifying Filesystems for Seastar and ScyllaDB

Welcome to the asynchronous world

In the networking world, asynchronous processing is so common that it is practically taken for granted. An application sends a message to a remote server, and goes off to do something else. When the response arrives, it is picked up and the application continues from where it left off. There are many application frameworks and even languages that make this easy.

In the storage world, the situation is different. Applications use synchronous I/O and many threads to allow concurrent operation, or else the use mmap and hope that they don’t need to block. But if your disk supports hundreds of thousands of I/O operations per second (IOPS), as many modern SSDs do, this approach is not viable.

A database like ScyllaDB needs of course to combine both networking and storage. It uses seastar as an I/O framework; one of the few frameworks that allow both asynchronous networking and asynchronous storage, and moreover, provides a unified API for both.

The devil is in the details

Under the hood, seastar uses libaio, which in turn uses the io_submit(2) and io_getevent(2) system calls to submit I/O commands to the kernel to be processed asynchronously, and to gather command completions respectively.

Seems simple, right? But there’s a catch. Linux AIO support is fairly limited: you cannot use buffered I/O; instead you must instruct the disk to perform direct memory access (DMA) (this happens to be just fine for ScyllaDB, which prefers to do its own caching anyway). In addition, you must use a filesystem that has good support for AIO. Today, and for the foreseeable future, this means XFS.

XFS and appending writes

XFS has had good support for asynchronous reads and writes for a long time. But one aspect has only been fixed recently: appending writes, which only became supported recently (Linux 3.15; other fixes were needed on top of this patch). ScyllaDB, which writes immutable SSTables, has a heavy reliance on appends, so we were glad to see the patch and the follow-up fixes.

Because the fix was relatively recent, there are still many installations which do not support asynchronous appending writes. If an appending write on such a system resulted with an error, we could simply report it to the user. However, this is not the case. Linux silently converts the asynchronous request to a synchronous one. While ScyllaDB thinks it handed off a write to the kernel to process in the background, in reality it is waiting for the write to hit the disk and complete. While this is happening, the core that submitted the request is idle (because of the thread per core design).

Qualifying a filesystem for ScyllaDB

As it is impossible to detect whether a filesystem is suitable for ScyllaDB at runtime, we switch to a different approach—qualification. To do this I wrote a tool, fsqual, that checks whether a filesystem supports asynchronous appending writes.

fsqual works by issuing a number of appending writes, in the same way that ScyllaDB would do, but also measuring the number of context switches that happened during io_submit (recall that before the XFS fix, appending writes still worked, but were processed synchronously—and that means the kernel has switched contexts while it waits for the I/O to complete). We do this by calling the getrusage(2) system call, and looking at the difference between the values of the ru_nvcsw field before and after the call.

Results

Let’s run fsqual on a few kernels and filesystems and observe the results:

  • ext4 on 4.3.3:
    # ./fsqual 
    context switch per appending io: 1 (BAD)
    

    Not unexpected; ext4 doesn’t have good AIO support.

  • xfs on Ubuntu’s 3.13.0-24-generic:
    # ./fsqual
    context switch per appending io: 1 (BAD)
    

    Again, expected because the bug was fixed in 3.15, and apparently not backported to this kernel.

  • xfs on Ubuntu’s 3.19.0-25-generic:
    # ./fsqual
    context switch per appending io: 0.001 (GOOD)
    

    Good! The bug was fixed in 3.15.

  • xfs on CentOS 7.1’s default kernel (3.10.0-229.el7):
    # ./fsqual
    context switch per appending io: 1 (BAD)
    

    Bad news, CentOS 7.1 is not suitable for running ScyllaDB.

  • xfs on CentOS 7.2’s latest kernel (3.10.0-327.4.5.el7):
    # ./fsqual
    context switch per appending io: 0 (GOOD)
    

    Excellent! The Red Hat XFS team backported the fix, and CentOS duly mirrored the fix in their kernel.

Conclusions and future directions

Exploiting today’s and tomorrow’s SSDs is hard with traditional approaches, but the new ones also have their pitfalls. fsqual allows a ScyllaDB administrator to be sure they are running on a qualified filesystem implementation.

The ScyllaDB team will work to integrate fsqual with the ScyllaDB distribution so that ScyllaDB will self-qualify the filesystem it is running on.

Thanks

Thanks to the XFS team for continuing to improve asynchronous I/O in Linux. Without XFS, reaching ScyllaDB’s performance targets would have been much, much harder.

About Avi Kivity

Avi Kivity, CTO of ScyllaDB, is known mostly for starting the Kernel-based Virtual Machine (KVM) project, the hypervisor underlying many production clouds. He has worked for Qumranet and Red Hat as KVM maintainer until December 2012. Avi is now CTO of ScyllaDB, bringing high throughput to the NoSQL world.