This is the third part in our blog series on ScyllaDB testing. Part 1 covers Apache Cassandra compatibility testing, and Part 2 covers Jepsen tests.
Some of the hardest to test code in any database is the code that handles filesystem errors. Such errors are rare, but handling them is critical. Using a simulated filesystem to inject errors is a good way to make sure that the error-handling code is correct. A good example of an existing fault-injecting filesystem is PetardFS. PetardFS supports configuration using an XML file, but is not scriptable at runtime.
Handling filesystem errors in a database is critical, but hard to test.
CharybdeFS is our new FUSE-based error injecting pass-through filesystem. It lives between the program to test and an real filesystem like XFS, and alters the real filesystem’s behavior in order to test the program. It’s controlled by a Thrift RPC interface for easy scripting. Using it is pretty easy: it’s simply a mount point that is proxied to a location in the real filesystem.
To script CharybdeFS, you can use a Python module that communicates with the filesystem over Thrift. CharybdeFS tests are written in Python. An example of a test that will kill -9 ScyllaDB when it does a sync or flush call then check the db is intact at restart would be:
print "setting flush/sync to kill ScyllaDB" client.set_fault(['flush', 'fsync', 'fsyncdir'], False, 0, 100000, "", True, 0, False) print "Waiting for ScyllaDB to die" while is_running("scylla"): time.sleep(1) print "ScyllaDB died"
set_fault method will instruct CharybdeFS to insert the given error under the given conditions.
Filesystem failures can also be matched to specific filenames or paths, using regular expressions. A failure can also happen randomly with a configurable probability. Another feature of CharybdeFS is that it can kill -9 the caller process (ScyllaDB). The idea is to make CharybdeFS randomly kill the database on the
sync system calls, and see if the data is still consistent at next ScyllaDB start up.
Some examples of real life filesystem errors that CharybdeFS can simulate are
- disk IO error (EIO)
- driver out of memory error (ENOMEM)
- file already exists (EEXIST)
- disk quota exceeded (EDQUOT)
Basically all the errors living in
<errno.h> can be returned by any of the filesystem syscalls that ScyllaDB uses. This includes operations done with the Linux kernel asynchronous I/O (AIO) interface that ScyllaDB uses extensively.
Another use case of CharybdeFS is making the filesystem delay some operations. Stalling reads and writes is useful for simulating storage that shows latency spikes, so we are sure that the database will continues working in these bad conditions. SSD RAID arrays exhibit such latency issues because in some array configurations the latency of a request is the one of the worst SSD.
Adding a 50ms delay on read and write operation on a subset of ScyllaDB files with a 10% probability during a 100-second run would look like the following:
# Set fault prototype is # void set_fault(list<string> methods, bool random, i32 err_no, i32 probability, string regexp, bool kill_caller, i32 delay_us, bool auto_delay) client.set_fault(["read", "write"], False, 0, 10000, ".*data/.*", False, 50000, False) time.sleep(100) client.clear_all_fault()
Note that 100% probability is 100000 since we may want to simulate very rare events.
ScyllaDB bugs found by CharybdeFS
ScyllaDB was missing a consistent strategy to handle disk errors. Thanks for CharybdeFS we are now adding it. Early development releases of ScyllaDB had only debug messages when a filesystem error happened. As ScyllaDB cannot repair the disk for the user the database must be properly shutdown on I/O errors to preserve data. Filesystem testing found a lot of examples of this case. We are now deploying the shutdown on error feature all over the code. Raphael, another ScyllaDB developer, is using the same technique to check how ScyllaDB reacts when the disk is full.
The log structured design of ScyllaDB makes it resistant to filesystem problems
The LSM tree design that ScyllaDB adopted from Cassandra never does random writes, but only append writes. It make it easier to keep data consistent than having to order and synchronize multiple random writes all over the place. However, CharybdeFS is not limited to testing Cassandra-like disk access. It’s generally useful for all kinds of server software. Some other use cases could be testing other database and storage servers like Ceph and Gluster, or even a mail server.
Learn more about how to use CharybdeFS in your own projects: Fault-injecting filesystem cookbook.
CharybdeFS is open source and available on GitHub
Coming soon: Distributed tests and longevity tests for ScyllaDB.