Project Circe is ScyllaDB’s year-long initiative to make ScyllaDB, already the best NoSQL database, even better. For the month of April we are going to take a look inside the organization and code base to see what it takes to bring major new features into a project as dynamic as ScyllaDB. Currently there are nearly a half-million lines of code in the scylladb/scylla repository on Github (482.7k as of this writing). Of those thousands of source lines of code so far are dedicated to the library implementing the new Raft consensus protocol.
Raft and the Logical Clock
We’ve already covered what Raft is and how we plan to use it in ScyllaDB. The hardest part, of course, is actually getting it to work. So I had a recent chat with Kostja Osipov, who leads development on this key infrastructure. He gave me an overview of some of the supporting efforts going into making Raft ready for release.
While Raft is currently being wired up to work with RPC to permit topology changes (add/remove nodes) the breadth of testing in order to make Raft truly resilient goes far beyond the source lines of code of the database commits themselves. “We added over a hundred test cases,” Kostja noted, “unit tests, functional tests using the concept of nemesis, or failure injections, plus a randomized test inspired by the Jepsen approach to testing.”
If you have a keen eye for scouring Github, you may already have come across the scylla/test/raft subdirectory, which includes another 2,853 source lines of code. The most basic of all tests is fsm_test.cc (finite state machine test), which treats Raft server as a “device under test.” It models specific chains of events which must not lead to protocol failure, e.g. receiving an outdated message from a deposed leader. The next level of testing includes a mock network and mock logical clock that is in replication_test. It allows us to test how different combinations of Raft options, such as pre-voting, non-voting members and gracious leader step down work together over a potentially slow network or non-synchronized clocks In etcd_raft.cc you’ll find a port of etcd Raft implementation unit tests. The team studied lots of Raft implementation and found the etcd testing effort one of the most thorough.
This sort of aggressive testing led to some interesting results. “One test found that our library crashes when the network reorders packets and there is a failure of one of the members. Another crash was when the leader fails while trying to bring on board a new cluster member — so there is a leader change — but it shouldn’t lead to a crash.”
Raft, despite being widely considered a simple protocol, has infinitely many protocol states. To be able to radically increase the amount of states our testing explores, the ScyllaDB engineering team came up with an implementation of a logical clock. “It’s a clock that is ticking with the speed at which the computer executes the test, not the speed of a wall clock.” Using this logical clock the team was able to squeeze a lot of things into a single test that runs at CPU speed — millions of events per second.
In ScyllaDB’s implementation, “Every state machine has [its] own instance of logical clock; this enables tests when different state machines run at different clock speeds.”
Recent Interesting Commits
Every week our CTO Avi Kivity produces a roundup of changes to our codebase sent out on our user mailing list entitled “Last week in scylla.git master.” You can read the latest month’s roundups here:
- Last week in scylla.git master (issue #72; 2021-04-04)
- Last week in scylla.git master (issue #73; 2021-04-11)
- Last week in scylla.git master (issue #74; 2021-04-19)
- Last week in scylla.git master (issue #75; 2021-04-25)
Here’s a few of the more salient commits mentioned this past month:
Compaction Strategies Reshaped
- Leveled Compaction Strategy (LCS) was reshaped to work better with repair-based operations.
- Time Window Compaction Strategy (TWCS) was modified to reduce write amplification and avoid unbounded memory usage.
Better Memory Allocation to Further Reduce Latencies and Stalls
- The SSTable parser was modified by adding new methods for parsing byte strings. This avoids creating large memory allocations for some cells, reducing related latencies.
- More code paths can now work with non-contiguous memory for table columns and intermediate values: comparing values, and the CQL write path. This reduces CPU stalls due to memory allocation when large blobs are present.
- Continuing on the path of allowing non-contiguous allocations for large blobs, memory linearizations have been removed from Change Data Capture (CDC). This reduces CPU stalls when CDC is used in conjunction with large blobs.
ScyllaDB Monitoring Stack 3.7
We recently released ScyllaDB Monitoring Stack 3.7. One feature in particular (#1258) provides visibility of the accumulation and sending of Hinted Handoffs — updates that are maintained during transient node failures, and sent when nodes come back online. While Hinted Handoffs have been around in ScyllaDB for many years, this new dashboard provides immediate observability into this aspect of extra load due to transient node failures.
Learn More in ScyllaDB University
While much of what we’ve been talking about here is deep in the heart of ScyllaDB’s source code, many readers might be new to ScyllaDB, or even new to NoSQL in general. For you, we’ve created ScyllaDB University. It is an entirely free online resource for users to build your NoSQL database skills. It’s your first step into the journey of mastering the monstrously-fast, monstrously-scalable database which is ScyllaDB.
You can start with an overview of ScyllaDB and then, at your own pace, move up into advanced architectural concepts like consensus protocols, learning how these power user-oriented features like Lightweight Transactions and how they work under the hood.