At ScyllaDB, we take data consistency seriously. To ensure the correctness of our lightweight transaction implementation we developed a series of tests using Scylla’s existing and new tools, such as Scylla Cluster Test, Gemini and lightest. These tools evaluate data correctness in the face of acute failure scenarios: I/O errors, running out of memory, node crashes and network partitions. We published some of this work in our blog (which has a great explanation of the difference between our normal eventually consistent operations and lightweight transactions, if you need a refresher).
We have also run Jepsen tests in the past. We focused on regular operations since LWT wasn’t supported at the time. Jepsen is an open source project and a service run by Kyle “Aphyr” Kingsbury. Established in 2013, it has by now become the industry’s standard for distributed systems testing, with many SQL and NoSQL vendors using it as a checkmark that their consistency claims are production grade.
Having run our own tests, we had confidence we were going to pass the Jepsen test with flying colors. The result turned out to be surprising.
Finding consistency violations in databases is an NP-complete problem – the cost grows exponentially with the number of transactions. Sometimes it is outright impossible to do for an external observer. Multiple violations can interact and produce the right results, while masking the underlying problems. ScyllaDB tests create scenarios with data constraints and check these constraints after running a workload with fault injections. Jepsen takes a more holistic approach. It observes not only whether or not there was a consistency fault, but also, with a high degree of certainty, the steps that brought it about. One tool that proved invaluable in particular was Elle, which finds anomalies in linear time.
But Jepsen testing, as a process, goes beyond just its powerful toolset — Kyle’s personal contribution into the testing is invaluable. Where else can you find such an opinionated, independent expert and also the author of one of the most cited Cassandra evaluations? With Kyle, we had an objective set of eyes.
Kyle’s testing was so thorough it ended up going beyond testing lightweight transactions. Scylla documentation for the eventual consistency model had grey areas in semantics of INSERTs (#7082), conflict resolution in case of duplicate timestamp (#7170), safe practices for topology changes, and the correct use of nodetool. All of these topics now have better documentation.
The vast majority of Scylla (and Cassandra) use cases are absolutely satisfied with eventual consistency. Regular operations enjoy better availability and better performance, thus conflict-free replicated data types, last-write-wins and even lack of row isolation are the right compromises for these workloads. QUORUM writes ensure durability, while QUORUM reads guarantee the application never reads dirty data. Background repair processes are responsible for eventually weeding out conflicts. Lightweight transactions are still invaluable when it’s necessary to not just guarantee durability, but linearizability — a specific (strict) order of changes, conditioning a change on the state of the database.
Some nasty LWT consistency bugs were found, too. In one instance (#7116), a typo in the data checksum loop introduced in Scylla 2.2 led to slight differences in data staying invisible to repair. LWT, as well as standard repair and streaming, depend greatly on the correctness of this code. Changing even one letter in the column name would make the bug disappear, meaning Kyle’s special talent for breaking systems had to play a role. Indeed, our own tests used null values in just the same way, but did not observe any errors just because the column holding nulls had a different name! This issue is fixed in Scylla 4.2 and the fix is now backported to all Scylla Enterprise versions.
Another ancient bug (#7611) was revealed in list append and prepend code, which historically generated its own timestamps for append/prepend values, ignoring Paxos timestamps. Using the Paxos timestamps was essential to preserve linearizability of list operations. And while the code was and still is working this way in Cassandra, that’s not good enough for us. The Scylla fix will appear in an upcoming release.
Truthful to the eventual consistency philosophy, Scylla does not run a strong coordination during topology changes. If you know what you’re doing, you can repair a cluster split across multiple availability zones, adding and removing nodes at will. This also allows you to shoot oneself in the foot.
The problem that manifested itself in the test was that Jepsen could start a new topology change after the previous one had failed – and by failure I also mean an operation timeout. Network partitioning and/or node failures, accompanied by a still ongoing node removal and a fresh node addition confused Paxos’s idea of what replicas constitute a quorum.
Despite our objections that this is a wrong way to operate Scylla, Kyle was nonplussed. There are circumstances in which even a human operator could not know if the previous operation has ended. Besides, the rules were never explicit in the docs.
Scylla guarantees that LWT operations will be strictly serializable during a single membership change. We refreshed the standard operational procedures with a new documentation page. If multiple changes are concurrent, the cluster is vulnerable to operator errors – a problem that is, fortunately, less important if one uses Scylla Cloud.
Jepsen testing spurred us to redouble our efforts to switch topology changes from eventual to strong consistency (#1247, #1207). With topology changes based on Raft, Scylla will be able to add and remove multiple nodes at a time, as well as detect and reject unsafe topology transitions requested by a DBA. I’ll talk more about it at our upcoming Scylla Summit.
Our confidence in LWT quality had some basis – the testing found no issues in the trickiest domain of lightweight transactions implementation, the Paxos algorithm. With Jepsen we learned yet again that database quality has to be all-round, and maturity is achieved through testing across many product features and their interactions, rather than focusing on a single piece.
To conclude, let us restate ScyllaDB’s commitment to ultimate product quality – specifically, to fixing all of the consistency issues found by Jepsen testing, and further improving the product and documentation to avoid usability traps. As of today, all the found issues have been resolved, and Scylla documentation clarified. We run Jepsen on a regular basis to ensure no regressions in upcoming releases, and plan to revisit and extend the coverage once strongly consistent topology changes are in.
Learn More at Scylla Summit
Both Kyle Kingsbury and Konstantin Osipov will be presenting at Scylla Summit 2021, coming this January 12th – 14th, 2021. The event is free and online. Kyle will present his Jepsen test findings in depth, and Konstantin will explain our future plans for Raft. These are just two of many great sessions we have planned at our Summit. Make sure you register today!