Making ScyllaDB a Monstrous Database: Introducing Project Circe

By Dor Laor

January 12, 2021

With more than 30,000 commits in ScyllaDB and Seastar, 78 and 113 committers, respectively, and customers running ScyllaDB in the wild supporting hundreds of million of users, empowering many mission-critical services we all rely on, it’s fair to say that ScyllaDB is now a mature database.

However, as engineers, as much as we’re proud of our results so far, we feel that there is still much more to be done in nearly every dimension of ScyllaDB.

“Cloud native” and “elastic” as concepts are attractive terms, and many bandy them around freely. “How fast can your cloud-native solution scale?” “What’s the cost of your cloud-native solutions?” “How easy is it to maintain such an elastic deployment without committing to a certain vendor’s cloud solution?” But when you get to implementation, you have to make architecture and code to deliver on those buzzwords. How do you restore a database backup to a different number of nodes? Can you run DDL while scaling your cluster? Can you add 1M ops to your cluster in a breeze? Currently Apache Cassandra, DynamoDB and even ScyllaDB do not have perfect answers to all these questions.

What is Project Circe?

In Greek mythology, ScyllaDB was originally a beautiful nymph who was turned into a monster by a witch named Circe. Circe’s enchantment transformed ScyllaDB into a monster. She grew 12 tentacles (as well as a cat’s tail and six dog’s heads around her waist). We now wish to similarly cast our spell on the modern ScyllaDB NoSQL database, and make it into an even bigger and better monster.

Project Circe is a one year effort to improve many aspects of ScyllaDB to bring stronger consistency, elasticity and ease of use to our community. So each month in 2021, we’ll reveal at least one “tentacle,” each representing a significant feature. Some of the new concepts are quite transformative, thus we will provide any available details ahead of time.

The Metamorphosis of ScyllaDB

ScyllaDB began by following the eventual consistency model of Cassandra and DynamoDB. Eventual consistency allows the user to prefer availability and partition tolerance over consistency, and also to tune the consistency for performance; and it is a perfectly valid model for many real-world use cases.

However, lack of consistency comes at a price of complexity. Many times it bothered us as database developers that our lives became more complicated with advanced features such as materialized views, indexes, change data capture and more. For example, materialized views receive asynchronous updates from the base table, in case of some server failures, the view table can go permanently out of sync and requires a full rebuild.

Other issues were observed during our recent Jepsen test of Lightweight Transactions (LWT) – traditionally schema and topology changes weren’t transactional, forcing users to carefully serialize them, making maintenance complicated. It is time for a change!

The Raft Consensus Protocol

Agreement between nodes, or consensus, in a distributed system is complicated but desirable. ScyllaDB gained lightweight transactions (LWT) through Paxos but this protocol has a cost of 3X round trips. Raft allows us to execute consistent transactions without a performance penalty. Unlike LWT, we plan to integrate Raft with most aspects of ScyllaDB, making a leap forward in manageability and consistency.

Raft will become our default option for data manipulation. We will preserve regular operations as well as the Paxos protocol. As of December 2020, the core Raft protocol is implemented in ScyllaDB. The first user-visible value will be in the form of transactional schema changes. Until today ScyllaDB tracked its schema changes using gossip and automatically consolidated schema differences. However, there was no way to heal a conflicting Data Definition Language (DDL) change. Transaction schema change will eliminate schema conflicts and allow full automation of DDL changes under any condition.

The next user-visible improvement is to make topology changes transactional using Raft. Currently ScyllaDB and Cassandra can scale only one node at a time. ScyllaDB can utilize all of the available machine resources and stream data at 10GB/s, thus new nodes can be added quite quickly, however, it can take a long time to double or triple the whole cluster capacity. That’s obviously not the elasticity you’d expect.

Once node range ownership becomes transactional, it will allow many levels of freedom and we plan on improving more aspects of range movements towards tablets and dynamic range splitting for load balancing.

Beyond crucial operational advantages, end users will be able to use the new Raft protocol, gaining strong transaction consistency at the price of a regular operation!

We expect to improve many more aspects of database maintenance. For example, materialized views can be made consistent with their base tables under all scenarios, repairs can be performed faster and cheaper and so forth. We welcome you to follow our progress with Raft on the ScyllaDB users mailing list and to participate in the discussions as well.

Maintainability

Many ScyllaDB maintenance operations require significant data movement between the database nodes in a cluster. It is not an easy task to make the management operations efficient while minimizing impact on the workload all the time. Moreover, the longer the maintenance window is open while moving data between nodes, the less resilient the database might be during those operations.

With its existing IO and CPU scheduler and automatic tuning, ScyllaDB was designed to support smooth maintenance operations. This year we would like to simplify your deployments by removing concepts like seed nodes, which break the homogenous node concept, allowing you to restart nodes or change the cluster size without risk or complexity.

Compaction is a solved problem at ScyllaDB. The compaction controller automatically sets the dynamic priority for the compaction queue and the user doesn’t need to do anything but choose the compaction algorithm. However, we identified cases where this promise isn’t met, especially around repair, node changes and migration. A new internal state called off-strategy compaction will automatically reshape your data to comply with the existing compaction strategy. Such a reshape will also allow you to load SSTables on any node, regardless of the key range ownership; the node will chauffeur the data to the proper replica set. These and many more changes will allow super smooth administration without any glass jaws.

Oh, last but not least, the repair process finally becomes a solved problem for ScyllaDB and ScyllaDB Manager. More changes are coming to repair, starting from an IO bandwidth limiter for poor IO environments, to a new IO scheduler and maintenance window. A key improvement of maintainability is achieved with Repair Based Node Operations (RBNO), which is disabled by default at the moment. Previously, a node operation like adding a new node or a decommission took up to several hours, so if there was a failure, the whole process needed to be restarted from scratch. With RBNO the restart is immediate and the operation will start from the last position — a big improvement. In addition, since the data is streamed by the repair protocol, the stream is consistent with all of the replicas, unlike before where a repair was required at the end of the streaming and there was a window of time with reduced consistency. Since we already use repair for streaming, the process is simpler, too, and saves the additional repair execution time.

See the Performance section (below) as well for how this will improve latency under maintenance operations as well.

Elasticity

ScyllaDB was modeled after Cassandra, which was created in a pre-cloud-native environment. With more workloads running as-a-service, spiky workloads can flip overnight from low usage to peak and back again in moments, requiring a modern distributed database to be as elastic as possible.

One of the basic elasticity improvements we’re now finalizing is the ability to add, remove and replace multiple nodes at a time with full consistency and greater simplicity. This is important not only for database-as-a-service deployments as ScyllaDB Cloud but for all deployments. When cluster manipulation is complex, database admins will refrain from changing the cluster size often. Soon this will no longer be the case. As an example, the remove node operation was hard and complex to execute in a safe way. Beginning with release 4.1 it was safe by default. We automatically hold off on compaction during streaming and wait for the conclusion of the streaming operation, then compact the new data just once at the end.

Related to performance, we’re removing many small, tiny stalls from cluster node operations in order to be able to constantly change the cluster size with no effect on latency. Clusters with 25% of idle time should suffer no P99 latency increase during these operations.

Lastly, by adopting Raft we’re making key range changes transactional. This major shift will allow us to detach the range movement from the actual data movement, which will be asynchronous. In this way, we will be able to add or remove multiple nodes at once. Furthermore, key ranges will be represented as tablets, which will automatically get split or united as a function of the load of the shard.

Performance

Performance and price/performance are fundamental motivations in developing ScyllaDB. Our emphasis on performance this year is around several key improvements:

Uninterrupted performance

Using our stall detector, our scheduler can locate stalls as small as 500usec. We’re eliminating them to ensure a smooth experience under any administration operation, from elasticity to availability.

Large partition performance improvements

Our goal is to make large partitions perform close to small partitions, with almost no overhead for the clustering key scanning. Better performance frees developers from working around their data model and provides consistent performance without any hiccups. We recently implemented a binary search through the SSTable index file, which boosts the speed of reading rows within a large partition. The final step is to cache the index file itself, so we wouldn’t go to read it from the disk.

New IO scheduler

Controlled IO is one of the key principles of ScyllaDB’s Seastar engine. Until now, all of the shards in the system would divide the global capacity of the disks and schedule shard IO within their budget. This design worked well for very fast disks, but slower disks or workloads with extreme latency requirements mixed with large blobs of IO suffered since the shard IO was maxed out. The new scheduler, which recently landed in the ScyllaDB master branch, has a new algorithm with a global view of the system. Shards now receive a budget and keep asking for more once it is close to being drained.

We expect the new IO scheduler to improve latency across the board, improve LWT operations and improve the environment with slow disks such as cloud network drives.

Coroutines

Coroutines are a C++ programming abstraction, recently introduced to ScyllaDB with C++20 and the CLANG compiler. Without diving deeply, coroutines allow ScyllaDB to replace some of the future-promise asynchronous code that extensively allocates memory with larger bulks when it’s appropriate. We expect to broaden the use of coroutines in 2021.

Stability

Many of the above improvements can also be considered stability improvements. For example, making schema changes transactional eliminates a whole family of schema disagreement issues. Same for the elimination of seed nodes — many ScyllaDB users are not fans of read-the-doc and there are times where seeds aren’t configured correctly. So when the time comes, not having enough seeds causes availability issues.

There are however, many improvements directly related to stability. We’re improving the tracking of memory used by queries in order to block/slow-down new queries when memory is limited. Drafts for cancellable IO were published, thus when a node becomes suddenly overloaded, we can cancel pending IO and provide immediate cure. There is ongoing work around tombstones and range tombstones. ScyllaDB will gain a safe mode — a new mode of operation that will reinforce better defaults and reduce the chances for errors. Workload shedding is a family of algorithms that we’re developing in order to prevent overloaded situations, especially when parallelism is unbounded, thus existing mechanisms to slow down clients via backpressure are not effective.

Deployment options

We’re finally ready to move forward and announce our Kubernetes Operator as a fully supported environment. Expect it to work well on all K8s environments, starting with GKE.

ScyllaDB Cloud (NoSQL DBaaS) itself is expanding from AWS-only to GCP and from there to Azure. We will deploy to more instances and develop many more features. We will also add standard machine images for GCP and Azure, and add marketplace options for DBaaS, Enterprise and OSS deployment options. An open source Ansible project exists and we’re adding provisioning options to it.

Even more

There are many more features (such as CDC going GA together with supportive libraries) and improvements we’re excited about. We will boost observability, implement more protocols, improve traceability and much more. check out the talks at our ScyllaDB Summit and the Project Circe page.