Raft After ScyllaDB 5.2: Safe Topology Changes

Konstantin Osipov32 minutes

ScyllaDB’s drive towards strongly consistent features continues, and in this talk I will cover the upcoming implementation of safe topology changes feature: our rethinking of adding and removing nodes to a Scylla cluster.
Quickly assembling a fresh cluster, performing topology and schema changes concurrently, quickly restarting a node with a different IP address or configuration – all of this has become possible thanks to a centralized - yet fault-tolerant - topology change coordinator, the new algorithm we implemented for Scylla 5.3. The next step would be automatically changing data placement to adjust to the load and distribution of data - our future plans which I will touch upon as well.

Share this

Video Slides

Video Transcript

Hi, my name is Kostja Osipov and I have become a regular guest at these events. Don’t know why, maybe because I’m such a fan of strong consistency.

Today I am going to give you an update of our work bringing strongly consistent features to Scylladb, namely I’ll talk about:

– Scylla 5.2, where Raft finally goes GA
– our work under –experimental-features=raft switch in the trunk, bringing in strongly consistent topology changes (still in experimental mode)
– an outlook into the future, ie what we plan to use Raft for next

Ah, you may be new to these topics and wondering why strong consistency is important, what the heck and what the heck Raft is and what it has to do with
strong consistency.

You must see my and Tomek Grabiec earlier talks – from Scylla Summit’ 21 and 22 for full context.

Here, let me quickly repeat the basics and the definitions, and we won’t go back to this any more, I promise.

Understanding a strongly consistent system is easy. It’s like observing two polite people talking – there is only one person talking at a time and you can clearly make sense of who talks after who and the flow of events. In an eventually consistent system changes, such as database writes, are allowed to happen concurrently and independently, and the database guarantees that eventually there will be an order in which they all line up.

Eventually consistent systems can accept writes to nodes partitioned away from the rest of the cluster. Strongly consistent systems require a majority of the nodes to
acknowledge an operation, such as the database write, to accept it.

So the trade off between strong and eventual consistency is in requiring the majority of the participants to be available to make progress.

Scylla as a member of Cassandra family started as an eventually consistent system and it made perfect business sense – in a large
cluster, we want our writes to be available even if a link to the other data center is down.

Apart from storing the user data, the database maintains additional information about it, called metadata – it consists of topology (nodes, data distribution) and schema (table format) information.

A while ago we recognized in Scylla that there is little business value in using the eventually consistent model for metadata: metadata changes are infrequent, so we do not need to demand extreme availability for them. Yet we want to reliably change the metadata in automatic mode to bring elasticity, which is hard to do with the eventually consistent model underneath.

This is when we embarked on a journey which brought in Raft – an algorithm and a library we implemented to replicate any kind of information across multiple nodes.

So what’s the idea of a replicated state machine? Suppose you had a program or an application that you wanted to make reliable. One way to do that is to execute that program on a collection of machines and ensure they execute it in exactly the same way. So a state machine is just a program or an application that takes inputs and produces outputs.

A replicated log can help to make sure that these state machines execute exactly the same commands. Here’s how it works.

A client of the system that wants to execute a command passes it to one of these machines.

That command, let’s call it X then gets recorded in the log of the local machine, and then, in addition, the command is passed to the other machines and recorded in their logs as well. Once the command has been safely replicated in the logs, then it can be passed to the state machines for execution. And when one of the state machines is finished executing the command, the result can be returned back to the client program.

And you can see that as long as the logs on the state machines are identical, and the state machines execute the commands in the same order, we know they are going to produce the same results. So it’s the job of the consensus module to ensure that the command is replicated and then pass it to the state machine for execution.

The system makes progress as long as any majority of the servers are up and can communicate with each other. (2 out of 3, 3 out of 5).

I again refer you to talks for details.

So big news, in Scylla 5.2 our first use of Raft – for propagation of schema changes – goes GA.

As I am preparing this talk, we’ve branched off Scylla 5.2 with –consistent-cluster-management feature ON in new clusters.

We also implemented an online upgrade to Raft for existing clusters, and machinery to recover existing clusters from a disaster.

What are the benefits of –consistent-cluster-management in Scylla 5.2?

It is now safe to perform concurrent schema change statements – they are not lost, neither they shadow effects of each other or lead to data loss. Schema propagation happens much faster, since the leader of the cluster is actively pushing it to the nodes. You can expect the nodes to learn about new schema in a healthy cluster under a few milliseconds (used to be a seconds or two).

If a node is partitioned away from the cluster, it can’t perform schema changes. It’s the main difference, or limitation, from the pre-raft clusters which you should keep in mind: and you can still perform other operations with such nodes, such as reads and writes, so data availability is unaffected. We see results of the change not only in simple regression tests, but in our longevity tests which execute DDL – there are fewer errors in the log and the systems running on Raft are more stable when DDL is running.

Going GA means that this option is ON by default in our scylla.yaml – ON in all new clusters.

If you’re upgrading an existing scylla installation, set –consistent-cluster-management ON in all your scylla.yaml files, perform a rolling restart, and the cluster will switch to raft. You can watch for upgrade operation progress in scylla log – look for raft_upgrade markers, and as soon as the upgrade finishes, the system.local key for group0_upgrade_state starts saying “use_post_raft_procedures”.

Before going GA, we implemented two critical features of Raft which I would like to mention here.

First, we added support for disaster recovery – a way to salvage data when you permanently lost the majority of your cluster. Here’s how we did it.

We added a way to drop the state of consistent cluster management on each node and establish it from scratch in a new cluster. The detailed procedure is described in our manual. The key takeaway now is that your Raft based clusters, even though they don’t accept writes in a minority, are still usable even if your majority is permanently lost. We viewed it as a big concern for the users who are performing upgrades from their existing clusters.

The second big change we added is IP address change support. To operate seamlessly in Kubernetes environments, nodes need to be able to start with existing data directories but different IP addresses. We added support for that in Raft mode as well.

If just one or few of your node’s IP address is changed, you can just restart these nodes with a new IP address and this will work – no need to change the configuration files (e.g. seed lists).

You can even restart the cluster with all node IPs changed. Then, of course, you need to somehow prompt the existing nodes with IP new IP addresses of each other. To discover the cluster, Scylla will use a pre-recorded IP information in system.peers as well as the seeds: parameter from scylla.yaml. So when all IP addresses have changes and system.peers is useless, it will find the nodes using the information provides in scylla.yaml. The information in system.peers is updated automatically on each node.

I hope you’ll find time to try out Scylla 5.2 with these shiny new features.

Now let’s talk about our new work, which is still experimental, but is already materialized in form of a feature branch and is
actively maturing – we hope to ship it in experimental mode in the upcoming Scylla 5.3 release.

This is, as planned, consistent and centralized topology management.

Well, I’ve just talked about schema ownership, let me describe what token ownership provides.

Right now, scylladb documentation says that:
* you can’t perform more than one topology operation at a time
* even when the cluster is empty.
* a failure during topology change may lead to lost writes
* topology operations take time.

The topology change algorithm makes sure (it’s only the best effort – not a guarantee ) that the old topology propagates through the cluster via sleeping. It also needs to ensure that all reads and writes, which are based on the old topology have quiesced – and for both of this the existing algorithm just sleeps for ring_delay_timeout (30 seconds).

I am very happy that Cassandra community have also recognized the problem and there is Cassandra work addressing the same issues.


For the rationale of our work, I would like to quote this CEP proposal.

“Probabilistic propagation of cluster state is suboptimal and not necessary, given the scale that Cassandra typically operates at.
Not to mention it is (by definition) unreliable and, since there is no strict order for changes propagated through gossip, there
can be no universal “reference” state at any given time. We aim to add a mechanism for establishing the order of state change events,
and replace gossip as the primary mechanism of their dissemination in order to enable nodes which are participating in state-changing
operations to detect when these intersect with other ongoing operations, such as reads and writes. The first two areas we plan
to apply these changes to are Schema and Token Ownership. Establishing global order for the events requires us to add
transactional semantics to changes in the state.”

Here’s what we achieve by moving topology data to Raft
 Raft group includes all cluster members
 Token metadata is replicated using Raft
 No stale topology

In our centralized topology implementation we were able to a significant extent address all of the above.

The key change that allowed us to do the transition was a turnaround in the way topology operations are performed.

Before, the node that is joining the cluster would drive the topology change forward. If anything would happen to that node, the failure would need to be addressed by the DBA. In the new implementation, a centralized process – which we call topology change coordinator – which runs alongside Raft cluster leader node – will drive all topology changes.

Before, the state of the topology, such as tokens, was propagated through gossip and eventually consistent. Now, the state is propagated through raft, replicated to all nodes and is strongly consistent. A snapshot of this data is always available locally, so that starting nodes can quickly obtain the topology information without reaching out to the majority of the cluster.

Even if nodes start concurrently, token metadata changes are now linearized. Also, the view of token metadata at each node is not dependent on the availability of the owner of the tokens. Node C is fully aware of node B tokens, even if node B is down, when it bootstraps.

But what if the coordinator fails? Topology changes are multi-step sagas. If the coordinator dies the saga could stall or topology change would remain incomplete. Locks could be held blocking other topology changes. Automatic topology changes (e.g. load balancing) cannot wait for admin intervention. We need fault-tolerantсe.

New coordinator follows Raft group0 leader and r takes over from where the previous one left off.

The state of the process is in a fault-tolerant linearizable storage. As long as quorum is alive, we can make progress.

We introduced a new feature on the read/write path, which we call fencing.

Each read or write in Scylla now is signed with the current topology version of the coordinator performing the write. If the replica has a newer, incompatible topology information, it responds with an error, and the coordinator refreshes its ring and re-issues the query to new, correct replicas. If the replica has older information, which is incompatible with the coordinator’s version, it refreshes its
topology state before serving the query.

We plan to extend the fencing all the way down to Scylla drivers, which can now protect their queries with the most recent fencing token they know. This will make sure that drivers never perform writes based on an outdated topology, and, for schema changes, will make sure that a write into a newly created table never fails with “no such table” because the schema didn’t propagate yet.

With fencing in place, we not only are able to avoid waiting for ring delay during topology operations, making them quick, more testable, more reliable, but also resolve consistency anomalies which are not impossible during topology changes.

Indeed, “using gossip for state changes leads to short-term replication factor violations. For example, a gossip-backed bootstrap process allows streaming to be started before there’s a guarantee that any coordinator will make a write towards the bootstrapping replica; similarly, during bootstrap it is possible for the effects of the write operation, followed by the read to the same replica set, not to be visible. While a problem with streaming is a relatively minor reduction in durability guarantees, a problem with disagreeing read/write sets is a transient data loss, a consistency violation that likely goes unnoticed on a regular basis. Worse yet, there is no guaranteed upper bound for a difference in how much to coordinators may diverge from each other, which means that a coordinator may try to read from, or write to replicas the replicas that do not own the range at all.”

This is now all gone.

So let me summarize what we plan to deliver in 5.3

While you can’t perform multiple topology operations concurrently, *requesting* multiple operations is now also safe.
The centralized coordinator will queue the operations, if they compatible.

Incorrect operations, such as removing and replacing the same node at the same time will be rejected.

If a node that is joining the cluster dies, the coordinator will notice and abort the join automatically.

In a nutshell, our main achievement is that topology changes now are safe and fast -we no longer bound by ring delay — user or
operator error do not corrupt the cluster.

Strongly consistent data tables is for now a tiny experimental feature of 5.2 as an experimental one, is broadcast tables.

These are strongly consistent, replicated everywhere tables based on raft.

It’s going to serve as the foundation for our “immediate consistency” project which is based on raft.

Long term road to tablets.

The advantages
– future: mention drivers and fencing; how drivers react today; how we would
like them to react in the future;
– road to tablets
– mention strongly consistent tables; how we extend them further