Hi, everyone, and welcome to my talk about the future of consensus in ScyllaDB. My name is Thomasz Grabiec. I’m a software engineer working at ScyllaDB on the core of the database. ScyllaDB 5.0 comes with a new technology for consensus called Raft, which we use to implement safe schema changes. About that you can learn from the talk of my colleague Konstantin Osipov.
In this talk I will tell you about other amazing things which we plan to do with Raft, but first let me tell you real quick about what Raft is. Raft is a protocol for state machine replication over unreliable network. It maintains a state machine at every replica where state is changed by commands, which can mean anything, and there is a single leader which decides on the order of commands. Protocol creates an illusion that there’s a single sequence of commands, and everyone agrees about, what are the commands? And what is the order of commands? And every state machine paused this sequence and eventually ends up in the same state because they execute the commands in the same order. Protocol is also called tolerant and highly available. It requires only quorum of the nodes to be alive to make progress, and because it has a single leader, then it follows that it must have automatic failover of the leader, and that’s exactly what it does. It has a detection algorithm for leader failures and runs a direction algorithm to make sure that there is always an active leader. This will be very useful, and I will explain how, why during this talk. Another cool thing which Raft causes to do is to build linearizable fault-tolerant storage on top of it. It’s the kind of storage where everyone that uses it serves the same history of changes, which matches the real-time ordering of requests. This will also be very useful in building distributed algorithms. The first thing on our road map which we plan to improve are topology changes. What do we mean by topology? It’s a pretty broad term in this context. It’s defined of the following. It’s the set of notes which are the members of the cluster, their location in DCs in racks but also assignment of ownership of data so data partitioning and distribution. Example of the triggers for topology changes in ScyllaDB include node operations like when you add a new node or decommission a node but also when you change replication strategy of a keyspace. For example, when you change the replication factor, this affects data distribution. This changes the location of data, and this is also what we call a topology change. So let me briefly describe how the current design works. Topology information is mostly stored in something which is called token metadata, so it’s information about members of the cluster, about data partitioning and distribution so assignment of data through replicas, and it’s based on partitioning, which is called token partitioning where every partition key is hashed to a value, which is called token, and the space of all tokens is called a token ring, and that represents the whole data set, all partition keys, and token ranges represent subsets of that so subsets of partition keys. And every node is assigned a set of tokens during bootstrap, and when you combine the tokens from all nodes, you get a combined ring, which is building the token metadata. The tokens from all nodes split the ring into token ranges, which are then assigned replicas according to the owning node of the newest token. Token metadata doesn’t determine fully replication metadata, meaning doesn’t fully describe where all the replicas are. For that, you also need the replication strategy, which is stored in the schema because the number of replicas depend on replication factor, for example, and token metadata combined with replication strategy gives you the actual replication metadata, which describes fully where the data is. And every node in the cluster has its own local view on topology, which includes token metadata and replication strategy from the schema, and this is used by coordinators to route requests, so the access for topology by coordinators is local. It doesn’t make any coordinated calls to other nodes, and token metadata changes propagate across the cluster through the gossip protocol. The gossip protocol is an eventually consistent protocol. Every node owns its tokens and broadcasts the information about owned tokens to other nodes, and eventually this information should be known by everyone, and everyone should have the same view on topology, so what’s the problem with this design? It boils down to eventual consistency. To ensure data consistency, even in eventually inconsistent system like ScyllaDB, all coordinators have to agree about the location of data so on topology, and if we use eventually consistent propagation for the topology, we may get state topology in certain scenarios. Let me give you an example of such a scenario. First, we have a cluster with three nodes. Initially every node is up and almost has the same view on topology, on token metadata. Then suppose the cluster goes down and then goes up but except node C, which remains down. The token metadata view which you have inside gossip will not include node C because node C stays silent because it’s down. Still, the local view of every node contains the latest information about topologies just that the gossip doesn’t see, and then suppose you are bootstrapping a new node, node D. It has to learn about the topology and does so using gossip, and it will learn about an incomplete topology because gossip contains incomplete information, and this will become its local view of topology, which is different than the local view of the old nodes, and this is not good because different token metadata means that we get different replica sets for token range, so coordinators will use different quorums, which will cause inconsistent reads, and writes may also go through replica sets temporarily until node C goes up and resynchronizes the view on topology with the new node. And this is of course bad, but there is no reason to freak out yet because this scenario shouldn’t really happen. It shouldn’t happen because our documentation says that you cannot bootstrap a new node if there is any node down in the cluster, so if node C is down, you shouldn’t bootstrap node D, and you should check the status using nodetool status command and do something with node C first. Sure, but it’s risky. Right? You can start by checking the statuses and see that everything is fine and then proceed with bootstrapping, but in the middle, the cluster somehow reboots itself with some nodes down, and no one notices. This is theoretically possible although seems very unlikely, but anyhow, it’s there. Another problem is human factor like administrators are humans and not machines. We can be easily programmed with documentation. When working under stress, they can cut corners, forget things and especially if things seem irrelevant most of the time like checking the statuses. Every time you get status, and everything is fine, it seems as if this doesn’t matter, so you may cut this out when you’re in haste. To solve all those problems, we plan to make the database responsible for consistency under all conditions, which should make an administrator’s life much better, but also because we need this as a prerequisite for something which we plan in the future, which is automatic topology changes. This is going to be part of the elasticity effort, which is adding features like autoscaling, which is about automatically increasing, changing cluster’s capacity by adding or removing nodes and dynamic data partitioning and load balancing, which is about moving data around to optimize data location for changing workloads. For example, to eliminate a hot shard which is overloaded. Those features will require automatic topology changes, which is like having an administrator inside the database itself, so every responsibility which is currently on the administrator would have to be maintained by the database, so it pushes the responsibility back to us, and these automatic topology changes have to be really solid because they can run concurrently with other events in the system like node restarts, manual topology changes. They will work on smaller increments than current topology changes. They will be more frequent, so they have to keep the system consistent at running conditions. They have to be fault-tolerant because we don’t have an administrator to fix things when there is temporary availability loss, for example, and they have to be fast. So as a first step towards this goal, we are moving topology information in a strongly consistent, fault-tolerant storage, and it will be based on Raft. We’ll have a Raft group which includes all the cluster members, which is called raft_group0. We already have it for schema changes, and token metadata will be the state machine, which is managed or replicated by Raft. Changes of token metadata will be translated to Raft commands. This way topology changes can do linearizable reads and writes of the of token metadata, and there is no problem of state topology. We will be able to get rid of gossip as a way to propagate token metadata across the cluster because Raft eagerly replicates, just takes to every node, so it’s like an replication factor all table with auto repair. Request coordinators will still have to use the local view on topology because we don’t want extra costs of coordination when executing those requests, but that’s fine, and topology changes will use linearizable access through Raft for learning and modification of topology information. This way we will be able to get rid of the sleeps, which we currently have to do, to wait, for example, until gossip propagates the information to other nodes or to learn about topology from other nodes. This will make topology changes faster. You can see on this diagram how it looks like the bootstrapping nodes will contact the consistent storage for token metadata first to learn about the current topology and then make changes to it, and because storage is linearizable, it’s ensured that later bootstrapping will see changes made by previous bootstrappings, so this diagram shows how things interact now after the changes, and bootstrapping nodes will learn about topology by contacting their linearizable storage and make changes to it also by contacting the linearizable storage, and because it’s linearizable, it ensures that topology changes which start later see all the effects of topology changes which completed earlier. Another thing which we need to change is to make a database responsible for serializing topology changes. Topology changes cannot be made concurrently. They will mess up the system state if they execute concurrently, and currently it’s a responsibility of the administrator to make sure that there are no concurrent changes, and we want the database to automatically ensure this and to do that by adding logs, which we will take using linearizable compare-and-set on topology lock registers. When the topology change starts, we will acquire the lock, possibly blocking until it’s released and release when the topology change completes or is aborted. Another thing which we need to do and will do using Raft is automatic transaction failover. Topology changes are multistep processes called sagas. They have many steps like taking the lock, changing token metadata, sending data and so on until it’s complete, and the log is released, and this saga needs to be orchestrated by someone, by some process in the cluster. And if the orchestrator dies, this saga will stall, leaving the topology change in an incomplete state with logs held, which may block other topology changes, which is not a desirable scenario. Automatic topology changes cannot wait for admin intervention in this case, so we need to make orchestration fault-tolerant so that it resumes or aborts automatically in such scenarios, and we can do that by using Raft and features it provides. You can keep the state of the saga inside a fault-tolerant linearizable storage and make sure the orchestrator runs where the Raft leader of the topology change group runs. So as long as there is a quorum of nodes aligned in the cluster, their orchestrator will be able to make progress, and the topology change will be able to proceed. So for examples of the current leader on which the orchestrator runs dies. Raft will make sure that this is detected, and it will elect a new leader, and this will then cause the new orchestrator to be started on that node, and the new orchestrator takes over the topology change saga by contacting the storage to read the current state and takes over from there and completes the transaction. Another thing on our road map is to invest in immediate consistency, which brings the technology of Raft to user tables and allows users to create strongly consistent tables which are based on Raft. We already provide strong consistency in the form of lightweight transactions, which are Paxos-based, but they have several drawbacks. Generally they are slow. They require three rounds to replicas for every request, and they have poor scaling if there are conflicts between transactions. If there are concurrent conflicting requests, the protocol will retry due to conflict, and as a result, they may not be able to make progress, and this will not scale good, and Raft doesn’t suffer from this issue. First and foremost, it requires only one round to replicas when you’re on leader or less because it can batch multiple commands in a single request, and also it supports pipeline, meaning that it can keep sending commands without waiting for preview commands to be acknowledged, and the pipelining goes down to a single CPU on which every following state machine runs, which means that the whole thing then is supposed to have high throughput. But it’s not all roses. Raft has its drawbacks. Because there is a single leader, Raft tables may experience latency when the leader dies because the leader has to undergo a failover. Most of the delay is actually due to detection latency because Raft doesn’t switch leader back and forth so easily. It waits for 1 second until it decides to elect a new leader. Lightweight transactions don’t have this, so they theoretically are more highly available. Another problem with Raft is that you have to have an extra hop to the leader when the request starts executing not on the leader, but this can be remedied by improving drivers to make them leader-aware and route requests to the leader directly. Another drawback of Raft tables would be that they need a lot of Raft groups to distribute load among shards evenly. That’s because every request has to go through a single CPU, the leader, and you have to have many such leaders noted to have even load. Lightweight transactions are much easier to distribute, so let’s have a closer look at this problem. This is how load is distributed currently using our standard partitioning, the token partitioning, which also applies to tables, which is lightweight transactions. Replication metadata, which is per keyspace, determines the set of replicas for a given key. The request then is routed to every replica, and on that replica there is a sharding function which picks up the CPU in which the request is served, which owns the data for a given key. The sharding function makes sure that the keys are evenly distributed among CPUs, and this provides good load distribution. The story with Raft is a bit different because there is no sharding function which is applied on the replica. Every request which goes to a given Raft group will go to a fixed set of route state machines and route leader, and their location of CPUs is fixed. They have a fixed route, so the log distribution is not as good as with sharding function with standard tables. We could remedy the situation by creating more tokens inside the replication metadata so that we have more ranges and more narrow ranges, but this on the other hand creates a lot of Raft groups, which may lead to explosion of metadata and management overhead because of route groups. The solution to this problem will depend on another technology which we’re going to introduce, which is called tablet partitioning. In tablet partitioning, replication metadata is not for keyspace. Every table has a separate replication metadata, and for every table, the range of keys lightweight token partitioning is divided into ranges, and those ranges are called tablets. Every tablet has a replica according to replication strategy, and the replica lives on a particular shard of every node, so unlike with token metadata, requests will not be routed to nodes but will be routed to specific shards. This will give us more control over where data is. This gives us more final control over distribution node data, so the system will aim to keep the tablets at manageable size. They can’t be too small so that we have a few of them so that we have low metadata overhead associated with having tablets, but they will not be too large so that we have many of them, enough to be able to balance the load by moving tablets around, and tables will start with just a few tablets, small tables. We don’t have a lot of data on this. May end there, and this is a good thing because unlike with the token partitioning, the data will not be fragmented into many tiny fragments, which is management overhead and also impacts negatively performance. Data will be localized in large chunks which are easy to consume. As tables grow, as they accumulate data, eventually they hit a threshold, and they have to be split. For example, the amount of disk space which is used by a tablet grows above some threshold, or the tablet becomes popular with requests hitting it, and it’s beneficial to split it and move it so that the load is distributed. So the tablet partitioning load balancer will decide to split it, and you also may decide to move it either within the same node to balance the shards or across the nodes to balance the global load in the cluster. This will help to relieve overall the shards and balance utilization in the cluster. It also depends on those fast, reliable, solid, fault-tolerant topology changes, which I talked about earlier because this will be automatic and work on smaller commands and will be happening maybe frequently, and it helps with Raft tables. Raft tables will piggyback on tablets. Every route will be associated with exactly one tablet, and the Raft servers will be associated with route replicas, so moving the tablet also moves the Raft server. Turns out the tablets will also help with other things. For example, resharding will be very cheap. SSTables are split at tablet boundary, so resharding is only a logical operation which reassigns tablets to shards. Cleanup is also cheap because cleaning up all data after a topology change is just about deleting the SSTables. Okay. This is all I have for you. I hope you enjoyed it, and if you have questions, feel free to reach out.