Hey. I’m Kostja. Today I’m going to talk about making schema changes use Raft. So what are schema changes, and what is Raft? Let’s discuss that. The easiest way to think about database schema comes from the world of relational databases, where it’s a collection of header relational tables. Each column has a column name and a data type, meaning all cells in this column conform to the constraints of the type.
Need for a schema comes from a desire to save on storage, specifically avoid storing the same name and type information in each cell as well as make relational algebra possible. Join conditions are biased on cell quality, something difficult to define for cells with different types. Just like relational database, ScyllaDB requires column types and enforces type constraints. It also supports complex types, such as sets, maps, lists and the user-defined types. This makes data presentation more compact compared to document databases but does require the clients to think through their schemas when designing an application. A less commonly considered part of database schema are replication storage properties, indexes, views and access rights. Finally, there is the power of data definition verbs to change the existing schema. It could be possible to add or drop columns, change column types, add unique check and foreign key constraints. Replicating schema statements across nodes has to use its own path. It impacts all nodes in the cluster and requires coordinated error handling, and you can’t file an operation in one node and succeed it on another. Some data definition operations require a complete scan or even a rebuild of data in a table. Building a unique index must check that the table has no duplicates in the given column. Change a column type might require converting each cell from one physical representation to another. ScyllaDB currently avoids re-writing data in schema change operations. Instead it transforms data on the fly to make it conform with the client schema. For example, if the current schema doesn’t contain a column, but the actual row still stores it, the column is removed from the row before it’s sent to the user. The advantage of this approach is that the existing schema change statements are lightweight and less error-prone. Some complex data transformation, however, have to be done on the client which has to concern themselves with the operation consistency. In a distributed environment, each node can have a slightly different version of schema. To be able to return consistent results in this environment, ScyllaDB assigns data retrieval operations passed between nodes with the schema version. The receiving node must make sure that the return data is in the format required by the client. If it’s a newer format that the node is not yet aware of, it will request the schema information from the sender, transform the row format and then send them the transformed row. Ooh, that’s about database schema. So what you can do with schema in ScyllaDB, you can create tables, views, user defined types, add or drop columns, truncate data, define roles and the set access rights. There are things you can’t do. You can’t change column types. You can do compatible changes like text to BLOB promotion, but you cannot change the integer to a stream. You can have no unique or .. . unique secondary keys, triggers or constraints, or you cannot rename a column unless it’s a primary key column. Some of the unsupported operations can be implemented in an eventual and consistent environment. It would be difficult to support unique secondary keys unless the definition of uniqueness is the same in all nodes. A duplicate might slip through the cracks when a node has been added and data is being moved from one node to another. Being a multi-consistency model database, it’s not impossible that ScyllaDB will eventually get these advanced schema features. So we’d better begin building the foundation for them now. To see why ScyllaDB supported some schema changes and not others, let’s consider how schema changes may fail and how ScyllaDB recovers from such a failure. One obvious kind of failure is a down node. Schema changes are allowed to proceed even if all but one node in the cluster are down. What happens then? When nodes get up, they’ll learn through gossip the cluster has received the schema update and fish it from the node that has it. But what if the node is not down but is partitioned away? It’s possible that two subsets of the clusters operate .. . or the cluster operate it in a different version of schema. Eventually, when partitioning heals, nodes learn about the changes and run compaction .. . Nodes learn about the changes and run compaction. Their data will be converted to the most recent format. Our more complex scenario is when two concurrent changes conflict. Imagine one part of the cluster adds a column to a table, and another part adds a different column to the same definition. The schema with the most recent time stamp is going to win, but if the column received any updates corresponding to the shadow definition, they will be lost after schema reconciliation. There are many other ways in which event schema consistency may backfire. You may not see the tables you just created. Schema agreement happens via gossip, so it can easily take seconds. So if the node receiving the [write is not the same node which received the schema change, it does not learn about it immediately. There is not prevention against duplicate attempts to create any object, not just columns like key .. . but also key spaces, tables, views, indexes, but concurrent attempts might succeed with an object which has a newer creation time stamp shadowing the older one with the same name with its entire contents. Clients may get incorrect errors, and schema changes use local nodes view of the schema which might not be fully up to date. Not all the versions of schema can be merged with eventual consistency rules. Dropping a user-defined type when schema relies on it renders the table unusable. The contents of the table becomes inaccessible. Concurrent changes to the same object can be lost, like some edit columns in Cassandra each attempt 250. Interaction of schema changes and topology changes produces another bouquet of [Indistinct]. Changing key space replication factor does not take immediate effect and, if done during a topology change, can lead to data loss. A failure during topology change can violate lightweight transaction linearizability. A variety might be lost during topology changes. To sum up, the only dubious advantage of the algorithm is total liveness. The cluster is able to make progress without the definition verbs in presence of a majority failure. The client is responsible for making schema changes always through the same node, thus manually enforcing schema linearizability. Schema reconciliation algorithms are not a feature. Their specific behavior is not documented. Rather, it’s a best effort to patch up for otherwise undefined behavior. The problem is, like I’m talking about at first such a significant amount of time, because the problem is quite big, and it has been acknowledged in Cassandra and ScyllaDB communities as far back as in 2015, with duplicate bugs constantly piling up. Strongly consistent features of ScyllaDB such as lightweight transactions and upcoming tablets aren’t really consistent unless DDL and topology changes are consistent as well. So welcome Raft, the base algorithm used for strong consistency in ScyllaDB 5.0. Let’s talk about it in more detail. Raft is called a protocol for state machine replication. A database is a kind of a state machine, and replicating a database is having the same copy of a data on every node. For schema, the replicated state is key-space stable in a few definitions. By means of Raft, we can make sure each cluster node not just has the same copy of the data but applies all the state changes, that is, data definition commands, in the same order. Moreover, if nodes restart, join or leave, the order must stay the same. System liveness must be preserved as long as the majority of the cluster is up. Hunting of node failures, joining and leaving is part of Raft protocol so it’s .. . And it was one of the big reasons why ScyllaDB settled on using it for schema consistency. Let’s look into this product in a little more detail. Well, for a deep dive into Raft, I recommend “Raft Study,” a video lecture by John Ousterhout, as well as Raft PhD. Key chapters are one through four. If you are looking into writing a nonimplementation, having studied many, I encourage you to look into ScyllaDB Raft. In my opinion, it’s highly isolated, commanded and carefully tested. Besides, it was just written in 2021, so it’s very fresh. So what’s the idea of a replicated state machine? Suppose you had a program or an application you wanted to make reliable. One way to do that is to execute that program in a collection of machines and ensure they execute it in exactly the same way. So a state machine is just a program or an application that takes inputs and produces outputs. A replicated log can help make sure that these state machines execute the same commands. Here’s how it works. A client of the system that wants to execute the command passes it to one of the state machines. You can see them in these boxes. The command, let’s call it X, then gets recorded in the log at the bottom of a local state machine, and then, in addition to that, the command is passed to all the other machines and recorded in their logs, as well. Once the command has been safely replicated in the logs, then it can be passed to the state machines for execution, and when one of the state machines is finished executing the command, the result can be returned back to the client program. And you can see that as long as the logs of the state machines are identical and the state machines execute the commands in the same order, we know they are going to produce the same results. So at the top of the consensus module on the left, to ensure that the command is replicated and then passed to the state machine for execution, the system makes progress as long as the .. . any majority of the servers are up so .. . and can communicate with each other. So if you have a cluster of three nodes, you need two nodes up to be able to do schema changes, five nodes, three nodes up, and so on. In Raft, the servers are not equal at any given point in time. Clients communicate through the leader, and the leader communicates with other servers to replicate commands. This decomposes the problem of consensus algorithm into two, normal operations when there is a running leader and what you do when a leader crashes or .. . and needs .. . and the cluster needs to elect a new leader. Being an otherwise asymmetric leader-based algorithm, Raft falls back to symmetry for leader election. Any follower that doesn’t get updates from the leader for the duration of an election time-out .. . election time-out is just a period of time, and it’s defined in the configuration .. . is able to become a candidate and request others, other members of the cluster, to vote for itself. The candidate which gets the majority of votes declares itself the leader and begins replicating logs to all other members of the cluster. Like in any kind of voting, split votes, situations when no candidate gets the majority of votes to become a leader, possible .. . are possible with Raft. In November 2020, Cloudflare recorded a few-hour outage due to prolonged failure to elect a leader in presence of an asymmetric network failure. An asymmetric network failure is when packets are routed from one part of the network to the other but not back. So nodes would repeatedly time-out, request votes. This would upset the existing leader. Newer versions of Raft implement a special extension called prevoting, mandating fresh candidates to dry run an election before starting a real one and disrupting the existing leader. Nodes which get pings from the current leader vote negatively during the prevote. If prevotes .. . If prevoting does not collect a majority of votes, the candidate doesn’t start the real election. ScyllaDB has prevoting implemented and always on. Turns out, it allows to simplify other parts of Raft, specifically drop Raft rules for stick leadership. This was made possible thanks to our decision to implement Raft from scratch on top of Seastar and not adopt an existing implementation. Another important Raft .. . reason Raft is so valuable for ScyllaDB is that cluster configuration, or we know it as topology changes, are a part of Raft core. In order to add or remove a node in Raft, the client applies a special command to the log, and once it’s replicated to the current majority, a new majority is formed. ScyllaDB uses an extended two-step configuration change procedure, allowing you to add or remove more than one node at a time or add and remove nodes in a single configuration change. This allows us to replace a node without the risk to render the cluster unusable if replaced files. So imagine you have just three nodes. You want to replace one node, and another node fails while you’re replacing. Technically you lost the majority. With a single-step replace, when you replace one node with another, this is impossible. Another advanced feature of ScyllaDB is being able to add a nonvoting member to the cluster. Nonvoting members act as normal Raft nodes, but they can neither vote nor get elected. In ScyllaDB, new nodes join the cluster as nonvoting members. Thus a join failure doesn’t impact Raft quorum. Raft quorum is the rules for determining the majority and those making progress. A node that failed to join can be easily removed, and if some nodes .. . even if some nodes in the cluster are down. Once a node has completed advertising tokens and transferring data, it becomes a full voting member of the ring. There are other ways in which ScyllaDB Raft is special. Given ScyllaDB clusters can be quite large, we paid special attention to make sure elections are swift. ScyllaDB randomizes each node’s election time-out, then the interval after which the node starts an election, and if it doesn’t hear from the current leader and spreads this interval proportionate to the cluster size. Even in a 1,000-node cluster, each node will start an election roughly in its own slot, in its own time, allowing it to request votes from followers without interference from other candidates. Thanks to this, typical time to elect a new leader is 1 to 3 seconds, even in large clusters, and split votes are rare. We call the support for Raft .. . We also call the support for Raft read barrier and automatic forwarding of commands to the leader. This made ScyllaDB Raft more symmetric, allowing followers to execute major parts of schema changes, offloading the leader which only needs the log mutations. Mutations are the changes of system table rows that contain the essence of the schema change. Finally, we greatly reduce the cost of failure detection. We allow multiple Raft instances, we call these instances Raft groups, on a single node, serving single failure detector. We plan to run a known mini Raft cluster, say a replica set, for each tablet. So reduced failure detection overhead on a single replica is very important to us. And there is another talk at this conference I’ll .. . by Tomasz Grabiec about our future plans with Raft. So there we discuss tablets and strong consistency, how we plan to build it on top of Raft. So ScyllaDB Raft implementation is also a very .. . has very special testing. We isolated the core of the implementation from disk and network. So it’s like a state machine which we test extensively without having to create the physical large clusters. So we create virtual 1,000-node clusters and the single minute-long test runs 1,000 configuration events, hundreds of thousands of events and alongside with network failures, starts and stop of instances, node failures and packet drops. So this was all possible because of the way we approached the implementation. Raft addresses many issues but provides only basic initial cluster setup. In ScyllaDB, it’s long been a rule that nodes need to be added to the cluster one at a time. An operator has to wait for the joining node to advertise its tokens and complete streaming before joining the next node. Apart from slowing down the overall join procedure, limiting the elasticity of the cluster, this procedure is also inherently unsafe. There are no safety net for operator error. For example, starting two new nodes could lead to some ranges being replicated incorrectly. Joining relied on gossip to propagate tokens which added constant unpleasant delays to the procedure. With transition to Raft, we wanted to address this problem as well. We introduced a new protocol for cluster assembly which we call cluster discovery. The idea of the protocol is as follows. If a joining node is able to contact with any cluster node which already has an existing Raft rule, it will use that group and will use Raft configuration to join itself. But if you start several fresh nodes, there is no cluster yet, but we want to start multiple nodes at a time, then we try to build a transitive closure of all of the nodes in the cluster by contacting peers. So the node continuously discovers its peers until it can build a complete map of the cluster, and as soon as it is able to do it, it may start a new Raft group. So for example, consider a fresh ring of five nodes which haven’t formed a cluster yet. Imagine each node in the ring has information about initial contact points, say we take it from configuration file, the seed section. The node contacts these initial peers and finds out there are other nodes in the cluster. It continues doing so until no new members appear and all existing members responded. Then a new Raft configuration can be formed because the node knows that it basically discovered all of the nodes in the cluster. Distributed protocols are not considered safe unless proved correct. To validate correctness of this discovery protocol, we created a TLA+ specification and run it to completeness for all reasonable cluster sizes. TLA+ is a special language for formal program validation, and it’s been used to validate such distributed protocol as Paxos, Raft and so on. So we just apply the same tool for our needs. In 5.0, we must support pre-Raft nodes. The discovery protocol lives side-by-side with gossip. In future we’ll be able to switch to it completely and make ScyllaDB cluster boot in just the subsecond time. Well, we’ve such a strong foundation implemented, schema changes would be a breeze, right? Well, maybe. So let’s now talk about the core of the feature in 5.0 and see how .. . See if it’s a simple implementation or not and what we have actually changed. So as you perhaps know, ScyllaDB stores all schema definitions and system tables. Each node has a copy, and the node which received mutation, the change, like create table, drop table, propagates this mutation to all of the nodes. So the node that makes the change is responsible for propagating the mutation, and the biggest change in 5.0 is that this propagation now is happening not through this migration manager, the legacy component responsible for it, but happens through Raft, the reliable component we just coded. Old cluster nodes become part of a Raft group. They call it group internally, Group Zero, because it’s the first group ever. The group has a single leader which is actively pushing all changes to all members. Any member willing to make a change forwards it to the leader which commits it on the majority before materializing as a new schema version. If a node is disconnected from the leader or disconnected from the majority, it can no longer make a schema change. For a connected node, the steps are as follows. Before a node executes a command, it reads the latest schema issuing .. . a special Raft read barrier. Why you need the read barrier? Imagine you want to drop a table. You need to make sure that the table exists, and if it doesn’t, you need to return the proper error to the client. Just dropping the topic optimistically or dropping it based on an outdated state would be wrong. Then, like similar validation, are relevant for all SQL commands, and as a result, it changed to a system of schemas built which is recorded in the Raft log. But what if two nodes try to do the changes at the same time? Both of them might be able to record their changes in the log. Raft will build some order for them, but we don’t know what order. We would like to prevent this. Also, if a connection to the leader is flickering, we may end up in a state of uncertainty when the command is executed, but the caller of the command doesn’t know the outcome. So maybe the leader changed during the command, and we still need to know if it has executed or not. To protect against a double execution or execution in the wrong order, each command is signed with an old and new schema ID. Schema IDs are just UUIDs we maintain for every schema change. When the command is applied, the current state of the schema must match the old ID, and the new ID is recorded in the schema version history so that newer commands may fail. In cases of uncertainty, change ID, this schema ID is used to validate that the command is indeed applied to the state machine because we can look it up in the schema history table. If erase is detected, we still record the ID in the schema history, but the application of the command, the application or the actual DDL turns into a no-op, and the entire procedure is restarted. So if we try to apply the DDL after another DDL sneaks in, we just restarted DDL to Raft, but we don’t .. . There is no way we could apply it on top of a state which has changed since the read barrier we issued. This is why we need the read barriers in these IDs. So this makes all DDL statements go in a single defined order on all of the nodes, but it doesn’t say anything about interaction between DDL and DML, data manipulation statements like insert, update, delete. Did switch to Raft have any impact on performance of .. . or availability of data manipulation statements? The way insert or delete or update works with the schema in 5.0 is similar to 4.0. The coordinator assigns the mutation. The mutation is something the coordinator creates for the actual statement to store in the tables, in memtables and then in SSTables. So every mutation is still signed with the schema version. If a mutation with an older version arrives to a node with a newer schema, it’s automatically converted to the newer version. If a mutation with a newer version arrives to a node with an older schema, that node will fetch that schema from the coordinator. So this is just like before. So what has changed? Well, the .. . Whatever this, say, older node fetches comes in a strict order. So whenever this fetch happens, the nodes get changes in the same order. That’s the biggest change that makes sure that many anomalies are removed, and we still don’t require schema fetch to go through the leader. This is necessary to make this whole procedure live, even in the presence of a majority loss. So this preserves DML availability guarantees during Raft leader changes or the network partition. During these events, we can still write to .. . So still it continues to be eventually consistent and available in these events. You can just kind of do DDL statements. You can do DML. Besides, Raft leader is constantly pushing schema updates to all nodes. So if the .. . If you restore connectivity, the cases for outdated schema are much more rare. Conflicts between DDL and topology are still resolved eventually, specifically imagine you change the key space replication factor. That change will still take place without actual data replication to conform to the new replication factor, and if that happens when adding a new node, we still use gossip to wait for a schema agreement. So we might not replicate all of the ranges. So let’s recap. Starting from ScyllaDB 5.0, concurrent DDL is safe, and safe in the sense that DDL is protected from DDL. Anomalies such as spurious errors, shadowed key spaces, tables or columns that disappear, these anomalies are impossible now. Schema propagation happens much faster, making it easier to write day-to-day applications. The feature still requires an experimental switch, and once the switch is enabled, there is no downgrade path because we begin storing the data differently. We persist this feature in the system tables, so you cannot downgrade. We are actively working on weeding out all of the bugs and turning off experimental switch, and we hope to do that this year. Heterogeneous clusters like 4.0 and 5.0, they continue to work, but they will use old gossip-style communication to propagate schema. Starting from the next major release, Raft schema management will become the default, making the problems we’ll discuss today a strict legacy. So bad stuff, not all changes introduced by Raft are rosy. There are cases when Raft preference for consistency over availability may impact production deployments. Raft preserves liveness as long as you have a majority of nodes. A split-brain in a 2-DC setup is one notable case when the majority can be lost, can be easily lost. And the 2-DC setup was often symmetric. You have the same amount of nodes in one DC and then another. ScyllaDB 5.0 with Raft will not admit any DDL statements in case of split-brain. DML will continue to work. We are looking into introduce a node to command to promote such isolated cluster into a new one, but if your network split is temporary, this might not be the solution for your problem. So what are the next steps? Use of Raft doesn’t stop with schema changes. There is a nice talk by Tomasz Grabiec about the future of consensus in ScyllaDB 5.0 and beyond which I invite you to attend to see what we .. . how we plan to implement topology changes on Raft and the strongly consistent tables which we call tablets. Thanks. Thanks for attending. This was a session about schema changes on Raft in ScyllaDB 5.0.