Hello, and welcome to this series of lightning talks by various ScyllaDB developers. And we’ll begin with Asias with a talk about repair-based node operations, and just a reminder, a node operation is when you bootstrap or decommission a node, and we’ve done some work to move that to using repair instead of using streaming, and Asias will explain how and why.
So here is Asias. Hi, everyone. Welcome to my talk. This is Asias here. In this talk, I will cover repair-based node operations in ScyllaDB and two more related features, off-strategy compaction and the gossip-free node operations. First, what is repair-based node operations? It is a new feature to use zone level repair as an underlying mechanism to sync data between nodes instead of the streaming for node operations like add or remove a node. There are multiple benefits of using repair-based node operations. It has significant improvements on both the performance and the data safety. For example, we can resume from a failed bootstrap operation without transferring data already synced on this node to speed up the operation. It also guarantees that the replacing node has latest replica to avoid data-consistency issues. It also simplifies the procedures so that there is no need to run repair before or after operations like replace or remove node, and that benefit is that we can use the same mechanism for all the node operations. We have enabled the repair base node operations by default for the replace operation. We will enable more operations by default in the future. All node operations are supported. We have added options to turn on specific node operations for instance here below we enabled both replace and bootstrap to use this feature. In addition, we are improving the IO scheduler to make sure the node operations will have less latency impact as possible so we can give her more operations by default. You’ll hear more about the IO scheduling work from Pavel and Avi’s talk. The idea of strategy compaction is to make a compaction during those operations more efficient to speed up the node operations. It keeps the SSTables generated by operations in a separated data set, only compact them together when the operation is done. Then it generates .. . It integrates the new SSTables into the main set. It is more efficient this way than adding the SSTables into the main set as soon as it is generated. As a result, we will have less compaction work during node operations, and then we can make it faster to complete node operations. We have enabled off-strategy compaction for all the node operations. For repair, we will wait for a while for more repairs to come to trigger off-strategy compaction. For other operations we will trigger immediately at the end of the operations. This is all automatic. No action is needed by users to enjoy this feature. Last but not least, I will talk about the gossip-free node operations. It is a new feature, to use RPC verbs instead of gossip status to perform operations like add or remove node. The feature gives multiple benefits. First, it is safe to run node operations by default. It requires all the nodes in the cluster to participate in topology changes. This avoids data-consistency issues if the nodes are network partitioned. For example, if a node is partitioned and they’re not aware of the bootstrap operation, this node will use a round token ring to route the traffic, which can cause data-inconsistence issue. In case that those nodes are really gone and dead, the user do want to run node operations. We have introduced the ignore-dead-nodes option to ignore it. Another important improvement is that with this feature, all the nodes in the cluster can revert to a previous state automatically, in case of error during operations. For example, when a replace operation failed, the existing nodes will remove replace node as a pending node immediately and automatically, as if the replace operation were not performed. At the moment, it is not allowed to add more nodes in parallel or perform multiple nodes operations at the same time. With this feature, we can reliably detect if there’s any pending operations to avoid operation mistakes. Each operation is now assigned with a user ID, to make it easier to identify node operations cluster-wide. We have enabled this feature for bootstrap, replace, decommission and remove node operations. No new action is needed to use it. Thank you. So thanks, Asias, and the next feature is quite an exciting one: repair-based tombstone garbage collection. This is exciting because repair can be a stressful feature to manage. If you lag behind your repairs and you run the risk of data resurrection and nobody wants that so we came up with repair based tombstone garbage collection which allows the database to track the times when you ran repair and to manage tombstone purging by itself. So here is Asias again, with an explanation of repair based tombstone garbage collection. In this talk, I will first cover the background of tombstones in ScyllaDB, explain the timeout-based tombstone GC metric, and introduce the new repair-based GC metric. Tombstones in ScyllaDB are used to delete data. No more the data and the tombstones are replicated to multiple loads. Another thing to note is that we cannot keep tombstones forever. Otherwise, if we have a lot of deletes, the database will become an unlimited amount of tombstones. The solution to this problem is that we drop the tombstones if they are not necessary anymore. Currently, we drop the tombstones when the following happens. First, the data covered by the tombstones and the tombstones can compact away together. Second, the tombstone is old enough, it is older than gc_grace_seconds operation, which is 10 days by default. However, the tombstones might be missed on some with replica nodes. For instance, the note was down and no repair was performed within gc_grace_seconds. As a result, data resurrection could happen. For example, nodes with tombstones do GC, and tombstones will cover data again. The node did not receive the tombstones, still have the data that is deleted. In the end, a read could return deleted data to the user. I will call the current GC method timeout-based tombstone GC. With this method, users have to run full cluster-wide repair within gc_grace_seconds. This GC method is not robust because the correctness depends on the cluster operations, and nothing guarantees if one can finish repair in time. Since repair is a maintenance operation and it has lowest priority, if there are important tasks like user workload and compaction repair will be slowed down. This adds pressure to people who operate ScyllaDB to finish a repair within gc_grace_seconds. In practice, users may want to avoid repair to reduce performance impact and serve more user workloads during critical period. For instance, during holidays. So we need a more robust solution. We implemented a repair based tombstone GC metric to solve this problem. The idea is that we’d use a tombstone only after repair is performed. This guarantees that all replica loads have their tombstones, no matter for the repair is finished within gc_grace_seconds or not. This brings multiple benefits with this feature. There is no need to figure out a proper gc_grace_seconds number. In fact, it’s very hard to find a proper one, which works for all the workloads, and there is no more data resurrection if a repair was not performed in time. So we have less pressure to run repair in a timely manner. As a result, we can throttle repair intensity even more to reduce latency impact on the user workload since there is no more hard requirement to finish a repair in time. If a repair is performed more frequently than gc_grace_seconds, tombstones can be GCed faster which gives better rate of performance. We can use ALTER TABLE and CREATE TABLE to turn on this feature. We have introduced a new option, tombstone_gc, which can specify the four modes. The first one, timeout, which means GC, after gc_grace_seconds. This is the same as before this feature. Repair means the GC after repair is performed, and disabled that means never GC, and then immediate which means GC immediately, without waiting for repair or timeout. When to use the immediate mode: This is mostly useful for a time-window compaction strategy with no user deletes. It’s much safer than setting the gc_grace_seconds to 0, we are even considering to reject user deletes if the mode is immediate to be on the safe side. The disabled mode is useful when loading data into ScyllaDB since the tools can generate out of order writes or write in the past, and we do not want to run GC when parts of the data are available. In such cases, we can set the mode to disabled when the tools are still logging data. In extreme cases, if the repair cannot be finished for some reason, we can produce a new API to fake a repair history, as an emergency to allow GC. A new gossip feature flag is introduced. A cluster will not use this feature until all the nodes have been upgraded. Also the default mode is timeout. Users have to set the mode to repair explicitly to turn on this feature. That’s it. Thank you so much. Thanks, Asias, and next up is Raphael with some new improvements that we made to our compaction with changes to time-window compaction strategy to simplify data modeling for time series and also improvements for environmental compaction strategy, allowing you to manage space amplification with incremental compaction. Raphael? Hello, everyone, and welcome to my talk about stable compaction enhancements. My name is Raphael Carvalho, and I have been working on ScyllaDB storage layers since its early days. Enough about me. Let’s move onto the interesting part. In this session, we describe space optimization for incremental compaction that will allow storage density of nodes to increase. Additionally, how ScyllaDB makes it much easier to model time series data without having to rely on old techniques like data bucketing, which was commonly used to avoid running into large partition performance issues. Last but not least, we’ll talk about upcoming improvements that will make compaction even better for you. Now let’s take a look back. Incremental compaction strategy, or ICS, was introduced back in 2019 to solve the large space overhead of that impacted most of the users. Before its existence, users were left with no choice but to leave 50 percent of free disk space for compaction associated. But was it enough? Well, the aforementioned space it had was efficiently fixed by incremental compaction. However, it suffered with bad space amplification when facing with write-heavy workloads. That’s because the compaction strategy wasn’t efficient at removing the data redundancy that accumulated across the tiers. We use a theoretical model called Run Conjecture to reason about compaction efficiency. It states that the compaction strategy cannot be optimal without the three efficiency goals, read, write and space. That’s why we have different strategies available. Each suits better a particular use case. If we look at the three-dimensional efficient space which represents the Run Conjecture trade-offs, we’ll see that incremental in size-shared strategies cover a similar region. Turns out incremental compaction can do much better than fixing the space of a half problem, aforementioned. We know for a fact that leveled and size-tiered strategies cover completely different regions in the efficient space. Also, we know that interesting regions cannot be reached with either of them. However, very interesting regions in the efficient space can be reached by combining the concepts of both strategies. We call it a hybrid approach. What do we actually want to accomplish with this hybrid approach? Let’s set a few high-level goals. Firstly, we want to optimize space efficiency for overwrite workloads while we’re sharing write and read latency meets service level requirements. In other words, performance must be sufficient to meet the needs, but space efficiency should be as good as possible until to allow for scale. That’s the space-amplification goal for you. The feature that will help you increasing storage density per node, therefore reducing the costs. Who doesn’t like that? It’s only available in ScyllaDB Enterprise and can be used with our incremental compaction only. Everything was carefully implemented to ensure latency will meet service level requirements. Compaction will dynamically adapt to the workload. Under heavy write load, compaction strategy will enter a write-optimized mode to make sure the system can keep up with the write rate. Otherwise, this strategy will be continuously working to optimize space efficiency. The coexistence of both modes is the reason we call this a hybrid approach. The adaptive approach combined with the hybrid one is what makes this feature so unique in the compaction world. Let’s get to a bit of action. How to enable this space optimization? That’s simply a matter of specifying a value between one and two to this strategy option named space-amplification goal. The lower the value, the lower the space amplification, but the higher the write amplification. 1.5 is a good value to start with. In order to optimize space efficiency, we are willing to trade off extra write amplification. However, the adaptive approach minimized the impact of the actual amplification given that the strategy will switch between the available modes, that is, writing space whenever it has to. To summarize, this will nicely give user control to reach interesting regions in the efficient space, allowing this strategy to perform better for your particular use case. Now comes the interesting part, the optimization on the action. We can clearly see in the graph that the lower the configured value, the lower the disk usage will be. In the example with a value of 1.25, the space amplification reached a maximum of 100 percent, but eventually it went below the 50 percent mark. If the system isn’t under heavy write load, the space amplification will meet the goal faster. As expected, the system is optimizing for space while performance objectives are achieved. Now let’s switch gears and talk about how ScyllaDB made time series less painful for application developers. Please look at the CREATE TABLE statement. That’s our usual schema for time series data. Note how the field date composed the partition key, along with this sensor ID. That’s a technique called data bucketing. This bucketing technique is mainly used to prevent large partitions from being degraded, as they’re known to create all sorts of performance issues. For example, ScyllaDB was very inefficient when reading from the middle of large partition. Also, repair was enemy of large partitions and time-window compaction strategy wasn’t optimized for reading a large partition spanning multiple time windows. In the picture, each individual line represents a partition. In the bucket case, the time series is split into multiple smaller partitions, while in the unbucketed case, the time series is kept here in a single partition. While the main problems with the bucketing is that lots of complexities push it to the application side. The application, you have to keep track of all partitions that belong to a particular time series. Also aggregation was more complex as the application has to figure out which partitions store a particular time range, query each one of them individually and finally aggregate the results. Fortunately, those bad days are gone. ScyllaDB fixed our problems of formation. Large partitions can now be efficiently indexed. Row-level repair was introduced to solve the problem of repairing large partitions and the timing-window strategy can now efficiently read large partitions stored across multiple time windows. When a table uses time window strategy, its SSTable files are not going to overlap in time stamp range, so a specialized reader was implemented that would discard irrelevant files and efficiently read the relevant ones. This would reduce resource consumption, read amplification, making the queries much more efficient. Without those problems fixed, the schema for time series amplification can now look much simpler. Queries become simpler. Your application becomes simpler. Please only make sure you have enough time series partitions to avoid hot spots, where a subset of shards may be processing much more data than its counterparts. For example, if your application is monitoring millions of devices where each has its own time series, then you’ll not run into any unbalanced issue, but if you only have a few time series in your application it’s better to rely, unfortunately, on the old bucketing technique to guarantee proper balancing. As for the upcoming improvements, clean up in major compaction will now be more resilient when the system is running out of disk space. Additionally, now the compaction manager will dynamically control the compaction fan and essentially it is a threshold on the minimum number of input files for a compaction to increase the overall compaction efficiency, and that translates into lower write amplification. This decision is based on the opposite reason that compaction efficiency is a function on the amount of input files for compaction and also their relative size. That’s why size tiered and similar-sized files to be compacted together. Essentially, we’ve increased the overall efficiency by not diluting the efficiency with compaction jobs that are less efficient than the ongoing compaction jobs. The system becomes more stable in performance as a result of this change. Last but not least, the IO scheduler is being nicely enhanced by Pavel Emelyanov. The enhancements will make the system more stable, allowing compaction to impact less other activity in the system, like user queries, the training and so on. To learn more about this, you can watch Pavel’s talk. An off-strategy compaction was written with the goal of making compaction less aggressive for node operations like bootstrap and right of repair. Allowing them to complete faster, consequently, the system will be able to scale faster and make ScyllaDB elasticity even better. Thank you for watching. See you around. Thanks, Raphael. So that’s it for this series of lightning talks about ScyllaDB 5.0. We hope you enjoyed them, and we’ll be working on more features for next year so we can present them in ScyllaDB Summit 2023. Thanks, everyone.