P99 CONF is the event on all things performance. Join us online Oct 23-24 — Registration is free

ScyllaDB 5.2 and Beyond

Avi Kivity32 minutes

ScyllaDB co-founder and CTO Avi Kivity covers 2022 accomplishments and deliveries. This year we enabled full repair-based-node-operations, brought distributed aggregation into ScyllaDB as default, focused on goodput, at the face of overload and many other changes.

Share this

Video Slides

Video Transcript

In this talk I will describe what we’ve been working on for the past year and the things that we’re preparing for the following years. I’ll be focusing onScyllaDB core quite a lot of the action is happening outside the core with serverless and with drivers but I will leave that to others and just talk about the core database.

so let’s start uh here is a agenda the list of things we’ll talk about I won’t reduce the slide uh let’s uh let’s get started uh so first uh We’ve uh enabled the repair based node operations now that’s a little bit of a mouthful so what are node operations those are all the operations that deal with topology was changing uh how data is laid out across nodes adding nodes and removing nodes and rebuilding nodes that have had faults uh and what we’ve done is we made all node operations be based on repair and that allowed us to consolidate some code and add a lot of properties to to those operations so bootstrap and decommission become resumable

um and when we stream data from one node to another we can stream from the primary replica which was the case before we had the repair based called operations uh but we can also stream from a quorum of replicas if the primary is missing and that allows us to perform uh node operations in the face of false and as a result we have a better resilience better for tolerance and correctness is also better

another thing that we’ve worked on is automatically parallelized queries um so the majority of queries in ScyllaDB only have a limited amount of parallelism so for example if you read a partition um then the query will go to a particular node and a particular chart within a node uh so they will not be distributed across the entire cluster within the node there will be some parallelism as you read from multiple assist tables in parallel in order to reduce the latency but otherwise it’s limited to a single node and a single chart perhaps a quote but there is a class of queries that can benefit from parallelization and that is the aggregation queries that scan uh over um the entire table or maybe a part of a table and previously these queries would be done via a spark or perhaps custom code and of course you can still do that but you can now you can also use a simpler route which is to use a SQL query and the query Optimizer will recognize certain patterns and will automatically dispatch the query first to all nodes within the cluster or all relevant nodes that hold the data and then each node uh will uh will dispatch the query to all of the vcpus within that node and all of your disks and all of your vcpus will be working to uh to get the answer as quickly as possible and that can easily result in a hundred-fold reduction in query latency so those are usually analytical queries not uh or TP queries but still you want the answer sooner rather than later uh so this is a nice Improvement and we’re looking to increase the scope of this to recognize more and more patterns so more queries can be automatically parallelized

um another feature uh that works very well with the previous one with automatically parallelized queries is user-defined functions and user-defined Aggregates using webassembly so in conjunction with parallelized queries you can distribute the computation across the entire cluster and instead of pulling data into your compute nodes and then performing a computation in your client you can push the computation to push just the code that performs the computation into the database and have it run close to the data so there are no no data movement the data is operated on in place uh and because we use webassembly then you can use any language you like um there are some limitations and right now the limitation is that the language must must be rust but because it’s so popular I guess no one will will mind uh we may add more bindings in the future because the infrastructure does allow any language to be used but right now we focused on Rust um the computations run in a sandbox which means that they cannot affect other operations on on the database node which is important for safety and it’s also protected in terms of latency so even if you have a long computation a long running Loop or you use a lot of memory it’s isolated from other other queries and will not impact their latency and the use case is usually Analytics so if you want to reduce the data in some way that’s not provided by the default SQL functions you can write your own reduction function and push it into the database and have the database perform the computation for it

um another feature that became available recently is per partition rate limit uh and this allows you to have a rate limit imposed on it on on partitions uh different trade limits for read and for right and the primary use case is to prevent um bot accounts or runaway apis from spamming the database and overloading a particular vcpu which in turn causes a reduction in in service level uh from the entire database now this is possible to do on your own on the client side but it’s often quite difficult because the client side is distributed and it usually has no central point of coordination or rather it’s central point of coordination is the database itself so the natural place uh to uh to add rate limits on partitions um is the database itself and that’s what we’ve added uh the chart at the bottom uh explains how it works but it’s a little bit too involved so if you want to understand it deeply you can watch the others talk about it this is where I stole that slide from uh so I recommend watching uh that presentation if you’re interested in it foreign

next we’ve updated our alternator implementation alternator is our implementation of the dynamodp API dynamodb of course is um it’s a nosql database provided by Amazon and we have a compatible implementation that implements exactly the same API only it’s a much less expensive and and faster um and uh one of the features uh that that we had missing was a time to live uh time to live uh now silly DB has time to live in its SQL mode but it’s it has different semantics and so we needed to implement the time to live for for alternator in a different way so this is what we did uh we also uh improved its performance uh alternator performance lags behind SQL performance but we’re determined to make the performance difference as small as we can so that we can use uh alternator without the uh without thinking that you’re losing performance we we had a bunch of bug fixes and small improvements as well but I will not go into the details here

um another major topic for us was uh is the consistent schema and topology um so uh with normal uh with the current cell ADB schema topology management as the operator has to be very careful not to perform certain errors because the result can be pretty bad so if you perform concurrent schema changes on the same table uh or you perform concurrent topology operations usually will be rewarded with a nasty failure mode and so we’ve added the raft-based coordination to make sure that this doesn’t happen so the first implementation which is already available in in 502 does not add new capabilities but does protect you from mistakes but of course we want to gain advantages from not just protect from errors uh and the advantages that you will see from from rafter topology is that you can you will be able to bootstrap and decommission nodes concurrently so it’s me this means that you can scale up and scale down or much faster than you can right now uh we will support the tablets which is a different way of Distributing data across the cluster and with tablets the data layout is can be different for different tables with uh with their current strategy which is called V nodes uh data is distributed the same way for for all tables at least all tables put in a key space um and this is sub-optimal when some tables have a large amount of data and so need to be distributed across a large number of pcpus and and nodes and you can also have smaller tables which need less distribution um and tablets allow each table to have its own uh control over how it is laid out across the cluster and finally we will be adding a strong consistency mode um actually we already have an experimental strong consistency mode but this is more oriented towards database developers rather than database users uh we will be making it available for users as well for users who want something different than the eventual consistency model uh and there is a talk by costia that will explain all this in um much finer detail and it was nicer Graphics too

um next we’ve revamped our i o scheduler and the i o scheduler now has a much better model of uh how the disk operates so in in this graph you can see measurements that we made across a wide variety of workloads so on the x-axis we scale the right bandwidth um from zero to one gigabyte per second and on the y-axis uh we scale the number of operations of read operations per second random read operations per second from zero to two hundred thousand uh or 250 000 UPS per second uh so you have a matrix of different workloads and the color indicates the latency that the we measured on those workloads on the top chart it’s the 50th percentile the median latency and on the bottom chart it’s a 95 95th percentile um and of course we measured the latency of the read operation because that’s that’s what’s important for us and the color indicates uh the latency so the lighter the color uh the lower the latency and you can see that uh in the top right half of the chart where the color is white um the disc simply isn’t able to sustain that workload and near the boundaries this diagonal line the latency becomes higher this uh garish purple color indicates that the latency is around five milliseconds which is too high and now the changes that we’ve made allow the io scheduler to uh know where in this chart the disk is operating and to keep the right bandwidth and the read i o operations per second uh in an area where the latency is um acceptable and if it sees that the latency uh if we reach an area where the latency is too high then it will throttle the right bandwidth down in order to keep it acceptable um so we’ve seen very good results from our scheduler it’s available in in sale at the b5.0 and uh with an improved version is in 5.1 and we will keep fine tuning it so that you will see the best license is possible even under the most demanding workloads

uh another area that we’ve worked on is trying to eliminate the the all the corner cases those Corner cases usually affect only five percent or so of the users and 95 never sees them but if you’re part of that five percent then it’s incredibly annoying because the behavior in those Corner cases is usually pretty bad so let’s look at a few examples so one example is reversed queries um so all cldb versions supported the reverse queries and if you had small partitions a few megabytes or so they would work perfectly well um and of course if you didn’t use them they also worked perfectly well but if you are a user that use reverse queries and you had moderately large partitions or even larger then they would behave they would be very slow and consume a lot of memory so in 4.6 we added reverse queries that were much faster than before but they could not use the cache and in 5.0 we made them fast and we also made them uh serve serve from the cache so they would work well both for small partitions usually from cash and for large partition usually from disk and they work well with the driver packaging on the right you can see an example of SQL queries that is reversed the descending keyword tells the database to read the partition from Back to Front thank you another example of a corner case is a handling large ones of Tom Stones um so I guess most of you are familiar with Tom stones in the context of nosql database but other refresh your memory anyway tombstones are used to Mark uh deleted data so instead of finding the data that needs to be deleted and removing it from uh from the disk structure instead we insert a marker called the tombstone that indicates that the data is is removed and one problem is that you have a delete intensive workload then you can get a large number of a large number of thumb stones in the database and one query the database will read those thumb sounds and filter them out so they don’t return any results and one problem that we had is that if you had a a large run of thumb Stones so a large section of the data that was nothing but Tom Stones then the performance would be uh very low and in certain cases it might have consumed large amounts of memory even to the point of running out of memory and we’ll fix that so usually the recommendations for nosql database tell you to avoid uh performing such workloads with lots of deletes but real life sometimes requires it and now cldb will support this use case well um another area that we’ve improved is arranged tombstones so range Thompsons are deletes that delete a large number of rows at once in a single statement and previously if you had many such range Thompsons within a partition um then you would experience a very bad performance uh and we’ve improved that um we’re also working on improving the handling of thumb Stones within the cache so uh the cash is learning to evict tomstones when it can get rid of them so that it will recover memory and improve performance

you have another area is the out of memory handling um and I’ll say a few words about why it’s needed in the first place the database has a conflicting requirements on the one hand it wants to run queries in parallel in order to be able to saturate the disk and the discs really like having multiple uh read requests running in parallel in order to extract the best performance but on the other hand it has a limited amount of memory and each query can take while the different amounts of memory so a query might take might require just a few hundred bytes for example if it hits cache directly then it just copies the data from the cache or it might require multiple megabytes if you’re reading our large rows that and you need to merge multiple access tables in order to construct your result then you can have queries that require holding tens of megabytes in memory and if you multiply tens of megabytes with high concurrency you you’re starting to reach in their a situation where you might run out of memory so the way to copy that is to have counter measures uh that apply as member usage growth so when we reach one threshold we simply prevent nucleus from starting when we’ve had that for a long a long time now it’s not new but the problem is that a query can continue to grow its memory consumption after it started so even if we prevent new periods from starting um still you can have the existing queries grow uh in size and consume more and more memory and eventually run out so you’ve added another threshold and when you reach that threshold you pause all of the queries except one and in this way they do not um do not consume more memory and you allow just one query to allocate more memory and this reduces the rate at which memory is consumed and eventually that query will complete and return its memory to the system and things will resolve but there’s a possibilities and even that is not enough so if you reach yet another threshold then we start the canceling queries so all queries except one query and this is a kind of emergency measure that allows the system to recover that query is completed and then those other queries can be retried so it’s not good when that happens but it’s a lot better than having the database run out of memory and start misbehaving or crashing

um yet another feature is a repair based tungsten garbage collection get another mouthful um so the goal here is to eliminate something called GC grade seconds and just see great seconds in the amount of time we keep a tombstone on disk uh usually it’s uh 10 days by by default and the idea here is that uh you keep the thumbs down around in order to assure it gets propagated we are prepared to all all replicas and that’s why you you don’t lose the thumb stop if the thumb stone is lost then you might have data Resurrection now this has uh two downsides one downside is that it might be more than necessary so if by keeping thumb stones for 10 days you might accumulate a lot of them and that reduces performance and consumes disk space uh and the opposite problem is that it might not be enough if the tombstone hasn’t propagated in 10 days to other replicas then you might end up with data Resurrection uh which is which is bad uh so uh our fix was to tie a tombstone garbage collection to the repair functionality and this way if you run a repair frequently and and repairs complete and everything works then we can garbage collect those tombstones early as soon as that as the last repair completes and if for some reason you did not run repair maybe there were infrastructure issues or maybe uh you you do not have the um the performance values for that then uh it will keep correctness uh and not the not garbage collectors thumb Stones uh it does impact performance but uh correctness is often more important than the performance

um let’s take a look at some things that we’re preparing

um so we’re working on morphing um the SQL grammar towards its uh older cousin uh sequel so the two dialects are very similar but still there there are different and that causes uh some confusion for people who come from SQL um so one Improvement is we’re relaxing a lot of constraints so there are a lot of examples where SQL is uh uh very limited so you can you can write a equals three but you cannot write three equals a the grammar will just reject it um or and you cannot write a equals D where A and B are two columns uh in your table so one of our goals is to make it easier to write the um to write construct to write the expressions and not have them rejected by the grammar uh seemingly randomly unless you’re very familiar with with a grammar um there are also some places where it’s semantically different uh so the most annoying example is lightweight transactions if Clause uh which Compares null in a different fashion than the rest of the system so um the lightweight transaction if Clause will consider two nulls equal whereas the standard SQL considers them unequal um so that can be surprising and we’ll be working to provide to resolve that and makes the semantics uh coherent of course will provide an upgrade path so that we will not break the code of existing users and we’ll be working to make a automatically parallel queries recognize more and more patterns to maximum more usable

another topic is a object storage so of course storage there’s lots of different kinds of storage available there is Ram which is extremely fast but also extremely expensive uh there’s nvme storage which is very fast so 100 microsecond uh spons time but still it’s relatively expensive although it’s getting cheaper all the time and there’s hard disk storage which is very slow but also a very inexpensive and to add to the mix there there is uh in Cloud environments there is also object storage so it’s even uh it’s even slower than hard disk it’s usually based on hard disks underneath but also it’s a network up uh but on the other hand it is very cheap and you can expand it very easily it’s just an API away rather than uh acquiring a disk and installing it away and I guess uh what’s um what’s most compelling about it is that it’s very easy to manipulate you can easily copy it easily backup it back it up uh and uh you can access it from different nodes very easily so it’s shared by default and this leads to

uh it leads to scenarios where it’s very easy to grow your cluster since you add new nodes and they already have access to the storage they don’t need to copy to stream the data from one node to the other so that’s a very interesting use case and we’re working on making it available and they’re we see two use cases for for that one use case is where you have very dense databases that uh manage very large amounts of data but with a small number of nodes and this will work where latency is not um is not very important and another use case is tiered storage where you have a mix between nvme storage and object storage and you try to mix the latency as well so you try to get most of the queries responding with a low latency but the bulk of the storage and therefore the bulk of the cost will be on the less expensive object storage so this is a way to try to gain the best of both worlds

um and that’s it thanks to everyone my contact details are on the slide so you’re welcome to follow me or contact me indirectly [Applause]

Read More