To Serverless and Beyond

Dor Laor37 minutesFebruary 15, 2023

This year, a set of technologies we’ve been developing for years, graduated together to bring you uncompromised performance, availability and elasticity. From Raft consensus to serverless and goodput. In this session Dor Laor, ScyllaDB’s CEO will cover the major innovations coming to ScyllaDB core and ScyllaDB Cloud.

Share this

Video Slides

Video Transcript

We’ve  been developing ScyllaDB for a very long time and we’re excited about the content that we have to present in the following two days. We’ll be speaking about serverless but we have more announcements to make and also in general.

We’re going to overview what we’ve done over multiple years but we’ll obviously Focus over the last year in 2022 and describe what will going to happen in 2023. now outside of this keynote we have wonderful content probably deeper than my session together with amazing customers and news of ours who contribute to the success of ScyllaDB so stay tuned and check the agenda don’t miss multiple Keynotes by customers and by amazing Engineers of ours so without further Ado let’s go to serverless and Beyond so for the newbies and we have thousands of people joining us um so why let’s uh be upfront about it so instead of uh this coming up for me uh we asked the smartest entity on Earth why is it a DP which entity is that obviously

When we asked we got a poem in return that almost brought me into terms uh almost not not really but uh this is really cool and uh AI in general uh is making big strides into the industry and we have an announcement to make as you know uh we’re moving into serverless and we thought about what about AI as many companies adopted so we have an announcement unlike many other companies we like to go ahead and innovate and everybody today embed several embed Ai and we thought let’s go directly into the as AI less world so ScyllaDB wouldn’t uh embed chat GPD in its database that’s the first announcement we have to make

Seriously we’re very excited with uh our users and and why cldp instead of coming from us see the experience that many other companies have done over the years and how ScyllaDB contribute to their success with uh feature stores with the cryptocurrency with very quick key Value Store e-commerce and so forth um really ScyllaDB was born to be a operation per second crunching machine and this is a primary use of scyllaDB and I hope that you’re going to learn how to use the best in the following following two days um now if I’ll concentrate on really technology we used to have a phrase that ScyllaDB provides the power of Cassandra at the speed of redis and really we give you the best price performance in the industry the bunch of other things that we do well that we um got the the design from Cassandra like uh you can run as many replicas as we like as many data centers as as you like uh and as many nodes and as many CPUs as you like within a single node uh everything comes out of the box automatically tuned with compatibility with Cassandra and Dynamo no lock-in uh either in protocols or in implementation we can write kubernetes on-prem or in the cloud and it’s open source software as well so uh let’s uh proceed into actual content the first part is very quick recap for newbies and then more detailed journey of 2022 and what the future holds so in terms of recap as I said ScyllaDB is a person per second machine we have a unique design of sharper core this is a picture from the Linux top where you can see all of the cores working parallel we actually love to see cores working for us and getting cores to 100 CPU it’s a healthy environment we and we have a scheduler that can control the allocation of what every core which we call The Shard is doing this was the first generation of ScyllaDB and obviously we knew that there is more factors to a database in the series database then um then performance there’s a efficiency of the different components there is the the ability to scale and to scale fast and the influence of administrative operations that you do while you perform so latency is supposed to be low as possible and there is the freedom score of providing different Tools in uh open source or open code manner this was the first generation of ScyllaDB the second one was a full Cassandra API with transactions lightweight transactions and we proceeded to develop cassette seller Beyond Cassandra with change data capture and indexes and new compaction algorithms and the alternator which is our dynamodb compatible API so there’s a lot of fancy features in syrup today and a year ago we met around the same time frame where we launched ScyllaDB 5.0 and we said okay ScyllaDB 5.0 gives you extreme efficiency and manageability and bunch of things easy to consume and we talked about the future about multi-denancy that multi-tenancy is coming which will allow you to separate computed storage and do a bunch of other things so I’m glad to say that the future is here and we’re going to cover it so this was super fast recap for newbies about what ScyllaDB is um and obviously there’s plenty of other sessions uh that you should visit and and this quick recap is really fast um let’s uh cover what we have done over 2022.

In 2022 I like to to capture it as we expanded uh the capabilities of seal up to make it in all weather not just a winter or summer database but all weather all terrain high throughput low latency this is what we promise um and how we delivered we improved performance maintenance and uh worked on serverless and bunch of more so let’s Dive In in terms of performance and this is what you can achieve with for example nine nodes doing million requests per second with the P99 of one to two milliseconds um very few databases if any can get to this level of performance um I must say that this is what you get with the simplified data models the more the data model complex then you’ll get great performance but maybe not the million with this particular setup but this is why sale was invented and this is the type of workloads that we like the most and we’ve done improvements with different components that you can see on the left starting with the i o scheduler so we have an i o scheduler because every Shard which is mapped to a virtual CPU need to run reads and writes but also need to run a synchronous administrative workload like streaming and repair and compaction in different other workloads so that’s why we have an i o scheduler running within the context of every shard in the first generation to control and to give your queries a priority over a workload like streaming that needs to send multiple terabytes from one node to the next this was the first generation and we measured the disk performance and and we we automatically gained the maximum i o request and we divide them equally among all of the i o queues which map to every shard in a one-to-one ratio later on we figured that its life are more complicated and sometimes if you divide the maximum i o request but by the number of shards every Shard will get two uh too few of the bandwidth so uh if it will have big tasks like compactions of large blocks then it some other queries May starve within that chart so we allowed multiple charts to consume the same ioq and then it becomes the scheduler work will became more complicated to provide fairness um later on uh we moved uh through several improvements as well and I jumped forward to version five uh Pavel and Avi who developed the io scheduler uh realized that life is more complicated than we thought before and actually that this performance is the function of also the ratio between reads and rights. We have a detailed presentation about it but we took it into consideration in version five we we’ve made created a token bucket protocol and we also added i o class throttling that in very unique cases can come into play but usually you do not have to touch it and it just works automatically under the hood it’s just nice to know what’s in there but things supposed to just work out of the box um we’ll have a presentation about per partition limit that can assist you also we worked on 30 CPU gain coming in ScyllaDB Enterprise uh it’s it’s work that we have done uh in 2022 and it will come into play this year uh we also have a super nice uh presentation about distributed Aggregates so let’s say if you need to do a select count it means a full scan over the entire data set if you’ll do that assistance by a single coordinator everything will be serialized and that’s not efficient so distributed Aggregates allows all of the nodes to become coordinators of their own regions and then all of the sub results are collected by by single coordinator a super coordinator so really nice Improvement and this is just select count which is enabled in open source 5.1 release but will propagate it into a bunch of other capabilities in the future we also eliminated exceptions from the iopath improving good puts a bunch of other things and improve the representation of range tombstones which on certain cases uh where uh glass jaw of ours so we improved that making performance more stable even in the face of ranged tombstones it’s most of the work for ranged tombstones is committed to the master branch and is propagating in a into a proper release these days

um so that’s the performance work that we’ve been doing uh this is an example of how we test performance internally by our quality assurance team so uh we’re trying to constantly test a case where we have a constants throughput sent to the cluster and in parallel we do a set of administration administrative operations like replacing node adding multiple nodes the commission nodes failing nodes and so forth and we compute the steady state without admin operation of uh of of the throughput of the cluster so in in add node for example the steady state or in without operation the steady state gives you a P99 latency of 2 milliseconds and we add how much the operation adds to the latency of the adding a node operation adds to the latency of the the operations that run and you can see here that uh it increases the latency by half a millisecond I’m looking at the second line cycle number one node number one latent latency increased from 209 to 2.79 like really you you barely notice that so we meet the promise of having maintenance operations while you run without any change uh in without going through a management window there are cases with where it’s not that perfect but we consider these cases a bug and we constantly try to iron out these cases and this is why we develop more generations of our i o scheduler which uh V5 which is uh in in which is part of 5.1 release and Enterprise 2020.2 is the best scheduler available in the market so that’s performance uh in terms of Maintenance uh also important uh we improved the option of a tombstones so tombstones are erased when you run repair but uh there is a timer uh and the ba race whether you run repair or not so just like that you were forced to run repair on a weekly basis but sometimes if there’s an issue and you forgot to run repair or you didn’t schedule it or whatever happened you you’re in a risk for to have data Resurrection automatic Tombstone garbage collection allowed you to forget about that and once there’s repair will run We’ll erase tombstones and if repair doesn’t run we wouldn’t erase them as simple as that so this is one example of improvement another one is load and stream if you need to load data that resides on another cluster maybe from migration to ScyllaDB from another technology or maybe in restoration from a cluster that was running on a different amount of nodes and different node sizes then those key ranges original key ranges may not map the topology that you have on the new cluster so we developed load and stream for it and basically you can throw any SS table on any new node and the new node will automatically go and scan through the DSs table and send the partitions to the right node owning them so this is a big Improvement in maintenance for migration and restore of a backup and data restoration for backup is part of sealer manager 3.1 which is another Improvement previously backup was automated but not restored and this simplified the restore so today with sealer manager 3.1 everything is supported

um we’ve very very still node operation so uh it’s it’s a very important feature that we release long time ago but uh it took us a while to make sure that the feature uh works as expected uh both fully stable and also with minimal effect on performance so um we measured this the way that I demonstrated before by QA um so repair based node operations uh makes all operations like adding nodes removing nodes repairing nodes uh all of these uh using the repair protocol um so repair is a resumable protocol so if you added the node and transfer 10 terabytes to the new node and then you had a failure it’s okay to just or you had to restart or something then you can just go back to the operation restart it and the 10 terabytes that you already sent would be needed to be sent again so it’s improves your maintenance ability uh once you you do these operations life sometimes it’s more complicated than reality uh it’s also makes everything consistent um sometimes they were special cases in in configuration that you had to configure in order to be consistent and now it relieves you from any special knowledge you just run the operation and repurpose operation take care of that it’s also much simpler because some of the operations had to run after they ended you had to run repair now since repair base operation uses repair no need to run repair so it’s simpler and also faster because there’s only one step and repair is running anyway uh lastly there is a consistent scheme of changes but I’ll talk about it in a separate section uh we also support top K queries using user-defense functions and awesome and distributed Aggregates uh but I won’t cover it but you can actually find how many people follow Elon Musk if you have it in your database um and there’s uh um investment in our database drivers we heavily invest in multiple drivers primarily the go the rest that we created from Strack scratch SD plus driver which is based on the RAS driver and we’re going to release in ga uh in in the first half of this year it’s actually ready for usage in multiple cases we will improve the load balancing options we added a serverless support and also performance was improved

um okay let’s talk about consistent metadata now a a metadata consistency is made gone by raft raft is a protocol for statement machine replication so if you have multiple changes to the states by distributed amount group of nodes it makes everything consistent and linearize all of these changes so obviously it’s it’s something good that everybody every database should have and before that we had the lightweight transaction is is a similar protocol but rough doesn’t have the overhead that lies where transaction has no the main thing roughly is just a protocol how we use it so we we are using it for multiple goals uh the first one which I’m very happy to announce that uh it’s it’s getting to the default case uh it’s safe schema changes if you have parallel schema changes previously uh you you you it was this forbidden to use uh multiple schema changes at once uh it can drive schema out of sync and and uh a manual operation would be required this is why when we pass the Jefferson test we forbid to use the schema in parallel and also we forbid to use multiple topology changes at once with raft these things are can be run in parallel and rough linearize them so in the upcoming 5.2 which is in a release candidate mode it’s default by its default on previously it was an experimental mode a year ago it was very complicated project and we’re in the last mile of delivering this by default next is already implemented but we will only be delivered uh in 5.3 release is safe topology changes so you’ll be able to add and remove multiple nodes in parallel and this is just a beginning with this base we can transition to more sophisticated type of changes with consistent tables and tablets and I’ll go into the provide more data about these

um lastly in 2022 we worked on ScyllaDB Cloud serverless but we haven’t released it we are we did release it this year um so this takes me to the future of data of fast data and it contains serverless and uh changes to ScyllaDB core that’s the last part in this presentation

um so serverless if you’re not familiar with this syntax uh it’s time to go back to school and so we’re switching from um instances uh in servers to units of virtual CPUs um so we basically abstract the servers servers obviously are still there but forget about what you know about instance types about server sizes about too much sizing prior to selection of the servers that you need to use for ScyllaDB now with serverless the building block is virtual CPU and we also decouple to some extent the relationship between storage and compute we still use local nvme so it’s the coupling is not uh it’s just a beginning of the decoupling but you do not need to think about which instances you’re going to select whether I3 or I3 en with more storage we’re doing it automatically for you with our management which is based on multi-tenant kubernetes deployment and I’m happy to say that it’s already implemented in the ScyllaDB free tier so if you go to ScyllaDB Cloud today and you select free tier you get two virtual CPUs and an amount of storage immediately we’re going to over the course of this year we’re going to propagate it to production later on this year and it will be stage deployment with more and more value coming to you in terms of value because technology is one thing but the value is more important it gives you beyond like uh no need to plan much uh the ability to be flexible with annual commitment uh with the hardware because uh we also today with VMS we buy a one-to-one ratio of the hardware for the cluster and we need to annual commit in order to make it cost effective to the cloud provider but in this mode with multi-tenancy there’s no need to commit to the the cloud provider we take the commitment and the storage and the server management so you do not need to think about that we also add the encryption at rest and bring our own key so it will be absolutely safe to run multi-tenance multi-tenancy over the same servers elasticity gets a big boost uh everything is API driven and I’m also happy to say that we released an API and terraform support for ScyllaDB cloud and not just serverless but actually Sera Cloud VMS also get API and terraform this was long anticipated uh other things are also coming in terms of more flexibility for Network usage and self-service for user management and billing and metering not directly related to serverless but we lead with serverless because serverless makes everything more elastic and dynamic so it’s it’s more required there

so we are entering the serverless world and previously our vision was or the articulation of ScyllaDB was the power of Cassandra at the speed of redis now we’re switching to a new era and we also assist many companies to move from dynamodb to ScyllaDB DB and while definitely they get much better TCO today with Sera DB we also like to make sure that the experience experience and the usability is at least as good as the anodp in certain cases it’s already better than dhamma DB there’s no rate limiting per partition that diamond ADP always does in scaling some can be faster than dynamodb in many cases but in many other cases uh dynamodb is better and we like to change this equation how we change it um so today ScyllaDBs supports eventual consistency with regular operations unless you use LWT with eventual consistency everything goes to three replicas usually and they may not be identical because we don’t use transaction and storage is bound to compute within the same node unless you use serverless but it’s still even with serverless we use nvme and real servers so it’s somehow bounds you have more flexibility but some planning and you need to repair the data once in a while and ScyllaDB is extremely competitive uh when the situation where you have high throughput in low latency but if you have a high volume and not necessarily high enough throughput then it becomes more expensive or together with all of the components and we wanted to solve that we wanted to Revolution revolutionize this so the future coming to ScyllaDB is based on raft both the metadata and the data will be consistent so there wouldn’t be need to repair and since there is no need to repair data is consistent we can just keep a single replica we like to separate the replica from the compute from the storage so we like to keep it on S3 but we need fast performance so we’re going to Cache the the data that we keep in S3 and synchronize it to S3 once in a while this will also enable point in time recovery in cluster parking if you don’t use and if you don’t query it and also fast disaster recovery backup I’ll I’ll explain how it is done so how we support this Vision First it starts with uh changes to key ranges so key ranges is a range of partitions and that we manage and every uh node has multiple sets of tokens token is uh the hash of a partition um so this this is what the token ownership map is for every every node in the cluster uh we we call it every token range basically it’s a virtual node and if assigned statically on bootstrap uh we manage the replication metadata by considering the replication strategy if you use replication of three or different one this is the replication fact uh metadata but the problem is it’s quite static it only changes one once you add and remove nodes we wanted to make it more flexible today every V node covers tens of terabytes terabytes or tens of terabytes in large cases and we want to make it more flexible so we’re moving into tablets tablets cover also ranges of keys but smaller ones from gigabytes to tens of gigabytes and they’re dynamically controlled and dynamically mapped so you can think about them in using this diagram so these boxes and in colors are small tablets with data up to 10 gigabytes and they can be mapped into shards the shards are the vertical lines here now a tablet can be splitted because we do not want to have a too big of tablets in order to balance uh storage and also to balance uh CPU and not too small so we wouldn’t have too many so tablets are Dynamic you can shot can balance or scheduler can balance and move tablets away which is really trivial because tablets have their own SS tables every tablets so it’s very quick to move we just move the ownership of the usage of the SS table within the node uh you can also re-shard so you can run a server with different amount of shards and we’ll just coalesce This Ss tables into fewer shards or or split the assist tables and we can also clean up uh and cleanup is just removing the assist table we don’t need to do any manipulation of the SS table today and it’s not the case in the past with virtual nodes so this is one major aspect that will support the revolution the second one is the future of storage we are extremely happy with local nvmes but uh if you need to be very elastic it takes time to scale uh so nvmes are great but uh but way better than EBS and some but uh let’s say if you don’t use them also um and it’s a high volume use usage then they they can become expensive so we went to the customers and listened to what they have to say they actually ask for shark with lasers which we mapped it into this is from Dr Evo if you’re not familiar with this movie also a big hole that you have to fail um so uh we’re obsessed with customers and also obsessed with technology and we transitioned this uh ask for shark with lasers into a better generation of software assisted by Object Store specifically S3 so previously we had all of these nodes and now we have S3 and we like to have data reach the nodes and then copy it into S3 with a single copy so on the right this is what you get by switching to S3 S3 is almost free because compared to nvme it’s completely decoupled from the compute it’s fully consistent because we’re going to keep one copy we’re speaking about transactions or data was that that was repaired before we sent the SS table to S3 it’s accessible by all nodes so if you need to have an elastic deployment all nodes can reach S3 and we can map the the tablet to any node uh we support we’re going to support tiered storage and since S3 is obviously slow we’re going to save copies in local nvme’s for Speed

um so you can think about their rights coming to the local uh nodes uh we will continue to have three replicas and they’ll receive the rights fast but once they received it they can send the commit log and upload it to S3 which will help also with the point in time recovery uh the if it’s consistent nothing is needed if it’s not uh consistent and eventually consistent we can run repair in order to in compaction to send the data to S3 and then that that would it be it with the right path um there are many things that can improve compaction for example can happen on another node we’re going to add caching for replicas and we can have flexibility and trade-off between the amount of nvme caches it doesn’t necessarily even need to contain the entire uh data set it can make contain just a working set size uh there are lots of interesting choices to make and we’re going to make it available for all users not just the database as a service a portion of delivery of ScyllaDB but also all of the other Enterprise and open source flavors um so if we think about all of these Concepts established S3 and rough metadata you get something which is greatest of all times uh that’s the future so data uh is on the cheapest layer you have caches to mitigate the S3 speed elasticity becomes fabulous removing nodes you just need to flash depending its request on the nodes and then you can just kill the node and transfer the tablet ownership to some other node if you add nodes adding Hardware is decoupled from ownership so you can add as many other as you like and then the ownership of tablets propagates letter and once it the first node gets ownership of the first table it can serve request immediately once it receives the the first gigabyte or so without waiting for terabytes of data so extremely extremely elastic configuration um also uh you can back up to another data center just by backing up S3 so backup that um disaster recovery backup becomes almost free and you can park the cluster last but not least um I haven’t forgotten also the consistency of the data path itself today we’re moving either eventual consistency we’re moving or moving to immediate consistency with raft so the configuration will be trivial we actually have it running uh in in a mode where it’s not uh productized yet but we have it running and this allows all operations to be consistent and not just the use of LWT which is main caveat is slow in latency with raft is Fast and the overhead is uh almost not existing in many many cases so this is where ScyllaDB is going I hope that you enjoyed the presentation we covered the 2022 achievements which make ScyllaDB already an extremely good database but now we have serverless and the future it looks extremely interesting I hope you enjoyed and stay with us for the next session as well thank you [Applause] foreign

Read More