Scaling ScyllaDB Storage Engine with State-of-Art Compaction

Raphael Carvalho21 minutes

Log Structured Merge (LSM) tree storage engines are known for very fast writes. This LSM tree structure is used by ScyllaDB to immutable Sorted Strings Tables (SSTables) on disk. These fast writes come with a tradeoff in terms of read and space amplification. While compaction processes can help mitigate this, the RUM conjecture states that only two amplification factors can be optimized at the extent of a third. Learn how ScyllaDB leverages RUM conjecture and controller theory, to deliver a state-of-art LSM-tree compaction for its users.

Share this

Video Slides

Video Transcript

Hello my name is Raphael. A little bit about me, I worked in the past in a couple of projects like OSV and now ScyllaDB. That’s enough about me, let’s move on to the interest interesting part.

okay so I would like to start with this great quote in order to make good use of Computer Resources one must most organized files intelligently making the retrieval process efficient this is a great quote from this paper written in the 70s of course the choice of good Fire Organization depends on the kind of retrieval to be performed database programmers have been working on storage engines for many decades already since the 17th where many papers were written on the topic there this is a great short and precise definition from the formation paper about what’s a storage engine is it allows users to restore update and recall data there are two approaches for handling updates in storage engine one of them is in place update structure such as B3

the example each entries contains a key prefixed with a k letter and the value prefixed with a V letter in place updates directly overwrites all the records to store new updates K1 is updated in place this structure sacrifice right performance as updates translate into random IO also consecutive updates and deletions related to fragmentation in the index Pages reducing the space efficiency the other option for handling updates is out of place update structures such as a LS entry same scenario as before we want to update K1 out of a place structure always start updating to new locations instead of overwriting all the entries immutabilities they need to make attributes this design improves the right performance since it can explain explored sequential iOS allowing for faster update injection the major problem of this design is that really performance is sacrificed since a record may be stored in multiple locations for correctness the latest version of the data is returned on the read path it requires background operation what we know by compaction that we know by compactions reorganize the data incrementally in order to improve the storage and query efficient

uh so I would like to say that our update out of place updates not new in the 70s a paper called differential files shows its application applicability in the real world the paper explains very well the higher level concept behind LSM when writer’s law refer them into the batches a differential file for database this is actually a good analogy that’s presented in the paper a differential file for database is analogous your right list for a book rather than print a new edition of a book each time a change in the text is desired a publisher identifies Corrections by Page and line number collecting them it’s when you write a list which is the ship to do with each book this procedure significantly reduced the publication costs so the LSM trip paper the official one was published in the 90s in 1996 to be more precise it addressed many problems with a merge process which is integrated into the structure itself providing high right performance and with bounded query performance and space utilization uh in the original lsm3 design it contains a sequence of components c0 C1 up to CK where the c0 resides in memory with whereas other remaining components will reside in disk in common rights go to a memory table which is zero and once it’s full it’s merged into C1 when doing this components two one is full it’s merged with situ and so on this merge procedure also known as compaction it’s crucial for efficients of queries and space utilization the paper also demonstrated that under a stable workload where the number of levels remains static the right performance is optimized when the size ratio between all the adjacent levels are the same and I would like to say that this principle is has impacted a lot the implementation of Cassandra and after Celebi uh so one of the good properties of osm trees um how it simplifies concurrence control and Recovery uh and that’s done uh because uh multiple these components are merged together into a new one without modifying existing components it simplifies a lot uh the regression recovery if the database crash in them you don’t emerge all the components will be intact question recovery can just discard the partial components and the modern databases such as syllabi implements uh this this component as a run of asset Stables now for the read path uh the life of aquarium LSM tree it will search out the components which includes the memory table and our as a stable sitting on disk and it of course it has reconciled the results that is you find the latest version of that key in this example the aquarium K1 read the key from both memory and the disk component and Returns the record with the most up-to-date information The Shape Up Now about lsmg compaction policies which is how to drive the shape of osm tree it determines the space and read and write efficiency for the database therefore the policy has to be carefully chosen based on the workloads and requirements and performance it goes the policy is composed of premium for Primitives it’s basically like I said it drives how we are going to do a compaction in the osm tree so it has uh four Primitives which is the trigger fire picking policy the granularity and layout the trigger is about when to compact the policies which data to compact the granular gravity how much data we are going to compact at once and the layouts how we are going to lay out the data on disk children level are their most famous compaction policies in the original Wireless Century design there is only one this component per level one of the problems with that is the right stove database stops accepting rights until C 0 is emerged into C1 which is the first uh this component to avoid running out of memory this compaction policy was implemented implicitly assumed when people talked about LSM trees which is the pure leveled policy which is implicit in the osm original osm design to avoid the right Source modern LSM trees implementations are more flexible with fixed level policy the alarm more than one components in the on this level and that’s done to reduce the right amplification and in every component is not merged with a pre-existing on these components but rather flushed into a new adjacents adjacent components which is on disk that allows the right styles to be reduced because we reduce the right amplification the right costs of flush the memory table into the disk once the first level reaches its capacity which is defined by the size ratio of the adjacent levels the level is merged directly into the next level no flexibility from level one on onwards uh the inputs of data is only removed on successful merge completion to guarantee is pressure recovery which we talked about previously uh is that now it’s very important to talk about this partition optimization for level that came later in time it’s basically about uh people realizing that merging entire levels which is known as full merge is problematic so the idea is basically about partitioning uh the data in each level each uh lsm3 component into fixed size fragments so when complexing we are not going to do the full merge but instead we are going to do the partial merge and this technique allows uh compaction to have a bounded operation time and temporary disk space the that is the amount of disk space required for compaction to complete each fragment is in represented in practice as a asset stable which is sort of showing the table the default for file format used in ScyllaDB and the main other uh databases implements in our symmetry storage engine so each level now has its data partitioned into access table files which don’t overlap at the T range there is no longer a single component super level but rather many components that can be logically treated as a single one when level 1 exits exceeds capacity a single filing level one can be merged with overlapping files in level 2. this partial merge is what bounds the operation time and the temporary disk space required required for the compaction to complete successfully without running out of this space and this is the cost analysis for loud policy for leveling policy as a stable at a given level will be compacted T minus one times until the level fuels up and that single and that assets table is emerging to the Natural level where T is the size ratio between the adjacent levels the worst case occurs when out the data uh is uh I mean where all the data it’s all but the largest level which contain approximately one tenth of the total data are updates to the entries at the largest level so that’s the worst case for level policy

so later in time after the regional level Century paper was presented a new paper written uh publishing in the nine inches a few years later actually introduce a new approach approach to Dollar Central layout to us about a uh the actual difference is that it allows lsm3 to have multiple in these components per level so it trades off space for right efficience so posting the paper uh the goal is to design a technique that supports both insertion and queries with reasonable efficiency without the delays of periodic batch processing and that gives birth to the third compaction policy so that’s how it works over time USS tables are created on the first level disk the third policy allows the combination of KSS tables before merging into the the next level uh Escape parameter can be usually controlled by the user and it should be it’s referred internally as the threshold compaction takes place by raising those access tables from disk and performing a k-way merge on them the KOA algorithm is made possible because the SS tables are sorted on disk so we are able to iterate depth through their input assess tables in compaction permanently use applying this technique and finally the output of the merge is written to a single largest table that will belong to the next level of course assuming in very few to know overrides for two policy OS allows a stable at level zero will be compacted L minus one times until it reaches the the last level the worst case occurs when the updates to the keys are redundantly stored across out the components of disk meaning how the updates to a given key but the last one are raised in disk space because it’s uh the updates are redundantly stored the problems the components in the tree very important policy for right hub or workloads and updates and when we updates for care not so frequent allowing the written space efficients to be reasonable so now this unit Journey Begins the database inherited all the osm3 improvements described so far but they turned out they were not enough as explained previously the input files can be only resist on compaction completion after it succeeds therefore the temporary space of a hat for compaction is the size of the input data given that the output can be as large as the inputs in the worst case these five the space requirement is the size of the data set size so users up to your policy are forced to leave 50 percent of free disk space this makes it very very expensive in terms of storage density what can we do about it remember the partitioning optimization applied for level turns out we can do the same for tiered policy particular data of each we are going to partition the data of each assess table into a fixed size fragments allowing the data of each asset stable to bring incrementally released during during the compaction process not after when compared to the access tables in level zero we will release the exhaust the fragments of input access tables as soon as the data is safely stored in the fragments of output as a stable with this approach temporary disk space is now bounded by the frame and size not the data set size so it bounds the temporary space of a head like we did like it was done for the level policy it allows this space to go Way Beyond the the previous limits that haunted us to 80 and Beyond it’s available instead of BS incremental compaction strategy you can read more about it in the documentation why is that enough we didn’t think so so now let me represent the efficient space of osm3 compaction policies as a grayscale spectrum where the space efficiency is on one side and right right on the other and leaving redefinitions out for Simplicity it’s not strictly needed for what I’m going to explain next at one side of this spectrum you have pure leveled which brings most of the space efficiency while sacrificing rights at the other side you have pure tiered which brings the most of right efficiency while while sacrificing space but you know the road’s not black and white there are Shades of Gray in between there are Shades of Gray in between so we came up with a new hybrid lsm3 data layout which is a result of combining properties of the those two leveled into policies something between therefore a hybrid approach where the largest level is space optimized it’s going to behave more like leveled and others are right optimized more like tiered IT addresses the okay space amplification interior for override intensive workloads when complexion wind complexion levels but the largest one the behavior is exactly as in the third policy for example two assets tables is level zero are written to a new asset table in the next level unlike leveled no merge is done with the pre-existent data in in level one this story is different when compacting to the largest level leveled mode kicks in as a stable in the second largest level are merged with the data with the single access table in the largest level which is a single asset stable run exactly like in level policy this hybrid approach will allow compaction to dynamically adapt to the workload so we have a hybrid plus Dynamic behavior for this module it’s a dynamically adapt to the workload Under head right load compaction Works to meet the right lintest latencies otherwise otherwise it optimized for a space efficiency under low right load space efficients can be almost as good as that of peer leveled but as soon as the right load becomes heavy the police optimized for rights sync seeking a right application close to that observed allowing the database to keep up with the right range to minimize the impact on P99 latency so hybrid interface efficients have batteries on pure and shared leveled and shared mini hybrid cannot be worse or better than them but somewhere in between the main advantage is that it allows right and space efficients to fluctuate between those boundaries creating a new opportunity that those primitive policies were not able to provide so to reduce this space amplification over writing test workloads it’s less right and space amplification increase the storage density per node and more money in your pocket so that’s great right it’s available as a space amplification go up shopping random compaction strategy again you can read more about that in our documentations uh now let’s talk about one of the most difficult problems the osm tree domain which is about handling tombstones efficiently a little bit of background when deleting data in our symmetry the out of place nature is not disrespected if you want to delete key a at Samsung which is a record of deletions written say we want to delete k a uh the deleted data will only be purchased once it’s compacted with the tombstone until their disk space is weighted by both the tombstone and the the data it Shadows one of the problems still being inherited from patch Cassandra is that garbage collection happens when single file of a single level at a time but I would say that’s about optimal for example if we try to collect the garbage on that file on level zero the tombstone cannot be purged because the data shadowed by it is in level one so both remain on disk until they are merged by regular compaction procedure which can take on indefinite amount of time not having an efficient garbage collection approach means that not only space efficiency reduced but also read efficients because the database will have more work to do during the read path with that in mind sealer now implements a garbage collection that can merge data across the level to allow Tombstone to be personal efficiently and quickly the big bad song from out of compassion to make sure temporary disk space is bounded allowing this cross level compaction to take place without running out of disk space we have gone beyond the improvements available in the literature and we intend to keep improving our storage engine thanks a lot for listening and have fun in the sale of summit bye [Applause]

Read More