P99 CONF is the event on all things performance. Join us online Oct 23-24 — Registration is free

Making the Most Out of ScyllaDB’s Awesome Concurrency at Optimizely

Brian Taylor20 minutes

How Optimizely takes full advantage of ScyllaDB's concurrency while also guaranteeing correctness and protecting quality of service.

Share this

Video Slides

Video Transcript

Hello, I am Brian Taylor, principal engineer at Optimizely and I’m excited to share with you today how at Optimizely we take great advantage of ScyllaDB’s incredible concurrency. First a little bit about me, I’m married, I have three very young children and the course of my career I’ve had the opportunity to create two programming languages and two databases for legitimate business reasons.

I point that out because all of my computer science professors promised me no one ever does that I’m very happy that they were wrong that’s a blast the reason they were wrong in my case is the third point I love finding some property in the solution space that if you can maintain it everything gets simpler that is the reason that one of those languages in one of those databases exists

so what we’re going to talk about today is how much Cilla loves concurrency and once I’ve driven that home and it’s shown you kind of the boundaries of that love I want to teach you how to find more concurrency in your problem so let’s talk about concurrency as we talk through concurrency I want you to hold this conceptual model in your head so this is the life cycle of a right a client makes a request via the driver and starts the clock named r that request enters a queue traverses the network eventually arrives at Cilla enters another queue eventually Cilla picks it up and starts the timer s it doesn’t work eventually commits that right to disk and then it stops the timer s drops the answer in another queue the answer traverses the network arrives at the client queue the client picks it up and actually gives the response back to your client and that stops the timer

so let’s zoom out from that conceptual model look at it more abstractly big idea to keep in your mind is about is R is about 10 times longer than S in time so asking Cilla to do more work concurrently looks like the picture below and the payoff is as we Stack Up concurrent requests that round trip time that is making up a lot of Art gets to parallelize even if Cilla executes s sequently it’s kept it busy more often and in the limit as concurrency approaches Infinity our theoretical effective average our time gets closer and closer to S remember R is 10 times bigger than S so that is awesome

let’s define some terms first term throughput that is measured in number of things per second you’re doing this is probably the number you care most about in your application second definition concurrency this is the number of independent things that are happening at the same time that is the tool that this talk is teaching you how to use to influence throughput service Time s is how long solidity took to do a thing and request time is how long it took to do that thing from the client’s perspective so from the previous mental model R is S Plus some round trip time in a world where you only do one thing at a time your client makes requests waits for an answer to come back the best throughput you can hope for is one over R that’s the pure sequential world if you’re willing to introduce concurrency Into Your solution X becomes a lot more interesting I’m going to talk about that more but to set the stage to talk about that more we need to talk about load testing classical load testing is what’s called closed loop load testing the model of the world is we stand up some number of virtual users they all make requests they wait for the answer to come back and when the answer comes back they make another request this is very common this is the model that syllabinch uses I claim it is not very useful for modern capacity planning and I’m going to flesh out that argument over the next few slides but it is very useful for applying a tool called the universal scaling law so Universal scaling law was discovered by Dr Neil Gunther it’s a generalization of amdel’s Law and it says that any practical system it predicts will have a relationship between concurrency in and throughput X that looks like this plot this plot is built from real measurements from a real small silver cluster so the universal scaling wall the USL predicts that there is a region of concurrency that gives you linear Improvement in throughput that’s the left region eventually any practical system will saturate some real component and that is the saturation region which is the middle part of the plot and most realistic systems eventually reach a point where driving them with more concurrency actually hurts your throughput and that is the retrograde region now again this is the USL plotted against Real solid DB data all right so I claim type of load testing you should be doing for capacity planning is open loop load testing in contrast to closely testing where you choose a number of users and they just make requests one after another as soon as the answer comes back open loop load testing models the load as a constant throughput meaning at all times it’s just driving your system with a thousand requests per second or something like that in this world concurrency is theoretically completely unbounded but I claim this is more useful because you probably all have internet applications does anyone know how many users they have I have no idea closely test closed loop load testing made sense in a world where you were planning the size of a Mainframe and all of the users you could have were in the room next door that’s not the world we live in these two models are related so closed loop load testing is choose end choose concurrency that is the x-axis of the plot you see on the right open loop load testing is choose X that is the y-axis of the plot on the right that distinction is interesting because the USL predicts that the performance of your system is a function of n meaning a function meaning there is one value for any n but it is not a function of X notice that there are x’s in this plot that have multiple valid values for n and that is predicting some instability in your system if it is driven by a load that looks like an open loop load which is what the internet is driving your system with

so let’s do some open loop load testing this is that same ScyllaDB cluster that I showed you the USL plotted against being driven with an open loop load generator at 4000 operations per second and then jumping up to 40 000 operations per second what’s really interesting about this picture you see is that s the service time of ScyllaDB didn’t change even though I increased throughput by a factor of 10. that’s interesting also note that R the request time experienced by the client didn’t change much 1.5 milliseconds to 2.5 milliseconds even though I increased throughput by a factor of 10. note also that the band of concurrency is pretty tight we’re operating in the linear region of Silas scalability the key property that makes this region linear is the fact that we still have excess SSD throughput available so remember under the hood ScyllaDB is prioritizing two workloads one is the compaction workload that’s constantly running in the background keeping your data nice and tight then the other is the flush workload and the foreground committing your rights to disk if there’s idle space left as those two workloads are competing the system is nice and stable you’re in the linear region you should take full advantage of the linear region of scaling Priscilla in your design

now let’s talk about the saturation region so the USL predicted that my cluster would saturate at 100 or 100 000 operations per second this test is driving at 100 000 operations per second and it got lucky it actually is stable in this particular one notice even though it’s stable s according to ScyllaDB is increasing over time and is way bigger than what we observed in the linear region notice also that R is ten times bigger what we saw in the linear region and concurrency is covering a huge band it’s all over the map this is exactly what the USL was predicting running up in the layer running up at the saturation region there’s now many valid values for concurrency performance will be everywhere so engineer your system to stay out of saturation that’s very important now again remember the mechanism for being in linear versus saturation is how much capacity how much throughput is left in your SSD so saturation now the SSD is pegged compaction and foreground flushing are consuming all of the available throughput if either of those things shift around a little bit it affects the other and we have instability for example this is exactly the same test just run again and now we have instability this is 100 000 records per second this time something perturbed the system maybe a compaction needed a little more throughput s the amount of time still it takes to do its work jumped way off the map it’s now a thousand times larger than it was when we were in the linear region performance and r went from remember one to two milliseconds in the linear region it’s now 55.8 seconds that’s incredible concurrency has also shot way up I actually artificially pegged it at 4096. the first time I ran this test and hit this instability my test consumed all available memory running more and more things concurrently and crashed so I had to force my system to stop scaling after 4096 concurrent operations

we have now entered the dreaded retrograde region this is the place where as we ask for more throughput by increasing our concurrency we get less so the more we push the less we get we’re on the road to hell if there is not some limiting mechanism in your system your system will consume all resources and fail usually with an out of memory okay what did we learn from all of that first of all ah has a awesome linear region and if you can operate if there is more capacity in that linear region and you have more work to do you should do it that is free throughput take full advantage of the linear region it is awesome saturation is a very dangerous place because if you’re running at saturation you’re one small system perturbation from being in retrograde which again is the room to help so stay out of saturation stay in the linear region so where are we going to get all of that concurrency from hopefully I’ve convinced you that Sila can make great use of the concurrency within some boundaries but where do you get it there’s a boring answer to that question which is concurrency just use threads that’s boring but there is a nugget of value here so there are cheap mechanisms for concurrency and there are expensive mechanisms for concurrency so for example a Java thread costs about a megabyte of Stack low hundreds of threads in Java or reasonable they’re cheaper and go but they’re darn cheap if you use a reactor-style concurrency system like c-star or rusts Tokyo in that regime tens of thousands of things happening concurrently is perfectly reasonable and feasible but that is the extent of the code answer being interesting to this question a far more interesting answer to where does concurrency come from is data dependency data dependency is the relationship between a unit of work and the previous units of work and they didn’t dependency is ultimately what completely decides what it’s safe to do concurrently and what it’s not safe to do concurrently so the easy case is when you have no data concurrency or no data dependency classic example of that is a write-only workload where every right is to a different slot in that world you can do it as concurrently as you want to it doesn’t change the answer

a more challenging world is when you have partitionable concurrency that means there’s some aspect of the work you’re trying to do where if you group certain things by say a key a and other things with a key B it’s safe to do things with key a and key B at the same time but everything within a needs to be sequential and everything within B needs to be sequential a much harder world to operate in and what I want to dig into in this talk is arbitrary happens before concurrency and that’s that’s the world where you need to actually understand the operations themselves and find ways that you can be safely concurrent let’s define some more terms first a command is an atomic unit of work it’s one thing your system is being asked to do it contains iops iops are work according to the database it’s a read or a write a batch is a group of iops that it is safe to execute concurrently [ __ ] is a cubby for data it’s a place you put data it has a name in somebody B it’s your partition key plus your clustering key a concurrency strategy is correct if the the it groups iops into batches such that the final state of the system with respect to the slots is the same as what the system would have been if you executed all of the commands sequentially

so in the world where there’s no data dependency your concurrency strategy isn’t reads doesn’t matter what you do group your commands into batches however you want so in this example the line I drew back between command of three and command four to divide my batches was arbitrary could have put it anywhere and the final result will be correct

a more difficult world is where you’re doing read modify write operations this is a world where you need to read an old value from a slot do something and then write a value back either to that slot or to a different slide this is the problem that most streaming platforms are solving when they use partition for currency in this world again we’ve got commands coming in each command has a partition in this world I’m using in this example I’m using tenet as my partitioning key there’s two partitions A and B and I can execute commands that are coming from tenant B concurrently with commands coming from 10 day but everything from 10 today means to happen sequentially if you want to do better than this you need to start digging into the iops themselves and to do that we need to understand the golden rules of data dependency and that is a read from a slot must be able to observe all of the rights to that slot that happened before and the second rule is that the final value of a slot must represent the last right to that slide so if we apply those rules remember the previous example I showed you with strict partition concurrency had four batches we can get that down to three so if we apply the golden rules we recognize that re-day read C read D those don’t interact they can be concurrent they can happen at the same time however write B in the First Command and right B in the second command you need to sequence so that the final right B is the one that impacts the final state we can take this even further though because all of the golden rules imply simplification rules yeah so you’ve got the golden rule that reads must be able to observe any rights that came before them we don’t have to observe those reads from the database we could for example synthesize those REITs from the batch itself the second Golden Rule says that the final value of a slot needs to represent the last right to that slot in the sequence we don’t have to write we don’t have to actually execute the rights before the last right we could just skip those and still observe the final Golden Rule so let’s apply that rule to our problem no you know now we’ve eliminated another batch because we’ve recognized we don’t actually have to sequence those rights between command 1 and command two if we just delete the right and command one and that doesn’t change the result so that’s really cool we went from four batches to two batches with far more concurrency what’s the payoff so I actually run in production two systems that use radically different concurrency models the first one uses normal partition concurrency it is a classic streaming platform it has it’s highly resourced at 64 cores its maximum throughput is about 8 000 commands per second and it has a flaw that torments us constantly that we call the data shape problem but in reality it’s the fact that partition cardinality changes over time if you get more partitioning and your partition key has more cardinality at one point in time you have more concurrency if it has less you get less and your throughput’s a function of concurrency the second system much less resourced 15 cores versus 64 and it uses happens before concurrency it actually inspects the iops and understands how tightly it can pack things it uses cheaper concurrency reactor versus threads and with those less resources it can push a throughput of 30 000 records per second what’s awesome about this system what I love the most isn’t the throughput it’s the fact it’s not sensitive to partition key cardinality you know so drive that home in a recent Black Friday we discovered that carnality can jump extremely high and in fact push us into the retrograde region and now we have a new limiter that we didn’t know we needed to keep us out of the retrograde region relying on partition Keys is very scary so I hope you’ve learned something valuable from this talk and I hope you stay in touch now there’s lots of ways to get a hold of me I’d love to chat with people um I hope you found this valuable have a great day

Read More