In the run-up to Scylla Summit 2018, we’ll be featuring our speakers and providing sneak peaks at their presentations. The first interview in this series is with ScyllaDB’s own Glauber Costa.
Glauber, your talk is entitled “Keeping Your Latency SLAs No Matter What!” How did you come up with this topic?
Last year I gave a talk at Scylla Summit where I unequivocally stated that we view latency spikes as a bug. If they are a bug, that means we should fix them so I also talked about some of the techniques that we used to do that.
But you know, there are some things that are by design never truly finished and the more you search, the more you find. This year my team and I poured a lot more work into finding even more situations where latency creeps up, and we fixed those too. So I figured the Summit would be a great time to update the community on the improvements we have made in this area.
What do you believe are the hardest elements to maintain in latency SLAs?
The hardest part of keeping latencies under control is that events that cause latency spikes can occur anytime, anywhere. For Java-based systems we are already familiar with the much infamous garbage collection, that Scylla gracefully solves by being written in C++. But the hardware can introduce latencies, the OS Kernel can introduce latencies, and even in the database itself they can come from the most unpredictable of places. It’s a battle against everyone.
What is a war story you’re free to share of a deployment that just wasn’t keeping its SLAs?
That’s actually a good question and a nice opportunity to show that SLAs are indeed complex beasts that can come from anywhere. We have a customer that had very strict SLAs for their p99 and those weren’t always being met. After much investigation we determined that the source of those latencies were not Scylla itself, but the Linux Kernel (in particular the XFS filesystem). Thankfully we have in our bench a lot of people who know Linux deeply, having contributed to the kernel for more than a decade. We were able to then understand the problem and work on a solution ourselves.
That’s interesting! Is that the kind of thing people are expected to learn by coming to the talk?
Yes, I will cover that. I want to show people a 360º view of the work we do in Scylla to keep latencies low and predictable. Not all of that work is done in the Scylla codebase. The majority of course is, but it ends up sprawling down all the way to the Linux Kernel. But this won’t be a Linux talk! We have many interesting pieces of technology that help us keep our latencies consistently low in Scylla, and I will be talking about them as well. For example, we redesigned our I/O Scheduler, we finalized our CPU Scheduler, added full controllers for all compaction strategies, and also took a methodical approach to find sources of latency spikes and get rid of them. It will be certainly a very extensive talk!
Thanks Glauber! We’re looking forward to your talk at the Summit!
It’s my pleasure! By the way, if anyone reading this article hasn’t registered yet, you can register with the code
glauber25monster to get 25% off the current price.