Choosing the right compaction strategy is crucial for having optimal performance on a Scylla cluster. In this interview with Nadav Har’El of ScyllaDB, we will learn more about compaction and his upcoming talk at Scylla Summit 2017. Let’s begin the interview.
Please tell us about yourself and what you do at ScyllaDB?
My name is Nadav Har’El, I have been a software developer in ScyllaDB for the last four years. We started by writing an operating system from scratch, named OSv, and I’m still the maintainer of this project. But today, we shifted our focus to the Scylla database and the Seastar asynchronous programming framework which powers it, and I contributed to both projects in many different areas. Before joining Scylla, I enjoyed working on an eclectic range of software projects including virtualization, networking, information retrieval, computational linguistics, and scientific computing.
How did you get started in programming?
In 1985, when I was 10 years old, my family joined my father on sabbatical in Bell Labs in Murray Hill, New Jersey. This is where C and UNIX and a lot of other cool stuff had been invented. At Bell Labs, my father got a home UNIX terminal and books to learn UNIX and C from. I was fascinated and began teaching myself C and UNIX using his books and by experimenting with random commands from /bin. In the beginning, it wasn’t easy. For example, after a few months, one wrong command resulted with me deleting half my files. But today, 32 years later, I think I have a pretty good handle on both C and UNIX.
I also learned the C++ language, in which Scylla is written, fairly early in 1988. However, unlike C which only changed a little in three decades, C++ has changed almost beyond recognition in those three decades. In ScyllaDB we tend to use last month’s (or next month’s!) C++ features, so it’s been a real roller-coaster ride keeping up with the changing C++ standard. But as you probably know, roller-coasters are fun!
Another important lesson I learned from the Bell Labs mainframe was the usefulness of source code. I had access not only to the running UNIX system, but also to its full source code. At some point I started to not only write my own silly programs, but also to look at the source code of various system software and try improving them (or at least I thought I was improving…). Following this experience, I always looked for the source-code of the software I was using and unsurprisingly I became a big fan of free software (a.k.a. Open Source software). In 1993, when Linux 0.99 was finally able to run the X Window System well, I switched our home PC from AT&T SVr4 to Linux. I have been using Linux and programming on it ever since.
What will you be talking about at Scylla Summit 2017?
In my talk I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
What type of audience will be interested in your talk?
Users of Scylla and Apache Cassandra know — and if they don’t, this talk will vividly let them know— that picking the wrong compaction strategy can absolutely ruin their workload’s performance. My talk will help these users better understand the tradeoffs involved between the different compaction strategies and the pitfalls of each. It will also introduce a new compaction strategy unique to Scylla, which will be the best choice for a wider variety of workloads, so I’m sure users will be happy to learn about it.
How does Scylla do compaction differently from Apache Cassandra?
Apache Cassandra’s four traditional compaction strategies, Size-Tiered, Leveled, Date-Tiered and Time-Window have been implemented in Scylla using the same heuristics that Apache Cassandra uses. Making the same heuristics available to users allows those already familiar with these compaction strategies to more easily switch from Apache Cassandra to Scylla. Additionally, these four compaction strategies have proven themselves genuinely useful for different workloads as I’ll demonstrate in my talk, so we wanted to support them. As I mentioned above, Scylla also implements a fifth, new, compaction strategy, which is not available in Apache Cassandra.
But beyond the compaction strategy (which SSTables to compact and when), there is a more fundamental difference between the way Apache Cassandra and Scylla perform the compaction itself. In Apache Cassandra, the rate of the compaction is tuned by the user who needs to control its concurrency and throughput. If compaction is too quick, query performance during compaction goes down. If compaction is too slow, SSTables start to pile up and affect read performance. Moreover, the compaction process increases the tail latency of queries performed in parallel with the compaction. Scylla, on the other hand, emphasizes what we call “workload conditioning”, and the rest of the world calls “automatic tuning” or “zero configuration”. Once the compaction strategy decides that some SSTables should be compacted, the user does not need to control its pace. Rather, Scylla automatically picks the best pace — the “Goldilocks” pace (not too slow and not too quick). At the same time, Scylla breaks up the background compaction into to small pieces of work to ensure that request latency — even tail latency (99th or higher percentile) — does not go up when compaction is in progress.
How can the people get in touch with you?
Feel free to write me with any questions about my talk, compaction, or Scylla in general. If you are really interested in Scylla and its development, I would recommend that you join me on the Scylla developers mailing list where all the Scylla developers hang out and would be happy to answer your questions.
Thank you very much, Nadav. We can not wait to see your talk in person and learn more. If you want to attend Scylla Summit 2017 and enjoy more talks like this one, please register here.
Scylla Summit is taking place in San Francisco, CA on October 24-25. Check out the current agenda on our website to learn about the rest of the talks—including technical talks from the Scylla team, the Scylla roadmap, and a hands-on workshop where you’ll learn how to get the most out of your Scylla cluster.