See all blog posts

How to Choose the Right Compaction Strategy for Workload Performance

date-tiered compaction

Choosing the right compaction strategy is crucial for having optimal performance on a ScyllaDB cluster. In this interview with Nadav Har’El of ScyllaDB, we will learn more about compaction and his upcoming talk at ScyllaDB Summit 2017. Let’s begin the interview.

Please tell us about yourself and what you do at ScyllaDB?

My name is Nadav Har’El, I have been a software developer in ScyllaDB for the last four years. We started by writing an operating system from scratch, named OSv, and I’m still the maintainer of this project. But today, we shifted our focus to the ScyllaDB database and the Seastar asynchronous programming framework which powers it, and I contributed to both projects in many different areas. Before joining ScyllaDB, I enjoyed working on an eclectic range of software projects including virtualization, networking, information retrieval, computational linguistics, and scientific computing.

How did you get started in programming?

In 1985, when I was 10 years old, my family joined my father on sabbatical in Bell Labs in Murray Hill, New Jersey. This is where C and UNIX and a lot of other cool stuff had been invented. At Bell Labs, my father got a home UNIX terminal and  books to learn UNIX and C from. I was fascinated and began teaching myself C and UNIX using his books and by experimenting with random commands from /bin. In the beginning, it wasn’t easy.  For example, after a few months, one wrong command resulted with me deleting half my files. But today, 32 years later, I think I have a pretty good handle on both C and UNIX.

I also learned the C++ language, in which ScyllaDB is written, fairly early in 1988. However, unlike C which only changed a little in three decades, C++ has changed almost beyond recognition in those three decades. In ScyllaDB we tend to use last month’s (or next month’s!) C++ features, so it’s been a real roller-coaster ride keeping up with the changing C++ standard. But as you probably know, roller-coasters are fun!

Another important lesson I learned from the Bell Labs mainframe was the usefulness of source code. I had access not only to the running UNIX system, but also to its full source code. At some point I started to not only write my own silly programs, but also to look at the source code of various system software and try improving them (or at least I thought I was improving…). Following this experience, I always looked for the source-code of the software I was using and unsurprisingly I became a big fan of free software (a.k.a. Open Source software). In 1993, when Linux 0.99 was finally able to run the X Window System well, I switched our home PC from AT&T SVr4 to Linux. I have been using Linux and programming on it ever since.

What will you be talking about at ScyllaDB Summit 2017?

In my talk I will present the different compaction strategies that ScyllaDB provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.

What type of audience will be interested in your talk?

Users of ScyllaDB and Apache Cassandra know — and if they don’t, this talk will vividly let them know— that picking the wrong compaction strategy can absolutely ruin their workload’s performance. My talk will help these users better understand the tradeoffs involved between the different compaction strategies and the pitfalls of each. It will also introduce a new compaction strategy unique to ScyllaDB, which will be the best choice for a wider variety of workloads, so I’m sure users will be happy to learn about it.

How does ScyllaDB do compaction differently from Apache Cassandra?

Apache Cassandra’s four traditional compaction strategies, Size-Tiered, Leveled, Date-Tiered and Time-Window have been implemented in ScyllaDB using the same heuristics that Apache Cassandra uses. Making the same heuristics available to users allows those already familiar with these compaction strategies to more easily switch from Apache Cassandra to ScyllaDB. Additionally, these four compaction strategies have proven themselves genuinely useful for different workloads as I’ll demonstrate in my talk, so we wanted to support them. As I mentioned above, ScyllaDB also implements a fifth, new, compaction strategy, which is not available in Apache Cassandra.

But beyond the compaction strategy (which SSTables to compact and when), there is a more fundamental difference between the way Apache Cassandra and ScyllaDB perform the compaction itself. In Apache Cassandra, the rate of the compaction is tuned by the user who needs to control its concurrency and throughput. If compaction is too quick, query performance during compaction goes down. If compaction is too slow, SSTables start to pile up and affect read performance. Moreover, the compaction process increases the tail latency of queries performed in parallel with the compaction. ScyllaDB, on the other hand, emphasizes what we call “workload conditioning”, and the rest of the world calls “automatic tuning” or “zero configuration”. Once the compaction strategy decides that some SSTables should be compacted, the user does not need to control its pace. Rather, ScyllaDB automatically picks the best pace — the “Goldilocks” pace (not too slow and not too quick). At the same time, ScyllaDB breaks up the background compaction into to small pieces of work to ensure that request latency — even tail latency (99th or higher percentile) — does not go up when compaction is in progress.

How can the people get in touch with you?

Feel free to write me with any questions about my talk, compaction, or ScyllaDB in general. If you are really interested in ScyllaDB and its development, I would recommend that you join me on the ScyllaDB developers mailing list where all the ScyllaDB developers hang out and would be happy to answer your questions.

Thank you very much, Nadav. We can not wait to see your talk in person and learn more.

ScyllaDB Summit is taking place in San Francisco, CA on October 24-25. Check out the current agenda on our website to learn about the rest of the talks—including technical talks from the ScyllaDB team, the ScyllaDB roadmap, and a hands-on workshop where you’ll learn how to get the most out of your ScyllaDB cluster.