After all the announcements and innovations presented at the recent ScyllaDB Summit (now available on-demand), veteran tech journalist George Anadiotis caught up with ScyllaDB CEO Dor Laor. George shared his analysis in a great article on ZDNet: Data is going to the cloud in real-time, and so is ScyllaDB 5.0.
George also captured the complete conversation in his podcast series, Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis, which is now in its third season.
Here’s the ScyllaDB podcast episode in its entirety:
And for those of you who prefer to read rather than listen, here’s a transcript of the first part of the conversation. The focus here is on what ScyllaDB has been up to since Dor and George last connected in 2020. We’ll cover the second part, focused on what’s next for ScyllaDB and NoSQL, in a later blog.
Update: Part 2 is now published.
In this blog:
- ScyllaDB’s Roots (0:23)
- What ScyllaDB Has Been Working on Over the Past Two Years (4:18)
- Change Data Capture and Event Streaming (20:28)
- Database Benchmarking (28:28)
The interview has been edited for brevity and clarity.
George Anadiotis: Could you start by saying a few words about yourself and ScyllaDB?
Dor Laor: I’m the CEO and co-founder at ScyllaDB, a NoSQL database startup. I’m now a CEO, but I’m still involved with the technology. Engineering is in my roots. My first job in the industry was to develop a terabit router, which was a super exciting project at the time. I met my existing co-founder, Avi Kivity, at another startup called Qumranet. That startup pivoted several times. The second pivot was around the open source Xen hypervisor, and then we switched to another solution that Avi invented, the KVM hypervisor. I believe that many are familiar with KVM. It powers Google Cloud and AWS…it was really a fantastic project. After that, Avi and I co-founded ScyllaDB.
There are a lot of parallels between that hypervisor startup and this database startup. With both, we had the luxury of starting after existing solutions were established, and we took an open source first approach.
We stumbled across Apache Cassandra many years ago in 2014, and we saw many things that we liked in the design. Cassandra is modeled after DynamoDB and Google Cloud Bigtable. But, we recognized that there was a big potential for improvement: Cassandra is written in Java and it’s not utilizing many modern optimizations that you can apply in operating systems. We wanted to apply our KVM experience to the database domain, so we decided to rewrite Cassandra from scratch. That was late 2014, and this is what we’ve been doing since. Over time, we completed implementing the Cassandra API. We also added the DynamoDB API, which we call Alternator. And we also added many additional features on top of that.
George Anadiotis: Great, thanks for the recap and introduction. As a side comment, I think your effort was the first one that I was aware of in which someone set out to reimplement an existing product, maintaining API compatibility and trying to optimize the implementation. After that, especially lately, I’ve seen others walking down this route as well. Apparently, it’s becoming somewhat of a trend, and I totally see the reasoning behind that.
George Anadiotis: The last time we spoke was about two years ago, around the time that you released ScyllaDB 4.0. Now, I think you’re about to release version 5.0. Could you tell us what you’ve been up to over those two years, on the technical side and also on the business side?
Dor Laor: Let’s start with the technical side, and we can also mix it up because many times, the technical side is a response to the business problem.
Two years ago we launched ScyllaDB 4.0, which achieved full API compatibility with Cassandra, including the lightweight transaction API, which used to have Paxos under the hood. We also completed the Jepsen test that certified that API. We were glad to have all of these abilities and we’re proud to be a better solution across the board, both with API compatibility and with performance improvements and operational improvements. Also, we introduced unique Change Data Capture on top of the Cassandra capabilities and we released the first version of Alternator, the DynamoDB API.
Ever since, we’ve continued to develop in multiple areas. We’ve continued to improve all aspects of the database. A database is such a complex product. After years of work, we’re still amazed about the complexity of it, the wide variety of use cases, and what’s going on in production.
Dor Laor: In terms of performance, we improved things like IO scheduling. We have a unique IO scheduler. We guarantee low latency for queries on one hand, but on the other hand, we have gigantic workloads generated from streaming, etc., so lots of data needs to be shuffled around. It’s very intense and that can hurt latency. That’s why we have an IO scheduler. We’ve been working on this IO scheduler for the past six years, and we’re continuing to optimize it.
Over time, we’ve made it increasingly complex. It matches underlying hardware more and more. That scheduler, for example, controls for every shard, every CPU core – we have a unique design of a shard per core architecture. Every shard and every CPU core is independent, and it’s doing its own IO for networking and for storage. That’s why every shard needs to be isolated from the other shards. If there is pressure on one shard, the other shards shouldn’t feel it.
Disks are built in such a way that if you do only writes, then you get one level of performance. If you do only reads, you’ll get a similar level of performance, but not identical. Several years back, we had a special cap for writes and for reads in the IO scheduler. But if you do mixed IO, then it’s not a simple function of mixing those two with the same proportion. It can be a greater hit with mixed IO.
We just released the new IO scheduler with better control over mixed IO – and most workloads have a certain amount of mixed IO. This greatly improves performance, improves the latency of our workloads and reduces the level of compactions, which is important for data stored in a log-structured merge tree. It’s also better with repair and streaming operations.
Dor Laor: Another major performance improvement is that we improved large partitions. In ScyllaDB, Cassandra, and other wide column stores, a partition is divided into many cells, many columns that can be indexed. But, it can reach millions of cells, or even more, in a single partition. It can be tricky for the database, and for the end user, to control these big partitions. So, we improved the indexing around these partitions and we cached those indexes.
We already had indexes; now, we added caching of those indexes. We basically solved the problem of large partitions. Cassandra had this problem; it’s half-solved in Cassandra, but it can still be a challenge. In ScyllaDB, it was half-solved as well. We have users with a hundred gigabyte partition (a single partition). We knew about those users because they reported problems. Now with the new solution, even a hundred gigabyte partition will just work, so all of the operations will be smoother.
Dor Laor: In addition to those performance improvements, we’ve also been working on plenty of operational improvements. The major one is what we call repair based node operations. So node operations are when you add nodes, you decommission nodes, you replace nodes. All of these operations need to stream data back and forth from the other replicas, so it’s really heavyweight. And after those operations, you also need to run repair, which basically compares the hashes in the source and the other replicas to see that the data matches.
The simplification that we added is called repair based node operations. So we’re doing repair, and repair fixes all of the differences and takes care of the streaming too. The first advantage is that there’s only one operation. It’s not streaming and repair, there’s just one repair. This means simplification and elimination of more actions. And the second advantage, which is even bigger, is that a repair is stateful and can be restarted. If you’re working with really large nodes, say with 30 terabytes nodes, and something happened to a node in the middle or you just need to reboot or whatever, then you can just continue from the previous state and not restart the whole operation and spend two hours again for nothing.
Dor Laor: Another major improvement is around consistency: our shift from being an eventually consistent database to an immediately consistent database.
Agreement between nodes, or consensus, in a distributed system is complicated but desirable. ScyllaDB gained Lightweight Transactions (LWT) through Paxos but this protocol has a cost of 3X round trips. Raft allows us to execute consistent transactions without a performance penalty. Unlike LWT, we’re integrating Raft with most aspects of ScyllaDB, making a leap forward in manageability and consistency.
Now, the first user-visible value is in the form of transactional schema changes. Before, ScyllaDB tracked its schema changes using gossip and automatically consolidated schema differences. However, there was no way to heal a conflicting Data Definition Language (DDL) change. Having transaction schema changes eliminates schema conflicts and allows full automation of DDL changes under any condition.
Next is making topology changes transactional using Raft. Currently, ScyllaDB and Cassandra can scale only one node at a time. ScyllaDB can utilize all of the available machine resources and stream data at 10GB/s, thus new nodes can be added quite quickly. However, it can take a long time to double or triple the whole cluster capacity. That’s obviously not the elasticity you’d expect. Transactional node range ownership will allow many levels of freedom, and we plan on improving more aspects of range movements towards tablets and dynamic range splitting for load balancing.
Beyond crucial operational advantages, end users will be able to use the new Raft protocol, gaining strong transaction consistency with zero performance penalty.
We’ve been working on this for quite some time, and we have more to come. We covered Raft in great length at ScyllaDB Summit – I encourage you to go and watch those presentations with much better speakers than myself. 🙂
George Anadiotis: If there was any doubt when you mentioned initially that you still like to be technical and hands-on, I think that that goes to prove it. Let me try and take you a little bit more to things that may catch the attention of less technically inclined people.
One important thing is Change Data Capture (CDC). I saw that you seem to have many partners that are actively involved in streaming: you work with Kafka, Redpanda, Pulsar, and I think a number of consultancies as well. The use case from Palo Alto Networks seemed very interesting to me because they basically said, “We’re using ScyllaDB for our streaming needs, for our messaging needs instead of a streaming platform.” I was wondering if you could say a few words on the CDC feature, how it has evolved, and how it’s being used in use cases.
Dor Laor: CDC, Change Data Capture, is a wonderful feature that allows users to collect changes to their data in a relatively simple manner. You can figure out what was written recently (for example, what is the highest score in the most recent hour) or consume those changes in a report without traversing through the entire data set. Change data capture is implemented in a really nice, novel way where you can enable change capture on a table and all of the changes within a certain period will be written to a new table that will have just these changes. The table will TTL itself; it erases itself after the period expires. We have client libraries that know how to consume this in several languages, which we developed over time. This is a really simple way to connect to Kafka, Redpanda, and others. Our DynamoDB API also implements the DynamoDB streaming in a similar way on top of our streaming solution.
You mentioned Palo Alto Networks. I’m not all that familiar with their details, but I think that they’re not using change data capture because they know more about the pattern of the data pattern that they use. They wanted to eliminate streaming due to cost and complexity. They manage their solution on their own, and they have an extreme scale: they have thousands of clusters, and each cluster has its own database and needs its own servers… so it’s expensive. You can understand why they want to eliminate so many additional streaming clusters.
They decided to use ScyllaDB directly, and they shared how they do it in their ScyllaDB Summit presentation. I believe that their pattern is such that instead of having ScyllaDB automatically create a change data capture table, they expect changes within a certain time limit, just recently, and they decided that they can just query those time changes on their own. I don’t know enough about the details to say what’s better, using off-the-shelf CDC or implementing this on your own. In their case, it’s probably possible to do it either way. Maybe they made the best decision to do it directly. If you know your data pattern, that’s the best. Usually, CDC will be implemented for users who don’t know what was written to the database. That’s why they need change data capture to figure out what happened across an entire half a petabyte data set.
George Anadiotis: Or potentially in cases where there is really no pattern in terms of how the data is coming in. It’s irregular, so CDC can help trigger updates in that scenario.
Dor Laor: Exactly. Otherwise, it’s really impossible other than to do the full scan. And even if you do a full scan, CDC allows you to know what was the previous value of the data – so it’s even more helpful than a regular full scan.
George Anadiotis: All right. Another topic that caught my eye from the ScyllaDB Summit agenda was around benchmarking. There was one presentation where the presenter was detailing the type of benchmarking that you do, how you did it, and the results. Could you summarize how the progress that you have outlined on the technical front has translated to the performance gains and what was the process of verifying that?
Dor Laor: So benchmarking is hard and we’re sometimes doing more than one flavor of benchmarking. We have an open source project that automates things. We need to use multiple clients when we hit the database and it’s important to compute the latency correctly. That tool automatically builds a histogram for every client combined in the right way.
We did several benchmarks over the past two years. At ScyllaDB Summit, we presented the process and results of benchmarking our database at the petabyte level. We also presented a benchmark that compared the i3, Intel’s x86 solution, with the Arm instances by AWS to figure out what’s more cost effective. Also, AWS has another instance family based on newer x86 machines: the i4is. The bottom line is that the new i4is are more than twice as performant than the i3s, so that’s a huge boost.
The second part of this blog series, covering ScyllaDB’s roadmap and Dor’s insights on what’s next for distributed databases, is now available.