Lee Atchison – software architect, podcast, host, and author of Architecting for Scale (O’Reilly Media) – recently hosted ScyllaDB co-founder and CEO Dor Laor on the Software Engineering Daily podcast. After kicking it off by talking about the origins of ScyllaDB and its “sweet spot” of solving problems at scale, the conversation took a natural turn toward why and when teams reach the tipping point on SQL vs NoSQL databases.
Advice for Teams Debating SQL vs NoSQL
Lee Atchinson (LA): The database storage space is an interesting space. “What type of database to use” is a fundamental decision that must be made very early in product development. You need to decide whether you need a SQL based database, or a NoSQL database – and then key value, wide table, whatever type structure you need – quite early, often before you really know your scaling needs. What advice do you give to someone who’s trying to decide if they should build their application using SQL or NoSQL?
Dor Laor (DL): At the end of the day, it’s about the workload and your anticipated usage patterns. If the workload isn’t that big and it can fit SQL types of databases, go ahead and pick SQL. They’re the best databases. They give the best flexibility and the best query capabilities. If the amount of data isn’t large, that’s the easiest choice.
Beyond the amount of data, you also want to consider the latency, throughput, and high availability requirements. If you need an active-passive, maybe relational databases can still work. If you need to active-active and you need to sustain operations if a region goes down, then SQL is usually not the best solution.
If you’re planning to compete with the Discord, Spotify, and Disneys of the world, then you better plan for having a 24/7 mission-critical application with lots of data and lots of users. If you’re a small startup doing credit card transactions, you might be ok with SQL. But, when fraud detection comes into play, you might need NoSQL for that use case.
If you need scaling from the very beginning, then go ahead and pick the most scalable database you need. If initially you’re more concerned with the speed of development and you don’t even have product-market fit yet, you might start in one direction then later shift to another. For example, Discord (one of our users) started off with MongoDB because they needed the agility. Once business picked up, they moved to Cassandra (back then, we weren’t yet GA). A couple of years later, they moved to ScyllaDB. It is possible to change while you’re growing.
Where’s the Tipping Point for SQL vs NoSQL?
LA: Yes, you’re right. You certainly can change technologies while you grow. But moving from MongoDB to Cassandra to ScyllaDB is a lot easier than moving from, let’s say, MySQL to Cassandra or MySQL to ScyllaDB because it’s a fundamentally different model in how you think about data, and how it works. I love what you said about active-active versus active-passive. I liked that because that’s a very concrete thing that you can evaluate at the beginning of your lifecycle and say, yes active-active is going to be important to us someday, so we want to build an architecture that allows us to easily have an active-active multi-region setup. That leads you in a certain direction, in a direction towards someone like ScyllaDB.
But I think what some people struggle with is high-performance, high-scalability. Yes, NoSQL is better for that. But how much better? My application is going to be very, very, very big someday, but I don’t know how big, and I don’t know exactly when. Can you say that NoSQL really starts to shine if you’re above this size of an application, or this size of a data set, or this size of a number of transactions per second, et cetera? Or is that just too open ended of a question?
DL: I’ll try. Number one is the volume. With relational, most of the time, you’re going to fit it into a single machine that needs to have a certain amount of volume. Let’s say, a terabyte, or 10 terabytes. Even if the machines have more capacity, it’s really hard for a relational database to deal with more than a terabyte, roughly speaking. Again, every object and every partition can be different. But more than a terabyte.
You can also think about whether your data is shardable or not. Let’s say you have multiple clients, and you don’t really care to cross-match all of the different clients. All of your queries will always have a single client ID, and you won’t join multiple clients together. Then you can say, Okay, I need my biggest client to fit into a single computer. That’s also possible. But we can also see it as sometimes a single client becomes worse, and they have sub-clients. So, it can get trickier and trickier over time.
One terabyte and, let’s say, 10,000, to 50,000-ish operations per second is good for relational, and anything above it is good for NoSQL. We have small customers who use tens of thousands of operations per second and bigger ones who can do multimillion operations per second and keep petabytes.
Blurring the Lines with SQL and NoSQL: NewSQL and Distributed SQL
LA: To be clear, and you’ve been very clear here, it depends on the application, it depends on the use case, and lots of variables there. I was pushing for numbers simply because that’s what people want. But the variability of these numbers is so variable. Are there ever cases where you would consider using SQL for an extremely large application with large amounts of data and lots of transactions? Or alternatively, consider using a NoSQL database for a very small application?
DL: You could use SQL for a large application that you can easily shard – that’s example number one. The size wouldn’t necessarily matter and you’ll be able to do multiple shards. Also, queries may be relatively cheap, even without sharding, depending on the data set and your schema. There’s also New SQL and Distributed SQL. We compared ScyllaDB NoSQL vs a hot Distributed SQL option – a great technology and product. We tested it versus ScyllaDB to figure out what Distributed SQL can do. We used the same three nodes, the same hardware, and we compared ScyllaDB and that other technology using one billion keys. The other database just crashed when we used one billion. So, we reduced that to 100 million for the other database and still had ScyllaDB using one billion. Then, we compared the two. When ScyllaDB had 10X the data, we managed to do 9X the throughput and the latency was 4X better. There is a big advantage in going to NoSQL.