How to Tell When a Database is Right for Your Project
When you are selecting databases for your latest use case (or replacing one that’s not meeting your current needs), the good news these days is that you have a lot of options to choose from. Of course, that’s also the bad news. You have a lot to sort through.
There are far more databases to consider and compare than ever before. In December 2012, the end of the first year DB-Engines.com first began ranking databases, they had a list of 73 systems (up significantly from the 18 they first started their list with). As of December 2022, they are just shy of 400 systems. This represents a Cambrian explosion of database technologies over the past decade. There is a vast sea of options to navigate: SQL, NoSQL, and a mix of “multi-model” databases that can be a mix of both SQL and NoSQL, or multiple data models of NoSQL (combining two or more options: document, key-value, wide column, graph and so on).
Further, users should not confuse outright popularity with fitness for their use case. While network effects definitely have advantages (“Can’t go wrong with X if everyone is using it”), it can also lead to groupthink, stifling innovation and competition.
In a recent webinar, my colleague Arthur Pesa and I took users through a consideration of five factors that users need to keep foremost when shortlisting and comparing databases.
The Five Factors
Let’s get straight into the list.
- Software Architecture — Does the database use the most efficient data structures, flexible data models, and rich query languages to support your workloads and query patterns?
- Hardware Utilization — Can it take full advantage of modern hardware platforms? Or will you be leaving a significant amount of CPU cycles underutilized?
- Interoperability — How easy is it to integrate into your development environment? Does it support your programming languages, frameworks and projects? Was it designed to integrate into your microservices and event streaming architecture?
- RASP — Does it have the necessary qualities of Reliability, Availability, Scalability, Serviceability and, of course, Performance?
- Deployment — Does this database only work in a limited environment, such as only on-premises, or only in a single datacenter or a single cloud vendor? Or does it lend itself to being deployed exactly where and how you want around the globe?
Any such breakdown is subjective. You may have your own list of 4 factors, or 12 or 20 or 100 criteria. And, of course, each of these factors like software architecture break down into subcategories, such as “storage engine,” “distributed processing architecture,” and even “query language.” But this is how I’d bucketize them into general categories.
The critical consideration here is “does the database use the most efficient data structures, flexible data models and rich query languages to support your specific workloads and query patterns?”
Workload — Do you need to do a write-heavy or mixed read-write transactional workload? Or are you going to do a mostly-read analytical workload? Do you need to have a hybrid workload with a mix of transactions and analytics? Is that workload real-time, batched or a mix? Is it a steady stream of events per second, or are there smooth, regular intraday predictable rises and falls? Or maybe do you need to plan to deal with stochastic shocks of sudden bursts of traffic (for example, breaking news, or any other sudden popularity of a record)?
Data Model — Are you dealing with key-value pairs? Wide column stores (row-based “key-key-value” data)? A column store (columnar data)? Document? Graph? RDBMS (with tables and JOINs)? Or something else entirely. Do you really have the time and need to do fully normalized data, or will you be absorbing so much unstructured data so quickly that normalization is a fool’s errand, and you’d be better served with a denormalized data model to start with? There’s no singular “right” answer here. “It depends” should be embraced as your mantra.
Query Language — Here there is definitely more of a bias. Because while your data engineering team may be able to mask or hide the back-end query model, many of your users have their own biases and preferences. This is one of the main reasons why SQL remains such a lock-in. At the same time, there are new query languages that are available. Some are SQL-like, such as the Cassandra Query Language (CQL) that is used by Cassandra and ScyllaDB. It has a passing familiarity to SQL users. But don’t be fooled – there are no table JOINs! Then there are a series of new school query languages which may use, for example JSON. This is how Amazon DynamoDB queries work. Again, here, ScyllaDB supports such a JSON query model using our Alternator interface, which is compatible with DynamoDB. Regardless of which way you lean, query language should not be an afterthought in your consideration.
Transactions / Operations / CAP — Which is more important to you? Fully consistent ACID transactions? Or highly performant, highly available basic CRUD operations? The CAP theorem says you can have any two of three: consistency, availability or partition tolerance. Considering that distributed databases always need to be partition-tolerant, that leaves you with the choice between so-called “CP”-mode consistency-oriented systems, or “AP”-mode availability-oriented systems. And within these modes, there are implementation details to consider. For example, how you achieve strong consistency within a distributed system can vary widely. Consider even the choice of various consensus algorithms to ensure linearizability, like Paxos, Raft, Zookeeper (ZAB) and so on. Besides the different algorithms, each implementation can vary significantly from another.
Data Distribution — When you say “distributed system,” what do you mean exactly? Are we talking about a local, single-datacenter cluster? Or are we talking multi-datacenter clustering? How do cross-cluster updates occur? Is it considered all one logical cluster, or does it require inter-cluster syncs? How does it handle data localization and, for example, GDPR compliance?
We are in the middle of an ongoing revolution in underlying hardware that continues to push the boundaries of software. A lot of software applications, and many databases in particular, are still rooted in decades-old origins, designs and presumptions.
CPU Utilization / Efficiency — A lot of software is said to be running poorly if CPU utilization goes up beyond, say, 40% or 50%. That means you are supposed to run that software inefficiently, leaving half of your box underutilized on a regular basis. In effect, you’re paying for twice the infrastructure (or more) than you actually need. So it behooves you to look at the way your system handles distributed processing.
RAM Utilization / Efficiency — Is your database consistently memory-bound? Is its caching too aggressive, or too bloated (such as having multiple layers of caching), keeping unneeded data in memory? How does it optimize its read and write paths?
Storage Utilization / Efficiency — What storage format does your database use? Does it have compact mutable tables that may require heavyweight file locking mechanisms? Or does it use immutable tables that may produce fast writes, but come at a cost of space and read amplification? Does it allow for tiered storage? How does it handle concurrency? Are files stored row-wise (good for transactional use cases) or column-wise (good for analytics on highly repetitive data)? Note that there isn’t just one “right” answer. Each solution is optimizing for different use cases.
Network Utilization / Efficiency — Here you should think both about the efficiency of client-server cluster communications, as well as intra-cluster communications. Client/server models can be made more efficient with concurrency, connection pooling, and so on. Intra-cluster communications span from typical operational/transactional chatter (replicating data in an update or a write), as well as administrative tasks such as streaming and balancing data between nodes during a topology change.
No database is an island. How easy is it to integrate into your development environment? Does it support your programming languages, frameworks and projects? Was it designed to integrate into your microservices and event streaming architecture?
Programming Languages / Frameworks — Over and over you hear “We’re an X shop,” where X stands for your preferred programming language or framework. If your database doesn’t have the requisite client, SDK, library, ORM, and/or other packages to integrate it into that language, it might as well not exist. To be fair, the massive explosion of databases is concurrent to the massive explosion in programming languages. Yet it pays to look at programming language support for the client. Note that this is not the same as what language the database may be written in itself (which may factor into its software architecture and efficiency). This is purely about what languages you can write apps in to connect to that back end database.
Event Streaming / Message Queuing — Databases may be a single source of truth, but they are not the only systems running in your company. In fact, you may have different databases all transacting, analyzing and sharing different slices of your company’s overall data and information space. Event streaming is the increasingly common media for modern enterprises to avoid data silos, and these days your database is only as good as its integration with real time event streaming and message queuing technologies. Can your database act as both a sink and a source of data? Does it have Change Data Capture (CDC)? Does it connect to your favorite event streaming and message queuing technologies such as Apache Kafka, or Apache Pulsar or RabbitMQ?
APIs — To facilitate your database integration into your application and microservices architecture, does your database support one or more APIs, such as a RESTful interface, or GraphQL? Does it have an administrative API so you can programmatically provision it rather than do everything via a GUI interface? Using the GUI might seem convenient at first, until you have to manage and automate your deployment systems.
Other Integrations — What about CI/CD toolchains? Observability platforms? How about using your database as a pluggable storage engine or underlying element of a broader architecture? How well does it serve as infrastructure, or fit into the infrastructure you already use?
This acronym goes back decades and generally is used in a hardware context. It stands for Reliability, Availability, Serviceability (or Scalability) and Performance. Basically these “-ilities” are “facilities” — things that make it easy to run your system. In a database, they are vital to consider how much manual intervention and “plate-spinning” you might need to perform to keep your system up and stable. They represent how much the database can take care of itself under general operating conditions, and even mitigate failure states as much as possible.
Typical platform engineer spinning up a bunch of new nodes.
Reliability —How much tinkering do you need to put in to keep this thing from collapsing, or from data disappearing? Does your database have good durability capabilities? How survivable is it? What anti-entropy mechanisms does it include to get a cluster back in sync? How good are your backup systems? Even more important, how good are your restore systems? And are there operational guardrails to keep individuals from bringing the thing down with a single “Oops!”
Availability — What does your database do when you have short term network partitions and transient node unavailability? What happens when a node fully fails? What if that network failure stretches out to more than a few hours?
Serviceability — These days the common buzzword is “observability,” which generally encompasses the three pillars of logging, tracing and metrics. Sure, your database needs to have observability built-in. Yet serviceability goes beyond that. How easy is it to perform upgrades without downtime? How pain-free are maintenance operations?
Scalability — Some databases are great to get started with. Then… you hit walls. Hard. Scalability means you don’t have to worry about hitting limits either in terms of total data under management, total operations per second, or geographic limits — such as going beyond a single datacenter to truly global deployability. Plus, there’s horizontal scalability — the scale out of adding more nodes to a cluster — as well as vertical scalability — putting your database on servers that have ever increasing numbers of CPUs, ever more RAM and more storage (refer back to the hardware section above).
Performance — Bottom line: if the database just can’t meet your latency or throughput SLAs, it’s just not going to fly in production. Plus, linked to scalability, many databases seem like they’ll meet your performance requirements at small scale or based on a static benchmark using test data but, when hit with real-world production workloads, just can’t keep up with the increasing frequency, variability and complexity of queries. So performance requires a strong correlation to linear scale.
All of the above then needs to run where you need it to. Does this database only work in a limited environment, such as only on-premises, or only in a single datacenter or a single cloud vendor? Or does it lend itself to being deployed exactly where and how you want around the globe? Ask yourself these question:
Lock-ins — Can this run on-premises? Or, conversely, is it limited to only on-premises deployments? Is it limited to only a certain public cloud vendor, or can this run in the cloud vendor of your choice? What are your hybrid cloud or multicloud deployment options?
Management / Control — Similarly, is this only available as a self-managed database, or can it be consumed as a fully-managed Database-as-a-Service (DBaas)? The former allows teams full control of the system, and the latter relieves teams of administrative burden. Both have their tradeoffs. Can you select only one, or does the database allow users to switch between these two business models?
Automation and Orchestration — Does it have a Kubernetes Operator to support it in production? Terraform and Ansible scripts? While this is the last itemn the list, rest assured, this should not be an afterthought in any production consideration.
So How Do You Evaluate Databases on This Rubric?
With this as a general rubric, you can watch the on-demand webinar to learn how ScyllaDB compares to such consideration criteria. ScyllaDB was architected from the ground up to take advantage of modern software paradigms and hardware architecture. ScyllaDB is based on the underlying Seastar framework, designed for building shard-per-core, shared nothing, highly asynchronous applications. There’s a lot going on under the hood, so I hope you watch the video to get the inside scoop!
Also, if you want to evaluate your current database against this criteria and discuss with us how your existing technology stacks up, feel free to contact us privately, or sign up for our user forums or Slack community to ask more questions and compare your results with your data engineering peers.