Virtual Workshops
Twice-monthly interactive sessions with our NoSQL solution architects.
Join Our Next Session >

CQL vs. DynamoDB API

You may be interested in moving to a scalable, reliable NoSQL database platform like Scylla, but aren’t sure which interface to use: our Apache Cassandra® compatible Cassandra Query Language (CQL) interface or our newer Amazon DynamoDB® compatible interface. Or maybe you are familiar with one interface already and want to understand more about the other.

Bottom line:

  • We recommend you use our CQL interface except in the following case
  • If you currently use DynamoDB, we recommend you perform a “lift-and-shift” migration to Scylla using our DynamoDB-compatible API, Project Alternator
CQL-and-Alternator-Interfaces-

While we generally recommend our CQL interface, let’s examine the details of each of these interfaces, their capabilities, and show which one is best for your use case.

Scylla’s CQL Interface

Central to Scylla’s design principles is conformance with the Cassandra Query Language (CQL). This interface combines an SQL-like query syntax, a corresponding data model, and its own binary transport protocol.

For example, here is a CQL SELECT query:

SELECT time, value
FROM events
WHERE event_type = 'myEvent'
AND time > '2011-02-03'
AND time <= '2012-01-01'

However, CQL is definitely a NoSQL interface. It does not support all SQL features such as foreign keys and JOINs across tables. It also has some commands unique to itself. You can read our full documentation on CQL here.

CQL Data Model

The CQL data model is referred to as a wide-column or column-oriented NoSQL database, though there is no well-established name for it. Data is organized into keyspaces, a top-level container that determines replication strategies. Within a keyspace you can have one or more tables, but, as stated above, there are no foreign keys or JOINs across tables. Instead, data is denormalized. For those familiar with SQL looking to learn CQL, you may want to look at this blog on best practices. Within CQL tables data is organized into partitions, with primary keys comprised of a partition key and clustering key.

Scylla’s DynamoDB-compatible Interface

Scylla announced Project Alternator to provide an open source alternative to the popular Amazon DynamoDB database, designed around cost, performance, and openness. It utilizes the same underlying storage engine, but conforms to the unique differences of DynamoDB: its JSON query language, its key-value and document data models, as well as its RESTful API using an HTTP/HTTPS transport mechanism.

DynamoDB Queries: JSON

DynamoDB structures queries using a JavaScript Object Notation (JSON) document model. But don’t be confused: you can use JSON in conjunction with many programming languages, not just JavaScript. JSON’s text syntax facilitates structured data interchange across programming languages. Also note that JSON is not a complete implementation of data interchange. Instead, it is a form of data encapsulation.

DynamoDB queries and statements can get quite complex, including data types, conditions, expressions, filters, consistency levels and comparison operators. Here is an example of a simple query written in JSON:

{
TableName: "Table_X"
KeyConditionExpression: "LastName = :a",
ExpressionAttributeValues: {
:a = "smith"z
}
}

DynamoDB Data Models: Key-Value and Document

Amazon positions DynamoDB as a key-value and document database. While DynamoDB uses a JSON document model for its queries, it does not store data natively in JSON format. Data in DynamoDB is written in one table and is distributed across partitions, which enables it to scale horizontally. As a partition is filled, another one is added. As in CQL, these partitions are lists of items, sorted by a sort key (analogous to the CQL clustering key).

Project-Alternator-Basics

Transport: Proprietary vs. Standard

A very visible difference between the CQL and DynamoDB APIs is the transport protocol used to send requests and responses. CQL uses its own binary protocol for transporting SQL-like requests, while DynamoDB uses a JSON-formatted RESTful API for requests and responses over HTTP or HTTPS (secured via SSL/TLS).

DynamoDB’s use of popular standards — JSON and HTTP — has an obvious benefit: They are familiar to most users, and many programming languages have built-in tools for them. This, at least in theory, makes it easier to write client code for this protocol compared to CQL’s non-standard protocol, which requires special client libraries (a.k.a. “client drivers”). Moreover, the generality of JSON and HTTP also makes it easy to add more features to this API over time, unlike rigid binary protocols that need to change and further complicate the client libraries. In the last 8 years, the CQL binary protocol underwent four major changes (it is now in version 5), while the DynamoDB API added features gradually without changing the protocol at all.

However, closer inspection reveals that special client libraries are needed for DynamoDB as well. The problem is that beyond the standard JSON and HTTP, several other non-standard pieces are needed to fully utilize this API. Among other things, clients need to sign their requests using a non-standard signature process, and DynamoDB’s data types are encoded in JSON in non-obvious ways.

So DynamoDB users cannot really use JSON or HTTP directly. Just like CQL users, they also use a client library that Amazon or the AWS community provided. For example, the Python SDK for AWS (and for DynamoDB in particular) is called boto3. The bottom line is that both CQL and DynamoDB ecosystems have client libraries for many programming languages, with a slight numerical advantage to CQL. For users who use many AWS services, the ability to use the same client library to access many different services — not just the database — is a bonus. But we want to limit ourselves just to the database, and DynamoDB’s use of standardized protocols ends up being of no real advantage to the user.

Drawbacks to Using DynamoDB’s JSON and HTTP vs CQL

    1. Request multiplexingDynamoDB supports only HTTP/1. One of the features missing in HTTP/1.1 is support for multiplexing many requests (and their responses) over the same connection. As a result, clients that need high throughput, and therefore high concurrency, need to open many connections to the server. These extra connections consume significant networking, CPU, and memory resources.At the same time the CQL binary protocol allows sending multiple (up to 32,768) requests asynchronously and potentially receiving their responses in a different order.
      The DynamoDB API had to work around its transport’s lack of multiplexing by adding special multi-request operations, namely BatchGetItem and BatchWriteItem — each can perform multiple operations and return reordered responses. However, these workarounds are more limited and less convenient than generic support for multiplexing.
      Had DynamoDB supported HTTP/2, it would have gained multiplexing as well, and would even gain an advantage over CQL for large requests or responses: The CQL protocol cannot break up a single request or response into chunks, but HTTP/2 does do it, avoiding large latencies for small requests which happen to follow a large request on the same connection. We predict that in the future, DynamoDB will support HTTP/2. Some other Amazon services already do.
    2. Verbose binary formatDynamoDB’s requests and responses are encoded in JSON. The JSON format is verbose and textual, which is convenient for humans but longer and slower to parse than CQL’s binary protocol. Because there isn’t a direct mapping of DynamoDB’s types to JSON, the JSON encoding is quite elaborate. For example, the string “hello” is represented in JSON as {“S”: “hello”},  and the number 3 is represented as {“N”: “3”}. Binary blobs are encoded using base64, which makes them 33% longer than necessary. The request parsing code then needs to parse JSON, decode all these conventions and deal with all the error cases when a request deviates from these conventions. This is all significantly slower than parsing a binary protocol.

Protocol: Cluster Awareness vs. One Endpoint

A distributed database is, well… distributed. It is a cluster of many different servers, each responsible for part of the data. When a client needs to send a request, the CQL and DynamoDB APIs took very different approaches on where the client should send its request:
    1. In CQL, clients are “topology aware” — they know the cluster’s layout. A client knows which Scylla or Cassandra nodes exist, and is expected to balance the load between them. Moreover, the client usually knows which node holds which parts of the data (this is known as “token awareness”), so it can send a request to read or write an individual item directly to a node holding this item. Scylla clients can even be shard aware,” meaning they can send a request directly to the specific CPU (in a node with many CPUs) that is responsible for the data.
    2. In the DynamoDB API, clients are not aware of the cluster’s layout. They are not even aware how many nodes there are in the cluster. All that a client gets is one “endpoint”, a URL like https://dynamodb.us-east-1.amazonaws.com — and directs all requests from that data-center to this one endpoint. DynamoDB’s name server can then direct this one domain name to multiple IP addresses, and load balancers can direct the different IP addresses to any number of underlying physical nodes. One benefit of the DynamoDB approach is the simplicity of the clients.
But CQL’s more elaborate approach comes with an advantage — reducing the number of network hops, and therefore reducing both latency and cost: In CQL, a request can often be sent on a socket directed to a specific node or even a specific CPU core that can answer it. With the DynamoDB API, the request needs to be sent at a minimum through another hop — e.g., in Amazon’s implementation the request first arrives at a front-end server, which then directs the request to the appropriate back-end node. If the front-end servers are themselves load-balanced, this adds yet another hop (and more cost). These extra hops add to the latency and cost of a DynamoDB API implementation compared to a cluster-aware CQL implementation.

Query Language: Prepared Statements

We already noted above how the DynamoDB API uses the HTTP transport in the traditional way: Even if requests are sent one after another on the same connection (i.e., HTTP Keep-Alive), the requests are independent, signed independently and parsed independently.

In many intensive (and therefore, expensive) use cases, the client sends a huge number of similar requests with slightly different parameters. In the DynamoDB API, the server needs to parse these very similar requests again and again, wasting valuable CPU cycles.

The CQL query language offers a workaround: prepared statements. The user asks to prepare a parameterized statement — e.g., “insert into tb (key, val) values (?, ?)”  with two parameters marked with question-marks. This statement is parsed only once, and further requests to the same node can reuse the pre-parsed statement, passing just the two missing values. The benefit of prepared statements are especially pronounced in small and quick requests, such as reads that can frequently be satisfied from cache, or writes. In this blog post we demonstrated in one case that using prepared statements can increase throughput by 50% and lower latency by 33%.

Write Model: CRDT vs. Read-Modify-Write

Amazon DynamoDB came out in 2012, two years after Cassandra. DynamoDB’s data model was inspired by Cassandra’s. In both DynamoDB and Cassandra, database rows are grouped together in a partition by a partition key, with the rows inside the partition sorted by a clustering key. The DynamoDB API and CQL use different names for these same concepts, though:

 CQL termDynamoDB term
Smallest addressable contentrowitem
Sub-entitiescolumnattribute
Unit of distributionpartition (many rows)partition (many items)
Key for distributionpartition keypartition key (also hash key)
Key within partitionclustering keysort key (also range key)

Despite this basic similarity in how the data is organized — and therefore in how data can be efficiently retrieved — there is a significant difference in how DynamoDB and CQL model writes. The two APIs offer different write primitives and make different guarantees on the performance of these primitives:

Cassandra is optimized for write performance. Importantly, ordinary writes do not require a read of the old item before the write: A write just records the change, and only read operations (and background operations known as compaction and repair) will later reconcile the final state of the row. This approach, known as Conflict-free Replicated Data Types (CRDT), has implications on Cassandra’s implementation, including its data replication and its on-disk storage in a set of immutable files called sstables (comprising an LSM). But more importantly for this post, the wish that ordinary writes should not read also shaped Cassandra’s API, CQL:

    • Ordinary write operations in CQL — both UPDATE and INSERT requests — just set the value of some columns in a given row. These operations cannot check whether the row already existed beforehand, and the new values cannot depend on the old values. A write cannot return the old pre-modification values, nor can it return the entire row when just a part of it was modified.
    • These ordinary writes are not always serialized. They usually are: CQL assigns a timestamp (with microsecond resolution) to each write and the latest timestamp wins. However two concurrent writes which happen to receive exactly the same timestamp may get mixed up, with values from both writes being present in the result.
      For more information on this problem, see this Jepsen post.
    • CQL offers a small set of additional CRDT data structures for which writes just record the change and do not need to involve a read of the prior value. These include collections (maps, sets and lists) and counters.
    • Additionally, CQL offers special write operations that do need to read a row’s previous state before a write. These include materialized views and lightweight transactions (LWT). However, these write operations are special; Users know that they are significantly slower than ordinary writes because they involve additional synchronization mechanisms – so users use them with forethought, and often sparingly.

Contrasting with CQL’s approach, DynamoDB’s approach is that every write starts with a read of the old value of the item. In other words, every write is a read-modify-write operation. Again, this decision had implications on how DynamoDB was implemented. DynamoDB uses leader-based replication so that a single node (the leader for the item) can know the most recent value of an item and serialize write operations on it. DynamoDB stores data on disk using B-trees which allow reads together with writes. But again, for the purpose of this post, the interesting question is how making every write a read-modify-write operation shaped DynamoDB’s API:

    • In the DynamoDB API, every write operation can be conditional: A condition expression can be added to a write operation, and uses the previous value of the item to determine whether the write should occur.
    • Every write operation can return the full content of the item before the write (ReturnValues).
    • In some update operations (UpdateItem), the new value of the item may be based on the previous value and an update expression. For example, “a = a + 1” can increment the attribute a (there isn’t a need for a special “counter” type as in CQL).
    • The value of an item attribute may be a nested JSON-like type, and an update may modify only parts of this nested structure. DynamoDB can do this by reading the old nested structure, modifying it, and finally writing the structure back.
      CQL could have supported this ability as well without read-modify-write by adding a CRDT implementation of a nested JSON-like structure (see, for example, this paper) — but currently doesn’t and only supports single-level collections.
    • Since all writes involve a read of the prior value anyway, DynamoDB charges for writes the same amount whether or not they use any of the above features (condition, returnvalues, update expression). Therefore users do not need to use these features sparingly.
    • All writes are isolated from each other; two concurrent writes to the same row will be serialized.

This comparison demonstrates that the write features that the DynamoDB and CQL APIs provide are actually very similar: Much of what DynamoDB can do with its various read-modify-write operations can be done in CQL with the lightweight transactions (LWT) feature or by simple extensions of it.  In fact, this is what Scylla does internally: We implemented CQL’s LWT feature first, and then used the same implementation for the write operations of the DynamoDB API.

However, despite the similar write features, the two APIs are very different in their performance characteristics:

    • CQL makes an official distinction between write-only updates and read-modify-write updates (LWT), and the two types of writes must not be done concurrently to the same row. This allows a CQL implementation to implement write-only updates more efficiently than read-modify-write updates.
    • A DynamoDB API implementation, on the other hand, cannot do this optimization: With the DynamoDB API, write-only updates and (for example) conditional updates can be freely mixed and need to be isolated from one another.

This requirement has led Scylla’s implementation of the DynamoDB API to use the heavyweight (Paxos-based) LWT mechanism for every write, even simple ones that do not require reads, making them significantly slower than the same write performed via the CQL API. We will likely optimize this implementation in the future, perhaps by adopting a leader model (like DynamoDB), but an important observation is that in every implementation of the DynamoDB API, writes will be significantly more expensive than reads. Amazon prices write operations at least 5 times more expensive than reads (this ratio reaches 40 for weakly-consistent reads of 4 KB items). 

Amazon prices a write unit (WCU) at 5 times the cost of a read unit (RCU). Additionally, for large items, each 1 KB is counted as a write unit, while a read unit is 4 KB. Moreover, eventually-consistent reads are counted as half a read unit. So writing a large item is 40 (=5*4*2) times more expensive than reading it with an eventually-consistent read.

Scylla provides users with an option to do DynamoDB-API write-only updates (updates without a condition, update expression, etc.) efficiently, without LWT. However, this option is not officially part of the DynamoDB API, and as explained above — cannot be made safe without adding an assumption that write-only updates and read-modify-write updates to the same item cannot be mixed.

Data Items: Schema vs. (Almost) Schema-less

Above we noted that both APIs — DynamoDB and CQL — model their data in the same way: as a set of partitions indexed by a partition key, and each partition is a list of items sorted by a sort key.

Nevertheless, the way that the two APIs model data inside each item is distinctly different:

    1. CQL is schema-full: A table has a schema — a list of column names and their types – and each item must conform with this schema. The item must have values of the expected types for all or some of these columns.
    2. DynamoDB is (almost) schema-less: Each item may have a different set of attribute names, and the same attribute may have values of different types. (The value types supported by the two APIs also differ, and notably the DynamoDB API supports nested documents. But these are mostly superficial differences and we won’t elaborate on this further here.) DynamoDB is “almost” schema-less because the items’ key attributes — the partition key and sort key — do have a schema. When a table is created, one must specify the name and type of its partition-key attribute and its optional sort-key attribute.

It might appear, therefore, that DynamoDB’s schema-less data model is more general and more powerful than CQL’s schema-full data model. But this isn’t quite the case. When discussing CRDT above, we already mentioned that CQL supports collection columns. In particular, a CQL table can have a map column, and an application can store in that map any attributes that aren’t part of its usual schema. CQL’s map does have one limitation, though: The values stored in a map must all have the same type. But even this limitation can be circumvented, by serializing values of different types into one common format such as a byte array. This is exactly how Scylla implemented the DynamoDB API’s schema-less items on top of its existing schema-full CQL implementation.

So it turns out that the different approaches that the two APIs took to schemas have no real effect on the power or the generality of the two APIs. Still, this difference does influence what is more convenient to do in each API:

    • CQL’s schema makes it easier to catch errors in a complex application ahead of time – in the same way that compile-time type checking in a programming language can catch some errors before the program is run.
    • The DynamoDB API makes it natural to store different kinds of data in the same database. Many DynamoDB best-practice guides recommend a single-table design, where all the application’s data resides in just a single table (plus its indexes). In contrast, in CQL the more common approach is to have multiple tables each with a different schema. CQL introduces the notion of “keyspace” to organize several related tables together.
    • The schema-less DynamoDB API makes it somewhat easier to change the data’s structure as the application evolves. CQL does allow modifying the schema of an existing table via an “ALTER TABLE” statement, but it is more limited in what it can modify. For example, it is not possible to start storing strings in a column that used to hold integers (however, it is possible to create a second column to store these strings).

Summary

We surveyed some of the major differences between the Amazon DynamoDB API and Apache Cassandra’s CQL API — both of which are now supported in one open-source database, Scylla. Despite a similar data model and design heritage, these two APIs differ in the query language, in the protocol used to transport requests and responses, and somewhat even in the data model (e.g., schema-full vs. schema-less items).

We showed that in general, it is easier to write client libraries for the DynamoDB API because it is based on standardized building blocks such as JSON and HTTP. Nevertheless, we noted that this difference does not translate to a real benefit to users because such client libraries already exist for both APIs, in many programming languages. But the flip side is that the simplicity of the DynamoDB API costs in performance: The DynamoDB API requires extra network hops (and therefore increased costs and latency) because clients do not need to be aware of which data belongs to which node. The DynamDB API also requires more protocol and parsing work and more TCP connections.

We discussed some possible directions in which the DynamoDB API could be improved (such as adopting HTTP/2 and a new mechanism for prepared statements) to increase its performance. We don’t know when, if ever, such improvements might reach Amazon DynamoDB. But as an experiment, we could implement these improvements in the open-source Scylla and measure what kinds of benefits an improved API might bring. When we do, look forward to a follow-up on this post detailing these improvements. 

We also noted the difference in how the two APIs handle write operations. The DynamoDB API doesn’t make a distinction between writes that need the prior value of the item and writes that do not. This makes it easier to write consistent applications with this API, but also means that writes with the DynamoDB API are always significantly more expensive than reads. Conversely, the CQL API does make this distinction — and does forbid mixing the two types of writes. Therefore, the CQL API allows write-only updates to be optimized, while the read-modify-write updates remain slower. For some applications, this can be a big performance boost.

The final arbiter in weighing these APIs is subjective in answer to the question: “Which will be better for your use case?” Due to many of the points made above, and due to other features which at this point are only available through Scylla’s CQL implementation, ScyllaDB currently advises users to adopt the CQL interface for all use cases apart from “lift and shift” migrations from DynamoDB.

Next Steps

Now that you’ve learned the differences, it’s time to try your workloads on Scylla to realize the performance benefits it can offer. Here’s some articles for how to implement your migration: