Using Change Data Capture (CDC) in ScyllaDB

By Calle Wilund

July 23, 2020

Change Data Capture (CDC) allows users to track data updates in their ScyllaDB database. While it is similar to the feature with the same name in Apache Cassandra and other databases, how it is implemented in ScyllaDB is unique, more accessible, and more powerful. We recently added a lesson in ScyllaDB University so users can understand how to get the most out of the changing nature of their data stored in ScyllaDB. This lesson is based on a talk we gave at our ScyllaDB Summit last year.

Overview

So what is Change Data Capture? It’s a record of modifications that happen for one or more tables in the database. We want to know when we write. We want to know when we delete. Obviously this could be done as triggers in an application, but the key feature is that we want to be able to consume this asynchronously. Plus for the consumer to know this from a data standpoint more than an application standpoint.

CDC is one of the key features in ScyllaDB in 2020 and one of the major things that we’re working on.

Some of the prominent use case include:

Fraud detection — you want to analyze transactions in some sort of batch manner to see if credit cards are being used from many places at the same time
Streaming Data — you want to plug it into your Kafka pipeline, you want some sort of analysis done on a transaction level and not on an aggregated level
Data duplication — you want to mirror a portion of your database without using multi-datacenter replication
Data replication — you want to somehow transform the data on a per-transaction basis to some other storage medium

How does ScyllaDB’s CDC work?

CDC is enabled per table. You don’t want to store changes for everything. You want it to be as isolated as possible. For ScyllaDB that means the granularity of the modifications made to a CQL row.

Optionally, we want to read pre-imaged data. We want to show the current state if data exists for a row, and we want to limit it to columns that are affected by the change. Plus we want to add a log for the modifications made:

The pre-image (original state of data before the change)
Changes per column (delta)
The post-image (final, current state after the change)

Or any combination thereof. We might only want one of the three or all three, etc. It’s all optional. It’s all controlled by you.

Also, unlike the commitlog-like implementation in Apache Cassandra, CDC in ScyllaDB is just another CQL table. It’s enabled per table and it creates another data table. It is stored distributed in the cluster, sharing all the same properties as you’re used to with your normal data. There’s no need for a separate specialized application to consume it.

It’s ordered by timestamp in the original operation and the associated CQL batch sequence. (Note that a CQL operation can break down into more than one modification.)

You have the columns for the pre-image and the delta records for the change. Every column records information about how we modified it and what TTL you put into it.

The topology of the CDC log is meant to match the original data so you get the same data distribution. You potentially get the same consistency (depending on the options you select) as for your normal data writes. It’s synchronized with your writes. It shares the same properties and distribution all to minimize the overhead of adding the log, and also to make sure that logs and actual data in the end result match each other as close as possible.

Everything in a CDC log is transient, representing a window back in time that you can consume and look at. The default survival time-to-live (TTL) is 24 hours. But again it’s up to you. This is to ensure that if this log builds up, if you’re not consuming it, it’s not going to kill your database. It’s going to add to your total storage but it’s not going to kill your systems.

The downsides of this approach are that we’re going to add a read before write if you want to pre-image data, and/or a read-after-write if you want a post-image. That will have a performance impact, and that is why we don’t want users to just generally enable CDC on all database transactions.

The log shares the same consistency as every other other thing in ScyllaDB. Which is to say it is eventually consistent. It represents the change as viewed by the client. Because ScyllaDB is a distributed database, there is no single truth of change of data. So it’s all based on how the coordinator or the client talking to the coordinator view the state and the change. (We’ll allow you to use different consistency levels for the base table as for the CDC table, although that is a special use case.)

Also note, it’s the change, it’s not what happened afterwards. So, for example, depending on your consistency level if you lost a node at some point in time, what you view from the data might not be what you last wrote, because you lost the only replica that held that data or you can’t reach a consensus level to get that value back. Obviously you could get partial logs in the case of severe node crashes.

Consuming the Data

How do you consume CDC table data? Well, that’s up to you. But it’s also up to us on how we implemented CDC.

It’s going to be consumed on the lowest level through normal CQL. It’s just a table. Very easy to read. Very easy to understand. The data is already deduplicated. You don’t need to worry about aggregating data from different replicas and reconciling it. It’s been done already for you. Everything is normal CQL data. Easy to parse. Easy to manipulate. You don’t need to know how the server works on the internal level. Everything is well known, detailed components.

This allows us for a layered approach. We can build adapters. We can build integrators on top of this for more advanced use cases for standardized connectors. We can adapt it to “push” models and we can integrate it with Kafka. We can provide an API that emulates the DynamoDB API via our Alternator project. And there’s going to be more.

We’re investing heavily in this and we’re going to be presenting a lot of integration projects to make sure that you can use CDC without relying on third-party connectors or appliances. It’s all going to come directly from ScyllaDB.

Under the Hood

Here’s a brief example of what happens when you’re using this given a very simple table. We create a table and we turn on CDC:

> CREATE TABLE base_table (
     pk text,
     ck text,
     val1 text,
     val2 text,
     PRIMARY KEY (pk, ck)
) WITH cdc = { ‘enabled’ = ‘true’, preimage = ‘true’ };

In this case we have a primary key, we have a clustering key, we have two values and we enabled CDC. We also add that we want the pre-image.

Now we do two inserts. You can see we’re inserting two CQL rows that share the same primary key.

> insert into base_table(pk, ck, val1, val2) values(“foo”, “bar”, “val1”, “val2”);

> insert into base_table(pk, ck, val1, val2) values(“foo”, “baz”, “vaz1”, “vaz2”);

So we get two update records in the CDC stream.

Stream_id |   time | batch_seq |  operation | ttl |   _pk |    _ck | _val1(op, value, ttl)|           _val2(...)
----------+--------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |        |         0 |     UPDATE |     | “foo” |  “bar” |  (ADD, “val1”, null) |  (ADD, “val2”, null)
    UUID1 |        |         0 |     UPDATE |     | “foo” |  “baz” |  (ADD, “vaz1”, null) |  (ADD, “vaz2”, null)

Again there’s no pre-image because nothing existed previously, it’s just two inserts.

Now we modify it by updating one of the CQL rows in the table:

> update base_table set val1 = “val3” where pk = “foo” and ck = “bar”;

Then we get the pre-image that only includes the value that we’re changing. Notice we try to limit the change log to make sure that it has as low impact as possible on performance. Here’s the resultant change:

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         0 |   PREIMAGE |     | “foo” |  “bar” |  (ADD, “val1”, null) |
    UUID1 |         |         1 |     UPDATE |     | “foo” |  “bar” |  (ADD, “val3”, null) |

To explain that tuple, we’re setting the value three and we don’t have a TTL.

Performing a delete of one of the values is similar:

> delete val2 from base_table where pk = “foo” and ck = “bar”;

We’re deleting one of the values. We’re not deleting the row, we’re setting a value to NULL. So we get a pre-image for the column that’s affected and then we get a delete delta.

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         0 |   PREIMAGE |     | “foo” |  “bar” |                      | (ADD, “val2”, null)
    UUID1 |         |         1 |     UPDATE |     | “foo” |  “bar” |                      | (DEL, null, null)

For a row delete we get a pre-image and the values that exist:

> delete from base_table where pk = “foo” and ck = “bar”;

And then we get a delete record for the row:

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         0 |   PREIMAGE |     | “foo” |  “bar” | (ADD, “val3”, null)  |
    UUID1 |         |         1 |    ROW_DEL |     | “foo” |  “bar” |                      |

Of course the delete record for a row doesn’t contain any deltas for the columns because it’s gone.

Whereas the partition delete we only get the delete of the partition, there’s no pre-image here:

> delete from base_table where pk = “foo”;

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         1 |   PART_DEL |     |       |        |                      |

Summary

CDC in ScyllaDB is going to be very easy to integrate and very easy to consume. Everything is plain old CQL tables. Nothing magic, nothing special. Just performant.

It’s going to be robust. Everything is replicated the same way normal data is replicated. It’s going to share all the same properties as your normal cluster. If your cluster works, CDC is going to work.

Our CDC implementation comes with a reasonable overhead. We coalesced the log writes into the same replica ranges, which means that, yes, you’re doing two writes, but it’s going to share the same request so it doesn’t add very much.

Data is TTL’d. It has a limited lifetime. Which means the CDC log is not going to overwhelm your system. You’re not going to get any node crashes because you enabled CDC. You’re just going to get some more data that you may or may not consume.

Here’s a short comparison with some NoSQL competitors:

	Cassandra	DynamoDB	MongoDB	ScyllaDB
Consumer location	on-node	off-node	off-node	off-node
Replication	duplicated	deduplicated	deduplicated	deduplicated
Deltas	yes	no	partial	yes
Pre-image	no	yes	no	yes, optional
Post-image	no	yes	yes	yes, optional
Slow consumer reaction	Table stopped	Consumer loses data	Consumer loses data	Consumer loses data
Ordering	no	yes	yes	yes

Where can you put the consumer? In Cassandra you have to do it on-node. For the other three, including ScyllaDB, we can put the consumer separated from the database cluster. It can be somewhere else. It’s just over wire.

ScyllaDB data is deduplicated, you don’t have to worry about trying to reconcile different data streams from different replicas. It’s already done for you.

Do you get deltas? Yes. With ScyllaDB you get deltas. You know the changes.

Do you get pre-image? With ScyllaDB, optionally yes you do. You know the state before the delta and you know the delta itself.

Do you get post-image? With ScyllaDB, optionally again, yes, we can give you that.

What happens if you don’t consume data? Yes, it is lost but it’s no catastrophe. If you didn’t consume it you probably didn’t want it. You have to take responsibility for consuming your data before you hit your TTLs.

And the ordering? With ScyllaDB, yes. It’s just CQL. You’re going to be able to consume it the way you want. It’s ordered by timestamp — again, the perceived timestamp of the client. It’s a client view. It’s the database description of how your application perceived the state of data as it modified it.

Next Steps

Sign up for ScyllaDB University to take the CDC lesson and get credit. You can also watch the session from ScyllaDB Summit in full below or at our Tech Talks page. Also check out our CDC documentation. If you have more questions, make sure to bring them to our Slack community, or attend one of our live online Virtual Workshops.