ScyllaDB Change Data Capture

Easily Track Updates, Migrate, or Stream Data
from ScyllaDB Clusters

Overview

ScyllaDB Change Data Capture (CDC) is an optional automated recording of modifications for one or more tables in your ScyllaDB cluster (database). CDC is primarily used for:

- Streaming Data — you want to plug it into your Kafka pipeline, you want some sort of analysis done on a transaction level and not on an aggregated level.
- Data duplication — you want to mirror a portion of your database, optionally with ETL but without using multi-datacenter replication.
- Data replication — you want to somehow transform the data on a per-transaction basis to some other storage format and repository.

With these building block functions, CDC also allows ScyllaDB users to build streaming data pipelines that allow real-time data processing and analysis. More and more use cases require an almost immediate reaction to modifications occurring in the database. In a fast-moving world, the amount of new information is so big that it has to be processed on the spot. For example, it might be desirable to send an SMS to a user if a login attempt is performed from a country the user does not live in – a typical fraud detection operation. Change Data Capture is also useful to replicate data stored in the database to other systems that create indexes or aggregations for it.

The following three diagrams depict how the write-flow works in an already-enabled CDC log table. ScyllaDB CDC is (1) enabled per table using an INSERT INTO base_table(…)…, (2) if required, the coordinator node reads existing row data for pre-/post image generation, and (3) the coordinator then populates CDC log table writes and piggybacks on the base table.

How does ScyllaDB’s CDC Work?

Change data capture is enabled per table and records modifications made to a CQL row and can query the history of all changes made to enabled tables in your ScyllaDB database. When CDC is enabled on a table, a corresponding CDC log table is created and mutations (or CDC records if you so prefer) are ordered by timestamp in the original operation and the associated CQL batch sequence. The CDC log table ordering is important in order for consumers to know how to retrieve data and continuously poll the table. The batch sequence number helps to ensure that writes in a batch are processed together, improving performance and assuring correctness. Log tables can be optionally created for one or more of the following:

- Changes per column (delta, what was changed)
- The pre-image (original state of data before the change)
- The post-image (final, current state after the change)

When writing applications, you can now use our language specific, shard-aware libraries (Rust, Go, and Java) to simplify writing applications which will be read to Scylla CDC. These libraries feature:

- A simple callback-based interface for consuming changes from CDC streams.
- Automatic retries in case of errors.
- Transparent handling of the complexities related to topology changes.
- Optional checkpointing functionality – the library can be configured to save progress so that it can later continue from saved point after restart.

ScyllaDB CDC data tables are stored and distributed in the cluster, sharing all the same properties as you’re used to with your normal data. All considerations related to partition and clustering keys apply to CDC log tables. There’s no need for a separate specialized application to consume it.

The topology of the CDC log is meant to match the original data, so you get the same data distribution. It’s synchronized with your writes. It shares the same properties and distribution all to minimize the overhead of adding the log.

Everything in a CDC log is transient, representing a window back in time that you can consume and look at and then automatically erase it. The default survival time-to-live (TTL) is 24 hours, which means the CDC log is not going to overwhelm your storage system.

Consuming ScyllaDB CDC Data

As described above, ScyllaDB CDC tables are normal tables and thus consumed on the lowest level through normal CQL operations. The data is already deduplicated so there is no need to run aggregation from different replicas because the CDC data records will always reside on the same shards as the base replicas.

You can also build integrators on top of CDC for more advanced use cases that can be grouped into two categories: integration with other systems and real-time processing of data. Real-time processing can be done for example with Kafka Streams or Spark and is useful for triggers and monitoring.

ScyllaDB CDC Source Connector

ScyllaDB CDC Source Connector is a Debezium connector, compatible with Kafka Connect (with Kafka 2.6.0+). The connector reads the CDC log for specified tables and produces Kafka messages for each row-level INSERT, UPDATE or DELETE operation. The connector is fault-tolerant, retrying reading data from Scylla in case of failure. It periodically saves the current position in the ScyllaDB CDC log using Kafka Connect offset tracking. Each generated Kafka message contains information about the source, such as the timestamp and the table name.

Confluent and Kafka Connect

Confluent is a full-scale data streaming platform that enables you to easily access, store, and manage data as continuous, real-time streams. It expands the benefits of Apache Kafka with enterprise-grade features. Confluent makes it easy to build modern, event-driven applications, and gain a universal data pipeline, supporting scalability, performance, and reliability.

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to define connectors that move large data sets in and out of Kafka. It can ingest entire databases or collect metrics from application servers into Kafka topics, making the data available for stream processing with low latency.

Kafka Connect includes two types of connectors:

- Source connector: Source connectors ingest entire databases and stream table updates to Kafka topics. Source connectors can also collect metrics from application servers and store the data in Kafka topics–making the data available for stream processing with low latency.
- Sink connector: Sink connectors deliver data from Kafka topics to secondary indexes, such as Elasticsearch, or batch systems, such as data warehouses and data lakes, for offline analysis.

Comparing ScyllaDB CDC with Other NoSQL Options

Here’s a short comparison with other major NoSQL options:

	Cassandra	DynamoDB	MongoDB	ScyllaDB
Consumer location	Cassandraon-node	DynamoDBoff-node	MongoDBoff-node	ScyllaDBoff-node
Replication	Cassandraduplicated	DynamoDBdeduplicated	MongoDBdeduplicated	ScyllaDBdeduplicated
Deltas	Cassandra	DynamoDBno	MongoDBpartial	ScyllaDB
Pre-image	Cassandrano	DynamoDB	MongoDBno	ScyllaDB optional
Post-image	Cassandrano	DynamoDB	MongoDB	ScyllaDB optional
Slow consumer reaction	CassandraTable stopped	DynamoDBConsumer loses data	MongoDBConsumer loses data	ScyllaDBConsumer loses data
Ordering	Cassandrano	DynamoDB	MongoDB	ScyllaDB

Cassandra

DynamoDB

MongoDB

ScyllaDB

Consumer location

Cassandraon-node

DynamoDBoff-node

MongoDBoff-node

ScyllaDBoff-node

Replication

Cassandraduplicated

DynamoDBdeduplicated

MongoDBdeduplicated

ScyllaDBdeduplicated

Deltas

Cassandra

DynamoDBno

MongoDBpartial

ScyllaDB

Pre-image

Cassandrano

DynamoDB

MongoDBno

ScyllaDB optional

Post-image

Cassandrano

DynamoDB

MongoDB

ScyllaDB optional

Slow consumer reaction

CassandraTable stopped

DynamoDBConsumer loses data

MongoDBConsumer loses data

ScyllaDBConsumer loses data

Ordering

Cassandrano

DynamoDB

MongoDB

ScyllaDB

Resources

Learn More about ScyllaDB Change Data Capture

How to

Get Started

How to Use Change Data Capture with Apache Kafka and ScyllaDB

Documentation

Change Data Capture

Read the documentation for additional information to get started.

Training

ScyllaDB University

ScyllaDB CDC Source Connector with Kafka – Lab

Apache® and Apache Cassandra® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Amazon DynamoDB® and Dynamo Accelerator® are trademarks of Amazon.com, Inc. No endorsements by The Apache Software Foundation or Amazon.com, Inc. are implied by the use of these marks.

Why ScyllaDB?

Is ScyllaDB right for me?

ScyllaDB University

ScyllaDB Blog

ScyllaDB Change Data Capture

Easily Track Updates, Migrate, or Stream Data
from ScyllaDB Clusters

Overview

How does ScyllaDB’s CDC Work?

Consuming ScyllaDB CDC Data

ScyllaDB CDC Source Connector

Confluent and Kafka Connect

Comparing ScyllaDB CDC with Other NoSQL Options

Resources

Learn More about ScyllaDB Change Data Capture

Get Started

Change Data Capture

ScyllaDB University

ScyllaDB University

ScyllaDB Cloud

Start scaling with the world's best high performance NoSQL database.

Why ScyllaDB?

Is ScyllaDB right for me?

ScyllaDB University

ScyllaDB Blog

ScyllaDB Change Data Capture

Easily Track Updates, Migrate, or Stream Data from ScyllaDB Clusters

Overview

How does ScyllaDB’s CDC Work?

Consuming ScyllaDB CDC Data

ScyllaDB CDC Source Connector

Confluent and Kafka Connect

Comparing ScyllaDB CDC with Other NoSQL Options

Resources

Learn More about ScyllaDB Change Data Capture

Get Started

Change Data Capture

ScyllaDB University

ScyllaDB University

ScyllaDB Cloud

Start scaling with the world's best high performance NoSQL database.

Easily Track Updates, Migrate, or Stream Data
from ScyllaDB Clusters