ScyllaDB Change Data Capture

Easily Track Updates, Migrate, or Stream Data
from ScyllaDB Clusters

ScyllaDB CDC Mascot

Overview

ScyllaDB Change Data Capture (CDC) is an optional automated recording of modifications for one or more tables in your ScyllaDB cluster (database).  CDC is primarily used for:

    • Streaming Data — you want to plug it into your Kafka pipeline, you want some sort of analysis done on a transaction level and not on an aggregated level.
    • Data duplication — you want to mirror a portion of your database, optionally with ETL but without using multi-datacenter replication.
    • Data replication — you want to somehow transform the data on a per-transaction basis to some other storage format and repository.

With these building block functions, CDC also allows ScyllaDB users to build streaming data pipelines that allow real-time data processing and analysis. More and more use cases require an almost immediate reaction to modifications occurring in the database. In a fast-moving world, the amount of new information is so big that it has to be processed on the spot. For example, it might be desirable to send an SMS to a user if a login attempt is performed from a country the user does not live in – a typical fraud detection operation. Change Data Capture is also useful to replicate data stored in the database to other systems that create indexes or aggregations for it.

The following three diagrams depict how the write-flow works in an already-enabled CDC log table. ScyllaDB CDC is (1) enabled per table using an INSERT INTO base_table(…)…, (2) if required, the coordinator node reads existing row data for pre-/post image generation, and (3) the coordinator then populates CDC log table writes and piggybacks on the base table. 

Combined CDC Figure Diagram

How does ScyllaDB’s CDC Work?

Change data capture is enabled per table and records modifications made to a CQL row and can query the history of all changes made to enabled tables in your ScyllaDB database. When CDC is enabled on a table, a corresponding CDC log table is created and mutations (or CDC records if you so prefer) are ordered by timestamp in the original operation and the associated CQL batch sequence. The CDC log table ordering is important in order for consumers to know how to retrieve data and continuously poll the table. The batch sequence number helps to ensure that writes in a batch are processed together, improving performance and assuring correctness. Log tables can be optionally created for one or more of the following:

    • Changes per column (delta, what was changed)
    • The pre-image (original state of data before the change)
    • The post-image (final, current state after the change)

When writing applications, you can now use our language specific, shard-aware libraries (Rust, Go, and Java) to simplify writing applications which will be read to Scylla CDC. These libraries feature:

    • A simple callback-based interface for consuming changes from CDC streams.
    • Automatic retries in case of errors.
    • Transparent handling of the complexities related to topology changes.
    • Optional checkpointing functionality – the library can be configured to save progress so that it can later continue from saved point after restart.

ScyllaDB CDC data tables are stored and distributed in the cluster, sharing all the same properties as you’re used to with your normal data. All considerations related to partition and clustering keys apply to CDC log tables. There’s no need for a separate specialized application to consume it.

The topology of the CDC log is meant to match the original data, so you get the same data distribution. It’s synchronized with your writes. It shares the same properties and distribution all to minimize the overhead of adding the log. 

Everything in a CDC log is transient, representing a window back in time that you can consume and look at and then automatically erase it. The default survival time-to-live (TTL) is 24 hours, which means the CDC log is not going to overwhelm your storage system.

Consuming ScyllaDB CDC Data

As described above, ScyllaDB CDC tables are normal tables and thus consumed on the lowest level through normal CQL operations.  The data is already deduplicated so there is no need to run aggregation from different replicas because the CDC data records will always reside on the same shards as the base replicas.  

You can also build  integrators on top of CDC for more advanced use cases that can be grouped into two categories: integration with other systems and real-time processing of data. Real-time processing can be done for example with Kafka Streams or Spark and is useful for triggers and monitoring.

ScyllaDB CDC Source Connector

ScyllaDB CDC Source Connector is a Debezium connector, compatible with Kafka Connect (with Kafka 2.6.0+). The connector reads the CDC log for specified tables and produces Kafka messages for each row-level INSERT, UPDATE or DELETE operation. The connector is fault-tolerant, retrying reading data from Scylla in case of failure. It periodically saves the current position in the ScyllaDB CDC log using Kafka Connect offset tracking. Each generated Kafka message contains information about the source, such as the timestamp and the table name.

Confluent and Kafka Connect

Confluent is a full-scale data streaming platform that enables you to easily access, store, and manage data as continuous, real-time streams. It expands the benefits of Apache Kafka with enterprise-grade features. Confluent makes it easy to build modern, event-driven applications, and gain a universal data pipeline, supporting scalability, performance, and reliability.

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to define connectors that move large data sets in and out of Kafka. It can ingest entire databases or collect metrics from application servers into Kafka topics, making the data available for stream processing with low latency.

Kafka Connect includes two types of connectors:

    • Source connector: Source connectors ingest entire databases and stream table updates to Kafka topics. Source connectors can also collect metrics from application servers and store the data in Kafka topics–making the data available for stream processing with low latency.
    • Sink connector: Sink connectors deliver data from Kafka topics to secondary indexes, such as Elasticsearch, or batch systems, such as data warehouses and data lakes, for offline analysis.

Comparing ScyllaDB CDC with Other NoSQL Options

Here’s a short comparison with other major NoSQL</a > options:
Cassandra DynamoDB MongoDB ScyllaDB
Consumer location Cassandraon-node DynamoDBoff-node MongoDBoff-node ScyllaDBoff-node
Replication Cassandraduplicated DynamoDBdeduplicated MongoDBdeduplicated ScyllaDBdeduplicated
Deltas Cassandrayes DynamoDBno MongoDBpartial ScyllaDByes
Pre-image Cassandrano DynamoDByes MongoDBno ScyllaDByes, optional
Post-image Cassandrano DynamoDByes MongoDByes ScyllaDByes, optional
Slow consumer reaction CassandraTable stopped DynamoDBConsumer loses data MongoDBConsumer loses data ScyllaDBConsumer loses data
Ordering Cassandrano DynamoDByes MongoDByes ScyllaDByes

Resources

Learn More about ScyllaDB Change Data Capture

How to Use Change Data Capture with Apache Kafka and ScyllaDB

Read the documentation for additional information to get started.

ScyllaDB CDC Source Connector with Kafka – Lab

Observing data changes with Change Data Capture (CDC)

Tracking Data Updates in Real Time with Change Data Capture (CDC)

GitHub

Get the Code

Get the latest CDC drivers for: Java, Rust and Go.