How We Boosted ScyllaDB Data Streaming by 25x

Extreme scale engineering

Discover the latest trends and best practices impacting data-intensive applications. Register for access to all 50+ sessions available on demand.

How We Boosted ScyllaDB Data Streaming by 25x

Asias He

In This NoSQL Presentation

Streaming, the process of scaling out/in to other nodes used to analyze every partition, one-by-one and was too slow and depended on the schema. File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells.

Asias He, Principal Software Engineer, ScyllaDB

Asias He is a long-time open source developer who previously worked on Debian Project, Solaris Kernel, KVM Virtualization for Linux and OSv unikernel. He now works on Seastar and ScyllaDB.

Additional Details

Summary: Asias He explains ScyllaDB’s internal data streaming, used for adding or removing nodes and migrating tablets. Traditional mutation‑based streaming reads mutations from multiple SSTables, serializes them, and writes them back on the receiving node. A new file‑based approach now streams whole SSTables directly because each table maps to a single tablet. This eliminates parsing, cuts CPU, lowers network bytes, and makes streaming up to 25 × faster and 10 × more bandwidth‑efficient.

Topics discussed

What streaming does inside ScyllaDB to add or decommission nodes and migrate tablets
How mutation‑based streaming pulls individual mutations from SSTables, serializes them, and rebuilds them remotely
Why per‑mutation parsing and serialization limit throughput and raise CPU use
How file‑based streaming ships complete SSTables thanks to tablet ownership, skipping parsing and serialization
What performance gains appear: 25 × shorter streaming time, 10 × higher network bandwidth, roughly one‑third of the data sent, and markedly lower CPU cycles

Takeaways

File‑based streaming turns the SSTable itself into the transfer unit, avoiding costly mutation decoding and re‑assembly. This lowers CPU and memory pressure and frees cycles for user workloads.
Because each SSTable belongs to a single tablet, whole‑file transfers never mix unrelated data. This lets ScyllaDB stream without filtering or rewriting, simplifying the code path and reducing I/O contention.
Field tests show streaming bandwidth improves by an order of magnitude while total bytes on the wire drop by roughly three‑fold, making cluster expansion or rebalancing far quicker even on constrained networks.
The new method’s 25 × speed‑up shortens maintenance windows and shrinks the noisy‑neighbor effect during node operations, which directly benefits latency‑sensitive applications.

Top takeaway: File‑based streaming lets ScyllaDB move entire SSTables between nodes, cutting CPU work and network bytes and slashing streaming time by up to 25×.

Monster Scale Summit

Extreme scale engineering

How We Boosted ScyllaDB Data Streaming by 25x

In This NoSQL Presentation

Asias He, Principal Software Engineer, ScyllaDB

Additional Details