ScyllaDB Open Source 5.1

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 5.1, a production-ready release of our open-source NoSQL database.

ScyllaDB 5.1 introduces Partition level rate limit, Distributed select count, and more functional, performance and stability improvements.

Only the last two minor releases of the ScyllaDB Open Source project are supported. As ScyllaDB Open Source 5.1 is officially released, only ScyllaDB Open Source 5.1 and ScyllaDB 5.0 will be supported, and ScyllaDB 4.6 will be retired. Users are encouraged to upgrade to the latest and greatest release. Upgrade instructions are here.

New Features

Distributed SELECT COUNT(*)

ScyllaDB will now automatically run SELECT COUNT(*) statements on all nodes and all shards in parallel, which brings a considerable speedup, even 100X in larger clusters.

This feature is limited to queries that do not use GROUP BY or filtering.

The implementation includes a new level of coordination.

A Super-Coordinator node splits aggregation queries into sub-queries, distributes them across some group of coordinators, and merges results.

Like a regular coordinator, the Super-Coordinator is a per operation function.

Example results:

A 3 node cluster setup on powerful desktops (3×32 vCPU)
Filled the cluster with ~2 * 10^8 rows using scylla-bench and run:

> time cqlsh <ip> <port> --request-timeout=3600 -e "select count(*) from scylla_bench.test using timeout 1h;"

Before Distributed Select: 68s
After Distributed Select: 2s

You can disable this feature by setting enable_parallelized_aggregation config parameter to false.

#1385 #10131

Limit partition access rate

It is now possible to limit read rates and writes rates into a partition with a new WITH per_partition_rate_limit clause for the CREATE TABLE and ALTER TABLE statements. This is useful to prevent hot-partition problems when high rate reads or writes are bogus (for example, arriving from spam bots). #4703

Examples:

Limits are configured separately for reads and writes. Some examples:

ALTER TABLE t WITH per_partition_rate_limit = {
'max_reads_per_second': 100,
'max_writes_per_second': 200
};

Limit reads only, no limit for writes:

ALTER TABLE t WITH per_partition_rate_limit = {
'max_reads_per_second': 200
};

Learn More

Load and Stream

This feature extends nodetool refresh to allow loading arbitrary sstables that do not belong to a particular node into the cluster. It loads the sstables from disk, calculates the data’s owning nodes, and automatically streams the data to the owning nodes. In particular this is useful when restoring a cluster from backup.

For example, say the old cluster has 6 nodes and the new cluster has 3 nodes.

One can copy the sstables from the old cluster to the new nodes and trigger the load and stream process.

This can make restores and migrations much easier:

You can place sstables from any node from the old cluster to any node in the new cluster
No need to run nodetool cleanup to remove data that does not belong to the local node after refresh

load_and_stream option also updates the relevant Materialized Views #9205

> curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

Note there is an open bug, #282, for the nodetool refresh --load-and-stream operation. Until it is fixed, use the REST API above.

Materialized Views: Prune

A new CQL extension PRUNE MATERIALIZED VIEW statement can now be used to remove inconsistent rows from materialized views. A special statement is dedicated for pruning ghost rows from materialized views.

A ghost row is an inconsistency issue which manifests itself by having rows in a materialized view which do not correspond to any base table rows. Such inconsistencies should be prevented altogether and ScyllaDB strives to avoid them, but if they happen, this statement can be used to restore a materialized view to a fully consistent state without rebuilding it from scratch.

Example usages:

PRUNE MATERIALIZED VIEW my_view;
PRUNE MATERIALIZED VIEW my_view WHERE token(v) > 7 AND token(v) < 1535250;
PRUNE MATERIALIZED VIEW my_view WHERE v = 19;

Alternator Updates

Alternator, ScyllaDB’s implementation of the DynamoDB API, now provides the following improvements:

Supports partially successful batch queries where successful results are returned and failed keys are reported. #9984
Uses Raft (if enabled) to make table creation atomic. #9868
Alternator will BYPASS CACHE for the scans employed in TTL expiration, reducing impact on user workloads.
New metrics that can be used to understand alternator TTL expiration operation.
Supports select option of Query and Scan methods. This allows counting items instead of just returning them.#5058
An improvement to BatchGetItem performance by grouping requests to the same partition.#10757

Materialized Views: Synchronous Mode (Experimental)

There is now a synchronous mode for materialized views. In ordinary, asynchronous mode materialized views operations return before the view is updated. In synchronous mode materialized views operations do not return until the view is updated. This enhances consistency but reduces availability as in some situations all nodes might be required to be functional.

Example:

CREATE MATERIALIZED VIEW main.mv
AS SELECT * FROM main.t
WHERE v IS NOT NULL
PRIMARY KEY (v, id)
WITH synchronous_updates = true;

ALTER MATERIALIZED VIEW main.mv WITH synchronous_updates = true;

Learn More

Performance: Eliminate exceptions from the read and write path

When a coordinator times out, it generates an exception which is then caught in a higher layer and converted to a protocol message. Since exceptions are slow, this can make a node that experiences timeouts become even slower. To prevent that, the coordinator write path and read path has been converted not to use exceptions for timeout cases, treating them as another kind of result value instead. Further work on the read path and on the replica reduces the cost of timeouts, so that goodput is preserved while a node is overloaded.

Improvement results below:

Full results including commands used to generate below:

Before Exceptions Elimination

./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write --stop-on-error=false --timeout=0s

57231.89 tps ( 62.1 allocs/op, 6.1 tasks/op, 138768 insns/op, 1000000 errors)

/build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --stop-on-error=false --flush --timeout=0s

22694.66 tps ( 50.5 allocs/op, 13.1 tasks/op, 366645 insns/op, 983824 errors)

After Exceptions Elimination

./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --write --stop-on-error=false --timeout=0s

163772.65 tps ( 52.1 allocs/op, 5.1 tasks/op, 40748 insns/op, 1000000 errors)

./build/release/test/perf/perf_simple_query --smp=1 --operations-per-shard=1000000 --stop-on-error=false --flush --timeout=0s

193391.25 tps ( 39.9 allocs/op, 12.1 tasks/op, 32881 insns/op, 974292 errors)

Results for this update will be shared soon.

Raft Updates

While strong consistent schema management remains experimental in 5.1, the work on the Raft consensus algorithm continues toward more use cases, such as safe topology updates, improving traceability and stability. Here are selected updates in this release:

A Data Definition Language (DDL) statement will fail under Raft-managed schema if a concurrent change is applied. Scylla will now automatically retry executing the statement, and if there was no conflict between the two concurrently executing statements (for example, if they apply to different tables) it will succeed. This reduces disruption to user workloads.
ScyllaDB will retry a DDL statement if a transient Raft failure has occurred, e.g. due to a leader change.
When Raft is enabled, nodes join as non-voting members until bootstrap is complete.
When node A sends a read or write request to node B, it will also send the schema version of the relevant table. If B is using a different version, it will ask A for the schema corresponding to that version. If A doesn’t have it, the request will fail. Consequently the cache expiration for schema versions has been increased to reduce the probability of such events.
Raft now persists its peer nodes in a local table.
The system could deadlock if a schema-modifying statement was executed in Raft mode during shutdown. This is now fixed.
A new failure detector was merged for Raft.
There is a new test suite for testing topology changes under Raft.

Web Assembly (WASM) based UDA/UDF – Experimental

ScyllaDB 5.1 brings experimental support for Wasm-based User Defined Functions and User Defined Aggregates.

The CQL syntax is compatible with Apache Cassandra. Examples:

CREATE FUNCTION sample ( arg int ) ...;
CREATE FUNCTION sample ( arg text ) ...;

A full example of using Rust to create a UDF will be shared soon.

To enable WASM UDF in ScyllaDB 5.1:

Use options

--enable-user-defined-functions true --experimental-features udf

Or in scylla.yaml:

experimental_features:
- udf

Issues fixed in this release:

Memory management for User Defined Functions (UDF) in WebAssembly has been significantly improved. There is now a protocol that allows ScyllaDB to allocate memory within the function’s memory space to store function parameters. This means that within a WebAssembly function, normal memory management can be used (e.g. Rust’s “new” methods).
in WebAssembly UDF, the RETURNS NULL ON NULL INPUT is now honored.
User-defined aggregates can now have a REDUCEFUNC defined, allowing them to run in parallel on all shards and nodes. In addition, all native aggregates can now be parallelized (not just COUNT(*)), and multiple aggregations in a single SELECT statement are supported.
A compiled WebAssembly function will now be reused across multiple invocations. Creating a WebAssembly function instance is very expensive, so we amortize the creation across many invocations.

Updates in this Release

Deployment and Packaging

ScyllaDB now has an Azure snitch for inferring rack/datacenter from the instance metadata.
The docker image has received some fixes for regressions: the locale is now set correctly, the supervisord service name has been restored for compatibility with scylla-operator, and the service runs as root to avoid interfering with permissions outside the container. #10310 #10269 #10261

CQL API updates

A list of CQL bug fix and extensions:

Integer columns now accept scientific notation JSON numbers, as long as those numbers are integers. This improves interoperability with JSON libraries (and fixes a regression). Fixes #10100 #10114 #10115
Empty strings are now allowed in materialized view partition keys. #9352 #9364 #9375
#10178
The LIKE operator on descending order clustering keys now works. #10183
ScyllaDB would incorrectly use an index with some IN queries, leading to incorrect results. This is now fixed.
In CREATE AGGREGATE statements, the INITCOND and FINALFUNC clauses are now optional (defaulting to NULL and the identity function respectively).
CREATE KEYSPACE now has a WITH STORAGE clause, allowing to customize where data is stored. For now, this is only a placeholder for future extensions.
When talking to drivers using the older v3 protocol, ScyllaDB did not serialize timeout exceptions correctly, resulting in the driver complaining about protocol violations. This is now fixed. #5610
The CQL grammar was relaxed to allow bind markers in collection literals, e.g. UPDATE tab SET my_set = { ?, 'foobar', :variable }.
ScyllaDB now validates collections for NULLs more carefully. #10580
After this change, the following query
INSERT INTO ks.t (list_column) VALUES (?);
And the driver sending a list with null inside as the bound value, something like `[1, 2, null, 4]` Would result in an invalid_request_exception instead of an ugly marshaling error.
Support for lists containing NULLs in IN relations (“WHERE x IN (1, 2, NULL, 3)“) has been removed. This is useless (since NULL doesn’t match anything) and conflicts with how lists work elsewhere.
WHERE clause processing has been relaxed to allow filtering on a list element: “WHERE my_list[:index] = :value ALLOW FILTERING“. Previously this was only allowed on maps.
Lightweight transaction behavior wrt static rows was adjusted to conform to Cassandra’s behavior.
If a named bind variable appeared twice in a statement (“WHERE a = :my_var AND b = :my_var“), ScyllaDB treated this as two separate variables. This is now fixed. #10810
The token() built-in function will now correctly return NULL if given NULL inputs, instead of failing the request. #10594

Stability and Performance Improvements

If a file page was inserted into cache, but the insertion required a memory allocation, it could cause memory corruption. This is now fixed. File caching is part of scylla-4.6 index caching feature. #9915
A recent commitlog regression has been fixed: we might have created two commitlog segment files in parallel, confusing the queue that holds them. The problem was only present in scylla-4.6. #9896
When shutting down, compaction tasks that are sleeping while awaiting a retry are now aborted, so the shutdown is not delayed. #10112
The compaction manager’s definition of compaction backlog for size-tiered compaction strategy has changed, reducing write amplification. The compaction backlog is used to determine how much resources will be devoted for compaction.
The compaction manager’s definition of compaction backlog for size-tiered compaction strategy has changed, reducing write amplification. The compaction backlog is used to determine how much resources will be devoted for compaction. See the results graphed here.
An accounting bug in sstable index caching that could lead to running out of memory has been fixed. #10056
ScyllaDB will shut down gracefully (without a core dump) if it encounters a permission or space problem on startup. #9573
sstables are created in a temporary directory to avoid serializing on an XFS lock. If the sstable writing failed, this directory would be left behind. It is now deleted. #9522
When using the spark migrator, Scylla might see TTL’ed cells which it thought needed repair, but could not actually find a difference in, leading to repair not resolving the difference and detecting it again on the next run. This has been fixed. #10156
Prepared batch statements are now correctly invalidated when a referenced table changes. #10129
A race in the prepared statement cache could cause invalidations (due to a schema change) to be ignored. This is now fixed. #10117
When populating the row cache with data from sstables, we will first compact the data in order to avoid populating the cache with data that will be later ignored. #3568
When reading, we now notice if data exists in sstables/cache but not in memtables, or vice-versa. In either case there is no need to perform a merge, yielding substantial performance improvements.
Truncate now disables compaction temporarily on the table and its materialized views, in order to complete faster.
Cleanup compactions, used after bootstrapping a new node to reclaim space in the original nodes, now have reduced write amplification.
SSTable index file reads now use read-ahead. This improves performance in workloads that frequently skip over small parts of the partition (e.g. full scans with clustering key restrictions).
When ScyllaDB generates a name for a secondary index, it will avoid using special characters (if they were present in the table name). #3403
Cleanup compactions, used to discard data that was moved to a new node, now have reduced write amplification on Time Window Compaction Strategy tables.
Reads from cache are now upgraded to use the new range tombstone representation. This completes the conversion of the read pipeline, and nets a nice performance improvement as detailed in the commit message.
A concurrent DROP TABLE while streaming data to another node is now tolerated. Previously, streaming would fail, requiring a restart of the node add or decommission operation. #10395
A crash where a map subscript was NULL in certain CQL expressions was fixed. Fixes #10361 #10399 #10401
A regression that prevented partially-written sstables from being deleted was fixed.
A race condition that allowed queries to be processed after a table was dropped (causing a crash) was fixed. #10450
The prepared statement cache was recently split into two sections, one for reused statements and one for single-use statements. This was done for flood protection – so that a bunch of single-use statements won’t evict reused statements from the cache. However, this created a regression when the size of the single-use section was shrunk so that it was too small for statements to promote into the reused section. This is now fixed by maintaining a minimum size for each section. #10440
Level selection for Leveled Compaction Strategy was improved, reducing write amplification.
Reconciliation is the process that happens when two replicas return non-identical results for a query. Some reactor stalls were removed, reducing latency effects on concurrent queries. #2361 #10038
A crash in some cases where an sstable index cursor was at the end of the file was fixed. #10403
A compaction job that is waiting in queue is now aborted immediately, rather than waiting to start and then getting aborted.
Repair-based node operations use repair to move data between nodes for bootstrap/decommission and similar operations (currently enabled by default only for replacenode). The iteration order has been changed from an outer iteration on vnodes and an inner iteration on tables to an outer iteration on tables and an inner iteration on vnodes, allowing tables to be completed earlier. This in turn allows compaction to reduce the number of sstables earlier, reducing the risk of having too many sstables open.
A “promoted index” is the intra-partition index that allows seeking within a partition using the clustering key. Due to a quirk in the sstable index file format, this has to be held in memory when it is being created. As a result, huge partitions risk running out of memory. ScyllaDB will now automatically downscale the promoted index to protect itself from running out of memory.
Change Data Capture (CDC) tables are no longer removed when CDC is disabled, to allow the still-queued data to be drained. #10489
Until now, a deletion (tombstone) that hit data in memtable or cache did not remove the data from memtable or cache; instead both the data and the tombstone coexisted (with the data getting removed during reads or memtable flush). This was changed to eagerly apply tombstones to memtable/cache data. This reduces write amplification for delete-intensive workloads (including the internal Raft log). #652
Recently, repair was changed to complete one table before starting the next (instead of repairing by vnodes first). We now perform off-strategy compaction immediately after a table was completed. This reshapes the sstables received from different nodes and reduces the number of sstables in the node.
The Leveled Compaction Strategy was made less aggressive. #10583
Compaction now updates the compaction history table in the background, so if the compaction history table is slow, compaction throughput is not affected.
Memtable flushes will now be blocked if flushes generate sstables faster than compaction can clear them. This prevents running out of memory during reads. This reduces problems with frequent schema updates, as schema updates cause memtable flushes for the schema tables. #4116
A recent regression involving a lightweight transaction conditional on a list element was fixed. #10821
A race condition between the failure detector and gossip startup was fixed.
Node startup for large clusters was sped up in the common case where there are no nodes in the process of leaving the cluster.
An unnecessary copy was removed from the memtable/cache read path, leading to a nice speedup.
A Seastar update reduces the node-to-node RPC latency.
Internal materialized view reads incorrectly read static columns, confusing the following code and causing it to crash. This is now fixed. #10851
Staging sstables are used when materialized views need to be built after token range is moved to a different node, or after repair. They are now compacted regularly, helping control read and space amplification.
Dropping a keyspace while a repair is running used to kill the repair; now this is handled more gracefully.
A large amount of range tombstones in the cache could cause a reactor stall; this is now fixed.
Adding range tombstones to the cache or a memtable could cause quadratic complexity; this is also fixed.
Under certain rare and complicated conditions, memtables could miss a range tombstone. This could result in temporary data resurrection. This is now fixed. Fixes #10913 #10830
Previously, a schema update caused a flush of all memtables containing schema information (i.e. in the system_schema keyspace). This made schema updates (e.g. ALTER TABLE) quite slow. This was because we could not ensure that commitlog replay of the schema update would come before the commitlog replay of changes that depend on it. Now, however, we have a separate commitlog domain that can be replayed before regular data mutations, and so we no longer flush schema update mutations, speeding up schema updates considerably. #10897
Gossip convergence time in large clusters has been improved by disregarding frequently changing state that is not important to cluster topology – cache hit rate and view backlog statistics.
Reads and writes no longer use C++ exceptions to communicate timeouts. Since C++ exceptions are very slow, this reduces CPU consumption and allows an overloaded node to retain throughput for requests that do succeed (“goodput”).
Major compaction will now happen using the maintenance scheduling group (so its CPU and I/O consumption will be limited), and regular compaction of sstables created since it was started will be allowed. This will reduce read amplification while a major compaction is ongoing. #10961
The Seastar I/O scheduler was adjusted to allow higher latency on slower disks, in order to avoid a crash on startup or just slow throughput. #10927
ScyllaDB will now clean up the table directory skeleton when a table is dropped. #10896
The repair watchdog interval has been increased, to reduce false failures on large clusters.
ScyllaDB propagates cache hit rate information through gossip, to allow coordinators to send less traffic to newly started node. It will now spend less effort to do so on large clusters. #5971
Improvements to token calculation mean that large cluster bootstrap is much faster.
An automatically parallelized aggregation query now limits the number of token ranges it sends to a remote node in a single request, in order to reduce large allocations. #10725
Accidentally quadratic behavior when a large number of range tombstones is present in a partition has been fixed. #11211
Row cache will miss a row if upper bound of population range is evicted and has an adjacent dummy row #11239

Tooling

scylla-api-cli is a lightweight command line tool interfacing the scylla REST API.
The tool can be used to list the different API functions and their parameters, and to print detailed help for each function.Then, when invoking any function, scylla-api-cli performs basic validation on the function arguments and prints the result to the standard output. Note that json results msy be pretty-printed using commonly available command line utilities. It is recommended to use scylla-api-cli for interactive usage of the REST API over plain http tools, like curl, to prevent human errors.
The sstable utilities now emit JSON output. See example output here.
There two new sstable tools, validate-checksums and decompress, allowing for more offline inspection options of sstables.
- scylla-sstable validate-checksums: helps identifying whether an sstable is intact or not, but checking the digest and the per-chunk checksums against the data on disk.
- scylla-sstable decompress: helps when one wants to manually examine the content of a compressed sstable.
The SSTableLoader code base has been updated to support “me” format sstables.
The sstable parsing tools usually need the schema to interpret an sstable’s data. For the special case of system tables, the tools can now use well-known schemas.
Nodetool was updated to fix IPv6 related errors (even when IPv4 is used) with update JVMs. #10442
Cassandra-derived tooling such as cqlsh and cassandra-stress was synchronized with Cassandra 3.11.3.
The bundled Prometheus node_exporterm used to report OS level metrics to Scylla Monitoring Stack was upgraded to version 1.3.1.
Repairs that were in their preparation stage previously could not be aborted. This is now fixed.
ScyllaDB documentation has been moved from the scylla-docs.git repository to scylla.git. This will allow us to provide versioned documentation.
The sstable tools gained a write operation that can convert a json dump of an sstable back into an sstable.

Storage

“me” format sstables are now supported (and the default format).
ScyllaDB will now store the Scylla version and build-id used to generate an sstable. This is helpful in tracking down bugs and altered persisted data.

Configuration

It is now possible to limit, and control in real time, the bandwidth of streaming and compaction.

These and more configuration updates below:

It is now possible to limit I/O for repair and streaming to a user-defined bandwidth limit, using the new stream_io_throughput_mb_per_sec config value. The value throttles streaming I/O to the specified total throughput (in MiBs/s) across the entire system. Streaming I/O includes the one performed by repair and both RBNO and legacy topology operations such as adding or removing a node. Setting the value to 0 disables stream throttling (default). The value can be updated in real time via the config virtual table or via configuration file hot-reload. It is recommended not to change this configuration from its default value, which dynamically determines the best bandwidth to use.
compaction_throughput_mb_per_sec: Throttles compaction to the specified total throughput across the entire system. The faster you insert data, the faster you need to compact in order to keep the SSTable count down. The recommended Value is 16 to 32 times the rate of write throughput (in MBs/second). Setting the value to 0 disables compaction throttling, It is recommended not to change this configuration from its default value, which dynamically determines the best bandwidth to use.
It is now possible to disable updates to node configuration via the configuration virtual table. This is aimed at Scylla Cloud, where users have access to CQL but not the node configuration. #9976
EC2MultiRegionSnitch will now honor broadcast_rpc_address if set in the configuration file.#10236
The permissions cache configuration is now live-updatable (via SIGHUP); and there is now an API to clear the authorization cache.
The compaction_static_shares and memtable_flush_static_shares configuration items, used to override the controllers, can now be updated without restarting the server.
column_index_auto_scale_threshold_in_kb to the configuration (defaults to 10MB).
When the promoted index (serialized) size gets to this threshold, it’s halved by merging each two adjacent blocks into one and doubling the desired_block_size.
commitlog_flush_threshold_in_mb: Threshold for commitlog disk usage. When used disk space goes above this value, Scylla initiates flushes of memtables to disk for the oldest commitlog segments, removing those log segments. Adjusting this affects disk usage vs. write latency.

Monitoring and Tracing

Below are a list of monitoring and tracing related work in this release:

Slow query tracing only considered local times – the time from when a request first hit the replica – to determine if a request needs to be traced. This could cause some parts of slow query tracing to be missed. To fix that, slow queries on the replicas are determined using the start time on the coordinator.
The system.large_partitions and similar system tables will now hold only the base name of the sstable, not the full path. This is to avoid confusion if the large partition is reported while the sstable is in one directory, but later moved to another, for example from staging to the main directory after view building is done or into the quarantine subdirectory if they are found to be inconsistent with scrub. #10075
There are now metrics showing each node’s idea of how many live nodes and how many unreachable nodes there are. This aids understanding problems where failure detection is not symmetric. #10102
The system.clients table has been virtualized. This is a refactoring with no UX impact.
Aggregated queries that use an index are now properly traced.
The amount of per-table metrics has been reduced by sending metric summaries instead of histograms and not sending unused metrics.

Bug Fixes

For a full list of fixed issues see git log and 5.1 release candidates notes.

01 Dec 2022