ScyllaDB Open Source 5.0

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 5.0, a production-ready release of our open source NoSQL database.

ScyllaDB Open Source 5.0 introduces safe schema changes with Raft, automatic management of tombstone garbage collection, and more functional, performance and stability improvements.

In particular, a new IO scheduler further reduces the effect of admin operations like repair on online request latency.

ScyllaDB Open Source 5.0 supports the latest and greatest AWS instances im4gn, is4gen, and i4i, with significant performance per core improvement compared to i3. More below.

Only the latest two minor releases of the ScyllaDB Open Source project are supported. With this release, only ScyllaDB Open Source 5.0 and 4.6 are supported. Users running ScyllaDB Open Source 4.5 and earlier are encouraged to upgrade to one of these two releases.

New Features

New I/O Scheduler

A new I/O scheduler measures and uses more granular storage properties to optimize and create a better balance between Scylla IO queues.

Scylla keeps a matrix of relevant combinations of the storage properties:

write/read bandwidth (bytes per second)
write/read IOPS (ops per second)

This matrix is part of the Scylla cloud image (for example, AMI for AWS), or generated at setup.

The results are used as an input for the real-time I/O scheduler.

This new I/O scheduler brings significant improvement and further reduces the effect of admin operations like repair or scale-out on online request latency.

For more information, see Linux Foundation recorded webinar Understanding Storage I/O Under Load with Avi Kivity and Pavel “Xemul” Emelyanov and a soon-to-be-released blog post with much more details and results.

Strongly Consistent Schema Management (experimental)

Schema management operations are DDL operations that modify the schema, like CREATE, ALTER, DROP for KEYSPACE, TABLE, INDEX, UDT, MV, etc.

Unstable schema management has been a problem in all Apache Cassandra and ScyllaDB versions. The root cause is the unsafe propagation of schema updates over gossip, as concurrent schema updates can lead to schema collisions.

ScyllaDB Open Source 5.0 is the first version with Raft-based schema management. With the implementation of the Raft consensus algorithm, ScyllaDB can now use it to implement data and cluster-level operations, starting with schema changes, and making concurrent schema updates safe.

To enable experimental safe schema update with Raft use:

--experimental-features=raft

Updates in this release:

ScyllaDB now sets up “Raft Group 0“, a central Raft group that will be used for topology and schema coordination. It is not used by default.
Schema synchronization across the cluster can now be done via Raft. Note it is still disconnected from the CQL statements that generate schema updates.
When Raft is enabled (in experimental mode), all schema management (i.e. CREATE TABLE / ALTER TABLE, etc.) will be done via Raft. This prevents incompatible schema changes from completing successfully on different nodes.

For more information, see “Making Schema Changes Safe with Raft” by Konstantin Osipov in Scylla Summit 2022.

Complete documentation of Raft and how to enable it is available here. Note that part of the failure recovery process will change in upcoming releases.

Automatic Management of Tombstone Garbage Collection (experimental)

ScyllaDB inherited the gc_grace_seconds option from Cassandra. The option allows you to specify the wait time (in seconds) before deletion tombstones are considered to be purged by compaction. This option assumes that you run repairs during the specified time. Failing to run repairs during the wait time may result in the resurrection of deleted data.

The new tombstone_gc schema option allows you to prevent data resurrection. With the repair mode configured, tombstones are considered for removal by compaction (if they may not shadow live data in any SSTtable) after repair is performed successfully on the affected token-range and after the tombstone was written. Unlike gc_grace_seconds, tombstone_gc has no time constraints — when the repair mode is on, tombstones garbage collection will wait until repair is run.

CREATE TABLE ks.cf (key blob PRIMARY KEY,  val blob) WITH tombstone_gc = {'mode':'repair'};

ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;

Available options are:

Timeout (default): Tombstone GC is performed after the wait time specified with gc_grace_seconds (same as in ScyllaDB 4.6). Default in ScyllaDB 5.0.
Repair: Tombstone GC is performed after repair has run successfully.
Disable: Tombstone GC is never performed. This mode may be useful when loading data to the database, to avoid tombstone GC when part of the data is not yet available.
Immediate: Tombstone GC is immediately performed. There is no wait time or repair requirement. This mode is useful for tables using TWCS compaction strategy with no user deletes. After data is expired after TTL, Scylla node can perform compaction to drop the expired data immediately.

Source: ScyllaDB Documentation

For more, read “Preventing Data Resurrection with Repair Based Tombstone Garbage Collection” or watch “Repair Based Tombstone GC” by Asias He at Scylla Summit 2022.

#3560

Virtual Table for Configuration

A new virtual table, system.config, allows querying and updating configuration over CQL.

Note that only a subset of the configuration parameters can be updated. Also note these updates are not persistent, and will be reverted to the scylla.yaml values in case the ScyllaDB server restarts.

Virtual Tables for Retrieving nodetool-level Information

Several virtual tables have been added for providing information usually obtained via nodetool:

system.snapshots – replacement for nodetool listsnapshots;
system.protocol_servers – replacement for nodetool statusbinary as well as Thrift active and Native Transport active from nodetool info;
system.runtime_info – replacement for nodetool info; but not an exact match: some fields were removed and some were refactored to make sense for scylla;
system.versions – replacement for nodetool version, prints all versions, including build-id;

Unlike nodetool, which is blocked for remote access by default, virtual tables allow remote access over CQL, including for Scylla Cloud users.

For information about available virtual tables, see here

Deployment

The AWS im4gn and is4gen instance families are now pre-tuned and supported out of the box.
The AWS I4i instance types are now supported and pre-tuned and supported out of the box. Benchmark results show 2x improvement in throughput compared to the i3 instances on the same number of vCPUs! More detailed benchmarks will be shared soon.
Debian 11 is now supported
Ubuntu 16.04 and Debian 9, both announced deprecated in ScyllaDB 4.6, are no longer supported in ScyllaDB 5.0
Due to a bug in CentOS 8 mdadm, we now pin mdadm’s version to a known-good release. #9540

Improvements

Reverse Queries

A reverse query is a query SELECT that uses a reverse order compared to the one used in the table schema. If no order was defined, the default order is ascending (ASC).

For example, the following table schema orders the rows in a partition by “time” in an ascending order:

CREATE TABLE heartrate (
   pet_chip_id uuid,
   owner uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id, time)
);

The following SELECT worked in Scylla 4.5 but might be very inefficient:

SELECT * FROM heartrate LIMIT 1 ORDER BY time DESC

Note that if the primary key uses multiple columns in mixed order (some are ascending and some are descending), the reverse order needs to invert the sorting order of all clustering columns participating in the primary key.

Reverse Queries were improved in 4.6, and are further improved in ScyllaDB 5.0, as follows:

ScyllaDB honors the page size requested by the client driver, but can also return short pages to limit its memory consumption. With the older implementation of reversed queries, the ability to return shorter pages was not available for reversed queries. With native reversed queries now enabled, ScyllaDB will also return short pages for reversed queries.
The row cache can now serve reversed queries (with query clustering order opposite from the schema definition). Previously, reversed queries automatically bypassed the row cache.
ScyllaDB can now fast-forward when reading a partition backwards. Fast-forwarding is used to skip over unneeded data when several subranges of clustering keys are wanted, for example in the query SELECT * FROM tab WHERE pk = ? AND ck1 IN (?, ?, ?) ORDER BY ck1 DESC, ck2 DESC. #9427

Security

There is now support for a certificate revocation list for TLS encrypted connections. This makes it possible to deny a client with a compromised certificate. #9630
The default Prometheus listener address is now localhost. Note that you may need to update this configuration item to expose the Prometheus port to your collector. #8757
The recently-changed Prometheus listen address configuration has been refined. ScyllaDB will now bind to the same host as the internal RPC address if not configured. This will reduce misconfigurations during upgrades, as typically configuration will not require any changes. See #8757 above, #9701

CQL

One can now omit irrelevant clustering key columns from ORDER BY clauses.

You can now replace

SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c1 ASC, c2 ASC)

With the shorter

SELECT * FROM t WHERE p = 0 AND c1 = 0 ORDER BY (c2 ASC)

#2247

Alternator

Alternator, ScyllaDB’s implementation of the DynamoDB API, now provides the following improvements:

Supports DELETE operations that remove an element from a set. #5864
Initial support for time-to-live (TTL) expiration. #9624 #9787 #9624 To enable use --experimental-features=alternator-ttl.
Added missing support for UpdateItem DELETE operation with value parameter #5864
Avoid large allocations while streaming. A similar change was made to the BatchGetItems API. This reduces latency spikes. #8522
Deleting a nested path with non-existent leaf should work #10043
Query with ScanIndexForward = false is inefficient and limited to 100MB #7586
Stream JSON response object — not convert it to a contiguous string #8522

Performance Improvements

Seastar has disabled Nagle’s algorithm for the http server, preventing 40ms latency spikes. #9619
XFS filesystems created by scylla_setup now have online discard enabled. This can improve SSD performance. Upgrades should consider manually adding online discard, providing they run a recent kernel. #9608
ScyllaDB precalculates replication maps (these contain information about which replicas each token maps to). We now share replication maps among keyspaces with similar configuration, to save memory and time.
ScyllaDB will now compact memtables when flushing them into SSTables. This results in smaller SSTables in delete-heavy workloads. Memtable compaction is automatically disabled when there are no relevant tombstones in the memtable. #7983
The internal cache used for (among other things) prepared statements now has pollution resistance. This means that a workload that uses unprepared statements will not interfere with a workload that properly prepares and reuses its statements. #8674 #9590
Leveled compaction strategy (LCS) will now be less aggressive in promoting SSTables to higher levels, reducing write amplification.
Data in memtables is now considered when purging tombstones. While it’s very unlikely to have data in memtable that is older than a tombstone in an SSTable, it’s still better to protect against it. #1745
When using Time Window Compaction Strategy (TWCS), ScyllaDB will now compact tombstones across buckets, so they can remove the deleted data. Note that using DELETE with TWCS is still not recommended. #9662
Time window compaction strategy (TWCS) major compactions will now serialize behind other compactions, in order to to include all SSTables in the compaction input set. #9553
Repair will now prefer closer nodes for pulling in missing data, when there is a choice. This reduces cross-datacenter traffic. PR#9769
ScyllaDB performs a reshape compaction to bring SSTables into the structure required by the compaction strategy, for example after a repair. It will now require less free space while doing so.
Reshape of Time Window Compaction Strategy (TWCS) now tries to compact SSTables of similar size, like Size Tiered Compaction Strategy, to reduce reshape time. This helps reduce reshape time if one accidentally creates tiny time windows and then has to increase them dramatically to avoid an explosion in SSTable counts.
Generally, if a node finds it needs to reshape SSTables while starting up, it will remain offline while doing so, in order to reduce read amplification. However, for the case of repair SSTables, remaining offline can be avoided and the node can start up immediately. #9895

Tooling and API

A new tool scylla-sstable allows you to examine the content of SStables by performing operations such as dumping the content of SStables, generating a histogram, validating the content of SStables, and more.
The main ScyllaDB executable can now run subtools by supplying a subcommand: scylla sstables and scylla types, to inspect SSTables and schema types. Additional commands will be added over time (see scylla-sstable above as an example) #7801
The scylla tool sub-subcommands have changed from switch form (‘scylla sstable --validate‘) to subcommand form (‘scylla sstable validate‘).
When the user requests to stop compactions, ScyllaDB will now only stop regular compaction, not user-request compactions like CLEANUP.
It is now possible to stop compaction on a particular set of tables. #9700
A replace operation in repair-based-node-operations mode will now ignore dead nodes, allowing the replacement of multiple dead nodes.
There is now a configuration flag to disable the new reversed reads code, in case an unexpected bug is found after release. PR#9908

Bug Fixes and Stability

A bug where incorrect results were returned from queries that use an index was fixed. The bug was triggered when large page sizes were used, and the primary key columns were also large, so that the page size multiplied by the key size exceeded a megabyte. #9198
ScyllaDB implements upgrades by having nodes negotiate “features” and only enabling those features when all nodes support them. The negotiated features are now persistent to disallow some illegal downgrades.
An exception safety problem leading to a crash when inserting data to memtables was fixed. #9728
A crash in ScyllaDB memory management code, triggered by index file caching, was fixed. The bug was caused by an allocation from within the memory allocator causing cache eviction in order to free memory. Freeing the evicted items re-enters the memory allocator, in a way that was not expected by the code. Fixes #9821 #9192 #9825 #9544 #9508 #9573
The INSERT JSON statement now correctly rejects empty string partition keys. #9853
Change Data Capture (CDC) preimage now works correctly for COMPACT STORAGE tables. #9876
Performance: Cleanup compaction / rewrite SSTable block behind regular compaction for each SSTable #10175. This fix replaces the earlier reverted fix for #10060, which was not complete.
Stability: a bug in LSA allocator might cause a coredump, for example after snapshot operations #10056
Build: compilation error on Fedora 34 with recent libstdc++ #10079
Docker: previous versions of Docker image run scylla as root. A regression introduced in 4.6 accidentally modified it to scylla user. #10261
Docker: Incompatible change of supervisor service name inside ScyllaDB container image #10269
Docker: incorrect locale value in docker build script #10310
Docker: Scylla Docker doesn’t have pidof, required by seastar-cpu-map.sh #10238
Fix a regression induce in 5.0rc0: range_tombstone_list: insert_from: incorrect range_tombstone applied to reverter in non-overlapping case #10326
CQL: Incorrect results returned with IN clause in a partition and index restriction and ALLOW FILTERING #10300
Stability: compare_atomic_cell_for_merge doesn’t compare TTL if expiry is the same. This is not consistent with the way the cell hash is computed, and may cause repair to keep trying to repair discrepancies caused by the TTL being different. #10156, #10173
Stability: A crash in Scylla memory management code, triggered by index file caching, was fixed. The bug was caused by an allocation from within the memory allocator causing cache eviction in order to free memory. Freeing the evicted items re-enters the memory allocator, in a way that was not expected by the code. Fixes #9573
Stability: JSON formatter issues may cause API calls fail in some cases, for example top partition on a CDC table #9061
Seastar: Intensive mixed (e.g. compaction + query) IO could sporadically break through the configured limitations causing the disk to be overloaded which would result in increased IO latencies #10233
Stability: Truncating a table during memtable flush might trigger an assert failure in ~flush_memory_accounter #10423
Stability: Compaction manager error: "bad_optional_access (bad optional access): retrying" when triggering maintenance compaction after disable autocompaction #10378
Stability: A wrong auth opcode, for example, a Driver bug, might crashes node #10487
Stability: Prepared statements cache: Batch statements don’t get invalidated when it appears that they should. #10129
Stability: OOM on memtable flush causes Scylla to abort #10027
CDC: dropping a column from the base table does not update system_schema.dropped_columns for the CDC log table, preventing queries from CDC log SSTables which contain this column #10473
Regression: batch with prepared stmts can cause (Java) driver to infinite loop re-preparing #10440
Stability: prepared_statements_cache / loading_cache load vs invalidate race can result in stale prepared statements #10117
Stability: database::drop_column_family(): has a window where new queries can be created for the dropped cf #10450
Stability: nodetool complains about malformed ipv6 addresses #10442
“Could not create SSTable component” exceptions while Scylla service is down #10218
Stability: system.config table is not usable for inquiring experimental_features setting #10047
Stability: managed_chunked_vector crashes in certain cases if reserve() is called twice without the reservation being consumed #10364
Stability: chunked_vector crashes in certain cases if reserve() is called twice without the reservation being consumed #10363
UX: Unreadable error response in a large partition test case #5610
Stability: Segmentation fault in sstables::partition_index_cache::get_or_load, after upgrade to 4.6 Crashes in SSTable reader which touch SSTables which have partition index pages with more than 1638 partition entries. Introduced in 4.6.0 #10290
Docker: Scylla container image stopped logging to stdout #10270
Stability: Seastar: some counters need to be reported as floating point numbers. https://github.com/scylladb/scylla-seastar/commit/6745a43c10d5df4ab90519fdc499964443246c72
Stability: Messaging: default tenants (user and system queries) are not isolated. As a result one node latency might increase a peer node latency #9505
Stability: Too many open files during terminate and replace node (scale test) #10462
Stability: Alternator – AWS Terraform API for Alternator fails to create table #10660
Setup: scylla_sysconfig_setup: does not able to handle ≥32 CPUs cpuset correctly #10523
Setup: scylla image setup fails with ValueError, cause by a warning return from perftune.py #10082
Storage: files created by Scylla ignore XFS project inheritance bit #10667
REST API major compaction request stuck for 30 minutes and don’t get response from scylla. #10485
Stability: Abort on bad_alloc during page loading #10617
Performance: write only workload which is throttled and uses only 20-30% CPU can’t be sustained #10704
Stability: coredump when calling describe_ring and a node is being shut down #10592

Refactoring

ScyllaDB used to have different internal representations of SELECT-clause expressions (“selectors”) and WHERE-clause expressions (“terms”). They are now unified into a single expression class, paving the way to a more regular and richer CQL grammar.
The source base has been migrated to Software Package Data Exchange (SPDX) to reduce license information boilerplate. More than 27,000 lines have been removed. PR#9937

See git log for a full (long) list of fixed issues.

30 Jun 2022