ScyllaDB Open Source 4.4

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 4.4, a production-ready release of our open source NoSQL database.

ScyllaDB is an open source NoSQL database with superior performance and consistently low latencies.

ScyllaDB 4.4 includes performance, stability improvements and bug fixes (below).

Find the ScyllaDB Open Source 4.4 repository for your Linux distribution here. ScyllaDB 4.4 is also available as Docker, EC2 AMI and GCP image.

Please note that only the last two minor releases of the ScyllaDB Open Source project are supported. Starting today, only ScyllaDB Open Source 4.4 and ScyllaDB 4.3 are supported, and ScyllaDB 4.2 is retired.

New Features

Timeout per Operation

There is now new syntax for setting timeouts for individual queries with “USING TIMEOUT”. #7777

This is particularly useful when one has queries that are known to take a long time. Till now, you could either increase the timeout value for the entire system (with request_timeout_in_ms), or keep it low and see many timeouts for the longer queries. The new Timeout per Operation allows you to define the timeout in a more granular way. Conversely, some queries might have tight latency requirements, in which case it makes sense to set their timeout to a small value. Such queries would get time out faster, which means that they won’t needlessly hold the server’s resources.

You can use the new TIMEOUT parameters for both queries (SELECT) and updates (INSERT, UPDATE, DELETE).

Examples:

SELECT * FROM t USING TIMEOUT 200ms;

INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;

Working with prepared statements works as usual — the timeout parameter can be explicitly defined or provided as a marker:

SELECT * FROM t USING TIMEOUT ?;

INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;

More Active Client Info

The system.clients table includes information on connected actively connected clients (drivers). ScyllaDB 4.4 add more fields to the table:

connection_stage
driver_name
driver_version
protocol_version

It also improves:

client_type – distinguishes CQL from Thrift just in case
username – now it displays the correct username if `PasswordAuthenticator` is configured.

#6946

Deployment and Packaging

ScyllaDB is now available for Ubuntu 20
We now include node-exporter in rpm/deb/tar packages. For deb/rpm, this is the optional scylla-node-exporter subpackage. This simplifies installation, especially for air-gapped systems (without an Internet connection) #2190
Downgrading on Debian derivatives is now easier: installing the scylla metapackage with a specific version will pull in subpackages with the same version. #5514

Documentation

The project developer oriented, in-tree documentation is now published to a new doc site using the ScyllaDB Sphinx Theme. With time we plan to move, and open source all of ScyllaDB documentation from the current doc site to the ScyllaDB project.

CDC

Change Data Capture (CDC) is production ready from ScyllaDB 4.3.

CDC API has changed from ScyllaDB 4.3 to ScyllaDB 4.4. If you are using CDC with ScyllaDB 4.3 please refer to ScyllaDB Docs for more info.

#8116

New reference example of a CDC consumer implementation see:

Note that only the Go consumer currently uses the end-of-record column.

Alternator

Now supports nested attribute paths in all expressions #8066

Additional Features

It is now possible to ALTER some properties of system tables, for example update the speculative_retry for the system_auth.roles table #7057
ScyllaDB will still prevent you from deleting columns that are needed for its operation, but will allow you to adjust other properties.
The removenode operation has been made safe. ScyllaDB will now collect and merge data from the other nodes to preserve consistency with quorum queries.

Tools and APIs

API: There is now a new API to force removal of a node from gossip. This can be used if information about a node lingers in gossip after it is gone. Use with care as it can lead to data loss. #2134

Performance Optimizations – I/O Scheduler 2.0

The Seastar I/O scheduler is used to maximize the requests throughput from all shards to the storage. Till now, the scheduler was running in a per shard scope: each shard runs its own scheduler, balanced between its I/O tasks, like reads, updates and compactions. This works well when the workload between shards is approximately balanced; but when, as often happened, one shard was more loaded, it could not take more I/O, even if other shards were not using their share.

I/O scheduler 2.0 included in ScyllaDB 4.4 fixes this. As storage bandwidth and IOPS are shared, each shard can use the whole disk if required.

More Performance Optimizations

Repair works by having the repairing node (“repair master”) merge data from the other nodes (“repair followers”) and write the local differences to each node. Until now, the repair master calculated the difference with each follower independently and so wrote an SStable corresponding to each follower. This creates additional work for compaction, and is now eliminated as the repair master writes a single SStable containing all the data. #7525
When aggregating (for example, SELECT count(*) FROM tab), ScyllaDB internally fetches pages until it is able to compute a result (in this case, it will need to read the entire table). Previously, each page fetch had its own timeout, starting from the moment the page fetch was initiated. As a result, queries continued to be processed long after the client gave up on them, wasting CPU and I/O bandwidth. This was changed to have a single timeout for all internal page fetches for an aggregation query, limiting the amount of time it can be alive. #1175
ScyllaDB maintains system tables tracking SSTables that have large cells, rows, or partitions for diagnostics purposes. When tracked SSTables were deleted, ScyllaDB deleted the records from the system tables. Unfortunately, this uses range tombstones which are not (yet) well supported in ScyllaDB. A series was merged to reduce the number of range tombstones to reduce impact on performance. #7668
Queries of Time Window Compaction Strategy tables now open sstables in different windows as the query needs them, instead of all at once. This greatly reduces read amplification, as noted in the commit. #6418
For some time now, ScyllaDB reshards (rearranges SSTables to contain data for one shard) on boot. This means we can stop considering multi-shard SSTables in compaction, as done here. This speeds up compaction a little. A similar change was done to cleanup compactions. #7748
During bootstrap, decommission, compaction, and reshape ScyllaDB will separate data belonging to different windows (in Time Window Compaction Strategy) into different SSTables (to preserve the compaction strategy invariant). However, it did not do so for memtable flush, requiring a reshape if the node was restarted. It will now flush a single memtable into multiple SSTables, if needed. #4617

The following updates eliminate latency spikes:

Large SSTables with many small partitions often require large bloom filters. The allocation for the bloom filters has been changed to allocate the memory in steps, reducing the chance of a reactor stall. #6974
The token_metadata structure describes how tokens are distributed across the cluster. Since each node has 256 tokens, this structure can grow quite large, and updating it can take time, causing reactor stalls and high latency. It is now updated by copying it in the background, performing the change, and swapping the new variant into the place of the old one atomically. #7220 #7313
A cleanup compaction removes data that the node no longer owns (after adding another node, for example) from SSTables and cache. Computing the token ranges that need to be removed from cache could take a long time and stall, causing latency spikes. This is now fixed. #7674
When deserializing values sent from the client, ScyllaDB will no longer allocate contiguous space for this work. This reduces allocator pressure and latency when dealing with megabyte-class values. #6138
Continuing improvement of large blob support, we now no longer require contiguous memory when serializing values for the native transport, in response to client queries. Similarly, validating client bind parameters also no longer requires contiguous memory. #6138
The memtable flush process was made more preemptible, reducing the probability of a reactor stall. #7885
Large allocation in mutation_partition_view::rows() #7918
Potential reactor stall on LCS compaction completion #7758

Configuration

Hinted handoff configuration can now be changed without restart. #5634

Monitoring and Log Collection

ScyllaDB Monitoring 3.6 adds support for log collection using Grafana Loki.
A new setup utility in ScyllaDB allows users to configure rsyslog to send logs to Loki or any other log collection server supporting the rsyslog format. #7589
New metrics report CQL errors #5859
There are now metrics for the different types of transport-level messages:
- STARTUP
- AUTH_RESPONSE
- OPTIONS
- QUERY
- PREPARE
- EXECUTE
- BATCH
- REGISTER
and for overload indicators:
- Reads_shed_due_to_overload – The number of reads shed because the admission queue reached its max capacity. When the queue is full, excessive reads are shed to avoid overload
- Writes_failed_due_to_too_many_in_flight_hints – number of CQL write requests which failed because the hinted handoff mechanism is overloaded and cannot store any more in-flight hints

#4888

For all metric update from ScyllaDB 4.3 to ScyllaDB 4.4 see here

Build

The compiler used to build ScyllaDB has been changed from gcc to clang. With clang we get a working implementation of coroutines, which are required for our Raft implementation. #7531

Debugging

ScyllaDB will now log a memory diagnostic report when it runs out of memory. Among other data the report includes free and used memory of the LSA, Cache, Memtable and system, as well as memory pools . See example report in the commit log #6365
A new tool, scylla-sstable-index, is now available to examine sstable -Index.db files. The tool is not yet packaged.
ScyllaDB uses a type called managed_bytes to store serialized data. The data is stored fragmented in cache and memtable, but contiguous outside it, with automatic conversion to contiguous storage on access. This automatic conversion made it difficult to detect when this conversion happens (so we can eliminate it), so a large patch series made all the conversions explicit and removed automatic conversion. #7490

Bugs Fixed in the Release

For a full list use git log

CQL: Invalid aggregation result on table with index: When using aggregates (COUNT(*)) on table with index on clustering key and filtering on clustering key, wrong results are returned #7355
CQL: Keyspaces have a durable_writes attribute that says whether to use the commitlog or not. Now, when changing it, the effects take place immediately. #3034
Stability: Cleanup compaction in KA/LA SSTables may crash the node in some cases #7553 (also part of ScyllaDB 4.2.2)
Stability: ‘ascii’ type column isn’t validated, and one could create illegal ascii values by using CQL without bind variables #5421
Alternator AttributesToGet breaks QueryFilter #6951 (also part of ScyllaDB 4.2.2)
Stability: From time to time, ScyllaDB needs to move SSTables from one directory to another. If ScyllaDB crashed at exactly the wrong moment, it could leave valid SSTables in both the old and the new places. #7429
Install: ScyllaDB does not start when kernel inotify limits are exceeded #7700 (also part of ScyllaDB 4.2.2)
Install: missing /etc/systemd/system/scylla-server.service.d/dependencies.conf on scylla-4.4 rpm #7703 (also part of ScyllaDB 4.2.2)
Stability: Query with multiple indexes, one date one boolean, fail with ServerError #7659
Thrift: The ability to write to non-thrift tables (tables which were created by CQL CREATE TABLE) from thrift has been removed, since it wasn’t supported correctly. Use CQL to write to CQL tables. #7568
Stability: SSTable reshape for Size-Tiered Compaction Strategy has been fixed. Reshape happens when the node starts up and finds its STTables do not conform to the strategy invariants, or on import. It then compacts the data so the invariants are preserved, guaranteeing reasonable read amplification. A bug caused reshape to be ignored for this compaction strategy. #7774
Stability: When joining a node to a cluster, ScyllaDB will now verify the correct snitch is used in the new node. #6832
Stability: Wild pointer dereferenced when try_flush_memtable_to_sstable on shutdown fails to create sstable in a dropped keyspace #7792
Stability: ScyllaDB records large data cells in the system.large_rows table. If such a large cell was part of a static rows, ScyllaDB would crash #6780
CQL: CQL prepared statements incomplete support for “unset” values #7740
Stability: Crash in row cache if a partition key is larger than 13k, with a managed_bytes::do_linearize() const: Assertion `lc._nesting' failed error #7897
CQL: min/max aggregate functions are broken for timeuuid #7729
CQL: paxos_grace_seconds table option: prevent applying a negative value #7906
LWT: Provide default for serial_consistency even if not specified in a request #7850
Redis: redis: parse error message is broken #7861 #7114
CQL: Restrictions missing support for “IN” on tables with collections, added in Cassandra 4.0 #7743 #4251
Stability: Possible schema discrepancy while updating secondary index computed column #7857
Stability: filesystem error: open failed: Too many open files and coredump during resharding with 5000 tables #7439
Stability: leveled_compaction_strategy: fix boundary of maximum SSTable level #7833
Packaging: seastar-cpu-map.sh missing from PATH #6731
Stability: ScyllaDB crashes on SELECT by indexed part of the pkey with WHERE like "key = X AND key = Y" #7772
Stability: a race condition can cause coredump in truncating a table after refresh #7732
Gossip ACK message in the wrong scheduling group cause latency spikes in some cases #7986
Stability: nodetool repair and nodetool removenode commands failed during repair process running #7965
Stability: [CDC] nodes aborting with coredump when writing large row #7994
Stability: Unexpected partition key when repairing with different number of shards may resulting in error similar to: “WARN 2020-11-04 19:06:41,168 [shard 1] repair - repair_writer: keyspace=ks, table=cf, multishard_writer failed: invalid_mutation_fragment_stream”
This error may occur when running repair between nodes with a different core number. #7552 Seastar #867
ScyllaDB on AWS: When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly reports root partition is part of ephemeral disks, and RAID construction will fail. #8055
Alternator: unexpected ValidationException in ConditionExpression and FilterExpression #8043
Stability: stream_session::prepare is missing a string format argument #8067
ScyllaDB on GCE: scylla_io_setup did not configure pre-tuned GCE instances correctly #7341
install: dist/debian: node-exporter package does not install systemd unit #8054
Stability: Compaction Backlog controller will misbehave for time series use cases #6054
Monitoring: False 1 second latency reported in I/O queue delay metrics #8166
Stability: TWCS reshape was silently ignoring windows which contain at least min_threshold SSTables (can happen with data segregation). #8147
Packaging: dist/debian: node-exporter files mistakenly packaged in scylla-conf package #8163
scylla_setup failed to setup with error /etc/bashrc: line 99: TMOUT: readonly variable #8049
Performance: SSTables being compacted by Cleanup, Scrub, Upgrade can potentially be compacted by regular compaction in parallel #8155
Stability: Failing to start ScyllaDB after reboot of machine “Startup failed: seastar::rpc::timeout_error (rpc call timed out)” #8187
Stability: a regression introduced in 4.4 in cdc generation query for Alternator (Stream API) #8210
Stability: Split CDC streams table partitions into clustered rows #8116
Install: scylla_raid_setup returns ‘mdX is already using‘ even it’s unused #8219
Stability: a regression on cleanup compaction’s space requirement introduced in ScyllaDB 4.1, due to unlimited parallelism #8247
Stability: ScyllaDB crash when the IN marker is bound to null #8265
Stability: segfault due to corrupt _last_status_change map in cql_transport::cql_server::event_notifier::on_down during shutdown #8143

25 Mar 2021