ScyllaDB Open Source 4.5

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 4.5.0, a production-ready release of our open source NoSQL database.

ScyllaDB 4.5 includes Alternator support for Cross-Origin Resource Sharing (CORS), as well as many other performance and stability improvements and bug fixes (below). Find the ScyllaDB Open Source 4.5 repository for your Linux distribution here. ScyllaDB 4.5 Docker is also available.

Only the latest two minor releases of the ScyllaDB Open Source project are supported. From now on, only ScyllaDB Open Source 4.5 and 4.4 are supported. Users running ScyllaDB Open Source 4.3 and earlier are encouraged to upgrade to these two releases.

We dedicate this release to the memory of Alberto José Araújo, a coworker and a friend.

New Features in ScyllaDB 4.5

Ubuntu base EC2 AMI

Starting from this release, ScyllaDB Enterprise EC2 AMI (and soon GCP images) will be based on Ubuntu 20.04. You should now use the “scyllaadm” user to login to your instance.

Example:

  ssh -i your-key-pair.pem scyllaadm@ec2-public-ip

For backward compatibility, the old “centos” login will be supported for 4.5 releases.

Alternator

Alternator is the ScyllaDB DynamoDB-compatible API (learn more)

Support for Cross-Origin Resource Sharing (CORS). This allows client browsers to access the database directly via JavaScript, avoiding the middle tier. #8025
Support limiting the number of concurrent requests with a scylla.yaml configuration value max_concurrent_requests_per_shard. In case the limit is crossed, Alternator will return RequestLimitExceeded error type (compatible with DynamoDB API) #7294
Alternator now fully supports nested attribute paths. Nested attribute processing happens when an item’s attribute is itself an object, and an operation modifies just one of the object’s attributes instead of the entire object. #5024 #8043
Alternator now supports slow query logging capability. Queries that last longer than the specified threshold are logged in system_traces.node_slow_log and traced. #8292

Example trace:

cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;

parameters                                                                                  | duration

---------------------------------------------------------------------------------------------+----------

{'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} |    75732
[{
     'start_time': 'b7d42b37-a661-11eb-a391-3d2009e69e44',
     'node_ip': '127.0.0.1',
     'shard': '0',
     'command': '{"TableName": "Pets", "Key": {"p": {"S": "dog"}}}',
     'date': '2021-04-26T07:33:38.903000',
     'duration': '94',
     'parameters': '{alternator_op : DeleteItem}, {query : {"TableName": "Pets", "Key": {"p": {"S": "dog"}}}}',
     'session_id': 'b7b47e70-a661-11eb-a391-3d2009e69e44',
     'source_ip': '::',
     'table_names': 'alternator_Pets.Pets',
     'username': '<unauthenticated request>'
}, {
     'start_time': 'b7b44416-a661-11eb-a391-3d2009e69e44',
     'node_ip': '127.0.0.1',
     'shard': '0',
     'command': '{"TableName": "Pets", "Item": {"p": {"S": "dog"}}}',
     'date': '2021-04-26T07:33:38.901000',
     'duration': '130',
     'parameters': '{alternator_op : PutItem}, {query : {"TableName": "Pets", "Item": {"p": {"S": "dog"}}}}',
     'session_id': 'b7b43050-a661-11eb-a391-3d2009e69e44',
     'source_ip': '::',
     'table_names': 'alternator_Pets.Pets',
     'username': '<unauthenticated request>'
}]

Alternator was changed to avoid large contiguous allocations for large requests. Instead, the allocation will be broken up into smaller chunks. This reduces stress on the allocator, and therefore latency. #7213
sstableloader now work with Alternator tables #8229
Support attribute paths in ConditionExpression, FilterExpression
Support attribute paths in ProjectionExpression
New metrics:
- requests_shed metrics

CDC

The Change Data Capture (CDC) facility used a collection to store information about CDC log streams. Since large clusters can have many streams, this violation of ScyllaDB guidelines caused many latency problems. Two steps were taken to correct it: the number of streams were limited (with some loss in efficiency on large clusters), and a new format was introduced (with automatic transition code) that uses partitions and clustering rows instead of collections. #7993

For reference example of a CDC consumer implementation see:

For CDC and Kafka integration see

Raft

We are building up an internal service in ScyllaDB, useful for this and other applications. The changes have no visible effect yet. Among other, the following was added:

ScyllaDB stores the database schema in a set of tables. Previously, these tables were sharded across all cores like ordinary user tables. They are now maintained by shard 0 alone. This is a step towards letting Raft manage them, since Raft needs to atomically modify the schema tables, and this can’t be done if the data is distributed on many cores. #7947
Raft can now store its log data in a system table. Raft is implemented in a modular fashion with plug-ins implementing various parts; this is a persistence module.
Raft Joint Consensus has been merged. This is the ability to change a raft group from one set of nodes to another, needed to change cluster topology or to migrate data to different nodes.
Raft now integrates with the ScyllaDB RPC subsystem; Raft itself is modular and requires integration with the various ScyllaDB service providers.
The Raft implementation gained support for non-voting nodes. This is used to make membership changes less disruptive.
The Raft implementation now has a per-server timer, used for Raft keepalives
The Raft implementation gained support for leader step down. This improves availability when a node is taken down for planned maintenance

Deployment and Packaging

The setup utility now uses chrony instead of ntp for timekeeping on all Linux distributions. This makes the setup more regular. #7922
Dynamic setting of aio-max-nr based on the number of cpus, mostly needed for large machines like EC2 i3en.24xlarge #8133

Additional Features

Lightweight (fast) slow-queries logging mode. New, low overhead tracing facility for slow queries. When enabled, it will work in the same way slow query tracing does besides that it will omit recording all the tracing events. So that it will not populate data to the system_traces.events table but it will populate trace session records for slow queries to all the rest: system_traces.sessions, system_traces.node_slow_log, etc. #2572

More here

Tools and APIs

It is now possible to perform a partial repair when a node is missing, by using the new ignore_nodes option. Repair will also detect when a repair range has no live nodes to repair with and short-circuit the operation #7806 #8256
The Thrift API disable by default. As it is less often used, users might not be aware Thrift is open and might be a security risk. #8336. To enable it, add “start_rpc: true” to scylla.yaml. In addition, Thrift now have
- partial admission control
- support for max_concurrent_requests_per_shard
- counters for in-flight requests and blocked requests
Nodetool Top Partitions extension. nodetool toppartitions allow you to find the partitions with the highest read and write access in the last time window. Till now, nodetool toppartitions only supported one table at a time. From ScyllaDB 4.5, nodetool toppartitions allows specifying a list of tables, or keyspaces. #4520
nodetool stop now supports more compaction types: Supported types are: COMPACTION, CLEANUP, SCRUB, UPGRADE. For example: nodetool stop SCRUB.Note that reshard and reshape start automatically on boot or refresh, if needed. Compaction, Cleanup, Scrub, and Upgrade are started with nodetool command. The others: RESHAPE, RESHARD, VALIDATION, INDEX_BUILD are unsupported by nodetool stop.
scylla_setup option to retry the RAID setup #8174
New system/drop_sstable_caches RESTful API. Evicts objects from caches that reflect sstable content, like the row cache. In the future, it will also drop the page cache and sstable index caches. While exiting BYPASS CACHE affects the behavior of a given CQL query on per-query basis, this API clears the cache at the time of invocation, later queries will populate it.
REST API: add the compaction id to the response of GET compaction_manager/compactions

Performance Optimizations

Improve flat_mutation_reader::consume_pausable #8359. Combined reader microbenchmark has shown from 2% to 22% improvement in median execution time while memtable microbenchmark has shown from 3.6% to 7.8% improvement in median execution time.
Significant write amplification when reshaping level 0 in a LCS table #8345
The Log-Structured Allocator (LSA) is the underlying memory allocator behind ScyllaDB’s cache and memtables. When memory runs out, it is called to evict objects from cache, and to defragment free memory, in order to serve new allocation requests. If memory was especially fragmented, or if the allocation request was large, this could take a long while, causing a latency spike. To combat this, a new background reclaim service is added which evicts and defragments memory ahead of time and maintains a watermark of free, non-fragmented memory from which allocations can be satisfied quickly. This is somewhat similar to kswapd on Linux. #1634
To store cells in rows, ScyllaDB used a combination of a vector (for short rows) and red-black tree (for wide rows), switching between the representations dynamically. The red-black is inefficient in memory footprint when many cells are present, so the data storage now uses a radix tree exclusively. This both reduces the memory footprint and also improves efficiency.
SSTables: Share partition index pages between readers. Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. For IO-bound, we needed 2 I/O per read before, and 1 (amortized) now. The throughput is ~70% higher. More here.
Switch partition rows onto B-tree. The data type for storing rows inside a partition was changed from a red-black tree to a B-tree. This saves space and spares some cpu cycles. More here.
The sstable reader will now allow preemption at row granularity; previously, sstables containing many small rows could cause small latency spikes as the reader would only preempt when an 8k buffer was filled. #7883

Repair-Based Node Operations (experimental)

Repair-Based Node Operations (RBNO) was introduced as an experimental feature in ScyllaDB 4.0, intending to use the same underlying implementation for repair and node-operations such as bootstrap, decommission, removenode, and replace. While still considered experimental, we continue to work on this feature.

Repair is oriented towards moving small amounts of data, not an entire node’s worth. This resulted in many sstables being created in the node, creating a large compaction load. To fix that, offstrategy compaction is now used to compact these sstables without impacting the main workload efficiently. #5226

To enable repair-bases node operations, add the following to scylla.yaml:

enable_repair_based_node_ops: true

Configuration

Ignore enable_sstables_mc_format: User can no longer disable MC format for older SSTable formats.

Other bugs fixed in this release

Stability: Optimized TWCS single-partition reader opens sstables unnecessarily #8432
Stability: TimeWindowCompactionStrategy not using specialized reader for single partition queries #8415
Stability: ScyllaDB will exit when accessed with a LOCAL_QUORUM to a DC with zero replication (one can define different numbers of replication per DC). #8354
Tools: sstableloader: partition with old deletion and new data handled incorrectly #8390
Stability: Commitlog pre-fill inner loop condition broken #8369
aws: aws_instance.ebs_disks() causes traceback when no EBS disks #8365
Thrift: handle gate closed exception on retry #8337
Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA SSTable formats are legacy formats that are not used in latest ScyllaDB versions.
inactive readers unification caused lsa OOM in toppartitions_test #8258
Thrift: too many accept attempts end up in segmentation fault #8317
Stability: Failed SELECT with tuple of reversed-ordered frozen collections #7902
Stability: Certain combination of filtering, index, and frozen collection, causes “marshalling error” failure #7888
build : tools/toolchain: install-dependencies.sh causes error during build Docker image, and ignoring it #8293
Stability: Use-after-free in simple_repair_test #8274
Monitoring: storage_proxy counters are not updated on cql counter operations #4337
Security: Enforce dc/rack membership iff required for non-tls connections #8051
Stability: ScyllaDB tries to keep enough free memory ahead of allocation, so that allocations don’t stall. The amount of CPU power devoted to background reclaim is supposed to self-tune with memory demand, but this wasn’t working correctly. #8234
Nodetool cleanup failed because of “DC or rack not found in snitch properties” #7930
Stability: a possible race condition in MV/SI schema creation and load may cause inconsistency between base table and view table #7709
Thrift: Regression in thrift_tests.test_get_range_slice dtest: query_data_on_all_shards(): reverse range scans are not supported #8211
Stability: mutation_test: fatal error: in “test_apply_monotonically_is_monotonic“: Mutations differ #8154
Stability: Node was overloaded: Too many in flight hints during Enospc nemesis #8137
Stability: Make untyped_result_set non-copying and retain fragments #8014
Stability: Requests are not entirely read during shedding, which leads to invalidating the connection once shedding happens. Shedding is the process of dropping requests to protect the system, for example, if they are too large or exceeding the max number of concurrent requests per shard. #8193
Stability: Versioned sstable_set #2622
UX: Improve the verbosity of errors coming from the view builder/updater #8177
Tools: Incorrect output in nodetool compactionstats #7927
Stability: cache-bypassing single-partition query from TWCS table not showing a row (but it appears in range scans). Introduce after ScyllaDB 4.4 #8138
CQL: unpaged query is terminated silently if it reaches global limit first. The bug was introduced in ScyllaDB 4.3 #8162
Stability: The multishard combining reader is responsible for merging data from multiple cores when a range scan runs. A bug that is triggered by very small token ranges (e.g. 1 token) caused shards that have no data to contribute to be queried, increasing read amplification. #8161
Stability: Repairing a table with TWCS potentially cause high number of parallel compaction #8124
Stability: Run init_server and join_cluster inside maintenance scheduling group #8130
Install: scylla_create_devices fails on EC2 with subprocess.CalledProcessError: Command /opt/scylladb/scripts/scylla_raid_setup... returned non-zero exit status 1 #8055
Stability: CDC: log: use-after-free in process_bytes_visitor #8117
Stability: Repair task from manager failed due to coredumpt on one of the node #8059
CQL: NetworkTopologyStrategy data center options are not validated #7595
Stability: no local limit for non-limited queries in mixed cluster may cause repair to fail #8022
Debug: Make scylla backtraces always print in one line #5464
Init: perftune.py fails with TypeError: 'NoneType' object is not iterable #8008
Stability: using experimental UDF can lead to exit #7977
Stability: Make commitlog accept N mutations in bulk #7615
Stability: transport: Fix abort on certain configurations of native_transport_port(_ssl) #7866 #7783
Debug: add sstable origin information to scylla metadata component #7880
Install: dist/offline_installer/redhat: causes “scylla does not work with current umask setting (0077)” #6243
Alternator: nodetool cannot work on table with a dot in its name #6521
Stability: During replace node operation – replacing node is used to respond to read queries #7312
Install: ScyllaDB doesn’t use /etc/security/limits.d/scylla.conf #7925
Stability: multishard_combining_reader uses smp::count in one place instead of _sharder.shard_count() #7945
Stability: Failed fromJson() should result in FunctionFailure error, not an internal error #7911
Stability: List append uses the wrong timestamp with LWT #7611
Stability: currentTimeUUID creates duplicates when called at the same point in time #6208
Build: dbuild fails with an error on older kernels (without cgroupsv2) #7938
Stability: Error: “seastar - Exceptional future ignored: sstables::compaction_stop_exception” after node drain #7904
UX: ScyllaDB reports broken pipe and connection reset by peer errors from the native transport, although it can happen in normal operation. #7907
Redis: Redis ‘exists’ command fails with lots of keys #7273
UX: Make scylla backtraces always print in online #5464
Stability: A mistake in Time Window Compaction Strategy logic could cause windows that had a very large number of SSTables not to be compacted at all, increasing read amplification. #8147
Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA SSTable formats are legacy formats that are not used in latest ScyllaDB versions.
Commitlog: allocation pattern which leaves large parts of segments “wasted”, typically because the segment has empty space, but cannot hold the mutation being added, can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall. #8270 This is a followup to PR #7879: make commitlog disk limit a hard limit.
Commitlog: ScyllaDB hangs on shutdown, if failed to allocate a new segment. #8577
Stability: Cassandra stress fails to achieve consistency during replacing node operation #8013 (followup to #7132 from 4.5rc1)
Stability: use-after-move when handling view update failures #8830
Alternator: incorrect set equality comparison inside a nested document #8514
Alternator: incorrect inequality check of two sets #8513
Alternator: ConditionExpression wrong comparison of two non-existent attributes #8511
Install: install.sh set aio conf during installation #8650
Install: scylla_io_setup failed with error: seastar - Could not setup Async I/O on aws instances (r5, r5b) and gp3 ebs volumes #8587
Alternator: Alternator’s health-check request doesn’t work properly with HTTPS #8691
Install: scylla_raid_setup may fails when mounting a volume with: “can't find UUID” error #8279
install: Unified Installer: Incorrect file security context cause scylla_setup to fail #8589
install: nonroot installation broken #8663
Stability: Exceptions in resharding and reshaping are being incorrectly swallowed #8657
Stability: TWCS: in some cases, SSTables are not compacted together when time window finishes #8569
Stability: materialized views: nodes may pull old schemas from other nodes #8554
Commitlog: handle commitlog recycle errors #8376
Commitlog: Commitlog can get stuck after reaching disk size limit, causing writes to time out #8363
Stability: a disks with tiny request rate may cause ScyllaDB to get stuck while upgrading from 4.3 to 4.4 #8378
Stability : Optimized TWCS single-partition reader opens SSTables unnecessarily #8411 #8435
Stability: `time_series_sstable_set::create_single_key_sstable_reader` may return an empty reader even if the queried partition exists (in some other SSTable) #8447
Stability: `clustering_order_reader_merger` may immediately return end-of-stream if some (but not necessarily all) underlying readers are empty #8445
CQL: Mismatched types for base and view columns id: timeuuid and timeuuid, generating ”Unable to complete the operation against any hosts” error #8666
Trace: Tracing can remain not shut down if start is aborted #8382
Tools: sstableloader doesn’t work with Alternator tables if “-nx” option is used #8230
When scylla-server was stopped manually, a few minutes later, the scylla-fstrim service starts it again #8921
Stability: Segfault in commit log shutdown, introduced in 4.5 #8952
Install: dist/redhat: scylla-node-exporter causes error while executing scriptlet on install #8966
Uninstall: removing /etc/systemd/system/*.mount on package uninstall can delete settings (like coredump setting) during upgrade #8810
Stability: Some of appending_hash<> instantiations are throwing operator() #8983
Stability: Off-strategy compaction with LCS keeps reshaping the last remaining SSTable #8573
Stability: Reshape may ignore overlapping in level L where L > 0 #8531
A new config “commitlog_use_hard_size_limit” sets whether or not to use a hard size limit for commitlog disk usage. Default is false. Enabling this can cause latency spikes, whereas the default can lead to occasional disk usage peaks as seen in #9053
Upscale (adding cores): On some environments /sys/devices/system/cpu/cpufreq/policy0/scaling_governor does not exist even if it supported CPU scaling. Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is used. #9191
Stability: load-and-stream fails: Assertion `!sst->is_shared()‘ failed and aborting on shard. #9173
Stability: excessive compaction of a fully expired TWCS table when running repair #8710
API uses incorrect plus<int> to sum up cf.active_memtable().partition_count(), which can result with
CQL: Creating a table that looks like a secondary index breaks the secondary index creation mechanism #8620. This fix accidentally broke CREATE INDEX IF NOT EXISTS #8717
Stability: repair does not consider memory bloat which may cause repair to use more memory and cause std::bad_alloc. #8641
hints: using gossiper info for node state may lead to race condition when a node is drained #5087
A bug in BTree, introduced in 4.5 to index partition rows, may cause ScyllaDB to abort #9248
A bug in load_and_stream, introduced in 4.5 to distribute load SSTable to other nodes, may cause ScyllaDB to abort #9278
Stability: Accidental cache line invalidation in compact-radix-tree kills performance on large machines #9252
New explicit experimental flag for Raft: --experimental-options=raft
Install: perftune.py fails on bond NICs #9225
setup: scylla_cpuscaling_setup: On Ubuntu, scaling_governor becomes powersave after reboot #9324
RPM packaging: dependency issues related to python3 relocatable rpm #8829
Stability: evictable_reader: _drop_static_row can drop of the static row from an unintended partition #8923
Stability: evictable_reader: self validation triggers when a partition disappears after eviction #8893

13 Oct 2021