ScyllaDB Open Source 4.5

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 4.5.0, a production-ready release of our open source NoSQL database.

ScyllaDB 4.5 includes Alternator support for Cross-Origin Resource Sharing (CORS), as well as many other performance and stability improvements and bug fixes (below). Find the ScyllaDB Open Source 4.5 repository for your Linux distribution here. ScyllaDB 4.5 Docker is also available.

Only the latest two minor releases of the ScyllaDB Open Source project are supported. From now on, only ScyllaDB Open Source 4.5 and 4.4 are supported. Users running ScyllaDB Open Source 4.3 and earlier are encouraged to upgrade to these two releases.

We dedicate this release to the memory of Alberto José Araújo, a coworker and a friend.

Related Links

New Features in ScyllaDB 4.5

Ubuntu base EC2 AMI

Starting from this release, ScyllaDB Enterprise EC2 AMI (and soon GCP images) will be based on Ubuntu 20.04. You should now use the “scyllaadm” user to login to your instance.

Example:

  ssh -i your-key-pair.pem scyllaadm@ec2-public-ip

For backward compatibility, the old “centos” login will be supported for 4.5 releases.

Alternator

Alternator is the ScyllaDB DynamoDB-compatible API (learn more)

  • Support for Cross-Origin Resource Sharing (CORS). This allows client browsers to access the database directly via JavaScript, avoiding the middle tier. #8025
  • Support limiting the number of concurrent requests with a scylla.yaml configuration value max_concurrent_requests_per_shard. In case the limit is crossed, Alternator will return RequestLimitExceeded error type (compatible with DynamoDB API) #7294
  • Alternator now fully supports nested attribute paths. Nested attribute processing happens when an item’s attribute is itself an object, and an operation modifies just one of the object’s attributes instead of the entire object. #5024 #8043
  • Alternator now supports slow query logging capability. Queries that last longer than the specified threshold are logged in system_traces.node_slow_log and traced. #8292

Example trace:

cqlsh> select parameters, duration from system_traces.node_slow_log where start_time=b7a44589-8711-11eb-8053-14c6c5faf955;

parameters                                                                                  | duration

---------------------------------------------------------------------------------------------+----------

{'alternator_op': 'DeleteTable', 'query': '{"TableName": "alternator_Test_1615979572905"}'} |    75732
[{
     'start_time': 'b7d42b37-a661-11eb-a391-3d2009e69e44',
     'node_ip': '127.0.0.1',
     'shard': '0',
     'command': '{"TableName": "Pets", "Key": {"p": {"S": "dog"}}}',
     'date': '2021-04-26T07:33:38.903000',
     'duration': '94',
     'parameters': '{alternator_op : DeleteItem}, {query : {"TableName": "Pets", "Key": {"p": {"S": "dog"}}}}',
     'session_id': 'b7b47e70-a661-11eb-a391-3d2009e69e44',
     'source_ip': '::',
     'table_names': 'alternator_Pets.Pets',
     'username': '<unauthenticated request>'
}, {
     'start_time': 'b7b44416-a661-11eb-a391-3d2009e69e44',
     'node_ip': '127.0.0.1',
     'shard': '0',
     'command': '{"TableName": "Pets", "Item": {"p": {"S": "dog"}}}',
     'date': '2021-04-26T07:33:38.901000',
     'duration': '130',
     'parameters': '{alternator_op : PutItem}, {query : {"TableName": "Pets", "Item": {"p": {"S": "dog"}}}}',
     'session_id': 'b7b43050-a661-11eb-a391-3d2009e69e44',
     'source_ip': '::',
     'table_names': 'alternator_Pets.Pets',
     'username': '<unauthenticated request>'
}]
  • Alternator was changed to avoid large contiguous allocations for large requests. Instead, the allocation will be broken up into smaller chunks. This reduces stress on the allocator, and therefore latency. #7213
  • sstableloader now work with Alternator tables #8229
  • Support attribute paths in ConditionExpression, FilterExpression
  • Support attribute paths in ProjectionExpression
  • New metrics:
    • requests_shed metrics

CDC

The Change Data Capture (CDC) facility used a collection to store information about CDC log streams. Since large clusters can have many streams, this violation of ScyllaDB guidelines caused many latency problems. Two steps were taken to correct it: the number of streams were limited (with some loss in efficiency on large clusters), and a new format was introduced (with automatic transition code) that uses partitions and clustering rows instead of collections. #7993

For reference example of a CDC consumer implementation see:

For CDC and Kafka integration see

Raft

We are building up an internal service in ScyllaDB, useful for this and other applications. The changes have no visible effect yet. Among other, the following was added:

  • ScyllaDB stores the database schema in a set of tables. Previously, these tables were sharded across all cores like ordinary user tables. They are now maintained by shard 0 alone. This is a step towards letting Raft manage them, since Raft needs to atomically modify the schema tables, and this can’t be done if the data is distributed on many cores. #7947
  • Raft can now store its log data in a system table. Raft is implemented in a modular fashion with plug-ins implementing various parts; this is a persistence module.
  • Raft Joint Consensus has been merged. This is the ability to change a raft group from one set of nodes to another, needed to change cluster topology or to migrate data to different nodes.
  • Raft now integrates with the ScyllaDB RPC subsystem; Raft itself is modular and requires integration with the various ScyllaDB service providers.
  • The Raft implementation gained support for non-voting nodes. This is used to make membership changes less disruptive.
  • The Raft implementation now has a per-server timer, used for Raft keepalives
  • The Raft implementation gained support for leader step down. This improves availability when a node is taken down for planned maintenance

Deployment and Packaging

The setup utility now uses chrony instead of ntp for timekeeping on all Linux distributions. This makes the setup more regular. #7922
Dynamic setting of aio-max-nr based on the number of cpus, mostly needed for large machines like EC2 i3en.24xlarge #8133

Additional Features

  • Lightweight (fast) slow-queries logging mode. New, low overhead tracing facility for slow queries. When enabled, it will work in the same way slow query tracing does besides that it will omit recording all the tracing events. So that it will not populate data to the system_traces.events table but it will populate trace session records for slow queries to all the rest: system_traces.sessions, system_traces.node_slow_log, etc. #2572

More here

Tools and APIs

  • It is now possible to perform a partial repair when a node is missing, by using the new ignore_nodes option. Repair will also detect when a repair range has no live nodes to repair with and short-circuit the operation #7806 #8256
  • The Thrift API disable by default. As it is less often used, users might not be aware Thrift is open and might be a security risk. #8336. To enable it, add “start_rpc: true” to scylla.yaml. In addition, Thrift now have
  • Nodetool Top Partitions extension. nodetool toppartitions allow you to find the partitions with the highest read and write access in the last time window. Till now, nodetool toppartitions only supported one table at a time. From ScyllaDB 4.5, nodetool toppartitions allows specifying a list of tables, or keyspaces. #4520
  • nodetool stop now supports more compaction types: Supported types are: COMPACTION, CLEANUP, SCRUB, UPGRADE. For example: nodetool stop SCRUB.Note that reshard and reshape start automatically on boot or refresh, if needed. Compaction, Cleanup, Scrub, and Upgrade are started with nodetool command. The others: RESHAPE, RESHARD, VALIDATION, INDEX_BUILD are unsupported by nodetool stop.
  • scylla_setup option to retry the RAID setup #8174
  • New system/drop_sstable_caches RESTful API. Evicts objects from caches that reflect sstable content, like the row cache. In the future, it will also drop the page cache and sstable index caches. While exiting BYPASS CACHE affects the behavior of a given CQL query on per-query basis, this API clears the cache at the time of invocation, later queries will populate it.
  • REST API: add the compaction id to the response of GET compaction_manager/compactions

Performance Optimizations

  • Improve flat_mutation_reader::consume_pausable #8359. Combined reader microbenchmark has shown from 2% to 22% improvement in median execution time while memtable microbenchmark has shown from 3.6% to 7.8% improvement in median execution time.
  • Significant write amplification when reshaping level 0 in a LCS table #8345
  • The Log-Structured Allocator (LSA) is the underlying memory allocator behind ScyllaDB’s cache and memtables. When memory runs out, it is called to evict objects from cache, and to defragment free memory, in order to serve new allocation requests. If memory was especially fragmented, or if the allocation request was large, this could take a long while, causing a latency spike. To combat this, a new background reclaim service is added which evicts and defragments memory ahead of time and maintains a watermark of free, non-fragmented memory from which allocations can be satisfied quickly. This is somewhat similar to kswapd on Linux. #1634
  • To store cells in rows, ScyllaDB used a combination of a vector (for short rows) and red-black tree (for wide rows), switching between the representations dynamically. The red-black is inefficient in memory footprint when many cells are present, so the data storage now uses a radix tree exclusively. This both reduces the memory footprint and also improves efficiency.
  • SSTables: Share partition index pages between readers. Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. For IO-bound, we needed 2 I/O per read before, and 1 (amortized) now. The throughput is ~70% higher. More here.
  • Switch partition rows onto B-tree. The data type for storing rows inside a partition was changed from a red-black tree to a B-tree. This saves space and spares some cpu cycles. More here.
  • The sstable reader will now allow preemption at row granularity; previously, sstables containing many small rows could cause small latency spikes as the reader would only preempt when an 8k buffer was filled. #7883

Repair-Based Node Operations (experimental)

Repair-Based Node Operations (RBNO) was introduced as an experimental feature in ScyllaDB 4.0, intending to use the same underlying implementation for repair and node-operations such as bootstrap, decommission, removenode, and replace. While still considered experimental, we continue to work on this feature.

Repair is oriented towards moving small amounts of data, not an entire node’s worth. This resulted in many sstables being created in the node, creating a large compaction load. To fix that, offstrategy compaction is now used to compact these sstables without impacting the main workload efficiently. #5226

To enable repair-bases node operations, add the following to scylla.yaml:

enable_repair_based_node_ops: true

Configuration

Other bugs fixed in this release

  • Stability: Optimized TWCS single-partition reader opens sstables unnecessarily #8432
  • Stability: TimeWindowCompactionStrategy not using specialized reader for single partition queries #8415
  • Stability: ScyllaDB will exit when accessed with a LOCAL_QUORUM  to a DC with zero replication (one can define different numbers of replication per DC). #8354
  • Tools: sstableloader: partition with old deletion and new data handled incorrectly #8390
  • Stability: Commitlog pre-fill inner loop condition broken #8369
  • aws: aws_instance.ebs_disks() causes traceback when no EBS disks #8365
  • Thrift: handle gate closed exception on retry #8337
  • Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA SSTable formats are legacy formats that are not used in latest ScyllaDB versions.
  • inactive readers unification caused lsa OOM in toppartitions_test #8258
  • Thrift: too many accept attempts end up in segmentation fault #8317
  • Stability: Failed SELECT with tuple of reversed-ordered frozen collections #7902
  • Stability: Certain combination of filtering, index, and frozen collection, causes “marshalling error” failure #7888
    build : tools/toolchain: install-dependencies.sh causes error during build Docker image, and ignoring it #8293
  • Stability: Use-after-free in simple_repair_test #8274
  • Monitoring: storage_proxy counters are not updated on cql counter operations #4337
  • Security: Enforce dc/rack membership iff required for non-tls connections #8051
  • Stability: ScyllaDB tries to keep enough free memory ahead of allocation, so that allocations don’t stall. The amount of CPU power devoted to background reclaim is supposed to self-tune with memory demand, but this wasn’t working correctly. #8234
  • Nodetool cleanup failed because of “DC or rack not found in snitch properties#7930
  • Stability: a possible race condition in MV/SI schema creation and load may cause inconsistency between base table and view table #7709
  • Thrift: Regression in thrift_tests.test_get_range_slice dtest: query_data_on_all_shards(): reverse range scans are not supported #8211
  • Stability: mutation_test: fatal error: in “test_apply_monotonically_is_monotonic“: Mutations differ #8154
  • Stability: Node was overloaded: Too many in flight hints during Enospc nemesis #8137
  • Stability: Make untyped_result_set non-copying and retain fragments #8014
  • Stability: Requests are not entirely read during shedding, which leads to invalidating the connection once shedding happens. Shedding is the process of dropping requests to protect the system, for example, if they are too large or exceeding the max number of concurrent requests per shard. #8193
  • Stability: Versioned sstable_set #2622
  • UX: Improve the verbosity of errors coming from the view builder/updater #8177
  • Tools: Incorrect output in nodetool compactionstats #7927
  • Stability: cache-bypassing single-partition query from TWCS table not showing a row (but it appears in range scans). Introduce after ScyllaDB 4.4 #8138
    CQL: unpaged query is terminated silently if it reaches global limit first. The bug was introduced in ScyllaDB 4.3 #8162
  • Stability: The multishard combining reader is responsible for merging data from multiple cores when a range scan runs. A bug that is triggered by very small token ranges (e.g. 1 token) caused shards that have no data to contribute to be queried, increasing read amplification. #8161
  • Stability: Repairing a table with TWCS potentially cause high number of parallel compaction #8124
  • Stability: Run init_server and join_cluster inside maintenance scheduling group #8130
  • Install: scylla_create_devices fails on EC2  with subprocess.CalledProcessError: Command /opt/scylladb/scripts/scylla_raid_setup... returned non-zero exit status 1 #8055
  • Stability: CDC: log: use-after-free in process_bytes_visitor #8117
  • Stability: Repair task from manager failed due to coredumpt on one of the node #8059
  • CQL: NetworkTopologyStrategy data center options are not validated #7595
  • Stability: no local limit for non-limited queries in mixed cluster may cause repair to fail #8022
  • Debug: Make scylla backtraces always print in one line #5464
  • Init: perftune.py fails with TypeError: 'NoneType' object is not iterable #8008
  • Stability: using experimental UDF can lead to exit #7977
  • Stability: Make commitlog accept N mutations in bulk #7615
  • Stability: transport: Fix abort on certain configurations of native_transport_port(_ssl) #7866 #7783
  • Debug: add sstable origin information to scylla metadata component #7880
    Install: dist/offline_installer/redhat: causes “scylla does not work with current umask setting (0077)#6243
  • Alternator: nodetool cannot work on table with a dot in its name #6521
  • Stability: During replace node operation – replacing node is used to respond to read queries #7312
  • Install: ScyllaDB doesn’t use /etc/security/limits.d/scylla.conf #7925
    Stability: multishard_combining_reader uses smp::count in one place instead of _sharder.shard_count() #7945
  • Stability: Failed fromJson() should result in FunctionFailure error, not an internal error #7911
  • Stability: List append uses the wrong timestamp with LWT #7611
  • Stability: currentTimeUUID creates duplicates when called at the same point in time #6208
  • Build: dbuild fails with an error on older kernels (without cgroupsv2) #7938
  • Stability: Error: “seastar - Exceptional future ignored: sstables::compaction_stop_exception” after node drain #7904
  • UX: ScyllaDB reports broken pipe and connection reset by peer errors from the native transport, although it can happen in normal operation. #7907
  • Redis: Redis ‘exists’ command fails with lots of keys #7273
  • UX: Make scylla backtraces always print in online #5464
  • Stability: A mistake in Time Window Compaction Strategy logic could cause windows that had a very large number of SSTables not to be compacted at all, increasing read amplification. #8147
  • Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA SSTable formats are legacy formats that are not used in latest ScyllaDB versions.
  • Commitlog: allocation pattern which leaves large parts of segments “wasted”, typically because the segment has empty space, but cannot hold the mutation being added, can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall. #8270 This is a followup to PR #7879: make commitlog disk limit a hard limit.
  • Commitlog: ScyllaDB hangs on shutdown, if failed to allocate a new segment. #8577
  • Stability: Cassandra stress fails to achieve consistency during replacing node operation #8013 (followup to #7132 from 4.5rc1)
  • Stability: use-after-move when handling view update failures #8830
  • Alternator: incorrect set equality comparison inside a nested document #8514
  • Alternator: incorrect inequality check of two sets #8513
  • Alternator: ConditionExpression wrong comparison of two non-existent attributes #8511
  • Install: install.sh set aio conf during installation #8650
  • Install: scylla_io_setup failed with error: seastar - Could not setup Async I/O on aws instances (r5, r5b) and gp3 ebs volumes #8587
  • Alternator: Alternator’s health-check request doesn’t work properly with HTTPS #8691
  • Install: scylla_raid_setup may fails when mounting a volume with: “can't find UUID” error #8279
  • install: Unified Installer: Incorrect file security context cause scylla_setup to fail #8589
  • install: nonroot installation broken #8663
  • Stability: Exceptions in resharding and reshaping are being incorrectly swallowed #8657
  • Stability: TWCS: in some cases, SSTables are not compacted together when time window finishes #8569
  • Stability: materialized views: nodes may pull old schemas from other nodes #8554
  • Commitlog: handle commitlog recycle errors #8376
  • Commitlog: Commitlog can get stuck after reaching disk size limit, causing writes to time out #8363
  • Stability: a disks with tiny request rate may cause ScyllaDB to get stuck while upgrading from 4.3 to 4.4 #8378
  • Stability : Optimized TWCS single-partition reader opens SSTables unnecessarily #8411 #8435
  • Stability: `time_series_sstable_set::create_single_key_sstable_reader` may return an empty reader even if the queried partition exists (in some other SSTable) #8447
  • Stability: `clustering_order_reader_merger` may immediately return end-of-stream if some (but not necessarily all) underlying readers are empty #8445
  • CQL: Mismatched types for base and view columns id: timeuuid and timeuuid, generating ”Unable to complete the operation against any hosts” error  #8666
  • Trace: Tracing can remain not shut down if start is aborted #8382
  • Tools: sstableloader doesn’t work with Alternator tables if “-nx” option is used #8230
  • When scylla-server was stopped manually, a few minutes later, the scylla-fstrim service starts it again #8921
  • Stability: Segfault in commit log shutdown, introduced in 4.5 #8952
  • Install: dist/redhat: scylla-node-exporter causes error while executing scriptlet on install #8966
  • Uninstall: removing /etc/systemd/system/*.mount on package uninstall can delete settings (like coredump setting) during upgrade #8810
  • Stability: Some of appending_hash<> instantiations are throwing operator() #8983
  • Stability: Off-strategy compaction with LCS keeps reshaping the last remaining SSTable #8573
  • Stability: Reshape may ignore overlapping in level L where L > 0 #8531
  • A new config “commitlog_use_hard_size_limit” sets whether or not to use a hard size limit for commitlog disk usage. Default is false. Enabling this can cause latency spikes, whereas the default can lead to occasional disk usage peaks as seen in #9053
  • Upscale (adding cores): On some environments /sys/devices/system/cpu/cpufreq/policy0/scaling_governor does not exist even if it supported CPU scaling. Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is used. #9191
  • Stability: load-and-stream fails: Assertion `!sst->is_shared()‘ failed and aborting on shard. #9173
  • Stability: excessive compaction of a fully expired TWCS table when running repair #8710
  • API uses incorrect plus<int> to sum up cf.active_memtable().partition_count(), which can result with
    CQL: Creating a table that looks like a secondary index breaks the secondary index creation mechanism #8620. This fix accidentally broke CREATE INDEX IF NOT EXISTS #8717
  • Stability: repair does not consider memory bloat which may cause repair to use more memory and cause std::bad_alloc. #8641
  • hints: using gossiper info for node state may lead to race condition when a node is drained #5087
  • A bug in BTree, introduced in 4.5  to index partition rows, may cause ScyllaDB to abort #9248
  • A bug in load_and_stream, introduced in 4.5 to distribute load SSTable to other nodes, may cause ScyllaDB to abort #9278
  • Stability: Accidental cache line invalidation in compact-radix-tree kills performance on large machines #9252
  • New explicit experimental flag for Raft: --experimental-options=raft
  • Install: perftune.py fails on bond NICs #9225
  • setup: scylla_cpuscaling_setup: On Ubuntu, scaling_governor becomes powersave after reboot #9324
  • RPM packaging: dependency issues related to python3 relocatable rpm #8829
  • Stability: evictable_reader: _drop_static_row can drop of the static row from an unintended partition #8923
  • Stability: evictable_reader: self validation triggers when a partition disappears after eviction #8893

13 Oct 2021