See all blog posts

ScyllaDB Open Source Release 4.6

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 4.6, a production-ready release of our open source NoSQL database.

ScyllaDB 4.6 includes ARM support, new replace-node streaming, a new restore (load and stream) operation, and other performance and stability improvements and bug fixes (below).

Find the ScyllaDB Open Source 4.6 repository for your Linux distribution here. ScyllaDB 4.6 Docker is also available.

Only the latest two minor releases of the ScyllaDB Open Source project are supported. From now on, only ScyllaDB Open Source 4.6 and 4.5 are supported. Users running ScyllaDB Open Source 4.4 and earlier are encouraged to upgrade to these two releases.

Many of the new features below will be discussed in the upcoming virtual ScyllaDB Summit 2022, Feb 9,10.

Related Links

New Features in ScyllaDB 4.6

ARM Support

ScyllaDB 4.6 is supported ARM architecture, including:

  • EC2 ARM base AMI, ready for Graviton2
  • Running ScyllaDB Docker on ARM , including Mac M1

Repair Base Node Operations (RBNO)

Repair Base Node Operations was introduced as an experimental feature in ScyllaDB 4.0, which uses repair to stream data for node-operations like replace, bootstrap and others. While still considered experimental, we continue to work on this feature.

Repair is oriented towards moving small amounts of data, not an entire node’s worth. This resulted in many SSTables being created in the node, creating a large compaction load. To fix that, offstrategy compaction is now used to efficiently compact these SSTables without impacting the primary workload. #5226

In 4.6, RBNO is enabled by default only for replace node operation.

Example from scylla.yaml:

Enable_repair_based_node_ops: true
allowed_repair_based_node_ops: replace

To enable other operations (experimental), add them as a comma-separated list to allowed_repair_based_node_ops. Available operations are: bootstrap, replace, removenode, decommission and rebuild

#8013 PR#9197

For more about “Repair Based Node Operations” see the ScyllaDB Summit 2022 session by Asias He.

Service Level Properties

Service Levels allows the user to attach attributes to Rules and Users. These attributes apply to each session the user opens to ScyllaDB, enabling granular control of the session properties, like time out and shedding (overload handling).

So far, service levels have been used to implement Workload Prioritization in ScyllaDB Enterprise.

In this release, service levels are merged to ScyllaDB Open Source to implement two features:

  • Per service level timeouts
  • Workload types

Note that Workload Prioritization will remain an Enterprise-only feature.

Per service level timeouts

You can now create service levels with customized read and write timeouts and attach them to rules and users. This is useful when some workloads, like ETL, are less sensitive to latency than others.

For example:

CREATE SERVICE LEVEL sl2 WITH timeout = 500ms;
ATTACH SERVICE LEVEL sl2 TO scylla;
ALTER SERVICE LEVEL sl2 WITH timeout = null;

#7913, PR#7617, PR#8763

Workload types

It’s possible to declare a workload type for a service level, currently out of three available values:

  1. unspecified – generic workload without any specific characteristics; default
  2. interactive – workload sensitive to latency, expected to have high/unbounded concurrency, with dynamic characteristics, OLTP; example: users clicking on a website and generating events with their clicks
  3. batch – workload for processing large amounts of data, not sensitive to latency, expected to have fixed concurrency, OLAP, ETL; example: processing billions of historical sales records to generate useful statistics

Declaring a workload type provides more context for ScyllaDB to decide how to handle the sessions. For instance, if a coordinator node receives requests with a rate higher than it can handle, it will make different decisions depending on the declared workload type:

  • For batch workloads it makes sense to apply back pressure – the concurrency is assumed to be fixed, so delaying a reply will likely also reduce the rate at which new requests are sent;
  • For interactive workloads, backpressure would only waste resources – delaying a reply does not decrease the rate of incoming requests, so it’s reasonable for the coordinator to start shedding surplus requests.

Example

ALTER SERVICE LEVEL sl WITH workload_type = 'interactive';

Reverse Queries

A reverse query is a query SELECT that uses a reverse order compared to the one used in the table schema. If no order was defined, the default order is ascending (ASC).

For example, the following table schema orders the rows in a partition by “time” in an ascending order:

CREATE TABLE heartrate (
  pet_chip_id uuid,
  owner uuid,
  time timestamp,
  heart_rate int,
  PRIMARY KEY (pet_chip_id, time)
);

The following SELECT worked in ScyllaDB 4.5 but might be very inefficient:

SELECT * FROM heartrate LIMIT 1 ORDER BY time DESC

Improving the performance of reverse query is an ongoing process, with the following updates in ScyllaDB 4.6:

  • The internal layer for managing queries now supports reversed queries natively. This lays the groundwork for reversed reads in memtables, cache, and SSTables, so that reversed queries will perform efficiently.
  • The mx SSTable reader (reading mc and md format SSTables) can now read partitions in reversed order. This is a step towards supporting reversed reads of large partitions.
  • Memtables now efficiently support reversed reads (for CQL WITH CLUSTERING ORDER). Together with the already merged SSTable reversed reader, reversed reads with BYPASS CACHE are now more efficient, especially with memory consumption.

SSTable Index Caching

Up to this release, ScyllaDB only cached data from SSTables.

As a result, if the data was not in cache readers had to touch the disk while walking the index. This was inefficient, especially for large partitions, increasing the load on the disk, and adding latency.

In ScyllaDB 4.6, index blocks can be cached in memory, between readers, populated on access, and evicted on memory pressure – reducing the IO and decreasing latency. #7079

More info in Tomasz Grabiec session in ScyllaDB Summit “SSTable Index Caching”

ScyllaDB 4.6 Updates

Install, deployment and packaging

  • Deprecated:
    • Ubuntu 16 support. Ubuntu 16.04 is EOL and will not be supported following 4.6
    • Debian 9 support. Debian 9 is EOL will not be supported following 4.6
  • Newer Linux distributions use systemd-timesyncd instead of ntp or chrony for time synchronization. This is now supported. ScyllaDB will now detect if you already have time synchronization set up and leave it alone if so. #8339
  • scylla_setup script now supports disabling the NVMe write-back cache on disks that allow it. This is useful to reduce latency on Google Cloud Platform local disks. The machine images built using scylla-machine-image will do this automatically.
  • The Unified (tarball) Installer now works correctly when SElinux is enabled. #8589
  • The docker image base has been switched from CentOS 7 to Ubuntu 20.04, similar to ScyllaDB AMI, and GCP images. PR#8849
  • The installer now offers to set up RAID 5 on the data disks in addition to RAID 0; this is useful when the disks can have read errors, such as on GCP local disks. #9076
  • The install script now supports supervisord in addition to systemd. This was brought in from the container image, where systemd is not available, and is useful in some situations where root access is not available.
  • Automatic I/O configuration during setup now supports AWS ARM instances.#9493
  • The setup utility now recognizes Persistent Disks on Google Cloud Platform and Azure. PR#9395 PR#9417

Raft

We are building up an internal service in ScyllaDB, useful for this and other applications. The changes have no visible effect yet. Among other, the following was added:

  • The Raft implementation now updates RPC about new and removed nodes.
  • It is now required to enable Raft with a configuration item (experimental: raft). This lets the implementation mature without requiring backwards compatibility efforts. #9239
  • The schema for storing Raft snapshots has been updated to avoid blobs.

CDC

  • Change Data Capture (CDC) now fills in the cdc$deleted_columns column in pre-image correctly

Alternator

Alternator is ScyllaDB’s implementation of the DynamoDB API.

  • Alternator: rudimentary implementation of TTL expiration service PR#9624
  • ConditionExpression wrong comparison of two non-existent attributes #8511
  • Incorrect set equality comparison inside a nested document #8514
  • Incorrect inequality check of two sets #8513
  • Alternator now include username in trace records. #9613
  • Alternator, ScyllaDB’s implementation of the DynamoDB API, now supports the ADD operation. #5893

Guardrails

We continue to add default limitations (guardrails) to ScyllaDB, making it harder for users to use non production settings by mistake. Each new configuration added to restriction mode (tri_mode_restriction) has three options:

  • True: restricted, disable risky feature
  • False: non restricted, enable risky feature
  • Warn: non restricted, log warning about risky feature

Additions in this release are:

  • It’s now possible to prevent users from using SimpleReplicationStrategy, using config parameter restrict_replication_simplestrategy
    The goal is to first default to warning and then default to actual prevention. SimpleReplicationStrategy can make it hard to later grow the cluster by adding data centers. #8586
  • DateTieredCompactionStrategy is deprecated in favor of TimeWindowCompactionStrategy for a long time now. A new warning will let you know if you are still using it. If you are nostalgic for the old strategy, use “restrict_dtcs” to disable this warning #8914

CQL

  • SELECT statements that used an index, and also restricted the token (e.g. SELECT ... WHERE some_indexed_column = ? AND token(pk) = ?) incorrectly ignored the token restriction. The issue was found by using spark connector filtering on a secondary index. This is now fixed. #7043
  • User-Defined Aggregates (UDA) have been implemented. User-defined aggregates allow the creation of custom aggregate functions, like count, min, and max.  Note that UDA, like UDF are experimental and can be enable with enable_user_defined_functions parameter in scylla.yaml #7201. Please note UDF/UDA are considered experimental features.
  • Selecting a partition range with a slice/equality restriction on clustering keys (e.g. SELECT * FROM tab WHERE ck=?, with no partition key restrictions) now demands ALLOW FILTERING again (since this query can potentially discard large amounts of data without returning anything). To avoid breaking applications that accidentally did not specify ALLOW FILTERING, it will only generate a warning for now. #7608
  • ScyllaDB now correctly rejects CQL insert/update statements with NULLs in key columns. #7852
  • Queries that are performed using an index can now select Static Columns. #8869
  • User Defined Functions and Aggregates (UDF/UDA) now support WebAssembly in experimental mode. The bindings to ScyllaDB data types will likely change, but this is sufficient to play with.

Hinted Handoff API

Hinted Handoff is an anti-entropy mechanism to replay mutations to a node which was unreachable for some time.

A new HTTP API for waiting for hinted handoff replay to be completed. This can be used to reduce repair work.

  • /hints_manager/waiting_point (POST) –  Create a sync point: given a set of target hosts, creates a sync point at the end of all HH queues pointing to any of the hosts.
  • /hints_manager/waiting_point (GET) – Wait or check the sync point: given a description of a sync point, checks if the sync point was already reached. If you provide a non-zero `timeout` parameter and the sync point is not reached yet, this endpoint will wait until the point reached or the timeout expires.

ScyllaDB-sstable

A tool which can be used to examine the content of sstable(s) and execute various operations on them. The currently supported operations are:

  • dump – dumps the content of the sstable(s), similar to sstabledump;
  • dump-index – dumps the content of the sstable index(es), replacing scylla-sstable-index;
  • writetime-histogram – generates a histogram of all the timestamps in the sstable(s);
  • custom – a hackable operation for the expert user (until scripting support is implemented);
  • validate – validate the content of the sstable(s) with the mutation fragment stream validator, same as scrub in validate mode;

Performance Improvements

  • Streams of mutation data are represented in ScyllaDB by a flat_mutation_reader, which provides a means for a function to consume a stream. This was made faster, which improves operations such as flushing a memtable. See PR#8359 for micro benchmark results.
  • When ScyllaDB receives SSTables from outside the replica (such as via a repair operation, or after restoring a snapshot) it first reshapes them to conform to the compaction strategy.
    Reshape was improved for:

  • More code paths can now work with non-contiguous memory for table columns and intermediate values: comparing values, the CQL write path. This reduces CPU stalls due to memory allocation when large blobs are present. PR#8357
  • SSTable parser will avoid large allocations, reducing latency spikes. #6376, #7457
  • Repair is now delayed until hints for that table are replayed. This reduces the amount of work that repair has to do, since hint replay can fill in the gaps that a downed node misses in the data set. #8102
  • SSTables will now automatically choose a buffer size that is compatible with achieving good latency, based on disk measurements by iotune.
  • The setup scripts will now format the filesystem with 1024 byte blocks if possible. This reduces write amplification for lightweight transaction (LWT) workloads.
  • Performance: read latency increase after deletion of high percentage of the data, as many rows covered by a single range tombstone, which go through row cache are very slow #8626
  • Authentication had a 15 second delay, working around dependency problems. But it is long unneeded and is now removed, speeding up node start.
  • Unintended quadratic behavior in the log-structured allocator (which manages ScyllaDB memtable and cache memory) has been fixed. #8542
  • Off-strategy compaction is now enabled for repair. After repair completes, the SSTables generated by repair will first be merged together, then incorporated into the set of SSTables used for serving data. This reduces read amplification due to the large number of SSTables that repair can generate, especially for range queries where the bloom filter cannot exclude those SSTables. #8677
  • Off-strategy compaction, a method by which SSTables are reshaped to fit the compaction strategy, is now enabled for bootstrap and replace operation using standard streaming. #8820
  • The read path has been optimized to remove unnecessary work, leading to a small performance increase.
  • The common case of single-partition query was treated as an IN query with a 1-element tuple. This case is now specialized to avoid the extra post-processing work.
  • SSTable index files are now cached, both at the page level and at an object level (index entry). This improves large partition workloads as well as intermediate size workloads where the entire SSTable index can be cached. #7079
  • The row cache behavior was quadratic in certain cases where many range tombstones were present. This has been fixed. #2581
  • Recently the SSTable index has gained the ability to use the cache to reduce I/O; but it did so even when BYPASS CACHE was requested in the CQL statement. The index now respects BYPASS CACHE like data access.
  • After adding a node, a cleanup process is run to remove data that was “moved” to the new node. This is a compaction process that compacts only one SSTable at a time. This fact was used to optimize cleanup. In addition, the check for whether a partition should be removed during cleanup was also improved. #6807
  • ScyllaDB uses reader objects to read sequential data. It caches those readers so they can be reused across multiple pages of the result set, eliminating the overhead of starting a new sequential read each time. However, this optimization was missed for internal paging used to implement aggregations (e.g. SUM(column)). ScyllaDB now uses the optimization for aggregates too. #9127
  • There is now an effective replication map structure, which contains the application of a replication strategy and its parameters to a topology (node->token mapping). This reduces the amount of run-time computation needed by the coordinator.
  • Time Window Compaction strategy reshape gained two optimizations. Reshape happens when changing compaction strategies or after streaming data to a new node. The optimizations reduce write amplification and therefore the time spent when adding a new node.
  • In Time Window Compaction Strategy compactions, fully expired SSTables will be compacted separately, since that can be done by just dropping them. #9533
  • Scrub compaction, which re-sorts unsorted SSTables, will now use memtables as a sorting mechanism instead of generating many small SSTables. PR#9548
  • Usually a memtable is flushed into one SSTable. In some cases, however, it can be flushed into several SSTables. We now adjust the partition estimate for these SSTables so the bloom filters allocated by these SSTables will not occupy too much memory. #9581
  • Size-tiered compaction strategy will prefer compactions with larger fan-in in order to improve efficiency. Moreover, once a compaction with large fan-in is started, compactions with lower fan-in will be delayed in order to improve overall write amplification.
  • Major compaction will now process tables from smallest to largest, to increase the probability of success in case the node is running low on space.

Stability Improvements

  • Thrift now has partial admission control, to reduce the chance of server overload.
  • A recent regression caused requests to data centers where the local replication factor is zero to crash. This is now fixed. #8354
  • A bug in Time Window Compaction Strategy’s selection of SSTables for single-partition reads caused SSTables that did not have data for the key to be consulted, reducing performance. This is now fixed. #8415
  • Continuing on the path of allowing non-contiguous allocations for large blobs, memory linearizations have been removed from Change Data Capture (CDC). This reduces CPU stalls when CDC is used in conjunction with large blobs. #7506
  • The underlying json parser used by alternator, rapidjson, has been hardened against out-of-memory errors. #8521
  • A bug in the row cache that can cause large stalls on schemas with no clustering key has been fixed.
  • ScyllaDB uses a log-structured memory allocator (LSA) for memtable and cache. Recently, unintentional quadratic behavior in LSA was discovered, so as a workaround the memory reserve size is decreased. Since the quadratic cost is in terms of this reserve size, the bad behavior is eliminated. Note the reserves will automatically grow if the workload really needs them. #8542
  • Repair allocates working memory for holding table rows, but did not consider memory bloat and could over-allocate memory. It is now more careful. #8641
  • Failure detection is now done directly by nodes pinging each other rather than through the gossip protocol. This is more reliable and the information is available more rapidly. Impact on networking is low, since ScyllaDB implements a fully connected mesh in all clusters smaller than 256 nodes per datacenter, much larger than the typical cluster. #8488
  • Change Data Capture (CDC) uses a new internal table for maintaining the stream identifiers. The new table works better with large clusters.#7961
  • ScyllaDB will now close the connection when a too-large request arrives; previously ScyllaDB would read and discard the request. The new behavior protects against having to read and discard potentially gigabytes of data. #8798
  • A SELECT statement could result in unbounded concurrency (leading to an out-of-memory error) in some circumstances; this is now fixed. #8799
  • Speculative retry will now no longer consider failed responses when calculating the 99th percentile of a request. Considering them leads to the latency threshold being continuously raised, since failed (timed out) requests have very long latency; so the effectiveness of the feature is much reduced. Now it will only consider successful responses. #3746 #7342
  • The bootstrap process was made more robust over failures to communicate the token ranges to the new node. Such problems will be resolved more fully with Raft. #8889
  • A regression in how ScyllaDB computes compaction backlog has been fixed. Compaction backlog estimates the total number of bytes that remain to be compacted (which can be greater than the table size due to write amplification). ScyllaDB then uses the backlog to decide how to allocate resources to compaction. The bug caused the backlog to be overestimated, thus devoting too many resources (CPU and disk bandwidth) to compaction #8768
  • Logging of corrupted SSTables has been improved to report the file name and offset of the corrupted chunk.
  • A problem with COMPACT STORAGE tables representation in Change Data Capture has been corrected. #8410
  • The reader concurrency semaphore is ScyllaDB’s main replica-side admission control mechanism. It tracks query memory usage and timeouts, and delays or rejects queries that will overload memory. Previously, it was engaged only on a cache miss. It is now engaged before the cache is accessed. This prevents some cache intensive workloads from overloading memory. These workloads typically have queries that take a long time to execute and consume a lot of memory, even from cache, so their concurrency needs to be limited. #4758 #5718
  • A limitation of 10,000 connections per shard has been lifted to 50,000 connections per shard, and made tunable with max-networking-io-control-blocks seastar config. #9051
  • The currenttime() and related functions were incorrectly marked as deterministic. This could lead to incorrect results in prepared statements. They are now marked as non-deterministic. #8816
  • If ScyllaDB stalls while reclaiming memory, it will now log memory-related diagnostics so it is easier to understand the root cause.
  • Repair now reads data with a very long timeout (instead of infinite timeout). This is a last-resort defense against internal deadlocks. #5359
  • The cache abstraction that is used to implement the prepared statement cache has been made more robust, eliminating cases where the entry expired prematurely. #8920
  • When streaming data from a Time Window Compaction Strategy table, ScyllaDB segregates the data into the correct time window to follow the compaction strategy rules. This however causes a large number of SSTables to be created, overloading the node in extreme cases. We now postpone data segregation to a later phase, off strategy compaction, which happens after the number of SSTables has been reduced. #9199
  • Scrub compactions no longer purge tombstones. This is in order to give a tombstone hiding in a corrupted SSTable the chance to get compacted with data.
  • Scrub compactions are now serialized with regard to other compactions, to prevent an SSTable undergoing compaction from not being scrubbed. #9256
  • The CQL protocol server now responds with an overloaded exception if it decides to shed requests. This avoids a resource leak at the client driver side, when the driver may be waiting for the response indefinitely. #9442
  • A scrub compaction takes an SSTable with corruption problems and splits it into non-corrupted SSTables. This can result in an SSTable being split into a large number of new SSTables. In this case the bloom filter of the new SSTables could occupy too much memory and run the node out of memory. We now reduce the bloom filter size dynamically to avoid this. #9463
  • When rewriting SSTables (e.g. for the nodetool upgradesstables command), ScyllaDB will now abort the rewrite if shutting down.

Infrastructure

  • Infrastructure for a new style of virtual tables has been merged. While ScyllaDB already supported virtual tables, it was hard to populate them with data. The new infrastructure reuses memtables as a simple way to populate a virtual table. #8343
  • The code to write ka/la format SSTables has been removed. ScyllaDB can now only write mc/md format SSTables. Note that ScyllaDB can still read ka/la format SSTables. #8352
  • The commitlog code has been converted to coroutines in order to simplify further work. PR#8954
  • ScyllaDB uses a home-grown Interface Definition Language (IDL) to help preserve compatibility during upgrades. It now supports generating RPC verbs, not just data types.
  • There is a new internal keyspace, system_distributed_everywhere, which is used to propagate internal information that needs higher consistency and bandwidth than gossip. The first user will be Change Data Capture internal data. PR#8457
  • The Redis protocol server relied on code that was copy-pasted from the CQL transport server. The two implementations are now unified into a generic tcp server.
  • The internal representation of range tombstones is changing. Instead of a {start, end} pair, after which rows that may be affected by the range tombstone (and rows that are not) are emitted, we emit a {start} marker, then all affected rows, then the {end} marker. This fits well with ‘m’ format SSTables, and once the work is complete will improve workloads which have many range tombstones.
  • The SSTable parsers have been converted from a state machine to a C++ coroutine. There should be no use-visible effect, but the code complexity has been significantly reduced, from a tangle of switches and gotos to structured loops.
  • The system.status and system.describe_ring virtual tables have been renamed to the more descriptive and conventional names system.cluster_status and system.token_ring respectively. Since the tables are new, users are not affected.

Security

Thrift listen port (start_rpc) is disabled by default, as it is not often used. Thrift is still fully supported. #8336

Configuration

New and updated configuration changes:

  • experimentaldeprecated. The flag used to enable all experimental features, which is not very useful. Instead, you should enable experimental features one by one. Experimental feature in this release include these possible values: ‘lwt‘, ‘cdc‘, ‘udf‘, ‘alternator-streams‘, ‘raft#9467
  • enable_sstables_mc_format: parameter is now ignored; mc or md format is mandatory. ScyllaDB continues to be able to read older ka and la format SSTables. #8352
  • batch_size_warn_threshold_in_kb: from 5 to 128, batch_size_fail_threshold_in_kb: from 50 to 1024
    The batch size warning and failure thresholds have been increased to 128 KiB and 1 MiB respectively. The original limits (5 and 50 KiB) were unnecessarily restrictive. #8416
  • max_hinted_handoff_concurrency: when a node is down, writes to that node are collected in hint files. When the node rejoins, hint files are replayed. It turned out that hint replay was too fast and interfered with the normal workload, so its concurrency was reduced and made configurable.
  • restrict_dtcs(false): see restricted above
  • experimental::alternator-ttl see above
  • reversed_reads_auto_bypass_cache(true) : Use new implementation of reversed reads in SSTables when performing reversed queries
  • replace_node and replace_token were ignored, and node removed
  • flush_schema_tables_after_modification(true): Flush tables in the system_schema keyspace after schema modification. This is required for crash recovery, but slows down tests and can be disabled for them
  • strict_allow_filtering(warn): Match Cassandra in requiring ALLOW FILTERING on slow queries. Can be true, false, or warn. When false, ScyllaDB accepts some slow queries even without ALLOW FILTERING that Cassandra rejects. Warn is the same as false, bbut with warning.
  • commitlog_use_hard_size_limit(false): Whether or not to use a hard size limit for commitlog disk usage. Default is false. Enabling this can cause latency spikes, whereas the default can lead to occasional disk usage peaks.
  • enable_repair_based_node_ops(true): Set true to use enable repair based node operations instead of streaming based (more on RBNO above)
  • allowed_repair_based_node_ops(replace): A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild (more on RBNO above)
  • sanitizer_report_backtrace(false): In debug mode, report log-structured allocator sanitizer violations with a backtrace. Do not use this in production, as is slow ScyllaDB significantly.
  • restrict_replication_simplestrategy(false): see restricted above
  • failure_detector_timeout_in_ms(20,000ms): Maximum time between two successful echo message before gossip mark a node down in milliseconds
  • wait_for_hint_replay_before_repair(true): If set to true, the cluster will first wait until the cluster sends its hints towards the nodes participating in repair before proceeding with the repair itself. This reduces the amount of data needed to be transferred during repair.

Build

Testing

  • The performance benchmark perf_simple_query now returns the number of allocations performed and tasks executed per query. These metrics are less noisy than raw performance and so can be used to quantify improvements more easily.
  • There is now rudimentary support for code-coverage reports in unit tests.
  • The perf_simple_query benchmark now reports how many instructions were executed by the CPU per query.
  • ScyllaDB comes with extensive integration with gdb, to make inspecting code dumps easier. This infrastructure now has a unit test.
  • A new type of compaction, validation. A validation compaction will read all SSTables and perform some checks, but write nothing. This is useful to make sure all SSTables can be read and pass sanity checks. #7736

Tooling

  • sstablemetadata: add support to inspect SSTable using zstd compression #8887
  • CQLsh uses the ScyllaDB Python Driver instead of the Cassandra driver
  • There is now a crawling SSTable reader, that does not use the SSTable index file. This is useful in scenarios such as scrub, when the index file is suspected to be corrupted.

Additional Bug Fixes

  • Stability: LWT sporadic failure if a non-deterministic function is used to assign partition key #8604
  • Monitoring: API uses incorrect plus<int> to sum up cf.active_memtable().partition_count(), which can result with  the value wrapped around if bigger than 232, and return the wrong metric.  #9090
  • CQL: Multi column restrictions ignored on indexed queries #9085
  • Stability: Reads which page due to memory and need to reconcile may miss writes in the results #9119
  • Packaging: Docker run arguments not passed to scylla #9247
  • A performance regression in the memtable reader was fixed. The regression was introduced when adding support to reverse queries. #9502
  • UX: large data warnings do not contains the SSTable name #9524
  • More SSTable code need a close() method to avoid use-after-free bugs #1076
  • Make result of trichotomic comparison not convertible to bool #1449
  • Cache reads generate more range tombstones than necessary #2581
  • Change flat_mutation_reader::is_end_of_stream() to reflect the state of the consumer side #3067
  • Speculative reads based on read latency profile may cease to be effective temporarily #3746
  • Request for enhancement nodetool toppartitions – need to be more general #4520
  • cache-mostly read workloads can overload memory #4758
  • hints_manager – Exception when draining 10.0.63.76: std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/var/lib/scylla/hints/3/10.0.63.76] #5087
  • reader_concurrency_semaphore: admit new reads only when all current ones are stalled on I/O #5718
  • Alternator – missing support for UpdateItem ADD operation #5893
  • Error applying view update (Operation timed out for mview...) #6187
  • Large allocation when calculating schema digest with 5000 stables #6376
  • Optimize partition filtering of cleanup compaction #6807
  • ScyllaDB vs Cassandra compatibility issue (spark connector filtering on secondary index) #7043
  • UDA creation error #7201
  • Transform mutation data into result_row_view when invoking expr::is_satisfied_by() #7215
  • C-s latency peaks up to 12 s with OPS fall down from 80k to 300 when ScyllaDB killed on one node. #7342
  • Avoid large contiguous allocation for large cells in the SSTable parser / writer #7457
  • CDC code linearizes all values #7506
  • Restriction on clustering-key only should require ALLOW FILTERING #7608
  • New need_filtering() returns false for a query that needs filtering #7708
  • nodetool scrub: add dry run flag #7736
  • Provide an option to bypass reshape on boot #7738
  • Add metrics for the amount of range tombstones processed during read #7749
  • Azure Ls v2 local disk setup #7807
  • Inserting a row with a null key column should be forbidden #7852
  • Frozen nested set may be created out-of-order via a prepared statement #7856
  • Introduce service levels #7867
  • Add per-service-level timeouts #7913
  • docs: moved latest_version to conf.py #7957
  • large allocation in untyped_result_set reading cdc generations (seastar_memory – oversized allocation: 2023424 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues) #7961
  • Cassandra stress fails to achieve consistency during replace node operation #8013
  • scylla_io_setup on GKE: found more than one disk mounted at root #8032
  • gossip: Introduce direct failure detector #8036
  • Replaying hints before repair #8102
  • Some IDL compiler cleanups #8148
  • scylla_setup: set XFS block size to 1k #8156
  • cql3: remove linearizations in the write path #8160
  • SSTables: remove large allocations when parsing cells #8182
  • If ALLOW FILTERING excludes a long string of rows, the scan can stop prematurely #8203
  • expr: give a name to a tuple of columns #8208
  • sstableloader doesn’t work with Alternator tables if “-nx” option is used #8230
  • The LWT stress is stopped during the prepare phase #8270
  • var-lib-scylla.mount mount process failed: can’t find UUID #8279
  • CDC generations: refactors and improvements #8283
  • storage_service/removenode: update gossiper state before excise #8284
  • dist/common/scripts/scylla_util.py doesn’t support Azure images #8325
  • repair: Switch to use NODE_OPS_CMD for replace operation #8330
  • scylla_ntp_setup: Use systemd-timesyncd instead of chrony when it’s installed by default #8339
  • Add infrastructure for virtual tables #8343
  • scylla_ntp_setup: support systemd-timesyncd #8344
  • Retire la/ka writer #8352
  • Unable to initialize new nodes on ScyllaDB 4.3.2: what():  Not an IPv4 address #8354
  • config: ignore enable_sstables_mc_format flag #8360
  • Commitlog can get stuck after reaching disk size limit, causing writes to time out #8363
  • dist: hardcoded product name on scylla_setup #8367
  • commitlog: Fix race and edge condition in delete_segments #8372
  • Handle commitlog recycle errors #8376
  • Stuck server while upgrading to 4.4 from 4.3.2 #8378
  • listen-address flag description saying `Never specify 0.0.0.0; it is always wrong.` is wrong #8381
  • Tracing can remain not shut down if start is aborted #8382
  • scylla_setup: stop hardcode product name on scylla_setup #8384
  • nonroot: when /var/log/journal exists, install.sh does not generate scylla_sysconfdir.py correctly #8385
  • nonroot: generate scylla_sysconfdir.py correctly #8386
  • hints: use token_metadata to tell if node has left the ring #8387
  • Unify CQL and Redis server code #8388
  • install.sh: document pathname components #8389
  • Spurious EINVAL errors from TLS sockets #8391
  • utils: fragment_range: fix FragmentedView utils for views with empty fragments #8397
  • Infinite loop in write_fragmented for empty fragments in non-empty views #8398
  • composite: replace enable_if with constraints #8399
  • treewide: don’t include “db/system_distributed_keyspace.hh” from headers #8403
  • types: convert has_empty predicate to a concept #8404
  • utils: rjson: convert enable_if to concept #8405
  • caching_options: detemplate from_map() #8406
  • caching_options.hh: move code to .cc #8408
  • config: correct description of listen_address #8409
  • coredump during requesting cdc log table #8410
  • Reader was forwarded before returning partition start” Exception using TWCS in 4.4 #8411
  • Add a (failing) test for picking secondary indexes in order #8414
  • TimeWindowCompactionStrategy not using specialized reader for single partition queries #8415
  • Annoying warnings for small batches #8416
  • config: relax batch size warning and failure thresholds #8417
  • Reject replacing a node that has left the ring #8419
  • storage_service: Reject replacing a node that has left the ring #8420
  • test: perf: perf_fast_forward: report allocation rate and tasks #8422
  • test: perf: perf_simple_query: collect allocation and task statistics #8425
  • cdc: log: avoid linearizations #8429
  • build: drop lld from install-dependencies.sh on s390x #8430
  • scylla_ntp_setup: detect already installed ntp client #8431
  • Optmized TWCS single-partition reader opens SSTables unnecessarily #8432
  • SSTables: fix TWCS single key reader SSTable filter #8433
  • `cache_flat_mutation_reader` position-fast-forwards readers that are not inside a partition, if they immediately return eof #8435
  • stream_session uses escaped curly braces in format strings #8436
  • Make sure that cache_flat_mutation_reader::do_fill_buffer does not fast forward finished underlying reader #8437
  • Name resolution with tcp fails #8442
  • dist/common/scripts/scylla_util.py: GCP nvme: fail silently if you cannot figure out nvme count, or cannot figure out their size and fallback to iotune #8444
  • clustering_order_reader_merger may immediately return end-of-stream if some (but not necessarily all) underlying readers are empty #8445
  • clustering_order_reader_merger: handle empty readers #8446
  • time_series_sstable_set::create_single_key_sstable_reader may return an empty reader even if the queried partition exists (in some other SSTable) #8447
  • time_series_sstable_set: return partition start if some SSTables were ck-filtered out #8448
  • Avoid two compaction passes when reshaping STCS table for repair-based operations #8449
  • utils: data_input: replace enable_if with tightened concept #8450
  • hints: delay repair until hints are replayed #8452
  • Add a ninja help build target #8454
  • artifacts-centos7-test: scylla_util.py fails with – ModuleNotFoundError: No module named 'scylla_sysconfdir' #8456
  • sys_dist_ks: new keyspace for system tables with Everywhere strategy #8457
  • tracing: test/boost/tracing: fix use after free #8461
  • mislicensed source files #8465
  • treewide: correct mislicensed source files #8466
  • service level controller is used after stop #8468
  • qos: make sure to wait for service level updates on shutdown #8470
  • Switch to use NODE_OPS_CMD for decommission operation #8471
  • Switch to use NODE_OPS_CMD for bootstrap operation #8472
  • High latency during bootstrap, background reclaim ineffective #8473
  • main: start background reclaim before bootstrap #8474
  • commitlog_test: Add test for deadlock in shutdown with segment wait #8475
  • gossip: Relax failure detector update #8476
  • Cant install using sites default install pkg #8479
  • Switch to use NODE_OPS_CMD for decommission and bootstrap operation #8481
  • Ubuntu AMI: ScyllaDB server isn’t up after reboot #8482
  • unified: abort install when non-bash shell detected #8484
  • Improve validation of “enable“, “postimage” and “ttl” CDC options #8486
  • Introduce direct failure detector #8488
  • reader_concurrency_semaphore: readmission leaks resources #8493
  • scylla_raid_setup: Ubuntu does not able to enable mdmonitor.service #8494
  • dist: add DefaultDependencies=no to .mount units #8495
  • bootstrap_with_repair badly formatted exception: expected 1 node losing range but found more nodes={} #8503
  • repair: Handle everywhere_topology in bootstrap_with_repair #8505
  • main: add a debug symbol for service level controller #8506
  • Potential data corruption when reading range tombstone from MC sstable #8507
  • database::get_reader_concurrency_semaphore() uses system semaphore as the catch-all semaphore #8508
  • Materialized views: fix possibly old views coming from other nodes #8509
  • alternator ConditionExpression wrong comparison of two non-existent attributes #8511
  • scylla_io_setup: configure “aio-max-nr” before iotune #8512
  • Alternator incorrect inequality check of two sets #8513
  • Alternator – incorrect set equality comparison inside a nested document #8514
  • doc: Update bootstrap with everywhere_topology #8518
  • rapidjson’s default allocator does not handle allocation failures properly #8521
  • tools: toolchain: dbuild: define die() earlier #8523
  • treewide: remove inclusions of storage_proxy.hh from headers #8524
  • Calculate partition ranges from expr::expression #8525
  • dist: Add support for disabling writeback cache #8526
  • dist/debian: when PRODUCT != scylla, scylla-node-exporter.default does not added with node exporter package #8527
  • dist/debian: rename .default file correctly #8528
  • rjson: Add throwing allocator #8529
  • scylla_raid_setup: enabling mdmonitor.service on Debian variants #8530
  • Reshape may ignore overlapping in level L where L > 0 #8531
  • Cannot bootstrap a node in presence of Everywhere strategy tables with RBO enabled #8534
  • token_metadata: Fix get_all_endpoints to return nodes in the ring #8536
  • repair: remove partition_checksum and related code #8537
  • test: make rjson allocator test working in debug mode #8539
  • Forward-port service level fixes #8540
  • logalloc: reclaim_from_evictable evicts up to 3(N2) segments when asked to reclaim N segments #8542
  • Added r5b to ena instance_class. #8546
  • Alternator tracing does not include valid user info #8547
  • Add username to alternator tracing #8548
  • test: perf: don’t truncate allocation/req and tasks/req report #8550
  • gdb: Fix heapprof() dereferencing of backtrace #8551
  • Huge reactor stalls in token_metadata during topology changes when `EverywhereStrategy` tables are present #8555
  • SSTables: Add debug info when create_sharding_metadata generates zero ranges #8557
  • messaging service: be more verbose when shutting down servers and clients #8560
  • test: perf: report instructions retired per operations #8563
  • Tests for Alternator’s TTL feature #8564
  • cql3: Check if partition-key restrictions are all EQ at preparation time #8565
  • repeated message: service_level_controller – service level default was deleted #8567
  • cdc: log: fill cdc$deleted_ columns in pre-images #8568
  • TWCS: sometimes SSTables are not compacted together when time window finishes #8569
  • logalloc: reduce minimum lsa reserve in allocating_section to 1 #8572
  • Off-strategy compaction with LCS keeps reshaping the last remaining SSTable #8573
  • workload prioritization: Reduce the logging sensitivity to “glitches” in #8574
  • Service level controller: fix wrong default service level removal log #8576
  • ScyllaDB shutdown process stuck when Shutting down drain storage proxy #8577
  • commitlog: make_checked_file for segments, report and ignore other errors on shutdown #8578
  • Reactor stall in consume_mutation_fragments_until #8579
  • flat_mutation_reader: consume_mutation_fragments_until: maybe yield for each popped mutation_fragment #8580
  • cql/cdc_batch_delete_postimage_test – rename test files + fix result #8581
  • STCS get_buckets insertion algorithm may break the bucket invariant #8584
  • dist: scylla_raid_setup: reduce xfs block size to 1k #8585
  • Add option to forbid SimpleStrategy in CREATE KEYSPACE #8586
  • scylla_io_setup failed with error: seastar – Could not setup Async I/O on aws instances (r5, r5b) and gp3 ebs volumes #8587
  • Unified Installer: Incorrect file security context cause scylla_setup to fail #8589
  • TWCS: initialize _highest_window_seen #8590
  • storage_proxy::get_max_result_size() uses the unlimited max-size for service levels #8591
  • storage_proxy: use small_vector for vectors of inet_address #8592
  • Support Microsoft Azure snitch #8593
  • scylla_util.py: Fix Azure support for machine-image #8596
  • node doesn’t switch status to UP and normal after scylla had started on node, which was previously drained #8597
  • IndexInfo system table lists MV name instead of index name #8600
  • install.sh: apply correct file security context when copying files #8602
  • LWT sporadic failure if a non-deterministic function is used to assign partition key #8604
  • storage_proxy: place unique_response_handler:s in small_vector instead of std::vector #8606
  • repair: Wire off-strategy compaction for decommission #8607
  • wild pointer dereference during shutdown after failing to create SSTable component #8609
  • storage_service isolate log message should be an error #8610
  • logalloc: fix quadratic behavior of reclaim_from_evictable #8611
  • phased_barrier move-assignment may leak a gate entry #8613
  • storage_service: Delay update pending ranges for replacing node #8614
  • gossip: add local application state excess logging #8616
  • Handle excess logging of non existing local endpoint #8617
  • scylla-gdb unit test #8618
  • Creating a table which looks like a secondary index breaks the secondary index creation mechanism #8620
  • cql-pytest: add nodetool flush feature and use it in a test #8622
  • Overlapping range tombstones can lead to very slow reads and OOMs #8625
  • Reads of many rows covered by a single range tombstone which go through row cache are very slow #8626
  • additional sstable stats (#251) #8630
  • Fix index name conflicts with regular tables #8632
  • Description in scylla-fstrim.timer is incorrect #8633
  • db/virtual tables: Add infrastructure + system.status example table #8634
  • exception phased_barrier::advance_and_await is not handled #8636
  • consistency_level: deinline assure_sufficient_live_nodes() #8637
  • repair: Consider memory bloat when calculate repair parallelism #8641
  • cdc: use a new internal table for exchanging generations #8643
  • scylla-fstrim.timer: fix wrong description from ‘daily’ to ‘weekly’ #8644
  • Fix service level negative timeouts #8645
  • hints: make hints concurrency configurable and reduce the default #8646
  • abstract_replication_strategy: avoid reactor stalls in `get_address_ranges` and friends #8647
  • ScyllaDB unconditionally waits 15 seconds for auth to be initialized, which is wasteful #8648
  • auth: remove the fixed 15s delay during auth setup #8649
  • install.sh:set aio conf during installation #8650
  • Abort restore_replica_count when node is removed from the cluster #8651
  • repair: Consider memory bloat when calculate repair parallelism #8652
  • storage_service: Abort restore_replica_count when node is removed from the cluster #8655
  • perf: add alternator frontend to perf_simple_query #8656
  • Exceptions in resharding and reshaping are being incorrectly swallowed #8657
  • perf_fast_forward: report instructions per fragment #8660
  • unified/uninstall.sh: simplify uninstall.sh, delete all files correctly #8662
  • nonroot installation broken at 4.5.rc1 #8663
  • install.sh: fix not such file or directory on nonroot #8664
  • tree-wide: comments on deprecated functions to access global variables #8665
  • Mismatched types for base and view columns id: timeuuid and timeuuid #8666
  • Fix type checking in index paging #8667
  • cql3: represent lists as chunked_vector instead of std::vector #8668
  • row_cache_test: test_partition_range_population_with_concurrent_memtable_flushes fails with segmentation fault #8671
  • gdb: bypass unit test on non-x86 #8672
  • Ninja build feature request: Set packages to be created with architecture name #8675
  • Wire off-strategy compaction for regular repair #8677
  • repair: Wire off-strategy compaction for regular repair #8678
  • cql3: result_set: switch rows to chunked_vector #8679
  • Introduce per-service-level workload types and their first use-case – shedding in interactive workloads #8680
  • scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem #8681
  • Enable -Wunused-private-field warning #8683
  • types: remove some dead code #8684
  • types: fix a case of type punning via union #8685
  • keys, compound: eliminate some careless copies of shared pointers #8686
  • alternator: executor: eliminate some pointless reserializations #8687
  • keys, compound: eliminate some careless copies of shared pointers #8686
  • alternator: executor: eliminate some pointless reserializations #8687
  • keys, compound: take the argument to from_single_value() by reference #8688
  • types: don’t linearize values in abstract_type::hash #8689
  • collection_mutation: don’t linearize collection values #8690
  • Alternator’s health-check request doesn’t work properly with HTTPS #8691
  • test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment #8692
  • Segfault during forced repair termination #8693
  • Stability: a memory leak on the memory allocator code, cause ScyllaDB to exit in different places:
    • During backup #9192 #9544;
    • During regular compaction #9508;
    • After manager repair and full scan range #9821;
    • During removenode #9825
  • Install: scylla_raid_setup: failed due to the “mdmonitor.service: Failed with result ‘protocol'” on CentOS8 #9540
  • Install: wrong permissions for scylla-housekeeping related files #9683
  • Stability: unhandled exception from B-tree’s insert_before() #9728
  • Alternator: Inappropriate error message for UpdateTable on non-existent table #9747
  • Stability: ScyllaDB crashed while ‘nodetool stop SCRUB’ running #9766
  • Stability: storage_service: Future from seastar::get_units is not waited in node_ops_update_heartbeat and friends #9767
  • Stability: commitlog uses get_units() without waiting for it #9770
  • Stability: coredump during compaction on node with cdc. The root cause is a regression introduced in 4.6 as part of cache improvement #9915
  • Performance under low free memory condition: Segment allocation is slow when low on free memory due to current_backtrace() from maybe_dump_memory_diagnostics() #9982
  • Stability: a regression in range_tombstone_list causes an error: “Mutation read doesn't match any expected version#9661
  • Stability: a regression in 4.6 lead to a crash in commitlog: Segfault after failed update schema_version: mutation_write_timeout_exception #9955
  • Stability: a race condition in repair while updating RF at the same time #9751
  • Stability: encryption at transit between nodes may fail when one node is behind a NAT #9653
  • Stability: sstables/partition_index_cache: get_or_load() error handling triggers assert #9887
  • Install: scylla_raid_setup: failed due to the “mdmonitor.service: Failed with result 'protocol'” on CentOS8 #9540

About Tzach Livyatan

Tzach Livyatan has a B.A. and MSc in Computer Science (Technion, Summa Cum Laude), and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier grade systems, signalling, policy and charging applications.