ScyllaDB Open Source Release 4.6

By Tzach Livyatan

February 11, 2022

The ScyllaDB team is pleased to announce the release of ScyllaDB Open Source 4.6, a production-ready release of our open source NoSQL database.

ScyllaDB 4.6 includes ARM support, new replace-node streaming, a new restore (load and stream) operation, and other performance and stability improvements and bug fixes (below).

Find the ScyllaDB Open Source 4.6 repository for your Linux distribution here. ScyllaDB 4.6 Docker is also available.

Only the latest two minor releases of the ScyllaDB Open Source project are supported. From now on, only ScyllaDB Open Source 4.6 and 4.5 are supported. Users running ScyllaDB Open Source 4.4 and earlier are encouraged to upgrade to these two releases.

Many of the new features below will be discussed in the upcoming virtual ScyllaDB Summit 2022, Feb 9,10.

New Features in ScyllaDB 4.6

ARM Support

ScyllaDB 4.6 is supported ARM architecture, including:

EC2 ARM base AMI, ready for Graviton2
Running ScyllaDB Docker on ARM , including Mac M1

Repair Base Node Operations (RBNO)

Repair Base Node Operations was introduced as an experimental feature in ScyllaDB 4.0, which uses repair to stream data for node-operations like replace, bootstrap and others. While still considered experimental, we continue to work on this feature.

Repair is oriented towards moving small amounts of data, not an entire node’s worth. This resulted in many SSTables being created in the node, creating a large compaction load. To fix that, offstrategy compaction is now used to efficiently compact these SSTables without impacting the primary workload. #5226

In 4.6, RBNO is enabled by default only for replace node operation.

Example from scylla.yaml:

Enable_repair_based_node_ops: trueallowed_repair_based_node_ops: replace

To enable other operations (experimental), add them as a comma-separated list to allowed_repair_based_node_ops. Available operations are: bootstrap, replace, removenode, decommission and rebuild

#8013 PR#9197

For more about “Repair Based Node Operations” see the ScyllaDB Summit 2022 session by Asias He.

Service Level Properties

Service Levels allows the user to attach attributes to Rules and Users. These attributes apply to each session the user opens to ScyllaDB, enabling granular control of the session properties, like time out and shedding (overload handling).

So far, service levels have been used to implement Workload Prioritization in ScyllaDB Enterprise.

In this release, service levels are merged to ScyllaDB Open Source to implement two features:

Per service level timeouts
Workload types

Note that Workload Prioritization will remain an Enterprise-only feature.

Per service level timeouts

You can now create service levels with customized read and write timeouts and attach them to rules and users. This is useful when some workloads, like ETL, are less sensitive to latency than others.

For example:

CREATE SERVICE LEVEL sl2 WITH timeout = 500ms;
ATTACH SERVICE LEVEL sl2 TO scylla;
ALTER SERVICE LEVEL sl2 WITH timeout = null;

#7913, PR#7617, PR#8763

Workload types

It’s possible to declare a workload type for a service level, currently out of three available values:

unspecified – generic workload without any specific characteristics; default
interactive – workload sensitive to latency, expected to have high/unbounded concurrency, with dynamic characteristics, OLTP; example: users clicking on a website and generating events with their clicks
batch – workload for processing large amounts of data, not sensitive to latency, expected to have fixed concurrency, OLAP, ETL; example: processing billions of historical sales records to generate useful statistics

Declaring a workload type provides more context for ScyllaDB to decide how to handle the sessions. For instance, if a coordinator node receives requests with a rate higher than it can handle, it will make different decisions depending on the declared workload type:

For batch workloads it makes sense to apply back pressure – the concurrency is assumed to be fixed, so delaying a reply will likely also reduce the rate at which new requests are sent;
For interactive workloads, backpressure would only waste resources – delaying a reply does not decrease the rate of incoming requests, so it’s reasonable for the coordinator to start shedding surplus requests.

Example

ALTER SERVICE LEVEL sl WITH workload_type = 'interactive';

Reverse Queries

A reverse query is a query SELECT that uses a reverse order compared to the one used in the table schema. If no order was defined, the default order is ascending (ASC).

For example, the following table schema orders the rows in a partition by “time” in an ascending order:

CREATE TABLE heartrate (
  pet_chip_id uuid,
  owner uuid,
  time timestamp,
  heart_rate int,
PRIMARY KEY (pet_chip_id, time)
);

The following SELECT worked in ScyllaDB 4.5 but might be very inefficient:

SELECT * FROM heartrate LIMIT 1 ORDER BY time DESC

Improving the performance of reverse query is an ongoing process, with the following updates in ScyllaDB 4.6:

The internal layer for managing queries now supports reversed queries natively. This lays the groundwork for reversed reads in memtables, cache, and SSTables, so that reversed queries will perform efficiently.
The mx SSTable reader (reading mc and md format SSTables) can now read partitions in reversed order. This is a step towards supporting reversed reads of large partitions.
Memtables now efficiently support reversed reads (for CQL WITH CLUSTERING ORDER). Together with the already merged SSTable reversed reader, reversed reads with BYPASS CACHE are now more efficient, especially with memory consumption.

SSTable Index Caching

Up to this release, ScyllaDB only cached data from SSTables.

As a result, if the data was not in cache readers had to touch the disk while walking the index. This was inefficient, especially for large partitions, increasing the load on the disk, and adding latency.

In ScyllaDB 4.6, index blocks can be cached in memory, between readers, populated on access, and evicted on memory pressure – reducing the IO and decreasing latency. #7079

More info in Tomasz Grabiec session in ScyllaDB Summit “SSTable Index Caching”

ScyllaDB 4.6 Updates

Install, deployment and packaging

Deprecated:
- Ubuntu 16 support. Ubuntu 16.04 is EOL and will not be supported following 4.6
- Debian 9 support. Debian 9 is EOL will not be supported following 4.6
Newer Linux distributions use systemd-timesyncd instead of ntp or chrony for time synchronization. This is now supported. ScyllaDB will now detect if you already have time synchronization set up and leave it alone if so. #8339
scylla_setup script now supports disabling the NVMe write-back cache on disks that allow it. This is useful to reduce latency on Google Cloud Platform local disks. The machine images built using scylla-machine-image will do this automatically.
The Unified (tarball) Installer now works correctly when SElinux is enabled. #8589
The docker image base has been switched from CentOS 7 to Ubuntu 20.04, similar to ScyllaDB AMI, and GCP images. PR#8849
The installer now offers to set up RAID 5 on the data disks in addition to RAID 0; this is useful when the disks can have read errors, such as on GCP local disks. #9076
The install script now supports supervisord in addition to systemd. This was brought in from the container image, where systemd is not available, and is useful in some situations where root access is not available.
Automatic I/O configuration during setup now supports AWS ARM instances.#9493
The setup utility now recognizes Persistent Disks on Google Cloud Platform and Azure. PR#9395 PR#9417

Raft

We are building up an internal service in ScyllaDB, useful for this and other applications. The changes have no visible effect yet. Among other, the following was added:

The Raft implementation now updates RPC about new and removed nodes.
It is now required to enable Raft with a configuration item (experimental: raft). This lets the implementation mature without requiring backwards compatibility efforts. #9239
The schema for storing Raft snapshots has been updated to avoid blobs.

CDC

Change Data Capture (CDC) now fills in the cdc$deleted_columns column in pre-image correctly

Alternator

Alternator is ScyllaDB’s implementation of the DynamoDB API.

Alternator: rudimentary implementation of TTL expiration service PR#9624
ConditionExpression wrong comparison of two non-existent attributes #8511
Incorrect set equality comparison inside a nested document #8514
Incorrect inequality check of two sets #8513
Alternator now include username in trace records. #9613
Alternator, ScyllaDB’s implementation of the DynamoDB API, now supports the ADD operation. #5893

Guardrails

We continue to add default limitations (guardrails) to ScyllaDB, making it harder for users to use non production settings by mistake. Each new configuration added to restriction mode (tri_mode_restriction) has three options:

True: restricted, disable risky feature
False: non restricted, enable risky feature
Warn: non restricted, log warning about risky feature

Additions in this release are:

It’s now possible to prevent users from using SimpleReplicationStrategy, using config parameter restrict_replication_simplestrategy
The goal is to first default to warning and then default to actual prevention. SimpleReplicationStrategy can make it hard to later grow the cluster by adding data centers. #8586
DateTieredCompactionStrategy is deprecated in favor of TimeWindowCompactionStrategy for a long time now. A new warning will let you know if you are still using it. If you are nostalgic for the old strategy, use “restrict_dtcs” to disable this warning #8914

CQL

SELECT statements that used an index, and also restricted the token (e.g. SELECT ... WHERE some_indexed_column = ? AND token(pk) = ?) incorrectly ignored the token restriction. The issue was found by using spark connector filtering on a secondary index. This is now fixed. #7043
User-Defined Aggregates (UDA) have been implemented. User-defined aggregates allow the creation of custom aggregate functions, like count, min, and max. Note that UDA, like UDF are experimental and can be enable with enable_user_defined_functions parameter in scylla.yaml #7201. Please note UDF/UDA are considered experimental features.
Selecting a partition range with a slice/equality restriction on clustering keys (e.g. SELECT * FROM tab WHERE ck=?, with no partition key restrictions) now demands ALLOW FILTERING again (since this query can potentially discard large amounts of data without returning anything). To avoid breaking applications that accidentally did not specify ALLOW FILTERING, it will only generate a warning for now. #7608
ScyllaDB now correctly rejects CQL insert/update statements with NULLs in key columns. #7852
Queries that are performed using an index can now select Static Columns. #8869
User Defined Functions and Aggregates (UDF/UDA) now support WebAssembly in experimental mode. The bindings to ScyllaDB data types will likely change, but this is sufficient to play with.

Hinted Handoff API

Hinted Handoff is an anti-entropy mechanism to replay mutations to a node which was unreachable for some time.

A new HTTP API for waiting for hinted handoff replay to be completed. This can be used to reduce repair work.

/hints_manager/waiting_point (POST) – Create a sync point: given a set of target hosts, creates a sync point at the end of all HH queues pointing to any of the hosts.
/hints_manager/waiting_point (GET) – Wait or check the sync point: given a description of a sync point, checks if the sync point was already reached. If you provide a non-zero `timeout` parameter and the sync point is not reached yet, this endpoint will wait until the point reached or the timeout expires.

ScyllaDB-sstable

A tool which can be used to examine the content of sstable(s) and execute various operations on them. The currently supported operations are:

dump – dumps the content of the sstable(s), similar to sstabledump;
dump-index – dumps the content of the sstable index(es), replacing scylla-sstable-index;
writetime-histogram – generates a histogram of all the timestamps in the sstable(s);
custom – a hackable operation for the expert user (until scripting support is implemented);
validate – validate the content of the sstable(s) with the mutation fragment stream validator, same as scrub in validate mode;

Performance Improvements

Streams of mutation data are represented in ScyllaDB by a flat_mutation_reader, which provides a means for a function to consume a stream. This was made faster, which improves operations such as flushing a memtable. See PR#8359 for micro benchmark results.
When ScyllaDB receives SSTables from outside the replica (such as via a repair operation, or after restoring a snapshot) it first reshapes them to conform to the compaction strategy.
Reshape was improved for:
- Leveled Compaction Strategy, by checking first if the SSTables happen to be disjoint.
- Time Window Compaction Strategy, by reducing write amplification.
More code paths can now work with non-contiguous memory for table columns and intermediate values: comparing values, the CQL write path. This reduces CPU stalls due to memory allocation when large blobs are present. PR#8357
SSTable parser will avoid large allocations, reducing latency spikes. #6376, #7457
Repair is now delayed until hints for that table are replayed. This reduces the amount of work that repair has to do, since hint replay can fill in the gaps that a downed node misses in the data set. #8102
SSTables will now automatically choose a buffer size that is compatible with achieving good latency, based on disk measurements by iotune.
The setup scripts will now format the filesystem with 1024 byte blocks if possible. This reduces write amplification for lightweight transaction (LWT) workloads.
Performance: read latency increase after deletion of high percentage of the data, as many rows covered by a single range tombstone, which go through row cache are very slow #8626
Authentication had a 15 second delay, working around dependency problems. But it is long unneeded and is now removed, speeding up node start.
Unintended quadratic behavior in the log-structured allocator (which manages ScyllaDB memtable and cache memory) has been fixed. #8542
Off-strategy compaction is now enabled for repair. After repair completes, the SSTables generated by repair will first be merged together, then incorporated into the set of SSTables used for serving data. This reduces read amplification due to the large number of SSTables that repair can generate, especially for range queries where the bloom filter cannot exclude those SSTables. #8677
Off-strategy compaction, a method by which SSTables are reshaped to fit the compaction strategy, is now enabled for bootstrap and replace operation using standard streaming. #8820
The read path has been optimized to remove unnecessary work, leading to a small performance increase.
The common case of single-partition query was treated as an IN query with a 1-element tuple. This case is now specialized to avoid the extra post-processing work.
SSTable index files are now cached, both at the page level and at an object level (index entry). This improves large partition workloads as well as intermediate size workloads where the entire SSTable index can be cached. #7079
The row cache behavior was quadratic in certain cases where many range tombstones were present. This has been fixed. #2581
Recently the SSTable index has gained the ability to use the cache to reduce I/O; but it did so even when BYPASS CACHE was requested in the CQL statement. The index now respects BYPASS CACHE like data access.
After adding a node, a cleanup process is run to remove data that was “moved” to the new node. This is a compaction process that compacts only one SSTable at a time. This fact was used to optimize cleanup. In addition, the check for whether a partition should be removed during cleanup was also improved. #6807
ScyllaDB uses reader objects to read sequential data. It caches those readers so they can be reused across multiple pages of the result set, eliminating the overhead of starting a new sequential read each time. However, this optimization was missed for internal paging used to implement aggregations (e.g. SUM(column)). ScyllaDB now uses the optimization for aggregates too. #9127
There is now an effective replication map structure, which contains the application of a replication strategy and its parameters to a topology (node->token mapping). This reduces the amount of run-time computation needed by the coordinator.
Time Window Compaction strategy reshape gained two optimizations. Reshape happens when changing compaction strategies or after streaming data to a new node. The optimizations reduce write amplification and therefore the time spent when adding a new node.
In Time Window Compaction Strategy compactions, fully expired SSTables will be compacted separately, since that can be done by just dropping them. #9533
Scrub compaction, which re-sorts unsorted SSTables, will now use memtables as a sorting mechanism instead of generating many small SSTables. PR#9548
Usually a memtable is flushed into one SSTable. In some cases, however, it can be flushed into several SSTables. We now adjust the partition estimate for these SSTables so the bloom filters allocated by these SSTables will not occupy too much memory. #9581
Size-tiered compaction strategy will prefer compactions with larger fan-in in order to improve efficiency. Moreover, once a compaction with large fan-in is started, compactions with lower fan-in will be delayed in order to improve overall write amplification.
Major compaction will now process tables from smallest to largest, to increase the probability of success in case the node is running low on space.

Stability Improvements

Thrift now has partial admission control, to reduce the chance of server overload.
A recent regression caused requests to data centers where the local replication factor is zero to crash. This is now fixed. #8354
A bug in Time Window Compaction Strategy’s selection of SSTables for single-partition reads caused SSTables that did not have data for the key to be consulted, reducing performance. This is now fixed. #8415
Continuing on the path of allowing non-contiguous allocations for large blobs, memory linearizations have been removed from Change Data Capture (CDC). This reduces CPU stalls when CDC is used in conjunction with large blobs. #7506
The underlying json parser used by alternator, rapidjson, has been hardened against out-of-memory errors. #8521
A bug in the row cache that can cause large stalls on schemas with no clustering key has been fixed.
ScyllaDB uses a log-structured memory allocator (LSA) for memtable and cache. Recently, unintentional quadratic behavior in LSA was discovered, so as a workaround the memory reserve size is decreased. Since the quadratic cost is in terms of this reserve size, the bad behavior is eliminated. Note the reserves will automatically grow if the workload really needs them. #8542
Repair allocates working memory for holding table rows, but did not consider memory bloat and could over-allocate memory. It is now more careful. #8641
Failure detection is now done directly by nodes pinging each other rather than through the gossip protocol. This is more reliable and the information is available more rapidly. Impact on networking is low, since ScyllaDB implements a fully connected mesh in all clusters smaller than 256 nodes per datacenter, much larger than the typical cluster. #8488
Change Data Capture (CDC) uses a new internal table for maintaining the stream identifiers. The new table works better with large clusters.#7961
ScyllaDB will now close the connection when a too-large request arrives; previously ScyllaDB would read and discard the request. The new behavior protects against having to read and discard potentially gigabytes of data. #8798
A SELECT statement could result in unbounded concurrency (leading to an out-of-memory error) in some circumstances; this is now fixed. #8799
Speculative retry will now no longer consider failed responses when calculating the 99th percentile of a request. Considering them leads to the latency threshold being continuously raised, since failed (timed out) requests have very long latency; so the effectiveness of the feature is much reduced. Now it will only consider successful responses. #3746 #7342
The bootstrap process was made more robust over failures to communicate the token ranges to the new node. Such problems will be resolved more fully with Raft. #8889
A regression in how ScyllaDB computes compaction backlog has been fixed. Compaction backlog estimates the total number of bytes that remain to be compacted (which can be greater than the table size due to write amplification). ScyllaDB then uses the backlog to decide how to allocate resources to compaction. The bug caused the backlog to be overestimated, thus devoting too many resources (CPU and disk bandwidth) to compaction #8768
Logging of corrupted SSTables has been improved to report the file name and offset of the corrupted chunk.
A problem with COMPACT STORAGE tables representation in Change Data Capture has been corrected. #8410
The reader concurrency semaphore is ScyllaDB’s main replica-side admission control mechanism. It tracks query memory usage and timeouts, and delays or rejects queries that will overload memory. Previously, it was engaged only on a cache miss. It is now engaged before the cache is accessed. This prevents some cache intensive workloads from overloading memory. These workloads typically have queries that take a long time to execute and consume a lot of memory, even from cache, so their concurrency needs to be limited. #4758 #5718
A limitation of 10,000 connections per shard has been lifted to 50,000 connections per shard, and made tunable with max-networking-io-control-blocks seastar config. #9051
The currenttime() and related functions were incorrectly marked as deterministic. This could lead to incorrect results in prepared statements. They are now marked as non-deterministic. #8816
If ScyllaDB stalls while reclaiming memory, it will now log memory-related diagnostics so it is easier to understand the root cause.
Repair now reads data with a very long timeout (instead of infinite timeout). This is a last-resort defense against internal deadlocks. #5359
The cache abstraction that is used to implement the prepared statement cache has been made more robust, eliminating cases where the entry expired prematurely. #8920
When streaming data from a Time Window Compaction Strategy table, ScyllaDB segregates the data into the correct time window to follow the compaction strategy rules. This however causes a large number of SSTables to be created, overloading the node in extreme cases. We now postpone data segregation to a later phase, off strategy compaction, which happens after the number of SSTables has been reduced. #9199
Scrub compactions no longer purge tombstones. This is in order to give a tombstone hiding in a corrupted SSTable the chance to get compacted with data.
Scrub compactions are now serialized with regard to other compactions, to prevent an SSTable undergoing compaction from not being scrubbed. #9256
The CQL protocol server now responds with an overloaded exception if it decides to shed requests. This avoids a resource leak at the client driver side, when the driver may be waiting for the response indefinitely. #9442
A scrub compaction takes an SSTable with corruption problems and splits it into non-corrupted SSTables. This can result in an SSTable being split into a large number of new SSTables. In this case the bloom filter of the new SSTables could occupy too much memory and run the node out of memory. We now reduce the bloom filter size dynamically to avoid this. #9463
When rewriting SSTables (e.g. for the nodetool upgradesstables command), ScyllaDB will now abort the rewrite if shutting down.

Infrastructure

Infrastructure for a new style of virtual tables has been merged. While ScyllaDB already supported virtual tables, it was hard to populate them with data. The new infrastructure reuses memtables as a simple way to populate a virtual table. #8343
The code to write ka/la format SSTables has been removed. ScyllaDB can now only write mc/md format SSTables. Note that ScyllaDB can still read ka/la format SSTables. #8352
The commitlog code has been converted to coroutines in order to simplify further work. PR#8954
ScyllaDB uses a home-grown Interface Definition Language (IDL) to help preserve compatibility during upgrades. It now supports generating RPC verbs, not just data types.
There is a new internal keyspace, system_distributed_everywhere, which is used to propagate internal information that needs higher consistency and bandwidth than gossip. The first user will be Change Data Capture internal data. PR#8457
The Redis protocol server relied on code that was copy-pasted from the CQL transport server. The two implementations are now unified into a generic tcp server.
The internal representation of range tombstones is changing. Instead of a {start, end} pair, after which rows that may be affected by the range tombstone (and rows that are not) are emitted, we emit a {start} marker, then all affected rows, then the {end} marker. This fits well with ‘m’ format SSTables, and once the work is complete will improve workloads which have many range tombstones.
The SSTable parsers have been converted from a state machine to a C++ coroutine. There should be no use-visible effect, but the code complexity has been significantly reduced, from a tangle of switches and gotos to structured loops.
The system.status and system.describe_ring virtual tables have been renamed to the more descriptive and conventional names system.cluster_status and system.token_ring respectively. Since the tables are new, users are not affected.

Security

Thrift listen port (start_rpc) is disabled by default, as it is not often used. Thrift is still fully supported. #8336

Configuration

New and updated configuration changes:

experimental – deprecated. The flag used to enable all experimental features, which is not very useful. Instead, you should enable experimental features one by one. Experimental feature in this release include these possible values: ‘lwt‘, ‘cdc‘, ‘udf‘, ‘alternator-streams‘, ‘raft‘ #9467
enable_sstables_mc_format: parameter is now ignored; mc or md format is mandatory. ScyllaDB continues to be able to read older ka and la format SSTables. #8352
batch_size_warn_threshold_in_kb: from 5 to 128, batch_size_fail_threshold_in_kb: from 50 to 1024
The batch size warning and failure thresholds have been increased to 128 KiB and 1 MiB respectively. The original limits (5 and 50 KiB) were unnecessarily restrictive. #8416
max_hinted_handoff_concurrency: when a node is down, writes to that node are collected in hint files. When the node rejoins, hint files are replayed. It turned out that hint replay was too fast and interfered with the normal workload, so its concurrency was reduced and made configurable.
restrict_dtcs(false): see restricted above
experimental::alternator-ttl see above
reversed_reads_auto_bypass_cache(true) : Use new implementation of reversed reads in SSTables when performing reversed queries
replace_node and replace_token were ignored, and node removed
flush_schema_tables_after_modification(true): Flush tables in the system_schema keyspace after schema modification. This is required for crash recovery, but slows down tests and can be disabled for them
strict_allow_filtering(warn): Match Cassandra in requiring ALLOW FILTERING on slow queries. Can be true, false, or warn. When false, ScyllaDB accepts some slow queries even without ALLOW FILTERING that Cassandra rejects. Warn is the same as false, bbut with warning.
commitlog_use_hard_size_limit(false): Whether or not to use a hard size limit for commitlog disk usage. Default is false. Enabling this can cause latency spikes, whereas the default can lead to occasional disk usage peaks.
enable_repair_based_node_ops(true): Set true to use enable repair based node operations instead of streaming based (more on RBNO above)
allowed_repair_based_node_ops(replace): A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild (more on RBNO above)
sanitizer_report_backtrace(false): In debug mode, report log-structured allocator sanitizer violations with a backtrace. Do not use this in production, as is slow ScyllaDB significantly.
restrict_replication_simplestrategy(false): see restricted above
failure_detector_timeout_in_ms(20,000ms): Maximum time between two successful echo message before gossip mark a node down in milliseconds
wait_for_hint_replay_before_repair(true): If set to true, the cluster will first wait until the cluster sends its hints towards the nodes participating in repair before proceeding with the repair itself. This reduces the amount of data needed to be transferred during repair.

Build

The command “ninja help” will now list available targets when building ScyllaDB. PR#8454
It is now possible to build ScyllaDB under the Nix package manager.
There is a new process for building docker images, based on the buildah tool. The process makes it easier to build outside the continuous integration pipeline (e.g. for individual developers). PR#8730
The toolchain used to build ScyllaDB is now based on Fedora 34, with the clang 12 compiler.

Testing

The performance benchmark perf_simple_query now returns the number of allocations performed and tasks executed per query. These metrics are less noisy than raw performance and so can be used to quantify improvements more easily.
There is now rudimentary support for code-coverage reports in unit tests.
The perf_simple_query benchmark now reports how many instructions were executed by the CPU per query.
ScyllaDB comes with extensive integration with gdb, to make inspecting code dumps easier. This infrastructure now has a unit test.
A new type of compaction, validation. A validation compaction will read all SSTables and perform some checks, but write nothing. This is useful to make sure all SSTables can be read and pass sanity checks. #7736

Tooling

sstablemetadata: add support to inspect SSTable using zstd compression #8887
CQLsh uses the ScyllaDB Python Driver instead of the Cassandra driver
There is now a crawling SSTable reader, that does not use the SSTable index file. This is useful in scenarios such as scrub, when the index file is suspected to be corrupted.

Additional Bug Fixes

Stability: LWT sporadic failure if a non-deterministic function is used to assign partition key #8604
Monitoring: API uses incorrect plus<int> to sum up cf.active_memtable().partition_count(), which can result with the value wrapped around if bigger than 232, and return the wrong metric. #9090
CQL: Multi column restrictions ignored on indexed queries #9085
Stability: Reads which page due to memory and need to reconcile may miss writes in the results #9119
Packaging: Docker run arguments not passed to scylla #9247
A performance regression in the memtable reader was fixed. The regression was introduced when adding support to reverse queries. #9502
UX: large data warnings do not contains the SSTable name #9524
More SSTable code need a close() method to avoid use-after-free bugs #1076
Make result of trichotomic comparison not convertible to bool #1449
Cache reads generate more range tombstones than necessary #2581
Change flat_mutation_reader::is_end_of_stream() to reflect the state of the consumer side #3067
Speculative reads based on read latency profile may cease to be effective temporarily #3746
Request for enhancement nodetool toppartitions – need to be more general #4520
cache-mostly read workloads can overload memory #4758
hints_manager – Exception when draining 10.0.63.76: std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/var/lib/scylla/hints/3/10.0.63.76] #5087
reader_concurrency_semaphore: admit new reads only when all current ones are stalled on I/O #5718
Alternator – missing support for UpdateItem ADD operation #5893
Error applying view update (Operation timed out for mview...) #6187
Large allocation when calculating schema digest with 5000 stables #6376
Optimize partition filtering of cleanup compaction #6807
ScyllaDB vs Cassandra compatibility issue (spark connector filtering on secondary index) #7043
UDA creation error #7201
Transform mutation data into result_row_view when invoking expr::is_satisfied_by() #7215
C-s latency peaks up to 12 s with OPS fall down from 80k to 300 when ScyllaDB killed on one node. #7342
Avoid large contiguous allocation for large cells in the SSTable parser / writer #7457
CDC code linearizes all values #7506
Restriction on clustering-key only should require ALLOW FILTERING #7608
New need_filtering() returns false for a query that needs filtering #7708
nodetool scrub: add dry run flag #7736
Provide an option to bypass reshape on boot #7738
Add metrics for the amount of range tombstones processed during read #7749
Azure Ls v2 local disk setup #7807
Inserting a row with a null key column should be forbidden #7852
Frozen nested set may be created out-of-order via a prepared statement #7856
Introduce service levels #7867
Add per-service-level timeouts #7913
docs: moved latest_version to conf.py #7957
large allocation in untyped_result_set reading cdc generations (seastar_memory – oversized allocation: 2023424 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues) #7961
Cassandra stress fails to achieve consistency during replace node operation #8013
scylla_io_setup on GKE: found more than one disk mounted at root #8032
gossip: Introduce direct failure detector #8036
Replaying hints before repair #8102
Some IDL compiler cleanups #8148
scylla_setup: set XFS block size to 1k #8156
cql3: remove linearizations in the write path #8160
SSTables: remove large allocations when parsing cells #8182
If ALLOW FILTERING excludes a long string of rows, the scan can stop prematurely #8203
expr: give a name to a tuple of columns #8208
sstableloader doesn’t work with Alternator tables if “-nx” option is used #8230
The LWT stress is stopped during the prepare phase #8270
var-lib-scylla.mount mount process failed: can’t find UUID #8279
CDC generations: refactors and improvements #8283
storage_service/removenode: update gossiper state before excise #8284
dist/common/scripts/scylla_util.py doesn’t support Azure images #8325
repair: Switch to use NODE_OPS_CMD for replace operation #8330
scylla_ntp_setup: Use systemd-timesyncd instead of chrony when it’s installed by default #8339
Add infrastructure for virtual tables #8343
scylla_ntp_setup: support systemd-timesyncd #8344
Retire la/ka writer #8352
Unable to initialize new nodes on ScyllaDB 4.3.2: what(): Not an IPv4 address #8354
config: ignore enable_sstables_mc_format flag #8360
Commitlog can get stuck after reaching disk size limit, causing writes to time out #8363
dist: hardcoded product name on scylla_setup #8367
commitlog: Fix race and edge condition in delete_segments #8372
Handle commitlog recycle errors #8376
Stuck server while upgrading to 4.4 from 4.3.2 #8378
listen-address flag description saying `Never specify 0.0.0.0; it is always wrong.` is wrong #8381
Tracing can remain not shut down if start is aborted #8382
scylla_setup: stop hardcode product name on scylla_setup #8384
nonroot: when /var/log/journal exists, install.sh does not generate scylla_sysconfdir.py correctly #8385
nonroot: generate scylla_sysconfdir.py correctly #8386
hints: use token_metadata to tell if node has left the ring #8387
Unify CQL and Redis server code #8388
install.sh: document pathname components #8389
Spurious EINVAL errors from TLS sockets #8391
utils: fragment_range: fix FragmentedView utils for views with empty fragments #8397
Infinite loop in write_fragmented for empty fragments in non-empty views #8398
composite: replace enable_if with constraints #8399
treewide: don’t include “db/system_distributed_keyspace.hh” from headers #8403
types: convert has_empty predicate to a concept #8404
utils: rjson: convert enable_if to concept #8405
caching_options: detemplate from_map() #8406
caching_options.hh: move code to .cc #8408
config: correct description of listen_address #8409
coredump during requesting cdc log table #8410
“Reader was forwarded before returning partition start” Exception using TWCS in 4.4 #8411
Add a (failing) test for picking secondary indexes in order #8414
TimeWindowCompactionStrategy not using specialized reader for single partition queries #8415
Annoying warnings for small batches #8416
config: relax batch size warning and failure thresholds #8417
Reject replacing a node that has left the ring #8419
storage_service: Reject replacing a node that has left the ring #8420
test: perf: perf_fast_forward: report allocation rate and tasks #8422
test: perf: perf_simple_query: collect allocation and task statistics #8425
cdc: log: avoid linearizations #8429
build: drop lld from install-dependencies.sh on s390x #8430
scylla_ntp_setup: detect already installed ntp client #8431
Optmized TWCS single-partition reader opens SSTables unnecessarily #8432
SSTables: fix TWCS single key reader SSTable filter #8433
`cache_flat_mutation_reader` position-fast-forwards readers that are not inside a partition, if they immediately return eof #8435
stream_session uses escaped curly braces in format strings #8436
Make sure that cache_flat_mutation_reader::do_fill_buffer does not fast forward finished underlying reader #8437
Name resolution with tcp fails #8442
dist/common/scripts/scylla_util.py: GCP nvme: fail silently if you cannot figure out nvme count, or cannot figure out their size and fallback to iotune #8444
clustering_order_reader_merger may immediately return end-of-stream if some (but not necessarily all) underlying readers are empty #8445
clustering_order_reader_merger: handle empty readers #8446
time_series_sstable_set::create_single_key_sstable_reader may return an empty reader even if the queried partition exists (in some other SSTable) #8447
time_series_sstable_set: return partition start if some SSTables were ck-filtered out #8448
Avoid two compaction passes when reshaping STCS table for repair-based operations #8449
utils: data_input: replace enable_if with tightened concept #8450
hints: delay repair until hints are replayed #8452
Add a ninja help build target #8454
artifacts-centos7-test: scylla_util.py fails with – ModuleNotFoundError: No module named 'scylla_sysconfdir' #8456
sys_dist_ks: new keyspace for system tables with Everywhere strategy #8457
tracing: test/boost/tracing: fix use after free #8461
mislicensed source files #8465
treewide: correct mislicensed source files #8466
service level controller is used after stop #8468
qos: make sure to wait for service level updates on shutdown #8470
Switch to use NODE_OPS_CMD for decommission operation #8471
Switch to use NODE_OPS_CMD for bootstrap operation #8472
High latency during bootstrap, background reclaim ineffective #8473
main: start background reclaim before bootstrap #8474
commitlog_test: Add test for deadlock in shutdown with segment wait #8475
gossip: Relax failure detector update #8476
Cant install using sites default install pkg #8479
Switch to use NODE_OPS_CMD for decommission and bootstrap operation #8481
Ubuntu AMI: ScyllaDB server isn’t up after reboot #8482
unified: abort install when non-bash shell detected #8484
Improve validation of “enable“, “postimage” and “ttl” CDC options #8486
Introduce direct failure detector #8488
reader_concurrency_semaphore: readmission leaks resources #8493
scylla_raid_setup: Ubuntu does not able to enable mdmonitor.service #8494
dist: add DefaultDependencies=no to .mount units #8495
bootstrap_with_repair badly formatted exception: expected 1 node losing range but found more nodes={} #8503
repair: Handle everywhere_topology in bootstrap_with_repair #8505
main: add a debug symbol for service level controller #8506
Potential data corruption when reading range tombstone from MC sstable #8507
database::get_reader_concurrency_semaphore() uses system semaphore as the catch-all semaphore #8508
Materialized views: fix possibly old views coming from other nodes #8509
alternator ConditionExpression wrong comparison of two non-existent attributes #8511
scylla_io_setup: configure “aio-max-nr” before iotune #8512
Alternator incorrect inequality check of two sets #8513
Alternator – incorrect set equality comparison inside a nested document #8514
doc: Update bootstrap with everywhere_topology #8518
rapidjson’s default allocator does not handle allocation failures properly #8521
tools: toolchain: dbuild: define die() earlier #8523
treewide: remove inclusions of storage_proxy.hh from headers #8524
Calculate partition ranges from expr::expression #8525
dist: Add support for disabling writeback cache #8526
dist/debian: when PRODUCT != scylla, scylla-node-exporter.default does not added with node exporter package #8527
dist/debian: rename .default file correctly #8528
rjson: Add throwing allocator #8529
scylla_raid_setup: enabling mdmonitor.service on Debian variants #8530
Reshape may ignore overlapping in level L where L > 0 #8531
Cannot bootstrap a node in presence of Everywhere strategy tables with RBO enabled #8534
token_metadata: Fix get_all_endpoints to return nodes in the ring #8536
repair: remove partition_checksum and related code #8537
test: make rjson allocator test working in debug mode #8539
Forward-port service level fixes #8540
logalloc: reclaim_from_evictable evicts up to 3(N²) segments when asked to reclaim N segments #8542
Added r5b to ena instance_class. #8546
Alternator tracing does not include valid user info #8547
Add username to alternator tracing #8548
test: perf: don’t truncate allocation/req and tasks/req report #8550
gdb: Fix heapprof() dereferencing of backtrace #8551
Huge reactor stalls in token_metadata during topology changes when `EverywhereStrategy` tables are present #8555
SSTables: Add debug info when create_sharding_metadata generates zero ranges #8557
messaging service: be more verbose when shutting down servers and clients #8560
test: perf: report instructions retired per operations #8563
Tests for Alternator’s TTL feature #8564
cql3: Check if partition-key restrictions are all EQ at preparation time #8565
repeated message: service_level_controller – service level default was deleted #8567
cdc: log: fill cdc$deleted_ columns in pre-images #8568
TWCS: sometimes SSTables are not compacted together when time window finishes #8569
logalloc: reduce minimum lsa reserve in allocating_section to 1 #8572
Off-strategy compaction with LCS keeps reshaping the last remaining SSTable #8573
workload prioritization: Reduce the logging sensitivity to “glitches” in #8574
Service level controller: fix wrong default service level removal log #8576
ScyllaDB shutdown process stuck when Shutting down drain storage proxy #8577
commitlog: make_checked_file for segments, report and ignore other errors on shutdown #8578
Reactor stall in consume_mutation_fragments_until #8579
flat_mutation_reader: consume_mutation_fragments_until: maybe yield for each popped mutation_fragment #8580
cql/cdc_batch_delete_postimage_test – rename test files + fix result #8581
STCS get_buckets insertion algorithm may break the bucket invariant #8584
dist: scylla_raid_setup: reduce xfs block size to 1k #8585
Add option to forbid SimpleStrategy in CREATE KEYSPACE #8586
scylla_io_setup failed with error: seastar – Could not setup Async I/O on aws instances (r5, r5b) and gp3 ebs volumes #8587
Unified Installer: Incorrect file security context cause scylla_setup to fail #8589
TWCS: initialize _highest_window_seen #8590
storage_proxy::get_max_result_size() uses the unlimited max-size for service levels #8591
storage_proxy: use small_vector for vectors of inet_address #8592
Support Microsoft Azure snitch #8593
scylla_util.py: Fix Azure support for machine-image #8596
node doesn’t switch status to UP and normal after scylla had started on node, which was previously drained #8597
IndexInfo system table lists MV name instead of index name #8600
install.sh: apply correct file security context when copying files #8602
LWT sporadic failure if a non-deterministic function is used to assign partition key #8604
storage_proxy: place unique_response_handler:s in small_vector instead of std::vector #8606
repair: Wire off-strategy compaction for decommission #8607
wild pointer dereference during shutdown after failing to create SSTable component #8609
storage_service isolate log message should be an error #8610
logalloc: fix quadratic behavior of reclaim_from_evictable #8611
phased_barrier move-assignment may leak a gate entry #8613
storage_service: Delay update pending ranges for replacing node #8614
gossip: add local application state excess logging #8616
Handle excess logging of non existing local endpoint #8617
scylla-gdb unit test #8618
Creating a table which looks like a secondary index breaks the secondary index creation mechanism #8620
cql-pytest: add nodetool flush feature and use it in a test #8622
Overlapping range tombstones can lead to very slow reads and OOMs #8625
Reads of many rows covered by a single range tombstone which go through row cache are very slow #8626
additional sstable stats (#251) #8630
Fix index name conflicts with regular tables #8632
Description in scylla-fstrim.timer is incorrect #8633
db/virtual tables: Add infrastructure + system.status example table #8634
exception phased_barrier::advance_and_await is not handled #8636
consistency_level: deinline assure_sufficient_live_nodes() #8637
repair: Consider memory bloat when calculate repair parallelism #8641
cdc: use a new internal table for exchanging generations #8643
scylla-fstrim.timer: fix wrong description from ‘daily’ to ‘weekly’ #8644
Fix service level negative timeouts #8645
hints: make hints concurrency configurable and reduce the default #8646
abstract_replication_strategy: avoid reactor stalls in `get_address_ranges` and friends #8647
ScyllaDB unconditionally waits 15 seconds for auth to be initialized, which is wasteful #8648
auth: remove the fixed 15s delay during auth setup #8649
install.sh:set aio conf during installation #8650
Abort restore_replica_count when node is removed from the cluster #8651
repair: Consider memory bloat when calculate repair parallelism #8652
storage_service: Abort restore_replica_count when node is removed from the cluster #8655
perf: add alternator frontend to perf_simple_query #8656
Exceptions in resharding and reshaping are being incorrectly swallowed #8657
perf_fast_forward: report instructions per fragment #8660
unified/uninstall.sh: simplify uninstall.sh, delete all files correctly #8662
nonroot installation broken at 4.5.rc1 #8663
install.sh: fix not such file or directory on nonroot #8664
tree-wide: comments on deprecated functions to access global variables #8665
Mismatched types for base and view columns id: timeuuid and timeuuid #8666
Fix type checking in index paging #8667
cql3: represent lists as chunked_vector instead of std::vector #8668
row_cache_test: test_partition_range_population_with_concurrent_memtable_flushes fails with segmentation fault #8671
gdb: bypass unit test on non-x86 #8672
Ninja build feature request: Set packages to be created with architecture name #8675
Wire off-strategy compaction for regular repair #8677
repair: Wire off-strategy compaction for regular repair #8678
cql3: result_set: switch rows to chunked_vector #8679
Introduce per-service-level workload types and their first use-case – shedding in interactive workloads #8680
scylla_raid_setup: use /dev/disk/by-uuid to specify filesystem #8681
Enable -Wunused-private-field warning #8683
types: remove some dead code #8684
types: fix a case of type punning via union #8685
keys, compound: eliminate some careless copies of shared pointers #8686
alternator: executor: eliminate some pointless reserializations #8687
keys, compound: eliminate some careless copies of shared pointers #8686
alternator: executor: eliminate some pointless reserializations #8687
keys, compound: take the argument to from_single_value() by reference #8688
types: don’t linearize values in abstract_type::hash #8689
collection_mutation: don’t linearize collection values #8690
Alternator’s health-check request doesn’t work properly with HTTPS #8691
test: serialized_action_test: prevent false-positive timeout in test_phased_barrier_reassignment #8692
Segfault during forced repair termination #8693
Stability: a memory leak on the memory allocator code, cause ScyllaDB to exit in different places:
- During backup #9192 #9544;
- During regular compaction #9508;
- After manager repair and full scan range #9821;
- During removenode #9825
Install: scylla_raid_setup: failed due to the “mdmonitor.service: Failed with result ‘protocol'” on CentOS8 #9540
Install: wrong permissions for scylla-housekeeping related files #9683
Stability: unhandled exception from B-tree’s insert_before() #9728
Alternator: Inappropriate error message for UpdateTable on non-existent table #9747
Stability: ScyllaDB crashed while ‘nodetool stop SCRUB’ running #9766
Stability: storage_service: Future from seastar::get_units is not waited in node_ops_update_heartbeat and friends #9767
Stability: commitlog uses get_units() without waiting for it #9770
Stability: coredump during compaction on node with cdc. The root cause is a regression introduced in 4.6 as part of cache improvement #9915
Performance under low free memory condition: Segment allocation is slow when low on free memory due to current_backtrace() from maybe_dump_memory_diagnostics() #9982
Stability: a regression in range_tombstone_list causes an error: “Mutation read doesn't match any expected version” #9661
Stability: a regression in 4.6 lead to a crash in commitlog: Segfault after failed update schema_version: mutation_write_timeout_exception #9955
Stability: a race condition in repair while updating RF at the same time #9751
Stability: encryption at transit between nodes may fail when one node is behind a NAT #9653
Stability: sstables/partition_index_cache: get_or_load() error handling triggers assert #9887
Install: scylla_raid_setup: failed due to the “mdmonitor.service: Failed with result 'protocol'” on CentOS8 #9540

Previous Post Next Post

Why ScyllaDB?

Is ScyllaDB right for me?

ScyllaDB University

ScyllaDB Blog

ScyllaDB Open Source Release 4.6

Related Links