See all blog posts

ScyllaDB Open Source 5.2: With Raft-Based Schema Management

ScyllaDB Open Source 5.2 is now available. It is a feature-based minor release that builds on the ScyllaDB Open Source 5.0 major release from 2022. ScyllaDB 5.2 introduces Raft-based strongly consistent schema management, DynamoDB Alternator TTL, and many more improvements and bug fixes. It also resolves over 100 issues, bringing improved capabilities, stability and performance to our NoSQL database server and its APIs.

DOWNLOAD SCYLLADB OPEN SOURCE 5.2 NOW

In this blog, we’ll share the TL;DR of what the move to Raft cluster management means for our users (two more detailed Raft blogs will follow), then highlight additional new features that our users have been asking about most frequently. For the complete details, read the release notes.

Strongly Consistent Schema Management with Raft

Consistent Schema Management is the first Raft based feature in ScyllaDB, and ScyllaDB 5.2 is the first release to enable Raft by default. It applies to DDL operations that modify the schema – for example, CREATE, ALTER, or DROP for KEYSPACE, TABLE, INDEX, UDT, MV, etc.

Unstable schema management has been a problem in past Apache Cassandra and ScyllaDB versions. The root cause is the unsafe propagation of schema updates over gossip: concurrent schema updates can lead to schema collisions.

Once Raft is enabled, all schema management operations are serialized by the Raft consensus algorithm. Quickly assembling a fresh cluster, performing concurrent schema changes, updating node’s IP addresses – all of this is now possible with ScyllaDB 5.2.

More specifically:

  • It is now safe to perform concurrent schema change statements. Change requests don’t conflict, get overridden by “competing” requests, or risk data loss. Schema propagation happens much faster since the leader of the cluster is actively pushing it to the nodes. You can expect the nodes to learn about new schema in a healthy cluster in under a few milliseconds (used to be a second or two).
  • If a node is partitioned away from the cluster, it can’t perform schema changes. That’s the main difference, or limitation, from the pre-Raft clusters that you should keep in mind. You can still perform other operations with such nodes (such as reads and writes) so data availability is unaffected. We see results of the change not only in simple regression tests, but in our longevity tests which execute DDL. There are fewer errors in the log and the systems running on Raft are more stable when DDL is running.

For more details, see the related Raft blog.

Moving to Raft-Based Clusters: Important User Impacts

Starting with the 5.2 release, all new clusters will be created with Raft enabled by default from 5.2 on. Upgrading from 5.1 will use Raft only if you explicitly enable it (see the upgrade docs). As soon as all nodes in the cluster opt-in to using Raft, the cluster will automatically migrate those subsystems to using Raft (you should validate that this is the case).

Once Raft is enabled, every cluster-level operation – like updating schema, adding and removing nodes, and adding and removing data centers – requires a quorum to be executed. For example, in the following use cases, the cluster does not have a quorum and will not allow updating the schema:

  • A cluster with 1 out of 3 nodes available
  • A cluster with 2 out of 4 nodes available
  • A cluster with two data centers (DCs), each with 3 nodes, where one of the DCs is not available

This is different from the behavior of a ScyllaDB cluster with Raft disabled.

Nodes might be unavailable due to network issues, node issues, or other reasons. To reduce the chance of quorum loss, it is recommended to have 3 or more nodes per DC, and 3 or more DCs, for a multi-DCs cluster. To recover from a quorum loss, it’s best to revive the failed nodes or fix the network partitioning. If this is impossible, see the Raft manual recovery procedure.

The docs provide more details on handling failures in Raft.

TTL for ScyllaDB’s DynamoDB API (Alternator)

ScyllaDB Alternator is an Amazon DynamoDB-compatible API that allows any application written for DynamoDB to be run, unmodified, against ScyllaDB. ScyllaDB supports the same client SDKs, data modeling and queries as DynamoDB. However, you can deploy ScyllaDB wherever you want: on-premises, or on any public cloud. ScyllaDB provides lower latencies and solves DynamoDB’s high operational costs. You can deploy it however you want via Docker or Kubernetes, or use ScyllaDB Cloud for a fully-managed NoSQL database-as-a-service solution.

In ScyllaDB 5.0, we introduced Time To Live (TTL) to Alternator as an experimental feature. In ScyllaDB 5.2, we promoted it to production ready. Like in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable – it defaults to 24 hours but can be set with the –alternator-ttl-period-in-seconds configuration option.

MORE ON SCYLLADB AS A DYNAMODB ALTERNATIVE

Large Collection Detection

ScyllaDB has traditionally recorded large partitions, large rows, and large cells in system tables so they can be identified and addressed. Now, it also records collections with a large number of elements since they can also cause degraded performance.

Additionally, we introduced a new configurable warning threshold:

compaction_collection_elements_count_warning_threshold – how many elements are considered a “large” collection (default is 10,000 elements).

The information about large collections is stored in the large_cells table, with a new collection_elements column that contains the number of elements of the large collection.

Automating Away the gc_grace_seconds Parameter

Optional automatic management of tombstone garbage collection, replacing gc_grace_seconds, is now promoted from an experimental feature to production ready. This drops tombstones more frequently if repairs are made on time, and prevents data resurrection if repairs are delayed beyond gc_grace_seconds. Tombstones older than the most recent repair will be eligible for purging, and newer ones will be kept. The feature is disabled by default and needs to be enabled via ALTER TABLE.

For example:

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

MORE ON REPAIR BASED TOMBSTONE GARBAGE COLLECTION

Preventing Timeouts When Processing Long Tombstone Sequences

Previously, the paging code required that pages had at least one row before filtering. This could cause an unbounded amount of work if there was a long sequence of tombstones in a partition or token range, leading to timeouts. ScyllaDB will now send empty pages to the client, allowing progress to be made before a timeout. This prevents analytics workloads from failing when processing long sequences of tombstones.

Secondary Index on Collections

Secondary indexes can now index collection columns. This allows you to index individual keys as well as values within maps, sets, and lists.

For example:

CREATE TABLE test(int id, somemap map<int, int>, somelist<int>, someset<int>, PRIMARY KEY(id));
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);

SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;

Additional Improvements

The new release also introduces numerous improvements across:

  • CQL API
  • Amazon DynamoDB Compatible API (Alternator)
  • Correctness
  • Performance and stability
  • Operations
  • Deployment and installations
  • Tools
  • Configuration
  • Monitoring and tracing

For complete details, see the release notes.

READ THE RELEASE NOTES

About Tzach Livyatan

Tzach Livyatan has a B.A. and MSc in Computer Science (Technion, Summa Cum Laude), and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier grade systems, signalling, policy and charging applications.