See all blog posts

Large Partitions Support in ScyllaDB 2.3 and Beyond

Large partitions, although supported by ScyllaDB, are also well known for causing performance issues. Fortunately, release 2.3 comes with a helping hand for discovering and investigating large partitions present in a cluster — system.large_partitions table.

Large partitions

CQL, as a data modeling language, aims towards very good readability and hiding unneeded implementation details from users. As a result, sometimes it’s not clear why a very simple data model suffers from unexpected performance problems. One of the potential suspects might be large partitions. Our blog entry on large partitions contains a detailed explanation on why coping with large partitions is important. We’ll use some of the same example tables from this article below.

The following table could be used in a distributed air quality monitoring system with multiple sensors:

CREATE TABLE air_quality_data (
   sensor_id text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY (sensor_id, time)
);

With time being our table’s clustering key, it’s easy to imagine that partitions for each sensor can grow very large – especially if data is gathered every couple of milliseconds. Given that there is a hard limit on the number of clustering rows per partition (2 billion) this innocent looking table can eventually become unusable. In this example, in about 50 days.

A standard solution is to amend the data model to reduce the number of clustering keys per partition key. In this case, let’s take a look at amended table air_quality_data:

CREATE TABLE air_quality_data (
   sensor_id text,
   date text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY ((sensor_id, date), time)
);

After the change, one partition holds the values gathered in a single day, which makes it less likely to overflow.

system.large_partitions table

Amending the data model may help with large partition issues. But sometimes you have such issues without realizing it. It’s useful to be able to see which tables have large partitions and how many of them exist in a cluster.

In order to track how many large partitions are created and to which table they belong, one can use the system.large_partitions table, which is implicitly created with the following schema:

CREATE TABLE system.large_partitions (
keyspace_name text,
table_name text,
sstable_name text,
partition_size bigint,
partition_key text,
compaction_time timestamp,
PRIMARY KEY ((keyspace_name, table_name), sstable_name, partition_size, partition_key)

) WITH CLUSTERING ORDER BY (sstable_name ASC, partition_size DESC, partition_key ASC);

How it works

Partitions are written to disk during memtable flushes and compaction. If, during any of this action, a large partition is written, an entry in system.large_partitions will be created (or updated). It’s important to remember that large partition information is updated when a row is actually written to disk; changes might not be visible immediately after acknowledging a write operation by the user, since data could still reside in a memtable for some time.

Each entry in system.large_partitions table represents a partition written to a given sstable. Note that large_partitions table is node-local – querying it will return large partition information only for the node that serves the request.

Listing all local large partition info can be achieved with:

SELECT * FROM system.large_partitions;

Checking large partitions for a specific table:

SELECT * FROM system.large_partitions WHERE keyspace_name = 'ks' and table_name = 'air_quality_data';

Listing all large partitions in a given keyspace that exceeded 140MB:

SELECT * FROM system.large_partitions WHERE partition_size > 146800640 ALLOW FILTERING; *

  • Note: ALLOW FILTERING support is not part of 2.3; it will be present in the next release

Listing all large partitions compacted today:

SELECT * FROM system.large_partitions WHERE compaction_time >= toTimestamp(currentDate()) ALLOW FILTERING; *

  • Note: ALLOW FILTERING support is not part of 2.3; it will be present in the next release

Since system.large_partitions can be read just like a regular CQL table, there are many more combinations of queries that return helpful results. Remember that keyspace_name and table_name act as the partition key, so some more complex queries, like the last example above, may involve filtering (hence the appended ALLOW FILTERING keywords). Filtering support for such queries is not part of 2.3 and will be available in the next release.

Aside from table name and size, system.large_partitions contains information on the offending partition key, when the compaction that led to the creation of this large partition occurred, and its sstable name (which makes it easy to locate its filename).

Configuration

For both readability and performance reasons, not all partitions are registered in system.large_partitions table. The threshold can be configured with an already existing parameter in scylla.yaml:

compaction_large_partition_warning_threshold_mb: 100

Previously, this configuration option was used to trigger a warning and logged each time a large-enough partition was written.

The large partition warning threshold defaults to 100MiB, which implies that each larger partition will be registered into system.large_partitions table the moment it’s written, either because of memtable flush or as a result of compaction.

If the default value is not sufficient for a specific use case, e.g. even 1MiB partitions are considered “too big” or, conversely, virtually every partition is bigger than 100MiB, it you can modify compaction_large_partition_warning_threshold_mb accordingly.

Disabling system.large_partitions can effectively be done by setting the threshold to an extremely high value, say, 500GiB. However, it’s highly recommended to leave it at a reasonable level. Better safe than sorry.

In order to prevent stale data from appearing in system.large_partitions, each record is inserted with time-to-live of 30 days.

Conclusion

We promised back in 2016 for Release 1.3 that we’d continue to improve support for large partitions. This improvement for 2.3 is a follow-through on that commitment. As you can see, we already have some next-steps planned out with future support for ALLOW FILTERING.

For now, we’d like for you to try system.large_partitions, and let us know what you find. Are you already aware of large partitions in your database, or did it help you discover anything about your data you didn’t already know?

If large partitions are critical to you, feel free to contact us with your war stories and requirements, or bring them up when you see us at ScyllaDB Summit this November.

About Piotr Sarna

Piotr is a software engineer who is keen on open-source projects and the Rust and C++ languages. He previously developed an open-source distributed file system and had a brief adventure with the Linux kernel during an apprenticeship at Samsung Electronics. He's also a long-time contributor and maintainer of ScyllaDB, as well as libSQL. Piotr graduated from University of Warsaw with an MSc in Computer Science. He is a co-author of the "Database Performance at Scale" book.