ScyllaDB 1.3 has introduced better support for large partitions. It is an important feature which simplifies data modeling so that it can be more focused on the client’s needs and less on the server limitations and ways to work around them. Moreover, issues related to large partitions are not just failed requests and server crashes caused by the node running out of memory, before reaching that point cluster may experience various performance problems, something much harder to diagnose.
Large and small in the wild (source: wikipidia)
Large partitions and the problems they create
In order to better demonstrate how easy it is to create large partitions and why it is important for the database to handle them well we will use an example with ScyllaDB storing air quality information, such as carbon monoxide concentration in parts per million, coming from multiple sensors. The table looks like this:
CREATE TABLE air_quality_data ( sensor_id text, time timestamp, co_ppm int, PRIMARY KEY (sensor_id, time) );
In this model each sensor has its own partition to which CO readings are being added periodically as clustering rows. Since
time is a clustering key we automatically get such entries sorted by the database. The problem is that such partitions may grow very large, well above the hard limit of 2 billion clustering rows per partition. The standard solution in such cases is to split the partition into smaller fragments. For example:
CREATE TABLE air_quality_data ( sensor_id text, date text, time timestamp, co_ppm int, PRIMARY KEY ((sensor_id, date), time) );
Each partition stores only entries for a single day and therefore the number of clustering rows was limited. Unfortunately, in previous ScyllaDB releases this still didn’t guarantee that the size of the partition won’t cause any issues. Partitions several megabytes big were enough to cause many undesirable effects, ranging from excessive cache eviction and limited concurrency to memory allocation failures in the most extreme cases. It is not an easy task to determine the maximum safe size of a partition as it varies greatly on many external factors such as the available memory on a node, the processes happening in the background and the characteristics of the load the cluster is under. Moreover, embedding such implementation details in data model should be avoided as it is quite fragile and may easily change.
In this particular example, just storing more information, like NO2 concentration or humidity, or increasing the resolution or readings will significantly change the size of created partitions. Not to mention that such fragmentation of data requires appropriate logic on the client side to handle it properly.
One can also easily imagine an example, when instead of periodical readings, unpredictable events are stored in the database. For example, we could have sensors detecting lightning strikes. Attempting, to limit partition size by remodelling data is inelegant and may affect performance, it is especially wasteful if the majority of partitions are small and few grow large enough to cause problems.
Improvements in ScyllaDB 1.3
While it is still recommended to keep the size of partitions small, we understand that it is not always possible to achieve without serious trade-offs. That is why ScyllaDB 1.3 has gained the proper support of large partitions. This was achieved by changing the database engine so that most of the internal operations are performed with a clustering row granularity and do not require the whole partition to fit in memory anymore. The fact that majority of internal processes need to keep only fixed number of clustering rows in memory means that SSTable compaction, streaming and repair can quite easily cope even with partitions that are larger than the available memory. Queries still need to be able to store their result in memory but as long as paging is used or they are limited either by providing row limit or start and end bound they are not going to cause troubles regardless of the partition size, although there is still room for improvement in that area. Moreover, during queries only data relevant for the actual result is being read which is going to improve performance in all cases when only part of the partition is returned.
Unfortunately, there are still certain limitations to the ability of handling such partitions. They are going to be addressed in the future releases. Some of them are direct consequences of guarantees ScyllaDB strives to provide, such as special handling of partitions larger than 128 kB while streaming data between nodes in a cluster. That special handling may cause the receiving side to create a large number of small SSTables and incur performance penalty until they are compacted away (issue #1440). In addition to that, as of now, cache still works on full partitions and in order to enable large partition support partitions which size crosses certain, configurable (default: 10 MB) threshold are not kept in cache at all (issue #960). Consequently, all accesses to such partitions would result IO operations which may be especially visible when read page size is much smaller than the number of clustering rows in the partition. Finally, at the moment reversed queries (i.e. queries with
ORDER BY clause specifying different order than the one during column family creation) need to keep whole partition in the memory which means that their ability to support large partitions is severely limited (issue #1413).
No matter how good computers get at performing tasks given to them, it is always a good thing to keep an eye on them. Useful tools to estimate the size of partitions are
nodetool cfstats and
nodetool cfhistograms. They provide estimates of the sizes of the partition thus giving better insight into the kind of data ScyllaDB is dealing with. There is also a counter
cache/uncached_wide_partitions which provides information about partitions that were considered too large to be put into the cache. This information may be very useful to ensure that the threshold above which partitions are not cached is configured properly.
Good support for large partitions is important since in many cases they are the most natural way to model the data and cannot be avoided without significant effort and additional complexity. Release 1.3 introduces that to ScyllaDB, which now is able to perform much even when facing partitions larger than the available memory. There are still some limitations though, but the future releases will see even more improvements in that area.
Since this article was originally published, a great deal more work has gone into large partition support in ScyllaDB. Read more: