See all blog posts

Enhancements to ScyllaDB’s Filtering Implementation

ScyllaDB Open Source 3.0 release comes with enhanced filtering support. One of the reasons we’re making this enhancement is due to the Spark-Cassandra connector’s reliance on ALLOW FILTERING support when generating CQL queries. The first part of this post provides a quick overview of what filtering is and why it can be useful. Then we will discuss why it can hurt performance and recommended alternatives. Finally, we’ll cover the caveats of ScyllaDB’s filtering implementation.

ALLOW FILTERING Keyword

Queries that may potentially hurt a ScyllaDB cluster’s performance are, by default, not allowed to be executed. These queries include those that restrict:

  • Non-key fields (e.g. WHERE v = 1)
  • Parts of primary keys that are not a prefixes (e.g. WHERE pk = 1 and c2 = 3)
  • Partition keys with something else other than an equality relation (e.g. WHERE pk >= 1)
  • Clustering keys with a range restriction and then by other conditions (e.g. WHERE pk =1 and c1 > 2 and c2 = 3)

ScyllaDB is expected to be compatible with Cassandra in qualifying queries for filtering.

ALLOW FILTERING is a CQL keyword that can override this rule, but for performance reasons, please use it with caution. Let’s take a look at an example scenario – a database designed for a restaurant that wants to keep all of their menu entries in one place.

You can use the following code snippets to build a sample restaurant menu. This example will serve as a reference in the proceeding sections.

Now, with the test table initialized, let’s see which SELECT statements are potential filtering candidates. Queries based on primary key prefixes will work fine and filtering is not needed:

SELECT * FROM menu WHERE category = 'starters' and position = 3;

SELECT * FROM menu WHERE category = 'soups';

Now let’s take a look at queries below.

For an affordable meal:

SELECT * FROM menu WHERE price <= 10. ALLOW FILTERING;

For one specific dish:

SELECT * FROM menu WHERE name = 'sour rye soup' ALLOW FILTERING;

For all dishes that are listed first, but with a very specific price:

SELECT * FROM menu WHERE position = 1 and price = 10.5 ALLOW FILTERING;

For cheap starters:

SELECT * FROM menu WHERE category = 'starters' and price <= 10 ALLOW FILTERING;

Trying the queries above will result in an error message:

“Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.”

This error occurs because either the non-key fields were restricted or a non-prefix part of the primary key was used in the statement. Some readers may spot the capitalized ALLOW FILTERING part of the error message and deduce that it’s a great solution to append it to the query. But those who are really reading the error message will notice that the keywords are not uppercase at all – these keywords are performance unpredictability.

Filtering is performed only after all potentially matching rows are fetched. So, if a table contains 10 million rows and all of them are potential candidates, they will all be fetched. If, however, filtering rules are applied and only 9 million rows fit the filter, 10% of the total table rows will be ignored. If the filtering restrictions are very selective and only a single row matches, 9999999 rows were read in vain. And if several partitions were queried, that makes it even worse – from the example queries above, only the last one (cheap starters) restricted the partition key, which makes it a little more efficient. All the other ones involve fetching all of the partitions, which is very likely to be slow and create an unnecessary load for the whole cluster. That said, low selectivity queries can benefit from filtering compared to secondary index-based search (more on this topic below). A Sequential scan of all of the values is faster than a huge set of random index-based seeks.

Filtering can be allowed by simply appending the ALLOW FILTERING keyword to queries:

SELECT * FROM menu WHERE price <= 10. ALLOW FILTERING;

SELECT * FROM menu WHERE name = 'sour rye soup' ALLOW FILTERING;

SELECT * FROM menu WHERE position = 1 and price = 10.5 ALLOW FILTERING;

SELECT * FROM menu WHERE category = 'starters' and price <= 10 ALLOW FILTERING;

Alternatives

Simply appending ALLOW FILTERING to queries should never be treated as a “rule of thumb” solution. It might be the best option for low selectivity queries, but it also might hurt performance cluster-wide if used incorrectly. The alternative ways described below should also be considered each time ALLOW FILTERING is discussed.

Schema Change

The first obvious thing to consider after seeing the “ALLOW FILTERING” error message is to change the data model. Let’s take a look at one of the example queries:

SELECT * FROM menu WHERE name = 'sour rye soup';

If the majority of queries involve looking up the name field, maybe it should belong to the key? With the table schema changed to:

CREATE TABLE menu (name text, category text, position int, price float, PRIMARY KEY(name));

queries that use the name field will not require filtering anymore.

Changing table schemas is usually not easy and sometimes it makes no sense at all because it would deteriorate performance for other important queries. That’s where secondary indexing may come to the rescue.

Secondary Indexes

Creating a secondary index on a field allows non-partition keys to be queried without filtering. Secondary indexing has its boundaries, e.g. it only works with equality restrictions (WHERE price = 10.5).

More information about secondary indexing can be found here:

Creating an index on a name field makes it possible to execute our soup query without problems:

CREATE INDEX ON menu(name);
SELECT * FROM menu WHERE name = 'sour rye soup';

It’s worth noting that indexes come with their own performance costs – keeping them will require additional space and querying them is not as efficient as by primary keys. A proper secondary index needs to be queried first and only then a base table query is constructed and executed, which means we end up having two queries instead of one. Also, writing to the table backed by indexing is slower because both the original table and all of the indexes need to be updated. Still, if changing the data model is out the question, indexing can be much more efficient than filtering queries. And that’s the case especially if queries are highly selective, i.e. only a few rows are read.

Finally, indexes and filtering do not exclude each other – it’s perfectly possible to combine both in order to optimize your queries. Let’s go back to another example:

SELECT * FROM menu WHERE position = 1 and price = 10.5;

If we suspect that not many dishes have the same cost, we could create an index on the price:

CREATE INDEX ON menu(price);

Now, in the first stage of query execution, this index will be used to fetch all of the rows with the specific price. Then, all of the rows with a position different than 1 will be filtered out. Note that ALLOW FILTERING needs to be appended to this query because filtering is still involved in its execution.

Materialized Views

Another notable alternative to filtering is to use materialized views to speed up certain SELECT queries at the cost of more complicated table updates. A comprehensive description of how materialized views work (with examples) can be found here.

Performance

A quick local test performed on the queries below shows the performance impact of filtering and secondary indexes when query selectivity is high. The test cluster consists of 3 nodes, replication factor RF=1, and caches are disabled to ensure that rows would need to be read from the SSD NVMe drive instead of RAM. All of the queries in this example table were filled with 10’000 rows:


Queries:

  • A – Based on partition key p1: SELECT * FROM TMCR WHERE p1 = 15534
  • B – Based on regular column r1: SELECT * FROM TMCR WHERE r1 = 15538
  • C – Based on regular column r2: SELECT * FROM TMCR WHERE r2 = 9
  • D – Based on regular column r2, sliced : SELECT * FROM TMCR WHERE r2 > 10000

The table below shows the duration of running 100 queries of each type in seconds:

Configuration/Query A (p1 = x) B (r1 = x) C (r2 = x) D (r2 > x)
Filtering, no indexes 0.14s 2.96s 3.63s
275K rows/s
2.96s
With index on r1 N/A 0.14s 3.79s 3.10s
With index on r2 N/A 3.11s 13.75s
73K rows/s
3.10s
With materialized view for r1* N/A 0.14s N/A N/A
With materialized view for r2* N/A N/A 1.15s
869K rows/s
2.55s

The first obvious conclusion is that the queries based on primary key are much faster than fetching and filtering all of the rows.

Another interesting observation is that the low selectivity query C (WHERE r2 = 9), which effectively fetches all rows, is much faster with filtering than indexes. At first glance, it may look like an anomaly, but it is actually expected – sequential reading and filtering of all of the rows are faster than a random index lookup.

Also, creating a specialized materialized view can be faster than indexing, since querying a materialized view doesn’t involve double lookups.

Finally, indexing a low cardinality column (query C, configuration with index on r2) is heavily discouraged because it will create a single huge partition (in our example all r2 values are equal to 9 and r2 becomes a primary key for created index table). This local test shows it’s already slower than other configurations, but the situation would get even worse on a real three-node cluster.

What’s Not Here Yet

ScyllaDB’s filtering implementation does not yet cover the following functionalities:

  • Support for CONTAINS restrictions on collections
  • Support for multi-column restrictions (WHERE (category, name) = (‘soups’, ‘sour rye soup’))

Summary

Newly implemented filtering support can allow certain queries to be executed by appending the ALLOW FILTERING keyword to them. Filtering comes with a performance burden and is usually a symptom of data model design flaws. The alternative solutions described in this blog post should be considered first.

Allow Filtering at ScyllaDB Summit 2018

ScyllaDB engineer Piotr Sarna presented in depth on how Materialized Views, Secondary Indexes, and ALLOW FILTERING were implemented, and how they can be used at ScyllaDB Summit 2018. Below is his ScyllaDB Summit video. You can also go to our Tech Talks section to see his presentation slides and all of the other videos and slides from the event.

Next Steps

  • Learn more about ScyllaDB from our product page.
  • Learn more about ScyllaDB Open Source Release 3.0.
  • See what our users are saying about ScyllaDB.
  • Download ScyllaDB. Check out our download page to run ScyllaDB on AWS, install it locally in a Virtual Machine, or run it in Docker.

Editor’s Note: This blog has been revised to reflect that this feature is now available in ScyllaDB Open Source Release 3.0.

About Piotr Sarna

Piotr is a software engineer who is keen on open-source projects and the Rust and C++ languages. He previously developed an open-source distributed file system and had a brief adventure with the Linux kernel during an apprenticeship at Samsung Electronics. He's also a long-time contributor and maintainer of ScyllaDB, as well as libSQL. Piotr graduated from University of Warsaw with an MSc in Computer Science. He is a co-author of the "Database Performance at Scale" book.