Materialized views at Scylla Summit
We are excited to be hosting our first Scylla Summit on September 6th in San Jose. Many topics will be covered, including materialized views.
About materialized views
In Cassandra and Scylla, data is divided into partitions, which can be found by a partition key. Sometimes, the application needs to find a partition – or partitions – by the value of another column. Doing this efficiently, without scanning all the partitions requires indexing. Materialized Views, which we explain here, is one of the three indexing options supported in Cassandra 3. The other two indexing options are:
The biggest problem with most kinds of indexes, which are local to a node, is scalability: When a search is expected to yield just one (or few) partitions, we need to send it to all the nodes because we cannot know which one might hold the data (the location of the data is determined by the partition key, which we don’t know, not the column value we are searching).
So, “Materialized Views” is not a local index, but rather a global index: There is one big index which is distributed to the different nodes using the normal distribution scheme. Writes becomes more complicated (the data, and the index entry, will usually be on different nodes) but reading is more scalable.
But the “Materialized Views” feature is more than just an index: It doesn’t just list partition keys matching a column value (as you would expected from an index); Rather, it also keeps a subset of columns on those partitions. In essence, Materialized Views builds a new table with the indexed column as a partition key, and a user-chosen subset of columns as values.
People have been doing this—creating additional tables containing other “views” into the same data (and calling them “Materialized Views” or “denormalization”)—for a long time. The new feature enables the Scylla server to do it for you, without application support, automatically, safely and efficiently. In other words, the application just updates the base table, and the additional materialized-view tables gets updated automatically as needed.
There is no free lunch, though. Ensuring correctness and atomicity, as the materialized views are updated together with the tables, makes updates to a table with materialized views significantly slower than normal updates. This means that the additional performance provided by Scylla is not just a matter of incremental cost savings. Scylla performance allows application developers to take advantage of higher data consistency and speed of development. Reads, however, are just regular fast reads, so both read and write operations are scalable.
Cassandra’s “Materialized Views” feature was developed in CASSANDRA-6477 and explained in this blog entry and in the design document. Another good explanation of materialized views can be found in this blog entry. The Scylla version is compatible, but, as usual, faster. We’ll be discussing performance of materialized views at Scylla Summit.
Let’s look at a table of buildings: the key is the building’s name, and additional columns are each building’s location, date of building, and height in meters:
CREATE TABLE buildings ( name text, city text, built int, meters int, PRIMARY KEY (name) );
Let’s insert a few famous builds into this list. Each of the buildings in this example held the title of tallest-building-in-the-world when it was built:
INSERT INTO buildings (name, city, built, meters) VALUES ('Burj Khalifa', 'Dubai', 2010, 828); INSERT INTO buildings (name, city, built, meters) VALUES ('Shangai World Financial Center', 'Shanghai', 2008, 487); INSERT INTO buildings (name, city, built, meters) VALUES ('Taipei 101', 'Taipei', 2004, 449); INSERT INTO buildings (name, city, built, meters) VALUES ('Sears Tower', 'Chicago', 1974, 442); INSERT INTO buildings (name, city, built, meters) VALUES ('World Trade Center', 'New York City', 1972, 417); INSERT INTO buildings (name, city, built, meters) VALUES ('Empire State Building', 'New York City', 1931, 381); INSERT INTO buildings (name, city, built, meters) VALUES ('Chrysler Building', 'New York City', 1930, 283);
The table looks like this:
SELECT * FROM buildings; name | built | city | meters --------------------------------+-------+---------------+-------- Chrysler Building | 1930 | New York City | 283 Burj Khalifa | 2010 | Dubai | 828 World Trade Center | 1972 | New York City | 417 Shangai World Financial Center | 2008 | Shanghai | 487 Sears Tower | 1974 | Chicago | 442 Empire State Building | 1931 | New York City | 381 Taipei 101 | 2004 | Taipei | 449
Now, from this “base table”, let’s ask Scylla to automatically maintain a second table, a materialized view, for finding buildings by city. The new table will have the city as the partition key. But the city cannot be the entire key for each record (a building), because we can have multiple buildings in the same city – in the above example we have multiple buildings in New York. So we will have (city, name) as the primary key of the new materialized-view table: city is the partition key, and name is a clustering key:
CREATE MATERIALIZED VIEW building_by_city AS SELECT * FROM buildings WHERE city IS NOT NULL PRIMARY KEY(city, name);
One might observe that in this example data (city, built), or even just (built) could also have also served as a key because it also uniquely determines a building in this data set. However, Scylla has no way to guarantee that this remains so as more data is added. For example, one might add another building built in 1930 to the list (we’ll even do this below). The only key which is really guaranteed to be unique – and remain unique as additional data is added – is the original table’s key. So, as a rule, in a materialized view all the components of the original primary key of the table MUST also appear in the materialized view’s key. This is why we added name in to the view’s key in this example.
WHERE city IS NOT NULL filter in the snippet above ensures if a building includes a null value for “city”, it would not be added to the view table, because doing so would be illegal (A key component – either partition or clustering key – cannot be null). Adding this filter is mandatory. If you don’t, you will not be able to create the materialized view:
CREATE MATERIALIZED VIEW building_by_city AS SELECT * FROM buildings PRIMARY KEY(city, name); InvalidRequest: code=2200 [Invalid query] message="Primary key column 'city' is required to be filtered by 'IS NOT NULL'"
As expected, the new table looks like this:
SELECT * FROM building_by_city; city | name | built | meters ---------------+--------------------------------+-------+-------- New York City | Chrysler Building | 1930 | 283 New York City | Empire State Building | 1931 | 381 New York City | World Trade Center | 1972 | 417 Shanghai | Shangai World Financial Center | 2008 | 487 Chicago | Sears Tower | 1974 | 442 Dubai | Burj Khalifa | 2010 | 828 Taipei | Taipei 101 | 2004 | 449
This view contains all the columns in the base table because of the
SELECT * command we used to define the view.
Advantages for developers
If you’re used to making your data model more complicated in order to support a variety of queries, now materialized views can make your life much easier. Instead of doing multiple INSERTs, you can rely on Scylla to populate the materialized views for you. However materialized views are real tables, and do take up storage space.
And since materialized views are separate tables, they can be repaired separately from the base table.
Materialized views, like any other tables, can have their properties (such as compaction, compression, etc.) tuned with the
ALTER MATERIALIZED VIEW command.
You can even create a materialized view for an already populated base table. In that case, a background operation is started to populate the materialized view. While this happens, there will be a period during which queries against the materialized view may not return all results. When the build is complete, the system.built_materializedviews table on each node will be updated with the view’s name.
See you at Scylla Summit
Come to Scylla Summit on September 6th, in San Jose, California, to learn more about materialized views and other upcoming Scylla features—along with info on how companies like IBM, Outbrain, Samsung SDS, Appnexus, Hulu, and Mogujie are using Scylla for better performance and faster development. Meet Scylla developers and devops users who will cover Scylla design, best practices, advanced tooling and future roadmap items.
Going to Cassandra Summit? Add another day of NoSQL Scylla Summit takes place the day before Cassandra Summit begins and takes place at the Hilton San Jose, adjacent to the San Jose convention Center. Lunch and refreshments are provided.