
Thrift is the original Cassandra protocol. It’s a software framework for back-end services developed at Facebook, and now an Apache project. New Cassandra applications use the CQL query language, but because Thrift was heavily used in Apache Cassandra’s early years, several important integrations depend on it. Some examples are:
- KairosDB
- Presto (which uses both CQL and Thrift)
- TitanDb
- Hector (a Java Cassandra client used in KairosDB and elsewhere)
Newer integrations such as Apache Spark are all CQL. However, some integrations, and in-house applications at some companies also depend on Thrift. The ScyllaDB team has therefore decided not to require future ScyllaDB users to update their applications to CQL. We will instead support Thrift natively in ScyllaDB 1.3 and future versions.
ScyllaDB 1.3 is scheduled for release soon, and brings Thrift support to ScyllaDB.
Implementation
ScyllaDB is internally designed in a CQL-centric way. Instead of changing the internal model, the Thrift API is implemented on top of CQL: Thrift column families are mapped onto CQL tables, just as Thrift operations are implemented as operations on those tables.
To illustrate this, let’s consider a simple example of data that you might want to have in ScyllaDB: a list of stories for a social reader application. In Thrift, using the deprecated cassandra-cli
, we could define a story column family as follows:
create keyspace ks;
use ks;
create column family story
with key_validation_class = UTF8Type
and comparator = UTF8Type
and column_metadata = [
{column_name: date, validation_class: DateType },
{column_name: content, validation_class: UTF8Type }
];
Listing 1: A Thrift column family for news stories in a simple “social reader” application
This is a static column family, as it defines all columns up-front, almost like you would when defining a CQL table. Using cqlsh
we can describe the CQL table ScyllaDB creates from this column family:
cqlsh> desc ks.story;
CREATE TABLE ks.story (
key1 text PRIMARY KEY,
content text,
date timestamp
) WITH COMPACT STORAGE
Listing 2: The equivalent CQL table, created by ScyllaDB
This table maps precisely to the Thrift column family. Since we didn’t specify a key_alias
, ScyllaDB names the primary key key1
by default. We can use CQL to update the column’s name:
cqlsh> alter table ks.story rename key1 to url;
cqlsh> desc ks.story;
CREATE TABLE ks.story (
url text PRIMARY KEY,
content text,
date timestamp
) WITH COMPACT STORAGE
Listing 4: Renaming default column names with cqlsh
Not all data can be represented with static column families. Another column family is the feed (which, in the social reader, can be RSS feeds, lists of URLs posted to a social site such as Twitter, or lists of stories submitted by a user). A feed has a feed URL, a fetch time, and a story URL.
create column family feed
with key_validation_class = UTF8Type
and comparator = DateType
and default_validation_class = UTF8Type;
Listing 5: A column family for news feeds
This is a dynamic column family, as each row may contain completely different sets of columns. For example, a feed for a particular user will look different from another user’s. The underlying CQL table is the following:
cqlsh> desc ks.feed;
CREATE TABLE ks.feed (
key1 text,
column1 timestamp,
value text,
PRIMARY KEY (key1, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
Listing 6: The equivalent CQL table, created by ScyllaDB
In addition to having a partition key, the CQL table for a dynamic column family has a clustering key (something that doesn’t exist in Thrift) and a regular column, which respectively hold the names and values of dynamically inserted columns. Their names can be renamed as well, transparently to their usage from the Thrift API.
We can use cassandra-cli
to insert and query some data over Thrift:
set story['fakeurl']['content'] = 'fake post';
Value inserted.
Elapsed time: 7.85 msec(s).
set story['fakeurl']['date'] = '2016-07-25 02:26:04';
Value inserted.
Elapsed time: 1.21 msec(s).
list story;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: fakeurl
=> (name=content, value=fake post, timestamp=1469055076976000)
=> (name=date, value=2016-07-25 02:26Z, timestamp=1469055131379000)
The same results can be obtained by querying that data through CQL:
cqlsh:ks> select * from story;
url | content | date
---------+-----------+--------------------------
fakeurl | fake post | 2016-07-25 02:26:04+0000
While all tables created through the Thrift API are consumable from CQL, not all CQL tables are consumable from Thrift. Any CQL table not created with the with compact storage
option cannot be used from Thrift.
Performance
Thrift performance is on par with CQL. The following benchmark consists of running cassandra-stress
workloads against ScyllaDB — write, read and mixed workloads, in that order — each executing 10M ops (with 100 threads). We establish a baseline by running cassandra-stress
in CQL3 mode, and then repeat the workloads in Thrift mode. Note that cassandra-stress
ran on the same machine as ScyllaDB, with the ScyllaDB process restricted to 4 CPUs with taskset
and the cassandra-stress
one to 12 CPUs; while the numbers don’t reflect what would be expected in a proper deployment, they nevertheless present a valid comparison of the two APIs.
Write workload:
+--------+--------+
| CQL3 | Thrift |
+-------------------+--------+--------+
| Rate | 135865 | 129880 |
+-------------------+--------+--------+
| Latency 99th %ile | 2.8 | 2.9 |
+-------------------+--------+--------+
| Latency max | 251.9 | 314.9 |
+-------------------+--------+--------+
Read workload:
+--------+--------+
| CQL3 | Thrift |
+-------------------+--------+--------+
| Rate | 115270 | 116725 |
+-------------------+--------+--------+
| Latency 99th %ile | 1.7 | 1.7 |
+-------------------+--------+--------+
| Latency max | 71.0 | 28.9 |
+-------------------+--------+--------+
Mixed workload:
+-------+--------+
| CQL3 | Thrift |
+-------------------+-------+--------+
| Rate | 88077 | 89306 |
+-------------------+-------+--------+
| Latency 99th %ile | 5.3 | 5.9 |
+-------------------+-------+--------+
| Latency max | 197.6 | 201.7 |
+-------------------+-------+--------+
The specs of the server are:
- Fedora 24
- Core I7-5960X @ 3Ghz (C-states disabled, XMP and Turbo enabled)
- 64GB DDR4 2800Mhz
- Samsung 950 Pro PCIe 256GB SSD
There are no independent benchmarks yet. Watch this space for more info on ScyllaDB’s Thrift performance.
Limitations
ScyllaDB supports the commonly used features of Thrift, but some functionality is not present in this release. Namely, we do not support:
- Super columns;
- Mixed column families, where you dynamically add columns to a static column family, are not supported;
- Compressed query strings for the
execute_cql3_query
andprepare_cql3_query
verbs; - Composite regular column names and values (for static column families).
Thrift-related issues are tagged [thrift-api] on the ScyllaDB project on GitHub, and a full list of currently open issues can be seen here.
In the next installment of this series, we cover KairosDB, an open source time series database, and how it works with ScyllaDB’s Thrift support. (Part 2)
Thrift at ScyllaDB Summit
Thrift is one of many topics to be covered at the upcoming ScyllaDB Summit. Come to ScyllaDB Summit on September 6th, in San Jose, California, to learn more about Thrift and other new and upcoming ScyllaDB features—along with info on how companies like IBM, Outbrain, Samsung SDS, Appnexus, Hulu, and Mogujie are using ScyllaDB for better performance and faster development. Meet ScyllaDB developers and devops users who will cover ScyllaDB design, best practices, advanced tooling and future roadmap items.
Going to Cassandra Summit? Add another day of NoSQL ScyllaDB Summit takes place the day before Cassandra Summit begins and takes place at the Hilton San Jose, adjacent to the San Jose convention Center. Lunch and refreshments are provided.