Stop Wasting Scylla’s CPU Time by Not Being Prepared

Subscribe to Our Blog


What’s the deal with prepared statements?

A query itself is just a string of text. For example:

INSERT INTO tb (key,val) VALUES (“key”, “value”)

In this simple example, we inserted two strings in a two-column table. Before that can happen, the CQL statement string (INSERT INTO…) needs to be sent to Scylla, parsed, and assuming no errors in the query, executed. It’s the parsing part that we are concerned with here. Parsing a CQL query is a compute-intensive operation that consumes resources just like anything else you would have a computer do. What if we could do the parsing part once and keep that parsed output around somewhere for future use?

How does Scylla handle prepared statements?

When asked to prepare a CQL statement, a client library will send a CQL statement to Scylla. Scylla will then create a unique fingerprint for that CQL statement by MD5 hashing the CQL statement. Scylla will use this hash to check its query cache to see if it has already cached that CQL statement. If Scylla had seen that CQL statement, it will send back a reference to that cached CQL statement. If Scylla does not have that unique query hash in its cache, it will then proceed to parse the query and insert the parsed output into its cache.

INSERT INTO tb (key, val) VALUES (?,?)

The client will then be able to send an execute request specifying the statement id and providing the (bound) variables.

Let’s take a look at a few code sample excerpts for Java/Python/Go for preparing a statement, binding it, and executing it.

In Java (using the DataStax Java Driver):

In Python (using the DataStax Python Driver):

In Go (using the GoCQL Driver):

Here is how Prepared Statements help:

  • The preprocessing of the statement is done only once
  • The meta information returned with the identifier allows the client (driver) to direct the executed statement to a node that owns that part of the data (saving a hop)
  • Aside of performance, prepared statements also provide better security—they protect against SQL (CQL) injection

Let’s test it out:

  • Start Scylla
  • Setup for test:
    • Run “cassandra-stress write n=1000000 -pop seq=1..1000000”
    • Run “cassandra-stress read n=1000000 -pop seq=1..1000000”
  • Non-prepared test:
    • Run “cassandra-stress read duration=120s -mode native unprepared cql3 -pop seq=1..1000000 -rate threads=20”
      Prepared test:
    • Run “cassandra-stress read duration=120s -pop seq=1..1000000 -rate threads=20”

This is what I got on my laptop:

Results Unprepared Prepared Gain
op rate 13037 19331 48.28%
partition rate 13037 19331 48.28%
row rate 13037 19331 48.28%
latency mean 1.5 1 33.33%
latency median 1.5 1 33.33%
latency 95th percentile 1.8 1.2 33.33%
latency 99th percentile 2.1 1.4 33.33%
latency 99.9th percentile 3.6 2.8 22.22%

You can check the current status of your application using Scylla exposed metrics via our monitoring stack. In a healthy system with a lot of traffic, scylla_query_processor_statements_prepared (the number of statements that have been prepared) the rate of increase over time should be mostly 0, changing only when there is a new statement or connection. Compared to the number of scylla_transport_requests_served (the number of requests processed), the scylla_query_processor_statements_prepared should be low.

Sample graphs:

This graph aligns with the Unprepared column in the table above where each request (green line) prior to execution had to be parsed and prepared (yellow line) in Scylla.

This graph aligns with the Prepared column in the table above where prior to executing the load, the statements have been prepared (the yellow line is at 0).

Preparing of the statements can be seen when displaying only the scylla_query_processor_statements_prepared (it’s a scale issue). The “read” statements are prepared once for each open connection and then used throughout the load.

Another method to check the current status of your application is to get a tcp dump and run it via wireshark—you can check if your EXECUTE/QUERY statements are passing a prepared statement id or have complete strings.

Sample Unprepared Statement Packet:

Sample Prepared Statement Packet:

Scylla has contributed parts of the CQL dissector support for Wireshark. Wireshark 2.2 and onwards support all Scylla versions to date.

Please note that we have many enhancements queued to further improve the performance of prepared statements. To set up your code for future improvements, use prepared statements whenever you execute a query more than once.

This a recap of a talk I gave at Scylla Summit 2017. You can find this talk and others on our Tech Talk page.

Shlomi LivneAbout Shlomi Livne

Shlomi Livne, VP of R&D at ScyllaDB, has 15 years of experience in software development of large-scale systems. Previously he has led the research and development team at Convergin, which was acquired by Oracle. Shlomi holds a BA and MSc in Computer Science from the Technion-Machon Technologi Le' Israel.

Tags: cql, deep-dive, performance, tutorial