See all blog posts

CQL tracing in ScyllaDB

Tracing is a ScyllaDB tool that is meant to help developers and administrators analyze internal data flows in a ScyllaDB cluster. One example of such a flow is CQL request processing. Tracing works much as it does in Apache Cassandra, so Cassandra tracing advice that you find online should be helpful on ScyllaDB, too.

When to use CQL tracing to resolve a problem

If your replication factor is too low, or requires data transfers from a remote data center, you can see that with traces. Tracing showed exactly what kind of problem is present, and then the next step is to decide how to fix it.

Another example is network issues: If the network between two nodes is slow, the timestamps will show where the delay is happening.

bloodhound photo

Example of using tracing to identify a problem

Let’s trace a query on a cluster with 6 nodes, 3 per data center.
In this example, 127.0.0.1, .2, and .3 are in one data center, and .4, .5, and .6 are in the other. The trace shows that an entire query can be processed locally, without time-consuming cross-data-center traffic.

./bin/nodetool -p 7100 status
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns    Host ID                               Rack
UN  127.0.0.4  164.73 KB  256     ?       95ff89a3-ea99-4266-bdf4-4d0a417b5232  RAC1
UN  127.0.0.5  151.16 KB  256     ?       654df685-f308-4b8c-802b-311971023dad  RAC1
UN  127.0.0.6  164.53 KB  256     ?       e6a4c46d-f744-4ec3-a6b9-c4852fa59030  RAC1
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns    Host ID                               Rack
UN  127.0.0.1  178.73 KB  256     ?       05eb380a-a73e-487f-bce2-88c791d57c2c  RAC1
UN  127.0.0.2  151.15 KB  256     ?       a7945022-e1c6-47e2-95fd-e2c7142c649a  RAC1
UN  127.0.0.3  178.73 KB  256     ?       f0be2b5c-99c7-4b89-a0c5-7412bb361739  RAC1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless


cqlsh> create keyspace my_ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2 };
cqlsh> CREATE TABLE my_ks.my_cf1 (c1 text PRIMARY KEY, c2 text);
cqlsh> INSERT into my_ks.my_cf1  ("c1", "c2") VALUES ('key1', 'value1');
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> CONSISTENCY LOCAL_QUORUM ;
Consistency level set to LOCAL_QUORUM.
cqlsh> SELECT * from my_ks.my_cf1 where c1='key1';

 c1   | c2
------+--------
 key1 | value1

(1 rows)

Tracing session: e8adc0b0-58a8-11e6-a3d6-000000000000

 activity                                                                                                                                                                    | timestamp                  | source    | source_elapsed
-------------------------------------------------------------------------+----------------------------+----------+----------------
                                                      Execute CQL3 query | 2016-08-02 15:01:56.283000 | 127.0.01 |              0
                                           Parsing a statement [shard 0] | 2016-08-02 15:01:56.283494 | 127.0.0.1 |             --
                                        Processing a statement [shard 0] | 2016-08-02 15:01:56.283540 | 127.0.0.1 |             47
              Creating read executor ... repair decision: NONE [shard 0] | 2016-08-02 15:01:56.283591 | 127.0.0.1 |             97
                               sending a message to /127.0.0.3 [shard 0] | 2016-08-02 15:01:56.283605 | 127.0.0.1 |            112
                                   read_data: querying locally [shard 0] | 2016-08-02 15:01:56.283650 | 127.0.0.1 |            156
                 read_digest: message received from /127.0.0.1 [shard 0] | 2016-08-02 15:01:56.283768 | 127.0.0.3 |             10
read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2016-08-02 15:01:56.283829 | 127.0.0.3 |             70
                     read_digest: got response from /127.0.0.3 [shard 0] | 2016-08-02 15:01:56.283987 | 127.0.0.1 |            494
                          Done processing - preparing a result [shard 0] | 2016-08-02 15:01:56.284036 | 127.0.0.1 |            542
                                                        Request complete | 2016-08-02 15:01:56.283557 | 127.0.0.1 |            557

(output trimmed at …)

For reads, we use LOCAL_QUORUM where possible, to avoid going to the remote datacenter. But now try it with the consistency level set to QUORUM..

cqlsh> CONSISTENCY QUORUM ;
Consistency level set to QUORUM.
cqlsh> SELECT * from my_ks.my_cf1 where c1='key1';

 c1   | c2
------+--------
 key1 | value1

(1 rows)

Tracing session: ef573a40-58a8-11e6-a3d6-000000000000

 activity                                                                                                                                                                               | timestamp                  | source    | source_elapsed
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                      Execute CQL3 query | 2016-08-02 15:02:07.460000 | 127.0.0.1 |              0
                                           Parsing a statement [shard 0] | 2016-08-02 15:02:07.460326 | 127.0.0.1 |             --
                                        Processing a statement [shard 0] | 2016-08-02 15:02:07.460373 | 127.0.0.1 |             47
              Creating read executor ... repair decision: NONE [shard 0] | 2016-08-02 15:02:07.460427 | 127.0.0.1 |            101
                  read_digest: sending a message to /127.0.0.3 [shard 0] | 2016-08-02 15:02:07.460433 | 127.0.0.1 |            107
                  read_digest: sending a message to /127.0.0.5 [shard 0] | 2016-08-02 15:02:07.460452 | 127.0.0.1 |            126
                                   read_data: querying locally [shard 0] | 2016-08-02 15:02:07.460455 | 127.0.0.1 |            130
                 read_digest: message received from /127.0.0.1 [shard 0] | 2016-08-02 15:02:07.460579 | 127.0.0.3 |             11
                 read_digest: message received from /127.0.0.1 [shard 0] | 2016-08-02 15:02:07.460595 | 127.0.0.5 |              8
read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2016-08-02 15:02:07.460644 | 127.0.0.3 |             75
read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2016-08-02 15:02:07.460647 | 127.0.0.5 |             60
                     read_digest: got response from /127.0.0.5 [shard 0] | 2016-08-02 15:02:07.460813 | 127.0.0.1 |            487
                     read_digest: got response from /127.0.0.3 [shard 0] | 2016-08-02 15:02:07.460840 | 127.0.0.1 |            515
                          Done processing - preparing a result [shard 0] | 2016-08-02 15:02:07.460874 | 127.0.0.1 |            548
                                                        Request complete | 2016-08-02 15:02:07.460573 | 127.0.0.1 |            573

(output trimmed at …)

Trace information shows traffic to the other data center. This was an artificial example to show tracing, but real-world ScyllaDB users run into similar problems. If a client driver is not configured properly, it will show excessive node-to-node (or worse, DC-to-DC) messaging. When we saw this client-side problem in production, the solution was to fix the client.

Probabilistic tracing

Tracing implies a significant performance penalty on a cluster when it’s enabled. Therefore if we want to enable tracing for some ongoing workload (with many similar queries) we don’t want to enable it for every request, but rather for some (small) portion of them. This may be achieved using “probabilistic tracing”, which randomly chooses a request to be traced with some defined probability.
For instance, if we want to trace 0.01% or all queries in the cluster we set a probabilistic tracing with the probability 0.0001:

nodetool settraceprobability 0.0001

Cleaning up trace data

Trace data persists for 24 hours by default. If you want to clean it up sooner, you can truncate the system_traces tables using the
TRUNCATE command:


TRUNCATE system_traces.sessions;

and

TRUNCATE system_traces.events;

Where tracing data goes

CQL tracing can generate large quantities of data, because every event on every node of the cluster can be logged. The good news is that ScyllaDB can handle huge inserts of data. It’s inserted into the system_traces keyspace, on your ScyllaDB cluster.

If you enable tracing, and you run a query there will be at least one tracepoint at the beginning and at the end. Most likely there will be events in the middle. Any queries that require communication with different nodes will hit tracepoints about messages being sent elsewhere, and generate events. Traces are generated on the node, if anything happens in the context of a query. And no trace information is sent from node to node – it is just inserted into the system_traces keyspace.

The system_traces keyspace has two tables, and a default replication factor of 2.

CREATE TABLE system_traces.events (
    session_id uuid,
    event_id timeuuid,
    activity text,
    source inet,
    source_elapsed int,
    thread text,
    PRIMARY KEY (session_id, event_id)
)

CREATE TABLE system_traces.sessions (
    session_id uuid PRIMARY KEY,
    client inet,
    command text,
    coordinator inet,
    duration int,
    parameters map<text, text>,
    request text,
    started_at timestamp
)

Traces are created in a context of a tracing session. For instance, if we trace an INSERT CQL command, a tracing session with a unique ID (session_id column in the tables above) will be created and all trace points hit during the execution will be stored in a context of this session. And this defines the format in which tracing data is stored:

  • The sessions table contains a single row for each tracing session
  • The events table contains a single row for each trace point.

A tracing session always corresponds to a single CQL query.

The session ID of the query can be used to query the keyspace to get all tracepoints for that query, created by all replica nodes. The format of those tables is the same as in Apache Cassandra. With new versions of ScyllaDB, more traceable events are being added.

The events table has multiple rows for a session, and event id is clustering key for that session. You can sort by event_id to get all events for a query, from all nodes, in chronological order.

Each row of events has a column called thread, which contains the shard numbe—you can see on which shard an event happened. (ScyllaDB nodes are sharded and each shard can handle its portion of a complex query independently while other shards are working.)

Data stored for each event

The events table’s columns are quite straightforward. The most interesting among them are:

  • session_id: ID of a session this trace
  • event_id: ID of this specific trace entry
  • activity: a trace message
  • source: address of a ScyllaDB node where the trace entry has been created
  • source_elapsed: a number of microseconds passed since the beginning of the tracing session on a specific Node (see examples above)
  • thread: currently this contains a number of a shard on which this trace point has been taken

Data stored for each session

The columns of sessions are:

  • session_id: ID of this tracing session
  • command: currently this may only have a “QUERY” value
  • client: address of a Client that has sent this query
  • coordinator: address of a coordinator that received this query from a Client
  • duration: the total duration of this tracing session
  • parameters: this map contains string pairs that describe the query which may include query string and consistency level.
  • request: a short string describing the current query, like “Execute CQL3 query”
  • started_at: is a timestamp taken when tracing session begins

Tracing at ScyllaDB Summit

Come to ScyllaDB Summit in San Jose on September 6th, to learn more about CQL tracing and other valuable info for ScyllaDB developers and administrators—along with success stories from companies like IBM, Outbrain, Samsung SDS, Appnexus, Hulu, and Mogujie, all using ScyllaDB for better performance and faster development. Meet ScyllaDB developers and devops users who will cover ScyllaDB design, best practices, advanced tooling and future roadmap items.

Going to Cassandra Summit? Add another day of NoSQL, right next door! ScyllaDB Summit takes place the day before Cassandra Summit begins and takes place at the Hilton San Jose, adjacent to the San Jose Convention Center. Lunch and refreshments are provided.

Register for ScyllaDB Summit

Photo of a bloodhound puppy by Asilverstein (Own work) CC BY-SA 3.0, via Wikimedia Commons.

About Vlad Zolotarov

Vlad specializes in networking, mostly L2. He has worked at on projects for Mellanox, the bnx2x Linux device driver for Broadcom, and on the ScaleMP Virtual Device System for network interfaces. Vlad studied at the Israel Institute of Technology (Technion) and holds a B.Sc. in Computer Science.