Testing part 4: Distributed tests

Subscribe to Our Blog

This is the fourth part in our blog series on Scylla testing. Part 1 covers Apache Cassandra compatibility testing, Part 2 covers Jepsen tests, and Part 3 covers the CharybdeFS fault-injecting filesystem.

In this installment, we’ll cover the Scylla distributed tests, or dtest, which is an extended version of the dtest project originally developed for Apache Cassandra. Cassandra dtests are functional black boxes that test Cassandra cluster operation. Thus dtest provides a test suite covering how a Cassandra server should operate, and allows us at ScyllaDB to validate that Scylla is compatible in functionality with Cassandra. Philip Thompson recently wrote a good article called How to Write a Dtest that covers how to add and document new tests. A non-inclusive list aimed at showing the variety of items tested by dtest includes:

  • batch operations: unlogged_batch_accepts_regular_mutations_test, logged_batch_accepts_regular_mutations_test (
  • paging: test_with_less_results_than_page_size, test_with_more_results_than_page_size, test_with_equal_results_to_page_size (
  • ttl: default_ttl_test, insert_ttl_has_priority_on_defaut_ttl_test (
  • commitlog: test_commitlog_replay_on_startup, test_commitlog_replay_with_alter_table (
  • server to client notifications: restart_node_test (restarting a node should generate exactly one DOWN and one UP notification) (
  • snapshot and restore: test_basic_snapshot_and_restore (
  • bootstrap: simple_bootstrap_test (adding a node to a cluster, token range is updated and data is streamed), read_from_bootstrapped_node_test (test bootstrapped node sees existing data) (
  • consistency level with simple strategy and network topology strategy : TestAvailability (Test that we can read and write depending on the number of nodes that are alive and the consistency levels.) (
  • upgrades: parallel_upgrade_test (Test upgrading cluster all at once (requires cluster downtime), rolling_upgrade_test (Test rolling upgrade of the cluster, so we have mixed versions part way through) (

Many events can happen in a software system as complex as a Scylla cluster, and dtest is necessary in order for us to get good test coverage. For us at ScyllaDB, Scylla unit tests and dtest, in that order, are our most commonly used test tools. Some numbers about DTest and Scylla (as of today)

  • We have written ~120 new tests
  • More than 10% of the reported issues on the Scylla project reference a dtest—many as the source of the bugs, others for reproducing or regression testing.
  • More than 60% of our engineering team have contributed tests to the project

An example of a problem found using dtest is Issue 484, where we discovered that the cache was not cleared when returning ownership on a key range following a decomission. Other bugs we found caused Scylla crashes, including Issue 489, and another, Issue 774 that only failed one out of 5 runs. As with filesystem testing, automation helps us see rare errors frequently.

Structure of a test

A good example of a dtest test is this one, which does three steps. Create a single-node cluster, insert some data with a replication factor of 2, insert some data, then add a new node and insert some more data – then finally test that Scylla is replicating the data correctly and each of the two nodes have all of it.

def simple_add_node_1_test(self):
    cluster = self.cluster
    # create a cluster with a single node and start the nodes 
    node1 = cluster.nodelist()[0]

    # connect to the node
    session = self.patient_cql_connection(node1)

    # create keyspace (RF=2), column family, insert data
    self.create_ks(session, 'ks', 2)
    self.create_cf(session, 'cf', read_repair=0.0, columns={'c1': 'text', 'c2': 'text'})
    insert_c1c2(session, keys=range(1000), consistency=ConsistencyLevel.ONE)

    # create a new node, start it and wait for it to connect to the cluster
    # when a new node is added to the cluster data is streamed to it based on RF.
    # In our case all the data will be streamed
    node2 = new_node(cluster)
    session = self.patient_exclusive_cql_connection(node2,'ks')
    # insert an additional 1000 rows with CL=2
    insert_c1c2(session, keys=range(1000, 2000), consistency=ConsistencyLevel.TWO)

    # shutdown node1, check all data exists on node2, start node1
    result = session.execute("SELECT * FROM cf LIMIT 2001")
    self.assertEqual(len(result), 2000, len(result))

    # shut down node2, check all data exists on node1
    session = self.patient_cql_connection(node1,'ks')
    result = session.execute("SELECT * FROM cf LIMIT 2001")
    self.assertEqual(len(result), 2000, len(result))

When calling cluster.populate() you set the number of nodes for the test. You can then add nodes to a cluster with new_node. Different tests use different numbers of nodes. For example, the test starts 9 nodes: 3 data centers with 3 nodes in each one.

Dtest and Jepsen testing

Jepsen tests can also test multiple nodes, and some Scylla functionality is covered by both. However, it’s easier to test some things in one rather than the other. Dtest contains tests for more specific functionality, including tests for administration operations with nodetool. Dtest also covers complete procedures:

  • Snapshots
  • Backup / Restore
  • Replace node
  • Expand cluster
  • Upgrades

Jepsen, on the other hand, does things that dtest is not set up to do:

  • Play with the clocks so that different nodes see different times
  • Do network partitioning, dividing the cluster

However, both the Jepsen tests and the dtest test suite are adding more tests all the time, so some overlap will remain and even grow.

Adding tests for basic functionality

Dtest has been with the Scylla project since the beginning. While we were adding functionality to Scylla, we have found that many of the original tests build on a lot of base functionality which was a chicken and egg problem. In such cases we ended up writing simpler tests that aim at testing specific functionality. The dtest project started in 2011, after Apache Cassandra had already existed for a while (March 14, 2010) and had a lot of functionality. What we had to do is test all functionality as we implemented it. We did test-driven development, writing new dtests and Scylla code in parallel, which made it necessary to test each new piece of functionality as we wrote it. The original, more complex dtests are still there, and now that Scylla has Cassandra compatibility it is passing those tests along with the simple ones that we needed for TDD.

Testing bootstrap is one such example. In the original version of dtest, the bootstrap test not only checks that the node is added and data is streamed, but also checks the data was split between the nodes almost equally by doing a major compaction after bootstrap. In our simpler case we have added a node and verified all the data was received without checking the sizes. Scylla now passes both sets of tests: the original, more complex tests, and the simpler tests we created for test-driven development.

The simple_add_node_1_test sample above is one that we added to have a simpler bootstrap test. This test does not require a compaction. Likewise, we added “decomposed” versions for other tests. A rough breakdown of the tests we have added is:

  • Operations (nodetool, backup restore) : ~25 tests
  • Cluster management (add node, remove node, decommission node, repair): ~60 tests
  • Data migration (copying over information from a cassandra 2.1.X cluster to scylla): ~15 tests
  • Other (cql , limits, etc.) : ~20 tests

To our surprise we have found that some nodetool commands are not covered at all, including removenode, gossipinfo, clearsnapshot, and listsnapshots. In our view every operation in nodetool needs at least one test. Some are not covered in the original dtest, and we had to add them.

Future work

We’re planning to add tests for upgrades and migrations, including Cassandra to Scylla and Scylla to Scylla. As with other test code, we’re planning to contribute our changes upstream to the original dtest project.

Coming soon: Longevity tests for Scylla in the public cloud.


Subscribe to this blog’s RSS feed for automatic updates. Or follow ScyllaDB on Twitter. If you’re interested in getting involved in Scylla development, or using Scylla as the database in your own project, see our community page for source code and mailing list info.

Shlomi LivneAbout Shlomi Livne

Shlomi Livne, VP of R&D at ScyllaDB, has 15 years of experience in software development of large-scale systems. Previously he has led the research and development team at Convergin, which was acquired by Oracle. Shlomi holds a BA and MSc in Computer Science from the Technion-Machon Technologi Le' Israel.

Tags: deep-dive, testing