Testing part 5: Longevity testing
This is the fifth part in our blog series on Scylla testing. Part 1 covers Apache Cassandra compatibilty testing, Part 2 covers Jepsen tests, Part 3 covers the CharybdeFS fault-injecting filesystem, and Part 4 covers distributed testing.
Testing software often requires you employ several approaches, in order to increase coverage. A well maintained project will typically have:
- Unit tests. Those tests aim to cover the functionality of specific parts of the code (functions and classes), inputs and outputs to those functions to help verify quality up close in the code. We already covered Scylla unit tests in a previous blog post.
- Functional tests. Those tests will verify how an entire program works, by executing operations a user would typically do on that program. One functional test suite used in scylla is our modified version of dtest, covered on [insert link]
- Integration tests. That is a special case of functional testing, where we want to test an entire stack of programs, on its entire life cycle. This blog post is dedicated to our integration suite, called scylla-longevity-tests.
For the sake of completeness, we also need:
- Performance tests. Those aim to verify if the program is performing the required operations on a reasonable amount of time. We’ll cover that in another blog post.
Scylla is able to use tests from several sources: our own tests, the Apache Cassandra tests, and third-party tests. As we get closer to the GA release of Scylla, the testing gets both harder and more automated. The Scylla longevity test is an integration test, which tests the product as a whole, as deployed, rather than by parts, as unit tests do.
dtest, covered in a previous article, is also an integration test, but it only tests local Scylla processes, instead of Scylla nodes running on machines spread across the network. Jepsen tests are focused on another aspect of distributed testing, which is CAP theorem correctness as a Scylla cluster undergoes network partitions and delays.
Longevity tests create a Scylla cluster in AWS and exercise it by running cassandra-stress and other clients specified in the test scripts. This testing is meant to find problems in the operation of long running clusters, which is the case with existing Scylla deployments. The system uses Boto for controlling Amazon AWS from Python.
The typical workflow of a longevity test is:
1. Create a Scylla cluster (referred as DB cluster from now on) with number of nodes and node size defined by the user in a config file
2. Create a number of loader nodes, that will stress the DB cluster, typically by running cassandra-stress (or “c-s” for short) on it. We can also run arbitrary CQL operations from a loader node.
3. The cluster object hosts a Nemesis (terminology borrowed from Jepsen), a thread that randomly picks a non-seed node and disrupts it in some way. Nemesis runs as a thread, and is a member of the cluster object. Nemesis has five disruption functions, all named after monkeys. (Thanks to Cory Bennett and Ariel Tseitlin at Netflix for starting the monkey naming tradition for testing tools.)
• CorruptTheRebuildMonkey/CorruptThenRepairMonkey destroys data on a given node by removing data files and killing the Scylla process.
• StopStartMonkey stop and restart AWS instance
• DrainerMonkey drainer: execute
nodetool drain, then stop and restart the node
• DecommissionMonkey decommission: decommission a node, create a new instance, add to cluster
• ChaosMonkey: call a random disruption function (one of the 5 above).
4. Nemesis will sleep for a specified time interval, then execute the disruption function, then sleep again, rinse and repeat.
5. Cassandra-stress starts. The default command line looks like:
cassandra-stress write cl=QUORUM duration=1440m -schema 'replication(factor=3)' -mode cql3 native -rate threads=4 -node 172.31.27.192
6. At all times, we have enough nodes active to keep c-s going given the replication factor 3. The disruption functions operate on a single node, and operations that remove a node permanently from the cluster will add a new node to replace the one that was removed, as soon as the removal finishes. The node is added right away to ensure we won’t have c-s errors due to the cluster running with fewer nodes than necessary to ensure replication.
7. Given all that, we do expect that c-s survives until the specified time period (the report is using 24 hours). If it doesn’t, chances are that we have a bug.
Most of the tests are done with 6 nodes, on c3.large instances. Development of the longevity test suite itself was done on t2.micro instances, to save on the Amazon bill. The results are in, and longevity testing has found some important issues.
- Cassandra stress does not survive stop/start nemesis
- Node runs out of space on drainer nemesis
- Nemesis fails to decomission a node, core dump generated
- Failure to add new node to a cluster due to seed node unavailable
- Scylla-JMX is issuing a java.lang.NullPointerException in the thread “Dropped messages”
- Nodes report they are not getting CPU time for as long as 128 seconds
Anatomy of a longevity test
You can see the principles discussed earlier in one example longevity test:
from avocado import main from sdcm.tester import ClusterTester class LongevityTest(ClusterTester): """ Test a Scylla cluster stability over a time period. :avocado: enable """ def test_custom_time(self): """ Run cassandra-stress with params defined in data_dir/scylla.yaml """ self.db_cluster.add_nemesis(self.get_nemesis_class()) self.db_cluster.start_nemesis(interval=self.params.get('nemesis_interval')) self.run_stress(duration=self.params.get('cassandra_stress_duration'))
This is an excerpt of an existing test, the one that originated the suite. Let’s break it down:
- Add a nemesis to the db_cluster object
- Start the nemesis thread
- Run cassandra-stress on it
If cassandra-stress survives the nemesis operations (the
run_stress function checks that), then the test passed.
Where is the db_cluster attribute defined? In the parent class setup() method:
@clean_aws_resources def setUp(self): self.credentials = None self.db_cluster = None self.loaders = None self.connections =  logging.getLogger('botocore').setLevel(logging.CRITICAL) logging.getLogger('boto3').setLevel(logging.CRITICAL) self.init_resources() self.loaders.wait_for_init() self.db_cluster.wait_for_init()
The setup method will talk to AWS, request the appropriate nodes from AWS and do the needed setup on them so you’ll have a Scylla or Cassandra cluster ready to go at the beginning of your test function. Each Node has a pointer to the AWS instance (Node.instance) and a remoter object (Node.remote) that you can use to execute SSH commands. Examples:
for node in self.db_cluster: result = node.remoter.run(‘ls -l’) self.log.info(result) node.instance.terminate()
This would execute
ls -l on each DB cluster node, then terminate the AWS instance associated to it. The base libraries and test class are flexible, and you can override the setup method of the base test class to better suit your test workflow needs. Let’s see the code of one of our nemesis:
class StopStartMonkey(Nemesis): @log_time_elapsed def disrupt(self): self.disrupt_stop_start()
Here, disrupt_stop_start is inherited from the base nemesis class:
def disrupt_stop_start(self): self.log.info('Stop %s then restart it', self.target_node) self.target_node.restart()
If you are curious about the implementation of the
Node.restart() function, you can always refer to the source code.
The next generation of longevity tests will cover multiple machine providers in addition to just AWS. We plan to add an abstraction layer on top of AWS management code, then add support for other clouds. Multi-datacenter testing support is also planned.
Source code is available on GitHub.
Subscribe to this blog’s RSS feed for automatic updates. Or follow ScyllaDB on Twitter. If you’re interested in getting involved in Scylla development, or using Scylla as the database in your own project, see our community page for source code and mailing list info.