Feb20

Mutant Monitoring Systems (MMS) Day 4 – We Are Under Attack!

Subscribe to Our Blog

mms

This is part 4 of a series of blog posts that provides a story arc for Scylla Training.

If you got this far, you have your Mutant Monitoring setup and running with the Mutant Catalog and Tracking keyspaces populated with data. At Division 3,  our mutant data-centers are experiencing more and more cyber attacks by evil mutants and sometimes we experience downtime and cannot track our IoT sensors. We must now prepare to plan for disaster scenarios so that we know for sure that we can survive an attack. In this exercise, we will go through a node failure scenario and learn about consistency levels.

Environment Setup

We will need to set up a fresh cluster by building the Docker containers using the files on the scylla-code-sampels GitHub repository.

git clone https://github.com/scylladb/scylla-code-samples.git
cd scylla-code-samples/mms

There is a new feature to automatically import the data with the keyspaces from the first couple of blog posts for you. Edit docker-compose.yml and add IMPORT=IMPORT under the environment section for scylla-node1 so the existing mutant data will be imported when the container starts:

environment:
- DC1=DC1
- IMPORT=IMPORT
- SEEDS=scylla-node1,scylla-node2
- CQLSH_HOST=scylla-node1

Build and Start the containers:

docker-compose build
docker-compose up -d

After about 60 seconds, the catalog and tracking keyspaces will be imported.

Simulating the Attack

Let’s use the nodetool command to examine the status of the nodes in our cluster:

docker exec -it mms_scylla-node1_1 nodetool status


We can see that all three nodes are currently up and running because the status is set to UN (Up/Normal). Now let’s forcefully remove node 3 from the cluster:

docker pause mms_scylla-node3_1

Now use nodetool to check the status of the cluster again after about 30 seconds:

docker exec -it mms_scylla-node1_1 nodetool status


We can now see that node 3 is missing from the cluster because it is in a DN (Down/Normal) state. The data is safe for now because we created each keyspace with a replication factor of three. With a replication factor set to three, there will be a replica of the data on each node.

We should be able to still run queries on Scylla:

docker exec -it mms_scylla-node1_1 cqlsh
select * from catalog.mutant_data;

All of the Mutant data is still there and accessible even though there are only two replicas remaining because we are using a replication factor of three (RF=3).

Consistency Levels

The data is still accessible because the Consistency Level is still being met. The Consistency Level (CL) determines how many replicas in a cluster that must acknowledge read or write operations. Quorum is the default Consistency Level. When a majority of the replicas respond, the read or write request is honored. Since we are using a replication factor of 3, only 2 replicas respond. QUORUM can be calculated using the formula (n/2 +1) where n is the Replication Factor.

Let’s test how consistency levels work with the cqlsh client. In this exercise, we will try to write data to the cluster using a Consistency Level of one, which means If one replica responds, the read or write request is honored.

docker exec -it mms_scylla-node1_1 cqlsh;

CONSISTENCY ONE;

insert into catalog.mutant_data ("first_name","last_name","address","picture_location") VALUES ('Steve','Jobs','1 Apple Road', 'http://www.facebook.com/jobs') ;

select * from catalog.mutant_data;

The query was successful!

Now let’s test using a Consistency Level of ALL. This means that all of the nodes must respond to the read or write request otherwise the request will fail.

docker exec -it mms_scylla-node1_1 cqlsh
CONSISTENCY ALL;

insert into catalog.mutant_data ("first_name","last_name","address","picture_location") VALUES ('Steve','Wozniak','2 Apple Road', 'http://www.facebook.com/woz') ;

select * from catalog.mutant_data;

The query failed with error “NoHostAvailable” because we only have two of the three nodes online and the Replication Factor (RF) is 3.

Adding a Node Back and Repairing the Cluster

We need to get the Cluster back into a healthy state with 3 replicas by adding a new node.

cd scylla-code-samples/mms/scylla
docker build -t newnode .
docker run -d --name newnode --net=mms_web -e DC1=DC1 -e SEEDS=scylla-node1 -e CQLSH_HOST=newnode newnode

Next, we need to get the IP address of the node that is down:

docker inspect mms_scylla-node3_1 | grep IPAddress

Now we can log in the new node and begin the replacement node procedure by adding replace_address in scylla.yaml. This argument tells Scylla that the new node will be replacing the failed node. This will also tell Scylla to remove the old node and it will no longer be visible in the nodetool status output.

docker exec -it newnode bash

Run the command below to add replace_address to scylla.yaml. Replace the IP address with the correct IP of mms_scylla-node3_1 :

echo "replace_address: '172.18.0.3'" >> /etc/scylla/scylla.yaml

Finally, exit the container and restart the new node:

docker restart newnode

After a minute or so, we should see all three nodes online in the cluster:

docker exec -it mms_scylla-node1_1 nodetool status

Great, now we have three nodes. Is everything ok now? Of course not! Until we run a repair operation, existing data will only be on two of the nodes.

We can repair the cluster with the following command:

docker exec -it mms_scylla-node1_1 nodetool repair

When repair finishes, each node will be in sync and the data will be safe to handle additional failures. Since we do not need the paused container named mms_scylla-node3_1 anymore, we can delete it:

docker rm -f mms_scylla-node3_1

Conclusion

We are now more knowledgeable about recovering from node failure scenarios and should be able to recover from a real attack from the mutants. With a Consistency Level of quorum in a three node Scylla cluster, we can afford to lose one of the three nodes while still being able to access our data. With our datacenter intact, we can begin looking at new ways to analyze data and run SQL queries with Apache Zeppelin. Stay tuned for the Zeppelin post coming soon and be safe out there!

 


Tags: consisteny, failure, mms, repair