See all blog posts

Preventing Data Resurrection with Repair Based Tombstone Garbage Collection

Delayed repairs can lead to a data resurrection problem, where deleted data come back to life. In this post, I explain a novel way that ScyllaDB removes this limitation: deleting tombstones based on repair execution, not a strict duration (known as gc_grace_seconds). This will be available in ScyllaDB Open Source 5.0.

Background: Tombstones in ScyllaDB and Other CQL-Compatible Databases

You might wonder what a tombstone is. In ScyllaDB, as well as in Apache Cassandra and other CQL-compatible databases, a tombstone is a special data marker written into the database to indicate that a user has deleted some data.

This marker is needed because ScyllaDB and these other databases use immutable SSTable files. So when you need to delete data, rather than alter a database record itself, you leave a marker to let the database know that data has been marked as deleted. In future reads, the database will spot the tombstone and ensure that the deleted data is not returned as part of a query result.

Eventually, after a compaction process, the deleted rows and the tombstones will be discarded. Why not just keep all the tombstone markers? It leads to unbounded size. We could not keep the deleted data and tombstones forever. Otherwise, if there are a lot of deletes, the database will contain an ever-increasing amount of deleted data and tombstones. So you generally want tombstones purged to save disk space.

The solution to this problem is to drop tombstones when they are no longer necessary. Currently, we drop the tombstones when the following happens:

  • First, the data covered by the tombstone and tombstone can compact away together.
  • Second, when the tombstone is old enough. It is older than the gc_grace_seconds option, which is 10 days by default.

All of this becomes a non-trivial issue when data and tombstones are replicated to multiple nodes. There are cases where the tombstones might be missing on some of the replica nodes – for instance, if the node was down during a deletion and no repair was performed within gc_grace_seconds.  As a result, data resurrection could happen.

Here is an example:

  1. In a 3-node cluster, a deletion with consistency level QUORUM is performed.
  2. One node was down but the other two were up, so the deletion succeeds and tombstones are written on the two up nodes.
  3. Eventually the downed node rejoins the cluster. However, if it rejoins beyond the Hinted Handoff window, it does not get the message that one of its records was marked for deletion.
  4. When gc_grace_seconds is exceeded, the two nodes with tombstones do GC, so tombstones and the covered data are gone.
  5. The node that did not receive the tombstone still has the data that was supposed to be deleted.

When the user performs a read query, the database could return deleted data since one of the nodes still has the deleted data.

Timeout-based Tombstone GC

Let’s call the current tombstone GC method “timeout based tombstone GC.”

With this method, users have to run full cluster-wide repair within gc_grace_seconds to make sure tombstones are synced to all the nodes in the cluster.

However, this GC method is not robust since the correctness depends on user operations. And there is no guarantee that the repair can be finished in time, since repair is a maintenance operation and it has lowest priority. If there are important tasks like user workload and compaction, repair will be slowed down and might miss the deadline.

This adds pressure on ScyllaDB admins to finish repair within gc_grace_seconds. Note we encourage admins to use ScyllaDB Manager, which can help with automation of backups, repairs and compaction.

In practice, users may want to avoid repair to reduce performance impacts and serve more user workload during critical periods, e.g., holiday shopping.

So, we need a more robust solution.

Repair-based Tombstone GC

We implemented the repair based tombstone GC method to solve this problem.

The idea is that we automatically remove tombstones — perform GC — only after repair is performed. This guarantees that all replica nodes have the tombstones, whether or not the repair is finished within gc_grace_seconds.

This provides multiple benefits. There is no need to figure out a proper gc_grace_seconds number with this feature. In fact, it is very hard to find a proper one that works for different workloads. Further, there is no more data resurrection if repair is not performed in time. This means less pressure to run repairs in a timely manner. Since there is no more hard requirement to finish repairs in time, we can throttle repair intensity even more to reduce the latency impact on user workload. Conversely, if repair is performed more frequently than the gc_grace_seconds setting, tombstones can be GC’ed faster, which results in better read performance.

Usage

You can use ALTER TABLE and CREATE TABLE to turn on this feature using a new option: tombstone_gc.

For example,

ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;

CREATE TABLE ks.cf (key blob PRIMARY KEY,  val blob) WITH tombstone_gc = {'mode':'repair'};

You can specify the following modes: timeout, repair, disabled and immediate.

  • The timeout mode is the default mode. It runs GC after gc_grace_seconds. It is the same as without this feature.
  • The repair mode is the one users should use. It will conduct GC after a repair is performed.
  • The disabled mode is useful when loading data into ScyllaDB. We do not want to run GC when only part of the data is available, since tools can generate out of order writes or writes in the past. In such cases, we can set the mode to disabled when the tools are loading data.
  • The immediate mode is mostly useful for Time Window Compaction Strategy (TWCS) with no user deletes. It is much safer than setting gc_grace_seconds to zero. We are even considering rejecting user deletes if the mode is immediate,  to be on the safe side.

A new gossip feature bit TOMBSTONE_GC_OPTIONS is introduced. A cluster will not use this new feature until all the nodes have been upgraded. To keep maximum compatibility and avoid surprising users, the default mode is still timeout. Users have to enable experimental features and then set the mode to repair explicitly to turn on this feature.

Summary

Repair based tombstone GC is a new ScyllaDB experimental feature which provides a more robust solution to ensuring that your dead data stays dead. It guarantees data consistency even if repair is not performed within gc_grace_seconds.  It makes it easier to operate ScyllaDB and makes the database safer in terms of data consistency. This feature will be available as an experimental new feature in ScyllaDB Open Source 5.0.

LEARN MORE ABOUT SCYLLADB OPEN SOURCE

About Asias He

Asias He is a software developer with over 10 years of experience in programming. In the past, he worked on Debian Project, Solaris Kernel, KVM Virtualization for Linux, OSv unikernel. He now works on Seastar and ScyllaDB.