Scylla Enterprise Release 2018.1.3 | May 13, 2021

Scylla Enterprise Release 2018.1.3

Today we released Scylla Enterprise 2018.1.3, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.3 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise.

More about Scylla Enterprise here.

Critical Patch

Scylla Enterprise 2018.1.3 fixes possible data loss when using Leveled Compaction Strategy.  The issue causes Scylla to miss a small fraction of data in a full table scan. This was originally observed in decommission (which performs a full table scan internally), where some data (<1% in a test) was not streamed.

In addition to full scan query, scans are used internally as part of compaction and streaming, including decommissioning, adding a node, and repairs. Our investigation into the matter concluded that Scylla can cause data loss while running any of these actions. The issue is limited to tables using LCS and does not affect tables using other compaction strategies.

If you are using LCS, you should upgrade to Scylla Enterprise 2018.1.3 ASAP.

Action to Take

The problem may be mitigated by restoring backups of the relevant table. If you are using LCS and have relevant backups, please contact our support team for additional information on how to run the restore procedure.

How This Happened

We take data integrity very seriously and are investigating why this issue was not identified earlier. Our initial findings are that a low-level optimization around disjoint SSTable merging introduced the bug in the 2.1 release. It surfaced only in our 2.2 testing since it happened very rarely with 2.1 based code. The Scylla cluster test suite did detect the issue, however, meeting quorum persistence papered over it together with the test suite itself – one of the roles of this suite is to run disruptors (corruption emulation, node and data center failures) against the cluster and to trigger corruptions and repairs. The bug was not identified since the test suite incorrectly concluded that it is part of the disruptor activity of the suite. We are now working to improve the cluster test suite’s ability to detect errors.

Please contact us with any questions or concerns. We will publish a full root cause analysis report as soon as possible and disclose enhancements to prevent such a case in the future.

Related Links

Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.3 in coordination with the Scylla support team.

Additional Issues Solved in This Release (with Open Source Issue Reference When Applicable)

  • Additional issue solved in this release (with open source issue reference when one exist)
  • Ec2MultiRegionSnitch does not (always) honor prefer the local DC, which result with redundant requests to remote DC #3454
  • When using TLS for interconnect connections, shutting down a node generates errors on system_error (error system:32, Broken pipe) other nodes #3461

Next Steps

