Today we released Scylla Open Source 2.1.4, a bugfix release of the Scylla 2.1 stable branch. Release 2.1.4, like all past and future 2.x.y releases, is backward compatible and supports rolling upgrades.
Scylla 2.1.4 fixes a possible data loss when using Leveled Compaction Strategy #3513. The issue causes Scylla to miss a small fraction of data in a full table scan. This was originally observed in decommission (which performs a full table scan internally), where some data (<1% in a test) was not streamed.
In addition to a full scan query, scans are used internally as part of compaction and streaming, including decommission, adding a node, and repairs. Our investigation into the matter concluded that Scylla can cause data loss while running any of these actions.
The issue is limited to tables using LCS and does not affect tables using other compaction strategies. If you are using LCS, you should upgrade to Scylla 2.1.4 ASAP.
Action to Take
The problem may be mitigated by restoring backups of the relevant table. If you are using LCS and have relevant backups, please contact our support team for additional information on how to run the restore procedure.
How This Happened
We take data integrity very seriously and are investigating why this issue was not identified earlier. Our initial findings are that a low-level optimization around disjoint SSTable merging introduced the bug in the 2.1 release. It surfaced only in our 2.2 testing since it happened very rarely with 2.1 based code. The Scylla cluster test suite did detect the issue, however, meeting quorum persistence papered over it together with the test suite itself – one of the roles of this suite is to run disruptors (corruption emulation, node and data center failures) against the cluster and to trigger corruptions and repairs. The bug was not identified since the test suite incorrectly concluded that it is part of the disruptor activity of the suite. We are now working to improve the cluster test suite’s ability to detect errors.
Please contact us with any questions or concerns. We will publish a full root cause analysis report as soon as possible and disclose enhancements to prevent such a case in the future.
Additional bugs fixed in this release
- Scylla AMI error: “systemd: Unknown lvalue ‘Ambient / Unknown lvalue ‘AmbientCapabilities’ “ Issue is solved by moving to a new CentOS 7.4.1708 base image #3184
- Upgrading to latest version of RHEL kernel causes Scylla to lose access to the RAID 0 data directory #3437 (detailed notice has been sent to all relevant customers)
- Wrong Commit log error handling may cause a core dump #3440
Closing a secure connection (TLS) may cause a core dump #3459
- When using TLS for interconnect connections, shutting down a node generates errors:”on system_error (error system:32, Broken pipe) other nodes” #3461
- Ec2MultiRegionSnitch does not (always) honor or prefer the local DC, which results with redundant requests to remote DC #3454