ScyllaDB V brings new performance, resilience, and ecosystem advantages that resolve longstanding challenges of legacy NoSQL databases.ScyllaDB V is Here
Piotr is a software engineer very keen on open-source projects, C++ and Rust. He previously developed an open-source distributed file system (LizardFS) and had a brief adventure with Linux kernel during an apprenticeship at Samsung Electronics. Piotr graduated from the University of Warsaw with MSc in Computer Science.
Hi everyone, my name is Piotr Sarna, and today I’ll talk about libSQL.
I’m currently a Staff Software Engineer at ChiselStrike, but I’ve spent a few precious years contributing to ScyllaDB, part of it as its maintainer. I’m also partially responsible for the ScyllaDB Rust Driver Project, and right now my main responsibility is libSQL.
What is libSQL? It’s a fork of SQLite – a famous library implementing a full-featured, lightweight SQL engine, coded in C language. SQLite is a rock-solid project, deployed to billions of devices, even if you count smartphones only. It also has a very specific view on open-source. It’s code is open and available as public domain, but the project is explicitly “not open-contribution”. Instead it has a small set of experts taking care of the whole project. That improves stability, but contributions have value as well, that’s why we decided to fork the rock-solid foundation and start maintaining it, being fully open to external contributions. In particular, we’re interested in using SQLite in cases it wasn’t originally designed for. By the way, there’s an official webpage called “Appropriate Uses For SQLite”, and it makes sense on its own, but we still want to go beyond it. That includes using it in a distributed environment, particularly edge computing. We also push Rust into the codebase, as it is not only a safer language, but also one that makes contributions more likely to happen, due to Rust’s ever raising popularity.
A fair part of this presentation is actually going to be an introduction to SQLite architecture, because this knowledge is simply needed to fully understand our motives for creating a fork, and why we decided to extend it with certain features.
SQLite, contrary to other database systems offering SQL interface, is not client-server architecture. Instead, it’s simply a library of functions one can use to maintain a local database, stored in a regular file, or sometimes a few files, which we’ll learn in the next few slides. SQLite is a zero-configuration system – while it allows you to tweak tons of settings to optimize for your particular use case, it also works out of the box. This great developer experience is one of the things that made it so popular. The other thing is its extensive test coverage, which makes the database very reliable.
SQLite has modular architecture that fits well on a single slide – this image is taken straight from SQLite docs. It’s extremely self-contained – it ships with its own tokenizer and parser, which translate SQL source code to a dedicated bytecode, also invented at SQLite. This bytecode runs on a virtual machine, implemented in SQLite. This virtual machine uses the backend implementation for input/output – a B-Tree layer, which lays on top of a pager layer, which lays on top of the operating system interface. Independently, there’s a set of helper programs, various utilities, and lots of tests.
The top back-end layer is a b-tree implementation, based on the algorithm’s description from the art of computer programming by Donald Knuth. There are actually two implementations, one specialized for being indexed by an integer, and one for secondary indexes, where a key can be just arbitrary bytes.
Pager is the only mechanism B-Tree uses for storing data. Pager is conceptually a simple structure – you can use it to read a page of data, or write pages of data. Internally it gets more complex, because it also includes an implementation of page cache, journaling, locking, and so on. But from a distant perspective, it’s a simple module that gives you access to pages of fixed size.
VFS stands for virtual file system. It’s not the one from Linux kernel, but the naming is not coincidental. SQLite’s VFS is an abstraction for operating on files. Initially it was created simply to provide support for multiple operating systems – Linux, Windows, Mac, and so on, because each of those has different ways of accessing hardware and different file system implementations. Later it turned out that this layer of abstraction is useful for many other things – for instance providing encryption, compression, auditing, backups, checksumming, replication, you name it. Virtual file systems are pluggable in SQLite, so users are free to load their own implementations dynamically, without having to recompile the whole project.
VDBE, the Virtual DataBase Engine, implements the virtual machine that runs the bytecode, to which all SQL queries get compiled. Internally, VDBE is register-based, and it simply executes database operations one by one, using the registers just like a CPU would do it.
SQLite also has an EXPLAIN statement, that works a little differently than what you could expect, based on other database systems. It doesn’t show you a high level query plan, and insteads prints the human-readable version of the bytecode. From the output you can clearly see what the virtual machine is going to do and when, which makes debugging quite easy.
Now, most database systems have a notion of “transactions”, which is an atomic unit of execution, consisting of one or more statements. SQLite has multiple types of transactions in various dimensions. There are read transactions and write transactions. Also, each transaction can be deferred, immediate or exclusive, and that defines how soon the locks are obtained. An important thing to note is that a statement is always executed in a transaction, there’s no such thing as running a statement outside of a transaction. Instead, when a user requests a single statement to be executed, this statement is wrapped in an implicit transaction.
Contrary to what other database systems may offer, transactions in SQLite can’t be nested. There exists a mechanism for rolling back to a previously saved point in a transaction, it’s called a “savepoint”. Then, instead of a full ROLLBACK, one can rollback to a predefined point inside a transaction and try executing it again.
Now we’re diving into more interesting bits. Historically, the only journaling implementation in SQLite was a rollback journal. The way it worked was as follows. During a write transaction, before any page is written to the main database file, it first gets copied to a rollback journal. Then, if the transaction commits, the journal can be deleted or ignored, and during rollback, pages from the journal can be copied back to the database file – they would simply overwrite whatever the transaction tried to write before. Later, a new journaling implementation was added – a write-ahead log. It uses an opposite idea – pages are never written directly to the main database file, but instead land into a new file – the write-ahead log. On rollback, the main database file is left intact, and pages are evicted from the log. On commit, nothing really happens immediately, because reads are performed both from the main database file and the log.
That’s neat, but there’s a price – the write-ahead log would keep growing if it’s never truncated. To remedy that, there’s a checkpoint operation which compacts the log back into the main database file. There are multiple types of checkpoints – you can make a partial checkpoint that doesn’t block readers, or a full checkpoint which does – but is in turn guaranteed to reclaim all of the storage space. That makes the flow a little more complicated, but checkpoints are generally done automatically, and their frequency can be configured.
A quite important feature of WAL journaling mode is concurrency. The original rollback journal forced the writer to take an exclusive lock, preventing any readers from reading the database. WAL, however, allows readers to coexist with a writer – because the main database file is never written to directly. The log only grows, with new pages getting appended to it, so readers can always remember a snapshot of the database by remembering a particular offset of the write-ahead log and never read past that offset. This is a huge benefit for concurrency, especially for read intensive workloads that only occasionally get a potentially long write transaction. An even more interesting feature, that isn’t merged upstream yet, is a new flavor of transaction – BEGIN CONCURRENT. That only works in WAL journaling mode and lets multiple writers continue in parallel, using optimistic locking. It basically means that once a concurrent write transaction wants to commit, it needs to check if the pages it touched were not modified by any other concurrent transaction – if they weren’t, it’s safe to commit. If they were, the transaction needs to roll back, unfortunately.
SQLite also has an interesting way of dealing with transaction conflicts – the decision is pushed to the user. Transactions don’t block – they return a special error code SQLITE_BUSY and give control back to the user. Alternatively, users can register their own callbacks which define how to react when it’s not possible to obtain a needed lock. It can be a simple busy timeout or any other kind of complex logic that users just implement themselves.
SQLite is quite opinionated in their documentation on what constitutes a great use case for SQLite, and what might not be the best idea. Still, it’s so tempting to try and distribute this architecture, that multiple projects already tried to accommodate the library for distributed systems. Rqlite and Dqlite tried with Raft, the consensus algorithm. The first one treats each SQL statement as a separate log entry and distributes that, and the second one distributes pages, by performing a few mild hacks to distribute the write-ahead log.
Then, there’s litestream, a solution for replicating the database in the background to S3-compatible storage. There’s also a flavor of it trying to leverage a userspace file system for the purpose. There’s also verneuil, another solution that replicates to S3 and allows having distributed read replicas. There’s also mvSQLite, which injects the power of FoundationDB, a distributed key-value store, to SQLite by replicating the pages, introducing multi-version concurrency control and optimistic locking. Finally, we’re working on schooled – distributed libSQL with multiple backends, including one that cooperates with mvSQLite, and with support for read replicas.
Finally, a few slides on actual libSQL! We’ve already made progress to add features that facilitate using libSQL in distributed systems, and already got valuable contributions from the community, as well as lots of feature requests and discussions.
One big thing introduced in libSQL is virtual write-ahead log. Similarly to how VFS is a virtual interface, we wanted to hook into the WAL journaling mode. Let’s see on a diagram what can be overridden when implementing a virtual WAL.
The write path is quite straightforward – when pages need to be stored, it’s done within a write transaction. Pages are appended to the write-ahead log, and then the transaction can be committed or rolled back. With virtual WAL, when pages are appended, we could do virtually anything with them – cache them, replicate them to remote storage, compress, encrypt. Overwriting the callbacks for starting and ending a write transaction allows the developers to implement their own locking mechanisms, including forms of distributed locks.
The read path is a little more complicated. Here’s one of the examples – a read transaction is started, and then the first call to WAL is a question whether page number N exists in WAL. If it does, WAL returns an identifier of a newest frame that holds data for this page. This identifier can later be used to access page contents. After that, a transaction ends.
If the page is *not* found in WAL, it’s instead read from the main database file. This opens a possibility for keeping the main database file as a small local cache – where some of recent pages are stored, while the rest of the pages are kept in remote storage, for instance something compatible with S3.
We’ve already started working on a few projects based on virtual WAL. One of them is a use case similar to litestream, which offers continuous replication of the database file and its write-ahead log to S3-compatible storage. We’re also planning to add distributed read replicas on top of that, also based on the same virtual WAL implementation. That feature is implemented as part of schooled, our distributed layer.
One of the concurrency schemes applied to distributed databases is optimistic locking. In this model, multiple writers are allowed to continue their operation in parallel, but before they commit the changes, a conflict check is performed. If a transaction can be committed without conflicting with any other concurrent write, it’s simply done. If there’s a potential conflict, a transaction fails and should either be retried or rolled back.
SQLite was not designed with optimistic locking in mind, at least not from the start, so there are some issues with it.
First of all, SQLite allocates database pages in a freelist implementation, which often needs to touch lots of pages just to update the metadata. That creates unnecessary contention and makes it more likely for concurrent transactions to fail. In order to prevent that, libSQL is going to allow overriding the allocation method for free pages, so that users can provide their own mechanisms with better characteristics.
Secondly, SQLite tables usually have a special ROWID column which acts as the primary key. That column can be explicitly omitted, but if it’s present, then it is a rather simple autoincrementing counter. That’s also a major source of contention, because if two parallel transactions want to add a new row, they are likely to get the same rowid number allocated, and that creates a conflict. In order to avoid hacky solutions, libSQL adds a new keyword – RANDOM ROWID – which allows users to create a table with rowids allocated in a pseudorandom fashion, making conflicts considerably more rare.
Finally, we also introduced dynamic user-defined functions to libSQL. SQLite also has user-defined functions, but they need to be coded via programmatic API’s, so they are not as portable and secure as a good old CREATE FUNCTION statement. We run the functions on WebAssembly, which is already becoming standard for embedded languages. This talk is way too short to include details about our Wasm integration, so if you’re curious, follow the link to our blog post, which contains more information.
There are more features we’re after – introducing multi-version concurrency control, exploring how we can introduce conflict-free replicated data types, so called CRDTs, provide official Rust bindings to make it easier to integrate with existing Rust projects, and countless other ideas.
libSQL is open-source and open-contribution, so I cordially invite everyone to join the community. All kinds of contributions are welcome.
That’s it, thanks everyone!