Scylla is a new approach to NoSQL data store design, optimized for modern hardware. The typical design of NoSQL data stores (left) consists of a JVM which runs on top of Linux, utilizes the page cache, and uses complex memory allocation strategies to “trick” the JVM garbage collector to avoid stop-the-world pauses. Such a design suffers from sudden latency hiccups, expensive locking, and low throughput due to low processor utilization.
The Scylla design, right, is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. We can easily reach 1 million CQL operations on a single commodity server. In addition, Scylla targets consistent low latency, under 1 ms, for inserts, deletes, and reads.
Development: Performance improvements enable not just reduction in hardware resources, but also a streamlined development process. architecture, debugging, and devops labor. Scylla gives you the power to build a data model that fits the application, not twist your data model to fit your NoSQL system.
Complexity slows down development time. Scylla’s overwhelmingly more efficient design allows architects to eliminate
- Extra tables
- External caches
- Complex OS-level configuration
Hardware on which modern workloads must run is remarkably different from the hardware on which current programming paradigms depend, and for which current software infrastructure is designed.
Scylla uses the Seastar framework for extreme performance on multicore hardware.
Core counts grow, clock speeds stay constant
Performance increases in clock speeds of individual cores have stopped. The increase in number of cores means that performance depends on coordination across multiple cores, no longer on throughput of a single core.
On new hardware, the performance of standard workloads depends more on locking and coordination across cores than on performance of an individual core. Software architects face two unattractive alternatives: coarse-grained locking, which will see application threads contend for control of the data and wait instead of producing useful work, and fine-grained locking, which, in addition to being hard to program and debug, sees significant overhead even when no contention occurs, due to the locking primitives themselves.
Meanwhile, I/O continues to increase in speed
The networking and storage devices available on a modern system have also continued to grow in speed. However, although processors have gained more cores, the ability to process packets on any one core has not.
A 2GHz processor handling 1024-byte packets at wire speed on a 10GBps network has only 1670 clock cycles per packet. (source: Intel DPDK Overview)
Software design approaches that were valid and safe when many clock cycles were available per packet are no longer sustainable.
The Seastar model: shared-nothing
Because sharing of information across cores requires costly locking, Scylla uses a shared-nothing model that shards all requests onto individual cores.
Scylla runs one application thread per core, and depends on explicit message passing, not shared memory between threads. This design avoids slow, unscalable lock primitives and cache bounces.
Any sharing of resources across cores must be handled explicitly. For example, when two requests are part of the same session, and two CPUs each get a request that depends on the same session state, one CPU must explicitly forward the request to the other. Either CPU may handle either response. The Seastar framework provides facilities that limit the need for cross-core communication, but when communication is inevitable, it provides high performance non-blocking communication primitives to ensure performance is not degraded.
Lockless inter-core communication
Threaded applications require inherently expensive locking operations, while the Seastar model can completely avoid locks for cross-CPU communications.
From the programmer’s point of view, Seastar uses futures, promises, and continuations (f/p/c). Where conventional event-driven programming using epoll and userspace libraries such as libevent has made it very difficult to write complex applications, f/p/c makes it easier to write complex asynchronous code.
For example, the following interaction between a sender core, C0, and a receiver core, C1, can take place with no locking required.
- C0: sender -> wait for queue entry (usually immediate) -> enqueue request, allocate promise.
- C1: dequeue request; execute it -> move result to request object -> enqueue request on response queue
- C0: dequeue request; extract response, use it to fulfill promise; destroy request
Each actual queue, one for requests and a return queue for fulfilled requests, is a simple queue of pointers.
There is one request queue and one return queue per pair of CPU cores on the system. Because a core does not pair with itself, a 16-core system will have 240 request queues and 240 return queues.
The result is a unique NoSQL data store that can scale to large numbers of cores per node with minimal overhead.