The typical design of NoSQL data stores consists of a JVM which runs on top of Linux, utilizes the page cache, and uses complex memory allocation strategies to “trick” the JVM garbage collector to avoid stop-the-world pauses. Such a design suffers from sudden latency hiccups, expensive locking, and low throughput due to low processor utilization.
The Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. We can easily reach one million CQL operations on a single commodity server. In addition, Scylla targets consistent low latency, under one millisecond, for inserts, deletes, and reads.
Performance improvements enable not just reduction in hardware resources, but also a streamlined development process for architecture, debugging, and DevOps labor. Scylla gives you the power to build a data model that fits the application, rather than twisting your data model to fit your NoSQL system.
Complexity slows down development time. Scylla’s efficient design means architects eliminate extra tables, external caches, and complex OS-level configuration.
The hardware that modern workloads run on differs from both the hardware that programming paradigms depend on and the hardware on which current software infrastructure is designed.
Scylla uses the Seastar framework for extreme performance on multi-core hardware.
With the recent plateau of clock speeds, CPU manufacturers can no longer rely on increasing those clock speeds to drive their performance gains. Instead, modern chips are driving performance improvements by increasing the number of cores per CPU. With this change, real-world application performance has become less about efficiency and throughput on a single core, and more about how the software coordinates across multiple cores.
On new hardware, the performance of standard workloads depends on locking and coordination across cores rather than on performance of an individual core. Software architects face two unattractive alternatives: coarse-grained locking, which will see application threads contend for control of the data and wait instead of producing useful work, and fine-grained locking, which, in addition to being hard to program and debug, sees significant overhead even when no contention occurs, due to the locking primitives themselves.
The networking and storage devices available on a modern system have also continued to grow in speed. However, although processors have gained more cores, the ability to process packets on any one core has not.
A 2GHz processor handling 1024-byte packets at wire speed on a 10GBps network has only 1670 clock cycles per packet. (source: Intel DPDK Overview)
Software design approaches that were valid and safe when many clock cycles were available per packet are no longer sustainable.
Because sharing of information across cores requires costly locking, Scylla uses a shared-nothing model that shards all requests onto individual cores.
Scylla runs one application thread-per-core, and depends on explicit message passing, not shared memory between threads. This design avoids slow, unscalable lock primitives and cache bounces.
Any sharing of resources across cores must be handled explicitly. For example, when two requests are part of the same session, and two CPUs each get a request that depends on the same session state, one CPU must explicitly forward the request to the other. Either CPU may handle either response. The Seastar framework provides facilities that limit the need for cross-core communication, but when communication is inevitable, it provides high performance non-blocking communication primitives to ensure performance is not degraded.
Threaded applications require inherently expensive locking operations, while the Seastar model completely avoids locks for cross-CPU communications.
From the programmer’s point of view, Seastar uses futures, promises, and continuations (f/p/c). Where conventional event-driven programming using epoll and userspace libraries such as libevent has made it very difficult to write complex applications, f/p/c makes it easier to write complex asynchronous code.
For example, the following interaction between a sender core, C0, and a receiver core, C1, can take place with no locking required.
Each actual queue, one for requests and a return queue for fulfilled requests, is a simple queue of pointers.
There is one request queue and one return queue per pair of CPU cores on the system. Because a core does not pair with itself, a 16-core system will have 240 request queues and 240 return queues.
The result is a unique NoSQL data store that can scale to large numbers of cores per node with minimal overhead.