Dedicated fast network stack for modern hardware
ScyllaDB networking is designed to squeeze the most out of the hardware.
No different from the other Scylla components, we optimize the hell out our CPU, our memory management and similarly, the network.
Scylla supports two different networking modes—our own native Seastar network stack and the traditional Linux stack.
Seastar’s highly optimized network stack
Seastar networking is a full stack which runs in userspace. It utilizes Intel’s DPDK driver framework and has a TCP/IP code implemented on top. As the rest of the database, the network stack itself is transparently sharded and thus multiple TCP/IP instances are running on each core. This mode provides low-latency, high-throughput networking. No system calls are required for communicating, and no data copying ever occurs. This is the preferred choice for best performance. Seastar’s DPDK mode works on physical machines, virtual machines (either with devise assignment or with para virtual devices) and containers.
Linux standard socket APIs
In this mode ScyllaDB consumes ordinary Linux networking APIs. Scylla preserves this mode for ease of application development and trivial deployment. Users can switch between modes dynamically and pick the networking mode that works best at first time usage or production installation.
Why a new network stack?
While conventional networking on Linux is remarkably full-featured, mature, and performant, it has some limitations for extreme scale networking-intensive applications
Separation of the network stack into kernel space means that costly context switches are needed to perform network operations, and that data copies must be performed to transfer data from kernel buffers to user buffers and vice-versa.
Linux is a time-sharing system, and so must rely on slow, expensive interrupt to notify the kernel that there are new packets to be processed.
Linux is heavily threaded, so all data structures are protected with locks. While a huge effort has made Linux very scalable, this is not without limitations and contention occurs at large core counts. Even without contention, the locking primitives themselves are relatively slow and impact networking performance.
Not using Linux stack wasn’t a NIH thing. Although our developer team is well experienced with OS development we wanted to focus on the database. However, early on with Seastar development, we noticed that our sample Memcache application spends most of its time in the kernel. We couldn’t resist putting Seastar sharing and reactor design into action and we quickly coded our very own TCP/IP code. As you can see for Memcache, it performs miraculously well when one compares Seastar native TCP vs Linux: Memcached Benchmark.
DPDK, the user-space network toolkit, is designed specifically for fast packet processing, usually in less than 80 CPU cycles per packet. It integrates seamlessly with Linux in order to take advantage of high-performance hardware. Traditionally L2/L3 software appliances use DPDK as a mean to replace custom hardware with of-the-shelf x86 one. To best of our knowledge, ScyllaDB and Seastar are the first product to integrate higher layer application with DPDK kernel bypass. The Scylla team is going to leverage this unique property and apply it to areas further beyond the basic database.