Building Next Generation Drivers: Optimizing Performance in Go and Rust

Piotr Grabowski21 minutes

Optimizing shard-aware drivers for ScyllaDB has taken multiple initiatives, often requiring a complete rewrite from scratch. Learn the work undertaken to improve performance of ScyllaDB drivers for both Go and Rust, plus how the Rust code base will be used as a core for drivers with other language bindings going forward. The session highlights performance increases obtained using techniques available in the respective programming languages, including shaving performance off Google's B-tree implementation with Go generics, and using the asynchronous Tokio framework as the basis of a new Rust driver.

Share this

Video Slides

Video Transcript

Today we’ll talk about building the next generation of drivers, especially their goal and the last drivers, and how we optimize performance in those drivers. A few words about me, I’m Piotr Grabowski, I’m a software team leader and my team is responsible for all cell ADB drivers as well as, the Kafka connectors such as the sync connector and the CDC Source connector.

and I have been at Cilla for the past two and a half years

uh so today we’ll we’ll take a short tour through different drivers and focusing on small stories of how we optimize different components of those drivers what we learned during optimizing those drivers and some interesting things we learned during doing this first we start at the short description of what the driver is its responsibilities and then we’ll move to Sarah DB code driver our new go driver that was developed by a group of students supervised by cell ADB next we’ll move to 3db rust driver our own rust driver written from scratch that was meant to be used with ScyllaDB first and this is our own project started from the beginning and next we’ll talk about some interesting new initiative to implement different bindings to sell adibira’s driver to use its performance core in other languages

so let’s start at the introduction of what the driver is its role so drivers as presented in this presentation are the libraries and Frameworks that allow you to send queries to Sarah DB from your language of choice the primary protocol that the drivers use is the SQL protocol it’s TCB based ScyllaDB supports the fourth version of this protocol this protocol is based around frames it supports multiple streams so multiple concurrent queries occurring at the same time over a single connection uh Additionally the queries can be sent in a compressed fashion as well as the results in compressed fashion reducing the time percent over the network and reducing the bandwidth needed for the network Sarah DB drivers support chart awareness as you well know Sarah DB note is is divided into multiple charts a single chart corresponds to a physical color in the CPU and the driver is smart enough to connect us to a specific chart and to establish connections to each chart of the cell ADB node so when we are thinking of the drivers what’s the components of the drivers all the things that the drivers are responsible for so the first role is handling the the protocol format so serialization of the serialization of different SQL frames as well as several DB types next the driver is responsible for maintaining The Meta that I bought different tables different nodes a big responsibility of the driver is throughout the request to a correct node and in case of seller DB drivers to Route the request to a correct chart Additionally the drivers handle all the low-level networking stuff and moving from the low level to high level the driver allows you to use convenient high-level constructs to execute and construct queries so for example we offer goes equal x package that provides many useful constructs for gold language and to use with go cql and Java driver offers mapper interface that makes it easy to execute and construct different queries in an object-oriented fashion so how the driver could improve the performance so first of all there’s shot awareness so sending requests directly to a correct chart there’s partitioners so this is the algorithm that determines which node to send the query to and select change data capture implements a custom partitioner so the drivers have to be modified to take that into account another important optimization is LWT optimization so ScyllaDB offers a very powerful lightweight transaction operations that allow you to atomically perform different operations and sending a query to a single replica uh consistently sending it to a single replica helps to avoid plexus conflicts and the main focus of this presentation is focusing on all of those the components that I shown in the previous slide so optimizing all those hot paths in the drivers such as serialization and the serialization the code that routes the queries so this code is very important as it’s run at each query and maybe some lower level optimizations like avoiding unnecessary copies allocations and logs

so let’s move on to Sarah DB goldriver

so ScyllaDB DB go driver is our new cell ADB driver for the go language the project had an interesting start it was developed by a team of University of Warsaw students for their bachelor’s thesis and this project was a great fit for such a program as the implementation of go driver was a bit of an experimental project and the main focus of the project was performance this is this was practically the sole focus of the project When comparing with an existing go driver go cql we saw that there were many new taxes used so avoiding mutexes was a big focus of the project avoiding unnecessary allocations uh this is due to a go garbage collector which which could reduce performance if lots of garbage is generated at the runtime and interesting thing with it was implementing a faster battery implementation using a new function of go generics so starting with mutexes this is a comparison of how many mutexes and clocks are used throughout the code of gocql and seradi being go driver as you can see instead of the big go driver only a single mutex is used in the entire project only a couple calls to lock are performed compared to go cql there are 37 mutexes and looking at the project history it turned out those mutexes were outed gradually as the the project is more mature during all those years of developing the go secure driver those new taxes added up up to this large number and unfortunately many of those mutexes were called at the hot path of the execution of the driver

so when thinking of how to avoid this problem in our new driver we are inspired by the seradebe sharpnapping architecture so this architecture means that each component in the case of each chart has its local data and avoids as much as possible sharing the data between those components so when we started the Beagle driver from scratch first of all we had better overall picture of the architecture based on the existing go secure driver so thanks to this we are able to design like more clear and efficient communication between those components and that the driver code heavily uses idiomatic channels and atomics which make it possible to not use mutexes the result is that the driver only uses single mutex which is local to a TCP connection and a nice effect of that is that by not using mutexes we reduce the possibility of race conditions

the initial benchmarks of go cql have shown that the the main issue of embed performance was the go garbage collector so while developing ScyllaDB go driver we are very careful to not do any allocations that were not needed and the end result of that is that the driver performs five times less allocations compared to go cql we had to meticulously design each component to avoid allocations so it’s not only some decision that had to be made in one piece of code uh we used our own faster Beatrice with generics which reduced the allocation a number and additional optimization was reducing buffers owned by writers and Traders our own V3 implementation used generics instead of instead of using interfaces interfaces unfortunately had a problem of variables escaping to the Heat and our implementation yielded 40 performance Improvement compared to Google’s Victory implementation and actually after the project we saw that Google’s B3 maintainers implemented a new version of V3 that uses generics

another interesting fact that we saw during the development was that different drivers had different characteristics of number of ciscos that they perform and especially we were intrigued by the number of Cisco stores and packets through the network so some of the drivers executed fewer sent messages or similar ciscos but they sent larger buffers and from that we saw that instead of sending each frame of cql separately through a separate system call it’s better to wait a bit and gather maybe a larger pool of buffers and send them in a single Cisco and this drastically reduce the the runtime of using our driver so this is an example that our driver batches much more queries into a single sent to Cisco compared to go cql and this is I think a difference of about five times of about five times more

uh when we also did some benchmarks of the of our driver so this is an example of sending many insert queries and Sergo driver was about three times faster compared to go cql so now let’s move to the cell ADB rust driver so the idea for the driver was born during a hackathon in 2020 and over the last three years we developed the driver much more compared to the initial implementation during the hackathon the driver uses a Tokyo framework which is a well-known asynchronous programming framework for rust and we consider the driver now feature complete which supports many features like shared awareness the asynchronous interface with support for large concurrency compression cql types and much more

one interesting issue that we saw during the development was actually raised by a member of community so the author of flutter it is a benchmark tool for sale ADB and Cassandra and the outer tried to use our DB rust driver but he saw that the driver had some problems with high concurrency of requests and we looked in depth into the problem and we identified that the root cause of the problem was the implementation in the Tokyo framework of the Futures unordered it’s an utility that gathers many Futures and waits for them all to finish uh we saw that Tokyo in some cases due to Cooperative scheduling had to iterate over all features in these Futures and Order structure many times so for example if there were 100 Futures in this Futures and ordered we would go through this list many times and the FX was merged to Tokyo that limited the number of Futures that were iterated in a single is in a single pole operation so we don’t have to look through the array every time it is popped some ongoing efforts in celadi biras driver so the first thing we are working on actively is track our load balancing so the problem when running the driver in a cloud environment is that you want your nodes to be in different availability zones however the traffic between the zones uh could have additional latency and additional costs we are implementing changes to the driver for the driver to pick the nearest track so the nearest availability Zone and that way the latency is improved and the cost is much less as the traffic between the nodes in a single availability zone is free the other optimization we are working on is implementing iterator based deserialization So currently the driver when it receives a result of for example some select query it tries to deserialize all the data that were received from the select into a single vac of rows so this is problematic because we have to allocate for memory per each row in this data so if the resulting data is 10 000 rows long we have to perform 10 000 allocations the solution that we are working on skips materializing all these rows into a single Vector instead offering a iterator that allows you to deserialize the data that was received from Sera on the Fly and the third ongoing effort which is quite a big effort is refactoring our load balancing code so the load balancing code is responsible for determining which node to send the query to based on the partition key and we thought that uh by re-implementing this functionality we can reduce the number of allocations and atomic operations while building the query plan and especially this problem was seen on the happy path so the path that is very often taken while executing the query

and now moving on to the bindings to the seller DB rust driver this is an interesting new effort that we are working on so while we are benchmarking the Sera dbras driver we saw that its performance was really great so we had an idea why not develop a way to use this great driver through other languages especially we saw that the last driver was faster compared to the C plus plus driver and both of those languages are compiled so it should be fairly easy to talk from CPAs pass to rest we also saw benefits of such approach high performance either maintenance of the solution and fewer bugs because we only have to fix something in a single core driver so we started development for the CPAs plus language the C plus plus Bindings that we have developed use the same API as the original CPS plus driver so it’s a drop in a placement and the bindings call to the rust driver the resulting project is very small compared to the original C Pass Plus driver as it’s only a layer between C plus plus and rust driver and we saw that the solution has better stability fewer problems compared to the original C plus plus driver so thank you for the presentation and I hope you’ll see and try out our driver projects [Applause] foreign

Read More