Building a 100% ScyllaDB Shard-Aware Application Using Rust

Yassir BarchiAlexys JacobJoseph Perez20 miuntesFebruary 15, 2023

At Numberly we designed an entire data processing application on ScyllaDB's low-level internal sharding using Rust.

Starting from what seemed like a crazy idea, our application design actually delivers amazing strengths like idempotence, distributed and predictable data processing with infinite scalability thanks to ScyllaDB.

Having ScyllaDB as our only backend, we managed to reduce operational costs while benefiting from core architectural paradigms like:

  • predictable data distribution and processing capacity
  • idempotence by leveraging deterministic data sharding
  • optimized data manipulation using consistent shard-aware partition keys
  • virtually infinite scaling along ScyllaDB

This talk will walk you through this amazing experience. We will share our thought process, the roadblocks we overcame, and the numerous contributions we made to ScyllaDB to reach our goal in production.

Guaranteed 100% made with love in Paris using ScyllaDB and Rust!

Share this

Video Slides

Video Transcript

Thanks for your interest in this presentation where I will present why and how we built that 100 ScyllaDB Shadow air application using Rust. I’m Alexys CTO of Numberly and if you ever attended the ScyllaDB Submit before you might remember me almost probably my French accent today I’m delighted to an honored to share the stage this year with two wonderful colleagues of mine Joseph and Yasir.

before Yasir tells you how it works for real and Joseph shares our learnings and future plans let me introduce you to how we got there in the first place at normally the omnichannel delivery team has the ownership of all the types of messages we support and operate for our clients from the well-known and established email to the steel emerging RCS without forgetting the Ott platforms such as WhatsApp the team recently got the chance to build the platform to rule them all with the goal of streamlining how all our components send and track messages whatever therefore the general logic is as follows clients or programmatic platforms send messages or batch of the messages using rest API gateways which are responsible for validating and rendering the message payload then those gateways will all converge towards the central message routing platform which will Implement full feature scheduling accounting tracing and of course routing of the messages using the right platform or operator connectors as of today our architecture has this General pipeline dedicated to each message Channel which is a burden to keep up to date and most importantly makes it hard to bring feature parity amongst all of them Here Comes our Central messaging routine platform taking care of everything with an omnichannel attitude now putting all your eggs in one basket is always risky right making this kind of move puts a lot of constraints to our applied form requirements it has to be very reliable first as being highly available and resilient second has been able to scale fast to match the growth of one or multiple channels at once High availability and scale rather easy Parts when compared to our observability and idempotence requirements when you imagine all your messages going through a single place the ability to trace what happens to every single one of them or group of them becomes a real challenge even worse one of the greatest challenge I was there even more in a distributed system is the idempotence guarantee that will lacked so far on the other Pipelines guaranteeing that the message cannot be sent twice is easier said than done at this point I’m gonna work you through our design process we split up our objectives into three main Concepts that we seek to strictly respect the first one is reliability we wanted to design something that is simple with a lot with something like shared almost nothing architecture components then we strive for low coupling keeping remote dependency to its minimum then it was about the coding language we needed a performance with explicit pattern and strict power headings language then comes the scale on the application layer we wanted to rely on something that makes it easy to deploy and scale with a lot of resilience obviously the database has a lot of things entitled to high throughput highly resilient horizontally scalable time and Order pressure being obviously capabilities message boost that’s a lot on on the database side on the data querying as well we wanted low latency for one or many query support last but not least idempotence we wanted a strong processing isolation our workload distribution should be deterministic to scale and be efficient from the start to the end when you try to combine all those Concepts into an architecture happen early you could end up with something like this you would deploy your application on kubernetes and then you would use maybe Kafka as a message bus to transfer the message to your central message routing platform which in turn will might be might be using Kafka to hold some kind of ordered cube of the message it has to process then to keep on the state and all the things that you need to be able to store and retrieve we would use ScyllaDB and since you want this to be efficient everywhere you would maybe need radius as a hot cache all right but this apparently simple go-to architecture as caveats that breaks too much of the concepts that we promise to stick with let’s look let’s look at the reliability one tree that architectures and Technologies means three factors to design reliability upon each of them could fail for different reasons that our platform logic should handle on the scalability side we can immediately see that we are lucky to have a data Tech to match each scalability constraint but the combination of the tree does not match reliability and it importance this this adds too much complexity and point of failures to be efficiently implemented eat importance as expected becomes as well a nightmare when you imagine achieving it on such a complex ecosystem so we decided to be bold and make a big statement we’ll only use one data technology and all everything together with it ScyllaDB was the best widget to face the challenge it’s highly available of course it scales amazingly it offers ridiculous fast queries for both single and range queries which means that it can also be thought as a distributed cache efficiently replacing redis now replacing Kafka as an ordered database is not so trivial using ScyllaDB doable somehow the big the biggest piece of the cake that we had to tackle was how can we get a deterministic workload distribution if possible for free that’s where I got what turned out to be not so crazy idea after all what if I used ScyllaDB’s sharper Co architecture inside my own application for those not familiar with the amazing sharper architecture of ScyllaDB I’ll try my best to make it clear avian door please forgive me if I say something wrong so the main idea is that the partition key of your data table design determines not only which node is responsible for copy of the data but also which CPU core IO scheduler gets to under its processing you got it right ScyllaDB distributes the data in a deterministic fashion down to a single CPU core so my naive idea was to distribute our own messaging platform processing using the exact same logic of cldb the expected effect would be to actually align ScyllaDBs perco processing with our own applications and benefit from all the latency scaling reliability that comes with it that’s how we effectively created 100 percent Shard aware application and if you set aside the hard work it took to make it happen it brings amazing properties on the table deterministic workload distribution super optimized data processing capacity a line from the application down to the storage layer strong latency and isolation guarantees per application instance and of course infinite scale following ScyllaDB’s scale yes yeah we have an enthusiastic crowd eager to know how we build that 100 Shadow application can you work them through that please of course Alexi so now that we got our action actor inspiration it was time to answer the Perpetual question which language should we use well we knew that we need a modern language who should be available secure safe and efficient to build our platform but also we knew that our Chardon algorithm will need and requires a performance and blessing performance for hashing and should have a good Synergy with the ScyllaDB driver so that said it was no longer a dilemma the answer was so obvious for us rust will fill our needs Beth okay between you and me the truth is that also Alexi was already sold to silly and to rescue at that point our stock was born so let’s get into the platform internals and how we achieve our objectives with crdb in our abstract architecture in common messages are handled by a component that we called the industry for each message we receive after the usual verifications and validations we calculate the star to which the message belongs and will be stored in crdb more exactly we compute a partition key that matches the city’s storage replica nodes and the CPU core for our messages partition key effectively align our applications processing with cldb CPU code once this partition key is calculated too much csdb storage layer we persist this message with all his data in the message table and at the same time as its metadata to the table named buffer with the calculating partition key the buffer table is a corner store in our distribution design the partition key itself is a Chapel of the channel and the short number this allows us to use a dedicated worker per Channel type and chart copper the clustering key is a Time solve of the to preserve the insertion time order this time stop is the current time stop as calculated by ScyllaDB this is the important detail that I’ll get back to it later now the data is stored in crdb let’s talk about the second component which is call the scheduler schedulers will simply consume the ordered data from the buffer table and effectively proceed with the message routing logic following the Chart 2 component architecture a scheduler will exclusively consume the messages of the specific chart just like a CPU core is assigned to slice of cldb data a scheduler will fetch slice of the data that is noted it is responsible for from the buffer table this is the crucial part of our flow we will see it in details in the next slides at this point scheduler will have the IDS of all the messages it should process then it fetch the message details from the message table the scheduler then process and send the messages to the right channel is responsible for as you can see each component of the platform is responsible of just a slice of the messages per Channel by leveraging oncilla DB’s amazing Hardware algorithm Alexi told you earlier that replacing Kafka as a order database is not so trivial using ScyllaDBs GB but was surely doable this is how we are doing it so let’s get a deep view on how it works from the scheduler component perspective as I said before we store the messages metadata as a Time series in the buffer table ordered by their time of the interest let me remember that this time of the interesting is the current timestamp calculated by ScyllaDB each scheduler keeps the timestamp offset at for the last message it successfully proceed this offset is stored in dedicated table so when scheduler starts it’s fetch the time stop offsets of the charts of the data is assigned group so now a scheduler to is simply an infinite Loop featuring the messages it’s assigned to within a certain configurable time window in fact schedule will not fetch data strictly starting from the last time stop offset but instead from an oldest time stop it does mean that message maybe some messages and for sure some messages will be fetched multiple times but this is handled by our Aidan potency business Logics and optimizes memory cache overlapping this previous time launch allow us to prevent any possible and handled messages caused maybe potential right latency or times Q between nodes well reaching our goal was not easy we failed it many times but finally made it so let’s take a look back at our achievement and perspective with Joseph thank you yes yeah so it works pretty well at this time but of course it took some iteration with their success and failures and of course we learned a lot so first thing we want to emphasize is that load testing is more than useful quickly enough during the development we set up load tests sending dozens of thousands messages per second our goal was to test our data schema design at scale and in the importance guarantee it allowed us to spot some multiple issues sometimes non-trivial like when the execution delay between the statement of our insertion batch was greater than our fetch time window debug by the way our first workload was a naive insert and delete and load testing met large partitions appear very fast hopefully we also learned about compaction strategies and especially time Windows compaction strategy which we are using now and which allowed us to get rid of large partition issues to make this project possible we contributed to the ScyllaDB ecosystem especially especially to the rust driver with a few issues and pull requests for example we added the code to compute the replica nodes of a primary key as we needed it to compute the shot of a message we hope it will help you following this session if you want to use this cool chatting pattern in the future we also discovered some cdb bugs so of course we worked with the cldbc port to have them fixed again thank you very much cldbc port for your activity and your amazing work as in all system everything is not perfect and we have some points we wish we could do better is not a message queuing platform and we miss Kafka long pulling currently our architecture does regular fetching of each shot buffer so that’s a lot of useless bandwidth consumed but we are working on optimizing this also we encountered some memory issues where we did suspect the syllabus driver we didn’t take so much time to investigate but it made us dig into the driver code well we spotted a lot of memory allocations as a side project we started to think about some optimization actually we did more than things because we want a world prototype of an allocation free driver almost we will maybe make it the subject of a feature blog post with a rest driver out there for me back the go driver yeah makes the ScyllaDB rush driver great again so we bet on crdb and that’s a good thing because it has a lot of other features that we want to benefit from for example change data capture using the CDC Kafka Source connector we can stream our message events to the rest of the infrastructure without touching applicative code observability made easy and crdb advertised about strongly consistent tables with raft as an alternative to LWT currently while using LWT in a few places especially for dynamic shot workload a duration so we can’t wait to test this feature so thank you for attending this session we hope you had a nice and inspiring time with us and we will be happy to have a chat with you in the speaker Lounge so see you there bye [Applause]

Read More