Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline

20 minutes

Register for access to all 30+ on demand sessions.

Enter your email to watch this video and access the slide deck from the ScyllaDB Summit 2022 livestream. You’ll also get access to all available recordings and slides.

In This NoSQL Presentation

Numberly operates business-critical data pipelines and applications where failure and latency means "lost money" in the best case scenario. Most of those data pipelines and applications are deployed on Kubernetes and rely on Kafka and ScyllaDB, where Kafka acts as the message bus and ScyllaDB as the source of some data enrichment. The availability and latency of both systems are thus very important to us because they mix and match data in the early stage of our pipelines to be consumed by our platforms. Most of our applications are developed using Python. But we always felt that we could benefit from a lower-level programming language to squeeze the performance of our hardware even further for some of the most demanding applications. So, when an important part of our data pipeline was to be adjusted to reflect some important changes in our platforms, we thought it was a great opportunity to rewrite it in Rust! Moving to Rust was hard, not only because of the language itself, but because being at a lower level allowed us to see, test, and demonstrate things that we could not pinpoint or explain that well using Python. We spent a lot of time analyzing the latency impacts of code patterns and client driver settings and ended up contributing to Apache Avro as we went down the rabbit hole. This session will share our experience transitioning from Python to Rust while meeting the expectations of a business-critical application mixing data from Confluent Kafka and ScyllaDB. There will be code snippets, graphs, numbers, tears, pull requests, grins, latency results, smiles, rants of frustration, and a lot of fun!

Alexys Jacob, CTO, Numberly

Alexys is CTO at Numberly. He is an open-source contributor, a Gentoo Linux developer, and PSF contributing member. He enjoys sharing his experience on architecture design, distributed systems, fault tolerance and scaling Python.

Video Transcript

Hello, everyone, and thank you for attending this presentation where I’ll share my experience in Learning Rust the Hard Way for Kafka and ScyllaDB data pipeline. Let me first try to introduce myself. As you can judge by my accent, I’m living in France. I’m French, in Paris. If you ever attended the ScyllaDB Summit before, there’s a chance that you recognize me as well. I’ve been around in the community for a while now.

I’ve contributed to Open Source ScyllaDB and to university as well. I’ve been contributing to Open Source in a variety of a Open Source projects anyway, so Apache data related projects and Open Source projects or not. I’m a tech speaker and writer. I’m also involved in the Gentoo Linux community as a developer, and I’m also a Python Software Foundation contributing member, which means that I spend a fair amount of my free time working or helping in various Python forward projects. I’m going to now introduce your next 20 minutes if you decide to stay. I will start with the thought process that made me move from Python to Rust. Then I will share about my experience in learning Rust the hard way and what I mean by hard way. And I will try to conclude on my experience and was it worth it, and share some production numbers as well with you. So let’s start. The fact that I could even consider to code using another language than Python was shocking to some of my colleagues and relatives, so let me take some time to go through the opportunity that offered itself to me and why I chose to go wild. If you ever attended a ScyllaDB Summit before, you may remember that Numberly is a digital data marketing expert helping brands connect with their customers using all digital channels available. As a data company, we operate on a lot of data that is fast-moving. The stream driven approach drives our technological choices towards platforms that allow us to process and react to streamers as close to real time as possible. As such, we combine Kafka and ScyllaDB superpowers extensively thanks to streams and specialized pipeline applications that we call data processors here. Each of those pipeline data processor applications prepares and enriches the incoming data so that it is useful to the downstream business partner or client applications. Relevance of a data-driven decision is that it’s best when it’s close to the event’s time of occurrence, which means that availability and latency are business critical to us. Latency and resilience are the pillars upon which we have to build our platforms to make our business reliable in the face of our clients and partners. Those data processor apps, Kafka and, of course, ScyllaDB itself can’t fail. If they do, we get angry partners and clients that look like this, and nobody wants to deal with an angry client. The data industry and ecosystems are always changing. Last fall we had to adapt three of the most demanding data processors written in Python. Those processor applications were doing the job for more than 5 years. They were better tested and trustworthy. We knew them by heart. But I also was following Rust maturation for a while, and I was curious, and it always felt less intimidating to me than C++, sorry. Sorry, Avi. So when this opportunity came, I went to my colleagues, and I told them something that looked like this. “Hey, why not rewrite those three Python applications that we know work very well with one Rust application which we don’t even know the language, okay?” I must admit that I lost my CTO badge for a few seconds when I saw their faces. So this statement deserves rational rather than just a crazy stance. Rust makes promises that some people seem to agree with. It’s intriguing. It’s supposed to be secure, easy to deploy, make few or no compromises, and it also plays well with Python. But furthermore, their marketing model speaks to the marketer inside me, “A language empowering everyone to build reliable and efficient software.” Wow. Everyone? That’s me. Reliable and efficient software? Yeah. Why not? That’s what my new processor app needs. Seems appealing. That being said, careful attendees will notice that I did not mention speed in the Rust promises. Isn’t Rust supposed to be “Oh, my god” fast? No. They talk about efficiency and it’s not the same. Efficient software does not always mean faster software. Brett Cannon, a developer, advocates that selecting a programming language for being faster on paper is a form of premature optimization. I agree with him in the sense that the word fast has different meanings depending on your objectives. To me, Rust can be said to be faster as a consequence of being efficient, which does not cover all the items on the list here. Let’s apply them to my context. Is it fast to develop in Rust? Compared to Python, no way it can be faster. I did Python for.. . I am doing Python for more than 15 years now. Is it fast to maintain in Rust better than Python? Nobody at Numberly do Rust yet, at least not professionally on production, so it can’t be faster to maintain. Is it fast to prototype in Rust? Oh, no. Code must be complete and to compile and run. Is it fast to process data in Rust? Sure, on paper, but to prove it, measure it. Is it fast to cover all failure cases? Yeah, definitely. There is mandatory exhaustivity that is in the language itself, error handling primitives are very strong. Yeah. But overall, as we can see in my case, choosing Rust over Python will mean that I will definitely lose time. So I did not choose Rust to be faster. Our Python code was fast enough in a way to develop pipeline processing. So why will I want to lose time? The short answer is innovation. Innovation cannot exist if you don’t accept to lose time. The question is to know when and on what project. So the gist of my decision was that I was sure on this project was the right one at the right time. Now, what will I gain from losing time other than the pain of using semi-colons and brackets? Supposedly a more reliable software thanks to Rust unique design and paradigms. In other words, what makes me slow is also an opportunity to make my software stronger. So a low-level paradigm, ownership, borrowing, lifetimes, all those crazy words. At least if it compiles, it’s supposed to be safe. Strong type safety, come from Python. Yeah, well, it can be make your code more predictable, readable, and maintainable. Compilation, what is it coming from Python, right? The compiler is very helpful and it’s still more helpful to have a compiler error that is explained very well than the random Python exception. Dependency management, oh, yes. This is where Rust shines as well to me. It looks the same compared to what I’m used to in Python. Exhaustive pattern matching, that brings confidence that you’re not forgetting something when you code and when you compile. Error management primitives, failure handling, writing the language syntax, that I want. So I chose Rust because it provided me with the programming language at the right level abstractions and the right paradigms, and this is what I needed to finally understand and better explain the reliability and performance of an application. It made this way easier than I was used to with Python. Okay, so let’s learn Rust the hard way with high stakes and straightforward production. Here is another view of all the aspects and all the technological stacks that I had to deal with. Obviously, learning the syntax itself, learning how to handle errors everywhere and properly. Then I had to connect and interact with Confluent Kafka and added with the Schema Registry to decode Avro messages, then I could stop playing with asynchronous latency-optimized application design, and see what it looks like in Rust. Then I had to connect and play with the ScyllaDB multi-datacenter production cluster that we operate at Numberly. I also had to interact and connect with MongoDB, everything runs and is deployed on Kubernetes at Numberly, so I had also to direct with this. Of course, we measure and check everything as graphed, so that means we have to export metrics using Prometheus and then grab them using Grafana, and we also use Sentry to be able to go back in time and see what happened if something goes wrong. Since I only have 20 minutes, I will skip through this list to highlight the most insightful parts. Let’s start with the first word I hate right from the start, consuming messages from Kafka. We use Confluent Kafka Community Edition with Schema Registry to handle our Avro uncoded messages. Confluent Schema Registry adds a magic bite to Kafka Message Payloads and this breaks vanilla Apache Avro Schema deserialization. Luckily for me, Gerard Klijs’ .. . and I’m sorry, Gerard, I’m sure I’m pronouncing your name very badly .. . had done the heavy lifting in this crate, which helped me a lot before I discovered some performance problems. So we are working on improving that, and I hope to switch back to this helpful project once we are done. Until then, I decided to use the manual approach that I’m showing you here, and decode Avro messages myself, with the respect of the schema of course. Then, I hit the second wall when, even if my reading of the Avro Payload was done right, I could not deserialize them. As a total Rust newbie, I blame myself for days before even daring to open Apache Avro source code and look at it. I eventually found out that Apache Avro was broken for complex schemas like this. It made me wonder if anyone in the whole world was actually using Avro with Rust in production. Even now knowing that the project has been given to the Apache Foundation without a committer able to merge PRs. Anyway, here I am contributing fixes to Apache Avro Rust, which eventually got merged 3 months later in January 2022. Thank you, Martin. Anyway, another unexpected fact that Rust allowed me to prove is that deserializing Avro is faster than deserializing JSON in our case of rich and complex data searches. My colleague Atmin was sure of it, and I could finally prove it. This one is for you Atmin. Once I was able to consume messages from Kafka, I started looking at the best button to process them. I found the Tokio asynchronous runtime very intuitive coming from Python Async I/O. I played a lot with various cut buttons to optimize and make consuming messages from Kafka latency stable and reliable. One of the interesting findings was to not defer the decoding of our messages to a green-thread, but to do it right in the consumer loop. This serialization is a CPU-bound operation, which will benefit from not being cooperative with other green-thread tasks. Similar, allowing and controlling your parallelism will help stabilize your I/O bound operation. Let’s see a real example of that with graphs. Deferring the rest of my processing logic, which is I/O bound to green-threads, help me absorb tail latencies without affecting my Kafka consuming speed. The graph on the dashboard you see here shows that around 9 in the morning something made ScyllaDB slower than usual. ScyllaDB select and insert P95 latencies, went up by 16. That’s where parallelism load also started to increase and you see this little bump in the first graph as I had more active green-threads processing messages in the background. But, it only hit my Kafka consuming latency and speed by a factor of two at P95, and this proved that it has effectively absorb tail latency that was due to some overload. This is the typical example of something that was harder to pinpoint in demonstrating Python, but we can clear with Rust. Now toward the ScyllaDB. I found the ScyllaDB Rust driver to be intuitive and well featured .. . Congratulations to the team, which is also very helpful on their dedicated channel, on the ScyllaDB Slack Service .. . Join us there. The new caching session is very handy to cache your prepared statements, so you don’t have to do it yourself like I did at first. Beware, prepared queries are not paged, use page queries instead with execute_iter. I almost got caught by this one. And here I am showcasing a code example of a production connection to ScyllaDB using SSL Multi-datacenter awareness and a caching session. You don’t have to read it and understand it right now. It’s there for later reference if you need it. Now to Prometheus, which even if it comes late on this presentation, is actually one of the first things I did set up on my application. For older experiments I did, I measured the latency and throughput thanks to Prometheus. For a test to be meaningful, those measurements must be made right and then graphed right. ScyllaDB people know this by heart, but it’s usually harder for marketers like us. So, yeah, keep this in mind. So here is an example of how I measure ScyllaDB query in-session latency. The first and important gotcha is to set up your histogram bucket correctly with your expected graphing finesse. Here I expect ScyllaDB latency to vary between 15 microseconds and 15 seconds, which is the maximal server time-out I’m allowing for writes. Then I use it like this, I start the timer on the Histogram, and record its duration on success and drop it on failure so that my metrics are not polluted by possible errors. It looks really sane and pure in the Rust .. . Thanks to the Rust syntax, I think. One of my best time investments is this project was to create a detailed and meaningful Grafana dashboard so I could see and compare the results of my Rust application experimentations. Make sure you graph as much things as possible, cache sizes, rates and occurrence of everything. Discerning the difference between the two, rates and occurrences, are not the same. Make errors metrics meaningful by using levels, et cetera, et cetera. I’m making a great article that the folks at Grafana wrote on how to visualize Prometheus Histograms right in Grafana. It’s not as abusive as one might think. So was it worth it? Did innovation make up for the time lost? Well, the real question is, do I have the feeling to have lost time at all? Short answer, sorry, hell no. The syntax was surprisingly simple and intuitive to adopt even coming from Python. In the end, I have to confess that Rust made me want to test and analyze everything at a lower level and that I absolutely failed to resist the temptation. So most of my time was spent on testing, graphing, analyzing, and trying to come up with a decent and insightful explanation of what I was seeing. The short, this surely does not look like wasting time to me. For the number hungry of you in the audience, here are some taken from the application and production. Kafka Consumer max group with processing? 200K message seconds on 20 partitions. Avro decoding P50 latency? 75 microseconds. ScyllaDB SELECT P50 latency on roughly 2 million rows worth of a table, 250. INSERT P50, two ms. In the end, I can fairly say that it went way better than expected. The Rust crate ecosystem is really mature and really similar to what I’m used to in Python Package Index. The ScyllaDB Rust driver is stable and efficient. It took me a while to understand and accept that Apache Avro was broken, but this is done now. I could replace three Python applications totaling 54 pods by one Rust application totaling 20 pods, which makes green IT happy. And this feels like the most reliable and efficient software that I actually ever wrote. So even if it was my first Rust application, I felt confident during the development process which transformed into confidence in a predictable and resilient software. After weeks of prediction, the new Rust pipeline processor proves to be very stable and resilient. So now I can fairly say that, yes, Rust promises are leading up to expectations. I thank you very much for your attention. I really hope you enjoy ScyllaDB Summit and that you learned something today, and let’s keep in touch.

Read More