How to Migrate a Cassandra Counter Table for 68 Billion Records

22 minutes

Register for access to all 30+ on demand sessions.

Enter your email to watch this video and access the slide deck from the ScyllaDB Summit 2022 livestream. You’ll also get access to all available recordings and slides.

In This NoSQL Presentation

Happn's war story about migrating a Cassandra 2.1 cluster containing more than 68 Billion records in a counter table to ScyllaDB Open Source.

Robert Czupioł, Senior Platform Engineer, Happn

Robert Czupioł is a Senior Platform Engineer for one of the world’s top dating apps, Happn. He is responsible for reducing costs on GCP while running databases and ops. Robert was previously a Principal Expert Architect and Java developer in ING Bank Slaski – Poland. He has 6+ years of experience in Cassandra/ScyllaDB/DSE in production – he’s an architect, developer, and DevOps in a single ninja person. He was certified in Cassandra during Cassandra Summit (San Jose) and has been a speaker at DataStax Accelerate, Premier Apache Cassandra Conference (Washington DC), and many domestic conferences like JDD Cracow, SpreadIT Katowice, IBM Think, and meetups. Robert is also a trainer at Cassandra workshops.

Video Transcript

Hello, folks. I’m glad that you spend next 20 minutes with me during this presentation on the amazing ScyllaDB Summit 2022, the online edition. I have the pleasure to present the topic of “Migrate the Counter Table which contains 68 Billion Records.” Yeah. So my name is Robert Czupioł, and I’m Senior Platform Engineer at Happn. I’m certified on Cassandra since 2015, so it’s 7 years ago.

Yeah, it’s …I close to forget that. And what is the most funny, I attended the first ScyllaDB Summit in 2016. I have even a photo, which prove that. Yeah. So Happn, find the people you’ve crossed paths with. So this is a dating application, top three in Wetern Europe, and we have more than 100 million customers, and we are still growing this number. I hope it will be 200 since we are a little close behind a little, just a little behind Tinder, yeah, but we maybe catch up them. So we have nine ScyllaDB clusters in past 16 Cassandra because we are close to the finish of migration. We have more than 200 terabytes of data in that ScyllaDB cluster, and average traffic is around 300K requests per second, so it’s a little big number for people like, for example, developers from [Indistinct] something like that it’s not a big number, but for us, for the top three Western Europe dating application, it’s a big one. So in May 2021, we made the big decision to migrate to ScyllaDB. We were working on Cassandra previously, and, everyone, we have some target and goal to achieve. Reduce TCO, okay, so, everyone, if you ask the CIO, your manager, your mother, “Do you want to pay less for the same and even for something better?” everyone says, “No, God no, no. We don’t need that.” No, normally everyone says, “Yes, we know. We want that.” So we also had this discussion with our CIO that we want to migrate into ScyllaDB, also to reduce the technical debt because we are previously on Cassandra 2.1. There is version from 7 years ago, is a little mess, a little old one. Also reduce the volume by something. We’ll talk about that on the to improve our monitoring stack. As you know, ScyllaDB has the Grafana dashboard and monitoring for all the monitoring .. the monitoring stack. So one of the biggest clusters in Happn was the crossing counter. So we counter if someone meets someone else on the street when he walk or drive by bus or something like that, go through the Champs-Élysées or something like that, and we count that. In past, we have 48 Cassandra nodes. Normally, we are working on Google, Google Cloud, so it ‘stoo high-mem machine, so for CPU, 32 gigabytes of memory, 1-terabyte network. This kind of caused a big technology dept, like Debian 8, uptime 580 days, and it’s not uptime of Linux. It’s uptime of Cassandra nodes, so you can imagine that we are part the Cassandra, and we are part .. . Yeah, we don’t do that. And this Cassandra 2.1, and there is one big table, crossing_count, and this table contains the number from the beginning of the presentation, so it’s 38 billion records. I use DS Bark to calculate. I know that normally we can use so the HyperLogLog or something like that to have the number of partitions, but I want to have little accuracy number, how many data we handle with. So migrate 68 billione records table. So, normally when we talking about the migration of normally, we have two different types. It could be the offline or online. I know that everyone know that, and offline is not this case because we are working 24/7, so our business cannot have even 1 millisecond of downtime, so we have to choose the online migration, and normally, online it’s three very simple steps, like do the doublewrite with μservice, then leverage data, and on the end, open the bottle of champagne. Of course, Happn is a French company, so we have a lot of bottles of champagne to open, and, yeah, and it is the most easy step for us. So leverage data. So there is also three different possibilities to deduct. The first one is by the flat file, like CS file. You could use the CQLSH or DSBULK, but this possibility have a writetime issue. It contains and [Indistinct] , so you can, for example, adding or remove the newest data during that, so, yeah. It’s not a good idea, to be honest, but maybe if you have some nonbig amount of data or some dictionary, something like that. So the next one is by SSTable, and we can use, for example, to migrate stable the ready tool, like SSTableLoader. Of course, Cassandra in single version, they are little different inside, but it’s generate based on the existing SSTable, the normal traffic, so it could little stress our cluster. The next one is to use nodetool refresh. In my opinion, is the best option if you are working on something like the Google Cloud and you have network disks because if we could just clone this kind of data we’ve writ on the nodes to a lot of ScyllaDB node and copying [Indistinct] just to run nodes to refresh and next doing the resharding the data, it’s ready. Okay, and the third, the most complicated situation, the most complicated way is to make a dual connection. So it connect to the Cassandra cluster and to the ScyllaDB cluster, and, for example, for that it’s ready application like ScyllaDB migrator, Spark or do something by our own. Okay, so counter table. The most important is that for counting the counter of people you meet during the walk down the Champs-Élysées. It’s that we are using the counter, okay? So counter, it’s out of idempotent rule. So each table in Cassandra or ScyllaDB are idempotent. Okay, so counter aren’t normally. You can just update, only update, but there is also possibility to delete, but after delete, you create the timestamp, which are covered next update until timestamp exists so it’s little weird approach. There is a different implementation in past of counter. In Cassandra 2.0, is a local entry mode shards, in 2.1 is just shards, but certainly, we talk about it later. There is impossible to use own timestamp, TTL, and they are not accurate, yeah, as normally, and my the best options and my the best attitude of counter, it’s range of long value. So it’s very easy to turn the counter by to have the negative value. So, for example, if we are a dating application, we count the people with which we have had segues, for example, and we have segues with a lot of people, like nine gazillion, we could, after that, have a negative value. Maybe it’s not possible, to be honest, but, for example, counting the people which are irate by new Polish tax, it could be. Okay. So how counters work, basically. On each node in our cluster, which contain our value based on the partition key, so is the same fun with the Token Ring, we create the node-counter-id, so it’s like the shard, and, for example, here it’s just the replication factor two, and we have nodes A, nodes B, and inside we have the counter node id A_1 and B_1 because we have two nodes. Shard’s logical clock, it will be increased when we update the value and the real shard’s value, and, for example, doing the updates operation, on the node B, so, for example, we choose node B as our coordinator, and we implement some value by two with the previous value. After that, we generate the new logical clock and save the new value into this node and to event replicas you see on the picture. Okay. If we, for example, decrement by five, we do exactly the same. We read the previous shard value. We decrement .. . We create the new logical clock value and insert the new value of that into the shard’s value, so like on the screen. And where is the funny stuff? When we read the value of our counter, we have to merge the value from each shard. So we have, for example, for shard A_1 the value of minus five, on the shard B-1 value three, so our current value of which when we create some, when we request for selling the data is minus six minus two. Okay, so how to migrate that stuff? Basically, if we are going with the doublewrite approach, we’ll have the situation like that. On Cassandra, we have 20, 21, then 22. We increment the value also on ScyllaDB side, so we are adding plus one to the ScyllaDB, so it’s one, two, three, and after leverage data, we have on ScyllaDB side 28, 29, 30, which is wrong data. Okay? Genius. Okay. It’s very smart. Okay. It’s not. So we’ve just written our own application to make that to avoid that problem. So, basically, we called it counter-migrator, but it could be anything. It could be ABCs, like one of the most important problem in informatic when you have to clean the cache and how to name all the applications, okay. So we name that counter-migrator. It was written in Java. Okay. Why in Java? Because everyone know this language. It’s well-known, and it’s simply each our μService are written also in that language. We spread the Token Ring, so as everyone know, in Cassandra and ScyllaDB we have consistency hash ring, and we spread to 6,114 ranges, and this select looks like here more and less, and after that, when we get the value from Cassandra, we compare with value on ScyllaDB and set the newest value, the true value of this counter. And doing that migration, we have, of course, some pitfalls because it shouldn’t be so easy as a piece of cake. So the first one is out of memory, of course, as everything. So we have to extend their range because normally if you divide that 68 billion records by 6,144, you will have around 11 million of records, so, yeah. It’s little too much for the normal heap space on Java, so we spread it into 600,000 ranges. And, of course, after that one of the most important stuff, we shuffle that ranges, that they are not in order, in normal order. They are shuffled randomly. After that, we are .. . because remember that ScyllaDB have a shard and the same value, which are close, could be executed on the same shard of ScyllaDB. So one CPU can be hot, and the rest are a little frozen. Okay, so next thing was Java and Spring. Okay, so the Spring takes 30 seconds to start the context. Yeah. If we start the context for each shard to have this process [Indistinct] or something like that, it takes us little more time than we expect. We have, of course, also the JVM Heap, and we need more machine to handle that or just switch to Golang. Yeah, in Golang it’s more simple to write this type of tools. Okay. After that, we have some problem with alerting because on one node there is another process, and to be honest, it’s [Indistinct] but never mind. It’s very close to our basic VM template, and this process takes 5 gig of memory on one ScyllaDB node, and ScyllaDB goes down because out of memory and something like that, and we are, of course, focused on migration, not on the alerting stuff. So after 2 days of one nodes down, we have a lot of hints to fight with the values, which are wrong because the hints are aggressive and create some aggressive workloads to fight with entropy. So we do stress our cluster for 2 days, but, yeah, after that it goes better. So the next step is tune the driver and consistency level, and default is always wrong. Yeah. I know that a lot of people like to take Spring and Spring Data Cassandra and just type the URL of node, and it will be working perfectly. No, it doesn’t. You have something like speculative execution. You have something like the Netty connection and a lot of thing to do to improve, and the next thing is think about double twice, about consistency level. Even during migration, the consistency level or .. . because it’s a migration process, even if some node goes down, it’s just during the migration process, so you should stop the migrate and look into this node. What’s happened? And start it, not generate a lot of machines to avoid the entropy in cluster and turn to ScyllaDB driver. Yeah, it takes 20 percent of better performance, more and less. Okay, and, yeah, it’s last. Okay? I swear. It last. So avoid the network latency. So during that migration, do not send in one request, one information because we have 38 billion records to send. We use the batches. Of course, it was the unload batch, so after that, the most important was the increasing of warning threshold in [Indistinct] to not overload the journal. Yeah. Okay, and the final result. Migrate data from Cassandra to ScyllaDB takes … Yeah, we are using two n2-standards in machine, and it takes 5 days. Finally. So what we achieve? We achieve, of course, data better latency. Yeah. I know that’s, for example, from my perspective is one of the most important metrics for the TCO. It’s not metrics [Indistinct] it was good because from customer perspective, even half of second, it’s okay, but we are down to 20 milliseconds. Okay, the disk space. We are reducing the final disk space. And as you see during this chart, on Cassandra we need 38-terabyte disk, and on ScyllaDB, 4.4 18. Yeah. We’ve checked if data all exists, if we are not skip some data. Why? Because ScyllaDB use MD format. It’s more aggressive during compaction, and we also change the compression from that four to that standard. Okay, so it’s really final result. So we downed the number of nodes from 48 Cassandra to 6 ScyllaDB. We improve also the cost of data by incremental snapshots, and we are commitment for N2 and LocalSSD, so finally, the TCO was reduced 75 percent. That’s only one company who cry. It’s Google because we pay less. Okay, lesson learned on that. The most important before migration, remember about cleanup and repair cluster. Yeah, I know it’s a very important topic, and I could talk about that for next 20 minutes, what happen if you, for example, passed up some node and forget about cleanup. Sometimes even compact to have less data to migrate. Remember about table properties, yeah. As always, adjust the scylla.yaml file, yeah, the hints window time, the next one, and improve and keep changes in Infrastructure as a Code, like Ansible playbook, [Indistinct] or something else. Yeah, it’s very important, so I think it’s the most important lesson learned. Okay, so thanks for your time. See you, and bye.

Read More