Rakuten’s Catalog Platform Migration from Cassandra to ScyllaDB

Hitesh Shah13 minutesFebruary 9, 2022

The RCP/Rakuten Catalog Platform has been growing at a brisk speed over the last couple of years. Our original backbone was Cassandra. However, as we continued our growth, we internally started realizing that it was not suitable for our next stage of growth.
As such, we started looking into ScyllaDB as a better ROI solution as well as a much more stable backend. Our migration itself was challenging since this has to be done for a production live data processing pipeline with minimal impact on customers.
In this talk, we will dive deeper into challenges and takeaways.

Share this

Video Slides

Video Transcript

Hello everyone. This is Hitesh from Rakuten. Today I’m here to talk about Rakuten’s Catalog Platform Migration from Cassandra to ScyllaDB.

What are some of the challenges we were running into cassandra and why we opted for seradvi so brief introduction about me i have been with rakuten for almost five plus years now by now and i have seen as well as participated into the growth of the catalog platform here at rakuten uh at heart i’m truly a passionate engineer i have predominantly worked with various technology firms and i i pride myself into you know focused on development as well as operational aspects of the product development uh pushing the analog of the technology is what i love on a day-to-day basis and that’s what i strive for doing it at rakuten as well my current interests include distributed systems as well as dynamic dynamically scalable platforms so quick overview in terms of the agenda i’m going to start with overview of rcp what kind of use cases we are supporting and what kind of services we are providing uh then i will have a quick overview a very high level overview of the pipeline architecture we are using as well as some of the transaction volumes we are we are supporting uh on on our platform then we will dive deeper into uh you know our experience as well as challenges with cassandra and why we started looking into sila and what value added sila was providing at the end i will spend some time and to share some of the kpi improvements we have seen after we migrated to silha db so as the name implies uh we are a product catalog platform team so if you are into online e-commerce shopping and if you are into cashback and let’s say you go to rakuten.com and you’re searching for you know you’re looking for a new iphone so you so you search for iphone on rokuren.com and you will get the search results uh saying that you know iphone from jcpenney at certain price iphone from targeted certain price iphone from walmart at certain price so those search results are powered by this platform uh and we are a truly global team we are responsible for a lot of the global assets across rakuten worldwide so on the right uh with if you go to let’s say you know rakuten japan rakuten cu.co.jp and let’s say again you’re looking for an iphone and you search for iphone and you get the search results saying that iphone from this merchant in the marketplace at this price and this deal as well as iphone from some other merchant in the same marketplace at a different price so again those search results are powered by this platform and the way in which we do that is that we have these offline ingestion platform uh to to power this these results and this product catalog we source the catalog data from multiple sources and once it is downloaded downloaded into the system there is a core uh data processing engine which which will kind of like normalize the normalize the product data validate the data uh and transform the data and at the end of the pipeline the catalog data will get stored into the sylla db so cell db is at the core of this platform to to to power the platform as well as support the platform and in between uh the the platform teams there is a queueing base uh you know i will say queuing based integration of different various micro services so you know for example the machine learning enrichment will happen via this pipeline and the once the enrichment happens the data will get stored into civil db and the the the catalog data will be sent out in the bottom to our partners as well as internal customers so in terms of the uh core uh technology stack uh pretty much i will say spark sila radius kafka we are also on google cloud platform so you know pretty much pops up data proc as well as data from in terms of the programming model uh we are kind of like uh using uh it’s pretty much matching uh as well as micro matching and some streaming also we have so in terms of the data processing volume we are handling right now the total catalog size is close to 700 million plus and pretty soon approaching 1 billion plus items or listings in the catalog the number of daily items we are processing is close to 250 million plus and again you know given the growth is continued to go up so all being said and done after all these optimizations and you know by the time the transaction reaches to sila db in terms of the rate qps vc at sila level is close to 10k to 15k uh qps per node as well as the right qps is in the range of anywhere from you know 3k to 5k uh transactions per node so here is a quick uh graph on a chart a recent chart of some of the sila nodes basically what we are monitoring internally so you can pretty much see that you know on the left hand side is the 8 kps pretty much in the you know 10 k plus minus range as well as on the right bottom right is the right qps and i mean you can pretty much see that servers are super super busy basically all right so let’s dive deeper into some of the issues we have seen with cassandra uh so one of the main reason was the inconsistent performance or what i like to call the volatile latencies so let’s say if i have a select query which is returning you know several returning results back in 70 milliseconds that exact same query will take let’s say you know 120 millisecond at a different time of the day or could even take 140 or 180 millisecond or a different day so it was very hard to predict the performance of the system as well as it was very hard to commit to the srs uh to our partners cassandra being a java based system uh it also had its own baggage of jvm issues and you know definitely we had our own fair share of auto family issues as well uh we also used to run into a lot of the long gc pulses which will come you know which will freeze the node right in the middle of all this activity and kind of result into server side connection timeout to for many of our clients one of the major issue which was which was very very prevalent with us was a single slow node can actually bring down the whole cassandra cluster and this is very common with a lot of the distributed systems but much more prevalent in cassandra is and the reason for these is because a lot of these distributed systems has this concept of sharding and the leaders and replica so there is a there is a lot of you know peer to peer network chatter going on behind the scene so for some reason if one of the node is performing slower then it will not be able to process all the request it is receiving as well as well as it will kind of like you know not be able to propagate those requests so this eventually basically it will start slowing down the other nodes in the cluster and eventually the whole cluster will come down uh the other some of the other issues where definitely cassandra requires lot of the lot of manual operations lot of babysitting we pretty much had to have a full-time dba to make sure that you know we are on top of all the repairs as well as all the jobs are running fine so having said all these things we were able to scale up the cassandra cassandra was definitely naturally horizontally scalable but what we realized that it was coming at a stiff cost so just around approximately two years around the same time kind of like we started internally realizing that uh cassandra was not the answer to our next stage of growth of the catalog platform so enter sila that’s where we bumped into sila db uh and we did a physically initial poc of couple of months to to to test out our primary use cases and all behold basically you know we kind of like saw that out of the box there was uh five times to six times performance improvement uh in cilavi so we we kind of like you know went ahead and with the migration so again so let me just quickly run through what are some of the other uh benefits we have seen with the sila so again as i said five to six is 6x performance improvement much more improved total cost of ownerships because you know we went down from 21 to 24 cassandra nodes to six nodes in sila as well as much better roi because any code which is running in c plus plus versus java on the same set of hardware we definitely give you a better roi and one of the biggest i would say peace of mind was that we started having this consistent latencies the product so queries are performing pretty much in a predictable fashion and we started committing to sls to our uh our internal partners as well as customers uh one other reason we were kind of like you know attracted to sila was the underlying sea star framework uh and you know the much better improved parallelism where basically you know the shared nothing architecture or you know short per core approach which reduces the kernel level contention substantially and improves your performance and again one you know last but not least one of the biggest advantage was it was just a drop in replacement so we could basically you know spin out a new zela cluster migrate our data and start pointing our application to sila and boom you know we were we were running in production uh why enterprise so you know just quickly run through some of the benefits of the enterprise version at least what we saw uh so we once we started doing poc we kind of like started realizing because of the complexity of the technology as well that enterprise version will make sense and one of the main reason is that with the enterprise version you know it comes with the private slack channel for the customer support and end with their own sls and we have you know we have benefited tremendously from this private slack channel and with the immediate turnaround of the asus the one other advantage was that for the enterprise customer the sela comes with custom shard aware driver uh which actually which you know improves the utilization of the node at the cluster substantially and the incremental complexion strategy or ics is one of one other benefit which actually really fitted really well with our use case uh where we have a you know schema or a table uh which is generating a lot of tombstones as well as it is you know it is heavily transaction oriented a lot of lots of reads and write so that’s where ics made a lot of sense for us to basically balance out the disk space usage as well as the cpu usage and again you know some other key things are workload priorities prioritizing which we are evaluating uh to integrate into our use cases as well as sila manager which is one of the biggest benefit because we know that i had to babysit the cluster okay so quick uh quick diagram on the shardwater driver so this is our this is uh taken the snapshot from our internal poc so on the top left is the data sex driver uh you know running against sila so you you know and uh you can see you know how the director is serving the request and on the bottom right is basically the sila custom shadow driver and you can see that the the parallel parallelism has improved substantially and you know you are getting much better utilization of the nodes as well as the director is busy basically and we actually internally so at least close to 15 percent to 20 improvements in our query because of the driver so last but not least let me quickly share some of the kpi improvements uh we have seen uh so our average feed ingestion times improved by you know 30 to 35 percent which resulted into improved sls for our internal partners and customers and improved customer satisfaction uh we also you know once the feeds are processed we also do out on publishing of the feeds which resulted into you know almost 2.5 times to five times improvements which resulted into you know much more satisfaction for our merchants because now they can quickly turn around turn around the price changes as well as availability changes to the site and one of the biggest benefit was basically the the the dollar dollar savings in the savings in the hardware as well as processing cost so you know as i said we went from 21 or 24 nodes in cassandra to six nodes and sila so substantial dollar signs in hardware not only that but just by the sheer fact that we are running a google cloud platform because sila is performing close to you know five times to six times faster our processing times radius which resulted into you know the reduced cost for the data proc and data flow jobs so so definitely lot of improvements over there so that’s pretty much what i wanted to cover uh thank you for joining my uh talk you can stay in touch with me i have my linkedin profile over here also i have my rakuten email address uh feel free to you know get in touch with me or shoot me any questions and enjoy rest of your conference thank you

Read More