Rakuten’s Catalog Platform Migration from Cassandra to ScyllaDB

13 minutes

Register for access to all 30+ on demand sessions.

Enter your email to watch this video and access the slide deck from the ScyllaDB Summit 2022 livestream. You’ll also get access to all available recordings and slides.

In This NoSQL Presentation

The RCP/Rakuten Catalog Platform has been growing at a brisk speed over the last couple of years. Our original backbone was Cassandra. However, as we continued our growth, we internally started realizing that it was not suitable for our next stage of growth. As such, we started looking into ScyllaDB as a better ROI solution as well as a much more stable backend. Our migration itself was challenging since this has to be done for a production live data processing pipeline with minimal impact on customers. In this talk, we will dive deeper into challenges and takeaways.

Hitesh Shah, Engineering Manager, Rakuten

Hitesh is a passionate engineering leader with extensive experience predominantly with SF Bay area-based tech companies, with solid depth in development as well as operations. Lately, he has focused on distributed systems with dynamically scalable platforms on Cloud Computing environments.

Video Transcript

Hello, everyone. This is Hitesh Shah from Rakuten, Rakuten Catalog Platform team. And today I’m here to talk about our journey from Cassandra to ScyllaDB, what are some of the challenges we faced with Cassandra and why we opted for ScyllaDB. So brief introduction about me. I have been with Rakuten for almost 5 plus years by now, and I have seen as well as participated into the growth of the Catalog platform here at Rakuten.

At heart, I’m truly a passionate engineer. I have predominantly worked with Bay Area technology firms, and I pride myself into focus on development as well as operational aspects of the product development. Pushing the envelope of the technology is what I love on a day-to-day basis, and that’s what I strive for doing it at Rakuten as well. My current interests include distributed systems as well as dynamically scalable platforms. So quick overview in terms of the agenda, I’m going to start with overview of RCP, what kind of use cases we are supporting and what kind of services we’re providing, and then I will have a quick overview, a very high-level overview of the pipeline architecture we are using as well as some of the transactions volumes we are supporting on our platform. Then we will dive deeper into experience as well as challenges with Cassandra and why we started looking into ScyllaDB and what value add ScyllaDB was providing. At the end, I will spend some time to share some of the KPI improvements we have seen after we migrated to ScyllaDB. So as the name implies, we are a product catalog platform team. So if you are into online e-commerce shopping and if you are into cashback, and let’s say you go to Rakuten.com, and you’re looking for a new iPhone. So you search for iPhone on Rakuten.com, and you will get the search results saying that iPhone from JCPenny at certain price, iPhone from Target at certain price, iPhone from Walmart at certain price. So those search results are powered by this platform, and we are a truly global team. We are responsible for a lot of the global assets across Rakuten worldwide. So on the right, if you go to, let’s say, Rakuten Japan, Rakuten.co.jp. And let’s say again you are looking for an iPhone, and you search for iPhone, and you get the search results saying that iPhone from merchant in the marketplace at this price and this deal as well as iPhone from some other merchant in the same marketplace at a different price. So again, those search results are powered by this platform. And the way in which we do that is that we have this off-line ingestion platform to power these results and this product catalog. We source the catalog data from multiple sources, and once it is downloaded into the system, there is a core data processing engine which will kind of normalize the product data, validate the data and transform the data. And at the end of the pipeline, the catalog data will get stored in the ScyllaDB, so ScyllaDB is at the core of this platform to power the platform as well as support the platform. And in between the platform teams, there is a queuing-based .. . I would say queing-based integration with different various microservices. So for example, machine learning enrichment will happen via this pipeline, and once the enrichment happens, the data will get stored into ScyllaDB, and the catalog data will be sent out in the bottom to our partners as well as internal customers. So in terms of the core technology stack, pretty much I will say Spark, ScyllaDB, Redis, Kafka. We’re also on Google Cloud platform, so pretty much pops up, data proc as well as data flow. In terms of the programming model we are kind of using, it’s pretty much batching as well as microbatching and some streaming also we have. So in terms of the data processing volume we are handling right now, the total catalog size is close to 700 million plus and pretty soon approaching 1 million plus items are listed in the catalog. The number of our daily items we are processing is closer to 250 million plus and again, the growth is continuing to go up. So after all being said and done after all these optimizations, and by the time the transaction reaches to ScyllaDB, in terms of the read QPS we see at ScyllaDB level is close to 10K to 15K QPS per node as well as the write QPS is in the range of anywhere from 3K to 5K transactions per node. So here’s a quick graph on a chart, a recent chart of some of the ScyllaDB nodes basically, what we are monitoring internally. So you can pretty much see that the left-hand side is the read QPS pretty much in the time K plus/minus ranges as well as on the bottom right is the write QPS. And you can pretty much see that servers are super, super busy physically. All right. So let’s dive deeper into some of the issues we have seen with Cassandra. So one of the main reasons was the inconsistent performance or what I like to call the volatile latencies. So let’s say if I had a select query, which is returning results back inside a millisecond, that exact same credit will take, let’s say, 120 milliseconds at a different time of the day, or will take 140 or 180 milliseconds on a different day. So it was very hard to predict the performance of the system as well as it was very hard to commit to the SRS global partners. Cassandra being a Java-based system, it also had it own baggage of JVM issues, and definitely we had our own fair share of issues as well. We also used to run into a lot of the long GC pauses, which will freeze the node right in the middle of all this activity and kind of resulting to site connection timeout for many of our clients. On of the major issues which was very, very prevalent with us was a single slow node can actually bring down the whole Cassandra cluster, and this is very common with a lot of the distributed systems but much more prevalent in Cassandra, and the reason for this is because a lot of those distributed systems have this concepts of sharding and the horizontal replica, so there is a lot of peer-to-peer need for chatter going on behind the scene. So for some reason if one of the nodes is performing slower, then it will not be able to process all the requests it is receiving as well as it will kind of not be able to propogate those requests. So eventually basically, it will start slowing down the other nodes in the cluster and eventually the whole cluster will come down. Some of the other issues were definitely Cassandra was requires a lot of babysitting. You pretty much had to have a full-time baby to make sure that we are on top of all of the repairs, as well as all of the jobs are running fine. So having said all these things, we were able to scale up the Cassandra. Cassandra was definitely naturally horizontally scalable, but what we realized that it was coming at a stiff cost. So just around approximately 2 years around the same time, kind of we started internally realizing that Cassandra was not the answer of our next stage of growth of the catalog platform. So enter ScyllaDB. That’s where we bumped into ScyllaDB, and we did a basically initial POC of couple of months to test out of our primary use cases. And all behold, basically we kind of saw that out of the box there was five times to six times performance improvement in ScyllaDB, so we kind of went ahead with the migration. So again, let me just quickly run through what are some of the other benefits we have seen with the ScyllaDB. So again, as I said, five to six times success performance improvement, much more improved total cost of ownership because we went down from 24 Cassandra nodes to six nodes in ScyllaDB as well as much better ROI because any core which is running in C++ or since Java on the same set of hardware will definitely give you a bit of ROI. And one of the biggest, I would say, peace of mind was that we started having consistent latencies. Our queries, our performance pretty much in a predictable fashion, and we starting committing to SLAs to our internal partners as well as customers. One other reason why we were kind of attracted to ScyllaDB was the underlying Seastar framework and the much better improved parallelism where basically the share-nothing architecture or shard per core approach, which reduces the kernel level substantially and improves their performance. And again, last but not least, one of the biggest advantage was it was just a drop-in replacement, so we could basically spin out a new ScyllaDB cluster, migrate our data and start pointing our applicants into ScyllaDB, and, boom, we were running in production. Why ScyllaDB Enterprise? So just quickly run through some of the benefits of the enterprise version, at least what we saw, so once we started doing POC, we kind of started realizing because of the complexity of the technology as well that enterprise will make sense and one of the main reasons is that with the enterprise, it comes with the private Slack channel for the customer support and with their own SLAs, and we have benefited tremendously from this private Slack channel and with the immediate turnaround of the ACUs. One other advantage was that, for the enterprise customer, the ScyllaDB comes with the custom shard-aware driver, which improves the utilization of the node of the cluster substantially. And the incremental compaction strategy, or ICS, is one of the benefits which actually really fitted really well with other use case where we have a schema or a table, which is generating a lot of tombstones as well as it is heavily transaction-oriented, lots of reads and writes. That’s where ICS made a lot of sense for us to basically balance out the disk space usage as well as the CPU usage. And again, some other key things are workload prioritization, which we are evaluating to integrating to our use spaces as well as ScyllaDB Manager, which is one of the biggest benefit because we no longer had to babysit the cluster. So quick diagram on the shard-aware driver. So this is taking the snapshot from our internal POC, so on the top left is the dataset driver running against ScyllaDB. And you can see how the director is serving the request. And on the bottom right is basically the ScyllaDB custom shard-aware driver, and you can see that the parallelism has improved substantially, and you are getting much better utilization of the nodes as well as the direct basically. And we actually internally saw at least close 15 percent to 20 percent improvements in our query because of the driver. So last but not least, let me quickly share some of the KPI improvements we have seen. So I will read you a feed ingestion times improved by 30 to 35 percent, which resulted in a improved SLAs, followed by internal partners and customers and improved customer satisfaction. We also .. . Once the feeds are processed we also do outbound publishing of the feeds, which resulted into almost 2.5 times to five times improvements, which resulted into much more satisfaction for our merchants because now they can quickly turn around the price changes as well as availability changes to the site. And one of the biggest benefit was basically the dollar-dollar savings in the hardware as well as processing costs. So as I said, we went from 21 or 24 nodes in Cassandra to six nodes in ScyllaDB so substantial dollar savings in hardware. Not only that, but just by the sheer fact that we are running in Google Cloud platform because ScyllaDB is performing close to five times to six times faster, our processing times reduce, which result in having the reduced costs for the data proc and data flow jobs so definitely a lot of increments with that as well. So that’s pretty much what I wanted to cover. Thank you for joining my talk. You can stay in touch with me. I have my LinkedIn profile over here. Also I have my Rakuten e-mail address. Feel free to get in touch with me or shoot me any questions and enjoy the rest of your conference. Thank you.

Read More