Grab is one of the most frequently used mobile platforms in Southeast Asia, providing the everyday services that matter most to consumers. Its customers commute, eat, arrange shopping deliveries, and pay with one e-wallet. Grab believes that every Southeast Asian should benefit from the digital economy, and the company provides access to safe and affordable transport, food and package delivery, mobile payments and financial services. Grab currently offers services in Singapore, Indonesia, the Philippines, Malaysia, Thailand, Vietnam, Myanmar and Cambodia.
When handling operations for more than 6 million on-demand rides per day, there’s a lot that must happen in near-real time. Any latency issues could result in millions of dollars in losses.
Like many other on-demand transportation companies, Grab relies on Apache Kafka, the data streaming technology underlying all of Grab’s systems. The engineering teams within Grab aggregate these multiple Kafka streams – or a subset of streams – to meet various business use cases. Doing so calls for reading the streams, using a powerful, low-latency metadata store to perform aggregations, and then writing the aggregated data into another Kafka stream.
The Grab development team initially used Redis as its aggregation store, only to find that it couldn’t handle the load. “We started to notice lots of CPU spikes,” explained Aravind Srinivasan, Software Engineer at Grab. “So we kept scaling it vertically, kept adding more processing power, but eventually we said it’s time to look at another technology and that’s when we started looking at Scylla.”
In deciding on a NoSQL database, Grab evaluated Scylla, Apache Cassandra, and other solutions. They performed extensive tests with a focus on read and write performance and fault tolerance. Their test environment was a 3-node cluster that used basic AWS EC2 machines.
“Most of our use cases are write heavy,” said Srinivasan. “So we launched different writer groups to write to the Scylla cluster with 1,000,000 records and looked at the overall TPS and how many errors occurred. Scylla performed extremely well. Read performance was one of the major bottlenecks we had when using Redis, so we wanted to test this thoroughly. We launched multiple readers from the Scylla cluster and evaluated the overall throughput and how long it took to scan the entire table. We’d populate the table with 1,000,000 rows and then figure out how long the entire table scan took.”
“Running the same workload on other solutions would have cost us more than three times as much as Scylla.”
– Aravind Srinivasan, Software Engineer, Grab
“For fault-tolerance, we had a 5-node cluster and we’d bring down a node at the same time we were adding another node and doing other things to the cluster to see how it behaves. Scylla was able to handle everything we threw at it. On the operational side, we tested adding a new node to an existing cluster and that was good as well.”
Scylla came out on top of extensive performance tests and is now in production at Grab. “Scylla is working really well as our aggregation metadata store,” says an enthusiastic Srinivasan. “It’s handling our peak load of 40K operations per second. It’s write-heavy right now but the latency numbers on both reads and writes are very, very impressive.”
The Grab team points to a few things that they especially like about Scylla:
Grab is now looking to extend its use of Scylla. Other teams at Grab are hearing about the success of using Scylla as an aggregation store and are looking to migrate additional use cases to Scylla, such as statistics tracking, as a time series database, and more.