SteelHouse Logo

CASE STUDY

How a Real-Time Switchover from Cassandra to Scylla Saved the Day

About SteelHouse

SteelHouse provides a self-service advertising platform for companies of all sizes. The SteelHouse Advertising Suite provides marketers with tools to build their own ads and then launch retargeting and prospecting campaigns through display, mobile, native, connected TV (CTV) and social media. With more than 700 customers worldwide, including industry leaders such as Virgin America, TUMI and Staples, SteelHouse solutions give advertisers transparency and control over their campaigns, with the fastest go-live in the industry.

SteelHouse’s ‘secret sauce’ lies in the ability to collect, personalize, and deliver data that’s relevant to the interests of the online audience, helping to ensure viewers engage with the ads and return to the advertiser’s site.

The Challenge

AdTech platforms like SteelHouse have very stringent service-level agreements (SLAs). Consistently meeting SLAs is critical to success in AdTech, otherwise windows of opportunity to place ads can close without success. To meet the company’s SLAs, SteelHouse’s systems must run high-volume requests with predictably low latencies, making their systems highly sensitive to glitches in the data layer.

SteelHouse initially built their services against Apache Cassandra. While Cassandra was fast, its performance was inconsistent. It also generated a high rate of timeouts to the services that depended on them. On top of that, instability in the Java Virtual Machine (JVM) resulted in many unplanned outages. Having experienced serious performance breakdowns in their Cassandra cluster, SteelHouse decided to evaluate alternatives.

The Solution

According to Kevaughn Bullock, SteelHouse Application Architect, the company went straight to Scylla. “We never really dove that deeply into other options, such as Riak or HBase,” Bullock said. “We went with Scylla because we knew it’s a 100% match with Cassandra. We were a Cassandra shop with performance issues. So Scylla provided us a Cassandra-compatible API, and it resolved our performance issues.”

SteelHouse never ran a formal POC or performance benchmarks against Scylla. Their migration occurred as a live-fire exercise; a severe Cassandra performance issue proved to be the last straw. “When we installed that first Scylla cluster, it was in the middle of a huge crisis for us in terms of Cassandra performance,” said Bullock. “Around 10:00 AM, we decided to pull the trigger to do a hot swap from Cassandra to Scylla that morning.”

With all their apps running, the team allowed the data to load into the system and then switched over to Scylla. Over the course of six hours, SteelHouse went from running a production Cassandra cluster with crippling performance issues to running a production Scylla cluster with no issues. “It was a sink-or-swim situation,” said Bullock. “Cassandra was sinking, and so we just said, ‘Hey, tear off the Band-Aid. Let’s try this out.’ It was a little intense but it worked out really well for us.”

With the first cluster set up, the team discovered how Scylla’s auto-tuning is a vast improvement over the Cassandra and JVM tweaks and tuning they had previously been burdened with. “Knowing that we could install the system, it would auto-tune itself, and be ready for a production workload, that’s pretty impressive,” said Bullock.

The team first tried Scylla on just one of their systems and discovered that the performance was as good as advertised; they also saw instant improvement in stability, with far fewer timeouts being logged. With careful monitoring, the team saw similar response times, but, suddenly, they were also meeting their SLAs 100% of the time. Even better, after the switch, the network timeouts they had seen with Cassandra vanished.

The biggest driver of data on SteelHouse’s platform is the pixel server, which receives billions of requests a month from customer sites. That service is also the most sensitive to latency. Scylla handles billions of requests that are coming in every month. “For every request that comes to the pixel server, there are multiple reads and writes,” according to Bullock. “So you’re talking about nine billion requests interacting with our Scylla backend.”

“Had we known beforehand how well it was going to work, we would have obviously made the switch even sooner!”

– Kevaughn Bullock, Application Architect, SteelHouse

The full benefits of migrating to Scylla were realized over the weekend of Black Friday and Cyber Monday, which is SteelHouse’s peak season for traffic. Over that weekend, SteelHouse normally sees three to four times normal volume for days at a time. Scylla performed flawlessly under the load. The technical team wasn’t the only group satisfied with Scylla’s performance during that interval. “A lot of people on the business side even commented that it was our smoothest holiday season on record,” noted Bullock. “It was also our largest holiday season on record.”

Scylla helped SteelHouse reduce their hardware footprint by consolidating workloads against fewer clusters. The result is less hardware to manage and fewer machines that are vulnerable to failure. Today, SteelHouse runs about 48 nodes of Scylla. Compared directly to Cassandra, the amount of time that SteelHouse spends managing Scylla on a weekly basis is significantly less, resulting in a significant reduction in operational costs.

According to Bullock, Scylla saves the team about three to four hours of unscheduled work every week they’d been spending on Cassandra maintenance. “We’re getting back roughly a 20% improvement in our team’s productivity,” said Bullock.

“The fact that you can drop Scylla in and let it handle production workloads without much optimization is hugely beneficial to us. After Cassandra driver compatibility, that was the biggest benefit for us.”

Steehouse services run on IBM Cloud and AWS with Kubernetes providing portability, making their platform effectively cloud-agnostic.

Bullock summed up SteelHouse’s experience with Scylla: “Had we known beforehand how well it was going to work, we would have obviously made the switch even sooner!”