Headquartered in Jacksonville, Florida, Fanatics is an online retailer of licensed sportswear, sports equipment and merchandise. Fanatics operates the ecommerce websites of major professional sports leagues (MLB, NASCAR, NBA, NFL, NHL, PGA, MLS, and UFC), major media brands (CBS Sports, FOX Sports, and NBC Sports), and over 150 sports teams. They are also the exclusive online distributor for the United States Olympic Team.
With over 40 locations worldwide and 10,000 employees, Fanatics has a product catalog with over 600 unique products, with two million SKUs for sale.
Fanatics follows a mobile-first development philosophy. More than 70% of Fanatic’s platform traffic and more than 50% of revenue comes from mobile business. During peak times, Fanatics’ revenue spikes up to 80 percent from mobile alone. For this reason, all new features, updates and bug fixes are tested first on mobile, and then on other platforms.
In 2015, Fanatics had to reckon with the limited ability to scale their on-premises deployment. In response, they decided not only to move to the cloud, but also to refactor their architecture with an all-new technology stack. Fanatics decided to adopt a NoSQL database to replace their old stack, which was based on SQL server and MySQL. They initially opted for Cassandra (at the time Scylla was relatively new).
“This not only reduced TCO, but also reduced the pain that the database engineering team was taking to actually maintain the cluster in a healthy state.”
– Niraj Kothari, Dir, Platforms Engineering, Fanatics.
Fanatics’s Cassandra use cases encompassed a range of functions associated with order capture, including orders that are created, carts that are modified, loyalty, promotional codes, order visibility, rate limiting and so on. But the Fanatics team ran into a range of issues with Cassandra. For example, shopping cart mutations proved to be a significant pain point.
Cart mutation data provides a window into reasons users abandoned their carts and left the site. So Fanatics stored every mutation. This eventually became an issue due to the JVM, which causes frequent pauses for garbage collection (GC). GC pauses forced the team to scale out the Cassandra cluster by a significant multiple, all the way to a 55-node cluster. Since the mutations were all in a wide row, compaction caused the CPU to spike whenever a new modification happened. This resulted in numerous timeouts from the client side.
Niraj Konathi, the Director of Platforms Engineering at Fanatics, explained the problem: “We did not realize how bad it could get with such large wide rows. Every time a user comes online, they change their payment cards, add new items to their cart, and so on.”
The only way to address the problem was to scale Cassanadra out, in the end creating a giant 55-node cluster. Running such a large cluster size increased EC2 costs, but more importantly increased maintenance overhead. For example, rolling restarts took four or five hours on Cassandra.
The Fanatics team discovered that Scylla offered the solution to their issues. As Konathi said, “Moving to Scylla solved our problems, because with Scylla there are no GC pauses and the latencies were really good.”
With Scylla, Fanatics has reduced node count from 55 to 12 nodes, resulting in a drastic reduction in EC2 expenses, all while virtually eliminating the timeouts experienced by end users.
“During a recent peak minute, we saw nearly 280,000 IOPs for a solid minute. With a 3-node Scylla cluster, we registered zero timeouts. Because of this we had happier customers and application teams.”
Fanatics has built a number of tools to manage Scylla. The team owns part of the pipeline for deploying applications in the Cloud, and use a homegrown tool to automate deployment.
Konathi described the Fanatics environment, “We have homegrown tools to actually do all the rolling restarts maintenance on Scylla and Cassandra clusters, and we can also apply patches using this tool, similar to what AWS has, but that’s a paid solution. Scylla Monitor is connected to PagerDuty. Scylla Manager is used for data repairs.”
The result: Rolling restarts across the cluster take less than an hour, whereas it used to take 5 to 8 hours.
“This not only reduced TCO, but it also reduced the pain that the database engineering team was taking to actually maintain the cluster in a healthy state.”
The Fanatics team are glad to endorse Scylla. “I highly recommend if someone is getting into Scylla for the first time, taking the Enterprise version, because you get a lot of support around it. So far we have been really happy with the support we have gotten from the Scylla team.”
Getting started takes only a few minutes. Scylla has an installer for every major platform. If you get stuck, we’re here to help.