A Tale of Database Performance Woes: Patrick’s Unlucky Green Fedoras

By Piotr Sarna

March 12, 2024

Database performance is serious business, but why not have a little fun exploring its challenges and complexities? 😉 Here’s a rather fanciful story we presented in Chapter 1 of Database Performance at Scale, a free Open Access book.

The technical topics covered here are expanded on throughout the book. But this is the one and only time we talk about poor Patrick. Let his struggles bring you some valuable lessons, solace in your own database performance predicaments… and maybe a few chuckles as well.

***

After losing his job at a ~~FAANG~~ MAANG (MANGA?) company, Patrick decided to strike off on his own and founded a niche online store dedicated to trading his absolute favorite among headwear, green fedoras. Noticing that a certain NoSQL database was recently trending on the front page of Hacker News, Patrick picked it for his backend stack.

After some experimentation with the offering’s free tier, Patrick decided to sign a one-year contract with a major cloud provider to get a significant discount on its NoSQL database-as-a-service offering. With provisioned throughput capable of serving up to 1,000 customers every second, the technology stack was ready and the store opened its virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers visited the site daily. At the same time, the shiny new database cluster kept running, fueled by a steady influx of money from his credit card and waiting for its potential to be harnessed.

Patrick’s Diary of Lessons Learned, Part I

The lessons started right away:

Although some databases advertise themselves as universal, most of them perform best for certain kinds of workloads. The analysis before selecting a database for your own needs must include estimating the characteristics of your own workload:
- Is it likely to be a predictable, steady flow of requests (e.g., updates being fetched from other systems periodically)?
- Is the variance high and hard to predict, with the system being idle for potentially long periods of time, with occasional bumps of activity? Database-as-a-service offerings often let you pick between provisioned throughput and on-demand purchasing. Although the former is more cost-efficient, it incurs a certain cost regardless of how busy the database actually is. The latter costs more per request, but you only pay for what you use.
Give yourself time to evaluate your choice and avoid committing to long-term contracts (even if lured by a discount) before you see that the setup works for you in a sustainable way.

The First Spike

March 17th seemed like an extremely lucky day. Patrick was pleased to notice lots of new orders starting from the early morning. But as the number of active customers skyrocketed around noon, Patrick’s mood started to deteriorate. This was strictly correlated with the rate of calls he received from angry customers reporting their inability to proceed with their orders.

After a short brainstorming session with himself and a web search engine, Patrick realized, to his dismay, that he lacked any observability tools on his precious (and quite expensive) database cluster. Shortly after frantically setting up Grafana and browsing the metrics, Patrick saw that although the number of incoming requests kept growing, their success rate was capped at a certain level, way below today’s expected traffic.

“Provisioned throughput strikes again,” Patrick groaned to himself, while scrolling through thousands of “throughput exceeded” error messages that started appearing around 11am.

Patrick’s Diary of Lessons Learned, Part II

This is what Patrick learned:

If your workload is susceptible to spikes, be prepared for it and try to architect your cluster to be able to survive a temporarily elevated load. Database-as-a-service solutions tend to allow configuring the provisioned throughput in a dynamic way, which means that the threshold of accepted requests can occasionally be raised temporarily to a previously configured level. Or, respectively, they allow it to be temporarily decreased to make the solution slightly more cost-efficient.
Always expect spikes. Even if your workload is absolutely steady, a temporary hardware failure or a surprise DDoS attack can cause a sharp increase in incoming requests.
Observability is key in distributed systems. It allows the developers to retrospectively investigate a failure. It also provides real-time alerts when a likely failure scenario is detected, allowing people to react quickly and either prevent a larger failure from happening, or at least minimize the negative impact on the cluster.

The First Loss

Patrick didn’t even manage to recover from the trauma of losing most of his potential income on the only day throughout the year during which green fedoras experienced any kind of demand, when the letter came. It included an angry rant from a would-be customer, who successfully proceeded with his order and paid for it (with a receipt from the payment processing operator as proof), but is now unable to either see any details of his order—and he’s still waiting for the delivery!

Without further ado, Patrick browsed the database. To his astonishment, he didn’t find any trace of the order either. For completeness, Patrick also put his wishful thinking into practice by browsing the backup snapshot directory. It remained empty, as one of Patrick’s initial executive decisions was to save time and money by not scheduling any periodic backup procedures.

How did data loss happen to him, of all people? After studying the consistency model of his database of choice, Patrick realized that there’s consensus to make between consistency guarantees, performance, and availability. By configuring the queries, one can either demand linearizabilityFootnote7 at the cost of decreased throughput, or reduce the consistency guarantees and increase performance accordingly. Higher throughput capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data landed on a single server without any replicas distributed in the system. Once this server failed—which happens to hardware surprisingly often, especially at large scale—the data was gone.

Patrick’s Diary of Lessons Learned, Part III

Further lessons include:

Backups are vital in a distributed environment, and there’s no such thing as setting backup routines “too soon.” Systems fail, and backups are there to restore as much of the important data as possible.
Every database system has a certain consistency model, and it’s crucial to take that into account when designing your project. There might be compromises to make. In some use cases (think financial systems), consistency is the key. In other ones, eventual consistency is acceptable, as long as it keeps the system highly available and responsive.

The Spike Strikes Again

Months went by and Patrick’s sleeping schedule was even beginning to show signs of stabilization. With regular backups, a redesigned consistency model, and a reminder set in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt moderately safe.

If only he knew that a ten-second video of a cat dressed as a leprechaun had just gone viral in Malaysia… which, taking time zone into account, happened around 2am Patrick’s time, ruining the aforementioned sleep stabilization efforts.

On the one hand, the observability suite did its job and set off a warning early, allowing for a rapid response. On the other hand, even though Patrick reacted on time, databases are seldom able to scale instantaneously, and his system of choice was no exception in that regard. The spike in concurrency was very high and concentrated, as thousands of Malaysian teenagers rushed to bulk-buy green hats in pursuit of ever-changing Internet trends. Patrick was able to observe a real-life instantiation of Little’s Law, which he vaguely remembered from his days at the university. With a beautifully concise formula, L = λW, the law can be simplified to the fact that concurrency equals throughput times latency.

TIP: For those having trouble with remembering the formula, think units. Concurrency is just a number, latency can be measured in seconds, while throughput is usually expressed in 1/s. Then, it stands to reason that in order for units to match, concurrency should be obtained by multiplying latency (seconds) by throughput (1/s). You’re welcome!

Throughput depends on the hardware and naturally has its limits (e.g., you can’t expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second, although we are crossing our fingers for this assumption to be invalidated in the near future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear that as concurrency rises, so does latency. For the end-users—Malaysian teenagers in this scenario—it means that the latency is eventually going to cross the magic barrier for the average human perception of a few seconds. Once that happens, users get too frustrated and simply give up on trying altogether, assuming that the system is broken beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also true enough.

Patrick’s Diary of Lessons Learned, Part IV

The lessons continue:

Unexpected spikes are inevitable, and scaling out the cluster might not be swift enough to mitigate the negative effects of excessive concurrency. Expecting the database to handle it properly is not without merit, but not every database is capable of that. If possible, limit the concurrency in your system as early as possible. For instance, if the database is never touched directly by customers (which is a very good idea for multiple reasons) but instead is accessed through a set of microservices under your control, make sure that the microservices are also aware of the concurrency limits and adhere to them.
Keep in mind that Little’s Law exists—it’s fundamental knowledge for anyone interested in distributed systems. Quoting it often also makes you appear exceptionally smart among peers.

Backup Strikes Back

After redesigning his project yet again to take expected and unexpected concurrency fluctuations into account, Patrick happily waited for his fedora business to finally become ramen profitable.

Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick spent most of the day enjoying steady Grafana dashboards, which kept assuring him that the traffic was under control and capable of handling the load of customers, with a healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks became severely overutilized. This seemed completely out of place given the observed concurrency. While looking for the possible source of this anomaly, Patrick noticed, to his horror, that the scheduled backup procedure coincided with the annual peak load…

Patrick’s Diary of Lessons Learned, Part V
Concluding thoughts:

Database systems are hardly ever idle, even without incoming user requests. Maintenance operations often happen and you must take them into consideration because they’re an internal source of concurrency and resource consumption.
Whenever possible, schedule maintenance options for times with expected low pressure on the system.
If your database management system supports any kind of quality of service configuration, it’s a good idea to investigate such capabilities. For instance, it might be possible to set a strong priority for user requests over regular maintenance operations, especially during peak hours. Respectively, periods with low user-induced activity can be utilized to speed up background activities. In the database world, systems that use a variant of LSM trees for underlying storage need to perform quite a bit of compactions (a kind of maintenance operation on data) in order to keep the read/write performance predictable and steady.

The end.