Monster SCALE Summit 2025 — Watch 60+ Sessions Now

See all blog posts

Database Performance Questions from Google Cloud Next

Spiraling cache costs, tombstone nightmares, old Cassandra pains, and more — what people were asking about at Google Cloud Next

You’ve likely heard that what happens in Vegas stays in Vegas…but we’re making an exception here.

Last week at Google Cloud Next in Las Vegas, my ScyllaDB colleagues and I had the pleasure of meeting all sorts of great people. And among all the whack-a-monster fun, there were lots of serious questions about database performance in general and ScyllaDB in particular.

In this blog, I’ll share some of the most interesting questions that attendees asked and recap my responses.

Cache

We added Redis in front of our Postgres but now its cost is skyrocketing. How can ScyllaDB help in this case? We placed Redis in front of DynamoDB because DAX is too expensive, but managing cache invalidation is hard. Any suggestions?

Adding a cache layer to a slower database is a very common pattern. After all, if the cache layer grants low-millisecond range response time while the back-end database serves requests in the 3-digit milliseconds range, the decision might seem like a no brainer.

However, the tradeoffs often turn out to be steeper than people initially anticipate:

  • First, you need to properly size the cache so the cost doesn’t outweigh its usefulness. Learning the intricacies of the workload (e.g., which pieces of data are accessed more than others) is essential for deciding what to cache and what to pass-through the backend database.
    If you underestimate the required cache size, the performance gain of having a cache might be less than ideal. Since only part of the data is in the cache, the database is hit frequently – and elevates latencies across the board.
  • Deciding what to keep in cache is also important. How you define the data eviction policy for data in cache might make or break the data lifecycle in that layer – greatly affecting its impact on long-tail latency. The application is also responsible for caching responses. That means there’s additional code that must be maintained to ensure consistency, synchronicity, and high availability of those operations.
  • Another issue that pops up really often is cache invalidation: how to manage updating a cache that is separate from the backend database. Once a piece of data needs to be deleted or updated, it has to be synchronized with the cache, and that creates a situation where failure means serving stale or old data. Integrated solutions such as DAX for DynamoDB are helpful because they provide a pass-through caching layer: the database is updated first, then the system takes care of reflecting the change on the cache layer. However, the tradeoff of this technique is the cost: you end up paying extra for DAX than you would pay for simply running a similarly-sized Redis cluster.

ScyllaDB’s performance characteristics have allowed many teams to replace both their cache and database layers. By bypassing the Linux cache and caching data at the row level, ScyllaDB makes cache space utilization more efficient for maximum performance. By relying on efficient use of cache, ScyllaDB can provide single-digit milliseconds p99 read latency while still reducing the overall infrastructure required to run workloads. Its design allows for extremely fast access to data on disks.

Even beyond that caching layer, ScyllaDB efficiently serves data from disk at very predictable ultra-low latency. ScyllaDB’s IO scheduler is optimized to maximize disk bandwidth while still delivering predictable low latency for operations. You can learn more about our IO Scheduler on this blog.

ScyllaDB maintains cache performance by leveraging the LRU (Least-Recently Used) algorithm which selectively evicts infrequently accessed data. Keys that were not recently accessed may be evicted to make room for other data to be cached. However, evicted keys are still persisted on disk (and replicated!) and can be efficiently accessed at any time. This is especially advantageous compared to Redis, where relying on a persistent store outside of memory is challenging.

Read more in our Cache Internals blog, cache comparison page, and blog on replacing external caches.

Tombstones

I’ve had tons of issues with tombstones in the past with Cassandra… Performance issues, data resurrection, you name it. It’s still pretty hard dealing with the performance impact. How does ScyllaDB handle these issues?

In the LSM (Log-Structured Merge tree) model, deletes are handled just like regular writes. The system accepts the delete command and creates what is called a tombstone: a marker for a deletion. Then, the system later merges the deletion marker with the rest of the data — either in a process called compaction or in memory at read time.

Tombstone processing historically poses a couple of challenges. One of them is to handle what is known as range deletes: a single deletion that covers multiple rows. For instance, you can use “DELETE … WHERE … ClusteringKey < X”, which would delete all records that have a Clustering Key lower than X. This usually means the system has to read through an unknown amount of data until it gets to the tombstone, then it would have to discard it all from the result set. If the number of rows is small, it’s still a very efficient read. But if it covers millions of rows, reading just to discard them can be very inefficient.

Tombstones are also the source of another concern with distributed systems: data resurrection. Since Cassandra’s (and originally ScyllaDB’s) tombstones were originally kept only up to the grace period (a.k.a. gc_grace_seconds, default of 10 days), a repair had to be run on the cluster within that time frame. Skipping this step could lead to tombstones being purged — and previously deleted data that’s not covered by a tombstone could come back to life (a.k.a. “Data resurrection).

ScyllaDB recently introduced tons of improvements in how it handles tombstones, from repair-based garbage collection, to expired tombstone thresholds to trigger early compaction of SSTables. Tombstone processing is now much more efficient and performant than Cassandra’s (and even previous versions of ScyllaDB), especially in workloads that are prone to accumulating tombstones over time.

ScyllaDB’s repair-based garbage collection capability also helps prevent data resurrection by ensuring tombstones are only eligible for purging after a repair has been completed. This means workloads can get rid of tombstones much faster and make reads more efficient.

Learn more about this functionality on our blog Preventing Data Resurrection with Repair Based Tombstone Garbage Collection.

BigTable and Friends

When would you recommend ScyllaDB over Spanner/BigTable/BigQuery?

Questions about how our product compares to the cloud databases run by the conference host are unsurprisingly common. Google Cloud databases are no exception. Attendees shared a lot of use cases currently leveraging them and were curious about alternatives aligned with our goals (scalability, global replication, performance, cost). Could ScyllaDB help them, or should they move on to another booth? It really depends on how they’re using their database as well as the nature of their database workload.

Let’s review the most commonly asked Google Cloud databases:

  • Spanner is highly oriented towards relational workloads at global scale. While it still can perform well under distributed NoSQL workloads, its performance and cost may pose challenges at scale.
  • BigQuery is a high-performance analytical database. It can run really complex analytical queries, but it’s not a good choice for NoSQL workloads that require high throughput and low latency at scale.
  • BigTable is Google Cloud’s NoSQL database. This is the most similar to ScyllaDB’s design, with a focus on scalability and high throughput.

From the description above, it’s easy to assess: if the use-case is inherently relational or heavy on complex analytics queries, ScyllaDB might not be the best choice. However, just because they are currently using a relational or analytics database doesn’t mean that they are leveraging the best tool for the job. If the application relies on point queries that fetch data from a single partition (even if it contains multiple rows), then ScyllaDB might be an excellent choice.

ScyllaDB implements advanced features such as Secondary Indexes (Local and Global) and Materialized Views, which allow users to have very efficient indexes and table views that still provide the same performance as their base table.

Cloud databases are usually very easy to adopt: just a couple of clicks or an API call, and they are ready to serve in your environment. Their performance is usually fine for general use. However, for use cases with latency or throughput requirements, it might be appropriate to consider performance-focused alternatives. ScyllaDB has a track record of being extremely efficient and fast, providing predictable low tail-latency at p99.

Cost is another factor. Scaling workloads to millions of operations per second might be technically feasible on some databases, but incur surprisingly high cost. ScyllaDB’s inherent efficiency allows us to run workloads at scale with greatly reduced costs.

Another downside of using a cloud vendor’s managed database solution: ecosystem lock-in. If you decide to leave the cloud vendor’s platform, you usually can’t use the same service – either on other cloud providers or even on-premises. If teams need to migrate to other deployment solutions, ScyllaDB provides robust support for moving to any cloud provider or running in an on-premises datacenter.

Read our ScyllaDB vs BigTable comparison.

Schema mismatch

How does ScyllaDB handle specific problems such as schema mismatch?

This user shared a painful Cassandra incident where an old node (initially configured to be part of a sandbox cluster) had an incorrect configuration. That mistake, possibly caused by IP overlap resulting from infrastructure drift over time, led to the old node joining the production cluster. At that point, it essentially garbled the production schema and broke it.

Since Cassandra relies on the gossip protocol (epidemic peer-to-peer protocol), the schema was replicated to the whole cluster and left it in an unusable state.

That mistake ended up costing this user hours of troubleshooting and caused a production outage that lasted for days. Ouch!

After they shared their horror story, they inquired: Could ScyllaDB have prevented that?

With the introduction of Consistent Schema changes leveraging the Raft distributed consensus algorithm, ScyllaDB made schema changes safe and consistent in a distributed environment. Raft is based on events being handled by a leader node, which ensures that any changes applied to the cluster would effectively be rejected if not agreed upon by the leader.

The issue reported by the user simply would not exist in a Raft-enabled ScyllaDB cluster. Schema management would reject the rogue version and the node would fail to join the cluster – exactly what it needed to do to prevent issues!

Additionally, ScyllaDB transitioned from using IP addresses to Host UUIDs – effectively removing any chance that an old IP tries to reconnect to a cluster it was never a part of.

Read the Consistent Schema Changes blog and the follow up blog. Additionally, learn more about the change to Strongly Consistent Topology Changes.

Old Cassandra Pains

I have a very old, unmaintained Cassandra cluster running a critical app. How do I safely migrate to ScyllaDB?

That is a very common question. First, let’s unpack it a bit.

Let’s analyze what “old” means. Cassandra 2.1 was released 10 years ago. But it is still supported by the ScyllaDB Spark connector…and that means it can be easily migrated to a shiny ScyllaDB cluster (as long as its schema is compatible).

“Unmaintained” can also mean a lot of things. Did it just miss some upgrade cycles? Or is it also behind on maintenance steps such as repairs? Even if that’s the case – no problem! Our Spark-based ScyllaDB Migrator has tunable consistency for reads and writes. This means it can be configured to use LOCAL_QUORUM or even ALL consistency if required. Although that’s not recommended in most cases (for performance reasons), that would ensure consistent data reads as data is migrated over to a new cluster.

Now, let’s discuss migration safety. In order to maintain consistency across the migration, the app should be configured to dual-write to both the source and destination clusters. It can do so by sending parallel writes to each and ensuring that any failures are retried. It may also be a good idea to collect metrics or logs on errors so you can keep track of inconsistencies across the clusters.

Once dual writes are enabled, data can be migrated using the Scylla Migrator app. Since it’s based on Spark, the migrator can easily scale to any number of workers that’s required to speed up the migration process.

After migrating the historical data, you might run a read validation process – reading from both sources and comparing until you are confident in the migrated data consistency.

Once you are confident that all data has been migrated, you can finally get rid of the old cluster and have your application run solely on the new one.

If the migration process still seems daunting, we can help. ScyllaDB has a team available to guide you through the migration, from planning to best practices at every step. Reach out to Support if you are considering migrating to ScyllaDB!

We have tons of resources on helping users migrate. Here are some of them:

Wrap

These conversations are only a select few of the many good discussions the ScyllaDB team had at Google Cloud Next. Every year, we are amazed at the wide variety of stories shared by people we meet. Conversations like these are what motivate us to attend Google Cloud Next every year.

If you’d like to reach out, share your story, or ask questions, here are a couple of resources you can leverage:

If you are wondering if ScyllaDB is the right choice for your use cases, you can reach out for a technical 1:1 meeting.

About Guilherme da Silva Nogueira

Guilherme Nogueira is an IT professional with more than 15 years of experience, specializing in large-scale database solutions. He enjoys tackling unique challenges that only NoSQL technologies can solve and helping users navigate complex and ever-scaling data environments. Outside of work, he is an avid Linux gamer. Guilherme is a technical director at ScyllaDB.

Blog Subscribe Mascots in Paper Airplane

Subscribe to the ScyllaDB Blog

For Engineers Only. Subscribe to the ScyllaDB Blog. Receive notifications about database-related technology articles and developer how-tos.