Access patterns determine your data model, your I/O costs, and which database is the best fit for your workload
I’ve been part of enough key-value database evaluations to recognize the pattern. When the conversation starts with benchmarks, the evaluation inevitably ends with regret. The benchmark answers “which is faster?” It doesn’t tell you which model fits how your application actually reads and writes data – and that’s what matters.
Every data modeling decision should begin with access patterns, regardless of the technology on the table. What does your application read? At what granularity? What does it write? How often? How large? Let those answers drive the data model, then pick the technology. Flip that order and you pay for it. A fast database like ScyllaDB amplifies schema decisions: good models perform well, bad ones break faster.
Edgar Codd invented First Normal Form (1NF) in 1970 to save disk space, but a terabyte of NVMe now costs about the same as lunch. So, even though the rule outlasted the constraint that justified it, we are still teaching it. That’s partly why so many teams expect to normalize their data with ScyllaDB the way they would a relational schema. But if they don’t get the order right (access patterns> data model> technology), they won’t get the performance that the engine was built to deliver.
A lot of the confusion comes down to terminology. “Key-value” is one of the most overloaded labels in the database industry. We use it to describe both:
- A system that maps a string to an opaque blob
- A system that maps a partition key plus a clustering key to typed, individually addressable columns with partial-update semantics.
Lumping these together hides the architectural decisions that determine your I/O patterns and your infrastructure costs.
“Key-value” is often used to describe three very different data models. They differ in capability and in how deeply you can address your data. Pick the wrong one for your access patterns and you pay for it in I/O overhead, infrastructure cost, and write throughput.
ScyllaDB can operate across multiple levels of this hierarchy. The one you select influences your I/O patterns, your update costs, and your infrastructure spend.
Key-Value vs Wide-Column: Four Levels of Access Pattern Depth
Instead of looking at feature lists, it’s better to compare these models by access pattern depth: at what level can you address, read, and write your data?
Level 1: Key level. One key maps to one value. The value is opaque. The database has no knowledge of what is inside it. You get it and you put it in. This is K-V, the model behind most caching layers and session stores. Redis is the canonical example. The ceiling is the value boundary – you can replace it, you cannot address inside it.
Level 2: Row level. A primary key maps to a set of named bins. Each bin holds a schemaless value. You can address individual bins by name, you can project specific bins in a read, and you can also update bins independently. This is K-V Wide Table, one key, multiple named fields, no schema enforcement on values. This model adds meaningful structure over K-V without requiring upfront schema design. Aerospike is the canonical example here. The ceiling is the bin boundary – you can update a bin, but you cannot address inside one.
Level 3: Column level. A partition key combined with a clustering key addresses a row. Each column in that row is individually typed. The database understands the type of every value it stores. This is KKV Wide Table, the two-key model is what puts the second K in KKV. Typed columns enable the database to make smarter decisions about storage layout, compression, and update semantics. Cassandra reaches this level. The ceiling is the column boundary – typed and addressable, but complex values inside a column must be declared frozen. In other words, the entire value is serialized as a single blob that the engine cannot see into.
Level 4: Within-column level. This is a key differentiator for KKV Wide Table. The engine starts working at a granularity that the other models can’t reach.
A KKV Wide Table column can hold a collection: a map, a set, a list, a user-defined type, or nested combinations of these. Whether the database can address what’s inside that collection determines your actual access pattern depth. A frozen collection is serialized as a single blob. The engine stores it, retrieves it, and replaces it, but cannot see inside it. An unfrozen collection is stored element by element. Each entry is individually addressable. That distinction is the central architectural argument at this level.
Cassandra touches this level but can’t reliably live here. Unfrozen collections exist in Cassandra, but tombstone accumulation makes them a liability in production.
In ScyllaDB, Level 4 becomes practical. With an unfrozen collection, ScyllaDB stores each element individually. Whether you add an entry to a map, append to a list, or remove an element from a set – no read is required first and the database operates at element level.
With a frozen collection, ScyllaDB serializes the entire value as a single cell. The engine can’t address inside it. For whole-value access patterns, that’s not a limitation, it’s an optimization. With this:
- There’s no per-element metadata.
- Reads pull one contiguous cell.
- Writes replace one contiguous cell.
ScyllaDB’s UDT performance benchmarks show frozen collections outperforming unfrozen ones by up to 228% on write throughput and 162% on read throughput for 50-field UDTs. For the right access pattern, frozen is the faster choice.
Don’t focus on frozen vs unfrozen; look at access pattern first and the right tool should follow from there.
The problem isn’t that it’s frozen; the access pattern mismatch is what’s causing the performance difference. An engineer who needs element-level updates and chooses frozen UDTs has, for those columns, given back Level 4 access. The operation degrades to read-modify-write: read the entire value, apply the change in memory, write it back as a whole. That is the same pattern a K-V Wide Table bin requires. The technology supports Level 4, but the schema choice has opted out of it.
The opposite mistake is also a problem. An engineer who uses large unfrozen collections for values they always access as a whole pays per-element TTL and timestamp metadata on every element in the collection – at compaction time, continuously. A map with 10K entries carries 10K individual metadata records. That overhead snowballs over time.
Choose frozen collections when you access the value as a whole. Choose small unfrozen collections when you need element-level updates. Large unfrozen collections are their own design smell, regardless of access pattern.
How Access Pattern Depth Meets Memory: Three Scenarios
The relationship between your dataset size and available memory determines which architecture is working with its strengths and which one is working against them.
Scenario 1: Everything Fits in Memory
When the entire dataset lives in RAM, a memory-resident hash index is fast. Point lookups are a hash computation and a pointer dereference. This is where K-V and K-V Wide Table architectures shine for read latency.
But “what’s fast?” and “what’s cost-effective?” are different questions. If your dataset is 2 TB, you are paying for 2 TB of RAM across your cluster. An architecture designed around SSDs with efficient memory-resident metadata can deliver reads in the low hundreds of microseconds while your data lives on storage that costs a fraction of RAM per gigabyte. Although the access pattern performance difference on reads may be negligible, the infrastructure cost difference is not.
This is also the scenario where honesty matters. If your access pattern is truly “put blob, get blob” on ephemeral data with simple lookups, a K-V store is the right tool. The operational simplicity is a genuine advantage. There are fewer moving parts and fewer things to misconfigure. If your values are small and your queries never need to reach inside them, a K-V store will serve you well and be easy to operate.
Scenario 2: Keys Fit in Memory, Values Do Not
This is what K-V Wide Table architectures market as their sweet spot. Here, you have a primary index in memory, records on SSD, and fast key lookups that pull values from disk.
For simple reads, bin projection works well here. Request three specific bins, get three bins back. You are not forced to read the entire record on every read.
The problem surfaces at Level 4. Assume one bin holds a serialized map of user preferences and you need to update a single entry in that map. In this case, the system must:
- Read the entire bin from disk
- Deserialize the collection structure in memory
- Apply the modification
- Serialize the updated structure
- Write the entire bin back.
That is a read-modify-write cycle on every collection update, regardless of how small the change is. The K-V Wide Table model has no path to Level 4 access. The bin is the floor.
A KKV Wide Table model with unfrozen collections handles the same update without a read. The new map entry goes directly to the write-ahead log and the in-memory table. There’s no deserialization or full-bin read. The merge with existing data happens during compaction, as a background operation that does not block the write path.
Compression: typed columns vs. schemaless bins. K-V Wide Table bins are schemaless. Within an SSTable block, different records interleave bin data without type information. That limits what a compressor can do across records.
A KKV Wide Table stores typed column data within the same partition contiguously in SSTable blocks. For example, ScyllaDB writes all values for the event_ts column across rows in a partition together. Because those values share the same type, a dictionary-based compressor like zstd has much more to work with. This is not columnar storage in the analytics sense. ScyllaDB is an LSM-tree row-based engine at the partition level, not Parquet. The compression benefit comes from typed column homogeneity within SSTable blocks rather than a columnar storage layout.
Frozen vs. unfrozen compression tradeoffs. Frozen UDTs compress well for a specific reason. A frozen UDT is a single cell with a consistent serialized layout. The same 50-field structure appears as the same byte sequence across records, which dictionary compression handles efficiently.
Unfrozen collections are a different story. Each element carries its own TTL and timestamp metadata. ScyllaDB groups column values within SSTable blocks, which helps the element values themselves compress, but the metadata overhead scales with collection cardinality. For small unfrozen collections, it’s negligible. For large unfrozen collections, it can negate a meaningful portion of the compression gain. The compression advantage of typed columns applies most cleanly to simple typed columns and small unfrozen collections.
Data locality. In a shard-per-core architecture (e.g., ScyllaDB’s), all columns within a partition live on the same CPU core. A read that touches three columns in a single partition involves zero cross-core coordination. This avoids locking and message passing between threads. This data locality might not be significant at low throughput. However, it matters a lot at hundreds of thousands of operations per second.
Scenario 3: Neither Keys Nor Values Fit in Memory
This is where memory-dependent index architectures hit a wall. If your architecture puts the primary index in RAM and your keyspace outgrows available memory, you are either:
- Adding nodes to hold the index, or
- Paging index entries to disk, which adds a disk read in front of every data read
An architecture built for disk-resident data from the start does not have this problem. ScyllaDB (and to a degree Cassandra) uses Bloom filters to determine probabilistically whether a partition exists in a given SSTable without loading a full index into memory. Partition index summaries provide efficient lookup with a small, fixed memory footprint regardless of key count. And compaction strategies manage on-disk data organization to keep read amplification bounded.
This is all strategic design for an architecture that assumes data will not fit in memory. Don’t just think about whether a system can handle disk-resident data; consider whether it was designed for it.
The Update Path: Where Access Depth Becomes I/O Pattern
Most evaluations obsess over reads. However, the update path is where access pattern depth differences tend to surface at scale.
Consider updating a single element in a collection, one value in a map with 500 entries.
In a K-V Wide Table architecture, collection updates require a full read-modify-write cycle: read the entire bin from disk, deserialize the collection structure in memory, apply the modification, serialize the updated structure, then write the entire bin back. Under concurrent updates to the same record, this becomes a serialization bottleneck. Under write-heavy workloads, write throughput is gated by read throughput.
In a KKV Wide Table architecture with unfrozen collections, the same update works like this: write the new value for that map entry directly to the memtable. This avoids the read, the deserialization, and the serialization. The entry lands in the write-ahead log and the in-memory table. The merge with existing data happens during compaction, as a background operation.
This is where access pattern honesty matters most. The append-only unfrozen update is fast for element-level changes to bounded collections. When your access pattern is whole-value, you write the entire UDT atomically and read it back as a unit. Here, frozen is the right choice. There is no read penalty and no per-element overhead. The ScyllaDB UDT benchmark shows 228% write throughput improvement for frozen UDTs in exactly this scenario: a 50-field UDT accessed and written as a whole. The frozen cell is one write operation. The equivalent unfrozen collection is 50 element writes plus 50 metadata records.
The difference at 1,000 operations per second is negligible. But at 100,000 operations per second, with large collections and concurrent writes, the wrong frozen/unfrozen choice becomes the bottleneck in either direction.
Choosing Honestly: Key-Value, K-V Wide Table, or KKV Wide Table
These three models exist because different access patterns have different requirements.
K-V is the right model for caching, session storage, and any workload where the access pattern is “put blob, get blob.” Its simplicity is a real advantage because you end up with fewer moving parts and fewer things to misconfigure. If your values are small and your queries never need to reach inside them, a K-V store will serve you well and be easy to operate.
K-V Wide Table adds meaningful capability for workloads that need to address individual fields without upfront schema design. It’s a pragmatic choice for moderate-scale applications where operational simplicity matters, bin-level read projection is sufficient, and collection updates are infrequent or small. It sits at Level 2–3 access depth and does that job well.
KKV Wide Table earns its complexity when your access patterns require Level 3 or 4 depth: frequent updates to large collections, datasets that will outgrow available memory, workloads where typed column compression meaningfully reduces storage cost, or write-heavy workloads that cannot afford read-modify-write on every collection update. The richer data model requires upfront schema design and demands that you get frozen versus unfrozen semantics right.
Don’t rely on your intuition; choose strategically, based on your actual access pattern:
- Use frozen when you always read or write the whole value. A 50-field profile UDT that you always write and read back as a unit is a frozen candidate. The performance data supports it.
- Use small unfrozen collections when you need element-level updates. Append to a list. Update one key in a map. This is what unfrozen exists for.
- Use large unfrozen collections only if your access pattern is genuinely element-granular and your collection cardinality stays bounded. Per-element metadata overhead compounds. It affects both compaction cost and compression ratios.
Don’t focus on which model is “best.” Think about which model best matches the access patterns your workload will experience in production.
- Start with the access patterns.
- Let the data model follow.
- Then pick the technology that supports that model at the depth you need.
Get that order right and the database works with you. Get it wrong, and you spend your time working around it.
***
If your use case requires low latencies at scale, and you’re frustrated with fighting your current database, ScyllaDB Cloud might be worth a look. Find me on LinkedIn – I’m always happy to talk data models.
