Bloom Filter

What Is a Bloom Filter?

A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. Conceived by Burton Howard Bloom in 1970, it can determine that an element is definitely not in the set, or that it may be in the set. “Definitely not in the set” is always certain, while “may be in the set” may be a false positive. Bloom filters guarantee no false negatives: if an element was added to the filter, it will always be found. False positives are possible but controllable through configuration.

What makes bloom filters valuable in database systems is their extraordinary space efficiency. A bloom filter requires fewer than 10 bits per element to achieve a 1% false positive rate — specifically ~9.6 bits/element — regardless of the size of the elements themselves. This means a filter covering one billion elements needs roughly 1.2 GB of memory, far less than any structure that stores the actual element values. In LSM-tree databases like Apache Cassandra and ScyllaDB, this space efficiency makes bloom filters a key mechanism for eliminating unnecessary disk reads at scale.

bloom-filter

Bloom Filter FAQs

How Do Bloom Filters Work?

A bloom filter data structure consists of two components: an m-bit array (initialized entirely to 0) and k independent hash functions. The mechanics of insertion and lookup are both O(k) — linear in the number of hash functions, constant regardless of how many elements are stored.

Insert operation. To add an element to a bloom filter, feed it through each of the k hash functions. Each function produces a position (index) in the m-bit array. Set the bit at each of those k positions to 1. Elements cannot be removed from a standard bloom filter — clearing a bit risks invalidating other elements that hashed to the same position.

Lookup operation. To test whether an element exists, feed it through the same k hash functions and check the bit at each resulting position. If any bit is 0, the element is definitely not in the set — this result is certain and requires no further lookup. If all bits are 1, the element is probably in the set — but this could be a false positive caused by other elements that collectively set those bit positions.

False positive rate. The false positive probability ε for n inserted elements, m bits, and k hash functions is approximately: ε ≈ (1 − e^(−kn/m))^k. The optimal number of hash functions that minimizes ε for a given m and n is k = (m/n) × ln 2. The memory required to achieve a target false positive rate follows m/n ≈ −1.44 × log₂(ε). In practice: at ε = 1%, a filter needs ~9.6 bits per element; at ε = 10%, ~4.8 bits per element. Halving the false positive rate costs roughly 1.4 additional bits per element — a modest price for meaningful accuracy improvements.

What Are Bloom Filters Used For?

Bloom filter applications span distributed systems, databases, networking, and big data pipelines. The common thread across all use cases is the same: use a tiny, fast filter to rule out elements that definitely aren’t present, avoiding the cost of a full lookup for the majority of queries.

SSTable read path optimization. In LSM-tree databases, each SSTable has a corresponding bloom filter built at write time. When a read request arrives, the database consults each SSTable’s bloom filter before performing any disk I/O. A “definitely not present” result means that SSTable is skipped entirely — no disk read. A “probably present” result triggers the actual lookup. Bloom filters are consulted for point lookups by partition key; they are not used during range scans, which must consult SSTables regardless. In clusters with many SSTables, bloom filters are a key mechanism for controlling read amplification.

Bloom filter deduplication. Bloom filters are widely used in streaming and data pipeline systems for deduplication: before processing an event or record, check the filter to see if it has been seen before. If the filter returns “definitely not seen,” process and add it. If it returns “probably seen,” perform a more expensive exact check. This two-stage approach dramatically reduces the load on the exact-match store for high-volume streams.

Bloom filtering in big data. In distributed query engines (Spark, Hive, Presto), bloom filters are embedded in query execution plans to perform “bloom filter joins”: before shuffling rows across the network for a join operation, apply a bloom filter built from the smaller table to filter out rows from the larger table that cannot possibly match. This reduces shuffle volume and speeds up large joins significantly. The technique is especially effective for star-schema queries against fact tables with billions of rows.

What Is a Bloom Filter in Cassandra?

Apache Cassandra is built on a Log-Structured Merge (LSM) tree architecture. Writes go first to an in-memory memtable, which periodically flushes to disk as an immutable SSTable. Over time, many SSTables accumulate per table. Without bloom filters, a read for any partition key would require checking every SSTable during the read path — an I/O operation that grows linearly with the number of SSTables. Bloom filter cassandra usage solves this: each SSTable has its own bloom filter, and the read path consults every filter before touching disk.

How bloom filter works in Cassandra

When a read request arrives for a partition key, Cassandra iterates through its SSTables and consults each one’s bloom filter. For each SSTable where the filter returns “definitely not present,” that SSTable is skipped with no disk I/O. For each SSTable where the filter returns “probably present,” Cassandra performs the actual SSTable lookup — reading the partition index and, if needed, the data file. A false positive means an unnecessary disk read. In environments with many SSTables and elevated false positive rates, this read amplification directly degrades p99 read latency.

Bloom filter memory in Cassandra

In Cassandra’s JVM-based architecture, bloom filters are stored off-heap in native memory — outside the JVM heap managed by the garbage collector. This is intentional: keeping bloom filters off-heap avoids adding their memory footprint to GC pressure. When capacity planning Cassandra nodes, total RAM must account for JVM heap + off-heap (bloom filters, key cache, row cache) + OS page cache. A table with one billion partitions at the default 1% false positive rate requires approximately 1.2 GB of off-heap bloom filter memory.

Tuning bloom_filter_fp_chance in Cassandra

The per-table CQL property bloom_filter_fp_chance (a float between 0 and 1) controls the target false positive rate. Default values differ by compaction strategy: 0.01 (1%) for Size-Tiered Compaction Strategy, and 0.1 (10%) for Leveled Compaction Strategy and Time Window Compaction Strategy. The higher default for LCS and TWCS reflects the fact that these strategies produce a bounded, small number of SSTables, so occasional false positives have limited impact — the filter can be coarser and use less memory without meaningful performance cost.

Lowering bloom_filter_fp_chance reduces false positive rates and unnecessary disk reads at the cost of more off-heap memory. Raising it saves memory at the cost of more false positives and higher read amplification. Going from 0.1 to 0.01 requires approximately 2x more memory (~4.8 bits/element vs. ~9.6 bits/element) but delivers a 10x improvement in false positive accuracy. After altering bloom_filter_fp_chance, existing SSTables must be rebuilt to apply the change — run nodetool scrub or nodetool upgradesstables -a. No Cassandra restart is required.

ScyllaDB Bloom Filter Implementation

ScyllaDB supports bloom filters with the same bloom_filter_fp_chance parameter and the same defaults as Cassandra — existing Cassandra configurations migrate without changes. The architectural differences in how ScyllaDB implements bloom filters reflect its shard-per-core design and C++ foundation, and address limitations that affect Cassandra’s JVM-based approach.

Shard-per-core ownership

In ScyllaDB, each CPU core (shard) manages its own subset of SSTables and their corresponding bloom filters. Bloom filter lookups are CPU-local: a shard checks only the filters for its own SSTables without cross-core coordination or locking. This eliminates contention on bloom filter reads under high concurrency — a structural advantage over Cassandra’s node-level SSTable management, where multiple threads may compete to read the same filter structures.

Automatic memory reclamation

ScyllaDB 5.4.6 introduced automatic bloom filter memory reclamation: when bloom filters across a shard exceed a configurable ratio of available shard memory (default threshold: approximately 10% of shard memory), ScyllaDB automatically evicts filter data to free RAM. Evicted filters behave as if they always return “probably present” — every SSTable read requires a disk lookup, equivalent to having no filter for those SSTables. Teams hitting this threshold can raise the memory ratio, accept higher false positive rates by increasing bloom_filter_fp_chance, or add memory to their nodes. This feature resolved critical memory pressure problems in deployments with very high partition counts.

No JVM off-heap management

In Cassandra, bloom filters occupy off-heap native memory that must be budgeted separately from the JVM heap. Operators must size both dimensions correctly and monitor both independently. In ScyllaDB, there is no JVM heap — ScyllaDB manages all memory, including bloom filters, through the Seastar memory allocator. There is no off-heap vs. on-heap split to reason about. Bloom filter memory is part of ScyllaDB’s unified memory pool, and the automatic reclamation mechanism described above handles pressure automatically rather than requiring manual headroom planning.

Additional resources on Bloom Filters in ScyllaDB:

  • Understanding Bloom Filters in ScyllaDB: A Simple Guide: ScyllaDB user Shubham Sharma (engineer at Verve) wrote a beginner-friendly guide that explains how ScyllaDB uses Bloom filters as a space-saving, probabilistic mechanism to optimize read performance by instantly identifying if a partition key is definitely not present in an SSTable. It provides a step-by-step breakdown of how a bit array and multiple hash functions map data, illustrating how this prevents unnecessary, costly disk I/O. Finally, the article explores practical optimization tips, focusing on balancing RAM consumption and false-positive rates using the database’s `bloom_filter_fp_chance` configuration setting.
  • How JioCinema Uses ScyllaDB Bloom Filters for Personalization: This article explains how the JioCinema (now JioHotstar) streaming platform manages user state tracking at massive scale by leveraging ScyllaDB’s native architecture instead of deploying a separate external solution like Redis Bloom. It explains how they efficiently handle high-concurrency watch-history checks by relying on ScyllaDB’s internal Bloom filters to quickly skip irrelevant SSTables during the read path. By optimizing partition keys and schema parameters like the false-positive chance, they utilize these filters to drastically cut down on unnecessary disk I/O and maintain ultra-low P99 latencies during peak live-streaming traffic.

Trending NoSQL Resources