Shaving 40% Off Google’s B-Tree Implementation with Go Generics

By Michał Matczuk

April 27, 2022

There are many reasons to be excited about generics in Go. In this blog post I’m going to show how, using the generics, we got a 40% performance gain in an already well optimized package, the Google B-Tree implementation.

A B-Tree is a kind of self-balancing tree. For the purpose of this blog post it’s sufficient to say that it is a collection. You can add, remove, get or iterate over its elements. The Google B-Tree is well optimized, measures are taken to make sure memory consumption is correct. There is a benchmark for every exported method. The benchmark results show that there are zero allocations in the B-Tree code for all operations but cloning. Probably it would be hard to further optimize using traditional techniques.

ScyllaDB and the University of Warsaw

We have had a longstanding relationship with the Computer Science department at the University of Warsaw. You may remember some of our original projects, including those integrating Parquet, an async userspace filesystem, or a Kafka client for Seastar. Or more recent ones like a system for linear algebra in ScyllaDB or a design for a new Rust driver.

This work was also part of our ongoing partnership with the University of Warsaw.

Making Faster B-Trees with Generics

While working on a new Scylla Go Driver with students of University of Warsaw, we ported the B-tree code to generics. (If you’re not familiar with generics in Go, check out this tutorial.).

The initial result: the generics code is faster by 20 to 30 percent according to the Google benchmarks (link to issue we opened). Below a full benchmark comparison done with benchstat.

This is great but within those numbers there is a troubling detail. The zero allocations is not something that you would normally see given that the functions accept an interface as a parameter.

For the rest of the blog post we’ll focus on benchmarking the ReplaceOrInsert function responsible for ingesting data. Let’s consider a simplified benchmark.

The results show even greater improvement: 31% vs. 27%, and allocations drop from 1, in case of the interface based implementation, to 0 in the case of generics.

Let’s try to understand what happens here.

The Additional Allocation

The Google benchmarks operate on a B-tree of integers hidden by an Item interface. They use pre-generated random data in a slice. When an Item is passed to the ReplaceOrInsert function the underlying integer is already on the heap, technically we are copying a pointer. This is not the case when a plain integer needs to be converted to an Item interface — the parameter values start “escaping to heap”.

Go has a feature of deciding if a variable you initialized should live in the function’s stack or in the heap. Traditionally the compiler was very “conservative” and when it saw a function like func bind(v interface{}) anything you wanted to pass as v would have to go to heap first. This is referred to as variable escaping to the heap. Over the years the compiler has gotten smarter, and calls to local functions or functions in other packages in your project can be optimized, preventing the variables from escaping. You can check for yourself by running go build -gcflags="-m" . in a Go package.

In the below example Go can figure out that it’s safe to take a pointer to the main functions stack.

As you can see the compiler informs us that variables do not escape to heap.

By changing the ToString implementation to

we see that the variables and literal values do start escaping.

In practical examples, when calling a function that accepts an interface as a parameter, the value almost always escapes to heap. When this happens it not only slows down the function call by the allocation, but also increases the GC pressure. Why is this important? The generics approach enables a truly zero allocation API, with zero GC pressure added as we will learn in the remainder of this blog post.

Why is it faster?

The B-tree, being a tree, consists of nodes. Each node holds a list of items.

When the Item is a pre-generics plain old interface, the value it holds must live separately somewhere on the heap. The compiler is not able to tell what is the size of an Item. From the runtime perspective an interface value is an unsafe pointer to data (word), a pointer to its type definition (typ), a pointer to interface definition (ityp); see definitions in the reflect package. It’s easier to digest than the runtime package. In that case we have items as a slice of int pointers.

On the other hand, with generic interface

and a generic type definition

items are a slice of ints — this reduces the number of small heap objects by a factor of 32.

Enough of theory, let’s try to examine a concrete usage. For the purpose of this blog I wrote a test program that is a scaled up version of my benchmark code.

We are adding 100 million integers, and the degree of the B-tree (number of items in a node) is 1k. There are two versions of this program: one uses generics, the other plain old interface. The difference in code is minimal, it’s literally changing btree.New(degree) to btree.New[btree.Int](degree) in line 13. Let’s compare data gathered by running both versions under `/usr/bin/time -l -p`.

	generics	interface	delta
real	11.49	16.59	-30.74%
user	11.27	18.61	-39.44%
sys	0.24	0.6	-60.00%
maximum resident set size	2334212096	6306217984	-62.99%
average shared memory size	0	0
average unshared data size	0	0
average unshared stack size	0	0
page reclaims	142624	385306	-62.98%
page faults	0	0
swaps	0	0
block input operations	0	0
block output operations	0	0
messages sent	0	0
messages received	0	0
signals received	600	843	-28.83%
voluntary context switches	25	48	-47.92%
involuntary context switches	1652	2943	-43.87%
instructions retired	204760684966	288827272312	-29.11%
cycles elapsed	37046278867	60503503105	-38.77%
peak memory footprint	2334151872	6308147904	-63.00%
HeapObjects	236884	50255826	-99.53%
HeapAlloc	2226292560	6043893088	-63.16%

Here using generics solves a version of N+1 problem for slices of interfaces. Instead of one slice and N integers in heap we now can have just the slice of ints. The results are profound, the new code behaves better in every aspect. The wall time duration is down by 40%, context switches are down by 40%, system resources utilization is down by 60% — all thanks to a 99.53% reduction of small heap objects.

I’d like to conclude by taking a look at top CPU utilization.

go tool pprof -top cpu.pprof

generics interface

Type: cpu Time: Apr 5, 2022 at 10:23am (CEST) Duration: 11.61s, Total samples = 11.05s (95.18%)< Showing nodes accounting for 10.77s, 97.47% of 11.05s total Dropped 52 nodes (cum <= 0.06s) flat flat% sum% cum cum% 4.96s 44.89% 44.89% 4.96s 44.89% runtime.madvise 4.61s 41.72% 86.61% 4.61s 41.72% runtime.memclrNoHeapPointers 0.64s 5.79% 92.40% 0.64s 5.79% github.com/google/btree.items[...].find.func1 0.19s 1.72% 94.12% 0.83s 7.51% sort.Search 0.08s 0.72% 94.84% 5.82s 52.67% github.com/google/btree..insert 0.08s 0.72% 95.57% 0.08s 0.72% runtime.mmap 0.07s 0.63% 96.20% 0.90s 8.14% github.com/google/btree.items[...].find 0.05s 0.45% 96.65% 5.88s 53.21% github.com/google/btree..ReplaceOrInsert 0.05s 0.45% 97.10% 4.19s 37.92% github.com/google/btree..insertAt (inline) 0.04s 0.36% 97.47% 0.61s 5.52% github.com/google/btree..maybeSplitChild 0 0% 97.47% 0.57s 5.16% github.com/google/btree..split Type: cpu Time: Apr 5, 2022 at 10:31am (CEST) Duration: 16.69s, Total samples = 18.65s (111.74%) Showing nodes accounting for 17.94s, 96.19% of 18.65s total Dropped 75 nodes (cum <= 0.09s) flat flat% sum% cum cum% 9.53s 51.10% 51.10% 9.53s 51.10% runtime.madvise 2.62s 14.05% 65.15% 2.62s 14.05% runtime.memclrNoHeapPointers 1.09s 5.84% 70.99% 1.31s 7.02% github.com/google/btree.items.find.func1 0.93s 4.99% 75.98% 2.73s 14.64% runtime.scanobject 0.67s 3.59% 79.57% 0.67s 3.59% runtime.heapBits.bits (inline) 0.44s 2.36% 81.93% 1.75s 9.38% sort.Search 0.30s 1.61% 83.54% 0.30s 1.61% runtime.markBits.isMarked (inline) 0.27s 1.45% 84.99% 2.03s 10.88% github.com/google/btree.items.find 0.27s 1.45% 86.43% 3.35s 17.96% runtime.mallocgc 0.26s 1.39% 87.83% 0.26s 1.39% runtime.(*mspan).refillAllocCache 0.25s 1.34% 89.17% 0.60s 3.22% runtime.greyobject 0.24s 1.29% 90.46% 0.26s 1.39% runtime.heapBits.next (inline) 0.23s 1.23% 91.69% 0.23s 1.23% github.com/google/btree.Int.Less 0.20s 1.07% 92.76% 0.20s 1.07% runtime.memmove 0.20s 1.07% 93.83% 0.20s 1.07% runtime.mmap 0.15s 0.8% 94.64% 2.47s 13.24% github.com/google/btree.(*items).insertAt (inline) 0.12s 0.64% 95.28% 0.27s 1.45% runtime.findObject 0.08s 0.43% 95.71% 5.44s 29.17% github.com/google/btree.(*node).insert 0.03s 0.16% 95.87% 5.48s 29.38% github.com/google/btree.(*BTree).ReplaceOrInsert 0.02s 0.11% 95.98% 0.84s 4.50% github.com/google/btree.(*node).maybeSplitChild 0.02s 0.11% 96.09% 0.45s 2.41% runtime.convT64 0.01s 0.054% 96.14% 9.83s 52.71% runtime.(*mheap).allocSpan 0.01s 0.054% 96.19% 2.82s 15.12% runtime.gcDrain 0 0% 96.19% 0.78s 4.18% github.com/google/btree.(*node).split

You can literally see how messy the interface profile is, how gc starts kicking in killing it… It’s even more evident when we focus on gc.

go tool pprof -focus gc -top cpu.pprof

generics interface

Type: cpu Time: Apr 5, 2022 at 10:23am (CEST) Duration: 11.61s, Total samples = 11.05s (95.18%) Active filters: focus=gc Showing nodes accounting for 0.29s, 2.62% of 11.05s total flat flat% sum% cum cum% 0.19s 1.72% 1.72% 0.19s 1.72% runtime.memclrNoHeapPointers 0.02s 0.18% 1.90% 0.02s 0.18% runtime.(*mspan).refillAllocCache 0.01s 0.09% 1.99% 0.02s 0.18% runtime.(*fixalloc).alloc 0.01s 0.09% 2.08% 0.01s 0.09% runtime.(*mheap).allocNeedsZero 0.01s 0.09% 2.17% 0.01s 0.09% runtime.(*mspan).init (inline) 0.01s 0.09% 2.26% 0.01s 0.09% runtime.heapBits.bits (inline) 0.01s 0.09% 2.35% 0.01s 0.09% runtime.markrootSpans 0.01s 0.09% 2.44% 0.01s 0.09% runtime.recordspan 0.01s 0.09% 2.53% 0.02s 0.18% runtime.scanobject 0.01s 0.09% 2.62% 0.01s 0.09% runtime.stkbucket Type: cpu Time: Apr 5, 2022 at 10:31am (CEST) Duration: 16.69s, Total samples = 18.65s (111.74%) Active filters: focus=gc Showing nodes accounting for 6.06s, 32.49% of 18.65s total Dropped 27 nodes (cum <= 0.09s) flat flat% sum% cum cum% 2.62s 14.05% 14.05% 2.62s 14.05% runtime.memclrNoHeapPointers 0.93s 4.99% 19.03% 2.73s 14.64% runtime.scanobject 0.67s 3.59% 22.63% 0.67s 3.59% runtime.heapBits.bits (inline) 0.30s 1.61% 24.24% 0.30s 1.61% runtime.markBits.isMarked (inline) 0.27s 1.45% 25.68% 3.35s 17.96% runtime.mallocgc 0.26s 1.39% 27.08% 0.26s 1.39% runtime.(*mspan).refillAllocCache 0.25s 1.34% 28.42% 0.60s 3.22% runtime.greyobject 0.24s 1.29% 29.71% 0.26s 1.39% runtime.heapBits.next (inline) 0.12s 0.64% 30.35% 0.27s 1.45% runtime.findObject 0.08s 0.43% 30.78% 0.08s 0.43% runtime.spanOf (inline) 0.06s 0.32% 31.10% 0.06s 0.32% runtime.(*mspan).base (inline) 0.06s 0.32% 31.42% 0.06s 0.32% runtime.(*mspan).init (inline) 0.06s 0.32% 31.74% 0.06s 0.32% runtime.heapBitsSetType 0.04s 0.21% 31.96% 0.04s 0.21% runtime.(*mSpanStateBox).get (inline) 0.04s 0.21% 32.17% 0.04s 0.21% runtime.pthread_kill 0.04s 0.21% 32.39% 0.04s 0.21% runtime.usleep 0.01s 0.054% 32.44% 0.10s 0.54% runtime.(*mheap).allocSpan 0.01s 0.054% 32.49% 2.82s 15.12% runtime.gcDrain

The generic version spent 0.29s (2.62%) in GC while the interface version spent 6.06s accounting for, hold your breath, 32.49% of the total time!

Generics: CPU profile flame focused on GC related function

Interface: CPU profile flame focused on GC related functions

Conclusion

By shifting the implementation from one using interfaces, to one using generics, we were able to significantly improve performance, minimize garbage collection time, and minimize CPU and other resource utilization, such as heap size. Particularly with heap size, we were able to reduce HeapObjects by 99.53%.

The future of Go generics is bright especially in the domain of slices.

EXPLORE MORE SCYLLADB ENGINEERING CONTENT

Want to be a ScyllaDB Monster?

We’re definitely proud of the work we do with the students at the University of Warsaw. Yet ScyllaDB is a growing company with a talented workforce drawn from all over the world. If you enjoy writing high performance generic Go code, come join us. Or if you specialize in other languages or talents, check out our full list of careers at ScyllaDB:

CAREERS AT SCYLLADB