Binary Trie Group-Depth Benchmark — Narrower Than Expected

§1

Executive Summary

The binary trie is on the Ethereum protocol strawman as a future state tree replacement. No binary trie implementation has been benchmarked at scale — not group depth, not anything else. With this transition on the roadmap, assessing performance characteristics is a prerequisite for informed prototyping. The geth implementation (EIP-7864) exposes a --bintrie.groupdepth parameter that controls how binary levels are packed into on-disk nodes; this study benchmarks eight configurations to determine the optimal setting.

Bottom line: The sweet spot is GD-5 or GD-6, depending on workload. GD-5 is the write champion (6.94 Mgas/s, +7% over GD-4). GD-6 leads reads (6.39 Mgas/s) and mixed workloads (6.27 Mgas/s). GD-7 confirms the inflection: performance degrades past GD-6 on all benchmarks. Since Ethereum is read-heavy, GD-6 may be the preferred default.

What we tested

Eight group-depth configurations (GD-1 through GD-8) on identical 360 GB databases with ~400 million state entries. Five benchmark types — two synthetic (raw SLOAD/SSTORE) and three ERC20 contract workloads — each run 9 times under a cold-cache protocol (verified cache drops, 120+ successful, 0 failures). All results use medians with Mann-Whitney U significance tests. Mgas/s is the primary comparison metric throughout, as it normalizes for block composition differences across configs.

What we found

Reads confirm the intuition (§4): Wider trees read faster. GD-8 achieves more than double the read throughput of GD-1 (5.59 vs 2.65 Mgas/s). GD-3 through GD-8 range from 5.2 to 6.4 Mgas/s by throughput, with GD-6 leading at 6.39 Mgas/s.
Writes reveal a sharper optimum (§5): GD-5 is the write champion at 629 ms (6.94 Mgas/s) — 7% faster than GD-4 (678 ms) and 55% faster than GD-8 (982 ms) by throughput. GD-7 (5.81 Mgas/s) confirms the inflection past GD-6.
Node size hits the Pebble block boundary at GD-7 (§5): Each GD-7 node serializes to ~4 KB (128 × 32 bytes) — exactly the Pebble block size. Below this boundary (GD-6: ~2 KB), each node fits within one block. Above it, reads may require two blocks per node. Gary Rong’s NVMe benchmarks show random 8 KB reads cost 54% more latency than 4 KB at QD=1 (77.8 vs 50.6 μs). This per-node I/O penalty compounds across ~37 path nodes, explaining why GD-7 reads slower than GD-6 despite a shorter path.
The sweet spot is GD-5 or GD-6 (§6): GD-5 wins writes by 7% over GD-4 and GD-6 leads reads (6.39 Mgas/s) and mixed (6.27 Mgas/s). GD-7 data validates the inflection — all three benchmarks degrade past GD-6. The optimal balance lies at 5–6 bits per node.

How to read this post

§2 Background covers the binary trie and group depth concept. §3 Methodology details the benchmark setup (collapsible). §4–§6 present the results in a narrative arc: reads, writes, then the trade-off. §7 Patterns examines cross-cutting observations, and §8 Conclusions gives the recommendation and open questions. Short on time? Start with §4’s “ERC20 Reads: Where Depth Matters” — it’s the section where group depth differences become most visible and sets up the rest of the analysis.

§2

Background

What is the Binary Trie?

EIP-7864 proposes replacing Ethereum's Merkle Patricia Trie (MPT) with a binary trie. The binary trie unifies the account trie and all storage tries into a single tree, uses SHA-256 for hashing instead of Keccak-256, and stores 32-byte stems that map to groups of 256 values. This design simplifies witness generation for stateless clients and enables more efficient proofs.

The transition from MPT to a binary trie is one of the most consequential changes to Ethereum's state layer. Performance characteristics of the new structure will directly affect block processing time, sync speed, and validator economics.

What is Group Depth?

The trie is always binary at the fundamental level — every internal node has exactly two children (left for bit 0, right for bit 1). Group depth controls how many binary levels are bundled into a single on-disk node. At GD-N, each stored node encapsulates an N-level binary subtree, so it appears to have 2^N children when viewed from the outside:

GD-1: 1 binary level per node → 2 child pointers, 256 nodes on the path to a leaf
GD-2: 2 binary levels per node → 4 child pointers, 128 nodes on path
GD-3: 3 binary levels per node → 8 child pointers, ~86 nodes on path
GD-4: 4 binary levels per node → 16 child pointers, 64 nodes on path
GD-5: 5 binary levels per node → 32 child pointers, ~52 nodes on path
GD-6: 6 binary levels per node → 64 child pointers, ~43 nodes on path
GD-7: 7 binary levels per node → 128 child pointers, ~37 nodes on path
GD-8: 8 binary levels per node → 256 child pointers, 32 nodes on path

Think of it like a zip code: GD-1 reads your address one digit at a time (256 steps), while GD-8 reads 8 digits at once (32 steps). Fewer steps means fewer disk reads — but each "bundled node" is larger and more expensive to update, because the binary subtree inside it must be rehashed.

Interactive — Group Depth Explorer

Group depth 4

Path

64

nodes

Children

16

per node

Rehash

15

ops / node

Node size

512

bytes

Each dashed box = one on-disk node bundling 4 binary levels

The trade-off is straightforward in theory: reads benefit from shallow trees (fewer disk I/O operations to reach a leaf), while writes suffer from wide nodes (more internal hashing when a node is modified). The question is where the crossover point lies.

§3

Methodology

    Methodology Details (click to collapse)
  

Benchmark Setup

Parameter	Value
Machine	QEMU VM — 8 vCPUs, 30 GB RAM, 3.9 TB SSD, Ubuntu 24.04 LTS
Database	~360 GB, ~400M accounts + slots
Configs	GD-1, GD-2, GD-3, GD-4, GD-5, GD-6, GD-7, GD-8 (Pebble, the LSM-tree storage engine used by geth) 4KB block size
Protocol	Cold cache (OS + Pebble caches dropped between runs)
Runs	10 per benchmark, run 1 excluded as warm-up
Gas target	100M gas per block

Statistical Approach

Per-run block medians aggregated across 9 retained runs
Mann-Whitney U test for pairwise comparisons (non-parametric)
Effect sizes reported as percentage difference from baseline (GD-1)
Coefficient of variation (CV%) for consistency assessment

Benchmark Taxonomy

a) Synthetic benchmarks — sstore_variants (writes) and sload_benchmark (reads). These use EIP-7702 delegations with sequential storage slots. Keys are numerically sequential, causing heavy prefix sharing in the trie.

b) ERC20 benchmarks — balanceof (reads), approve (writes), and mixed. These use real ERC-20 contract code. Storage keys are keccak hashes of random addresses, producing uniformly distributed access patterns across the trie.

Figure 2 — Sequential keys share trie prefixes and benefit from caching. Keccak-hashed keys scatter uniformly, forcing cold reads at every level.

Block Composition Note

The execution-specs harness sends all benchmark transactions to geth’s mempool at once (within <1 second). Geth’s dev mode miner (dev.period=10) then processes transactions sequentially from this pool during a 10-second block-building window — when the timer expires, it mines whatever it managed to process. The bottleneck is trie operation time, not gas capacity: each ERC20 approve tx uses only ~4.4M gas (the 100M block gas limit could fit ~22), but trie operations (traverse, update, hash, commit) consume nearly the entire window for just 1 tx. With verified cold-cache drops in Phase 3, all configs process 1 tx per block (median tx_count=1). Setup transactions (simple ETH transfers) process 7 txs in 77ms, confirming that trie cost — not transaction overhead — is the limiting factor.

Mgas/s (throughput) is the correct comparison metric as it normalizes for any gas differences across benchmarks. Where raw ms is shown, it reflects the actual block processing time. Note that GD-7 mixed workloads use 8.84M gas/block vs 11.80M for all other configs, so Mgas/s comparisons remain valid but raw ms is not directly comparable for that benchmark.

§4

Act I: Reads Confirm the Intuition

Wider trees should mean faster reads. And they do.

Synthetic Reads: The Baseline

Per-block read latency for a representative run per config, confirming stable measurements with no degradation over time.

Group Depth	Median Read (ms)	vs GD-1
GD-1	53	baseline
GD-2	48	−9%
GD-4	48	−10%
GD-8	no data

Note: Sequential reads and writes show near-identical performance across all configs. Since group depth primarily affects random-access patterns (keccak-hashed storage slots), only ERC20 benchmarks were collected for all 8 configurations.

Only ~10% improvement from GD-1 to GD-4. Sequential access doesn't differentiate group depths because shared prefixes keep the working set small and cache-friendly regardless of tree shape.

ERC20 Reads: Where Depth Matters

Q4: ERC20 read latency boxplot — the hero chart

GD	state_read (ms)	total (ms)	Mgas/s	vs GD-1 (Mgas/s)
1	5,878	6,284	2.65	baseline
2	3,840	4,230	3.95	+49%
3	2,866	3,213	5.20	+96%
4	2,677	3,067	5.46	+106%
5	2,370	2,733	6.11	+131%
6	2,248	2,623	6.39	+141%
7	2,339	2,693	6.04	+128%
8	2,598	2,977	5.59	+111%

~2.4× read throughput from GD-1 to GD-6. Throughput (Mgas/s) is the primary comparison metric as it normalizes for any gas differences across configs. GD-6 achieves the highest read throughput (6.39 Mgas/s), followed by GD-5 (6.11) and GD-7 (6.04). Performance improves monotonically from GD-1 through GD-6, with GD-4 (5.46) beating GD-3 (5.20) as expected from the shorter path.

Why the dramatic difference from synthetic? Keccak scatters keys uniformly, forcing a full traversal from root to leaf. GD-1 must descend 256 levels; GD-8 only 32. Every level is a potential disk seek. Random access exposes the full depth penalty.

GD-3 (3,213 ms, 5.20 Mgas/s) and GD-4 (3,067 ms, 5.46 Mgas/s) show close performance despite different path lengths (~86 vs 64 nodes). GD-3’s smaller node serialization (~256 bytes vs ~512 bytes for GD-4) may interact more favorably with Pebble’s 4KB block size, partially offsetting the longer path. This near-parity supports exploring the GD × block-size matrix (Open Question #3).

Per-slot cost: synthetic reads cost ~0.02 ms/slot. ERC20 reads cost ~0.4–1.0 ms/slot — a 40× penalty from random access patterns.

This 40× ratio aligns closely with raw NVMe I/O measurements: Gary Rong’s disk page-read benchmarks show random 4KB reads at 77 MB/s vs 3,306 MB/s sequential (43×), suggesting the penalty is dominated by I/O access patterns rather than Pebble overhead.

Why random access is the baseline, not the exception. The binary trie unifies all accounts and storage into a single tree. Every key — whether an account balance, a storage slot, or a code chunk — is SHA256-hashed into the 256-bit keyspace. A single contract's storage slots scatter across completely different tree paths. This makes random access the fundamental access pattern of the binary trie, not a pathological case. The synthetic sequential benchmarks represent an unrealistic best case that cannot occur in a unified trie deployment.

So far, wider is better for reads. Then we tested writes.

§5

Act II: The Write Surprise

This is the most important finding in the study.

Q5: ERC20 write latency boxplot — the money graph

GD	state_read	trie_updates	commit	Write Cost	Total (ms)	Mgas/s
1	812	690	76	762	1,645	2.67
2	483	393	61	457	993	4.42
3	349	287	44	331	732	5.95
4	313	254	53	308	678	6.47
5	271	242	57	299	629	6.94
6	272	283	76	359	679	6.41
7	264	328	103	431	745	5.81
8	313	433	158	603	982	4.47

trie_updates = state_hash_ms (AccountHashes + AccountUpdates + StorageUpdates) — this metric covers the full trie mutation and rehash phase, not just hashing. Write Cost is the median of per-block (trie_updates + commit); this may differ slightly from the sum of the component medians due to properties of the median function. Phase 3 data uses verified cold-cache drops (120+ successful, 0 failures, buffers/cache 297–334M after each drop). All configs process 1 tx/block.

GD-5 is the write champion. 6.94 Mgas/s — 7% faster than GD-4 (6.47 Mgas/s, p < 1e-9) and 55% faster than GD-8 (4.47 Mgas/s). GD-6 (6.41 Mgas/s) is close behind, while GD-7 (5.81 Mgas/s) confirms that write performance degrades past GD-6. The inflection from GD-5 to GD-6 is moderate: hash cost rises 17% (242→283 ms) and commit rises 33% (57→76 ms).

✓ Reads: GD-5 271 ms, GD-6 272 ms, GD-7 264 ms — all close, favoring wider trees

✓ Trie updates: GD-5 242 ms vs GD-4 254 ms (−5%). GD-6 rises moderately to 283 ms (+17% over GD-5)

✗ Beyond GD-6: GD-7 328 ms hash, GD-8 433 ms hash — confirming the inflection is a gentle slope, not a cliff

Why? The Internal Subtree

Each trie node at group depth g contains an internal binary subtree with 2^g−1 nodes that must be rehashed on every write. This is the mechanism behind the surprise.

Figure 3 — Each trie node contains an internal binary subtree. Wider nodes mean exponentially more hashing per write.

Figure 4 — For reads, only the downward traversal matters (favors GD-8). For writes, rehashing and commit dominate (favors GD-4).

GD-8's path is 2× shorter than GD-4 (32 vs 64 levels), but each node is ~17× more expensive to rehash (255 vs 15 internal operations). GD-5 hits the sweet spot: 31 internal nodes per node × ~52 nodes on path = ~1,612 total ops, compared to GD-4's 960 and GD-8's 8,160. GD-5 finds a minimum because its path is short enough to benefit from fewer I/Os, while its per-node rehashing cost hasn't yet exploded.

Note: The 17× ratio (255 vs 15 internal hash operations) is the theoretical upper bound from the data structure definition. Our benchmarks support the mechanism: GD-8 trie update costs are 1.71× more than GD-4 (433ms vs 254ms), consistent with random writes modifying a fraction of each node's internal subtree. The geth implementation (go-ethereum/trie) is the authoritative source for the exact rehashing algorithm.

Node Serialization Size

Each trie node stores up to 2^N child pointers (32 bytes each). A GD-4 node: 16 × 32 = ~512 bytes. A GD-7 node: 128 × 32 = ~4 KB — exactly the Pebble block size. A GD-8 node: 256 × 32 = ~8 KB. The size differences have cascading effects:

Pebble block boundary: GD-6 nodes (~2 KB) fit within a single 4 KB Pebble block. GD-7 nodes (~4 KB) saturate the block — with key overhead, they likely span two blocks, potentially doubling I/O per node fetch. This partly explains GD-7’s read reversal: despite 14% fewer path nodes than GD-6 (37 vs 43), GD-7 reads ~148 KB total per lookup (37 × 4 KB) vs GD-6’s ~86 KB (43 × 2 KB).
Pebble cache efficiency: Fewer GD-8 nodes fit in a given cache budget
Write amplification: Larger serialized nodes increase LSM compaction overhead
Commit cost: Commit costs scale with total bytes serialized per write path. GD-8 nodes (~8 KB each × 32 path nodes ≈ 256 KB per write) cost 158 ms vs GD-4’s 53 ms (~512 bytes × 64 nodes ≈ 32 KB). GD-6 commit (76 ms) and GD-7 commit (103 ms) show moderate increases over GD-5 (57 ms), consistent with larger serialized nodes.

§6

Act III: The Trade-off

The Verdict

Criterion	GD-4	GD-5	GD-6	GD-7	GD-8	Winner
Reads (Mgas/s)	5.46	6.11	6.39	6.04	5.59	GD-6 by 5% vs GD-5
Writes (Mgas/s)	6.47	6.94	6.41	5.81	4.47	GD-5 by 7% vs GD-4
Mixed (Mgas/s)	5.13	6.09	6.27	5.87	5.43	GD-6 by 3% vs GD-5
Wins	0/3	1/3	2/3	0/3	0/3	GD-6 leads reads + mixed; GD-5 leads writes

GD-6 wins reads and mixed; GD-5 wins writes. By Mgas/s, GD-6 leads reads (6.39) and mixed (6.27), while GD-5 leads writes (6.94, +7% over GD-4). GD-7 is past the inflection — worse than GD-6 on all three benchmarks — confirming the sweet spot is GD-5 or GD-6 depending on workload. Since Ethereum is read-heavy, GD-6 may be the preferred default.

Interactive — Cost Breakdown by Group Depth

State reads (traversal)

Trie updates (rehash)

Commit

Total

GD-5 is the write champion at 629 ms — shortest path with manageable rehash cost

Mixed Workloads

GD	state_read (ms)	trie_updates (ms)	commit (ms)	total (ms)	Mgas/s
1	4,711	345	53	5,363	2.18
2	3,003	217	44	3,518	3.35
3	1,981	145	39	2,375	4.90
4	1,893	138	43	2,302	5.13
5	1,512	124	48	1,910	6.09
6	1,440	141	54	1,851	6.27
7	1,055	218	73	1,529	5.87
8	1,612	221	87	2,145	5.43

Note: GD-7’s mixed benchmark processed fewer transactions per block (8.84M vs ~11.80M gas). Mgas/s normalizes for this difference, so throughput comparisons remain valid. Raw ms values for GD-7 mixed are not directly comparable to other configs.

GD-6 leads mixed workloads by Mgas/s: 6.27 vs 6.09 for GD-5 and 5.13 for GD-4. GD-5 is 19% faster than GD-4 (6.09 vs 5.13 Mgas/s, p < 1e-3). GD-6’s read advantage (1,440 ms state_read vs 1,512 ms for GD-5) outweighs its moderate trie update increase (141 ms vs 124 ms). GD-7 (5.87 Mgas/s) confirms the inflection: its raw state_read (1,055 ms) is lower, but GD-7 processes less gas per block (8.84M vs 11.80M), so Mgas/s is the fair comparison. Note that when showing ms breakdowns, Mgas/s is the primary metric since some configs process different gas amounts in similar wall-clock time.

Open question: The optimal group depth ultimately depends on the read/write ratio of real Ethereum blocks. While state reads clearly dominate block processing time in our benchmarks, the exact mainnet split has not been systematically measured. A historical analysis of mainnet read vs write access patterns would further refine this recommendation.

§7

Cross-Cutting Patterns

Where Does Time Go?

State reads dominate ERC20 block processing time across all group-depth configurations, accounting for 50–85% of total time. Trie update and commit costs are negligible for read-only benchmarks (balanceOf) but become the dominant cost component for writes (approve) — especially at higher group depths where internal subtree rehashing is most expensive.

Overall Throughput

Q8: Mgas/s overview across all benchmarks

Hardware Context: NVMe I/O Characteristics

Disk page-read benchmarks by Gary Rong on NVMe (Samsung 990 Pro) provide hardware context for interpreting our results:

Random vs sequential: 4KB random reads = 77 MB/s (50.6 μs) vs 3,306 MB/s sequential — a 43× gap that closely matches our 40× per-slot penalty.
Page size matters: Random 16KB reads achieve 2.3× throughput of 4KB (174 vs 77 MB/s) with only 1.8× latency. Wider group depths produce larger nodes that could benefit from larger Pebble block sizes.
Queue depth is transformative: QD=1→QD=8 improves random 4KB throughput by 8.7× (77→673 MB/s). Parallel EVM could unlock this, compressing read latency differences between configs.
Sublinear latency growth: 4KB→64KB = 16× data, but only 2.3× latency (50.6→117.3 μs). NVMe’s internal parallelism means larger I/O requests are disproportionately efficient.

These measurements use direct page reads bypassing filesystem caches — analogous to our cold-cache protocol. The benchmark hardware (Samsung 990 Pro) differs from our QEMU VM’s virtual disk, so absolute numbers won’t match, but the ratios reveal fundamental NVMe characteristics relevant to group depth optimization.

§8

Conclusions

GD-5 or GD-6, depending on workload

Default / read-heavy / mixed: GD-6 (wins reads by 5%, mixed by 3% over GD-5).

Write-heavy: GD-5 (wins writes by 7% over GD-4, 8% over GD-6).

GD-7 is past the inflection — worse than GD-6 on all benchmarks — confirming the sweet spot is 5–6 bits. GD-8 is suboptimal on all workloads, demonstrating diminishing returns past the sweet spot.

Root Causes

The recommendation rests on a three-step mechanism that governs every state access in the binary trie:

Traverse — descend from root to leaf. Cost proportional to tree depth. Favors wider trees (GD-8: 32 levels vs GD-5: ~52 vs GD-4: 64 levels).
Rehash — recompute the internal subtree of every node on the path back to root. Cost proportional to 2^g−1 per node. Favors narrower trees (GD-4: 15 ops/node, GD-5: 31 ops/node, GD-8: 255 ops/node).
Commit — serialize and write modified nodes to disk. Cost proportional to node size. Favors narrower trees.

GD-5 finds the write minimum of the traversal × rehashing trade-off. Its path is 19% shorter than GD-4 (~52 vs 64 nodes), and each node's 31 internal operations remain manageable. At GD-6, hash cost rises moderately to 283 ms (+17% over GD-5’s 242 ms) — a gentle slope, not a cliff. GD-7 data (328 ms hash) confirms the trend continues. The write inflection point lies between GD-5 and GD-6 (hash/read ratio crosses 1.0), but reads and mixed workloads still improve at GD-6, making the inflection write-specific rather than universal.

Five Patterns That Hold Across All Group Depths

1. State reads dominate. 50–85% of block processing time is spent reading state from disk, regardless of group depth or benchmark type.

2. Random access is ~40× more expensive per slot. Keccak-hashed keys (ERC20) cost 0.4–1.0 ms/slot vs 0.02 ms/slot for sequential access (synthetic).

Independent NVMe page-read benchmarks confirm that random vs sequential I/O explains nearly all of this penalty (see §7 Hardware Context for details).

3. Cache hit rates plateau at ~37–38% for random access. Despite increasing group depth, the storage cache hit rate stays within 28–38% for all keccak-hashed workloads. The 256-bit keyspace is too sparse for meaningful cache reuse beyond shared upper-level trie nodes.

4. Trie updates are negligible for reads, dominant for writes. Read benchmarks show <1% time in trie updates. Write benchmarks at GD-7/8 spend 40–60% of time rehashing internal subtrees. GD-6 is the transition point where hash cost begins to exceed read cost.

5. Run-to-run CV mostly < 10% on Mgas/s. Phase 3’s verified cold-cache drops (120+ successful, 0 failures) produce reproducible results across all 8 configs. The improvement over earlier phases validates the methodology fix.

On the Snapshot Layer

These benchmarks ran without a flat snapshot layer. Binary trie path-based reads via Pebble’s BTree index are already efficient, so snapshots may offer limited additional benefit for narrow group depths. However, wider group depths (GD-6, GD-8) could benefit more from snapshots, as their larger nodes incur higher per-read I/O cost that a flat key-value lookup would bypass (see Open Question #2).

Resolved & Open Questions

01 ✓ RESOLVED

Non-power-of-2 group depths: GD-5 or GD-6 is the optimum, depending on workload. Testing all 8 configs (GD-1 through GD-8) confirmed that 5 bits wins writes (+7% over GD-4) and 6 bits wins reads (+5% over GD-5) and mixed (+3% over GD-5). GD-7 validates the inflection.

02

Snapshot layer validation: empirically confirm that snapshots further favor GD-5/GD-6 as our analysis predicts.

03

Pebble block size interaction: this study fixed block size at 4KB. NVMe page-read benchmarks show that random 16KB reads achieve 2.3× the throughput of 4KB reads (174 vs 77 MB/s) with only 1.8× the latency (89.7 vs 50.6 μs). Larger Pebble blocks could significantly benefit wider group depths whose serialized nodes exceed 4KB.

04

~~Mainnet state distribution.~~ Largely resolved. The binary trie SHA256-hashes all keys into the 256-bit keyspace, making trie access patterns inherently random regardless of real-world state distribution. Hot contracts (DEXes, bridges) don’t cluster in the trie. The only caveat is stem-influenced access patterns, where keys sharing a stem may share upper path nodes — but this affects caching, not the group depth trade-off.

05

Concurrent block processing: parallel EVM execution may change the read/write ratio and shift the optimal group depth. NVMe queue depth benchmarks show that QD=8 improves random 4KB throughput 8.7× over QD=1 (673 vs 77 MB/s). If parallel execution increases effective I/O queue depth, read latency differences between group depths may compress significantly.

06 ✓ RESOLVED

GD-6 data quality: Phase 3 re-run with verified cold-cache drops confirmed all measurements. All 8 configs have clean data with CVs mostly <10% on Mgas/s. GD-6 writes: 679 ms, 6.41 Mgas/s. GD-7 fills the gap and confirms the inflection past GD-6.