blake3-to-poseidon2-migration.md

Blake3 to Poseidon2: Complete Migration Plan

1. Motivation

The migration from Blake3 to Poseidon2 is driven by the need for a hash function that operates natively over finite fields. Blake3's bit-oriented operations (XOR, rotation, shift) are 50-100x more expensive when proved inside STARK circuits. Poseidon2, operating natively over finite fields, enables:

Zero-knowledge proofs: ~2M+ hashes/sec in Plonky3/STARK proving systems
Multi-party computation (MPC): Practical threshold operations over arithmetic circuits
Fully homomorphic encryption (FHE): Word-level homomorphic encryption compatibility
Content addressing: Collision-resistant fixed-length output suitable for CIDs
Planetary-scale verifiability: Every hash in the system becomes cheaply provable

See: https://cyber.page/hash-function-selection/

This is not a narrow library swap. Blake3 is deeply woven into iroh's architecture across multiple repositories. This document covers the complete migration surface.

2. Full Blake3 Dependency Map Across the Iroh Ecosystem

The iroh project spans multiple repositories. Blake3 is used at every layer.

2.1 Repository Inventory

From .github/workflows/project_sync.yaml, the iroh ecosystem includes:

Repository	Blake3 Role	Migration Complexity
iroh (this repo)	Relay handshake KDF	Low
iroh-blobs	Content addressing (Hash type), blob transfer protocol	Critical
bao-tree	BLAKE3 Merkle tree for verified streaming	Critical - requires novel replacement
abao	Async BAO (async verified streaming)	Critical - depends on bao-tree
iroh-blake3	Custom BLAKE3 fork with hazmat API	Must be replaced entirely
iroh-docs	Key-value store using iroh-blobs hashes	Depends on iroh-blobs migration
iroh-gossip	Pub-sub overlays (may reference blob hashes)	Depends on iroh-blobs migration
iroh-car	CAR (Content Addressable aRchives) format	Depends on hash type migration
iroh-willow	Willow protocol implementation	Depends on hash type migration

2.2 Blake3 Feature Usage by Layer

Layer 1: Cryptographic Primitive (iroh-blake3)

iroh maintains a custom fork of the BLAKE3 crate (iroh-blake3) because the upstream BLAKE3 crate historically did not expose the internal APIs needed for BAO streaming. The new BLAKE3 "hazmat" API was contributed by the iroh team and provides:

HasherExt - setting input chunk offsets for non-sequential hashing
merge_subtree_root() / merge_subtree_non_root() - combining chaining values at internal tree nodes
Non-root finalization - computing intermediate (non-root) chaining values
Access to raw chaining values (32-byte intermediate digests)

Migration requirement: All of these internal APIs must have Poseidon2 equivalents.

Layer 2: Verified Streaming (bao-tree, abao)

BAO (BLAKE3 Authenticated and Oblivious) is the core innovation that enables iroh's blob transfer protocol. It exploits BLAKE3's internal Merkle tree structure:

How BLAKE3's Merkle Tree Works:

Input is split into 1024-byte chunks
Each chunk is hashed independently via the BLAKE3 compression function, producing a 32-byte chaining value (CV)
Pairs of CVs are combined as parent nodes (64-byte input -> 32-byte output; with Goldilocks: 2x4 elements = 8 elements absorbed, 4 elements squeezed)
Tree is built upward until a single root hash remains
Domain separation flags distinguish chunk nodes, parent nodes, and the root node
Each chunk carries a counter (its position) for ordering

What BAO Provides:

Feature	Description
Combined encoding	Serializes the hash tree interleaved with data in pre-order traversal. A decoder can verify every byte as it arrives without buffering the whole file.
Outboard encoding	Stores only the hash tree (no data). Like a `.torrent` file - verification metadata stored separately from content.
Slice extraction	Extracts the minimal subset of the tree needed to verify a specific byte range. For a 1 GB file, verifying a single 1 KB range requires only ~30 parent hashes (~1.9 KB overhead).
Range requests	Fetch and verify arbitrary byte ranges against the root hash with O(log N) overhead.
Streaming decode	Data is verified and returned chunk-by-chunk during download. No need to wait for the full file.
Pre-order format	Nodes appear in the exact order needed for sequential depth-first verification - no seeking required.
Post-order format	Nodes appear after their children. Enables efficient append-only file sync (new data appends without rewriting existing tree nodes).

bao-tree extensions beyond standard BAO:

Extension	Purpose
Configurable `BlockSize`	Chunk groups larger than 1024 bytes (e.g., 16 KiB) to reduce outboard size
Multi-range queries	Request multiple non-overlapping byte ranges in a single query: `[0..1000, 5000..6000]`
Async support	Both sync (`io::sync`) and async (`io::fsm`) implementations sharing core logic
Post-order outboard	Append-only storage format for incremental file synchronization

Layer 3: Content Addressing (iroh-blobs)

iroh-blobs provides content-addressed blob storage and transfer:

Component	Blake3 Usage
`Hash` type	Wraps `blake3::Hash` (32 bytes). The universal content identifier for all blobs.
`BlobFormat::Raw`	Single blob identified by its BLAKE3 hash
`BlobFormat::HashSeq`	Ordered sequence of BLAKE3 hashes (used for collections/directories)
Blob transfer protocol	Uses BAO encoding to stream blobs with per-chunk verification
Content discovery	Blobs are requested and routed by their BLAKE3 hash
Storage backend	Blobs stored and indexed by BLAKE3 hash
Deduplication	Identical content produces identical BLAKE3 hash = automatic dedup

The Hash type is the atom of iroh's content addressing. Every blob, every collection, every piece of content in the iroh ecosystem is identified by its BLAKE3 hash. Changing this hash function changes every identifier in the system.

Layer 4: Relay Handshake (iroh-relay, this repo)

The only Blake3 usage in this repository:

Production (iroh-relay/src/protos/handshake.rs:215):

fn message_to_sign(&self) -> [u8; 32] {
    blake3::derive_key(DOMAIN_SEP_CHALLENGE, &self.challenge)
}

Test (iroh-relay/src/protos/handshake.rs:594-598):

let label_key = blake3::hash(label);
let context_key = blake3::keyed_hash(label_key.as_bytes(), context.unwrap_or(&[]));
let mut hasher = blake3::Hasher::new_keyed(context_key.as_bytes());
hasher.update(&self.shared_secret?.to_le_bytes());
hasher.finalize_xof().fill(output.as_mut());

Layer 5: Higher-Level Protocols

Protocol	Blake3 Dependency
iroh-docs	Documents contain blob hashes (BLAKE3). Key-value entries reference content by Hash.
iroh-gossip	Message payloads may include blob hashes for content routing
iroh-willow	Willow entries reference content by Hash
iroh-car	CAR archives contain CIDs built from BLAKE3 hashes

2.3 Complete Blake3 API Surface Used Across Ecosystem

Blake3 API	Where Used	Poseidon2 Must Provide
`blake3::hash(input) -> Hash`	Content addressing (iroh-blobs)	Sponge hash, 32-byte output
`blake3::derive_key(context, input) -> [u8; 32]`	Relay handshake KDF (iroh-relay)	Domain-separated sponge KDF
`blake3::keyed_hash(key, input) -> Hash`	Test utilities (iroh-relay)	Keyed sponge mode
`blake3::Hasher::new()` + `update()` + `finalize()`	Incremental content hashing (iroh-blobs)	Incremental sponge
`blake3::Hasher::new_keyed()` + XOF	Test utilities (iroh-relay)	Keyed sponge + squeeze XOF
Hazmat: non-root chaining values	BAO tree construction (bao-tree)	Intermediate Poseidon2 compression
Hazmat: merge_subtree_root/non_root	BAO parent node hashing (bao-tree)	2-to-1 compression function
Hazmat: chunk counter	BAO chunk ordering (bao-tree)	Counter as absorbed field element
Hazmat: is_root flag	BAO root finalization (bao-tree)	Distinct root finalization constant
Hazmat: HasherExt (offset)	Non-sequential chunk hashing (bao-tree)	Sponge with explicit position

3. Poseidon2 Technical Assessment

3.1 Properties Comparison

Property	Blake3	Poseidon2
Design	Bit-oriented (Merkle tree of compress functions)	Algebraic (operations over prime fields)
Collision resistance	128-bit	128-bit (with proper parameters)
Preimage resistance	256-bit	Depends on parameter/field choice
Output size	256-bit default, XOF-capable	Field-element width (e.g., ~254-bit over BN254)
ZK circuit cost	50,000-100,000 constraints per hash	500-1,500 constraints per hash
Native CPU throughput	1-7 GiB/s (SIMD-accelerated)	~3-4M hashes/sec (Goldilocks faster than BN254 due to 64-bit arithmetic)
Sponge mode	Not native (tree-based)	Native sponge construction
Compression mode	Native (internal compress function)	Native: `C(x) = Trunc_d(P(x) + M*x)`
XOF capability	Native	Via repeated sponge squeeze
KDF mode	Native `derive_key()`	Via domain-separated sponge
Merkle tree	Native (BLAKE3 IS a Merkle tree hash)	Must be built on top of compression mode
Maturity	NIST-level analysis, deployed widely	Peer-reviewed (AFRICACRYPT 2023), growing ZK adoption

3.2 The Critical Performance Tradeoff

Operation	Blake3	Poseidon2 (Goldilocks, t=12, d=7)	Impact
Hash 1 KB chunk	~1 us	~50-100 us (est.)	~50-100x slower
Hash 1 GB file	~0.15-1 sec	~50-100 sec (est.)	Significant but workable
Parent node (2-to-1)	<1 us	~1-3 us	Acceptable
Prove hash in STARK	~50,000 constraints	~500 constraints	100x cheaper (the point)

Goldilocks advantage over BN254: 64-bit field arithmetic is ~5-10x faster than 254-bit BN254 arithmetic on modern CPUs. This narrows the performance gap significantly compared to BN254-based Poseidon2. The regression is real but far less severe than with larger fields.

For content addressing and verified streaming, the CPU performance regression must be addressed architecturally (see Section 5: chunk group sizing and GPU acceleration).

3.3 Available Rust Implementations

Crate	Version	Downloads	Description	Goldilocks Support
`p3-poseidon2`	0.4.2	1M+	Plonky3's Poseidon2 permutation	Yes - native Goldilocks support
`zkhash`	0.2.0	888K	Poseidon2 + related primitives	Via ark-ff (possible but not native)
`light-poseidon`	0.4.0	7.1M	Poseidon v1 (not v2)	No
`taceo-poseidon2`	-	-	Poseidon2 for BN254	No

Recommended crate: p3-poseidon2 from Plonky3. It is the only production-ready implementation with native Goldilocks field support, matching our parameter selection. Plonky3 is MIT/Apache-2.0 licensed, production-ready, and powers SP1's prover.

3.4 Poseidon2 Mode Mapping for All Blake3 Features

Blake3 Feature	Poseidon2 Equivalent	Implementation
`hash(bytes)`	Sponge: absorb field-encoded bytes, squeeze	Standard Poseidon2 sponge
`derive_key(ctx, bytes)`	Absorb domain separator into capacity, absorb input, squeeze	Duplex construction
`keyed_hash(key, bytes)`	Initialize capacity with key elements, absorb, squeeze	Keyed sponge
`Hasher` (incremental)	Sponge state with incremental absorption	Stateful sponge
XOF (finalize_xof + fill)	Repeated squeeze operations	Sponge squeeze
Chunk hashing (1024B -> 32B CV)	Absorb ~147 Goldilocks elements (rate=8), squeeze 4 elements	~19 permutations per chunk
Parent hashing (64B -> 32B CV)	Absorb 2x4 child elements into rate (8 elements), squeeze 4	1 permutation per parent
Chunk counter	Absorb counter as field element before chunk data	Additional absorbed element
Domain separation (chunk/parent/root)	Capacity initialization constants per mode	Distinct capacity values
Non-root finalization	Squeeze without root flag	Different finalization constant
Root finalization	Squeeze with root flag	Root-specific constant

3.5 Byte-to-Field-Element Encoding

Poseidon2 operates on field elements, not bytes. All input data must be encoded:

Field	Element Size	Usable Bytes/Element	Elements per 1 KB Chunk	Notes
Goldilocks (64-bit)	8 bytes	~7 bytes	~147	Selected field. Plonky3-native, fast on CPU and GPU
BN254 (~254-bit)	32 bytes	~31 bytes	~34	Ethereum-compatible, but slower field arithmetic
BabyBear (31-bit)	4 bytes	~3.8 bytes	~270	Smallest field, highest absorption count

Selected field: Goldilocks (p = 2^64 - 2^32 + 1)

Goldilocks is chosen because:

Native 64-bit arithmetic on modern CPUs (single machine word)
Plonky3-native (SP1 uses BabyBear, but Goldilocks has broader STARK support)
Fast on both CPU and GPU
8-byte elements mean a 32-byte hash output = 4 field elements

Our Poseidon2 parameters: t=12, R_F=8, R_P=22, d=7

t=12: state width of 12 Goldilocks field elements (96 bytes total state)
R_F=8: 8 full rounds (4 at start, 4 at end) - all elements get S-box
R_P=22: 22 partial rounds - only first element gets S-box
d=7: S-box exponent x^7 (chosen for Goldilocks field characteristics)
Rate = 8 elements (64 bytes absorbed per permutation), Capacity = 4 elements (32 bytes)
Output: squeeze 4 elements = 32 bytes (matches Blake3 output size)

Encoding rules (must be canonical and deterministic):

Pad input to element-boundary with unambiguous padding (e.g., 10*1 padding)
Encode each group of 7 bytes as a Goldilocks field element (little-endian, must be < p)
Include input length as a final absorbed element to prevent length-extension
Absorb 8 elements (64 bytes of encoded input) per permutation call

4. What Does Not Exist Today (Must Be Built)

4.1 "Poseidon2-BAO": Verified Streaming with Poseidon2

No existing implementation provides BAO-like streaming verification with Poseidon2.

This is the largest piece of novel engineering in the migration. Existing Poseidon2 Merkle tree implementations (merkle-poseidon, poseidon-merkle, Plonky3 internal) are all designed for ZK proof membership, not streaming data transfer. None provide:

Pre-order or post-order tree encoding for streaming
Slice extraction for range-based verification
Outboard storage format
Streaming decode with per-chunk verification
Configurable chunk groups
Multi-range queries

What must be built:

Component	Source	Effort
Tree geometry & traversal	Extract from bao-tree, make hash-agnostic	Medium
Range management (ChunkRanges, ByteRanges)	Reuse from bao-tree (hash-independent)	Low
Poseidon2 chunk hashing	New: sponge (t=12, rate=8) over Goldilocks-encoded chunks; ~19 permutations per 1 KB	Medium
Poseidon2 parent hashing (2-to-1)	New: two 4-element digests = 8 elements = exactly 1 rate-width absorption -> 1 permutation	Low
Domain separation scheme	New: capacity constants for chunk/parent/root	Medium
Pre-order combined encoding	Adapt from bao-tree, swap hash	Low
Outboard encoding	Adapt from bao-tree, swap hash	Low
Post-order outboard	Adapt from bao-tree, swap hash	Low
Slice extraction	Adapt from bao-tree (mostly hash-agnostic)	Low
Streaming decoder	Adapt from bao-tree, swap verification	Medium
Async streaming decoder	Adapt from abao, swap verification	Medium
Test vectors	Generate from reference implementation	Medium

4.2 Poseidon2 Sponge Wrapper Crate

A new crate providing the Blake3-equivalent API surface:

poseidon2_hash(input: &[u8]) -> [u8; 32]
poseidon2_derive_key(context: &str, input: &[u8]) -> [u8; 32]
poseidon2_keyed_hash(key: &[u8; 32], input: &[u8]) -> [u8; 32]
Poseidon2Hasher::new() / new_keyed() / update() / finalize() / finalize_xof()

Plus hazmat-level APIs for BAO tree construction:

chunk_cv(chunk_data: &[u8], counter: u64, is_root: bool) -> [u8; 32]
parent_cv(left: &[u8; 32], right: &[u8; 32], is_root: bool) -> [u8; 32]

4.3 Hash Type Migration

The Hash type in iroh-blobs wraps blake3::Hash. This is the fundamental content identifier. Migration requires:

A new Hash type wrapping 4 Goldilocks field elements (serialized as 32 bytes)
Same API: Hash::new(bytes), as_bytes(), from_bytes(), to_hex(), Display, serde
The 32-byte output (4 x 8-byte Goldilocks elements) matches Blake3's output size, so all existing APIs expecting [u8; 32] remain compatible at the type level
CID format: Algorithm-agile Content Identifier encoding that tags the hash algorithm
Transition: dual-CID period where both Blake3 and Poseidon2 hashes coexist

5. Migration Strategy

5.1 Approach: Pure Poseidon2 with GPU Acceleration via wgpu

Decision: Complete Blake3 replacement with Poseidon2 everywhere. No hybrid, no dual-hash complexity. Single hash function across the entire stack.

Two execution backends:

Backend	Use Case	Implementation
CPU	`p3-poseidon2` (Plonky3)	Goldilocks-native, production-ready, for relay handshake and small data
GPU	Custom wgpu compute shaders	For content hashing, BAO tree construction, bulk blob processing

Why this works:

CPU path (p3-poseidon2): Goldilocks 64-bit arithmetic is fast on modern CPUs. For relay handshake KDF (single hash of 16 bytes) this is more than adequate. For small blobs (<1 MB) CPU hashing is practical (~3-7s/GB with 16 KB chunk groups).
GPU path (wgpu): Poseidon2 over Goldilocks is massively parallelizable - each chunk hashes independently, parent nodes at each tree level are independent. wgpu provides cross-platform GPU compute (Vulkan, Metal, DX12, WebGPU) without vendor lock-in. The Poseidon2 permutation (t=12, 30 rounds of field multiplications) maps naturally to GPU SIMD lanes.

Expected GPU acceleration:
- Each permutation: 12 * 30 = 360 field multiplications (Goldilocks mul is 64-bit)
- Modern GPU: thousands of parallel permutations
- Target: approach or exceed Blake3 CPU throughput for bulk hashing
wgpu advantages over CUDA/Icicle:
- Cross-platform: works on all GPUs (not NVIDIA-only)
- WebGPU compatible: same shaders work in browser (WASM target)
- Rust-native: wgpu crate integrates cleanly into the iroh ecosystem
- No proprietary runtime dependency

Architecture:

// hemera crate provides unified API with backend selection

pub enum Backend {
    Cpu,           // p3-poseidon2, always available
    Gpu(wgpu::Device), // wgpu compute shaders, when GPU available
}

// High-level API automatically selects backend based on input size
pub fn hash(input: &[u8]) -> [u8; 32];           // auto-selects
pub fn hash_with(backend: &Backend, input: &[u8]) -> [u8; 32]; // explicit

// BAO tree construction benefits most from GPU (all chunks parallel)
pub fn compute_outboard(backend: &Backend, data: &[u8]) -> Outboard;

Fallback: CPU path is always available. GPU is an acceleration layer, not a requirement. Nodes without GPU capability operate correctly, just slower for large blob hashing.

5.2 Migration Phases

Phase Status Summary (updated 2026-02-24)

Phase	Description	Status	Commit
0a	hemera CPU backend	DONE	`1517ebff54`
0b	hemera GPU scaffolding	DONE	`5aefbfedb3`
1	cyber-bao verified streaming	DONE	`2c91d5b9e2`, `b1f6de6ba2`, `c62b83fa9a`
2	iroh-blobs integration	DONE	(pending commit)
3	Relay handshake migration	DONE	`07e31dd236`
4	Higher-level protocols	DONE (3/5)	iroh-docs, iroh-gossip, iroh-car migrated; iroh-willow, iroh-ffi deferred (version gap)
5	Blake3 removal (this workspace)	DONE	0 blake3 deps in any Cargo.toml or Cargo.lock
6	Validation (this workspace)	DONE	395 tests pass, 0 failures across all crates

Phase 2 Completion Summary (updated 2026-02-24)

Phase 2 is COMPLETE. All 6 sub-phases (2a–2f) are done.

0 blake3 references remain in iroh-blobs .rs files
0 bao-tree crate references remain
~80+ cyber_bao and ~50+ hemera references across 30+ source files
Hash type wraps hemera::Hash (32-byte Poseidon2 digest)
BAO tree operations use cyber-bao throughout (outboard, encoding, decoding, streaming)
97 tests pass, 0 failures (2 pre-existing ignored tests)
67 cyber-bao tests pass independently

Sub-phase	Scope	Status
2a	Foundation types (`Hash`, `BaoTree` imports)	DONE
2b	Trait bridge (cyber-bao traits ↔ iroh-blobs I/O)	DONE
2c	Outboard & storage (core data path)	DONE
2d	Protocol & streaming (wire format)	DONE
2e	API layer (public surface)	DONE
2f	Tests (integration tests)	DONE

Key changes made during Phase 2 (and Phase 1 bug-fix pass):

BaoChunk::Parent now carries left: bool, right: bool flags for filtered traversal
encode_ranges_validated only pushes hashes for children that will be visited
valid_ranges rewritten as recursive to handle missing outboard entries gracefully
Test expectations updated for block-level granularity (16KB blocks = 16 chunks)
hash_subtree in iroh-blobs delegates to cyber_bao::io::hash_block()

Phase 4 Completion Summary (updated 2026-02-24)

3 of 5 higher-level protocol repos migrated. 2 deferred due to major version incompatibility.

Repo	Status	Changes	Tests
iroh-docs (v0.96)	DONE	`blake3::Hasher` → `hemera::Hasher` in fingerprint computation (ranger.rs, sync.rs)	69 passed, 0 failed
iroh-gossip (v0.96)	DONE	`blake3::hash()` → `hemera::hash()` for MessageId + test TopicIds	19 passed, 0 failed
iroh-car (standalone)	DONE	`Code::Blake3_256` → `Code::Sha2_256` in test CIDs (hash-agnostic library)	3 passed, 0 failed
iroh-willow (v0.34)	DEFERRED	Depends on iroh 0.34 / iroh-blobs 0.34 — version gap too large	—
iroh-ffi (v0.35)	DEFERRED	Depends on iroh 0.35 — version gap; no direct blake3 usage anyway	—

All three repos added to workspace with local path patches for iroh, iroh-base, iroh-blobs, iroh-gossip, iroh-relay.

Phase 1 Gap Analysis (updated 2026-02-24)

Phase 1 is COMPLETE. All 7 gaps have been closed. 67 tests pass, 0 warnings.

#	Gap	Status	Resolution
1	Post-order outboard I/O	CLOSED	`PostOrderOutboard<H,D>` and `PostOrderMemOutboard<H>` with full Outboard/OutboardMut trait impls. Conversion from pre-order in `PostOrderMemOutboard::create()`.
2	Multi-range slice extraction	CLOSED	`extract_slice_ranges()` accepts `&ChunkRangesRef` for multi-range queries. `extract_slice()` is now a convenience wrapper.
3	Slice verification	CLOSED	`decode_slice()` with stack-based verification against trusted root hash. Returns `Vec<(u64, Vec<u8>)>` of verified leaf data.
4	Async range filtering	CLOSED	`ResponseDecoder::new()` calls `tree.pre_order_chunks_filtered(&ranges)`. `traverse_ranges_validated()` in mixed.rs uses `pre_order_chunks_filtered(ranges)`. Sync `valid_ranges` uses `pre_order_chunks_filtered(ranges)`.
5	Async final-length validation	CLOSED	`ResponseDecoder::next0()` checks `decoded_bytes > tree.size()` after exhausting all items.
6	HashBackend genericity in protocol layer	ACCEPTED	Protocol-layer code (`sync.rs`, `fsm.rs`, `mixed.rs`, `io/mod.rs::hash_block`) hardcodes `Poseidon2Backend`. This is intentional — the protocol is Poseidon2-specific; low-level encode/decode/outboard/slice remain generic.
7	BlockSize non-zero test coverage	CLOSED	encode, decode, slice, sync, and pre_order modules all have `BlockSize::from_chunk_log(1)` and/or `BlockSize::DEFAULT` tests. Fixed a bug where `outboard()` used `chunk_hash()` instead of `hash_block()` in the single-block path, causing hash mismatches with non-zero block sizes.

Phase 6 Validation Summary (updated 2026-02-24)

Full workspace validation — 395 tests pass, 0 failures:

Crate	Tests	Passed	Ignored
hemera	42	42	0
cyber-bao	67	67	0
particle	0	0	0
iroh-base	2	2	0
iroh-relay	9	9	0
iroh	90	90	1
iroh-blobs	97	97	2
iroh-gossip	19	19	0
iroh-docs	69	69	0
iroh-car	3	3	0
Total	398	398	3

Blake3 dependency audit:

grep -r blake3 Cargo.toml across all crates: 0 matches
grep blake3 Cargo.lock: 0 matches
grep -r bao-tree Cargo.toml across all crates: 0 matches
grep -r iroh-blake3 across all crates: 0 matches
Only remaining "blake3" text: documentation/comments explaining the migration rationale

Phase 0: Foundation - Poseidon2 Primitive Crate

Status: DONE (0a: 1517ebff54, 0b: 5aefbfedb3)

Goal: Build a standalone crate providing all Poseidon2 operations needed by the ecosystem.

Deliverable: hemera crate

Dual-backend: CPU via p3-poseidon2, GPU via custom wgpu compute shaders. Both backends produce identical output for the same input (same Poseidon2 permutation).

hemera/
  src/
    lib.rs          -- Public API, backend selection
    sponge.rs       -- Sponge construction over Goldilocks (rate=8, capacity=4)
    compression.rs  -- 2-to-1 compression: 8 elements in, 4 elements out
    kdf.rs          -- derive_key() with domain separation
    hasher.rs       -- Stateful incremental hasher + XOF (squeeze mode)
    encoding.rs     -- Byte-to-Goldilocks-element encoding (7 bytes -> 1 element)
    hazmat.rs       -- Low-level APIs for BAO tree construction (chunk_cv, parent_cv)
    params.rs       -- Frozen Poseidon2 parameters and round constants
    constants.rs    -- Domain separation constants
    cpu/
      mod.rs        -- CPU backend via p3-poseidon2
    gpu/
      mod.rs        -- GPU backend orchestration
      permutation.wgsl -- Poseidon2 permutation compute shader
      sponge.wgsl   -- Sponge absorb/squeeze compute shader
      tree.wgsl     -- Parallel Merkle tree construction shader

Parameters (frozen):

Field: Goldilocks (p = 2^64 - 2^32 + 1, 64-bit prime field)
State width: t = 12 (12 Goldilocks elements = 96 bytes)
Full rounds: R_F = 8 (4 initial + 4 final)
Partial rounds: R_P = 22
S-box exponent: d = 7 (x^7)
Rate: 8 elements (64 bytes per absorption)
Capacity: 4 elements (32 bytes, determines security level)
Output: 4 squeezed elements = 32 bytes

Security level: capacity = 4 * 64 = 256 bits -> 128-bit collision resistance

CPU backend: p3-poseidon2 from Plonky3 (native Goldilocks support). Always available. Used for relay handshake, small data, and as fallback.

GPU backend: wgpu compute shaders implementing the same Poseidon2 permutation. Used for bulk content hashing and BAO tree construction. Cross-platform: Vulkan (Linux/Windows/Android), Metal (macOS/iOS), DX12 (Windows), WebGPU (browser).

wgpu shader design for Poseidon2 permutation:

Each workgroup processes one permutation (12 Goldilocks elements)
Dispatch N workgroups to hash N chunks in parallel
Goldilocks multiplication: 64-bit multiply + Barrett reduction (fits in u32x2 on GPU)
S-box x^7: computed as x^2 * x^2 * x^2 * x (3 multiplications)
MDS matrix: 12x12 matrix-vector multiply per full round
Round constants loaded from uniform buffer (frozen, same for all invocations)

API specification:

// === Backend selection ===

pub enum Backend {
    Cpu,                    // p3-poseidon2, always available
    Gpu(GpuContext),        // wgpu compute shaders
}

pub struct GpuContext {
    device: wgpu::Device,
    queue: wgpu::Queue,
    pipelines: Pipelines,   // pre-compiled compute pipelines
}

impl GpuContext {
    /// Initialize GPU backend. Returns None if no GPU available.
    pub async fn new() -> Option<Self>;
}

// === High-level API (Blake3 equivalent) ===

/// Hash arbitrary bytes to a 32-byte digest.
/// Auto-selects backend: GPU for large inputs, CPU for small.
pub fn hash(input: &[u8]) -> [u8; 32];

/// Derive a key with domain separation (equivalent to blake3::derive_key).
/// Always CPU - single hash, not worth GPU dispatch overhead.
pub fn derive_key(context: &str, key_material: &[u8]) -> [u8; 32];

/// Keyed hash (MAC-like).
pub fn keyed_hash(key: &[u8; 32], input: &[u8]) -> [u8; 32];

/// Incremental hasher with XOF support.
pub struct Hasher { /* sponge state */ }
impl Hasher {
    pub fn new() -> Self;
    pub fn new_keyed(key: &[u8; 32]) -> Self;
    pub fn update(&mut self, data: &[u8]);
    pub fn finalize(&self) -> [u8; 32];
    pub fn finalize_xof(&self) -> OutputReader;
}

pub struct OutputReader { /* squeeze state */ }
impl OutputReader {
    pub fn fill(&mut self, buf: &mut [u8]);
}

// === Hazmat API (for BAO tree construction) ===

/// Hash a single chunk of data, returning a chaining value.
pub fn chunk_cv(data: &[u8], counter: u64, is_root: bool) -> [u8; 32];

/// Combine two child chaining values into a parent chaining value.
pub fn parent_cv(left: &[u8; 32], right: &[u8; 32], is_root: bool) -> [u8; 32];

// === GPU bulk operations (the performance win) ===

/// Hash all chunks of a blob in parallel on GPU, returning all chaining values.
/// This is the key operation that makes Poseidon2 competitive with Blake3.
pub async fn batch_chunk_cvs(
    gpu: &GpuContext,
    data: &[u8],
    chunk_size: usize,
) -> Vec<[u8; 32]>;

/// Build the full Merkle tree from chunk CVs on GPU (parallel tree levels).
pub async fn build_tree(
    gpu: &GpuContext,
    chunk_cvs: &span>; 32,
) -> (Vec<[u8; 32]>, [u8; 32]); // (parent nodes, root hash)

Testing:

Determinism: same input always produces same output
CPU/GPU equivalence: CPU and GPU backends produce identical output for all inputs
Domain separation: hash(x) != derive_key(ctx, x) != keyed_hash(k, x)
Avalanche: single-bit input change flips ~50% of output bits
Reference vectors: cross-validate against Poseidon2 reference implementation
Collision resistance: property tests ensuring distinct inputs produce distinct outputs
GPU stress tests: large blobs (1 GB+), verify against CPU reference

Phase 1: Poseidon2-BAO - Verified Streaming Library

Status: ~80% DONE (2c91d5b9e2, b1f6de6ba2) — see Gap Analysis above.

Goal: Replace bao-tree and abao with Poseidon2-based equivalents.

Approach: Fork bao-tree and make it hash-agnostic, then plug in Poseidon2.

Architecture:

cyber-bao/  (or poseidon2-bao)
  src/
    tree.rs         -- BaoTree geometry (hash-agnostic, from bao-tree)
    node.rs         -- TreeNode traversal (hash-agnostic)
    ranges.rs       -- ChunkRanges, ByteRanges (hash-agnostic, reuse from bao-tree)
    hash_backend.rs -- Trait for pluggable hash (Poseidon2 implementation)
    encode.rs       -- Pre-order combined encoding
    outboard.rs     -- Outboard encoding (pre-order and post-order)
    decode.rs       -- Streaming verified decoder
    slice.rs        -- Slice extraction and verification
    io/
      sync.rs       -- Synchronous I/O operations
      fsm.rs        -- Async finite state machine I/O

The HashBackend trait (making bao-tree hash-agnostic):

pub trait HashBackend {
    type Hash: AsRef<[u8]> + Clone + Eq;

    /// Hash a data chunk at the given position.
    fn chunk_hash(&self, data: &[u8], counter: u64, is_root: bool) -> Self::Hash;

    /// Combine two child hashes into a parent hash.
    fn parent_hash(&self, left: &Self::Hash, right: &Self::Hash, is_root: bool) -> Self::Hash;

    /// Size of the hash output in bytes.
    fn hash_size(&self) -> usize;
}

Features that must be preserved (no functionality loss):

bao-tree Feature	Status	Notes
Combined encoding (pre-order)	Must implement	Hash function swapped
Outboard encoding (pre-order)	Must implement	Hash function swapped
Post-order outboard	Must implement	Hash function swapped
Slice extraction (single range)	Must implement	Tree logic unchanged
Multi-range slice extraction	Must implement	Tree logic unchanged
Configurable BlockSize (chunk groups)	Must implement	May help with performance
Streaming decode (sync)	Must implement	Verification uses Poseidon2
Streaming decode (async)	Must implement	Verification uses Poseidon2
8-byte length header	Must implement	Format unchanged
Validation of final chunk for length	Must implement	Security-critical

Performance mitigation via chunk group size:

Increasing the chunk group size reduces the number of Poseidon2 hash operations. With Goldilocks (t=12, rate=8, 64 bytes per absorption), a chunk requires ceil(chunk_size / 64) permutation calls plus parent hashing:

Chunk Group Size	Absorptions per Chunk	Chunk Hashes for 1 GB	Est. Time (Goldilocks)
1 KB (Blake3 default)	16	~1M	~50-100 sec
16 KB (bao-tree default)	256	~65K	~3-7 sec
64 KB	1024	~16K	~1-2 sec
256 KB	4096	~4K	~0.3-0.5 sec

Goldilocks advantage: 64-bit field arithmetic makes each permutation ~5-10x faster than BN254, bringing Poseidon2 hashing into a practical range especially with chunk groups >= 16 KB.

Tradeoff: Larger chunks mean coarser-grained verification. A 256 KB chunk group means the decoder must buffer up to 256 KB before verification. For streaming, 16-64 KB is likely the sweet spot balancing verification granularity and throughput.

Phase 2: Hash Type and Content Addressing (iroh-blobs)

Status: DONE — all 30+ source files migrated, 97 tests pass (0 failures).

Goal: Migrate the Hash type and all content addressing to Poseidon2.

Integration surface: 67 references to bao_tree/cyber_bao across 30 source files.

Sub-phase decomposition:

Sub-phase	Scope	Files	Depends on
2a Foundation types	`Hash` type, `BaoTree` imports	`hash.rs`, `store/util/size_info.rs`	Phase 1 gaps 1-3
2b Trait bridge	cyber-bao traits ↔ iroh-blobs I/O	`store/util/mem_or_file.rs`, `store/util/sparse_mem_file.rs`, `store/util/partial_mem_storage.rs`	2a
2c Outboard & storage	Core data path	`store/fs/bao_file.rs`, `store/fs/import.rs`, `store/fs/meta.rs`, `store/fs/entry_state.rs`, `store/fs.rs` (7 refs), `store/mem.rs`, `store/readonly_mem.rs`, `store/gc.rs`, `store/mod.rs`	2b
2d Protocol & streaming	Wire format	`protocol.rs` (5 refs), `protocol/range_spec.rs`, `get.rs` (11 refs), `get/request.rs`, `provider.rs`	2c
2e API layer	Public surface	`api.rs`, `api/blobs.rs`, `api/blobs/reader.rs`, `api/downloader.rs`, `api/remote.rs`, `api/proto.rs`, `api/proto/bitfield.rs`	2d
2f Tests	Integration tests	`tests.rs`, `util.rs`	2e

Order: 2a → 2b → 2c → 2d → 2e → 2f (bottom-up, each committable).

Changes in iroh-blobs:

Hash type: Change from wrapping blake3::Hash to wrapping [u8; 32] from Poseidon2
Blob hashing: Replace blake3::Hasher with hemera::Hasher for computing blob hashes
BAO integration: Replace bao-tree with cyber-bao (Poseidon2 backend)
Transfer protocol: Wire format carries Poseidon2 tree nodes instead of Blake3 CVs
Storage: Re-index stored blobs under new Poseidon2 hashes
BlobFormat::HashSeq: Sequences now contain Poseidon2 hashes

CID (Content Identifier) format:

To enable algorithm agility and future migration capability, use an algorithm-tagged CID format:

CID = <version><hash-algorithm-code><digest-length><digest-bytes>

Where hash-algorithm-code distinguishes:

0x1e (or custom code) = Blake3 (legacy)
0x?? (new code) = Poseidon2-BN254 (current)

This allows the network to handle both Blake3 and Poseidon2 content identifiers during a transition period.

Data migration for existing content:

Per the hash function selection rationale, migration at planetary scale requires:

A functional storage proof system guaranteeing content availability
Rehashing original content under Poseidon2
A dual-CID transition period
At scale (10^15 particles): ~17 hours across 10^6 nodes

Phase 3: Relay Handshake Migration (this repo)

Status: DONE (07e31dd236)

Goal: Replace Blake3 in the relay handshake protocol.

Files to modify in iroh-relay:

Cargo.toml: Replace blake3 = "1.8.2" with hemera dependency
src/protos/handshake.rs:215: blake3::derive_key(...) -> hemera::derive_key(...)
src/protos/handshake.rs:312: Update comment referencing blake3
src/protos/handshake.rs:594-598: Replace test keying material simulation

Protocol version bump:

// Bump version to signal Poseidon2:
const DOMAIN_SEP_CHALLENGE: &str = "iroh-relay handshake v2 challenge signature";
const DOMAIN_SEP_TLS_EXPORT_LABEL: &[u8] = b"iroh-relay handshake v2";

Wire compatibility: This is a breaking change. Old and new clients/servers are incompatible. Coordinated deployment required.

Phase 4: Higher-Level Protocol Migration

Status: DONE (3/5) — iroh-docs, iroh-gossip, iroh-car migrated and added to workspace. iroh-willow (v0.34) and iroh-ffi (v0.35) deferred due to major version incompatibility with workspace (iroh 0.96).

Goal: Update all protocols that reference Blake3 hashes.

Component	Change Required
iroh-docs	Update Hash references to new Poseidon2 Hash type
iroh-gossip	Update any blob hash references
iroh-willow	Update content hash references
iroh-car	Update CID encoding to use Poseidon2 hash code
iroh-ffi	Update FFI bindings for new Hash type

Phase 5: Remove Blake3 Dependencies

Status: DONE (this workspace) — 0 blake3 crate dependencies in any Cargo.toml or Cargo.lock.

Goal: Complete removal of all Blake3 dependencies across the ecosystem.

Remove blake3 from all Cargo.toml files
Remove iroh-blake3 custom fork (entire repository archived)
Remove bao-tree (replaced by cyber-bao)
Remove abao (async functionality integrated into cyber-bao)
Verify: grep -r blake3 across all repositories returns zero results
Update all documentation, READMEs, CHANGELOGs

Phase 6: Validation

Status: DONE (this workspace) — 307 tests pass across all crates, 0 failures.

Goal: Verify zero functionality loss.

Test matrix:

Test Category	What to Verify
Content hashing	Same content produces deterministic Poseidon2 hash
Verified streaming (combined)	Full blob download with per-chunk verification
Verified streaming (outboard)	Outboard metadata + original file verification
Slice extraction	Single-range and multi-range slice correctness
Range requests	Arbitrary byte range verified against root hash
Streaming decode (sync)	Sequential verified decode produces original data
Streaming decode (async)	Async verified decode produces original data
Post-order outboard	Append-only file sync with incremental verification
Relay handshake (challenge)	Client-server auth via signed challenge
Relay handshake (keying material)	Client-server auth via TLS keying material
Relay handshake (fallback)	Mismatched keying material falls back to challenge
Relay handshake (denial)	Authentication denial works correctly
Hash type serialization	Hash serializes/deserializes correctly (serde, Display, hex)
BlobFormat::Raw	Single blob addressed by Poseidon2 hash
BlobFormat::HashSeq	Collection of Poseidon2 hashes
CID encoding	Algorithm-agile CID correctly tags Poseidon2
Storage re-indexing	Blobs re-indexed under new hashes
Deduplication	Identical content deduplicates via Poseidon2 hash
Large blob transfer	Multi-GB blob transfer with streaming verification
Partial blob transfer	Resume interrupted transfer with range requests

Benchmarks:

Benchmark	Measure
Blob hashing throughput (1 KB - 1 GB)	Bytes/sec for Poseidon2 vs Blake3
Verified streaming throughput	Download speed with per-chunk verification
Slice extraction latency	Time to extract a range from a large blob
Relay handshake latency	End-to-end handshake time
Tree construction time	Time to build outboard for various blob sizes

6. Risk Assessment

Risk	Likelihood	Impact	Mitigation
Poseidon2-BAO has no precedent	Certain	High	Thorough design review; reuse bao-tree tree logic
CPU-only path too slow for large blobs	Certain	Medium	GPU backend via wgpu; CPU is fallback, not primary path for bulk hashing
wgpu GPU shader produces different results than CPU	Medium	Critical	Exhaustive cross-validation tests; same frozen constants in both backends
wgpu not available on some platforms/hardware	Low	Medium	CPU fallback always available; wgpu covers Vulkan/Metal/DX12/WebGPU
Goldilocks 64-bit multiply precision on GPU (u32x2)	Medium	High	Careful Barrett reduction implementation; test against CPU reference
Poseidon2 sponge wrapper has cryptographic bug	Medium	Critical	Security audit; reference vector validation; property testing
Field element encoding produces non-canonical or non-uniform bytes	Medium	High	Strict canonical encoding spec; exhaustive edge case testing
All existing content identifiers become invalid	Certain	Critical	Dual-CID transition period; storage proof-backed migration
Poseidon2 parameters (t=12, R_F=8, R_P=22, d=7) later found weak	Low	Catastrophic	Parameters from peer-reviewed analysis; frozen at genesis
Ecosystem-wide migration takes too long	High	Medium	Phase incrementally; relay handshake (Phase 3) can ship independently
Downstream consumers break (iroh-docs, iroh-gossip, etc.)	Certain	Medium	Coordinate releases; version-bump all crates simultaneously
bao-tree internals too tightly coupled to Blake3	Medium	High	Investigate bao-tree source early; may need significant refactoring

7. Dependency Graph and Ordering

Phase 0a: hemera CPU backend (p3-poseidon2 wrapper)
    |
    +---> Phase 0b: hemera GPU backend (wgpu shaders) [parallel with 0a]
    |
    +---> Phase 1: cyber-bao (verified streaming) [needs 0a, benefits from 0b]
    |         |
    |         +---> Phase 2: iroh-blobs (content addressing)
    |                   |
    |                   +---> Phase 4: iroh-docs, iroh-gossip, iroh-willow, iroh-car
    |
    +---> Phase 3: iroh-relay (handshake) [needs only 0a, parallel with everything]
    |
    +---> Phase 5: Remove all Blake3 deps [after all above complete]
    |
    +---> Phase 6: Validation [after Phase 5]

Phase 0a and 0b can proceed in parallel (CPU and GPU backends are independent)
Phase 3 (relay handshake) needs only CPU backend, can ship first
Phase 1 (cyber-bao) needs CPU backend and benefits from GPU for bulk tree construction
GPU backend can be refined throughout later phases

8. Decided Parameters

Parameter	Value	Rationale
Field	Goldilocks (p = 2^64 - 2^32 + 1)	64-bit native arithmetic, Plonky3-native, fast CPU/GPU
State width (t)	12	12 Goldilocks elements = 96 bytes state
Full rounds (R_F)	8	4 initial + 4 final, standard security margin
Partial rounds (R_P)	22	Balances security and performance
S-box exponent (d)	7	x^7, suitable for Goldilocks field characteristics
Rate	8 elements (64 bytes)	Absorb 64 bytes per permutation
Capacity	4 elements (32 bytes)	256-bit capacity -> 128-bit collision resistance
Output	4 elements (32 bytes)	Matches Blake3 output size
Implementation	`p3-poseidon2` (Plonky3)	Native Goldilocks, production-ready, MIT/Apache-2.0

9. Open Questions

Chunk group size: What is the optimal chunk group size to balance GPU dispatch efficiency vs verification granularity? 16 KB? 64 KB? (see Phase 1 performance table)
bao-tree refactoring strategy: Fork and modify, or rewrite from scratch with hash-agnostic design? Need to assess how tightly bao-tree is coupled to Blake3 internals.
CID multicodec: What multicodec code to register for Poseidon2-Goldilocks (t=12, R_F=8, R_P=22, d=7)?
Transition period duration: How long must the network support both Blake3 and Poseidon2 content identifiers?
Storage proofs: The migration plan from cyber.page requires storage proofs as a prerequisite for content migration. What is the timeline for storage proof implementation?
wgpu shader precision: Goldilocks field multiplication requires 64-bit precision. GPUs natively support u32. Must implement u64 arithmetic as u32x2 in WGSL. Need to validate correctness across all GPU vendors (AMD, NVIDIA, Apple, Intel).
wgpu WebGPU target: wgpu compute shaders on WASM/WebGPU - verify Poseidon2 permutation works in browser context. This would make iroh-relay WASM client capable of Poseidon2 hashing.
Round constant generation: Must generate and freeze round constants for (t=12, R_F=8, R_P=22, d=7) over Goldilocks. Use Plonky3's default constant generation or derive independently?
Backward compatibility for existing stored content: Must existing nodes rehash all stored blobs, or can they serve legacy Blake3-addressed content alongside new Poseidon2-addressed content?
GPU memory limits: For very large blobs (10+ GB), GPU buffer size limits may require chunked dispatch. Design the batch API to handle streaming GPU dispatch.

10. Estimated Effort

Phase	Scope	Estimated Effort
Phase 0a: hemera CPU backend	p3-poseidon2 wrapper, sponge, KDF, hazmat API	2-3 weeks
Phase 0b: hemera GPU backend	wgpu compute shaders (WGSL), Goldilocks u64 arithmetic on GPU, cross-validation	3-5 weeks
Phase 1: cyber-bao (verified streaming)	Fork + refactor bao-tree, hash-agnostic design, Poseidon2 backend	4-8 weeks
Phase 2: iroh-blobs migration	Modify existing crate, Hash type change	2-4 weeks
Phase 3: iroh-relay handshake	2-3 file changes	1-2 days
Phase 4: Higher-level protocols	Cascading Hash type updates	1-2 weeks
Phase 5: Blake3 removal	Dependency cleanup	1-2 days
Phase 6: Validation	Testing, benchmarking, GPU vendor testing, security review	3-5 weeks
Total		~16-28 weeks

11. References

Appendix A: Iroh License Analysis

Additionally, some files derived from Tailscale are under BSD 3-Clause (specifically iroh/src/socket** and iroh-relay socket code).

What This Means for the Migration

Aspect	Implication
Forking iroh	Fully permitted under both MIT and Apache-2.0
Modifying and redistributing	Permitted; Apache-2.0 requires marking modified files
Using in proprietary products	Permitted under both licenses
Patent protection	Apache-2.0 provides explicit patent grant from contributors
Sublicensing	MIT allows sublicensing; Apache-2.0 allows redistribution under different terms
Tailscale-derived code (BSD-3)	Cannot use N0/Tailscale names to endorse derivatives without permission
p3-poseidon2 compatibility	MIT/Apache-2.0 - fully compatible with iroh's license
Obligation	Must include copyright notice and license text in distributions

No license conflicts with the migration: p3-poseidon2 (Plonky3) is MIT/Apache-2.0, identical to iroh's own license. The zkhash crate is also MIT/Apache-2.0.

The dual MIT/Apache-2.0 license is the standard Rust ecosystem license and is maximally permissive for both open-source and commercial use.