GPU backend — wgpu compute shaders for hemera
full Poseidon2 permutation implemented in WGSL compute shaders. batch operations dispatch thousands of hashes in parallel via wgpu. the GPU backend is optional — CPU is always the fallback.
challenge: u64 in WGSL
WGSL has no native 64-bit integers. Goldilocks arithmetic (p = 2^64 - 2^32 + 1) requires u64. the solution: emulate u64 as vec2<u32> (lo, hi) with custom add, sub, mul, and modular reduction.
// Goldilocks field element = vec2<u32>
fn gl_add(a: vec2<u32>, b: vec2<u32>) -> vec2<u32>
fn gl_sub(a: vec2<u32>, b: vec2<u32>) -> vec2<u32>
fn gl_mul(a: vec2<u32>, b: vec2<u32>) -> vec2<u32>
fn gl_inv(a: vec2<u32>) -> vec2<u32> // field inversion via Fermat
fn gl_pow7(a: vec2<u32>) -> vec2<u32> // x⁷ S-box
fn gl_reduce(lo: u32, hi: u32) -> vec2<u32> // modular reduction
multiplication uses the 64×64→128 bit schoolbook method, then reduces mod p using the Goldilocks structure: 2^64 ≡ 2^32 - 1 (mod p).
compute pipelines
| pipeline | entry point | workgroup size | purpose |
|---|---|---|---|
| permute | hemera_permute |
64 | raw Poseidon2 permutation |
| hash_leaf | hemera_hash_leaf |
64 | sponge + leaf domain flags |
| hash_node | hemera_hash_node |
64 | parent node hashing |
| hash_chunk | hemera_hash_chunk |
64 | sponge hash (no tree domain) |
| keyed_hash | hemera_keyed_hash |
64 | keyed sponge hash |
| derive_key_material | hemera_derive_key_material |
64 | key derivation phase 2 |
| hash_node_nmt | hemera_hash_node_nmt |
64 | namespace-aware node hashing |
all pipelines share the same bind group layout: input buffer, output buffer, round constants buffer, diagonal buffer, params buffer.
GPU tree hashing
root_hash() computes a full Merkle tree on GPU:
- dispatch
hash_leaffor all chunks in parallel → leaf hashes - iteratively dispatch
hash_nodeto merge pairs → level by level - final single hash = root
for a 1 GB file (~262,144 chunks): 18 GPU dispatches (1 leaf + 17 merge levels).
batch operations
let gpu = new.await?;
// batch hash — thousands of chunks in one dispatch
let hashes = gpu.batch_hash.await;
// batch keyed hash
let macs = gpu.batch_keyed_hash.await;
// batch key derivation
let keys = gpu.batch_derive_key.await;
// batch proof verification
let results = gpu.batch_verify_proofs.await;
// batch XOF squeeze
let outputs = gpu.batch_squeeze.await;
outboard on GPU
outboard() uses the GPU for leaf hashing (the parallelizable part), then builds the tree structure on CPU:
let = gpu.outboard.await;
progress reporting
root_hash_with_progress reports completion percentage:
let root = gpu.root_hash_with_progress.await;
shader modules
| file | responsibility |
|---|---|
params.wgsl |
constants — field, sponge, tree parameters, capacity indices |
field.wgsl |
Goldilocks arithmetic — u64 emulation via vec2 |
encoding.wgsl |
bytes ↔ field element conversion (7-byte chunks) |
permutation.wgsl |
Poseidon2 permutation — full/partial rounds, S-box dispatch |
sponge.wgsl |
sponge hash, keyed hash, derive-key material |
tree.wgsl |
leaf/node hashing with domain flags |
entry_points.wgsl |
workgroup dispatch for all operations |
fallback behavior
GpuContext::new() returns None if no compute-capable GPU adapter is found. the CLI detects this and falls back to CPU. both backends produce identical output — cross-verified by 45 tests.
implementation
wgsl/src/lib.rs—GpuContext, all batch methodswgsl/src/shaders/— WGSL compute shaderswgsl/tests/— 39 GPU integration tests- CLI:
--gpu/--cpuflags
see inversion-sbox for the x⁻¹ partial S-box implemented in WGSL