partial-round collapse

states [1..15] evolve linearly during all 16 partial rounds. precompute the composed linear map. break the sequential dependency for 15 of 16 state elements. ~4× prover wall-clock speedup.

the observation

in partial rounds, only state[0] gets the nonlinear x⁻¹ S-box. states [1..15] only get linear mixing via M_I:

partial round i:
  state[0] = (state[0] + RC[i])⁻¹       ← nonlinear (sequential)
  state = M_I × state                    ← linear for [1..15]

the linear evolution of states [1..15] across all 16 rounds can be expressed as:

state_final[1..15] = L₁₆ × state_initial[1..15] + f(state[0] values at each round)

where:
  L₁₆ = M_I^16 restricted to indices [1..15]   (15×15 constant matrix)
  f = contribution of state[0] at each round     (depends on x⁻¹ values)

prover advantage

current: 16 sequential rounds, each touching all 16 elements
  prover must evaluate M_I × state at every round
  16 × 16 multiplications per round = 4,352 total field muls

collapsed: precompute L₁₆ (constant matrix), apply once
  L₁₆ is a 15×15 matrix — precomputed at compile time
  ONE matrix-vector multiply: 15 × 15 = 225 field muls
  + 16 sequential x⁻¹ on state[0]: 16 × 64 = 1,024 field muls
  + 16 injection steps: 16 × 15 = 240 field muls
  total: ~1,489 field muls

prover speedup: 4,352 / 1,489 ≈ 2.9×

the constraint count is similar (linear constraints in both encodings), but prover wall-clock drops because the sequential dependency is broken for 15 of 16 elements.

SIMD optimization

the L₁₆ matrix-vector multiply is embarrassingly parallel. 15 independent dot products of length 15. maps perfectly to SIMD (4 lanes) or GPU (15 parallel threads).

sequential: 225 field muls, 15 sequential accumulations
SIMD (4-wide): ~60 field muls wall-clock
GPU: ~15 field muls wall-clock

interaction with batched proving

partial-round collapse composes with batched proving (batched-proving): for N hemera calls, the L₁₆ application is N independent matrix-vector multiplies. all N can run in parallel. the sequential bottleneck is only the N × 16 x⁻¹ applications on state[0].

open questions

L₁₆ numerical stability: the 15×15 matrix is a 16th power of M_I restricted to [1..15]. need to verify that the entries remain in canonical Goldilocks range (they will, since M_I entries are small)
constraint encoding: does the collapsed form produce different constraint structure than round-by-round? the verifier still checks the same relation, but the CCS layout may need adaptation

see inversion-sbox for the x⁻¹ S-box, batched-proving for multi-instance optimization

Dimensions

partial-round-collapse

hemera/reference/props/partial-round-collapse.md