binary jets — F₂/Binius prover acceleration
eight jets for the Bt binary prover. the base operations (XOR, AND, NOT, SHL) are already 1 constraint each in F₂ — they don't need jets. jets target composite operations that appear millions of times in quantized inference and tri-kernel SpMV.
the core loop
both primary Bt workloads — quantized AI inference and tri-kernel SpMV — reduce to the same kernel:
binary matrix-vector multiply:
for each row r of weight matrix W (binary):
result[r] = popcount(W[r] AND input_vector)
popcount = count set bits = the accumulation step
XOR and AND are free in F₂ (1 constraint). popcount is the bottleneck: summing bits requires carry propagation, which is NOT free in F₂. a naive n-bit popcount requires a tree of full adders — each full adder is 5 binary constraints (2 XOR + 2 AND + 1 XOR for carry). total: ~5n constraints for n-bit popcount.
jets compress this.
jet 0: popcount
popcount(x) → count
input: x ∈ F₂^n (packed binary vector, n ≤ 128)
output: count ∈ Z_{log₂(n)+1} (integer, log₂(n)+1 bits wide)
count the number of set bits in a packed binary word.
algorithm: parallel prefix popcount (SWAR)
step 1: sum pairs (n/2 partial sums, 2 bits each)
step 2: sum quads (n/4 partial sums, 3 bits each)
step 3: sum octets (n/8 partial sums, 4 bits each)
...
step k: final sum (1 result, log₂(n)+1 bits)
total: ~n constraints (each step is n/2 additions at growing width)
constraint encoding
naive full-adder tree:
n-1 full adders × 5 constraints each = ~5n constraints
for n=128: ~640 constraints
jet parallel prefix:
log₂(n) stages × n/2 wired additions = ~n constraints
for n=128: ~128 constraints
but the additions grow wider at each stage. careful encoding:
stage 1: n/2 × 1-bit additions (XOR + AND = 2 constraints each) = n
stage 2: n/4 × 2-bit additions (4 constraints each) = n
...
stage k: 1 × (k-1)-bit addition = ~2k constraints
total across all stages: ~n × log₂(n) / 2 constraints
for n=128: ~448 constraints
improvement over naive: ~640/448 ≈ 1.4× (modest)
the real win is not constraint count but PROVER EFFICIENCY: the parallel prefix structure maps perfectly to Binius's packed SIMD operations. the prover evaluates all n/2 partial sums in one packed u128 operation.
prover speedup:
naive: 640 sequential constraint evaluations
jet: 7 packed u128 SIMD operations (log₂(128) stages)
prover speedup: ~90× wall-clock
- pure equivalent: ~640 patterns (full-adder tree)
- jet cost: ~128 (7 SIMD stages)
- binary constraints: ~448
- accelerates: every binary accumulation — matmul, hamming distance, threshold
the single most important Bt jet. every binary matmul row ends with popcount.
jet 1: packed_inner_product
packed_inner_product(a, b) → count
input: a, b ∈ F₂^n (packed binary vectors)
output: count = popcount(a AND b) ∈ Z_{log₂(n)+1}
inner product of two binary vectors. this IS the matmul kernel: one call per matrix row.
implementation:
1. c = a AND b (n constraints, but AND is free in F₂ = 0 constraints)
2. count = popcount(c) (jet 0: ~448 constraints for n=128)
total: ~448 constraints
AND is literally free (it IS multiplication in F₂). the entire cost is popcount.
- pure equivalent: n AND + ~5n popcount = ~5n constraints
- jet cost: ~128
- binary constraints: ~448
- accelerates: binary matmul kernel, hamming distance, correlation
jet 2: binary_matvec
binary_matvec(W, x) → y
input: W ∈ F₂^{m×n} (weight matrix), x ∈ F₂^n (input vector)
output: y ∈ Z^m where y[i] = popcount(W[i] AND x)
batched matrix-vector multiply. the top-level jet for inference and tri-kernel.
naive: m separate packed_inner_product calls
constraint cost: m × 448 = 448m
jet optimization — shared structure:
1. commit x once (shared across all m rows)
2. W rows packed into m × (n/128) u128 words
3. AND is free → m×n binary constraints = 0
4. m popcount calls with shared commitment
constraint savings from shared input commitment:
without sharing: m × (n + 448) ≈ m × 576 (for n=128)
with sharing: n + m × 448 ≈ m × 449 (for n=128)
marginal savings: ~22% (from amortized input commitment)
the bigger win is prover efficiency:
4096×4096 binary matmul (one inference layer):
naive patterns: 4096 × 4096 × 5 = ~84M binary constraints
with popcount jet: 4096 × 448 = ~1.8M binary constraints
with matvec jet: 4096 × 449 + 4096 = ~1.84M binary constraints
equivalent F_p cost (without Binius): 84M × 32 = ~2.7B F_p constraints
binary jet cost: ~1.84M binary constraints
total speedup vs F_p: ~1,400×
- pure equivalent: m × ~5n constraints
- jet cost: m × ~128
- binary constraints: m × ~448 + n
- accelerates: quantized inference layers, tri-kernel SpMV iterations
jet 3: quantize
quantize(v, k) → bits
input: v ∈ F_p (Goldilocks field element), k ∈ Z (bit width, k ≤ 64)
output: bits ∈ F₂^k (binary representation)
the bridge from Goldilocks to binary. fires at every F_p → F₂ algebra crossing.
constraint encoding:
prove: Σ_{i=0}^{k-1} bits[i] × 2^i = v (range check in F_p)
prove: each bits[i] ∈ {0, 1} (free in F₂, 1 constraint each in F_p)
the range check is an F_p constraint (weighted sum = field element)
the bit constraints are F₂ constraints (trivially satisfied)
cross-algebra constraint: ~k F_p constraints for the weighted sum
+ k trivial F₂ constraints for bit membership
the weighted sum Σ bits[i] × 2^i = v must be verified in F_p (where 2^i makes algebraic sense). the bits themselves live in F₂. this jet spans the algebra boundary.
- pure equivalent: ~k² patterns (bit decomposition via repeated halving)
- jet cost: ~k
- cross-algebra constraints: k F_p + k F₂
- accelerates: every Goldilocks → binary transition (model weight loading, input quantization)
jet 4: dequantize
dequantize(bits, k) → v
input: bits ∈ F₂^k (binary representation), or accumulated integer from popcount
output: v ∈ F_p (Goldilocks field element)
the bridge from binary back to Goldilocks. fires at every F₂ → F_p crossing.
constraint encoding:
prove: v = Σ_{i=0}^{k-1} bits[i] × 2^i (reconstruction in F_p)
same cross-algebra structure as quantize, reversed direction.
for popcount results: k = log₂(n)+1 (small, ~7 bits for 128-bit vectors)
- pure equivalent: ~k² patterns
- jet cost: ~k
- cross-algebra constraints: k F_p + k F₂
- accelerates: result collection after binary computation, accumulation into F_p
jet 5: activation_lut
activation_lut(table, x) → table[x]
input: table ∈ F₂^{2^k × m} (lookup table, 2^k entries of m bits each)
x ∈ F₂^k (k-bit index)
output: result ∈ F₂^m
small lookup table for activation functions. quantized inference uses tables for nonlinearities (binary ReLU, sign, step, clipped linear).
mechanism: multilinear extension (MLE)
the table is a function f: {0,1}^k → F₂^m
MLE: f̃(x₁,...,xₖ) = Σ_{b∈{0,1}^k} f(b) × ∏ᵢ (xᵢ·bᵢ + (1-xᵢ)(1-bᵢ))
evaluation at binary point x: f̃(x) = f(x) (MLE agrees on Boolean hypercube)
constraint encoding:
commit table as MLE polynomial (one-time, 2^k constraints)
each lookup = evaluate MLE at x (k multiplications = k constraints)
prove evaluation matches claimed output (1 equality constraint)
per-lookup cost after table commitment: ~k+1 constraints
example: 8-bit activation table (k=8, 256 entries)
table commitment: 256 constraints (one-time per table)
per-lookup: 9 constraints
1000 lookups in an inference layer: 256 + 9000 = 9,256 constraints
naive (enumerate all entries): 1000 × 256 = 256,000 constraints
speedup: ~28×
- pure equivalent: ~2^k × m patterns per lookup (enumeration)
- jet cost: ~k+1 per lookup (after table commitment)
- binary constraints: 2^k (table) + (k+1) per lookup
- accelerates: activation functions, S-boxes, small nonlinearities
jet 6: gadget_decompose
gadget_decompose(a, base, digits) → (a₀, a₁, ..., a_{digits-1})
input: a ∈ R_q coefficient (Goldilocks field element)
base ∈ Z (decomposition base, typically 2)
digits ∈ Z (number of digits)
output: digit sequence where a = Σ aᵢ × base^i
the first phase of TFHE bootstrapping. decomposes each ring coefficient into binary digits for processing in the binary domain.
for base=2 (binary decomposition):
identical to quantize but semantically distinct:
quantize converts a value for binary computation
gadget_decompose prepares FHE ciphertext for bootstrapping
constraint encoding: same as quantize (~k cross-algebra constraints)
for base=4 or base=8 (multi-bit decomposition):
each digit is 2 or 3 bits
k/2 or k/3 digits instead of k
per-digit range check: prove 0 ≤ aᵢ < base
total: ~(k/log₂(base)) × log₂(base) = ~k constraints
the jet recognizes the gadget decomposition pattern (specific to FHE) and applies the optimized constraint encoding. the prover knows the decomposition base from the formula hash.
- pure equivalent: ~k² patterns (repeated modular reduction)
- jet cost: ~k
- cross-algebra constraints: k F_p + k F₂
- accelerates: TFHE blind rotation, key switching, all FHE bootstrapping phases
jet 7: barrel_shift
barrel_shift(x, amount, direction) → shifted
input: x ∈ F₂^n (packed word)
amount ∈ F₂^{log₂(n)} (shift amount, variable)
direction ∈ F₂ (0=left, 1=right)
output: shifted ∈ F₂^n
variable-amount shift or rotate. pattern 14 (shl) is fixed-amount only. barrel shift handles the general case.
algorithm: log₂(n) conditional shift stages
stage i: if amount[i] = 1, shift by 2^i positions
each stage: n MUX operations (select original or shifted)
each MUX: 3 binary constraints (select = a·s + b·(1-s))
total: log₂(n) × n × 3 constraints
for n=128, log₂(n)=7: 7 × 128 × 3 = 2,688 constraints
naive enumeration of all shift amounts:
n possible amounts × n constraints per shift = n² constraints
for n=128: 16,384 constraints
speedup: ~6×
- pure equivalent: ~n² patterns (enumerate shift amounts)
- jet cost: ~3n·log₂(n)
- binary constraints: ~3n·log₂(n)
- accelerates: cipher round functions, permutation-based crypto, binary NTT butterfly
summary
jet input naive jet speedup primary workload
─────────────────────────────────────────────────────────────────────────────────────────
0: popcount F₂^128 → Z ~640 ~128 5× all accumulation
1: packed_inner_prod F₂^128² → Z ~5n ~128 5× matmul kernel
2: binary_matvec F₂^{m×n} → Z^m m×5n m×128 5× inference, tri-kernel
3: quantize F_p → F₂^k ~k² ~k k× F_p → F₂ boundary
4: dequantize F₂^k → F_p ~k² ~k k× F₂ → F_p boundary
5: activation_lut F₂^k → F₂^m ~2^k/lookup ~k/lookup 2^k/k× activation functions
6: gadget_decompose F_p → F₂^k ~k² ~k k× FHE bootstrapping
7: barrel_shift F₂^n → F₂^n ~n² ~3n·log(n) n/3log× crypto, permutations
constraint savings for primary workloads
workload: 4096×4096 quantized inference layer (BitNet 1-bit)
naive F₂ patterns: ~84M constraints
with Bt jets: ~1.84M constraints (binary_matvec jet)
vs F_p (no Binius): ~2.7B constraints
total speedup: ~1,400× over F_p
workload: tri-kernel SpMV (quantized 4-bit, 10^6 × 10^6 sparse)
per iteration (1% density): ~10^6 × 128 × 4 = ~512M binary constraints (naive)
with Bt jets: ~10^6 × 449 × 4 = ~1.8M constraints
vs F_p: ~16B constraints
speedup: ~8,900× over F_p
workload: FHE gadget decomposition (n=1024 coefficients, 64-bit)
naive: 1024 × 64² = ~4.2M constraints
with gadget_decompose jet: 1024 × 64 = ~65K constraints
speedup: ~64×
prover efficiency (wall-clock)
constraint count understates the jet advantage. Binius prover operates on packed u128 words — 128 F₂ elements per machine operation. jets are designed to exploit packing:
popcount: 7 SIMD stages (vs 640 sequential) ~90× prover speedup
packed_inner_prod: 7 SIMD stages (AND is free) ~90× prover speedup
binary_matvec: m × 7 SIMD stages (shared commitment) ~90× prover speedup
constraint count × prover speedup = effective acceleration:
4096×4096 inference layer:
constraint reduction: 46× (84M → 1.84M)
prover speedup: 90× (SIMD packing)
effective: ~4,100× faster to prove than naive binary
vs F_p prover: ~1,400× × 90× = ~126,000× total advantage
open questions
- popcount output width: popcount of n bits produces a log₂(n)+1 bit result. this result typically crosses to F_p for accumulation (matrix row results are summed as integers). is the dequantize cost per row acceptable, or should popcount output directly to F_p?
- table size limit for activation_lut: k=8 (256 entries) is practical. k=16 (65K entries) requires 65K constraints for table commitment — may need hierarchical decomposition
- multi-bit quantization: quantize/dequantize handle arbitrary k. for 4-bit quantized models (GPTQ-style), k=4 per weight. should there be a specialized 4-bit jet?
- cross-algebra constraint accounting: jets 3, 4, 6 span both F_p and F₂. how does the universal CCS handle constraints that reference both algebras? does each cross-algebra constraint count as one F_p constraint?
see jets for API and recognition mechanism, verifier-jets for Goldilocks jets, binius-pcs for PCS backend, zheng-2 for cross-algebra composition