F₂ vs F_p: when to use which

kuro (binary fields, F₂ tower) and nebu (prime fields, Goldilocks) are complementary — not competing. each dominates a different class of computation. choosing correctly yields order-of-magnitude performance gains. choosing wrong costs 32× or more.

two fields, two algebras

property F_p (Goldilocks, nebu) F₂ tower (kuro)
characteristic p = 2⁶⁴ − 2³² + 1 2
element size 64 bits (one u64) 1 to 128 bits (tower level)
addition integer add mod p (~3 cycles) XOR (~1 cycle)
multiplication integer mul + reduction (~5 cycles) Karatsuba tower (~20 cycles)
representation one integer in [0, p) n bits packed in a register

nebu's strength is dense arithmetic: field multiplications, polynomial evaluations, NTT butterflies. kuro's strength is bitwise parallelism: XOR, AND, popcount, bit manipulation.

the 32× rule for bitwise operations

in a prime field, a single field element holds a 64-bit number. to access individual bits, you must decompose: extract each bit with division and modular arithmetic. in a STARK constraint system, bit decomposition costs approximately 32 constraints (one range check per bit, amortized).

in GF(2), each bit is already a field element. XOR is addition (1 constraint). AND is multiplication (1 constraint). no decomposition needed.

                    F_p cost    F₂ cost    ratio
XOR of two bits     ~32          1         32×
AND of two bits     ~32          1         32×
128 parallel ANDs   ~4,096       1         4,096×
bit shift           ~32          0         ∞

the 32× factor is not an implementation detail — it is the algebraic distance between characteristic p and characteristic 2. no amount of engineering can close it.

the 4× rule for field multiplication

general field multiplication goes the other direction. Goldilocks multiplication is one u64×u64 multiply plus a fast reduction — about 5 cycles on modern hardware. a multiplication in F₂¹²⁸ requires Karatsuba decomposition through 7 tower levels — about 20 cycles.

                    F_p cost    F₂ cost    ratio
field multiply      ~5 cycles   ~20 cycles  0.25×
field add           ~3 cycles   ~1 cycle    3×
field inversion     ~320 cycles ~1000 cycles 0.3×

for workloads dominated by field multiplications (polynomial evaluation, NTT, Poseidon2 rounds), Goldilocks wins by 4×. the binary tower's recursive structure is elegant but cannot match a single hardware multiply instruction.

the cost matrix

workload dominant operations winner margin
hash function (Poseidon2) field mul, S-box F_p
NTT / polynomial ops field mul, add F_p 3-4×
AES verification XOR, AND, S-box F₂ 10-30×
SHA-256 verification XOR, AND, rotations F₂ 20-30×
BitNet inference AND, popcount F₂ 64×
boolean circuit XOR, AND F₂ 32×
range check bit decomposition F₂ 32×
lookup argument field inversion, mul F_p

the cross-algebra boundary

a proof system that uses both fields must cross between them. this crossing has a cost.

to embed an F₂ computation result into an F_p constraint system: the binary value must be range-checked (is each component actually 0 or 1?) and converted to an F_p element. estimated cost: ~766 F_p constraints per crossing.

to go the other direction (embed an F_p result into F₂): the 64-bit value must be decomposed into bits and imported as 64 GF(2) elements. this costs ~64 F₂ constraints for the decomposition.

the boundary cost is fixed per crossing, not per element. for large batches (processing thousands of binary operations before crossing back), the per-operation overhead is negligible. for frequent crossings, the overhead dominates.

rule of thumb: cross the boundary as rarely as possible. batch binary work together, batch arithmetic work together.

decision guide

use kuro (F₂) when:

  • the computation is inherently bitwise (XOR, AND, bit shifts, rotations)
  • you are proving correctness of binary algorithms (AES, SHA, boolean circuits)
  • the data is naturally binary (1-bit weights, boolean flags, bit vectors)
  • parallelism matters: 128 operations per instruction
  • the workload is BitNet inference or quantized SpMV

use nebu (F_p) when:

  • the computation is inherently arithmetic (polynomial evaluation, matrix multiply with large entries)
  • you are computing hashes (Poseidon2 runs over Goldilocks)
  • the workload involves NTT or polynomial commitment (FRI/WHIR)
  • field inversion is needed frequently (lookup arguments)
  • the data represents large numbers (balances, counters, coordinates)

use both when:

  • a program mixes arithmetic and bitwise operations
  • the STARK has both polynomial constraints and boolean constraints
  • you need to verify binary computations within an arithmetic proof
  • the system naturally partitions into binary and arithmetic sub-circuits

the architecture

the proof system deploys both fields in their natural domains:

                    ┌─────────────────────┐
                    │   arithmetic zone    │
                    │   nebu (Goldilocks)  │
                    │   NTT, Poseidon2,    │
                    │   polynomial ops     │
                    └────────┬────────────┘
                             │
                    ~766 constraints per crossing
                             │
                    ┌────────┴────────────┐
                    │    binary zone       │
                    │    kuro (F₂ tower)   │
                    │    XOR, AND, BitNet, │
                    │    boolean circuits  │
                    └─────────────────────┘

the Goldilocks accumulator (top) and binary field operations (bottom) each run in their native algebra. the cross-algebra boundary is crossed only when necessary — typically once per sub-circuit, not once per operation.

see also

Dimensions

f2-vs-fp

Local Graph