F₂ vs F_p: when to use which

kuro (binary fields, F₂ tower) and nebu (prime fields, Goldilocks) are complementary — not competing. each dominates a different class of computation. choosing correctly yields order-of-magnitude performance gains. choosing wrong costs 32× or more.

two fields, two algebras

property	F_p (Goldilocks, nebu)	F₂ tower (kuro)
characteristic	p = 2⁶⁴ − 2³² + 1	2
element size	64 bits (one u64)	1 to 128 bits (tower level)
addition	integer add mod p (~3 cycles)	XOR (~1 cycle)
multiplication	integer mul + reduction (~5 cycles)	Karatsuba tower (~20 cycles)
representation	one integer in [0, p)	n bits packed in a register

nebu's strength is dense arithmetic: field multiplications, polynomial evaluations, NTT butterflies. kuro's strength is bitwise parallelism: XOR, AND, popcount, bit manipulation.

the 32× rule for bitwise operations

in a prime field, a single field element holds a 64-bit number. to access individual bits, you must decompose: extract each bit with division and modular arithmetic. in a STARK constraint system, bit decomposition costs approximately 32 constraints (one range check per bit, amortized).

in GF(2), each bit is already a field element. XOR is addition (1 constraint). AND is multiplication (1 constraint). no decomposition needed.

                    F_p cost    F₂ cost    ratio
XOR of two bits     ~32          1         32×
AND of two bits     ~32          1         32×
128 parallel ANDs   ~4,096       1         4,096×
bit shift           ~32          0         ∞

the 32× factor is not an implementation detail — it is the algebraic distance between characteristic p and characteristic 2. no amount of engineering can close it.

the 4× rule for field multiplication

general field multiplication goes the other direction. Goldilocks multiplication is one u64×u64 multiply plus a fast reduction — about 5 cycles on modern hardware. a multiplication in F₂¹²⁸ requires Karatsuba decomposition through 7 tower levels — about 20 cycles.

                    F_p cost    F₂ cost    ratio
field multiply      ~5 cycles   ~20 cycles  0.25×
field add           ~3 cycles   ~1 cycle    3×
field inversion     ~320 cycles ~1000 cycles 0.3×

for workloads dominated by field multiplications (polynomial evaluation, NTT, Poseidon2 rounds), Goldilocks wins by 4×. the binary tower's recursive structure is elegant but cannot match a single hardware multiply instruction.

the cost matrix

workload	dominant operations	winner	margin
hash function (Poseidon2)	field mul, S-box	F_p	4×
NTT / polynomial ops	field mul, add	F_p	3-4×
AES verification	XOR, AND, S-box	F₂	10-30×
SHA-256 verification	XOR, AND, rotations	F₂	20-30×
BitNet inference	AND, popcount	F₂	64×
boolean circuit	XOR, AND	F₂	32×
range check	bit decomposition	F₂	32×
lookup argument	field inversion, mul	F_p	3×

the cross-algebra boundary

a proof system that uses both fields must cross between them. this crossing has a cost.

to embed an F₂ computation result into an F_p constraint system: the binary value must be range-checked (is each component actually 0 or 1?) and converted to an F_p element. estimated cost: ~766 F_p constraints per crossing.

to go the other direction (embed an F_p result into F₂): the 64-bit value must be decomposed into bits and imported as 64 GF(2) elements. this costs ~64 F₂ constraints for the decomposition.

the boundary cost is fixed per crossing, not per element. for large batches (processing thousands of binary operations before crossing back), the per-operation overhead is negligible. for frequent crossings, the overhead dominates.

rule of thumb: cross the boundary as rarely as possible. batch binary work together, batch arithmetic work together.

decision guide

use kuro (F₂) when:

the computation is inherently bitwise (XOR, AND, bit shifts, rotations)
you are proving correctness of binary algorithms (AES, SHA, boolean circuits)
the data is naturally binary (1-bit weights, boolean flags, bit vectors)
parallelism matters: 128 operations per instruction
the workload is BitNet inference or quantized SpMV

use nebu (F_p) when:

the computation is inherently arithmetic (polynomial evaluation, matrix multiply with large entries)
you are computing hashes (Poseidon2 runs over Goldilocks)
the workload involves NTT or polynomial commitment (FRI/WHIR)
field inversion is needed frequently (lookup arguments)
the data represents large numbers (balances, counters, coordinates)

use both when:

a program mixes arithmetic and bitwise operations
the STARK has both polynomial constraints and boolean constraints
you need to verify binary computations within an arithmetic proof
the system naturally partitions into binary and arithmetic sub-circuits

the architecture

the proof system deploys both fields in their natural domains:

                    ┌─────────────────────┐
                    │   arithmetic zone    │
                    │   nebu (Goldilocks)  │
                    │   NTT, Poseidon2,    │
                    │   polynomial ops     │
                    └────────┬────────────┘
                             │
                    ~766 constraints per crossing
                             │
                    ┌────────┴────────────┐
                    │    binary zone       │
                    │    kuro (F₂ tower)   │
                    │    XOR, AND, BitNet, │
                    │    boolean circuits  │
                    └─────────────────────┘

the Goldilocks accumulator (top) and binary field operations (bottom) each run in their native algebra. the cross-algebra boundary is crossed only when necessary — typically once per sub-circuit, not once per operation.

Dimensions

f2-vs-fp

kuro/docs/explanation/f2-vs-fp.md