NTT theory

the Number Theoretic Transform is the finite field analogue of the Fast Fourier Transform. it converts between coefficient and evaluation representations of polynomials in O(N log N) operations, enabling polynomial multiplication in O(N log N) instead of O(N²).

polynomial representations

a polynomial f(x) of degree < N can be represented in two ways:

coefficient form: f(x) = c₀ + c₁x + c₂x² + ... + c_{N−1}x^{N−1}. stored as a vector of N coefficients.

evaluation form: f(x) described by its values at N distinct points: {(x₀, f(x₀)), (x₁, f(x₁)), ..., (x_{N−1}, f(x_{N−1}))}. stored as a vector of N values.

the NTT converts coefficient form → evaluation form (at the N-th roots of unity). the inverse NTT converts evaluation form → coefficient form.

why evaluation form

in evaluation form, polynomial operations are pointwise:

operation	coefficient form	evaluation form
addition	O(N) — add coefficients	O(N) — add values
multiplication	O(N²) — convolution	O(N) — multiply values
division	O(N²) — polynomial long division	O(N) — divide values

multiplication is the critical case. multiplying two degree-N polynomials in coefficient form requires computing the convolution of their coefficients: O(N²) operations. in evaluation form, it is N independent field multiplications: O(N).

the catch: converting between forms costs O(N log N). so the full pipeline for multiplication is:

NTT(a) → â          O(N log N)
NTT(b) → b̂          O(N log N)
â · b̂ → ĉ           O(N)        (pointwise)
INTT(ĉ) → c         O(N log N)

total: O(N log N). this is the Schönhage-Strassen insight — polynomial (and integer) multiplication via transform.

the DFT matrix

the NTT at length N evaluates f at the N-th roots of unity ω⁰, ω¹, ..., ω^{N−1}. in matrix form:

    ┌                              ┐   ┌    ┐       ┌    ┐
    │ 1    1      1      ...  1    │   │ c₀ │       │ f₀ │
    │ 1    ω      ω²     ... ω^{N-1}│ │ c₁ │       │ f₁ │
    │ 1    ω²     ω⁴     ... ω^{2(N-1)}│ c₂ │   =   │ f₂ │
    │ ⋮    ⋮      ⋮           ⋮    │   │ ⋮  │       │ ⋮  │
    │ 1  ω^{N-1} ω^{2(N-1)} ...   │   │c_{N-1}│   │f_{N-1}│
    └                              ┘   └    ┘       └    ┘

this is the DFT matrix F_N with entries F[j][k] = ω^{jk}. the inverse matrix has entries F⁻¹[j][k] = N⁻¹ · ω^{−jk}.

direct matrix-vector multiplication costs O(N²). the NTT computes the same result in O(N log N) by exploiting the structure of the DFT matrix.

the butterfly decomposition

the key identity: ω^{N/2} = −1 (since ω is a primitive N-th root of unity and has order N). this means:

f(ω^k) = f_even(ω^{2k}) + ω^k · f_odd(ω^{2k})
f(ω^{k+N/2}) = f_even(ω^{2k}) − ω^k · f_odd(ω^{2k})

where f_even and f_odd are the polynomials formed from the even and odd coefficients of f. the values ω^{2k} for k = 0, ..., N/2 − 1 are exactly the (N/2)-th roots of unity.

this splits one length-N NTT into two length-N/2 NTTs plus N/2 butterfly operations. each butterfly:

a' = a + ω · b
b' = a − ω · b

one multiplication, two additions. the total work: N/2 butterflies per stage × log₂ N stages = (N/2) log₂ N multiplications and N log₂ N additions.

Cooley-Tukey (decimation in time)

the forward NTT uses the Cooley-Tukey algorithm: split by even/odd indices (decimation in time), process bottom-up.

input:  bit-reversed order
output: natural order

for each stage s = 0, 1, ..., log₂(N) − 1:
  block size m = 2^(s+1)
  ω_m = primitive m-th root of unity
  for each block starting at j:
    for each butterfly in the block:
      t = ω^i · a[j + i + m/2]
      a[j + i + m/2] = a[j + i] − t
      a[j + i]       = a[j + i] + t

the bit-reversal permutation at the input reorders elements so that the butterfly's access pattern is sequential within each stage.

Gentleman-Sande (decimation in frequency)

the inverse NTT uses the Gentleman-Sande algorithm: split by first/second half (decimation in frequency), process top-down.

input:  natural order
output: bit-reversed order (then permuted back)

for each stage s = log₂(N) − 1, ..., 1, 0:
  block size m = 2^(s+1)
  ω_m_inv = inverse of primitive m-th root
  for each block starting at j:
    for each butterfly in the block:
      u = a[j + i]
      v = a[j + i + m/2]
      a[j + i]       = u + v
      a[j + i + m/2] = ω^i · (u − v)

followed by bit-reversal permutation and scaling by N⁻¹.

the inverse NTT satisfies INTT(NTT(a)) = a. the scaling factor N⁻¹ exists because gcd(N, p) = 1 (N is a power of 2, p is odd).

bit-reversal permutation

the decimation-in-time NTT expects input in bit-reversed order. for index i in a 2ᵏ-point NTT, the bit-reversed index is obtained by reversing the k least significant bits of i:

example (k = 3, N = 8):
  000 → 000  (0 → 0)
  001 → 100  (1 → 4)
  010 → 010  (2 → 2)
  011 → 110  (3 → 6)
  100 → 001  (4 → 1)
  101 → 101  (5 → 5)
  110 → 011  (6 → 3)
  111 → 111  (7 → 7)

the permutation is an involution (applying it twice returns to the original order) and can be computed in-place by swapping elements where i < bit_reverse(i).

computational complexity

operation	multiplications	additions	total for N = 2²⁰
forward NTT	N/2 · log₂ N	N · log₂ N	~10M muls, ~20M adds
inverse NTT	N/2 · log₂ N	N · log₂ N	~10M muls, ~20M adds
pointwise multiply	N	0	~1M muls

the NTT dominates: polynomial multiplication is 2× NTT + pointwise + INTT = 3× NTT cost.

at 5 ns per field multiplication (Goldilocks on modern x86), a 2²⁰-point NTT takes approximately 50 ms. this is the ballpark for one STARK proof layer.

convolution theorem

the convolution theorem states:

NTT(a ∗ b) = NTT(a) · NTT(b)

where ∗ is convolution (polynomial multiplication in coefficient form) and · is pointwise multiplication. equivalently:

a ∗ b = INTT(NTT(a) · NTT(b))

this is why the NTT enables fast polynomial multiplication. it also works for cyclic convolution: when the polynomial modulus is x^N − 1, the NTT naturally computes the cyclic product. for negacyclic convolution (modulus x^N + 1, as in TFHE), a twist by half-roots is applied before the NTT.

NTT-friendly fields

a field is NTT-friendly if it has high two-adicity (large k where 2ᵏ | p − 1). comparison:

field	bits	two-adicity	max NTT length
Goldilocks (2⁶⁴ − 2³² + 1)	64	32	2³²
BabyBear (2³¹ − 2²⁷ + 1)	31	27	2²⁷
Mersenne31 (2³¹ − 1)	31	1	2
BN254 scalar field	254	28	2²⁸

Goldilocks has the highest two-adicity among commonly used fields. Mersenne31 is nearly useless for NTT (two-adicity 1 means only length-2 NTTs). BabyBear is popular in newer STARK systems but has a smaller field (31-bit elements vs 64-bit).

nebu/docs/explanation/ntt-theory.md