data-availability.md

data availability and algebraic DAS

DAS (Data Availability Sampling) proves data physically exists across the neuron set without downloading it. validity proves correctness. ordering proves sequence. completeness proves nothing was withheld. availability proves the data survives.

polynomial nouns make DAS native

when particles ARE polynomials, DAS is no longer a separate encoding step. the particle's content is already a polynomial with a Lens commitment. extending that polynomial beyond its evaluation domain produces the erasure code automatically. the particle IS a polynomial. the extension IS the erasure code. no separate encoding pipeline.

sampling = Lens opening at random positions on the extended polynomial. reconstruction = polynomial interpolation from enough evaluations. the Lens commitment that identifies the particle is simultaneously the DAS commitment. one object serves both purposes.

this collapses the traditional DAS pipeline (data → encode → commit → sample → verify) to (polynomial noun → sample → verify). the encoding step vanishes because polynomial nouns are born erasure-coded.

the availability problem

a neuron stores a file. the neuron dies. the file is gone. no amount of proof verifies data that no longer exists.

the harder failure: a neuron is online but selectively withholds data. it responds to some requests, ignores others. a CRDT merge converges — to incomplete state. the receiving neuron has no way to know data is missing without a completeness proof and an availability check.

erasure coding

erasure coding splits data into k data chunks and generates n - k parity chunks such that any k of n chunks reconstruct the original. the encoding is 2D Reed-Solomon over Goldilocks field (p = 2^64 - 2^32 + 1).

original data:  √n × √n grid of field elements
extended:       2√n × 2√n with parity rows and columns
reconstruction: any √n × √n submatrix of original dimension is sufficient

parameters:
  chunk size:   256 KiB (content-defined boundaries)
  rate:         k/n configurable per neuron group (default 1/2)
  at rate 1/2:  any 50% of chunks → full reconstruction
  per-neuron:   2/N × file_size (distributed, not replicated)

2D encoding enables row-level and column-level fraud proofs. a single incorrectly encoded cell is provable by downloading one row and one column — O(sqrt(n)) data instead of O(n). this is the foundation of efficient sampling.

algebraic DAS

the 2D Reed-Solomon grid maps naturally to a bivariate polynomial P(row, col). each cell in the extended grid is an evaluation of P at a grid point. the commitment is a single Lens digest:

each row     → polynomial over Goldilocks
all rows     → bivariate polynomial P(row, col)
commitment   = Lens.commit(P)    32 bytes

sampling

a verifier picks random grid positions and requests each cell with a Lens opening proof:

per sample:
  request:      cell value at (row, col)
  proof:        Lens opening of P(row, col)    ~200 bytes
  verification: Lens.verify(commitment, (row, col), value, proof)    O(1) field ops

20 samples → 99.9999% confidence data is available
30 samples → 99.99999999% confidence

total bandwidth:    20 × (256 + 200) = ~9 KiB
total computation:  20 × O(1) field verifications
total constraints:  ~3K

a phone verifies 1 TB of data with ~20 random samples and ~9 KiB of bandwidth.

why sampling works

if a neuron withholds even one chunk, the 2D erasure structure means the chunk's absence affects an entire row and column. the probability that random samples miss all affected positions decreases exponentially.

with rate 1/2 encoding, withholding more than 50% means reconstruction fails — and a random sample has 50% chance per try of hitting a withheld position. after 20 samples: probability of missing all withheld = (0.5)^20 < 10^-6.

withholding less than 50% means reconstruction succeeds anyway — the withheld chunks are recoverable from parity. the neuron gains nothing.

binary outcome: either data is available (sampling confirms) or reconstruction fails (sampling detects). no middle ground where partial withholding goes undetected.

fraud proofs

incorrect encoding is detected when k+1 cells on a row or column fail to lie on a degree-k polynomial:

fraud proof:    k+1 cells + Lens openings
size:           (k+1) × (256 + ~200 bytes)
verification:   O(k) field operations

namespace completeness

"give me all chunks in namespace N" uses the same Lens mechanism as algebraic NMT — a completeness proof is a Lens opening over the evaluation points in that namespace. DAS adds: "and here is proof the chunks are correctly erasure-coded." both proofs are polynomial evaluations against the same commitment.

Lens completeness

DAS proves chunks exist. Lens completeness proves the full set was committed — no chunks omitted from the commitment.

neuron commits all chunks to polynomial:
  content_commitment = Lens.commit(chunk_poly)

completeness proof:
  verifier requests: "all chunks in positions [a, b]"
  neuron returns: chunks + Lens opening for the range

polynomial binding guarantees:
  the commitment uniquely determines all evaluation points.
  a Lens opening for range [a, b] cannot omit a point
  that the polynomial encodes. algebraic — the commitment
  cannot lie about its contents.

three layers composed

each layer solves a different failure mode:

failure	CRDT alone	DAS alone	Lens alone	all three
neuron dies	data lost	recoverable	no help	recoverable
selective withholding	undetectable	detectable	provable	provable + detectable
merge conflict	resolved	no help	no help	resolved
incomplete transfer	undetectable	no help	provable	provable

CRDT — merge semantics. content-addressed chunks have no conflicts. grow-only set union is commutative, associative, idempotent.

Lens completeness — provable completeness per namespace. one commitment, one opening, algebraic proof that nothing was omitted. O(1) proof size.

DAS + erasure coding — physical availability. data survives permanent neuron failure through k-of-n reconstruction. O(sqrt(n)) sampling cost.

together: provably complete, provably available, correctly merged.

local vs global

the mechanism is identical at both scales. parameters differ:

parameter	local (1-20 neurons)	global (network)
participants	same identity	different identities
chunk distribution	capacity-weighted	stake-weighted
reconstruction	any k-of-n local neurons	any k-of-n in namespace
fraud response	neuron key revocation	stake slashing

the same erasure coding, the same 2D Reed-Solomon, the same Lens commitment scheme. the availability layer is scale-invariant.

cost summary

                        algebraic DAS
commitment:             32 bytes (one Lens)
per-sample proof:       ~200 bytes (Lens opening)
per-sample verify:      O(1) field operations
20-sample verification: ~9 KiB bandwidth, ~3K constraints
fraud proof (k=64):     ~30 KiB
namespace completeness: O(range) field operations

see data-availability for the reference specification, why-polynomial-state for why polynomial over hash tree, architecture-overview for the full pipeline

Homonyms

bbg/specs/data-availability

data availability bbg without data availability is incomplete. authenticated state means nothing if the data behind it cannot be retrieved. algebraic DAS (Data Availability Sampling) allows light clients to verify that block data is available without downloading the full block, using Lens openings…