data availability: from sampling to algebraic proofs
how cyber guarantees that data physically exists without downloading it. layer 4 of structural sync. the mechanism that closes the gap between "the data is committed" and "the data is retrievable."
the problem
a node commits to data (posts a hash, signs a root). the hash proves integrity — if you have the data, you can verify it matches. but the hash says nothing about whether the data is AVAILABLE. the committer could publish the root and discard the payload. every verifier sees a valid commitment to data nobody can retrieve.
this is the data availability problem. it is independent of consensus, ordering, and validity — a system can have perfect proofs of correct computation while the underlying data is lost.
the solution: erasure coding + sampling
2D Reed-Solomon encoding
data is arranged in a $\sqrt{n} \times \sqrt{n}$ grid and extended with parity rows and columns via Reed-Solomon over Goldilocks field:
original: k × k data cells
extended: 2k × 2k (data + parity)
any k × k submatrix → full reconstruction
default rate 1/2: 2× storage overhead, distributed across N devices
withholding >50% makes reconstruction fail — but random sampling detects it exponentially fast. withholding <50% means the data is recoverable from parity anyway. binary outcome: available or detectable.
data availability sampling (DAS)
a verifier samples O(√n) random cells, each with an NMT inclusion proof against the committed root:
for i in 1..k:
pos ← random_cell()
(cell, proof) ← request(pos)
verify: NMT_proof(cell, proof, root)
k = 20: confidence 1 - (1/2)^20 = 99.9999%
k = 30: confidence 1 - (1/2)^30 = 99.99999999%
a phone verifies 1 TB of data with ~30 random samples. no full download. no trust in the data holder.
fraud proofs for bad encoding
if a block producer encodes data incorrectly (wrong Reed-Solomon), any node that obtains k+1 cells from one row can detect it:
1. download k+1 cells from row
2. Reed-Solomon decode → expected polynomial
3. check: decoded polynomial matches committed row root?
4. if not → fraud proof (k+1 cells + proofs)
5. verification: O(k log n) — linear in row, logarithmic in block
three scales
the same mechanism works at every scale. only parameters change:
| scale | participants | distribution | commitment | fraud handling |
|---|---|---|---|---|
| local (devices) | 1-20 devices of one neuron | capacity-weighted | per-device NMT | key revocation |
| global (neurons) | 10³-10⁶ neurons | stake-weighted | per-neuron NMT | stake slashing |
| query (light clients) | 10⁹ clients | pull-based | BBG root | proof rejection |
at local scale: a phone verifies its server has all files without downloading them. at global scale: light clients verify block data availability without running a full node.
the cost (current NMT-based)
per sample: NMT inclusion proof → O(log n) Hemera hashes
at n = 2³²: 32 × 736 = 23,552 constraints per sample
20 samples: 20 × 23,552 = 471,040 constraints
bandwidth: 20 × 1 KiB = 20 KiB per verification
this is layer 4's contribution to Verified Eventual Consistency. combined with layer 3 (NMT completeness) and layer 5 (CRDT merge), it provides the full VEC guarantee: convergence verified + completeness verified + availability verified.
algebraic DAS: the evolution
the shift
replace NMT inclusion proofs with lens (polynomial commitment scheme) openings. every sample becomes a polynomial evaluation instead of a Merkle path:
current (NMT):
sample cell → NMT inclusion proof → O(log n) Hemera hashes
proof size: ~1 KiB per sample
verification: walk Merkle path, check sorting invariant
algebraic:
sample cell → Lens opening → O(1) field operations
proof size: ~200 bytes per sample
verification: one polynomial evaluation check
the numbers
| metric | NMT DAS | algebraic DAS | improvement |
|---|---|---|---|
| proof size per sample | ~1 KiB | ~200 bytes | 5× smaller |
| constraints per sample | 23,552 | ~100 | 235× fewer |
| 20 samples bandwidth | 20 KiB | 4 KiB | 5× less |
| 20 samples constraints | 471K | 2K | 235× fewer |
| in-circuit verification | prohibitive for bulk | feasible | enables provable DAS |
why this matters
algebraic DAS is not a standalone optimization. it is a consequence of algebraic state commitments: when the state layer moves from hash trees to polynomials, the availability layer inherits the efficiency automatically. the NMT that commits erasure-coded chunks becomes a polynomial that commits them algebraically.
the cascade:
- BBG polynomial state (one commitment for all state) → state reads are O(1) field ops
- algebraic DAS (Lens openings for samples) → availability verification is O(1) field ops
- provable consensus (tri-kernel in zheng circuit) → the circuit can verify availability IN the proof
with NMT DAS: verifying 20 samples in-circuit costs 471K constraints. feasible but expensive. with algebraic DAS: 2K constraints. negligible. the prover can verify data availability as part of proving consensus correctness — zero marginal cost.
the complete availability stack
layer 4 in structural sync:
DATA CREATION
neuron creates signal → content chunks
↓
ERASURE ENCODING
2D Reed-Solomon over Goldilocks field
k data chunks → 2k encoded chunks
↓
COMMITMENT
NMT (current): namespace Merkle tree over chunks
polynomial (future): Lens commitment over chunk evaluations
↓
DISTRIBUTION
local: capacity-weighted across device set
global: stake-weighted across neuron set
↓
SAMPLING (DAS)
verifier samples O(√n) random cells
each cell: inclusion proof (NMT path or Lens opening)
confidence: 1 - (1/2)^k with k samples
↓
RECONSTRUCTION (on failure)
any k-of-2k chunks → full reconstruction
device dies → other devices hold parity → data survives
what DAS does NOT solve
availability proves data EXISTS. it does not prove data is CORRECT (that is layer 1: zheng validity), ORDERED (layer 2: hash chain + VDF), or COMPLETE for a given source (layer 3: NMT completeness). each layer provides an independent guarantee. together: Verified Eventual Consistency.
DAS also does not provide PERMANENT availability. erasure coding survives k failures out of 2k holders. if more than k holders disappear, data is lost. long-term persistence requires archival strategies — replication to additional holders as existing holders leave.
relationship to other systems
| system | erasure coding | sampling | commitment | transparent |
|---|---|---|---|---|
| Celestia | 2D Reed-Solomon | DAS (O(√n)) | NMT | yes |
| Ethereum (EIP-4844) | 1D RS | no sampling | KZG | no (trusted setup) |
| EigenDA | 1D RS | no sampling | KZG | no |
| Avail | 2D RS (Kate) | DAS | KZG | no |
| cyber (current) | 2D RS over Goldilocks field | DAS (O(√n)) | NMT + Hemera | yes |
| cyber (algebraic) | 2D RS over Goldilocks | DAS (O(√n)) | lens (WHIR/Brakedown) | yes |
cyber's approach is closest to Celestia (both use NMT + 2D RS + transparent DAS). the key difference: cyber's algebraic evolution replaces NMT with lens, making availability verification algebraic — composable with the zheng proof system. Celestia keeps NMT permanently.
the other difference: cyber uses Goldilocks field for erasure coding (same field as proofs, consensus, FHE). no field mismatch. Celestia uses a separate field.
see DAS for the sampling algorithm. see erasure coding for Reed-Solomon construction. see NMT for namespace Merkle trees. see structural sync for the five-layer framework. see algebraic state commitments for the polynomial transition. see cyber/research/vec formalization for formal availability guarantees. see BBG for the authenticated state layer