encoding.md

noun encoding specification

version: 0.3 status: canonical

overview

this document specifies three encoding layers for nox nouns:

storage encoding — canonical binary serialization of individual nouns
content-addressed identity — how noun identity is derived from encoded content
wire format — how nouns are transmitted between nodes

all concrete sizes refer to the canonical instantiation: nox<Goldilocks, Z/2^32, Hemera>.

storage encoding

every noun is either an atom or a cell. the first byte is a tag that determines interpretation. the encoding is deterministic — one valid byte sequence per noun.

tag byte

0x00   atom (field element)
0x01   atom (word element)
0x02   atom (hash element)
0x03   cell

three atom tags distinguish runtime types. the tag byte makes stored nouns self-describing — a resolver does not need execution context to parse a noun.

atom: field element (tag 0x00)

offset  size   field
──────  ─────  ──────────────────────────────
0       1      tag = 0x00
1       8      value: canonical little-endian Goldilocks element

total: 9 bytes

the 8-byte value MUST be in [0, p) where p = 2^64 - 2^32 + 1 = 18446744069414584321. values >= p are invalid. the canonical encoding is the unique little-endian representation of the reduced value.

atom: word element (tag 0x01)

offset  size   field
──────  ─────  ──────────────────────────────
0       1      tag = 0x01
1       8      value: canonical little-endian Goldilocks element
                (word value zero-extended to field element, MUST be in [0, 2^32))

total: 9 bytes

word atoms store 32-bit values as field elements. the value MUST be in [0, 2^32). values >= 2^32 are invalid for word-typed atoms. the encoding is 8 bytes little-endian, same as field atoms — the upper 4 bytes are zero.

atom: hash element (tag 0x02)

offset  size   field
──────  ─────  ──────────────────────────────
0       1      tag = 0x02
1       32     value: 4 consecutive Goldilocks elements, each 8 bytes little-endian

total: 33 bytes

each of the 4 field elements MUST be in [0, p). this encodes the output of a hemera hash (4 x F_p).

cell (tag 0x03)

offset  size   field
──────  ─────  ──────────────────────────────
0       1      tag = 0x03
1       32     left:  NounId (32 bytes, identity of the left child)
33      32     right: NounId (32 bytes, identity of the right child)

total: 65 bytes

left before right. a cell does not contain its children inline — it contains their identities. to materialize a cell, resolve both NounIds recursively.

encoding summary

tag    type          size (bytes)   payload
────   ────────────  ─────────────  ────────────────────────────────
0x00   field atom     9             8-byte LE Goldilocks element
0x01   word atom      9             8-byte LE value (must be < 2^32)
0x02   hash atom     33             4 × 8-byte LE Goldilocks elements
0x03   cell          65             NounId(left) ‖ NounId(right)

four sizes: 9, 9, 33, 65. tag byte disambiguates the two 9-byte variants.

NounId — content-addressed identity

every noun has a 32-byte identity derived from its content:

NounId = hemera(encoded_noun)     32 bytes (4 × F_p elements)

encoded_noun is the canonical binary encoding defined above (tag byte + payload). hemera absorbs the byte sequence and squeezes 4 field elements. the identity is deterministic — same noun always produces the same NounId.

identity computation

fn noun_id(noun: &Noun) -> NounId {
    let bytes = encode(noun);          // tag byte + payload
    hemera::hash(&bytes)               // → [F_p; 4] = 32 bytes
}

hemera uses the default domain tag (capacity[11] = 0x00) for structural noun hashing. other domain tags (COMMITMENT, NULLIFIER, etc.) are protocol-level — they do not affect NounId.

polynomial identity (optimization)

for large nouns (many leaves), the polynomial identity is more efficient:

identity = hemera(Lens.commit(noun_polynomial) ‖ domain_tag)    32 bytes

the noun is encoded as a multilinear polynomial (see nouns.md polynomial representation). Lens commitment costs O(N) field operations where N = number of leaves (Brakedown linear-time commitment). the hemera wrap adds 1 hemera call for domain separation.

both methods produce the same NounId for the same noun. the polynomial path is an optimization for nouns with many leaves — it replaces recursive hashing with O(N) field ops + 1 hemera call. for small nouns (< 56 bytes), cost is equivalent to direct hemera hashing.

an implementation MUST support the direct path (hemera of encoded bytes). the polynomial path is optional.

computation key

the memoization key for a computation (formula applied to object):

computation_key = NounId(object) ‖ NounId(formula)    64 bytes
computation_val = NounId(result)                       32 bytes

this is the axon in the cybergraph (see reduction.md global memoization).

content-addressed store

nouns are stored in a content-addressed key-value store:

store: NounId (32 bytes) → encoded_noun (9, 9, 33, or 65 bytes)

to retrieve a noun, query the store by its NounId. the tag byte in the stored value determines how to parse the payload.

resolution

to materialize a full noun tree from a NounId:

fn resolve(store: &Store, id: NounId) -> Result<Noun, Unavailable> {
    let bytes = store.get(id)?;
    match bytes[0] {
        0x00 => {
            assert(bytes.len() == 9);
            let value = u64_le(bytes[1..9]);
            assert(value < p);
            Atom::Field(value)
        }
        0x01 => {
            assert(bytes.len() == 9);
            let value = u64_le(bytes[1..9]);
            assert(value < p);
            assert(value < 2^32);
            Atom::Word(value as u32)
        }
        0x02 => {
            assert(bytes.len() == 33);
            let elems = [u64_le(bytes[1..9]),
                         u64_le(bytes[9..17]),
                         u64_le(bytes[17..25]),
                         u64_le(bytes[25..33])];
            assert(all elems < p);
            Atom::Hash(elems)
        }
        0x03 => {
            assert(bytes.len() == 65);
            let left_id  = bytes[1..33];
            let right_id = bytes[33..65];
            let left  = resolve(store, left_id)?;
            let right = resolve(store, right_id)?;
            Cell(left, right)
        }
        _ => error("invalid tag byte")
    }
}

resolution is recursive for cells. an implementation SHOULD use iterative deepening or a stack to avoid call-stack overflow on deep nouns.

store verification

when receiving nouns from an untrusted source, verify:

assert(hemera::hash(encoded_noun) == claimed_noun_id);

this is the only check needed. if the hash matches, the content is authentic. content addressing eliminates the need for signatures on individual nouns.

wire format

nouns are transmitted between nodes as length-prefixed messages containing content-addressed noun entries.

message framing

offset  size       field
──────  ─────────  ──────────────────────────────
0       4          message_length: u32 little-endian (byte count of payload)
4       variable   payload: one or more noun entries

max message_length: 2^24 (16 MiB). implementations MUST reject messages > 16 MiB.

noun entry within a message

each entry in the payload is:

offset  size       field
──────  ─────────  ──────────────────────────────
0       32         NounId (expected identity)
32      1          entry_length: encoded noun size (9, 9, 33, or 65)
33      variable   encoded_noun (tag byte + payload, as specified above)

the receiver:

reads the NounId (32 bytes)
reads the entry_length (1 byte)
reads entry_length bytes of encoded noun data
verifies: hemera(encoded_noun) == NounId
stores the entry if valid, rejects the message if any entry fails verification

message types

type byte (first byte of payload):
0x10   noun_push    — sender pushes noun entries (no request)
0x11   noun_request — request nouns by NounId list
0x12   noun_response — response to a noun_request

noun_push payload:
  [type=0x10] [entry_count: u32 LE] [entry_0] [entry_1] ...

noun_request payload:
  [type=0x11] [count: u32 LE] [NounId_0] [NounId_1] ...

noun_response payload:
  [type=0x12] [entry_count: u32 LE] [entry_0] [entry_1] ...

a noun_push for a cell MUST include all transitive children before the cell itself (topological order). the receiver can verify and store each entry as it arrives — no forward references.

particle CID

a particle in the cybergraph is identified by its NounId:

particle_cid = NounId(noun) = hemera(encode(noun))

the CID (content identifier) is the hemera hash of the canonical encoding. any node can independently compute the CID from the noun content. two nodes with the same noun always agree on its CID.

variable-length atom encoding (optional optimization)

for atoms whose value fits in fewer than 8 bytes, an implementation MAY use a compact encoding within internal storage. this is an implementation optimization — the canonical wire format and identity computation always use the fixed-width encoding defined above.

compact atom encoding (internal use only):

tag     size   value range         encoding
─────   ─────  ──────────────────  ─────────────────────
0x80    2      [0, 256)            tag + 1 byte
0x81    3      [0, 65536)          tag + 2 bytes LE
0x82    5      [0, 2^32)           tag + 4 bytes LE
0x00    9      [0, p)              tag + 8 bytes LE (canonical)

the compact encoding MUST NOT be used for identity computation or wire transmission. it is strictly an internal storage optimization for implementations that want to reduce memory for small atoms.

an implementation that does not support compact encoding is fully conforming.

field element encoding

Goldilocks field element:
  8 bytes, little-endian
  canonical value in [0, p) where p = 2^64 - 2^32 + 1
  values >= p are invalid
  reduction: a mod p = a_lo - a_hi × (2^32 - 1) + correction

formula encoding

a formula is a noun of the form cell(tag, body) where tag is a field atom with value 0-17 (the pattern number).

[0 a]           axis     — navigate object tree
[1 c]           quote    — return literal
[2 [x y]]       compose  — reduce x, reduce y, apply
[3 [a b]]       cons     — construct cell
[4 [t [y n]]]   branch   — conditional evaluation
[5 [a b]]       add      — field addition
[6 [a b]]       sub      — field subtraction
[7 [a b]]       mul      — field multiplication
[8 a]           inv      — field inversion (Fermat)
[9 [a b]]       eq       — equality test
[10 [a b]]      lt       — less-than test
[11 [a b]]      xor      — bitwise exclusive-or
[12 [a b]]      and      — bitwise conjunction
[13 a]          not      — bitwise complement
[14 [a n]]      shl      — bitwise left shift
[15 a]          hash     — hemera hash
[16 [t c]]      call     — non-deterministic witness injection
[17 a]          look     — deterministic BBG state read

formulas are nouns. they are stored and resolved the same way as any other noun — by NounId lookup, recursively. the tag atom uses field type (0x00) in the encoding.

canonical invariants

tag byte MUST be 0x00, 0x01, 0x02, or 0x03
field element values MUST be in [0, p)
word element values MUST be in [0, 2^32)
hash elements: each of the 4 field elements MUST be in [0, p)
cell content: left NounId before right NounId
encoded noun length MUST match the tag (9, 9, 33, or 65 bytes)
NounId MUST equal hemera(encoded_noun) — no aliasing
one valid encoding per noun — no alternative representations
wire messages: topological order (children before parents)
wire messages MUST NOT exceed 16 MiB

violation of any invariant is a rejection. there is no recovery, no fallback, no "best effort" parsing. invalid data is discarded.

test vectors (canonical instantiation)

atom encoding

field atom with value 0:
  encoded: 00 00 00 00 00 00 00 00 00
  NounId:  hemera(00 00 00 00 00 00 00 00 00)

field atom with value 1:
  encoded: 00 01 00 00 00 00 00 00 00
  NounId:  hemera(00 01 00 00 00 00 00 00 00)

field atom with value p-1 (max valid):
  p-1 = 18446744069414584320 = 0xFFFFFFFF00000000
  little-endian of p-1: 00 00 00 00 FF FF FF FF
  encoded: 00 00 00 00 00 FF FF FF FF   (tag=0x00, then 8 bytes LE)
  NounId:  hemera(00 00 00 00 00 FF FF FF FF)

word atom with value 42:
  encoded: 01 2A 00 00 00 00 00 00 00
  NounId:  hemera(01 2A 00 00 00 00 00 00 00)

cell encoding

cell(atom(0), atom(1)):
  left  = NounId(atom(0))  = hemera(00 00 00 00 00 00 00 00 00)
  right = NounId(atom(1))  = hemera(00 01 00 00 00 00 00 00 00)
  encoded: 03 ‖ left ‖ right  (65 bytes)
  NounId:  hemera(03 ‖ left ‖ right)

invalid encodings (must be rejected)

tag byte 0x04 or higher           — unknown tag
field atom with value = p          — value out of range
field atom with value = p + 1      — value out of range
word atom with value = 2^32        — word out of range
cell with only 64 bytes total      — truncated
cell with 66 bytes total           — trailing bytes
hash atom with 3 elements          — truncated
empty payload (0 bytes)            — missing tag

Homonyms

hemera/specs/encoding

encoding specification Input Encoding (Bytes → Field Elements) Pack input bytes into 7-byte little-endian chunks. Each 7-byte chunk is zero-extended to 8 bytes and interpreted as a u64 in little-endian order, producing one Goldilocks field element. Why 7 bytes, not 8: The maximum 7-byte value is…

strata/genies/specs/encoding

encoding serialization formats for genies types. F_q element an element x in F_q is serialized as 64 bytes, little-endian. each x_i is a 64-bit limb stored in little-endian byte order. the value x must satisfy 0 <= x < q. **canonical form**: on deserialization, reject if the decoded value >= q.…

strata/nebu/specs/encoding

encoding specification how bytes map to field elements and back. this encoding is shared by every system that moves data across the byte↔field boundary. bytes to field elements input bytes are packed into field elements using 7-byte little-endian chunks. the maximum 7-byte value is 2⁵⁶ − 1 =…

strata/kuro/specs/encoding

encoding specification how bytes map to tower field elements and back. this encoding is shared by every system that moves data across the byte/field boundary. bytes to tower elements F₂⁸ (byte-aligned) each byte maps directly to one F₂⁸ element. the byte's bit layout matches the tower…

strata/jali/specs/encoding

encoding — RingElement serialization how ring elements are serialized to bytes and deserialized back. two forms exist: coefficient encoding and NTT encoding. coefficient form encoding a RingElement in coefficient form is n Goldilocks field elements. each coefficient is 8 bytes little-endian: total…

strata/trop/specs/encoding

F_p encoding how tropical elements and matrices are represented over the Goldilocks field F_p (p = 2⁶⁴ - 2³² + 1). element encoding a tropical element is a single F_p value with the following interpretation: | F_p value | tropical meaning | |-----------|-----------------| | 0 | tropical…