hemera/roadmap/semantic-hashing.md

prop: semantic hashing

status

draft

the core insight

hemera is an identification layer. not compression. not archival.

compression transforms bytes to reduce size. identification assigns every piece of data a permanent, verifiable address. deduplication is a consequence: if two chunks have the same address, they are the same chunk — store once, reference many times.

compression:      file.bin → file.bin.gz   (bytes change, size shrinks)
identification:   file.bin → particle_id   (bytes unchanged, address permanent)

dedup, sync efficiency, and graph ranking all follow from correct identification. they are not the goal — they are what happens when identity is done right.

the problem

hemera's current tree hash splits bytes into fixed 4 KB chunks. this is correct for opaque bytes. it fails for structured data:

boundary alignment problem: two .model files share identical weight tensors, but the tensor starts at a different byte offset because frontmatter sizes differ. 4 KB cuts fall at different positions → no deduplication.

semantic unit problem: for weights, the meaningful unit is a quantization block (18 bytes for q4), not a 4 KB window. cutting through the middle of a block is semantically wrong and produces no dedup.

two levels of structure

structured data has identity at two levels:

level 1: section boundaries    (from format declaration)
level 2: chunk boundaries      (from content, within each section)

level 1 is format-specific. level 2 is content-defined.

level 1: section tree

for .cyb containers, sections are the top-level semantic units:

1. parse frontmatter → ordered section list
2. section_root[i] = hemera::root_hash(section_bytes[i])
3. particle_id = left_balanced_tree([section_root[0], ..., section_root[n]])

sections taken in files declaration order. root_hash applies level 2 chunking internally.

two containers sharing an identical section share that section_root. storage deduplicates at section granularity automatically.

frontmatter is not a section — it declares order but is not hashed as a unit. identity is insensitive to frontmatter serialization.

level 2: element-aligned local minimum CDC

within each section, chunk boundaries are content-defined. different content types need different element granularity.

gear table

gear_table[256] = [hemera(0), hemera(1), ..., hemera(255)]

generated once from hemera. deterministic. frozen forever.

fingerprint per element

for an element of size S bytes:

fp(element) = gear_table[e[0]]
            ^ rotate_left(gear_table[e[1]], 8)
            ^ rotate_left(gear_table[e[2]], 16)
            ^ ...  (up to S bytes)

element boundaries are never crossed. raw bytes: S=1. q4 quant block: S=18.

chunk boundary = local minimum

slide window of W elements. chunk boundary where fp is the local minimum:

boundary at position i  iff  fp[i] = min(fp[i-W/2 .. i+W/2])

guaranteed ~W elements between boundaries. no min/max parameters. no MASK threshold.

by content type

content element_size W why
q4 weights 18 bytes (quant block) 256 boundaries between quant blocks
q8 weights 34 bytes (quant block) 128 same
u16 (embeddings) 2 bytes 2048 per weight
u32 (norms) 4 bytes 1024 per weight
ternary 8 bytes (32-value block) 512 per ternary block
text / TOML 1 byte 4096 standard CDC
raw bitmap bytes per pixel (3-4) 1024 per pixel
WAV audio 2-4 bytes (sample) 1024 per sample
compressed (JPEG, video) 1 byte 4096 no semantic structure, no dedup expected

element_size comes from the section's encoding declaration (tensors.toml). unknown binary: element_size = 1 → degrades to standard byte-level CDC.

dedup quality by case

case quality why
identical tensors across models excellent same bytes → same fp → same minima
zero-runs (abliteration) excellent constant fp → regular minima
identical vocab across fine-tunes good text CDC, chunks align
compressed video/audio none near-random bytes, expected
random bytes none correct, nothing to dedup

connections to adjacent fields

git: content-addressed objects. every blob, tree, commit is identified by hash. dedup across branches and history is automatic. we are applying this to binary data at sub-file granularity.

rsync: rolling hash finds matching blocks between source and destination. transfers only the diff. our CDC uses the same idea to find stable chunk boundaries — same content at different offsets still chunks identically.

bup / borg / restic: CDC-based backup systems. identical to our approach. bup uses Rabin fingerprinting; restic uses CDC with variable chunk sizes. years of production validation.

MinHash (Broder 1997): local minimum IS MinHash applied to chunking. MinHash finds the minimum hash value in a set — an unbiased estimator of Jaccard similarity. our local minimum CDC finds minimum fingerprint positions, which are maximally stable across similar content. the theoretical guarantee: two regions that are mostly identical will chunk identically with high probability.

information theory — Kolmogorov complexity: the Merkle DAG over CDC chunks IS a near-minimum description of the data. identical chunks referenced once, unique chunks stored once. approaches the information-theoretic lower bound for structured redundant data.

LZ compression: finds repeated substrings within one file. we find repeated chunks across files. different scope, same mathematical intuition: pattern = compressible = dedupable.

what compression theory contributes: entropy estimation. high-entropy content (compressed video, random bytes) → dedup ratio ≈ 0 regardless of chunk size. if entropy estimate > threshold → use 4 KB flat chunks, skip CDC overhead. this could be added as a fast pre-check.

what we do NOT borrow: approximate/perceptual matching (Shazam, SimHash, LSH for similarity search). these find "nearly identical" content. we require exact identity. approximate matching changes the meaning of particle_id.

spec targets

specs/tree.md — two new sections:

  • Semantic Units: section tree construction for .cyb containers
  • Element-Aligned Chunking: local minimum CDC, gear table, element_size by encoding

root/format.md in cyb repo — ## identity section (pending this prop acceptance).

Neighbours