semantic-hashing.md

prop: semantic hashing

status

draft

the core insight

hemera is an identification layer. not compression. not archival.

compression transforms bytes to reduce size. identification assigns every piece of data a permanent, verifiable address. deduplication is a consequence: if two chunks have the same address, they are the same chunk — store once, reference many times.

compression:      file.bin → file.bin.gz   (bytes change, size shrinks)
identification:   file.bin → particle_id   (bytes unchanged, address permanent)

dedup, sync efficiency, and graph ranking all follow from correct identification. they are not the goal — they are what happens when identity is done right.

the problem

hemera's current tree hash splits bytes into fixed 4 KB chunks. this is correct for opaque bytes. it fails for structured data:

boundary alignment problem: two .model files share identical weight tensors, but the tensor starts at a different byte offset because frontmatter sizes differ. 4 KB cuts fall at different positions → no deduplication.

semantic unit problem: for weights, the meaningful unit is a quantization block (18 bytes for q4), not a 4 KB window. cutting through the middle of a block is semantically wrong and produces no dedup.

two levels of structure

structured data has identity at two levels:

level 1: section boundaries    (from format declaration)
level 2: chunk boundaries      (from content, within each section)

level 1 is format-specific. level 2 is content-defined.

level 1: section tree

for .cyb containers, sections are the top-level semantic units:

1. parse frontmatter → ordered section list
2. section_root[i] = hemera::root_hash(section_bytes[i])
3. particle_id = left_balanced_tree([section_root[0], ..., section_root[n]])

sections taken in files declaration order. root_hash applies level 2 chunking internally.

two containers sharing an identical section share that section_root. storage deduplicates at section granularity automatically.

frontmatter is not a section — it declares order but is not hashed as a unit. identity is insensitive to frontmatter serialization.

level 2: element-aligned local minimum CDC

within each section, chunk boundaries are content-defined. different content types need different element granularity.

gear table

gear_table[256] = [hemera(0), hemera(1), ..., hemera(255)]

generated once from hemera. deterministic. frozen forever.

fingerprint per element

for an element of size S bytes:

fp(element) = gear_table[e[0]]
            ^ rotate_left(gear_table[e[1]], 8)
            ^ rotate_left(gear_table[e[2]], 16)
            ^ ...  (up to S bytes)

element boundaries are never crossed. raw bytes: S=1. q4 quant block: S=18.

chunk boundary = local minimum

slide window of W elements. chunk boundary where fp is the local minimum:

boundary at position i  iff  fp[i] = min(fp[i-W/2 .. i+W/2])

guaranteed ~W elements between boundaries. no min/max parameters. no MASK threshold.

by content type

content	element_size	W	why
q4 weights	18 bytes (quant block)	256	boundaries between quant blocks
q8 weights	34 bytes (quant block)	128	same
u16 (embeddings)	2 bytes	2048	per weight
u32 (norms)	4 bytes	1024	per weight
ternary	8 bytes (32-value block)	512	per ternary block
text / TOML	1 byte	4096	standard CDC
raw bitmap	bytes per pixel (3-4)	1024	per pixel
WAV audio	2-4 bytes (sample)	1024	per sample
compressed (JPEG, video)	1 byte	4096	no semantic structure, no dedup expected

element_size comes from the section's encoding declaration (tensors.toml). unknown binary: element_size = 1 → degrades to standard byte-level CDC.

dedup quality by case

case	quality	why
identical tensors across models	excellent	same bytes → same fp → same minima
zero-runs (abliteration)	excellent	constant fp → regular minima
identical vocab across fine-tunes	good	text CDC, chunks align
compressed video/audio	none	near-random bytes, expected
random bytes	none	correct, nothing to dedup

connections to adjacent fields

git: content-addressed objects. every blob, tree, commit is identified by hash. dedup across branches and history is automatic. we are applying this to binary data at sub-file granularity.

rsync: rolling hash finds matching blocks between source and destination. transfers only the diff. our CDC uses the same idea to find stable chunk boundaries — same content at different offsets still chunks identically.

bup / borg / restic: CDC-based backup systems. identical to our approach. bup uses Rabin fingerprinting; restic uses CDC with variable chunk sizes. years of production validation.

MinHash (Broder 1997): local minimum IS MinHash applied to chunking. MinHash finds the minimum hash value in a set — an unbiased estimator of Jaccard similarity. our local minimum CDC finds minimum fingerprint positions, which are maximally stable across similar content. the theoretical guarantee: two regions that are mostly identical will chunk identically with high probability.

information theory — Kolmogorov complexity: the Merkle DAG over CDC chunks IS a near-minimum description of the data. identical chunks referenced once, unique chunks stored once. approaches the information-theoretic lower bound for structured redundant data.

LZ compression: finds repeated substrings within one file. we find repeated chunks across files. different scope, same mathematical intuition: pattern = compressible = dedupable.

what compression theory contributes: entropy estimation. high-entropy content (compressed video, random bytes) → dedup ratio ≈ 0 regardless of chunk size. if entropy estimate > threshold → use 4 KB flat chunks, skip CDC overhead. this could be added as a fast pre-check.

what we do NOT borrow: approximate/perceptual matching (Shazam, SimHash, LSH for similarity search). these find "nearly identical" content. we require exact identity. approximate matching changes the meaning of particle_id.

spec targets

specs/tree.md — two new sections:

Semantic Units: section tree construction for .cyb containers
Element-Aligned Chunking: local minimum CDC, gear table, element_size by encoding

root/format.md in cyb repo — ## identity section (pending this prop acceptance).

Neighbours

Mentions

hemera/roadmap