cyb-model: four manifest models, two paths, one truth.

today the four mvp manifest models — qwen3-0.6b, qwen2.5-coder-1.5b, qwen2.5-coder-14b-abl, gemma-4-31b — pass end-to-end inference through both the curated decoder forward and the graph IR executor, on the canonical cyb-model format. same prompt, same tokens, same output, two independent codepaths.

why two paths matter

a model file is supposed to be a contract between disk and runtime. if only one reader can interpret it, the contract is invisible — every bug is silent. with two readers walking the same bytes and producing the same logits, the format is verified against the code, and the code against the format.

qwen3-0.6b           cpu: " Paris. The capital of Italy is Rome…"
                     graph: same.
qwen2.5-coder-1.5b   cpu: "match n { 0 => 0, 1 => 1, _ => fibonacci(n-1)+fibonacci(n-2) } …"
                     graph: same.
qwen2.5-coder-14b    cpu: same prefix.
                     graph: same.
gemma-4-31b          cpu: " Paris."
                     graph: " Paris. The capital"

byte-aligned encoder/decoder pairs. five canonical encodings (u32, u16, q8, q4, ternary), all integer fixed-point, no floats on disk. eps stored as inverse, sampling as per-mille. every number an integer, every encoding a small kernel.

the gemma-4 wall, and breaking it

gemma-4-31b was the last model holding out. two walls fell in sequence.

first wall — architecture. gemma-4 is llamastyle+: per-layer head_dim/kv_heads/rope switching (sliding vs full-attention layers), pre-Sdpa Q-scale to make unity attention work through a 1/sqrt(hd) Sdpa kernel, post-attn / post-ffw norms, layer output scale, final-logit softcap, scaled embeddings. the graph template emits all of that now, conditional on flags derived from the family profile. tied-embedding alias is zero-copy via Arc<Vec<u8>>; gemma extras (_embed_scale, _v_norm_ones_<hd>, _q_scale_<hd>, _softcap) are synthesised into the weight map at load time.

second wall — RAM. the graph executor was a "correctness reference": dequantise everything to f32 at load. for 31B q8 weights that's ~120 GB of host memory, well past the laptop. the fix was making the executor quant-aware:

  • block-quantised weights stay in native byte form. Op::Matmul checks the weight dtype and routes through Backend::quant_matmul when w is quantised.
  • Op::TokenEmbed dequantises one row per token on demand.
  • non-block tensors (norms, biases, scalars stored as u32/u16) still dequant once at load — they're tiny and consumed by f32-only ops.

result: graph mode now uses about the same RAM as the curated forward. 31B fits.

the meaning

a manifest is a small, finite list. four models. two paths. five encodings. one file.

the smallness is the point. canonical formats don't grow by accumulating cases — they grow by proving that one minimal definition handles everything we throw at it. when something doesn't fit, the answer isn't "add a sixth encoding" — the answer is to look at what the model is doing, find the abstraction that absorbs it, and check that the small set still works.

today's milestone is that the small set still works. four models from 0.6B to 31B, three families (qwen2, qwen3, gemma-4), two attention shapes, two paths through the runtime. one truth on disk.

that's the contract.

next

  • gpu kernels (wgpu, honeycrisp) for canonical q4/q8/u16 — currently fall back to cpu reference.
  • spec drift: run/specs/format.md keeps wandering from cyb-model as we touch the runtime; collapse the runtime spec into a pointer + runtime-only sections (mmap, scan procedure).
  • mi import name / source / parameters are still HF snapshot hashes / placeholders — clean up so the frontmatter actually says what's inside.
  • batch prefill: today every forward consumes one token; bulk-prefill would help long-context loads.