honeycrisp/acpu/specs.md

acpu — API specification

pure Rust driver for Apple Silicon CPU compute. direct access to every useful compute unit in M1–M4: matrix coprocessor, vector engine, numeric extensions, atomics, memory system, performance counters. zero external dependencies — only inline assembly and system calls.

organs

six groups. each maps to a compute unit or system resource in silicon.

organ silicon what it does
probe sysctl, MRS detect chip, enumerate capabilities
matrix AMX coprocessor 512-bit matrix fma, undocumented Apple hardware
vector NEON (AdvSIMD) 128-bit SIMD, 32 registers, the workhorse
numeric FP16, BF16, DotProd, I8MM, FCMA, RDM precision and format extensions inside NEON
sync LSE, LRCPC, barriers atomics, memory ordering, core affinity
pulse PMU (Apple kpc) cycle counters, cache misses, branch stats

concepts

concept what it is
chip identified Apple Silicon die — M1, M1 Pro, M2, M3, M4, etc.
features runtime capability flags — what extensions this chip has
matrix AMX context — set/clr bracket, owns matrix register file
xrow, yrow, zrow typed AMX register handles (8 × 512-bit each bank)
kernel a compute operation dispatched to the best available unit
lane NEON vector register (128-bit, v0–v31)

probe — chip detection

runtime identification of Apple Silicon chip and its capabilities. all detection is zero-cost after first call (cached in static).

chip

variant CPU AMX ver extensions
M1 Firestorm/Icestorm 1 NEON, FP16, DotProd, FCMA, RDM, LSE
M1 Pro/Max/Ultra Firestorm/Icestorm 1 same as M1
M2 Avalanche/Blizzard 2 + BF16, I8MM, AMX bf16 ops
M2 Pro/Max/Ultra Avalanche/Blizzard 2 same as M2
M3 Everest/Sawtooth 3 + AMX ldx/ldy, matint
M3 Pro/Max/Ultra Everest/Sawtooth 3 same as M3
M4 Dorada/Brava 4 + AMX extrh/extrv, vecfp/vecint
M4 Pro/Max Dorada/Brava 4 same as M4

features

field type semantics
chip Chip enum variant
amx_ver u8 1–4, matches chip generation
has_fp16 bool FEAT_FP16 — always true M1+
has_bf16 bool FEAT_BF16 — M2+
has_dotprod bool FEAT_DotProd — always true M1+
has_i8mm bool FEAT_I8MM — M2+
has_fcma bool FEAT_FCMA — always true M1+
has_rdm bool FEAT_RDM — always true M1+
has_lse bool FEAT_LSE — always true M1+
has_lrcpc bool FEAT_LRCPC — always true M1+
p_cores u8 performance core count
e_cores u8 efficiency core count
l1_line usize L1 cache line size (bytes)
l2_size usize L2 cache size per cluster (bytes)
method signature semantics
scan () -> &'static Features detect once, return cached reference
chip () -> Chip shortcut for scan().chip
has (Feature) -> bool query single feature flag

system mapping

field source
chip sysctl hw.optional.arm.FEAT_* + machdep.cpu.brand_string
core counts sysctl hw.perflevel0.physicalcpu / hw.perflevel1.physicalcpu
cache sysctl hw.perflevel0.l1dcachesize / hw.perflevel0.l2cachesize
features sysctl hw.optional.arm.FEAT_FP16 etc. or MRS ID_AA64ISAR0_EL1

matrix — AMX coprocessor

Apple Matrix coprocessor. undocumented. three register banks: X (8 × 512-bit), Y (8 × 512-bit), Z (8 × 512-bit). total: 4608 bytes of matrix register state.

AMX instructions live in reserved ARM encoding space at 0x201000 + (opcode << 5) + operand. emitted via .word in inline assembly.

context lifecycle

method signature semantics
new () -> Matrix AMX_SET — enable coprocessor, zero registers
drop automatic AMX_CLR — disable coprocessor

AMX_SET/AMX_CLR are bracketing instructions. all AMX operations must occur between set and clr. context is per-thread.

encoding

AMX_SET: NOP; NOP; NOP; .word (0x201000 + (17 << 5) + 0)
AMX_CLR: NOP; NOP; NOP; .word (0x201000 + (17 << 5) + 1)

register model

type bank count width total
XRow X 8 512 bits (64 bytes) 512 bytes
YRow Y 8 512 bits (64 bytes) 512 bytes
ZRow Z 8 512 bits (64 bytes) 512 bytes

registers are typed wrappers: XRow(0) .. XRow(7).

Z rows are accumulators — fma/mac results land here.

load / store

method signature semantics
ldx (&self, row: XRow, ptr: *const u8) load 64 bytes into X row
ldy (&self, row: YRow, ptr: *const u8) load 64 bytes into Y row
stx (&self, row: XRow, ptr: *mut u8) store 64 bytes from X row
sty (&self, row: YRow, ptr: *mut u8) store 64 bytes from Y row
stz (&self, row: ZRow, ptr: *mut u8) store 64 bytes from Z row
ldzi (&self, row: ZRow, ptr: *const u8) load into Z row (interleaved)

encoding

operand GPR holds: ptr | (row_index << 56) for load/store.

op opcode
ldx 0
ldy 1
stx 2
sty 3
ldz 4
stz 5
ldzi 6
stzi 7

compute

method signature semantics
fma32 (&self, x: XRow, y: YRow, z: ZRow) Z += X × Y (fp32, outer product)
fma16 (&self, x: XRow, y: YRow, z: ZRow) Z += X × Y (fp16 inputs, fp32 accum)
fmabf16 (&self, x: XRow, y: YRow, z: ZRow) Z += X × Y (bf16 inputs, fp32 accum) — M2+
mac16 (&self, x: XRow, y: YRow, z: ZRow) Z += X × Y (int16, int32 accum)
matint (&self, x: XRow, y: YRow, z: ZRow) integer matrix op — M3+
vecfp (&self, x: XRow, z: ZRow) vector fp op — M4+
vecint (&self, x: XRow, z: ZRow) vector int op — M4+
extrh (&self, z: ZRow, ptr: *mut u8) extract horizontal from Z — M4+
extrv (&self, z: ZRow, ptr: *mut u8) extract vertical from Z — M4+

encoding (compute)

operand GPR holds bit-packed config:

fma32:  bits[4:0]  = x_offset
        bits[8:6]  = y_row_select (0–7)
        bits[19:10] = z_row
        bit[27]    = accumulate mode (1 = +=, 0 = =)
opcode: 10 (fma32), 11 (fma16), 12 (fmabf16), 13 (mac16)
M3+:    14 (matint)
M4+:    15 (vecfp), 16 (vecint), 8 (extrh), 9 (extrv)

exact bit layout per opcode documented in corsix/amx Instructions.md.

fma geometry

X row:  64 bytes = 16 × fp32  or  32 × fp16
Y row:  64 bytes = 16 × fp32  or  32 × fp16
Z row:  64 bytes = 16 × fp32

fma32:  Z[16×16] += outer_product(X[16], Y[16])
fma16:  Z[16×16] += outer_product(X[32→16], Y[32→16])  (fp16 in, fp32 accum)

one fma32 = 256 fp32 multiply-accumulates per instruction.


vector — NEON (AdvSIMD)

ARM Advanced SIMD. 32 registers × 128 bits. documented, stable, available on all AArch64. accessed via core::arch::aarch64 intrinsics or inline assembly.

acpu exposes NEON through typed kernel functions, not raw intrinsics. the user calls acpu::exp(slice), not vfmaq_f32(a, b, c).

register file

v0–v31:  128-bit SIMD registers
  as f32:  4 lanes  (float32x4_t)
  as f16:  8 lanes  (float16x8_t)
  as f64:  2 lanes  (float64x2_t)
  as i32:  4 lanes  (int32x4_t)
  as i16:  8 lanes  (int16x8_t)
  as i8:  16 lanes  (int8x16_t)
  as u8:  16 lanes  (uint8x16_t)

operations used

not a full NEON reference — only the instruction classes acpu uses.

class instructions what for
arithmetic fadd, fmul, fmla, fmls, fneg, fabs vector math kernels
compare fcmeq, fcmgt, fcmge branchless select
convert fcvt (fp16↔fp32), fcvtn, fcvtl precision conversion
load/store ld1, st1, ld1q (4-reg), st1q bulk data movement
shuffle tbl, trn1, trn2, zip1, zip2, uzp1, uzp2 transpose, interleave
reduce faddp, fmaxnmv, fminnmv horizontal sum, min, max
bitwise and, orr, eor, bsl, bif, bit mask ops, branchless logic
shift shl, sshr, ushr, ssra fixed-point, quantization
reciprocal frecpe, frecps, frsqrte, frsqrts fast 1/x, 1/√x (Newton steps)

numeric — precision and format extensions

extensions that operate within the NEON register file but add specialized data formats. each gated by a feature flag in caps.

fp16 (FEAT_FP16) — M1+

native half-precision arithmetic in NEON. not just conversion — actual compute in 16-bit.

operation instructions semantics
arithmetic fadd(h), fmul(h), fmla(h) fp16 vector math
convert fcvt h↔s, fcvtn, fcvtl fp16 ↔ fp32 bulk
compare fcmeq(h), fcmgt(h) fp16 comparison

bulk conversion (NEON vectorized)

function signature semantics
cast_f16_f32 (&mut [f32], &[u16]) bulk fp16→fp32, 32/iter (4× unrolled)
cast_f32_f16 (&mut [u16], &[f32]) bulk fp32→fp16, 32/iter (4× unrolled)
fp16_to_f32 (u16) -> f32 scalar, single NEON fcvt
f32_to_fp16 (f32) -> u16 scalar, single NEON fcvt

bf16 (FEAT_BF16) — M2+

brain float: 8-bit exponent, 7-bit mantissa. same range as fp32, less precision. preferred for training.

operation instructions semantics
dot bfdot bf16 dot product → fp32
matmul bfmmla bf16 2×4 × 4×2 → fp32 2×2
convert bfcvt, bfcvt2 fp32 → bf16 (truncate)

bulk conversion

function signature semantics
cast_f32_bf16 (&mut [u16], &[f32]) bulk fp32→bf16 via bfcvt
cast_bf16_f32 (&mut [f32], &[u16]) bulk bf16→fp32 (shift left 16)

dotprod (FEAT_DotProd) — M1+

INT8 dot product. four int8 × int8 multiplies accumulated into one int32. the foundation of quantized inference.

operation instructions semantics
signed sdot 4 × (i8 × i8) → i32, accumulated
unsigned udot 4 × (u8 × u8) → u32, accumulated
mixed usdot 4 × (u8 × i8) → i32 — M2+ only

i8mm (FEAT_I8MM) — M2+

INT8 matrix multiply. 2×8 × 8×2 → 2×2 int32. eight times the throughput of scalar int8 multiply.

operation instructions semantics
signed smmla i8[2×8] × i8[8×2] → i32[2×2]
unsigned ummla u8[2×8] × u8[8×2] → u32[2×2]
mixed usmmla u8[2×8] × i8[8×2] → i32[2×2]

fcma (FEAT_FCMA) — M1+

complex number arithmetic in NEON registers. pairs of floats treated as (real, imag).

operation instructions semantics
rotate fcadd complex add with 90° or 270° rotation
mul-acc fcmla complex multiply-accumulate (0°/90°/180°/270°)

used for: FFT, complex-valued attention, signal processing.

rdm (FEAT_RDM) — M1+

rounding doubling multiply. fixed-point DSP without overflow.

operation instructions semantics
mul-add sqrdmlah sat(a + round(b×c >> shift))
mul-sub sqrdmlsh sat(a - round(b×c >> shift))

used for: fixed-point quantization, audio DSP.


sync — atomics, ordering, affinity

concurrency primitives for parallel compute across P-cores and E-cores. not for general concurrent programming — specifically for multi-threaded GEMM, producer-consumer pipelines, and work-stealing.

lse atomics (FEAT_LSE) — M1+

hardware atomic operations. single instruction, no LL/SC loop. essential for lock-free thread pool in parallel GEMM.

operation instructions semantics
fetch-add ldadd, ldaddal atomic add, return old value
compare-swap cas, casal atomic CAS, single instruction
swap swp, swpal atomic exchange
clear ldclr atomic bit clear
set ldset atomic bit set

Rust mapping

LSE atomics are used automatically by LLVM on AArch64 when -Ctarget-feature=+lse is set (default on macOS). std::sync::atomic operations compile to single LSE instructions. acpu does not wrap these — Rust already does the right thing.

acpu exposes LSE through the sync module only where the standard library is insufficient (e.g. custom fence patterns, spin-wait with WFE).

lrcpc (FEAT_LRCPC) — M1+

load-acquire with weaker ordering. faster than full acquire barrier for producer-consumer patterns.

operation instructions semantics
load-acquire ldapr load with acquire semantics, weaker than ldar

used between pack-thread and compute-thread in parallel GEMM.

memory barriers

function instruction semantics
barrier DMB ISH data memory barrier, inner shareable
fence DSB ISH data sync barrier, inner shareable
isb ISB instruction sync barrier
wait WFE wait for event (low-power spin)
wake SEV signal event (wake spinning core)

wait/wake pair: spin-waiting threads use WFE to sleep until producer calls SEV. saves power and thermal headroom during parallel GEMM synchronization.

core affinity

pin threads to performance or efficiency cores.

function signature semantics
pin_p_core () pin current thread to P-core cluster
pin_e_core () pin current thread to E-core cluster
pin_any () remove pinning, allow migration

system mapping

function system call
pin_p_core pthread_set_qos_class_self_np(QOS_CLASS_USER_INTERACTIVE)
pin_e_core pthread_set_qos_class_self_np(QOS_CLASS_BACKGROUND)
pin_any pthread_set_qos_class_self_np(QOS_CLASS_DEFAULT)

note: macOS QoS classes influence core scheduling but do not guarantee hard affinity. USER_INTERACTIVE strongly prefers P-cores. BACKGROUND strongly prefers E-cores. for inference, pin all compute threads to P-cores.

prefetch

function signature semantics
prefetch_l1 (ptr: *const u8) PRFM PLDL1KEEP — prefetch into L1
prefetch_l2 (ptr: *const u8) PRFM PLDL2KEEP — prefetch into L2
prefetch_l1_write (ptr: *mut u8) PRFM PSTL1KEEP — prefetch for write

used in GEMM packing loops to hide memory latency.


pulse — performance counters

Apple Performance Monitoring Unit. undocumented kpc_* API in /usr/lib/libkperf.dylib. gives cycle-accurate measurements without Instruments or dtrace.

accessed via dlopen at runtime (same pattern as rane uses for AppleNeuralEngine.framework).

counters

counter what it counts
cycles CPU cycles (fixed counter 0)
instructions instructions retired (fixed counter 1)
branches branches retired
branch_misses branch mispredictions
l1d_misses L1 data cache misses
l1i_misses L1 instruction cache misses
l2_misses L2 cache misses

context

method signature semantics
new (counters: &[Counter]) -> Result<Counters> configure PMU via kpc_set_config
start (&mut self) kpc_set_counting + kpc_set_thread_counting
read (&self) -> Snapshot kpc_get_thread_counters64
stop (&mut self) disable counting
elapsed (&self, a: &Snapshot, b: &Snapshot) -> Counts delta between two snapshots

system mapping

method symbol (libkperf.dylib)
configure kpc_set_config
start kpc_set_counting(KPC_CLASS_FIXED | KPC_CLASS_CONFIGURABLE)
read kpc_get_thread_counters64
stop kpc_set_counting(0)

usage pattern

let mut pulse = Counters::new(&[Counter::Cycles, Counter::L1dMisses])?;
pulse.start();
let a = pulse.read();
// ... compute ...
let b = pulse.read();
let counts = pulse.elapsed(&a, &b);
println!("cycles: {}, L1 misses: {}", counts.cycles, counts.l1d_misses);
pulse.stop();

kernels — high-level compute operations

safe, auto-dispatching operations. each kernel picks the fastest available path based on caps:

AMX  →  NEON+extension  →  NEON scalar  →  fallback

dispatch is resolved at first call and cached.

gemm

function signature semantics
matmul_f32 (a, b, c, m, n, k) C[m×n] += A[m×k] × B[k×n], fp32
matmul_f16 (a, b, c, m, n, k) C[m×n] += A[m×k] × B[k×n], fp16 in, fp32 accum
matmul_bf16 (a, b, c, m, n, k) C[m×n] += A[m×k] × B[k×n], bf16 in, fp32 accum — M2+
matmul_i8 (a, b, c, m, n, k, scale, zero) int8 quantized matmul → fp32

dispatch:

  • matmul_f32: AMX fma32 → NEON fmla
  • matmul_f16: AMX fma16 → NEON FP16 fmla
  • matmul_bf16: AMX fmabf16 → NEON bfmmla → NEON bfdot
  • matmul_i8: NEON I8MM smmla → NEON DotProd sdot → scalar

math (elementwise, vectorized)

all operate in-place on fp32 slices. NEON vectorized, 4-wide minimum, tail-masked.

function signature semantics
exp (&mut [f32]) e^x, polynomial approximation
log (&mut [f32]) ln(x)
tanh (&mut [f32]) tanh(x)
sigmoid (&mut [f32]) 1/(1+e^-x)
gelu (&mut [f32]) 0.5x(1+tanh(√(2/π)(x+0.044715x³)))
silu (&mut [f32]) x × sigmoid(x)
softmax (&mut [f32]) exp(x)/Σexp(x)
normalize (out, x, weight, eps) x × weight / √(mean(x²)+ε)
rotate (out, x, freqs, pos) rotary positional embedding

convert (bulk, vectorized)

function signature semantics
cast_f16_f32 (&mut [f32], &[u16]) fp16 → fp32, NEON 32/iter
cast_f32_f16 (&mut [u16], &[f32]) fp32 → fp16, NEON 32/iter
cast_bf16_f32 (&mut [f32], &[u16]) bf16 → fp32, shift
cast_f32_bf16 (&mut [u16], &[f32]) fp32 → bf16, NEON bfcvt
cast_f32_i8 (&mut [i8], &[f32], scale) quantize fp32 → int8
cast_i8_f32 (&mut [f32], &[i8], scale, zero) dequantize int8 → fp32

reduce

function signature semantics
sum (&[f32]) -> f32 NEON pairwise add
max (&[f32]) -> f32 NEON fmaxnmv
min (&[f32]) -> f32 NEON fminnmv
dot (&[f32], &[f32]) -> f32 NEON fmla + reduce
length (&[f32]) -> f32 √Σx²

errors

ChipNotSupported         not Apple Silicon
AmxSetFailed             AMX_SET instruction failed
AmxOpFailed(String)      AMX operation error
PmuNotAvailable          libkperf.dylib not found or kpc denied
PmuConfigFailed(String)  counter configuration rejected
FeatureNotAvailable(Feature)  required extension absent on this chip
AffinityFailed(String)   QoS class change failed

execution model

  • AMX is per-thread. each thread needs its own Matrix
  • AMX set/clr are cheap (~1 cycle). open a context per GEMM call, not per thread lifetime
  • NEON registers are callee-saved (v8–v15). inline asm must respect this
  • all kernels are synchronous. no async dispatch model (this is CPU, not GPU)
  • parallel GEMM: partition M dimension across threads, each thread gets own Matrix
  • sync between threads: WFE/SEV + LSE atomics (no mutexes in hot path)
  • memory: all buffers are caller-owned slices. acpu allocates nothing on heap
  • all public functions are #[inline] or dispatch through cached function pointers

driver stack

acpu crate
  → inline asm (.word for AMX, intrinsics for NEON)
  → sysctl (chip detection)
  → libkperf.dylib dlopen (PMU counters)
  → pthread (core affinity)
  → no frameworks, no ObjC, no C compiler

module map

src/
  lib.rs              pub API re-exports, CpuError
  probe.rs            Chip, Features, Feature, scan()
  matrix/
    mod.rs            Matrix lifecycle (set/clr)
    ops.rs            load/store, fma32/fma16/fmabf16/mac16
    regs.rs           XRow, YRow, ZRow typed wrappers
    asm.rs            raw .word encoding macros
  vector/
    mod.rs            NEON kernel dispatch
    math.rs           exp, log, tanh, sigmoid, gelu, silu
    reduce.rs         sum, max, min, dot, length
    softmax.rs        softmax, normalize
    rope.rs           rotary positional embedding
  numeric/
    fp16.rs           FP16 arithmetic + bulk convert
    bf16.rs           BF16 ops + bulk convert
    quant.rs          DotProd, I8MM, quantize/dequantize
    complex.rs        FCMA complex mul-acc
  sync/
    mod.rs            barriers, wait/wake
    affinity.rs       pin_p_core, pin_e_core
    prefetch.rs       PRFM wrappers
  pulse/
    mod.rs            Counters, Counter, Snapshot
    ffi.rs            dlopen libkperf, kpc_* symbols
  gemm.rs             matmul_f32, matmul_f16, matmul_bf16, matmul_i8 (auto-dispatch)
  convert.rs          cast_* bulk conversion (re-exports from numeric)
  probe/
    main.rs           acpu_probe binary — exercise every organ

file limit: 500 lines per source file. split if exceeded.


license

cyber license: don't trust. don't fear. don't beg.

Folder

Synonyms

hemera/specs
Hemera: A Permanent Hash Primitive for Planetary-Scale Collective Intelligence | field | value | |----------|--------------------------------| | version | 2.0 | | status | Decision Record | | authors | mastercyb | | date | March 2026 | Abstract Hemera is the cryptographic hash primitive for cyber,…
bbg/specs
specs
zheng/specs
zheng: polynomial proof system one IOP: SuperSpartan + sumcheck (CCS constraints, O(N) prover, O(log N) verifier). one folding: HyperNova (CCS-native, ~30 field ops per fold, one decider at the end). one hash: hemera (~3 calls per proof — binding, Fiat-Shamir seed, domain separation). five…
nox/specs
nox reference canonical specification of the nox virtual machine. this is the source of truth — when code and reference disagree, fix reference first, then propagate to code. specifications | page | scope | status | |------|-------|--------| | vm.md | overview, field, hash, algebra polymorphism,…
lens/specs
lens reference canonical specification for polynomial commitment — five lenses for five algebras. the trait three operations. commit is O(N). open produces a proof. verify checks the proof. all transparent (no trusted setup), all post-quantum. see trait for the full specification. naming convention…
strata/genies/specs
genies specification canonical reference for isogeny group action arithmetic: F_q field operations, supersingular curves, isogeny computation, and class group action. spec pages | page | defines | |------|---------| | [prime](/strata/genies/specs/prime) | CSIDH prime form, selection criteria,…
strata/nebu/specs
nebu specification canonical reference for the Goldilocks prime field, its arithmetic, and its hardware. spec pages | page | defines | |------|---------| | field | prime, elements, arithmetic, properties, why Goldilocks | | ntt | Number Theoretic Transform, roots of unity, butterfly, Cooley-Tukey |…
honeycrisp/rane/specs
specs
strata/trop/specs
trop specification canonical reference for tropical semiring arithmetic: the (min, +) semiring, its matrix algebra, and dual certificate verification. spec pages | page | defines | |------|---------| | [semiring](/strata/trop/specs/semiring) | tropical semiring axioms, (min, +) definition, identity…
honeycrisp/unimem/specs
unimem: Zero-Copy Memory Driver for Apple Silicon Goal Single pinned buffer visible to CPU, GPU, AMX, and ANE — zero copies between pipeline stages. The memory layer for inference on unified memory. v1 adds NVMe DMA via DEXT — full zero-copy from disk to compute. Why this exists Every inference…
strata/kuro/specs
kuro specification canonical reference for the F₂ tower field, its arithmetic, packed operations, and hardware targets. spec pages | page | defines | |------|---------| | [field](/strata/kuro/specs/field) | tower levels, all field operations, properties, cost model vs Goldilocks | |…
honeycrisp/aruminium/specs
aruminium — API specification pure Rust driver for Apple Metal GPU. direct objc_msgSend FFI, zero external dependencies, only macOS system frameworks. concepts | concept | what it is | |---------|-----------| | device | a Metal GPU — discovered at runtime, owns all GPU resources | | buffer |…
strata/jali/specs
jali reference canonical specification for polynomial ring arithmetic R_q = F_p[x]/(x^n+1) over Goldilocks. what jali is jali (जाली — lattice/mesh) is the fifth execution algebra for cyber. polynomial ring elements are structured vectors of n Goldilocks field elements with multiplication defined by…

Neighbours