specs.md

unimem: Zero-Copy Memory Driver for Apple Silicon

Goal

Single pinned buffer visible to CPU, GPU, AMX, and ANE — zero copies between pipeline stages. The memory layer for inference on unified memory.

file I/O → IOSurface → CPU / AMX / ANE / GPU → result
               ↑
      pinned, shared, never copied

v1 adds NVMe DMA via DEXT — full zero-copy from disk to compute.

NVMe DMA → IOSurface(IOVA) → AMX → ANE → result
                 ↑
        never moves, never copies

Why this exists

Every inference framework on Apple Silicon leaks performance through copies:

malloc gives virtual addresses — hardware can't share them directly
Metal buffers copy on sync between CPU/GPU
CoreML copies between pipeline stages internally
Model loading: NVMe → kernel buf → userspace buf → framework buf

The M1/M2/M3/M4 unified memory makes this unnecessary. CPU, GPU, AMX, ANE share the same physical DRAM. One buffer, visible to everyone. Nobody has built this properly yet.

Architecture: v0 and v1

v0 — IOSurface-backed (implement now)

IOSurface provides pinned, kernel-managed shared memory accessible to CPU, GPU, AMX, and ANE from userspace. Proven by rane (ANE crate). No special signing, no kernel components.

                    IOSurface (pinned, DART-registered)
                   /     |      \         \
                CPU    AMX     GPU       ANE
              (VA)   (VA)   (Metal)  (private API)

One copy on model load (file I/O → IOSurface). Zero copies during inference.

v1 — DEXT + IOSurface (after Apple Developer enrollment)

DriverKit extension reads DART IOVA from IOSurface-backed memory via CreateMemoryDescriptorFromClient + IODMACommand::PrepareForDMA. NVMe uses IOVA for direct DMA. Full zero-copy from disk to compute.

NVMe ──DMA(IOVA)──→ IOSurface ──→ CPU/AMX/ANE/GPU
                         ↑
               DEXT reads IOVA via IODMACommand

Crate structure (v0)

unimem/
  src/
    lib.rs          ← public API, re-exports, MemError
    ffi.rs          ← raw FFI: IOSurface, CoreFoundation symbols + types
    block.rs        ← Block: pinned IOSurface, locked at creation
    tape.rs         ← Tape: Turing tape bump allocator
    grid.rs         ← Grid/Cell: fixed-size cell grid over Tape
    layout.rs       ← Layout: three-tape inference layout

Layer 1 — IOSurface allocation (`block`)

Purpose

Allocate pinned shared memory via IOSurface.framework. Buffer is visible to CPU, GPU (via Metal wrap), AMX (via CPU pointer), and ANE (via private API). Pages are wired — never swapped, never moved.

Mechanism

Raw FFI to IOSurface.framework and CoreFoundation. Same pattern as rane — no objc2, no wrapper crates.

IOSurface FFI functions:

IOSurfaceCreate(properties) → IOSurfaceRef
IOSurfaceLock(block, options, seed) → lock for CPU access
IOSurfaceUnlock(block, options, seed) → release lock
IOSurfaceGetBaseAddress(block) → virtual pointer
IOSurfaceGetAllocSize(block) → allocation size
IOSurfaceGetID(block) → global block ID (for sharing)

CoreFoundation FFI functions:

CFDictionaryCreateMutable(alloc, capacity, keyCallbacks, valueCallbacks) → dict
CFDictionarySetValue(dict, key, value) → set property
CFNumberCreate(alloc, type, valuePtr) → wrap int as CFNumber
CFRelease(ref) → release any CF object

IOSurface property keys (extern CFStringRef constants):

kIOSurfaceWidth → size (total bytes as width)
kIOSurfaceHeight → 1 (single row)
kIOSurfaceBytesPerElement → 1 (raw bytes)
kIOSurfaceBytesPerRow → size (full row)
kIOSurfaceAllocSize → size (allocation size)
kIOSurfacePixelFormat → 0 (no pixel format, raw buffer)

Lock model

Block is locked once at creation time and stays locked until drop. This makes VA stable for the entire lifetime — Tape can hand out raw pointers without per-alloc lock/unlock overhead.

open(): IOSurfaceCreate → IOSurfaceLock → cache VA → ready
drop(): IOSurfaceUnlock → CFRelease

No public lock/unlock API. Lock is internal implementation detail.

Data: Block

Field	Type	Description
ref	IOSurfaceRef	kernel block handle (CFRelease on drop)
va	non-null pointer	base virtual address (stable — locked at creation)
size	usize	allocation size in bytes
id	u32	global IOSurface ID

Thread safety

Block is Send + Sync. After creation, all fields are immutable (ref, va, size, id never change). The lock is held for the entire lifetime — no mutable state, no data races. Multiple threads can read VA concurrently.

Operations

Operation	Input	Output	Latency	Notes
open	size (bytes)	Block or error	~20us	creates, locks, caches VA
address	—	raw pointer	inline, 0ns	returns cached VA (always valid)
size	—	usize	inline, 0ns	returns cached size
id	—	u32	inline, 0ns	for sharing with DEXT (v1)
handle	—	IOSurfaceRef	inline, 0ns	raw handle for ANE (rane) / GPU (Metal) integration
drop	—	—	~5us	IOSurfaceUnlock + CFRelease

Measured performance (from experiment)

Size	Alloc	Write throughput	Read throughput
4 KB	18 us	22.8 GB/s	22.8 GB/s
1 MB	15 us	23.6 GB/s	23.5 GB/s
16 MB	17 us	23.6 GB/s	18.9 GB/s
256 MB	20 us	23.1 GB/s	19.5 GB/s

Throughput measured with volatile u64, single thread, no SIMD. With NEON/AMX: 60-70+ GB/s expected.

Invariants

Allocation is lazy — kernel reserves VA space, pages backed on first touch
First-touch page fault: ~1000-1200ns per 16KB page
All IOSurfaces map as single contiguous VM region
Block is Send + Sync (immutable after creation, lock held for lifetime)
Drop sequence: IOSurfaceUnlock → CFRelease — no leaks
open(0) returns error (ZeroSize)
Allocation failure returns error, never panics

Layer 2 — Bump allocator (`tape`)

Purpose

Fast sub-allocation over a single IOSurface. Zero syscalls after init. Clear in O(1). Block stays pinned between clears.

Mechanism

Atomic bump pointer over Block. Allocation = compare-exchange loop with alignment rounding. Clear = single atomic store.

Data: Tape

Field	Type	Description
block	Block	backing IOSurface
cursor	atomic usize	current allocation offset
capacity	usize	total tape size (= block.size)

Thread safety

Tape is Send + Sync. Block is immutable (Send + Sync). Cursor is atomic. Multiple threads can take concurrently via compare_exchange loop.

Alloc algorithm

fn take(size, align) -> Option<*mut u8>:
    loop:
        current = cursor.load(Relaxed)
        aligned = (current + align - 1) & !(align - 1)
        new_cursor = aligned + size
        if new_cursor > capacity:
            return None
        if cursor.compare_exchange_weak(current, new_cursor, Relaxed, Relaxed).ok:
            return block.address() + aligned

compare_exchange chosen over fetch_add because:

No wasted space on overshoot (fetch_add advances cursor even on failure)
Retry loop cost is negligible — contention on bump allocator is rare
Bounds check happens before cursor advances, not after

Operations

Operation	Input	Output	Latency	Notes
start	capacity (bytes)	Tape or error	~20us	creates IOSurface
take	size, alignment	pointer or none	< 5ns	compare_exchange loop
take_one	type T	typed pointer or none	< 5ns	uses size_of/align_of T
clear	—	—	< 10ns	store 0 to cursor
used	—	usize	inline	cursor.load
free	—	usize	inline	capacity - used
owns	pointer	bool	inline	ptr within [base, base+capacity)

Invariants

Take is lock-free — compare_exchange_weak, no mutex, no syscall
Alignment: must be power of 2. Minimum 1, recommended 64 for AMX SIMD
Apple Silicon kernel pages are 16KB (not 4KB)
start(0) returns error (ZeroSize)
no_std compatible core logic (only depends on atomic ops + pointer arithmetic)
Clear does NOT zero memory — caller responsible if needed
Concurrent take is safe (atomic). Concurrent take + clear is NOT safe — caller must ensure no take is in-flight during clear (e.g. single-threaded clear between inference passes)

Layer 3 — Fixed-size tensor grid (`grid`)

Purpose

Pre-allocated grid for inference tensors. Take and give with zero allocator overhead.

Mechanism

Tape-backed array of fixed-size cells. Lock-free queue (crossbeam SegQueue) tracks free cell indices. Grid owns the Tape.

Sizing

Tape capacity = CELL_SIZE * CELLS. CELL_SIZE must be a multiple of 64 (AMX alignment). This guarantees every cell is 64-byte aligned with no wasted padding.

Example: 32 cells of 4MB each → Tape of 128MB → one IOSurface of 128MB.

Data: Grid

Parameter	Description
CELL_SIZE	size of each cell in bytes (compile-time, must be multiple of 64)
CELLS	number of cells (compile-time)

Data: Cell

Cell borrows the Grid (Cell<'a> with lifetime tied to &'a Grid). This prevents use-after-free at compile time with zero runtime overhead — no Arc, no refcount.

Field	Type	Description
ptr	raw pointer	virtual address of cell data
id	usize	cell index (for give back to grid)
_grid	PhantomData<&'a Grid>	lifetime tie — Cell cannot outlive Grid

Operations

Operation	Input	Output	Latency	Notes
new	—	Grid or error	~20us	creates tape (CELL_SIZE * CELLS), fills free queue
take	—	Cell or none	~10ns	pop from lock-free queue
give	Cell	—	~10ns	push index back to queue
free	—	usize	~10ns	free queue length
total	—	usize	inline	CELLS (compile-time)

Invariants

Cell count fixed at compile time — no runtime resize
Take returns none when grid full — no panic, no block
Cell lifetime tied to Grid — compile-time use-after-free prevention
Given cells immediately reusable
Grid::drop is safe only when all Cells are given (enforced by lifetime — compiler prevents outstanding borrows)

Layer 4 — Hardware integration (`dma`)

Purpose

v0: no trait. Block and Tape expose address() and handle() directly. Hardware crates (rane, aruminium) consume IOSurfaceRef or raw pointers.

v1: add DmaBuffer trait with iova_segments() when DEXT is available.

v0 integration points

Consumer	Gets	How
CPU code	raw pointer	tape.take() or cell.ptr
AMX	raw pointer	same as CPU (AMX uses CPU instructions)
ANE (rane)	IOSurfaceRef	block.handle() → pass to _ANEIOSurfaceObject
GPU (Metal)	IOSurfaceRef	block.handle() → MTLTexture from IOSurface

No trait needed — direct method calls. Trait appears in v1 when DEXT adds IOVA capability.

Performance targets (v0)

Operation	Target	Baseline (mimalloc)
Single take (tape)	< 5ns	~100ns
Sequential write 1GB	> 23 GB/s (volatile)	~20 GB/s
Sequential write 1GB (NEON)	> 60 GB/s	~50 GB/s
Tape clear 1GB	< 10ns	~5ms (free loop)
IOSurface alloc 256MB	< 25us	N/A
Grid take/give	< 10ns	N/A

Error model

Error	Layer	Cause
ZeroSize	block, tape	open(0) / start(0) — zero-size allocation requested
BlockCreateFailed	block	IOSurfaceCreate returned null (OOM or system limit)
BlockLockFailed	block	IOSurfaceLock returned non-zero
InvalidAlignment	tape	alignment not power of 2
tape full	tape	take: not enough space remaining
grid full	grid	all cells in use

Tape.take and Grid.take return Option (None = exhausted) — not Result. These are hot-path operations, Option is lighter.

Block::open and Grid::new return Result<T, MemError> — these are cold-path init operations.

All errors non-panicking. No unwrap, no expect in library code.

Dependencies

Dependency	Purpose	Layer
IOSurface.framework	pinned shared buffers	block
CoreFoundation	CFDictionary for IOSurface properties	block
crossbeam	lock-free queue for grid	grid
criterion	benchmarking	benches

No objc2. No Metal. No Hypervisor. No DriverKit (v0).

Build

cargo build
cargo test
cargo bench

No special signing required for v0. No entitlements. No SIP changes.

Non-goals (v0)

No PA/IOVA visibility — that's v1 with DEXT
No NVMe DMA — that's v1
No safe API at this layer — unsafe foundation
No async — synchronous polling
No allocator trait impl — not a general purpose allocator

Implementation order (v0)

ffi.rs — IOSurface + CoreFoundation FFI declarations (extern C, link attrs, type aliases)
block.rs — Block struct: open (create + lock + cache VA), address, handle, drop (unlock + CFRelease)
tape.rs — Tape struct: start (creates Block), take (compare_exchange), clear, used/free/owns
Benchmarks — prove tape take < 5ns, Block throughput matches iosurface_probe experiment
grid.rs — Grid struct: new (creates Tape), take/give with SegQueue, Cell with lifetime
Integration test — roundtrip: take → write → read → verify
ANE test — create Block, pass handle() to rane, verify shared access

v1 roadmap: full zero-copy via DEXT

Prerequisites

Apple Developer Program enrollment ($99/year)
DriverKit entitlement provisioning from Apple
SIP reduced security + systemextensionsctl developer on for testing

Architecture

DEXT (CybMemDriver) acts as bridge between IOSurface and NVMe:

Client creates IOSurface, locks it, sends VA + length to DEXT
DEXT calls CreateMemoryDescriptorFromClient(VA, length) → IOMemoryDescriptor
DEXT calls IODMACommand::PrepareForDMA(descriptor) → DART IOVA segments (up to 32)
Client receives IOVA list, passes to NVMe for direct DMA

Key API (verified in DriverKit SDK 24.2)

IOUserClient::CreateMemoryDescriptorFromClient(
    uint64_t options,
    uint32_t segmentsCount,
    const IOAddressSegment segments[32],
    IOMemoryDescriptor ** memory
)

IODMACommand::PrepareForDMA(
    uint64_t options,
    IOMemoryDescriptor * memory,
    uint64_t offset,
    uint64_t length,
    uint64_t * flags,
    uint32_t * segmentsCount,
    IOAddressSegment segments[32]
)

DART IOMMU makes scatter/gather transparent

On Apple Silicon, PrepareForDMA returns DART IOVAs, not raw physical addresses. DART maps scattered physical pages into contiguous IOVA ranges for each device. NVMe sees contiguous address space even if physical RAM is fragmented.

Physical RAM:    [page7] [page3] [page19] [page42]   scattered
                    ↓        ↓        ↓        ↓
DART IOVA:       [0x1000] [0x2000] [0x3000] [0x4000]  contiguous for device

kIOMemoryPhysicallyContiguous is kernel-only — DriverKit cannot request it. But DART makes it unnecessary for DMA.

What we verified (experiments)

Finding	Source
IOSurface: pinned, contiguous VM region, ~20us alloc, ~23 GB/s write	experiments/iosurface_probe
Hypervisor: hv_vm_map does NOT improve host access latency (~10ns unchanged)	experiments/hyp_probe
Hypervisor: minimum 16KB pages on Apple Silicon	experiments/hyp_probe
Hypervisor: GPU/ANE cannot see guest IPA — useless for hardware sharing	audit
IOMallocContiguous: kernel-only, not callable from userspace	audit
IOMemoryDescriptorGetPhysicalAddress: kernel-only C++ method, not a C function	audit
DriverKit kIOMemoryPhysicallyContiguous: not available, kernel-only flag	research
CreateMemoryDescriptorFromClient: exists in DriverKit SDK 24.2, takes client VA segments	verified in headers
IODMACommand::PrepareForDMA: returns up to 32 DART IOVA segments	verified in headers
DEXT requires Apple Developer Program ($99/yr) — no self-signing workaround	research

DEXT experiment code (prepared, needs signing)

experiments/
  dext_iosurface_pa/       ← Path 2: IOSurface → DEXT reads IOVA
    dext/CybMemDriver.iig
    dext/CybMemDriver.cpp  ← CreateMemoryDescriptorFromClient + IODMACommand
    client/src/main.rs     ← Rust client: create IOSurface, call DEXT, print IOVAs
  dext_contiguous_alloc/   ← Path 1: DEXT alloc (limited — no contiguous flag)

v1 open questions

Can DEXT read IOVA from IOSurface that was created by another process (the client)?
What is PrepareForDMA latency for 256MB IOSurface?
How many IOVA segments does a 1GB IOSurface produce? (max 32 per call — may need multiple calls)
Can NVMe controller be addressed via DEXT, or does it need a separate NVMe DEXT?

Relationship to the cyber stack

unimem        ← this crate: IOSurface + tape + grid
  ↓
rane          ← ANE inference (uses IOSurface for buffers)
aruminium     ← AMX/GPU ops via Metal (can wrap IOSurface)
  ↓
cyb-runtime   ← orchestrates the full pipeline
  ↓
tru           ← runs tri-kernel on the cybergraph

Folder

api-sketch

Synonyms

hemera/specs

Hemera: A Permanent Hash Primitive for Planetary-Scale Collective Intelligence | field | value | |----------|--------------------------------| | version | 2.0 | | status | Decision Record | | authors | mastercyb | | date | March 2026 | Abstract Hemera is the cryptographic hash primitive for cyber,…

bbg/specs

specs

zheng/specs

zheng: polynomial proof system one IOP: SuperSpartan + sumcheck (CCS constraints, O(N) prover, O(log N) verifier). one folding: HyperNova (CCS-native, ~30 field ops per fold, one decider at the end). one hash: hemera (~3 calls per proof — binding, Fiat-Shamir seed, domain separation). five…

nox/specs

nox reference canonical specification of the nox virtual machine. this is the source of truth — when code and reference disagree, fix reference first, then propagate to code. specifications | page | scope | status | |------|-------|--------| | vm.md | overview, field, hash, algebra polymorphism,…

lens/specs

lens reference canonical specification for polynomial commitment — five lenses for five algebras. the trait three operations. commit is O(N). open produces a proof. verify checks the proof. all transparent (no trusted setup), all post-quantum. see trait for the full specification. naming convention…

strata/genies/specs

genies specification canonical reference for isogeny group action arithmetic: F_q field operations, supersingular curves, isogeny computation, and class group action. spec pages | page | defines | |------|---------| | [prime](/strata/genies/specs/prime) | CSIDH prime form, selection criteria,…

honeycrisp/acpu/specs

acpu — API specification pure Rust driver for Apple Silicon CPU compute. direct access to every useful compute unit in M1–M4: matrix coprocessor, vector engine, numeric extensions, atomics, memory system, performance counters. zero external dependencies — only inline assembly and system calls.…

strata/nebu/specs

nebu specification canonical reference for the Goldilocks prime field, its arithmetic, and its hardware. spec pages | page | defines | |------|---------| | field | prime, elements, arithmetic, properties, why Goldilocks | | ntt | Number Theoretic Transform, roots of unity, butterfly, Cooley-Tukey |…

honeycrisp/rane/specs

specs

strata/trop/specs

trop specification canonical reference for tropical semiring arithmetic: the (min, +) semiring, its matrix algebra, and dual certificate verification. spec pages | page | defines | |------|---------| | [semiring](/strata/trop/specs/semiring) | tropical semiring axioms, (min, +) definition, identity…

strata/kuro/specs

kuro specification canonical reference for the F₂ tower field, its arithmetic, packed operations, and hardware targets. spec pages | page | defines | |------|---------| | [field](/strata/kuro/specs/field) | tower levels, all field operations, properties, cost model vs Goldilocks | |…

honeycrisp/aruminium/specs

aruminium — API specification pure Rust driver for Apple Metal GPU. direct objc_msgSend FFI, zero external dependencies, only macOS system frameworks. concepts | concept | what it is | |---------|-----------| | device | a Metal GPU — discovered at runtime, owns all GPU resources | | buffer |…

strata/jali/specs

jali reference canonical specification for polynomial ring arithmetic R_q = F_p[x]/(x^n+1) over Goldilocks. what jali is jali (जाली — lattice/mesh) is the fifth execution algebra for cyber. polynomial ring elements are structured vectors of n Goldilocks field elements with multiplication defined by…

honeycrisp/unimem/specs.md

unimem: Zero-Copy Memory Driver for Apple Silicon

Goal

Why this exists

Architecture: v0 and v1

v0 — IOSurface-backed (implement now)

v1 — DEXT + IOSurface (after Apple Developer enrollment)

Crate structure (v0)

Layer 1 — IOSurface allocation (block)

Purpose

Mechanism

Lock model

Data: Block

Thread safety

Operations

Measured performance (from experiment)

Invariants

Layer 2 — Bump allocator (tape)

Purpose

Mechanism

Data: Tape

Thread safety

Alloc algorithm

Operations

Invariants

Layer 3 — Fixed-size tensor grid (grid)

Purpose

Mechanism

Sizing

Data: Grid

Data: Cell

Operations

Invariants

Layer 4 — Hardware integration (dma)

Purpose

v0 integration points

Performance targets (v0)

Error model

Dependencies

Build

Non-goals (v0)

Implementation order (v0)

v1 roadmap: full zero-copy via DEXT

Prerequisites

Architecture

Key API (verified in DriverKit SDK 24.2)

DART IOMMU makes scatter/gather transparent

What we verified (experiments)

DEXT experiment code (prepared, needs signing)

v1 open questions

Relationship to the cyber stack

Folder

Synonyms

Neighbours

Layer 1 — IOSurface allocation (`block`)

Layer 2 — Bump allocator (`tape`)

Layer 3 — Fixed-size tensor grid (`grid`)

Layer 4 — Hardware integration (`dma`)