explanations
how aruminium works and why it's built this way.
why direct FFI
Metal.framework is an ObjC API. the standard Rust approach is objc2-metal (safe wrappers over ObjC runtime). aruminium bypasses this entirely:
objc2-metal: Rust -> objc2 runtime -> ObjC protocol wrappers -> Metal.framework
aruminium: Rust -> objc_msgSend (transmuted fn pointers) -> Metal.framework
one less layer. no ObjC protocol conformance checks, no dynamic dispatch tables, no wrapper allocations. the cost is manual memory management (retain/release) and unsafe transmutes for every call.
objc_msgSend dispatch
every Metal API call is an ObjC message send. the C function objc_msgSend
takes a target object, a selector (method name), and arguments. the return
type varies — so we transmute objc_msgSend to typed function pointers:
type F = unsafe extern "C" fn ;
let f: F = transmute;
f
this is the same mechanism the ObjC compiler uses. safe as long as the type signature matches the actual method.
selector caching
sel_registerName("commandBuffer") resolves a C string to a selector
pointer. idempotent — same string always returns same pointer. we cache
the result in an AtomicPtr with Relaxed ordering:
- ARM64: compiles to plain
ldr(no memory barrier) - first call: resolves + stores
- subsequent calls: single atomic load
- race on first init is benign (same value written by all threads)
50+ selectors are cached this way. hot-path cost: one pointer load.
three dispatch tiers
tier 1: objc_msgSend (standard path)
every API call goes through objc_msgSend with cached selectors.
safe wrappers with null checks and Result return types. used by
examples, probe, and general-purpose code.
tier 2: ARC fast-retain
objc_retainAutoreleasedReturnValue called immediately after
objc_msgSend. the ObjC runtime recognizes this pattern and skips
the autorelease+retain round-trip. saves ~100ns per command buffer
creation. used by default commands() and encoder().
tier 3: pre-resolved IMP (Dispatch)
class_getMethodImplementation resolves a selector to a direct
function pointer (IMP) at construction time. subsequent calls bypass
objc_msgSend entirely — no selector lookup, no method cache check,
no dispatch overhead:
objc_msgSend path: selector -> method cache -> IMP -> call
IMP path: IMP -> call
saves ~50ns per call. for inference loops doing 600+ dispatches per token, this adds up.
memory management
Metal objects follow ObjC retain/release semantics:
newXxxmethods return retained objects (caller owns)commandBufferreturns autoreleased (runtime owns temporarily)objc_retain/objc_releaseare direct C calls (not msg_send)
aruminium tracks ownership with owned: bool on command buffers
and encoders. Drop only calls release on owned objects.
five command buffer variants trade safety for speed:
| variant | retain | null check | resource refs |
|---|---|---|---|
| command_buffer | ARC fast | yes | retained |
| command_buffer_unretained | none | yes | retained |
| command_buffer_fast | explicit | yes | unretained |
| command_buffer_unchecked | ARC fast | no | unretained |
| command_buffer_autoreleased | none | no | unretained |
"unretained references" means Metal skips retain/release on all buffers, textures, and pipelines used in the command buffer. caller must ensure resources outlive the command buffer.
shared vs private buffers
shared (MTLResourceStorageModeShared): CPU and GPU share the
same physical memory. zero-copy — write from CPU, read on GPU,
no synchronization needed between command buffer boundaries.
the contents() pointer is cached at creation time.
private (MTLResourceStorageModePrivate): GPU-only memory.
Metal has full control over placement and caching. higher bandwidth
for inter-kernel buffers. CPU access requires blit encoder copy.
fp16 conversion
half-precision floats (fp16) are the common weight format for inference. conversion uses inline ARM64 NEON assembly:
- single value:
fcvtinstruction (h->s or s->h) - bulk (32/iter): 4x unrolled
fcvtl/fcvtnwithldp/stp - tail: 8/iter NEON, then scalar fallback
throughput: 58-72 GB/s on M1 Pro (memory bandwidth limited).
software fallback for non-aarch64 platforms handles IEEE 754 half-precision format manually (sign, exponent, mantissa bit shifting).
MTLSize on ARM64
MTLSize is a 24-byte struct (3x usize). on ARM64, structs up to
4 registers are passed in registers, not on the stack. the transmute
from Rust struct to C calling convention works because both use the
same ABI. this is verified by the dispatch examples — if the ABI
didn't match, threadgroup dimensions would be garbage.
batch encoding
the single most important optimization for inference: encoding multiple dispatches into one command buffer.
without batching: N dispatches = N command buffers = N round-trips to the GPU driver. each round-trip costs ~20us.
with batching: N dispatches = 1 command buffer = 1 round-trip.
pipeline switches happen inside the same encoder — different
setComputePipelineState calls within one computeCommandEncoder.
dispatch_batch_async adds pipelining: commit batch N, start
encoding batch N+1 on CPU while GPU executes batch N. overlap
hides encoding latency completely.
relationship to cyb-llm
cyb-llm inference runtime (graphs, models, scheduling)
cyb-llm/backend Metal backend (MSL shaders, jet dispatch, frame allocator)
aruminium this crate (device, buffer, pipeline, dispatch)
aruminium does not know about models, operations, or shaders. it provides the Metal.framework API surface that the backend needs. one-way dependency: cyb-llm depends on aruminium, not the reverse.