aruminium — API specification

pure Rust driver for Apple Metal GPU. direct objc_msgSend FFI, zero external dependencies, only macOS system frameworks.

concepts

concept what it is
device a Metal GPU — discovered at runtime, owns all GPU resources
buffer CPU/GPU memory region — shared (zero-copy) or private (GPU-only)
library compiled shader code — one or more functions from MSL source
function a single shader entry point — vertex, fragment, or kernel
pipeline a compiled GPU state object — binds function + config for dispatch
queue a serial command submission channel to the GPU
command buffer a batch of encoded GPU commands — submitted atomically
encoder records commands into a command buffer — compute or blit
dispatcher pre-resolved IMP dispatch engine for inference hot loops
texture GPU image data — 2D/3D, region read/write
fence GPU work tracking within a single command buffer
event synchronization between command buffers
shared event CPU/GPU synchronization with signaled counter

lifecycle

source  ->  compile  ->  pipeline  ->  encode  ->  commit  ->  complete
  MSL      MTLLibrary   pipeline    encoder    cmdBuf      GPU done

device

method signature semantics
open () -> Result<Gpu> get default Metal GPU
all () -> Result<Vec<Gpu>> enumerate all Metal GPUs
name (&self) -> String device name (e.g. "Apple M1 Pro")
has_unified_memory (&self) -> bool shared CPU/GPU memory architecture
max_buffer_length (&self) -> usize max buffer allocation in bytes
max_threads_per_threadgroup (&self) -> MTLSize max threads per threadgroup
recommended_max_working_set_size (&self) -> u64 recommended GPU memory budget
new_command_queue (&self) -> Result<Queue> create command queue
buffer (&self, bytes) -> Result<Buffer> allocate shared buffer (CPU+GPU)
buffer_private (&self, bytes) -> Result<Buffer> allocate private buffer (GPU-only)
buffer_with_data (&self, &[u8]) -> Result<Buffer> shared buffer with initial data
compile (&self, &str) -> Result<ShaderLib> compile MSL source
pipeline (&self, &Shader) -> Result<Pipeline> create compute pipeline
texture (&self, desc) -> Result<Texture> create texture from descriptor (unsafe)
fence (&self) -> Fence create fence
event (&self) -> Event create event
shared_event (&self) -> SharedEvent create shared event

apple mapping

method ObjC
open MTLCreateSystemDefaultDevice()
all MTLCopyAllDevices()
name [device name]
has_unified_memory [device hasUnifiedMemory]
max_buffer_length [device maxBufferLength]
max_threads_per_threadgroup [device maxThreadsPerThreadgroup]
recommended_max_working_set_size [device recommendedMaxWorkingSetSize]
new_command_queue [device newCommandQueue]
buffer [device newBufferWithLength:options:] (StorageModeShared)
buffer_private [device newBufferWithLength:options:] (StorageModePrivate)
buffer_with_data [device newBufferWithBytes:length:options:]
compile [device newLibraryWithSource:options:error:]
pipeline [device newComputePipelineStateWithFunction:error:]
texture [device newTextureWithDescriptor:]
fence [device newFence]
event [device newEvent]
shared_event [device newSharedEvent]

buffer

CPU/GPU memory region. two storage modes:

  • shared (default) — zero-copy, CPU and GPU share physical memory. no lock/unlock needed. contents pointer cached at creation.
  • private — GPU-only, higher bandwidth for inter-kernel buffers. CPU cannot read/write. use blit encoder to copy data in/out.
method signature semantics
is_shared (&self) -> bool true if CPU-accessible (shared mode)
contents (&self) -> *mut c_void raw pointer to shared memory (cached)
read (&self, |&[u8]|) read access via closure
write (&self, |&mut [u8]|) write access via closure
read_f32 (&self, |&[f32]|) typed read as f32
write_f32 (&self, |&mut [f32]|) typed write as f32
size (&self) -> usize allocation in bytes
drop automatic [buffer release]

apple mapping

method ObjC
contents [buffer contents]
size construction parameter
drop objc_release

library

compiled shader code from MSL source text.

method signature semantics
function (&self, &str) -> Result<Shader> get function by name
function_names (&self) -> Vec<String> list all function names

apple mapping

method ObjC
function [library newFunctionWithName:]
function_names [library functionNames]

function

a single shader entry point extracted from a library.

method signature semantics
name (&self) -> String function name

compute pipeline

compiled GPU state — function + hardware config.

method signature semantics
max_total_threads_per_threadgroup (&self) -> usize max threads per threadgroup for this pipeline
thread_execution_width (&self) -> usize SIMD width (32 on Apple GPU)
static_threadgroup_memory_length (&self) -> usize threadgroup memory used by pipeline (bytes)

apple mapping

method ObjC
max_total_threads_per_threadgroup [pipeline maxTotalThreadsPerThreadgroup]
thread_execution_width [pipeline threadExecutionWidth]
static_threadgroup_memory_length [pipeline staticThreadgroupMemoryLength]

command queue

method signature semantics
commands (&self) -> Result<Commands> retained, ARC fast-retain
commands_unretained unsafe (&self) -> Result<Commands> autoreleased, no retain overhead
commands_fast unsafe (&self) -> Result<Commands> unretained references — Metal skips resource retain/release
commands_unchecked unsafe (&self) -> Commands unretained refs, no null check
commands_autoreleased unsafe (&self) -> Commands fastest — must be in autorelease_pool

overhead hierarchy (low to high):

commands_autoreleased   — zero overhead, requires pool
commands_unchecked      — no null check, unretained refs
commands_fast           — unretained refs, null checked
commands_unretained     — autoreleased, null checked
commands                — retained, safe, standard

apple mapping

method ObjC
commands [queue commandBuffer] + objc_retainAutoreleasedReturnValue
commands_unretained [queue commandBuffer] (no retain)
commands_fast [queue commandBufferWithUnretainedReferences] + retain
commands_unchecked [queue commandBufferWithUnretainedReferences] + ARC fast-retain
commands_autoreleased [queue commandBufferWithUnretainedReferences] (no retain)

command buffer

method signature semantics
encoder (&self) -> Result<Encoder> retained compute encoder
encoder_unretained unsafe (&self) -> Result<Encoder> autoreleased
encoder_unchecked unsafe (&self) -> Encoder no null check, retained
encoder_autoreleased unsafe (&self) -> Encoder fastest, requires pool
copier (&self) -> Result<Copier> blit encoder
submit (&self) submit for GPU execution
wait (&self) block until GPU done
status (&self) -> u64 execution status code
error (&self) -> Option<String> error description if failed
gpu_start_time (&self) -> f64 GPU start time (seconds since boot)
gpu_end_time (&self) -> f64 GPU end time (seconds since boot)
gpu_time (&self) -> f64 GPU execution duration (end - start)

apple mapping

method ObjC
encoder [cmdBuf computeCommandEncoder] + ARC fast-retain
copier [cmdBuf blitCommandEncoder]
submit [cmdBuf commit]
wait [cmdBuf waitUntilCompleted]
status [cmdBuf status]
error [cmdBuf error]
gpu_start_time [cmdBuf GPUStartTime]
gpu_end_time [cmdBuf GPUEndTime]

compute encoder

method signature semantics
bind (&self, &Pipeline) bind compute pipeline
bind_buffer (&self, &Buffer, offset, index) bind buffer at index
push (&self, &[u8], index) inline constant data
launch (&self, grid, group) dispatch with auto non-uniform grid handling
launch_groups (&self, groups, threads) dispatch with explicit group count
finish (&self) finish encoding

apple mapping

method ObjC
bind [encoder setComputePipelineState:]
bind_buffer [encoder setBuffer:offset:atIndex:]
push [encoder setBytes:length:atIndex:]
launch [encoder dispatchThreads:threadsPerThreadgroup:]
launch_groups [encoder dispatchThreadgroups:threadsPerThreadgroup:]
finish [encoder endEncoding]

blit encoder

method signature semantics
copy (&self, src, src_off, dst, dst_off, size) GPU buffer-to-buffer copy
finish (&self) finish encoding

apple mapping

method ObjC
copy [encoder copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:]
finish [encoder endEncoding]

compute dispatcher

pre-resolved IMP dispatch engine for inference hot loops. resolves all ObjC method implementations at construction — every dispatch call goes through direct function pointers, bypassing objc_msgSend entirely.

method signature semantics
new (&Queue) -> Self resolve all IMPs eagerly
dispatch unsafe (&self, pipeline, buffers, grid, group) single dispatch: encode + commit + wait
dispatch_with_bytes unsafe (&self, pipeline, buffers, bytes, index, grid, group) single dispatch with inline constants
batch unsafe (&self, |&Batch|) multiple dispatches in one command buffer
batch_raw unsafe (&self, |&Batch|) batch without autorelease management (caller manages pool)
batch_async unsafe (&self, |&Batch|) -> GpuFuture encode + commit, return handle for deferred wait

batch

provided to batch closures. same IMP-resolved hot path.

method signature semantics
bind (&self, &Pipeline) bind pipeline
bind_buffer (&self, &Buffer, offset, index) bind buffer
push (&self, &[u8], index) inline constants
launch (&self, grid, group) dispatch
launch_groups (&self, groups, threads) dispatch with explicit groups

gpu future

handle for committed but not yet completed command buffer.

method signature semantics
wait (self) block until GPU finishes, release command buffer
drop automatic if not waited, waits + releases (prevents leak)

pipelining pattern

let mut prev = None;
for pass in passes {
    let handle = disp.batch_async(|batch| { ... });
    if let Some(h) = prev { h.wait(); }
    prev = Some(handle);
}
if let Some(h) = prev { h.wait(); }

overlap GPU execution of batch N with CPU encoding of batch N+1.

texture

GPU image data. wraps id<MTLTexture>.

method signature semantics
width (&self) -> usize width in pixels
height (&self) -> usize height in pixels
depth (&self) -> usize depth (3D textures)
pixel_format (&self) -> usize MTLPixelFormat value
replace_region unsafe (&self, region, mipmap, data, bytes_per_row) write data to region
get_bytes unsafe (&self, data, bytes_per_row, region, mipmap) read data from region

apple mapping

method ObjC
width [texture width]
height [texture height]
depth [texture depth]
pixel_format [texture pixelFormat]
replace_region [texture replaceRegion:mipmapLevel:withBytes:bytesPerRow:]
get_bytes [texture getBytes:bytesPerRow:fromRegion:mipmapLevel:]

synchronization

fence

GPU work tracking within a single command buffer.

method signature semantics
as_raw (&self) -> ObjcId raw pointer for encoder fence ops

event

synchronization between command buffers on same device.

method signature semantics
as_raw (&self) -> ObjcId raw pointer for command buffer signal/wait

shared event

CPU/GPU synchronization with monotonic counter.

method signature semantics
signaled_value (&self) -> u64 current signaled counter value
as_raw (&self) -> ObjcId raw pointer

conversion

fp16<->f32 via inline NEON assembly (aarch64) with software fallback.

function signature semantics
fp16_to_f32 (u16) -> f32 single half -> single precision
f32_to_fp16 (f32) -> u16 single -> half precision
cast_f16_f32 (&mut [f32], &[u16]) bulk half -> single (32/iter, 4x unrolled NEON)
cast_f32_f16 (&mut [u16], &[f32]) bulk single -> half (32/iter, 4x unrolled NEON)

tail: 8/iter NEON, then scalar fallback.

autorelease pool

autorelease_pool(|| {
    // autoreleased ObjC objects valid here
})

required when using unretained/autoreleased command buffer and encoder variants.

errors

DeviceNotFound              no Metal GPU available
BufferCreationFailed(String) buffer allocation failed
LibraryCompilationFailed(String) MSL compilation error
FunctionNotFound(String)    shader function not in library
PipelineCreationFailed(String) pipeline creation error
CommandBufferError(String)  command buffer execution error
EncoderCreationFailed       encoder creation failed
QueueCreationFailed         command queue creation failed
TextureCreationFailed(String) texture creation failed
Io(io::Error)               filesystem error

execution model

  • one pipeline = one compiled shader function
  • command buffers submitted atomically via commit
  • GPU executes command buffers in order per queue
  • multiple queues enable concurrent GPU work
  • shared buffers need no synchronization between command buffer boundaries
  • private buffers need blit encoder for CPU data transfer
  • dispatch_threads handles non-uniform grids automatically
  • dispatch_threadgroups requires manual grid division
  • Dispatch bypasses objc_msgSend for hot-loop performance

driver stack

aruminium crate (objc_msgSend FFI + IMP resolution)
  -> Metal.framework (linked at build time)
    -> GPU driver
      -> GPU hardware

Metal.framework is public. linked via #[link(name = "Metal", kind = "framework")]. core path: objc_msgSend with transmuted function pointers. hot path: pre-resolved IMP via class_getMethodImplementation.

render pipeline

RenderPipeline struct exists (wraps id<MTLRenderPipelineState>). no render encoder yet — compute-only driver.

Synonyms

hemera/specs
Hemera: A Permanent Hash Primitive for Planetary-Scale Collective Intelligence | field | value | |----------|--------------------------------| | version | 2.0 | | status | Decision Record | | authors | mastercyb | | date | March 2026 | Abstract Hemera is the cryptographic hash primitive for cyber,…
bbg/specs
specs
zheng/specs
zheng: polynomial proof system one IOP: SuperSpartan + sumcheck (CCS constraints, O(N) prover, O(log N) verifier). one folding: HyperNova (CCS-native, ~30 field ops per fold, one decider at the end). one hash: hemera (~3 calls per proof — binding, Fiat-Shamir seed, domain separation). five…
nox/specs
nox reference canonical specification of the nox virtual machine. this is the source of truth — when code and reference disagree, fix reference first, then propagate to code. specifications | page | scope | status | |------|-------|--------| | vm.md | overview, field, hash, algebra polymorphism,…
lens/specs
lens reference canonical specification for polynomial commitment — five lenses for five algebras. the trait three operations. commit is O(N). open produces a proof. verify checks the proof. all transparent (no trusted setup), all post-quantum. see trait for the full specification. naming convention…
strata/genies/specs
genies specification canonical reference for isogeny group action arithmetic: F_q field operations, supersingular curves, isogeny computation, and class group action. spec pages | page | defines | |------|---------| | [prime](/strata/genies/specs/prime) | CSIDH prime form, selection criteria,…
honeycrisp/acpu/specs
acpu — API specification pure Rust driver for Apple Silicon CPU compute. direct access to every useful compute unit in M1–M4: matrix coprocessor, vector engine, numeric extensions, atomics, memory system, performance counters. zero external dependencies — only inline assembly and system calls.…
strata/nebu/specs
nebu specification canonical reference for the Goldilocks prime field, its arithmetic, and its hardware. spec pages | page | defines | |------|---------| | field | prime, elements, arithmetic, properties, why Goldilocks | | ntt | Number Theoretic Transform, roots of unity, butterfly, Cooley-Tukey |…
honeycrisp/rane/specs
specs
strata/trop/specs
trop specification canonical reference for tropical semiring arithmetic: the (min, +) semiring, its matrix algebra, and dual certificate verification. spec pages | page | defines | |------|---------| | [semiring](/strata/trop/specs/semiring) | tropical semiring axioms, (min, +) definition, identity…
honeycrisp/unimem/specs
unimem: Zero-Copy Memory Driver for Apple Silicon Goal Single pinned buffer visible to CPU, GPU, AMX, and ANE — zero copies between pipeline stages. The memory layer for inference on unified memory. v1 adds NVMe DMA via DEXT — full zero-copy from disk to compute. Why this exists Every inference…
strata/kuro/specs
kuro specification canonical reference for the F₂ tower field, its arithmetic, packed operations, and hardware targets. spec pages | page | defines | |------|---------| | [field](/strata/kuro/specs/field) | tower levels, all field operations, properties, cost model vs Goldilocks | |…
strata/jali/specs
jali reference canonical specification for polynomial ring arithmetic R_q = F_p[x]/(x^n+1) over Goldilocks. what jali is jali (जाली — lattice/mesh) is the fifth execution algebra for cyber. polynomial ring elements are structured vectors of n Goldilocks field elements with multiplication defined by…

Neighbours