specs.md

aruminium — API specification

pure Rust driver for Apple Metal GPU. direct objc_msgSend FFI, zero external dependencies, only macOS system frameworks.

concepts

concept	what it is
device	a Metal GPU — discovered at runtime, owns all GPU resources
buffer	CPU/GPU memory region — shared (zero-copy) or private (GPU-only)
library	compiled shader code — one or more functions from MSL source
function	a single shader entry point — vertex, fragment, or kernel
pipeline	a compiled GPU state object — binds function + config for dispatch
queue	a serial command submission channel to the GPU
command buffer	a batch of encoded GPU commands — submitted atomically
encoder	records commands into a command buffer — compute or blit
dispatcher	pre-resolved IMP dispatch engine for inference hot loops
texture	GPU image data — 2D/3D, region read/write
fence	GPU work tracking within a single command buffer
event	synchronization between command buffers
shared event	CPU/GPU synchronization with signaled counter

lifecycle

source  ->  compile  ->  pipeline  ->  encode  ->  commit  ->  complete
  MSL      MTLLibrary   pipeline    encoder    cmdBuf      GPU done

device

method	signature	semantics
open	`() -> Result<Gpu>`	get default Metal GPU
all	`() -> Result<Vec<Gpu>>`	enumerate all Metal GPUs
name	`(&self) -> String`	device name (e.g. "Apple M1 Pro")
has_unified_memory	`(&self) -> bool`	shared CPU/GPU memory architecture
max_buffer_length	`(&self) -> usize`	max buffer allocation in bytes
max_threads_per_threadgroup	`(&self) -> MTLSize`	max threads per threadgroup
recommended_max_working_set_size	`(&self) -> u64`	recommended GPU memory budget
new_command_queue	`(&self) -> Result<Queue>`	create command queue
buffer	`(&self, bytes) -> Result<Buffer>`	allocate shared buffer (CPU+GPU)
buffer_private	`(&self, bytes) -> Result<Buffer>`	allocate private buffer (GPU-only)
buffer_with_data	`(&self, &[u8]) -> Result<Buffer>`	shared buffer with initial data
compile	`(&self, &str) -> Result<ShaderLib>`	compile MSL source
pipeline	`(&self, &Shader) -> Result<Pipeline>`	create compute pipeline
texture	`(&self, desc) -> Result<Texture>`	create texture from descriptor (unsafe)
fence	`(&self) -> Fence`	create fence
event	`(&self) -> Event`	create event
shared_event	`(&self) -> SharedEvent`	create shared event

apple mapping

method	ObjC
open	MTLCreateSystemDefaultDevice()
all	MTLCopyAllDevices()
name	[device name]
has_unified_memory	[device hasUnifiedMemory]
max_buffer_length	[device maxBufferLength]
max_threads_per_threadgroup	[device maxThreadsPerThreadgroup]
recommended_max_working_set_size	[device recommendedMaxWorkingSetSize]
new_command_queue	[device newCommandQueue]
buffer	[device newBufferWithLength:options:] (StorageModeShared)
buffer_private	[device newBufferWithLength:options:] (StorageModePrivate)
buffer_with_data	[device newBufferWithBytes:length:options:]
compile	[device newLibraryWithSource:options:error:]
pipeline	[device newComputePipelineStateWithFunction:error:]
texture	[device newTextureWithDescriptor:]
fence	[device newFence]
event	[device newEvent]
shared_event	[device newSharedEvent]

buffer

CPU/GPU memory region. two storage modes:

shared (default) — zero-copy, CPU and GPU share physical memory. no lock/unlock needed. contents pointer cached at creation.
private — GPU-only, higher bandwidth for inter-kernel buffers. CPU cannot read/write. use blit encoder to copy data in/out.

method	signature	semantics
is_shared	`(&self) -> bool`	true if CPU-accessible (shared mode)
contents	`(&self) -> *mut c_void`	raw pointer to shared memory (cached)
read	`(&self, \|&[u8]\|)`	read access via closure
write	`(&self, \|&mut [u8]\|)`	write access via closure
read_f32	`(&self, \|&[f32]\|)`	typed read as f32
write_f32	`(&self, \|&mut [f32]\|)`	typed write as f32
size	`(&self) -> usize`	allocation in bytes
drop	automatic	[buffer release]

apple mapping

method	ObjC
contents	[buffer contents]
size	construction parameter
drop	objc_release

library

compiled shader code from MSL source text.

method	signature	semantics
function	`(&self, &str) -> Result<Shader>`	get function by name
function_names	`(&self) -> Vec<String>`	list all function names

apple mapping

method	ObjC
function	[library newFunctionWithName:]
function_names	[library functionNames]

function

a single shader entry point extracted from a library.

method	signature	semantics
name	`(&self) -> String`	function name

compute pipeline

compiled GPU state — function + hardware config.

method	signature	semantics
max_total_threads_per_threadgroup	`(&self) -> usize`	max threads per threadgroup for this pipeline
thread_execution_width	`(&self) -> usize`	SIMD width (32 on Apple GPU)
static_threadgroup_memory_length	`(&self) -> usize`	threadgroup memory used by pipeline (bytes)

apple mapping

method	ObjC
max_total_threads_per_threadgroup	[pipeline maxTotalThreadsPerThreadgroup]
thread_execution_width	[pipeline threadExecutionWidth]
static_threadgroup_memory_length	[pipeline staticThreadgroupMemoryLength]

command queue

method	signature	semantics
commands	`(&self) -> Result<Commands>`	retained, ARC fast-retain
commands_unretained	`unsafe (&self) -> Result<Commands>`	autoreleased, no retain overhead
commands_fast	`unsafe (&self) -> Result<Commands>`	unretained references — Metal skips resource retain/release
commands_unchecked	`unsafe (&self) -> Commands`	unretained refs, no null check
commands_autoreleased	`unsafe (&self) -> Commands`	fastest — must be in autorelease_pool

overhead hierarchy (low to high):

commands_autoreleased   — zero overhead, requires pool
commands_unchecked      — no null check, unretained refs
commands_fast           — unretained refs, null checked
commands_unretained     — autoreleased, null checked
commands                — retained, safe, standard

apple mapping

method	ObjC
commands	[queue commandBuffer] + objc_retainAutoreleasedReturnValue
commands_unretained	[queue commandBuffer] (no retain)
commands_fast	[queue commandBufferWithUnretainedReferences] + retain
commands_unchecked	[queue commandBufferWithUnretainedReferences] + ARC fast-retain
commands_autoreleased	[queue commandBufferWithUnretainedReferences] (no retain)

command buffer

method	signature	semantics
encoder	`(&self) -> Result<Encoder>`	retained compute encoder
encoder_unretained	`unsafe (&self) -> Result<Encoder>`	autoreleased
encoder_unchecked	`unsafe (&self) -> Encoder`	no null check, retained
encoder_autoreleased	`unsafe (&self) -> Encoder`	fastest, requires pool
copier	`(&self) -> Result<Copier>`	blit encoder
submit	`(&self)`	submit for GPU execution
wait	`(&self)`	block until GPU done
status	`(&self) -> u64`	execution status code
error	`(&self) -> Option<String>`	error description if failed
gpu_start_time	`(&self) -> f64`	GPU start time (seconds since boot)
gpu_end_time	`(&self) -> f64`	GPU end time (seconds since boot)
gpu_time	`(&self) -> f64`	GPU execution duration (end - start)

apple mapping

method	ObjC
encoder	[cmdBuf computeCommandEncoder] + ARC fast-retain
copier	[cmdBuf blitCommandEncoder]
submit	[cmdBuf commit]
wait	[cmdBuf waitUntilCompleted]
status	[cmdBuf status]
error	[cmdBuf error]
gpu_start_time	[cmdBuf GPUStartTime]
gpu_end_time	[cmdBuf GPUEndTime]

compute encoder

method	signature	semantics
bind	`(&self, &Pipeline)`	bind compute pipeline
bind_buffer	`(&self, &Buffer, offset, index)`	bind buffer at index
push	`(&self, &[u8], index)`	inline constant data
launch	`(&self, grid, group)`	dispatch with auto non-uniform grid handling
launch_groups	`(&self, groups, threads)`	dispatch with explicit group count
finish	`(&self)`	finish encoding

apple mapping

method	ObjC
bind	[encoder setComputePipelineState:]
bind_buffer	[encoder setBuffer:offset:atIndex:]
push	[encoder setBytes:length:atIndex:]
launch	[encoder dispatchThreads:threadsPerThreadgroup:]
launch_groups	[encoder dispatchThreadgroups:threadsPerThreadgroup:]
finish	[encoder endEncoding]

blit encoder

method	signature	semantics
copy	`(&self, src, src_off, dst, dst_off, size)`	GPU buffer-to-buffer copy
finish	`(&self)`	finish encoding

apple mapping

method	ObjC
copy	[encoder copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:]
finish	[encoder endEncoding]

compute dispatcher

pre-resolved IMP dispatch engine for inference hot loops. resolves all ObjC method implementations at construction — every dispatch call goes through direct function pointers, bypassing objc_msgSend entirely.

method	signature	semantics
new	`(&Queue) -> Self`	resolve all IMPs eagerly
dispatch	`unsafe (&self, pipeline, buffers, grid, group)`	single dispatch: encode + commit + wait
dispatch_with_bytes	`unsafe (&self, pipeline, buffers, bytes, index, grid, group)`	single dispatch with inline constants
batch	`unsafe (&self, \|&Batch\|)`	multiple dispatches in one command buffer
batch_raw	`unsafe (&self, \|&Batch\|)`	batch without autorelease management (caller manages pool)
batch_async	`unsafe (&self, \|&Batch\|) -> GpuFuture`	encode + commit, return handle for deferred wait

batch

provided to batch closures. same IMP-resolved hot path.

method	signature	semantics
bind	`(&self, &Pipeline)`	bind pipeline
bind_buffer	`(&self, &Buffer, offset, index)`	bind buffer
push	`(&self, &[u8], index)`	inline constants
launch	`(&self, grid, group)`	dispatch
launch_groups	`(&self, groups, threads)`	dispatch with explicit groups

gpu future

handle for committed but not yet completed command buffer.

method	signature	semantics
wait	`(self)`	block until GPU finishes, release command buffer
drop	automatic	if not waited, waits + releases (prevents leak)

pipelining pattern

let mut prev = None;
for pass in passes {
    let handle = disp.batch_async(|batch| { ... });
    if let Some(h) = prev { h.wait(); }
    prev = Some(handle);
}
if let Some(h) = prev { h.wait(); }

overlap GPU execution of batch N with CPU encoding of batch N+1.

texture

GPU image data. wraps id<MTLTexture>.

method	signature	semantics
width	`(&self) -> usize`	width in pixels
height	`(&self) -> usize`	height in pixels
depth	`(&self) -> usize`	depth (3D textures)
pixel_format	`(&self) -> usize`	MTLPixelFormat value
replace_region	`unsafe (&self, region, mipmap, data, bytes_per_row)`	write data to region
get_bytes	`unsafe (&self, data, bytes_per_row, region, mipmap)`	read data from region

apple mapping

method	ObjC
width	[texture width]
height	[texture height]
depth	[texture depth]
pixel_format	[texture pixelFormat]
replace_region	[texture replaceRegion:mipmapLevel:withBytes:bytesPerRow:]
get_bytes	[texture getBytes:bytesPerRow:fromRegion:mipmapLevel:]

synchronization

fence

GPU work tracking within a single command buffer.

method	signature	semantics
as_raw	`(&self) -> ObjcId`	raw pointer for encoder fence ops

event

synchronization between command buffers on same device.

method	signature	semantics
as_raw	`(&self) -> ObjcId`	raw pointer for command buffer signal/wait

shared event

CPU/GPU synchronization with monotonic counter.

method	signature	semantics
signaled_value	`(&self) -> u64`	current signaled counter value
as_raw	`(&self) -> ObjcId`	raw pointer

conversion

fp16<->f32 via inline NEON assembly (aarch64) with software fallback.

function	signature	semantics
fp16_to_f32	`(u16) -> f32`	single half -> single precision
f32_to_fp16	`(f32) -> u16`	single -> half precision
cast_f16_f32	`(&mut [f32], &[u16])`	bulk half -> single (32/iter, 4x unrolled NEON)
cast_f32_f16	`(&mut [u16], &[f32])`	bulk single -> half (32/iter, 4x unrolled NEON)

tail: 8/iter NEON, then scalar fallback.

autorelease pool

autorelease_pool(|| {
    // autoreleased ObjC objects valid here
})

required when using unretained/autoreleased command buffer and encoder variants.

errors

DeviceNotFound              no Metal GPU available
BufferCreationFailed(String) buffer allocation failed
LibraryCompilationFailed(String) MSL compilation error
FunctionNotFound(String)    shader function not in library
PipelineCreationFailed(String) pipeline creation error
CommandBufferError(String)  command buffer execution error
EncoderCreationFailed       encoder creation failed
QueueCreationFailed         command queue creation failed
TextureCreationFailed(String) texture creation failed
Io(io::Error)               filesystem error

execution model

one pipeline = one compiled shader function
command buffers submitted atomically via commit
GPU executes command buffers in order per queue
multiple queues enable concurrent GPU work
shared buffers need no synchronization between command buffer boundaries
private buffers need blit encoder for CPU data transfer
dispatch_threads handles non-uniform grids automatically
dispatch_threadgroups requires manual grid division
Dispatch bypasses objc_msgSend for hot-loop performance

driver stack

aruminium crate (objc_msgSend FFI + IMP resolution)
  -> Metal.framework (linked at build time)
    -> GPU driver
      -> GPU hardware

Metal.framework is public. linked via #[link(name = "Metal", kind = "framework")]. core path: objc_msgSend with transmuted function pointers. hot path: pre-resolved IMP via class_getMethodImplementation.

render pipeline

RenderPipeline struct exists (wraps id<MTLRenderPipelineState>). no render encoder yet — compute-only driver.

Synonyms

hemera/specs

Hemera: A Permanent Hash Primitive for Planetary-Scale Collective Intelligence | field | value | |----------|--------------------------------| | version | 2.0 | | status | Decision Record | | authors | mastercyb | | date | March 2026 | Abstract Hemera is the cryptographic hash primitive for cyber,…

bbg/specs

specs

zheng/specs

zheng: polynomial proof system one IOP: SuperSpartan + sumcheck (CCS constraints, O(N) prover, O(log N) verifier). one folding: HyperNova (CCS-native, ~30 field ops per fold, one decider at the end). one hash: hemera (~3 calls per proof — binding, Fiat-Shamir seed, domain separation). five…

nox/specs

nox reference canonical specification of the nox virtual machine. this is the source of truth — when code and reference disagree, fix reference first, then propagate to code. specifications | page | scope | status | |------|-------|--------| | vm.md | overview, field, hash, algebra polymorphism,…

lens/specs

lens reference canonical specification for polynomial commitment — five lenses for five algebras. the trait three operations. commit is O(N). open produces a proof. verify checks the proof. all transparent (no trusted setup), all post-quantum. see trait for the full specification. naming convention…

strata/genies/specs

genies specification canonical reference for isogeny group action arithmetic: F_q field operations, supersingular curves, isogeny computation, and class group action. spec pages | page | defines | |------|---------| | [prime](/strata/genies/specs/prime) | CSIDH prime form, selection criteria,…

honeycrisp/acpu/specs

acpu — API specification pure Rust driver for Apple Silicon CPU compute. direct access to every useful compute unit in M1–M4: matrix coprocessor, vector engine, numeric extensions, atomics, memory system, performance counters. zero external dependencies — only inline assembly and system calls.…

strata/nebu/specs

nebu specification canonical reference for the Goldilocks prime field, its arithmetic, and its hardware. spec pages | page | defines | |------|---------| | field | prime, elements, arithmetic, properties, why Goldilocks | | ntt | Number Theoretic Transform, roots of unity, butterfly, Cooley-Tukey |…

honeycrisp/rane/specs

specs

strata/trop/specs

trop specification canonical reference for tropical semiring arithmetic: the (min, +) semiring, its matrix algebra, and dual certificate verification. spec pages | page | defines | |------|---------| | [semiring](/strata/trop/specs/semiring) | tropical semiring axioms, (min, +) definition, identity…

honeycrisp/unimem/specs

unimem: Zero-Copy Memory Driver for Apple Silicon Goal Single pinned buffer visible to CPU, GPU, AMX, and ANE — zero copies between pipeline stages. The memory layer for inference on unified memory. v1 adds NVMe DMA via DEXT — full zero-copy from disk to compute. Why this exists Every inference…

strata/kuro/specs

kuro specification canonical reference for the F₂ tower field, its arithmetic, packed operations, and hardware targets. spec pages | page | defines | |------|---------| | [field](/strata/kuro/specs/field) | tower levels, all field operations, properties, cost model vs Goldilocks | |…

strata/jali/specs

jali reference canonical specification for polynomial ring arithmetic R_q = F_p[x]/(x^n+1) over Goldilocks. what jali is jali (जाली — lattice/mesh) is the fifth execution algebra for cyber. polynomial ring elements are structured vectors of n Goldilocks field elements with multiplication defined by…

honeycrisp/aruminium/specs.md

aruminium — API specification

concepts

lifecycle

device

apple mapping

buffer

apple mapping

library

apple mapping

function

compute pipeline

apple mapping

command queue

apple mapping

command buffer

apple mapping

compute encoder

apple mapping

blit encoder

apple mapping

compute dispatcher

batch

gpu future

pipelining pattern

texture

apple mapping

synchronization

fence

event

shared event

conversion

autorelease pool

errors

execution model

driver stack

render pipeline

Synonyms

Neighbours