aruminium — API specification

pure Rust driver for Apple Metal GPU. direct objc_msgSend FFI, zero external dependencies, only macOS system frameworks.

concepts

concept what it is
device a Metal GPU — discovered at runtime, owns all GPU resources
buffer CPU/GPU memory region — shared (zero-copy) or private (GPU-only)
library compiled shader code — one or more functions from MSL source
function a single shader entry point — vertex, fragment, or kernel
pipeline a compiled GPU state object — binds function + config for dispatch
queue a serial command submission channel to the GPU
command buffer a batch of encoded GPU commands — submitted atomically
encoder records commands into a command buffer — compute or blit
dispatcher pre-resolved IMP dispatch engine for inference hot loops
texture GPU image data — 2D/3D, region read/write
fence GPU work tracking within a single command buffer
event synchronization between command buffers
shared event CPU/GPU synchronization with signaled counter

lifecycle

source  ->  compile  ->  pipeline  ->  encode  ->  commit  ->  complete
  MSL      MTLLibrary   pipeline    encoder    cmdBuf      GPU done

device

method signature semantics
open () -> Result<Gpu> get default Metal GPU
all () -> Result<Vec<Gpu>> enumerate all Metal GPUs
name (&self) -> String device name (e.g. "Apple M1 Pro")
has_unified_memory (&self) -> bool shared CPU/GPU memory architecture
max_buffer_length (&self) -> usize max buffer allocation in bytes
max_threads_per_threadgroup (&self) -> MTLSize max threads per threadgroup
recommended_max_working_set_size (&self) -> u64 recommended GPU memory budget
new_command_queue (&self) -> Result<Queue> create command queue
buffer (&self, bytes) -> Result<Buffer> allocate shared buffer (CPU+GPU)
buffer_private (&self, bytes) -> Result<Buffer> allocate private buffer (GPU-only)
buffer_with_data (&self, &[u8]) -> Result<Buffer> shared buffer with initial data
buffer_wrap unsafe (&self, *mut c_void, usize) -> Result<Buffer> zero-copy wrap of caller-owned page-aligned memory
wrap (&self, &Block) -> Result<Buffer> zero-copy wrap of a unimem::Block
compile (&self, &str) -> Result<ShaderLib> compile MSL source
pipeline (&self, &Shader) -> Result<Pipeline> create compute pipeline
texture (&self, desc) -> Result<Texture> create texture from descriptor (unsafe)
fence (&self) -> Result<Fence> create fence
event (&self) -> Result<Event> create event
shared_event (&self) -> Result<SharedEvent> create shared event
as_raw (&self) -> ObjcId raw MTLDevice

apple mapping

method ObjC
open MTLCreateSystemDefaultDevice()
all MTLCopyAllDevices()
name [device name]
has_unified_memory [device hasUnifiedMemory]
max_buffer_length [device maxBufferLength]
max_threads_per_threadgroup [device maxThreadsPerThreadgroup]
recommended_max_working_set_size [device recommendedMaxWorkingSetSize]
new_command_queue [device newCommandQueue]
buffer [device newBufferWithLength:options:] (StorageModeShared)
buffer_private [device newBufferWithLength:options:] (StorageModePrivate)
buffer_with_data [device newBufferWithBytes:length:options:]
compile [device newLibraryWithSource:options:error:]
pipeline [device newComputePipelineStateWithFunction:error:]
texture [device newTextureWithDescriptor:]
fence [device newFence]
event [device newEvent]
shared_event [device newSharedEvent]

buffer

CPU/GPU memory region. two storage modes:

  • shared (default) — zero-copy, CPU and GPU share physical memory. no lock/unlock needed. contents pointer cached at creation.
  • private — GPU-only, higher bandwidth for inter-kernel buffers. CPU cannot read/write. use blit encoder to copy data in/out.
method signature semantics
is_shared (&self) -> bool true if CPU-accessible (shared mode)
as_bytes (&self) -> &[u8] direct slice view; panics on private buffer
read (&self, |&[u8]|) read access via closure; panics on private buffer
write (&self, |&mut [u8]|) write access via closure; panics on private buffer
read_f32 (&self, |&[f32]|) typed read as f32
write_f32 (&self, |&mut [f32]|) typed write as f32
size (&self) -> usize allocation in bytes
as_raw (&self) -> ObjcId raw MTLBuffer
drop automatic [buffer release]

apple mapping

method ObjC
as_bytes / read / write [buffer contents] (pointer cached at construction)
size construction parameter
drop objc_release

library

compiled shader code from MSL source text.

method signature semantics
function (&self, &str) -> Result<Shader> get function by name
function_names (&self) -> Vec<String> list all function names
as_raw (&self) -> ObjcId raw MTLLibrary

apple mapping

method ObjC
function [library newFunctionWithName:]
function_names [library functionNames]

function

a single shader entry point extracted from a library.

method signature semantics
name (&self) -> String function name
as_raw (&self) -> ObjcId raw MTLFunction

compute pipeline

compiled GPU state — function + hardware config.

method signature semantics
max_total_threads_per_threadgroup (&self) -> usize max threads per threadgroup for this pipeline
thread_execution_width (&self) -> usize SIMD width (32 on Apple GPU)
static_threadgroup_memory_length (&self) -> usize threadgroup memory used by pipeline (bytes)
as_raw (&self) -> ObjcId raw MTLComputePipelineState

apple mapping

method ObjC
max_total_threads_per_threadgroup [pipeline maxTotalThreadsPerThreadgroup]
thread_execution_width [pipeline threadExecutionWidth]
static_threadgroup_memory_length [pipeline staticThreadgroupMemoryLength]

command queue

method signature semantics
commands (&self) -> Result<Commands> retained, ARC fast-retain
commands_unretained unsafe (&self) -> Result<Commands> autoreleased, no retain overhead
commands_fast unsafe (&self) -> Result<Commands> unretained references — Metal skips resource retain/release
commands_unchecked unsafe (&self) -> Commands unretained refs, no null check
commands_autoreleased unsafe (&self) -> Commands fastest — must be in autorelease_pool

overhead hierarchy (low to high):

commands_autoreleased   — zero overhead, requires pool
commands_unchecked      — no null check, unretained refs
commands_fast           — unretained refs, null checked
commands_unretained     — autoreleased, null checked
commands                — retained, safe, standard

apple mapping

method ObjC
commands [queue commandBuffer] + objc_retainAutoreleasedReturnValue
commands_unretained [queue commandBuffer] (no retain)
commands_fast [queue commandBufferWithUnretainedReferences] + retain
commands_unchecked [queue commandBufferWithUnretainedReferences] + ARC fast-retain
commands_autoreleased [queue commandBufferWithUnretainedReferences] (no retain)

command buffer

method signature semantics
encoder (&self) -> Result<Encoder> retained compute encoder
encoder_unretained unsafe (&self) -> Result<Encoder> autoreleased
encoder_unchecked unsafe (&self) -> Encoder no null check, retained
encoder_autoreleased unsafe (&self) -> Encoder fastest, requires pool
copier (&self) -> Result<Copier> blit encoder
render_encoder (&self, &RenderPassDescriptor) -> Result<RenderEncoder> render encoder for a pass
submit (&self) submit for GPU execution
wait (&self) block until GPU done
status (&self) -> u64 execution status code
error (&self) -> Option<String> error description if failed
gpu_start_time (&self) -> f64 GPU start time (seconds since boot)
gpu_end_time (&self) -> f64 GPU end time (seconds since boot)
gpu_time (&self) -> f64 GPU execution duration (end - start)
as_raw (&self) -> ObjcId raw MTLCommandBuffer

apple mapping

method ObjC
encoder [cmdBuf computeCommandEncoder] + ARC fast-retain
copier [cmdBuf blitCommandEncoder]
render_encoder [cmdBuf renderCommandEncoderWithDescriptor:] + retain
submit [cmdBuf commit]
wait [cmdBuf waitUntilCompleted]
status [cmdBuf status]
error [cmdBuf error]
gpu_start_time [cmdBuf GPUStartTime]
gpu_end_time [cmdBuf GPUEndTime]

compute encoder

method signature semantics
bind (&self, &Pipeline) bind compute pipeline
bind_buffer (&self, &Buffer, offset, index) bind buffer at index
push (&self, &[u8], index) inline constant data
launch (&self, grid, group) dispatch with auto non-uniform grid handling
launch_groups (&self, groups, threads) dispatch with explicit group count
finish (&self) finish encoding
as_raw (&self) -> ObjcId raw MTLComputeCommandEncoder

apple mapping

method ObjC
bind [encoder setComputePipelineState:]
bind_buffer [encoder setBuffer:offset:atIndex:]
push [encoder setBytes:length:atIndex:]
launch [encoder dispatchThreads:threadsPerThreadgroup:]
launch_groups [encoder dispatchThreadgroups:threadsPerThreadgroup:]
finish [encoder endEncoding]

blit encoder

method signature semantics
copy (&self, src, src_off, dst, dst_off, size) GPU buffer-to-buffer copy
finish (&self) finish encoding
as_raw (&self) -> ObjcId raw MTLBlitCommandEncoder

apple mapping

method ObjC
copy [encoder copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:]
finish [encoder endEncoding]

compute dispatcher

pre-resolved IMP dispatch engine for inference hot loops. resolves all ObjC method implementations at construction — every dispatch call goes through direct function pointers, bypassing objc_msgSend entirely.

method signature semantics
new (&Queue) -> Self resolve all IMPs eagerly
dispatch unsafe (&self, pipeline, buffers, grid, group) single dispatch: encode + commit + wait
dispatch_with_bytes unsafe (&self, pipeline, buffers, bytes, index, grid, group) single dispatch with inline constants
batch unsafe (&self, |&Batch|) multiple dispatches in one command buffer
batch_raw unsafe (&self, |&Batch|) batch without autorelease management (caller manages pool)
batch_async unsafe (&self, |&Batch|) -> GpuFuture encode + commit, return handle for deferred wait

batch

provided to batch closures. same IMP-resolved hot path.

method signature semantics
bind (&self, &Pipeline) bind pipeline
bind_buffer (&self, &Buffer, offset, index) bind buffer
push (&self, &[u8], index) inline constants
launch (&self, grid, group) dispatch
launch_groups (&self, groups, threads) dispatch with explicit groups
memory_barrier_buffers (&self) MTLBarrierScopeBuffers between dispatches in same encoder

gpu future

handle for committed but not yet completed command buffer.

method signature semantics
wait (self) block until GPU finishes, release command buffer
drop automatic if not waited, waits + releases (prevents leak)

pipelining pattern

let mut prev = None;
for pass in passes {
    let handle = disp.batch_async(|batch| { ... });
    if let Some(h) = prev { h.wait(); }
    prev = Some(handle);
}
if let Some(h) = prev { h.wait(); }

overlap GPU execution of batch N with CPU encoding of batch N+1.

texture

GPU image data. wraps id<MTLTexture>.

method signature semantics
width (&self) -> usize width in pixels
height (&self) -> usize height in pixels
depth (&self) -> usize depth (3D textures)
pixel_format (&self) -> usize MTLPixelFormat value
replace_region unsafe (&self, region, mipmap, data, bytes_per_row) write data to region
get_bytes unsafe (&self, data, bytes_per_row, region, mipmap) read data from region

apple mapping

method ObjC
width [texture width]
height [texture height]
depth [texture depth]
pixel_format [texture pixelFormat]
replace_region [texture replaceRegion:mipmapLevel:withBytes:bytesPerRow:]
get_bytes [texture getBytes:bytesPerRow:fromRegion:mipmapLevel:]

synchronization

fence

GPU work tracking within a single command buffer.

method signature semantics
as_raw (&self) -> ObjcId raw pointer for encoder fence ops

event

synchronization between command buffers on same device.

method signature semantics
as_raw (&self) -> ObjcId raw pointer for command buffer signal/wait

shared event

CPU/GPU synchronization with monotonic counter.

method signature semantics
signaled_value (&self) -> u64 current signaled counter value
as_raw (&self) -> ObjcId raw pointer

conversion

fp16<->f32 via inline NEON assembly (aarch64) with software fallback.

function signature semantics
fp16_to_f32 (u16) -> f32 single half -> single precision
f32_to_fp16 (f32) -> u16 single -> half precision
cast_f16_f32 (&mut [f32], &[u16]) bulk half -> single (32/iter, 4x unrolled NEON)
cast_f32_f16 (&mut [u16], &[f32]) bulk single -> half (32/iter, 4x unrolled NEON)

tail: 8/iter NEON, then scalar fallback.

autorelease pool

autorelease_pool(|| {
    // autoreleased ObjC objects valid here
})

required when using unretained/autoreleased command buffer and encoder variants.

errors

DeviceNotFound              no Metal GPU available
BufferCreationFailed(String) buffer allocation failed
LibraryCompilationFailed(String) MSL compilation error
FunctionNotFound(String)    shader function not in library
PipelineCreationFailed(String) pipeline creation error
CommandBufferError(String)  command buffer execution error
EncoderCreationFailed       encoder creation failed
QueueCreationFailed         command queue creation failed
TextureCreationFailed(String) texture creation failed
Io(io::Error)               filesystem error

execution model

  • one pipeline = one compiled shader function
  • command buffers submitted atomically via commit
  • GPU executes command buffers in order per queue
  • multiple queues enable concurrent GPU work
  • shared buffers need no synchronization between command buffer boundaries
  • private buffers need blit encoder for CPU data transfer
  • dispatch_threads handles non-uniform grids automatically
  • dispatch_threadgroups requires manual grid division
  • Dispatch bypasses objc_msgSend for hot-loop performance

driver stack

aruminium crate (objc_msgSend FFI + IMP resolution)
  -> Metal.framework (linked at build time)
    -> GPU driver
      -> GPU hardware

Metal.framework is public. linked via #[link(name = "Metal", kind = "framework")]. core path: objc_msgSend with transmuted function pointers. hot path: pre-resolved IMP via class_getMethodImplementation.

render pipeline

raster path. mirrors compute: RenderPipelinePipeline, RenderEncoderEncoder.

shaders -> RenderPipeline -> RenderPassDescriptor -> RenderEncoder -> draw -> end

RenderPipeline

wraps id<MTLRenderPipelineState>.

method signature semantics
color_attachments (&self) -> usize number of color attachments
sample_count (&self) -> u32 MSAA sample count (1 = no MSAA)
as_raw (&self) -> ObjcId raw MTLRenderPipelineState

RenderPipelineSpec

builder for Gpu::render_pipeline.

field type semantics
color_attachments Vec<ColorAttachmentSpec> per-slot format + blend
depth_format Option<NSUInteger> depth attachment pixel format (None = no depth)
stencil_format Option<NSUInteger> stencil attachment pixel format
sample_count u32 MSAA sample count (1 = none)
vertex_descriptor Option<VertexDescriptor> typed vertex input layout

constructors: RenderPipelineSpec::color(fmt) — single attachment, no depth, no MSAA. RenderPipelineSpec::colors(&[fmt]) — multiple color attachments. builders: with_depth(fmt), with_stencil(fmt), with_sample_count(n), with_vertex_descriptor(vd), with_blend(index, blend).

ColorAttachmentSpec

field type semantics
format NSUInteger pixel format (e.g. MTLPixelFormatBGRA8Unorm)
blend Option<BlendState> per-attachment blend (None = disabled)
write_mask u32 RGBA write mask (R=1, G=2, B=4, A=8; default 0xF)

builders: with_blend(blend), with_write_mask(mask).

BlendState

field type semantics
rgb_op, alpha_op BlendOp blend operation
src_rgb, dst_rgb BlendFactor RGB blend factors
src_alpha, dst_alpha BlendFactor alpha blend factors

constructors: BlendState::alpha_over() (standard source-over), BlendState::additive().

BlendOp = Add | Subtract | ReverseSubtract | Min | Max. BlendFactor = Zero | One | SourceColor | OneMinusSourceColor | SourceAlpha | OneMinusSourceAlpha | DestinationAlpha | OneMinusDestinationAlpha | DestinationColor | OneMinusDestinationColor.

DepthStencil

field type semantics
compare CompareFunction depth comparison
write_enabled bool write depth on pass

constructors: DepthStencil::less_write(), DepthStencil::always_no_write().

CompareFunction = Never | Less | Equal | LessEqual | Greater | NotEqual | GreaterEqual | Always.

DepthStencilState wraps id<MTLDepthStencilState>. method: as_raw() -> ObjcId.

Gpu — render factory

method signature semantics
render_target (&self, w, h, format) -> Result<Texture> Private color render target (RenderTarget | ShaderRead)
render_target_ms (&self, w, h, format, samples) -> Result<Texture> multisampled color render target
depth_target (&self, w, h, format) -> Result<Texture> Private depth render target
depth_target_ms (&self, w, h, format, samples) -> Result<Texture> multisampled depth render target
render_pipeline (&self, &Shader, &Shader, &RenderPipelineSpec) -> Result<RenderPipeline> compile vertex+fragment into pipeline state
depth_stencil_state (&self, DepthStencil) -> Result<DepthStencilState> compile depth/stencil state

RenderPassDescriptor

wraps id<MTLRenderPassDescriptor>.

method signature semantics
new () -> Self empty descriptor
color_attachment (&mut self, index, ColorAttachmentDesc) configure color slot
depth_attachment (&mut self, DepthAttachmentDesc) configure depth slot
as_raw (&self) -> ObjcId raw descriptor

ColorAttachmentDesc{ texture, load_action, store_action, clear_color: [f64;4], resolve_texture: Option<&Texture>, level: u32, slice: u32 }. constructor: ColorAttachmentDesc::clear(&tex, color) — Clear/Store, no resolve.

DepthAttachmentDesc{ texture, load_action, store_action, clear_depth: f64 }. constructor: DepthAttachmentDesc::clear(&tex) — Clear/DontCare, depth=1.0.

LoadAction = DontCare | Load | Clear. StoreAction = DontCare | Store | MultisampleResolve | StoreAndMultisampleResolve.

RenderEncoder

wraps id<MTLRenderCommandEncoder>.

method signature semantics
bind (&self, &RenderPipeline) bind pipeline state
set_vertex_buffer (&self, index, &Buffer, offset) vertex stage buffer
set_fragment_buffer (&self, index, &Buffer, offset) fragment stage buffer
push_vertex (&self, &[u8], index) inline vertex constants
push_fragment (&self, &[u8], index) inline fragment constants
set_vertex_texture (&self, index, &Texture) vertex stage texture
set_fragment_texture (&self, index, &Texture) fragment stage texture
set_viewport (&self, x, y, w, h, near, far) viewport rect (f64 each)
set_scissor (&self, x, y, w, h) scissor rect (u32 each)
set_cull_mode (&self, CullMode) None / Front / Back
set_front_facing_winding (&self, Winding) Clockwise / CounterClockwise
set_depth_stencil_state (&self, &DepthStencilState) bind depth/stencil state
set_depth_bias (&self, bias, slope_scale, clamp) depth offset (f32 each)
draw (&self, PrimitiveType, start, count) non-indexed draw
draw_instanced (&self, PrimitiveType, start, count, instances) instanced non-indexed
draw_indexed (&self, PrimitiveType, index_count, IndexType, &Buffer, offset) indexed draw
draw_indexed_instanced (&self, PrimitiveType, index_count, IndexType, &Buffer, offset, instances) indexed instanced
end (self) finish encoding (consumes; Drop calls endEncoding if skipped)

PrimitiveType = Point | Line | LineStrip | Triangle | TriangleStrip. IndexType = UInt16 | UInt32. .size() returns element size in bytes.

VertexDescriptor

typed vertex input layout for stage_in vertex functions.

VertexDescriptor::new()
    .with_attribute(VertexAttribute { shader_location, format, offset, buffer_index })
    .with_layout(VertexBufferLayout { buffer_index, stride, step, step_rate })

VertexFormat = Float | Float2 | Float3 | Float4 | Half2 | Half4 | UChar4Normalized | UInt | Int. VertexStep = Constant | PerVertex | PerInstance.

apple mapping

method ObjC
Gpu::render_target [device newTextureWithDescriptor:] (usage RenderTarget|ShaderRead)
Gpu::depth_target [device newTextureWithDescriptor:] (usage RenderTarget, depth format)
Gpu::render_pipeline [device newRenderPipelineStateWithDescriptor:error:]
Gpu::depth_stencil_state [device newDepthStencilStateWithDescriptor:]
Commands::render_encoder [cmdBuf renderCommandEncoderWithDescriptor:] + retain
RenderEncoder::bind [encoder setRenderPipelineState:]
RenderEncoder::set_vertex_buffer [encoder setVertexBuffer:offset:atIndex:]
RenderEncoder::set_fragment_buffer [encoder setFragmentBuffer:offset:atIndex:]
RenderEncoder::push_vertex [encoder setVertexBytes:length:atIndex:]
RenderEncoder::push_fragment [encoder setFragmentBytes:length:atIndex:]
RenderEncoder::set_vertex_texture [encoder setVertexTexture:atIndex:]
RenderEncoder::set_fragment_texture [encoder setFragmentTexture:atIndex:]
RenderEncoder::set_viewport [encoder setViewport:] (MTLViewport struct)
RenderEncoder::set_scissor [encoder setScissorRect:] (MTLScissorRect struct)
RenderEncoder::set_cull_mode [encoder setCullMode:]
RenderEncoder::set_front_facing_winding [encoder setFrontFacingWinding:]
RenderEncoder::set_depth_stencil_state [encoder setDepthStencilState:]
RenderEncoder::set_depth_bias [encoder setDepthBias:slopeScale:clamp:]
RenderEncoder::draw [encoder drawPrimitives:vertexStart:vertexCount:]
RenderEncoder::draw_instanced [encoder drawPrimitives:vertexStart:vertexCount:instanceCount:]
RenderEncoder::draw_indexed [encoder drawIndexedPrimitives:indexCount:indexType:indexBuffer:indexBufferOffset:]
RenderEncoder::draw_indexed_instanced [encoder drawIndexedPrimitives:...:instanceCount:]
RenderEncoder::end [encoder endEncoding]

Homonyms

soft3/hemera/specs
Hemera: A Permanent Hash Primitive for Planetary-Scale Collective Intelligence | field | value | |----------|--------------------------------| | version | 2.0 | | status | Decision Record | | authors | mastercyb | | date | March 2026 | Abstract Hemera is the cryptographic hash primitive for cyber,…
cyb/honeycrisp/acpu/specs
acpu — API specification pure Rust driver for Apple Silicon CPU compute. direct access to every useful compute unit in M1–M4: matrix coprocessor, vector engine, numeric extensions, atomics, memory system, performance counters. zero external dependencies — only inline assembly and system calls.…
cyb/honeycrisp/unimem/specs
unimem: Zero-Copy Memory Driver for Apple Silicon Goal Single pinned buffer visible to CPU, GPU, AMX, and ANE — zero copies between pipeline stages. The memory layer for inference on unified memory. v1 adds NVMe DMA via DEXT — full zero-copy from disk to compute. Why this exists Every inference…
cyb/honeycrisp/rane/specs
specs

Graph