tutorial
step-by-step: from zero to GPU compute in aruminium.
prerequisites
- macOS with Metal-capable GPU (any Mac from 2012+)
- Rust toolchain (
rustup)
step 1: create project
add aruminium to Cargo.toml:
[dependencies]
aruminium = { path = "../aruminium" } # or git URL
step 2: discover the GPU
use ;
$ cargo run
GPU: Apple M1 Pro
Unified memory: true
Max buffer: 5461 MB
step 3: allocate GPU buffers
create three buffers for vector addition: A + B = C.
let n = 1024usize;
let buf_a = device.buffer?; // 1024 floats = 4096 bytes
let buf_b = device.buffer?;
let buf_c = device.buffer?;
these are shared-mode buffers: CPU and GPU see the same memory.
write data from CPU:
buf_a.write_f32;
buf_b.write_f32;
step 4: write a Metal shader
Metal Shading Language (MSL) is C++-like. each thread gets its
position via thread_position_in_grid:
let source = r#"
#include <metal_stdlib>
using namespace metal;
kernel void vecadd(device float *a buffer(0),
device float *b buffer(1),
device float *c buffer(2),
uint id thread_position_in_grid) {
c[id] = a[id] + b[id];
}
"#;
step 5: compile and create pipeline
let lib = device.compile?;
let func = lib.function?;
let pipeline = device.pipeline?;
this compiles MSL to GPU bytecode at runtime. the pipeline is a reusable compiled state — create once, dispatch many times.
step 6: encode and dispatch
let queue = device.new_command_queue?;
let cmd = queue.commands?;
let enc = cmd.encoder?;
enc.bind;
enc.bind_buffer; // buffer, offset, index
enc.bind_buffer;
enc.bind_buffer;
enc.launch;
enc.finish;
cmd.submit;
cmd.wait;
step 7: read results
buf_c.read_f32;
step 8: add GPU timing
cmd.submit;
cmd.wait;
println!;
step 9: pass parameters via push
for small constants (uniforms), use push instead of a buffer:
let params = Params ;
let bytes = unsafe ;
enc.push; // bind at index 3
in the shader:
kernel void my_kernel(constant Params &p buffer(3),
...) {
float x = p.scale * input[id];
}
complete example
use ;
next steps
- see
examples/matmul.rsfor 2D dispatch with struct parameters - see
docs/guide.mdfor Dispatch, batch encoding, pipelining - see
docs/explanations.mdfor why the architecture works this way