unimem: API Reference
Extracted from the actual implementation in src/.
lib.rs
pub use Block;
pub use Tape;
pub use ;
ffi.rs
IOSurface + CoreFoundation raw FFI. No objc2, no wrappers.
// Types
pub type IOSurfaceRef = *mut c_void;
pub type CFTypeRef = *const c_void;
pub type CFStringRef = *const c_void;
pub type CFMutableDictionaryRef = *mut c_void;
pub type kern_return_t = i32;
// IOSurface.framework
extern "C"
// CoreFoundation
extern "C"
// Helpers
pub ; // string → CFString
pub *const c_void; // i64 → CFNumber
block.rs
Pinned shared memory. Locked at creation, VA stable for lifetime.
// Send + Sync — immutable after creation
IOSurface properties set at creation:
| Key | Value |
|---|---|
| IOSurfaceWidth | size |
| IOSurfaceHeight | 1 |
| IOSurfaceBytesPerElement | 1 |
| IOSurfaceBytesPerRow | size |
| IOSurfaceAllocSize | size |
| IOSurfacePixelFormat | 0 |
tape.rs
Bump allocator. ~1ns take. Reset in 0.3ns.
// Send + Sync — atomic cursor, immutable block
Alloc algorithm:
loop
grid.rs
Fixed-size tensor grid. ~15ns take+give cycle.
Usage examples
Basic tape allocation
use Tape;
let tape = start?; // 64MB pinned
let buf = tape.take.unwrap;
unsafe
let tensor = tape.take.unwrap; // 1MB, 64-byte aligned
unsafe
tape.clear; // instant — all allocations invalidated
Tensor grid for inference
use Grid;
// 32 cells of 4MB each = 128MB IOSurface
let grid: 4 * 1024 * 1024 }, 32> = new?;
let mut cell = grid.take.unwrap;
unsafe
grid.give;
ANE integration via rane
use Block;
let block = open?;
// Write input data
unsafe
// Pass to ANE — same physical memory, zero copy
let iosurface_ref = block.handle;
// rane uses IOSurfaceRef directly via _ANEIOSurfaceObject
Benchmark results
All numbers measured on Apple Silicon M1. Volatile u64, single thread.
| Operation | unimem | malloc | Vec | Box | mmap | mmap+mlock |
|---|---|---|---|---|---|---|
| take 64 B | 1.3 ns | 16 ns | 17 ns | 15 ns | - | - |
| take 4 KB | 0.9 ns | 18 ns | 20 ns | 83 ns | 464 ns | - |
| take 1 MB | 0.9 ns | 23 ns | 22 ns | - | 461 ns | - |
| free all | 0.3 ns | ~5 ms | ~5 ms | ~5 ms | ~5 ms | ~5 ms |
| grid cycle 4 KB | 15 ns | 19 ns | 20 ns | - | 928 ns | - |
| init 16 MB (lazy) | 23 us | - | - | - | 0.47 us | - |
| init 16 MB (warm) | 1.4 ms | 6.7 us | - | - | - | 1.6 ms |
| write 64 MB | 23.2 GB/s | 23.1 GB/s | 23.4 GB/s | 23.4 GB/s | 22.8 GB/s | 22.7 GB/s |
| read 64 MB | 22.5 GB/s | 21.6 GB/s | 22.8 GB/s | 22.8 GB/s | 22.0 GB/s | 21.2 GB/s |
| pinned | yes | no | no | no | no | yes |
| HW shared | CPU+GPU+AMX+ANE | CPU | CPU | CPU | CPU | CPU |
Init breakdown (16 MB, one-time cost):
- unimem lazy (23 us): IOSurface kernel object + DART registration + lock. Pages not backed.
- unimem warm (1.4 ms): same + walk all 1024 pages (16KB each), trigger page faults.
- malloc touch (6.7 us): malloc reuses cached pages from system allocator — no real faults.
- mmap+mlock+touch (1.6 ms): mmap + wire + page faults. Same cost as unimem warm.
malloc looks faster on init because the system allocator caches freed pages. On first-ever allocation (cold process), malloc+touch would be ~1.5ms too.
Bandwidth identical across all methods — DRAM bottleneck (~23 GB/s volatile u64).
unimem wins on: take speed (15-25x), dealloc speed (16Mx), and hardware sharing (only method where CPU+GPU+AMX+ANE see one buffer without copies).