instructions.md

Apple Silicon CPU Instruction Reference

Complete catalog of compute instructions accessible from acpu on Apple Silicon M1–M4. Each section maps hardware features to the acpu module that exposes them.

Detected on this machine: Apple M1 Max, AMX v1, 8P+2E cores, L1=128B line, L2=4MB. FP16, DotProd, FCMA, RDM, LSE, LRCPC: yes. BF16, I8MM: no (M1, appears in M2+).

1. AMX — Matrix Coprocessor

Undocumented per-thread coprocessor. 23 valid opcodes (0-22). Full reference: specs/amx.md.

Module: acpu::matrix

Register file

X: 8 × 64 B = 512 B (input rows for FMA)
Y: 8 × 64 B = 512 B (input columns for FMA)
Z: 64 × 64 B = 4096 B (accumulator, 4 independent 16×16 f32 tiles)

Instructions (opcodes 0-22)

Op	Name	Description	FLOPS
0	LDX	load 64 B → X row	—
1	LDY	load 64 B → Y row	—
2	STX	store X row → 64 B	—
3	STY	store Y row → 64 B	—
4	LDZ	load 64 B → Z row	—
5	STZ	store Z row → 64 B	—
6	LDZI	load interleaved → Z	—
7	STZI	store interleaved ← Z	—
8	EXTRX	extract Z slice → X	—
9	EXTRY	extract Z slice → Y	—
10	FMA64	f64 outer product 8×8, Z stride 8	128
11	FMS64	f64 outer subtract 8×8	128
12	FMA32	f32 outer product 16×16, Z stride 4	512
13	FMS32	f32 outer subtract 16×16	512
14	MAC16	i16 multiply-accumulate 32×32, Z stride 2	2048
15	FMA16	f16 outer product 32×32, Z stride 2	2048
16	FMS16	f16 outer subtract 32×32	2048
17	SET/CLR	activate/deactivate AMX (reg 0=set, 1=clr)	—
18	???	Z write, 1 row. possible i32 reduced product	?
19	???	Z write, 1 row. values are sums (dot product?)	?
20	???	Z write, 16 rows stride 2. i32 matrix variant	?
21	???	Z write, 16 rows stride 2. f32 matrix variant	?
22	???	X write. extract variant	—

Encoding

.word (0x00201000 + (opcode << 5) + register_encoding)

Operand: 64-bit value in GPR. Bit layout per opcode in amx.md.

2. NEON — 128-bit SIMD

ARM Advanced SIMD. 32 registers × 128 bits. 4-wide f32 or 2-wide f64.

Module: acpu::vector (math, reduce, softmax, rope)

Arithmetic (f32 × 4)

Intrinsic	Op	Throughput	Use in acpu
`vfmaq_f32`	fused multiply-add	1/cycle	gemm accumulate, math
`vaddq_f32`	add	1/cycle	accumulate_tile, reduce
`vsubq_f32`	subtract	1/cycle	—
`vmulq_f32`	multiply	1/cycle	math kernels
`vdivq_f32`	divide	~10 cycles	softmax
`vsqrtq_f32`	square root	~10 cycles	normalize
`vmaxq_f32` / `vminq_f32`	max/min	1/cycle	reduce
`vabsq_f32`	absolute	1/cycle	—
`vnegq_f32`	negate	1/cycle	—
`vrecpeq_f32` + `vrecpsq_f32`	reciprocal estimate	2 cycles	softmax fast path
`vrsqrteq_f32` + `vrsqrtsq_f32`	rsqrt estimate	2 cycles	normalize fast path

Load / Store

Intrinsic	Description
`vld1q_f32`	load 4 f32 (unaligned OK)
`vst1q_f32`	store 4 f32
`vld1q_f32_x2` / `x4`	load 8/16 f32
`vdupq_n_f32`	broadcast scalar → 4 lanes
`vdupq_laneq_f32`	broadcast one lane → 4

Shuffle / Transpose

Intrinsic	Description	Use in acpu
`vzip1q_f32`	interleave low halves	pack_a NEON 4×4 transpose
`vzip2q_f32`	interleave high halves	pack_a NEON 4×4 transpose
`vtrn1q_f32`	transpose even elements	—
`vtrn2q_f32`	transpose odd elements	—
`vextq_f32`	extract/rotate	rotate
`vrev64q_f32`	reverse within 64-bit	rotate

Reduction (horizontal)

Intrinsic	Description
`vaddvq_f32`	sum 4 lanes → scalar
`vmaxvq_f32`	max of 4 lanes
`vminvq_f32`	min of 4 lanes
`vpaddq_f32`	pairwise add

Comparison

Intrinsic	Description
`vcgtq_f32`	greater than → mask
`vcgeq_f32`	greater or equal
`vceqq_f32`	equal
`vbslq_f32`	bitwise select (blend)

3. FP16 — Half-Precision (FEAT_FP16)

16-bit IEEE 754 float. Present on all M-series.

Module: acpu::numeric::fp16

Intrinsic	Description
`vcvt_f32_f16`	convert 4 f16 → 4 f32
`vcvt_f16_f32`	convert 4 f32 → 4 f16
`vfmaq_f16`	f16 fused multiply-add (8-wide)
`vaddq_f16`	f16 add (8-wide)

AMX FMA16 (opcode 15): 32×32 f16 outer product, 2048 FLOPS/instruction.

4. BF16 — BFloat16 (FEAT_BF16, M2+)

16-bit brain float. NOT available on M1. Available M2+.

Intrinsic	Description
`vcvtq_low_bf16_f32`	convert f32 → bf16
`vbfmmlaq_f32`	bf16 matrix multiply-accumulate 2×4 → f32
`vbfdotq_f32`	bf16 dot product
`vbfmlalbq_f32`	bf16 widening multiply-add (low)
`vbfmlaltq_f32`	bf16 widening multiply-add (high)

Module: acpu::numeric::bf16 (software convert on M1, native on M2+)

5. DotProd — Integer Dot Product (FEAT_DotProd)

4-element i8/u8 dot product accumulated into i32. Present on all M-series.

Module: acpu::numeric::quant

Intrinsic	Description	FLOPS (per call)
`vdotq_s32`	dot(i8×4) → i32, 4 accumulators	32 int ops
`vdotq_u32`	unsigned variant	32 int ops
`vusdotq_s32`	mixed signed/unsigned	32 int ops

6. I8MM — Int8 Matrix Multiply (FEAT_I8MM, M2+)

8×8 int8 matrix multiply → int32. NOT available on M1.

Intrinsic	Description
`vmmlaq_s32`	signed i8 matmul 2×8 × 8×2 → 2×2 i32
`vmmlaq_u32`	unsigned variant
`vusmmlaq_s32`	mixed signed/unsigned

AMX opcode 14 (MAC16): 32×32 i16 multiply-accumulate. AMX opcodes 20-21: possibly i32/mixed matrix ops (undocumented).

7. FCMA — Complex Multiply-Accumulate (FEAT_FCMA)

Fused complex number operations. Present on all M-series.

Module: acpu::numeric::complex

Intrinsic	Description
`vcmlaq_f32`	complex multiply-accumulate (rotation 0°)
`vcmlaq_rot90_f32`	complex mul-acc rotated 90°
`vcmlaq_rot180_f32`	complex mul-acc rotated 180°
`vcmlaq_rot270_f32`	complex mul-acc rotated 270°

Computes (a + bi) × (c + di) in 2 instructions (0° + 90°).

8. RDM — Rounding Doubling Multiply (FEAT_RDM)

Fixed-point multiply with rounding. Present on all M-series.

Intrinsic	Description
`vqrdmulhq_s16`	rounding doubling multiply high (i16)
`vqrdmlahq_s16`	rounding doubling multiply-add high
`vqrdmlshq_s16`	rounding doubling multiply-sub high
`vqrdmulhq_s32`	i32 variant

Useful for fixed-point quantized inference.

9. LSE — Large System Extensions (FEAT_LSE)

Hardware atomics. Present on all M-series. Essential for lock-free threading.

Module: acpu::sync

Instruction	Intrinsic / asm	Description
LDADD	`core::sync::atomic`	atomic fetch-add
LDSET	—	atomic fetch-or
LDCLR	—	atomic fetch-and-not
CAS	—	compare-and-swap
SWP	—	atomic swap

All available as Ordering::Relaxed/Acquire/Release/SeqCst via std::sync::atomic::AtomicUsize etc.

10. Barriers and Events

Module: acpu::sync

Instruction	Function	Description
DMB ISH	`barrier()`	data memory barrier (inner shareable)
DSB ISH	`fence()`	data synchronization barrier
ISB	`isb()`	instruction synchronization barrier
WFE	`wait()`	wait for event (low-power sleep)
SEV	`wake()`	signal event (wake all cores in cluster)

WFE/SEV pair: lightweight cross-core signaling without mutex overhead.

11. Prefetch (PRFM)

Module: acpu::sync::prefetch

Instruction	Function	Description
PRFM PLDL1KEEP	`prefetch_l1(ptr)`	prefetch for read into L1
PRFM PLDL2KEEP	`prefetch_l2(ptr)`	prefetch for read into L2
PRFM PSTL1KEEP	`prefetch_l1_write(ptr)`	prefetch for write into L1

Software prefetch hints. Hardware prefetcher is aggressive on Apple Silicon; explicit prefetch most useful for non-sequential access patterns (packed panels).

12. Core Affinity (QoS)

Module: acpu::sync::affinity

Function	QoS class	Effect
`pin_p_core()`	USER_INTERACTIVE (0x21)	schedule on P-cores
`pin_e_core()`	BACKGROUND (0x09)	schedule on E-cores
`pin_any()`	DEFAULT (0x15)	no preference

Uses pthread_set_qos_class_self_np(). Soft hint, not hard pinning.

13. PMU — Performance Monitoring Unit

Module: acpu::pulse

Access via dlopen of libkperf.dylib (private framework). Requires SIP-related permissions.

Counter	Description
Cycles	CPU cycles
Instructions	retired instructions
BranchMiss	branch mispredictions
L1DMiss	L1 data cache misses
L1IMiss	L1 instruction cache misses

14. LRCPC — Load-Acquire RCpc (FEAT_LRCPC)

Weaker acquire semantics for better performance on weakly-ordered loads.

Instruction	Description
LDAPR	load-acquire with release consistency (cheaper than LDAR)

Available via core::sync::atomic::Ordering::Acquire with LLVM optimization.

Feature Matrix by Chip

Feature	M1	M2	M3	M4
AMX v1	YES	YES	—	—
AMX v2	—	—	YES	YES
NEON	YES	YES	YES	YES
FP16	YES	YES	YES	YES
BF16	NO	YES	YES	YES
DotProd	YES	YES	YES	YES
I8MM	NO	YES	YES	YES
FCMA	YES	YES	YES	YES
RDM	YES	YES	YES	YES
LSE	YES	YES	YES	YES
LRCPC	YES	YES	YES	YES

AMX v2 (M3+): adds opcodes 8-9 (EXTRX/EXTRY), possibly opcodes 18-22. BF16 + I8MM appear in M2. All other features present since M1.