honeycrisp/acpu/specs/instructions.md

Apple Silicon CPU Instruction Reference

Complete catalog of compute instructions accessible from acpu on Apple Silicon M1–M4. Each section maps hardware features to the acpu module that exposes them.

Detected on this machine: Apple M1 Max, AMX v1, 8P+2E cores, L1=128B line, L2=4MB. FP16, DotProd, FCMA, RDM, LSE, LRCPC: yes. BF16, I8MM: no (M1, appears in M2+).


1. AMX — Matrix Coprocessor

Undocumented per-thread coprocessor. 23 valid opcodes (0-22). Full reference: specs/amx.md.

Module: acpu::matrix

Register file

  • X: 8 × 64 B = 512 B (input rows for FMA)
  • Y: 8 × 64 B = 512 B (input columns for FMA)
  • Z: 64 × 64 B = 4096 B (accumulator, 4 independent 16×16 f32 tiles)

Instructions (opcodes 0-22)

Op Name Description FLOPS
0 LDX load 64 B → X row
1 LDY load 64 B → Y row
2 STX store X row → 64 B
3 STY store Y row → 64 B
4 LDZ load 64 B → Z row
5 STZ store Z row → 64 B
6 LDZI load interleaved → Z
7 STZI store interleaved ← Z
8 EXTRX extract Z slice → X
9 EXTRY extract Z slice → Y
10 FMA64 f64 outer product 8×8, Z stride 8 128
11 FMS64 f64 outer subtract 8×8 128
12 FMA32 f32 outer product 16×16, Z stride 4 512
13 FMS32 f32 outer subtract 16×16 512
14 MAC16 i16 multiply-accumulate 32×32, Z stride 2 2048
15 FMA16 f16 outer product 32×32, Z stride 2 2048
16 FMS16 f16 outer subtract 32×32 2048
17 SET/CLR activate/deactivate AMX (reg 0=set, 1=clr)
18 ??? Z write, 1 row. possible i32 reduced product ?
19 ??? Z write, 1 row. values are sums (dot product?) ?
20 ??? Z write, 16 rows stride 2. i32 matrix variant ?
21 ??? Z write, 16 rows stride 2. f32 matrix variant ?
22 ??? X write. extract variant

Encoding

.word (0x00201000 + (opcode << 5) + register_encoding)

Operand: 64-bit value in GPR. Bit layout per opcode in amx.md.


2. NEON — 128-bit SIMD

ARM Advanced SIMD. 32 registers × 128 bits. 4-wide f32 or 2-wide f64.

Module: acpu::vector (math, reduce, softmax, rope)

Arithmetic (f32 × 4)

Intrinsic Op Throughput Use in acpu
vfmaq_f32 fused multiply-add 1/cycle gemm accumulate, math
vaddq_f32 add 1/cycle accumulate_tile, reduce
vsubq_f32 subtract 1/cycle
vmulq_f32 multiply 1/cycle math kernels
vdivq_f32 divide ~10 cycles softmax
vsqrtq_f32 square root ~10 cycles normalize
vmaxq_f32 / vminq_f32 max/min 1/cycle reduce
vabsq_f32 absolute 1/cycle
vnegq_f32 negate 1/cycle
vrecpeq_f32 + vrecpsq_f32 reciprocal estimate 2 cycles softmax fast path
vrsqrteq_f32 + vrsqrtsq_f32 rsqrt estimate 2 cycles normalize fast path

Load / Store

Intrinsic Description
vld1q_f32 load 4 f32 (unaligned OK)
vst1q_f32 store 4 f32
vld1q_f32_x2 / x4 load 8/16 f32
vdupq_n_f32 broadcast scalar → 4 lanes
vdupq_laneq_f32 broadcast one lane → 4

Shuffle / Transpose

Intrinsic Description Use in acpu
vzip1q_f32 interleave low halves pack_a NEON 4×4 transpose
vzip2q_f32 interleave high halves pack_a NEON 4×4 transpose
vtrn1q_f32 transpose even elements
vtrn2q_f32 transpose odd elements
vextq_f32 extract/rotate rotate
vrev64q_f32 reverse within 64-bit rotate

Reduction (horizontal)

Intrinsic Description
vaddvq_f32 sum 4 lanes → scalar
vmaxvq_f32 max of 4 lanes
vminvq_f32 min of 4 lanes
vpaddq_f32 pairwise add

Comparison

Intrinsic Description
vcgtq_f32 greater than → mask
vcgeq_f32 greater or equal
vceqq_f32 equal
vbslq_f32 bitwise select (blend)

3. FP16 — Half-Precision (FEAT_FP16)

16-bit IEEE 754 float. Present on all M-series.

Module: acpu::numeric::fp16

Intrinsic Description
vcvt_f32_f16 convert 4 f16 → 4 f32
vcvt_f16_f32 convert 4 f32 → 4 f16
vfmaq_f16 f16 fused multiply-add (8-wide)
vaddq_f16 f16 add (8-wide)

AMX FMA16 (opcode 15): 32×32 f16 outer product, 2048 FLOPS/instruction.


4. BF16 — BFloat16 (FEAT_BF16, M2+)

16-bit brain float. NOT available on M1. Available M2+.

Intrinsic Description
vcvtq_low_bf16_f32 convert f32 → bf16
vbfmmlaq_f32 bf16 matrix multiply-accumulate 2×4 → f32
vbfdotq_f32 bf16 dot product
vbfmlalbq_f32 bf16 widening multiply-add (low)
vbfmlaltq_f32 bf16 widening multiply-add (high)

Module: acpu::numeric::bf16 (software convert on M1, native on M2+)


5. DotProd — Integer Dot Product (FEAT_DotProd)

4-element i8/u8 dot product accumulated into i32. Present on all M-series.

Module: acpu::numeric::quant

Intrinsic Description FLOPS (per call)
vdotq_s32 dot(i8×4) → i32, 4 accumulators 32 int ops
vdotq_u32 unsigned variant 32 int ops
vusdotq_s32 mixed signed/unsigned 32 int ops

6. I8MM — Int8 Matrix Multiply (FEAT_I8MM, M2+)

8×8 int8 matrix multiply → int32. NOT available on M1.

Intrinsic Description
vmmlaq_s32 signed i8 matmul 2×8 × 8×2 → 2×2 i32
vmmlaq_u32 unsigned variant
vusmmlaq_s32 mixed signed/unsigned

AMX opcode 14 (MAC16): 32×32 i16 multiply-accumulate. AMX opcodes 20-21: possibly i32/mixed matrix ops (undocumented).


7. FCMA — Complex Multiply-Accumulate (FEAT_FCMA)

Fused complex number operations. Present on all M-series.

Module: acpu::numeric::complex

Intrinsic Description
vcmlaq_f32 complex multiply-accumulate (rotation 0°)
vcmlaq_rot90_f32 complex mul-acc rotated 90°
vcmlaq_rot180_f32 complex mul-acc rotated 180°
vcmlaq_rot270_f32 complex mul-acc rotated 270°

Computes (a + bi) × (c + di) in 2 instructions (0° + 90°).


8. RDM — Rounding Doubling Multiply (FEAT_RDM)

Fixed-point multiply with rounding. Present on all M-series.

Intrinsic Description
vqrdmulhq_s16 rounding doubling multiply high (i16)
vqrdmlahq_s16 rounding doubling multiply-add high
vqrdmlshq_s16 rounding doubling multiply-sub high
vqrdmulhq_s32 i32 variant

Useful for fixed-point quantized inference.


9. LSE — Large System Extensions (FEAT_LSE)

Hardware atomics. Present on all M-series. Essential for lock-free threading.

Module: acpu::sync

Instruction Intrinsic / asm Description
LDADD core::sync::atomic atomic fetch-add
LDSET atomic fetch-or
LDCLR atomic fetch-and-not
CAS compare-and-swap
SWP atomic swap

All available as Ordering::Relaxed/Acquire/Release/SeqCst via std::sync::atomic::AtomicUsize etc.


10. Barriers and Events

Module: acpu::sync

Instruction Function Description
DMB ISH barrier() data memory barrier (inner shareable)
DSB ISH fence() data synchronization barrier
ISB isb() instruction synchronization barrier
WFE wait() wait for event (low-power sleep)
SEV wake() signal event (wake all cores in cluster)

WFE/SEV pair: lightweight cross-core signaling without mutex overhead.


11. Prefetch (PRFM)

Module: acpu::sync::prefetch

Instruction Function Description
PRFM PLDL1KEEP prefetch_l1(ptr) prefetch for read into L1
PRFM PLDL2KEEP prefetch_l2(ptr) prefetch for read into L2
PRFM PSTL1KEEP prefetch_l1_write(ptr) prefetch for write into L1

Software prefetch hints. Hardware prefetcher is aggressive on Apple Silicon; explicit prefetch most useful for non-sequential access patterns (packed panels).


12. Core Affinity (QoS)

Module: acpu::sync::affinity

Function QoS class Effect
pin_p_core() USER_INTERACTIVE (0x21) schedule on P-cores
pin_e_core() BACKGROUND (0x09) schedule on E-cores
pin_any() DEFAULT (0x15) no preference

Uses pthread_set_qos_class_self_np(). Soft hint, not hard pinning.


13. PMU — Performance Monitoring Unit

Module: acpu::pulse

Access via dlopen of libkperf.dylib (private framework). Requires SIP-related permissions.

Counter Description
Cycles CPU cycles
Instructions retired instructions
BranchMiss branch mispredictions
L1DMiss L1 data cache misses
L1IMiss L1 instruction cache misses

14. LRCPC — Load-Acquire RCpc (FEAT_LRCPC)

Weaker acquire semantics for better performance on weakly-ordered loads.

Instruction Description
LDAPR load-acquire with release consistency (cheaper than LDAR)

Available via core::sync::atomic::Ordering::Acquire with LLVM optimization.


Feature Matrix by Chip

Feature M1 M2 M3 M4
AMX v1 YES YES
AMX v2 YES YES
NEON YES YES YES YES
FP16 YES YES YES YES
BF16 NO YES YES YES
DotProd YES YES YES YES
I8MM NO YES YES YES
FCMA YES YES YES YES
RDM YES YES YES YES
LSE YES YES YES YES
LRCPC YES YES YES YES

AMX v2 (M3+): adds opcodes 8-9 (EXTRX/EXTRY), possibly opcodes 18-22. BF16 + I8MM appear in M2. All other features present since M1.

Neighbours