Apple AMX Instruction Set Reference
Reverse-engineered on Apple Silicon (M-series). All opcodes verified on real hardware.
Encoding
Every AMX instruction is a single .word:
.word (0x00201000 + (opcode << 5) + register_encoding)
The register_encoding converts LLVM's register numbering to the 5-bit hardware
encoding using the BCD trick: 0{reg} - ((0{reg} >> 4) * 6).
Operand is a 64-bit value in the general-purpose register. The hardware reads this register and interprets the bits according to the opcode.
Register File
| Register | Count | Size each | Total | Access |
|---|---|---|---|---|
| X | 8 rows | 64 bytes | 512 B | LDX/STX |
| Y | 8 rows | 64 bytes | 512 B | LDY/STY |
| Z | 64 rows | 64 bytes | 4096 B | LDZ/STZ/LDZI/STZI |
Z has 4 independent 16×16 f32 tiles (selected by z_row & 3):
- Tile 0: Z rows {0, 4, 8, ..., 60}
- Tile 1: Z rows {1, 5, 9, ..., 61}
- Tile 2: Z rows {2, 6, 10, ..., 62}
- Tile 3: Z rows {3, 7, 11, ..., 63}
Lifecycle
AMX_SET(opcode 17, operand register = 0): activate AMX on this threadAMX_CLR(opcode 17, operand register = 1): deactivate AMX on this thread- AMX state is per-thread. Each thread needs its own SET/CLR bracket.
- SET/CLR are cheap (~1 cycle). Z/X/Y contents persist across SET/CLR pairs.
Complete Opcode Map
Verified on hardware. Opcodes 23-31 cause SIGILL.
Load / Store (opcodes 0-7)
| Op | Name | Operand | Effect |
|---|---|---|---|
| 0 | LDX | ptr | (row << 56) |
Load 64 bytes from ptr into X[row] (row: 0-7) |
| 1 | LDY | ptr | (row << 56) |
Load 64 bytes from ptr into Y[row] (row: 0-7) |
| 2 | STX | ptr | (row << 56) |
Store 64 bytes from X[row] to ptr |
| 3 | STY | ptr | (row << 56) |
Store 64 bytes from Y[row] to ptr |
| 4 | LDZ | ptr | (row << 56) |
Load 64 bytes from ptr into Z[row] (row: 0-63) |
| 5 | STZ | ptr | (row << 56) |
Store 64 bytes from Z[row] to ptr |
| 6 | LDZI | ptr | (row << 56) |
Load into Z[row] with interleaved layout |
| 7 | STZI | ptr | (row << 56) |
Store from Z[row] with interleaved layout |
Row field: bits 58:56 for X/Y (3 bits, 0-7), bits 61:56 for Z (6 bits, 0-63).
Extract (opcodes 8-9)
| Op | Name | Effect |
|---|---|---|
| 8 | EXTRX | Extract from Z into X (horizontal slice) |
| 9 | EXTRY | Extract from Z into Y (vertical slice) |
Operand format: selects which Z row/column to extract.
FMA — Fused Multiply-Accumulate (opcodes 10-16)
All FMA instructions write to Z. X and Y are read-only inputs.
| Op | Name | Data type | Z stride | Elements/row | Outer product size |
|---|---|---|---|---|---|
| 10 | FMA64 | f64 | 8 | 8 | 8×8 |
| 11 | FMS64 | f64 (negate) | 8 | 8 | 8×8 |
| 12 | FMA32 | f32 | 4 | 16 | 16×16 |
| 13 | FMS32 | f32 (negate) | 4 | 16 | 16×16 |
| 14 | MAC16 | i16 | 2 | 32 | 32×32 |
| 15 | FMA16 | f16 | 2 | 32 | 32×32 |
| 16 | FMS16 | f16 (negate) | 2 | 32 | 32×32 |
FMS = fused multiply-subtract (Z -= X × Y instead of Z += X × Y).
FMA32 Operand Bit Layout (opcode 12)
Verified on hardware via single-bit probing:
Bit 63: vector mode (0 = matrix/outer product, 1 = vector/element-wise)
Bit 62: no visible effect
Bits 61-60: KILL output (Z empty when set)
Bits 59-38: no visible effect (default behavior)
Bit 37: partial output (1 Z row only)
Bits 36-34: KILL output (Z empty when set)
Bit 33: switches to stride-8 mode (f64-like Z addressing)
Bit 32: switches to stride-8 mode
Bits 31-30: no visible effect
Bit 29: skip_x (treat X as zero — no visible effect with Y present)
Bit 28: skip_y (treat Y as zero — Z[0][0] = 1.0 with default X)
Bit 27: skip_z (no accumulate: Z = X×Y instead of Z += X×Y)
Bits 26-22: no visible effect
Bit 21: Z tile select bit 1 (tile 2: first row at Z[2])
Bit 20: Z tile select bit 0 (tile 1: first row at Z[1])
Bits 19-16: no visible effect (X offset high bits)
Bits 15-12: X offset — selects X row via byte offset
Bits 11-10: X offset low bits
Bits 9-6: Y offset — selects Y row via byte offset
Bits 5-2: Y offset — changes output value (Y element selection)
Bits 1-0: Y offset low bits
Functional fields:
- Y offset (bits 9:0): byte offset into 512-byte circular Y buffer. Y row N = N×64 bytes.
- X offset (bits 19:10): byte offset into 512-byte circular X buffer. X row N = N×64 bytes.
- Z tile (bits 21:20): selects Z tile 0-3.
- skip_z (bit 27): first iteration flag — Z = X×Y (ignores current Z).
- skip_y (bit 28): Y treated as zero.
- skip_x (bit 29): X treated as zero.
- vector mode (bit 63): element-wise instead of outer product.
FMA32 Operation (matrix mode, bit 63 = 0)
for j in 0..16:
for i in 0..16:
Z[(j * 4 + tile) % 64][i] += X[x_offset/4 + i] * Y[y_offset/4 + j]
One fma32 instruction = 512 FLOPS (16 × 16 × 2).
Unknown Opcodes (18-22)
Valid on this hardware. Not yet fully characterized.
| Op | Z changed | Z rows | Z stride | Hypothesis |
|---|---|---|---|---|
| 18 | YES | 1 | - | i32 dot product or reduced mac |
| 19 | YES | 1 | - | f32 dot product (values are sums) |
| 20 | YES | 16 | 2 | i32 matrix mode (like mac16 variant) |
| 21 | YES | 16 | 2 | f32 matrix with fp16-like accumulation |
| 22 | X changed | 0 | - | X extract variant (like extrx) |
Opcodes 18-19 produce single-row Z output (possible dot product / reduction). Opcodes 20-21 produce 16-row output with stride 2 (possible mixed-precision). Opcode 22 modifies X registers (possible extract with different addressing).
Opcodes 23-31: SIGILL (illegal instruction) on this chip.
Performance Characteristics
- LDX/LDY: ~1-2 cycles from L1
- FMA32: ~2-3 cycles throughput (pipelined)
- STZ: ~1-2 cycles
- AMX load and FMA do NOT overlap well (interleaving hurts performance)
- AMX SET/CLR: ~1 cycle each
- Prefetch (PRFM) before AMX loads improves throughput ~5-10%
- 4 Z tiles allow register blocking: one Y load can serve 4 FMA instructions
Optimal GEMM Pattern
- Pack A into MR=16 wide column-major strips (Y loading)
- Pack B into NR=16 wide row-major strips (X loading)
- For each K batch of 8: load 8 Y + 8 X, issue 8 FMA32
- Use all 4 Z tiles for 16×64 computation (each Y load → 4 FMAs)
- STZ result to memory, NEON vaddq for C accumulation
- L1 constraint: KC × (MR + NR) × 4 ≤ L1D (64KB for Apple Silicon)
- L2 constraint: MC × KC × 4 ≤ L2 (4MB per core)