amx.md

Apple AMX Instruction Set Reference

Reverse-engineered on Apple Silicon (M-series). All opcodes verified on real hardware.

Encoding

Every AMX instruction is a single .word:

.word (0x00201000 + (opcode << 5) + register_encoding)

The register_encoding converts LLVM's register numbering to the 5-bit hardware encoding using the BCD trick: 0{reg} - ((0{reg} >> 4) * 6).

Operand is a 64-bit value in the general-purpose register. The hardware reads this register and interprets the bits according to the opcode.

Register File

Register	Count	Size each	Total	Access
X	8 rows	64 bytes	512 B	LDX/STX
Y	8 rows	64 bytes	512 B	LDY/STY
Z	64 rows	64 bytes	4096 B	LDZ/STZ/LDZI/STZI

Z has 4 independent 16×16 f32 tiles (selected by z_row & 3):

Tile 0: Z rows {0, 4, 8, ..., 60}
Tile 1: Z rows {1, 5, 9, ..., 61}
Tile 2: Z rows {2, 6, 10, ..., 62}
Tile 3: Z rows {3, 7, 11, ..., 63}

Lifecycle

AMX_SET (opcode 17, operand register = 0): activate AMX on this thread
AMX_CLR (opcode 17, operand register = 1): deactivate AMX on this thread
AMX state is per-thread. Each thread needs its own SET/CLR bracket.
SET/CLR are cheap (~1 cycle). Z/X/Y contents persist across SET/CLR pairs.

Complete Opcode Map

Verified on hardware. Opcodes 23-31 cause SIGILL.

Load / Store (opcodes 0-7)

Op	Name	Operand	Effect
0	LDX	`ptr \| (row << 56)`	Load 64 bytes from `ptr` into X[row] (row: 0-7)
1	LDY	`ptr \| (row << 56)`	Load 64 bytes from `ptr` into Y[row] (row: 0-7)
2	STX	`ptr \| (row << 56)`	Store 64 bytes from X[row] to `ptr`
3	STY	`ptr \| (row << 56)`	Store 64 bytes from Y[row] to `ptr`
4	LDZ	`ptr \| (row << 56)`	Load 64 bytes from `ptr` into Z[row] (row: 0-63)
5	STZ	`ptr \| (row << 56)`	Store 64 bytes from Z[row] to `ptr`
6	LDZI	`ptr \| (row << 56)`	Load into Z[row] with interleaved layout
7	STZI	`ptr \| (row << 56)`	Store from Z[row] with interleaved layout

Row field: bits 58:56 for X/Y (3 bits, 0-7), bits 61:56 for Z (6 bits, 0-63).

Extract (opcodes 8-9)

Op	Name	Effect
8	EXTRX	Extract from Z into X (horizontal slice)
9	EXTRY	Extract from Z into Y (vertical slice)

Operand format: selects which Z row/column to extract.

FMA — Fused Multiply-Accumulate (opcodes 10-16)

All FMA instructions write to Z. X and Y are read-only inputs.

Op	Name	Data type	Z stride	Elements/row	Outer product size
10	FMA64	f64	8	8	8×8
11	FMS64	f64 (negate)	8	8	8×8
12	FMA32	f32	4	16	16×16
13	FMS32	f32 (negate)	4	16	16×16
14	MAC16	i16	2	32	32×32
15	FMA16	f16	2	32	32×32
16	FMS16	f16 (negate)	2	32	32×32

FMS = fused multiply-subtract (Z -= X × Y instead of Z += X × Y).

FMA32 Operand Bit Layout (opcode 12)

Verified on hardware via single-bit probing:

Bit 63:    vector mode (0 = matrix/outer product, 1 = vector/element-wise)
Bit 62:    no visible effect
Bits 61-60: KILL output (Z empty when set)
Bits 59-38: no visible effect (default behavior)
Bit 37:    partial output (1 Z row only)
Bits 36-34: KILL output (Z empty when set)
Bit 33:    switches to stride-8 mode (f64-like Z addressing)
Bit 32:    switches to stride-8 mode
Bits 31-30: no visible effect
Bit 29:    skip_x (treat X as zero — no visible effect with Y present)
Bit 28:    skip_y (treat Y as zero — Z[0][0] = 1.0 with default X)
Bit 27:    skip_z (no accumulate: Z = X×Y instead of Z += X×Y)
Bits 26-22: no visible effect
Bit 21:    Z tile select bit 1 (tile 2: first row at Z[2])
Bit 20:    Z tile select bit 0 (tile 1: first row at Z[1])
Bits 19-16: no visible effect (X offset high bits)
Bits 15-12: X offset — selects X row via byte offset
Bits 11-10: X offset low bits
Bits 9-6:  Y offset — selects Y row via byte offset
Bits 5-2:  Y offset — changes output value (Y element selection)
Bits 1-0:  Y offset low bits

Functional fields:

Y offset (bits 9:0): byte offset into 512-byte circular Y buffer. Y row N = N×64 bytes.
X offset (bits 19:10): byte offset into 512-byte circular X buffer. X row N = N×64 bytes.
Z tile (bits 21:20): selects Z tile 0-3.
skip_z (bit 27): first iteration flag — Z = X×Y (ignores current Z).
skip_y (bit 28): Y treated as zero.
skip_x (bit 29): X treated as zero.
vector mode (bit 63): element-wise instead of outer product.

FMA32 Operation (matrix mode, bit 63 = 0)

for j in 0..16:
    for i in 0..16:
        Z[(j * 4 + tile) % 64][i] += X[x_offset/4 + i] * Y[y_offset/4 + j]

One fma32 instruction = 512 FLOPS (16 × 16 × 2).

Unknown Opcodes (18-22)

Valid on this hardware. Not yet fully characterized.

Op	Z changed	Z rows	Z stride	Hypothesis
18	YES	1	-	i32 dot product or reduced mac
19	YES	1	-	f32 dot product (values are sums)
20	YES	16	2	i32 matrix mode (like mac16 variant)
21	YES	16	2	f32 matrix with fp16-like accumulation
22	X changed	0	-	X extract variant (like extrx)

Opcodes 18-19 produce single-row Z output (possible dot product / reduction). Opcodes 20-21 produce 16-row output with stride 2 (possible mixed-precision). Opcode 22 modifies X registers (possible extract with different addressing).

Opcodes 23-31: SIGILL (illegal instruction) on this chip.

Performance Characteristics

LDX/LDY: ~1-2 cycles from L1
FMA32: ~2-3 cycles throughput (pipelined)
STZ: ~1-2 cycles
AMX load and FMA do NOT overlap well (interleaving hurts performance)
AMX SET/CLR: ~1 cycle each
Prefetch (PRFM) before AMX loads improves throughput ~5-10%
4 Z tiles allow register blocking: one Y load can serve 4 FMA instructions

Optimal GEMM Pattern

Pack A into MR=16 wide column-major strips (Y loading)
Pack B into NR=16 wide row-major strips (X loading)
For each K batch of 8: load 8 Y + 8 X, issue 8 FMA32
Use all 4 Z tiles for 16×64 computation (each Y load → 4 FMAs)
STZ result to memory, NEON vaddq for C accumulation
L1 constraint: KC × (MR + NR) × 4 ≤ L1D (64KB for Apple Silicon)
L2 constraint: MC × KC × 4 ≤ L2 (4MB per core)