BLIS Algorithm
The CPU implementation uses BLIS-style cache blocking for optimal performance.
5-Loop Blocking
Matrix multiplication is blocked into tiles that fit in cache:
┌──────────────────────────────────────────────────────────────────────────┐
│ Loop 5: for jc in 0..N step NC (L3 cache - columns of B) │
│ Loop 4: for pc in 0..K step KC (L2 cache - depth) │
│ Pack B[pc:KC, jc:NC] → B̃ (contiguous in L3) │
│ Loop 3: for ic in 0..M step MC (L1 cache - rows of A) │
│ Pack A[ic:MC, pc:KC] → Ã (contiguous in L2) │
│ Loop 2: for jr in 0..NC step NR (register blocking) │
│ Loop 1: for ir in 0..MC step MR (microkernel) │
│ microkernel(Ã[ir], B̃[jr], C[ic+ir, jc+jr]) │
└──────────────────────────────────────────────────────────────────────────┘
Cache Tiling Parameters
| Parameter | Description | f32 AVX2 | f64 AVX2 | Portable |
|---|---|---|---|---|
| MC | Rows per L2 block | 256 | 128 | 64 |
| NC | Columns per L3 block | 256 | 128 | 64 |
| KC | Depth per block | 512 | 256 | 256 |
| MR | Microkernel rows | 8 | 4 | 4 |
| NR | Microkernel columns | 8 | 4 | 4 |
Parameters are tuned to fit in cache:
MC × KCfits in L2 cacheKC × NCfits in L3 cacheMR × NRfits in registers
Packing
Before computation, matrices are packed into contiguous buffers:
Pack A (MC × KC block)
Original layout (row-major):
A[0,0] A[0,1] A[0,2] ...
A[1,0] A[1,1] A[1,2] ...
...
Packed layout (MR-contiguous panels):
A[0,0] A[1,0] ... A[MR-1,0] // First column of first panel
A[0,1] A[1,1] ... A[MR-1,1] // Second column of first panel
...
A[MR,0] A[MR+1,0] ... // First column of second panel
Pack B (KC × NC block)
Packed into NR-wide panels for broadcasting:
B[0,0] B[0,1] ... B[0,NR-1] // First row of first panel
B[1,0] B[1,1] ... B[1,NR-1] // Second row of first panel
...
Benefits
- Sequential access: Packed data is accessed linearly
- Cache reuse: Each block is loaded once, used many times
- TLB efficiency: Fewer page table lookups
- SIMD friendly: Contiguous data enables vectorization
Code Location
core/gemm.rs: Main blocking loopscore/packing.rs: Pack functionscore/tiling.rs: TilingParams structcore/kernel.rs: Microkernel trait