Expand description
Core tropical GEMM algorithms using BLIS-style blocking.
This module provides the portable implementation of tropical matrix multiplication, optimized for cache efficiency using the BLIS framework.
§BLIS Algorithm Overview
The BLIS (BLAS-like Library Instantiation Software) approach achieves near-optimal performance through hierarchical cache blocking:
┌─────────────────────────────────────────────────────────────────┐
│ Loop 5: for jc in 0..N step NC (partition columns of B) │
│ Loop 4: for pc in 0..K step KC (partition depth) │
│ Pack B[pc:KC, jc:NC] → B̃ (fits in L3 cache) │
│ Loop 3: for ic in 0..M step MC (partition rows of A) │
│ Pack A[ic:MC, pc:KC] → Ã (fits in L2 cache) │
│ Loop 2: for jr in 0..NC step NR (register blocking) │
│ Loop 1: for ir in 0..MC step MR (microkernel) │
│ microkernel(Ã[ir], B̃[jr], C[ic+ir, jc+jr]) │
└─────────────────────────────────────────────────────────────────┘§Cache Tiling Parameters
The TilingParams struct controls blocking sizes:
| Parameter | Purpose | Typical Value (f32 AVX2) |
|---|---|---|
| MC | Rows per L2 block | 256 |
| NC | Columns per L3 block | 256 |
| KC | Depth per block | 512 |
| MR | Microkernel rows | 8 |
| NR | Microkernel columns | 8 |
§Packing
Before computation, matrices are packed into contiguous buffers:
pack_a: Packs anMC×KCblock of A for sequential microkernel accesspack_b: Packs aKC×NCblock of B for efficient broadcasting
Packing eliminates TLB misses and enables SIMD vectorization.
§Microkernel
The innermost loop executes an MR×NR tile computation:
for each k in packed_k:
C[0:MR, 0:NR] = C ⊕ (A_col[0:MR] ⊗ B_row[0:NR])The Microkernel trait abstracts over portable and SIMD implementations.
§Module Contents
Structs§
- Block
Iterator - Iterator over blocks for the outer loop.
- Gemm
With Argmax - Result of GEMM with argmax tracking.
- Portable
Microkernel - Portable (non-SIMD) microkernel implementation.
- Tiling
Params - Tiling parameters for BLIS-style GEMM blocking.
Enums§
Traits§
- Microkernel
- Trait for GEMM microkernels.
- Microkernel
With Argmax - Trait for microkernels that track argmax during computation.
Functions§
- pack_a⚠
- Pack a panel of matrix A into a contiguous buffer.
- pack_b⚠
- Pack a panel of matrix B into a contiguous buffer.
- packed_
a_ size - Calculate packed buffer size for A.
- packed_
b_ size - Calculate packed buffer size for B.
- tropical_
gemm_ ⚠inner - Tropical GEMM with custom kernel and tiling parameters.
- tropical_
gemm_ ⚠portable - Tropical GEMM: C = A ⊗ B
- tropical_
gemm_ ⚠with_ argmax_ inner - Tropical GEMM with argmax tracking and custom kernel.
- tropical_
gemm_ ⚠with_ argmax_ portable - Tropical GEMM with argmax tracking.