Module core

Module core 

Source
Expand description

Core tropical GEMM algorithms using BLIS-style blocking.

This module provides the portable implementation of tropical matrix multiplication, optimized for cache efficiency using the BLIS framework.

§BLIS Algorithm Overview

The BLIS (BLAS-like Library Instantiation Software) approach achieves near-optimal performance through hierarchical cache blocking:

┌─────────────────────────────────────────────────────────────────┐
│ Loop 5: for jc in 0..N step NC    (partition columns of B)     │
│   Loop 4: for pc in 0..K step KC  (partition depth)            │
│     Pack B[pc:KC, jc:NC] → B̃  (fits in L3 cache)               │
│     Loop 3: for ic in 0..M step MC  (partition rows of A)      │
│       Pack A[ic:MC, pc:KC] → Ã  (fits in L2 cache)             │
│       Loop 2: for jr in 0..NC step NR  (register blocking)     │
│         Loop 1: for ir in 0..MC step MR  (microkernel)         │
│           microkernel(Ã[ir], B̃[jr], C[ic+ir, jc+jr])           │
└─────────────────────────────────────────────────────────────────┘

§Cache Tiling Parameters

The TilingParams struct controls blocking sizes:

ParameterPurposeTypical Value (f32 AVX2)
MCRows per L2 block256
NCColumns per L3 block256
KCDepth per block512
MRMicrokernel rows8
NRMicrokernel columns8

§Packing

Before computation, matrices are packed into contiguous buffers:

  • pack_a: Packs an MC×KC block of A for sequential microkernel access
  • pack_b: Packs a KC×NC block of B for efficient broadcasting

Packing eliminates TLB misses and enables SIMD vectorization.

§Microkernel

The innermost loop executes an MR×NR tile computation:

for each k in packed_k:
    C[0:MR, 0:NR] = C ⊕ (A_col[0:MR] ⊗ B_row[0:NR])

The Microkernel trait abstracts over portable and SIMD implementations.

§Module Contents

  • gemm: The main GEMM algorithm with blocking loops
  • kernel: Microkernel trait and portable implementation
  • packing: Matrix packing utilities
  • tiling: Cache tiling parameters and iterators
  • argmax: Argmax tracking for backpropagation

Structs§

BlockIterator
Iterator over blocks for the outer loop.
GemmWithArgmax
Result of GEMM with argmax tracking.
PortableMicrokernel
Portable (non-SIMD) microkernel implementation.
TilingParams
Tiling parameters for BLIS-style GEMM blocking.

Enums§

Layout
Matrix layout enumeration.
Transpose
Transpose specification.

Traits§

Microkernel
Trait for GEMM microkernels.
MicrokernelWithArgmax
Trait for microkernels that track argmax during computation.

Functions§

pack_a
Pack a panel of matrix A into a contiguous buffer.
pack_b
Pack a panel of matrix B into a contiguous buffer.
packed_a_size
Calculate packed buffer size for A.
packed_b_size
Calculate packed buffer size for B.
tropical_gemm_inner
Tropical GEMM with custom kernel and tiling parameters.
tropical_gemm_portable
Tropical GEMM: C = A ⊗ B
tropical_gemm_with_argmax_inner
Tropical GEMM with argmax tracking and custom kernel.
tropical_gemm_with_argmax_portable
Tropical GEMM with argmax tracking.