tropical_gemm/core/mod.rs
1//! Core tropical GEMM algorithms using BLIS-style blocking.
2//!
3//! This module provides the portable implementation of tropical matrix
4//! multiplication, optimized for cache efficiency using the BLIS framework.
5//!
6//! # BLIS Algorithm Overview
7//!
8//! The BLIS (BLAS-like Library Instantiation Software) approach achieves
9//! near-optimal performance through **hierarchical cache blocking**:
10//!
11//! ```text
12//! ┌─────────────────────────────────────────────────────────────────┐
13//! │ Loop 5: for jc in 0..N step NC (partition columns of B) │
14//! │ Loop 4: for pc in 0..K step KC (partition depth) │
15//! │ Pack B[pc:KC, jc:NC] → B̃ (fits in L3 cache) │
16//! │ Loop 3: for ic in 0..M step MC (partition rows of A) │
17//! │ Pack A[ic:MC, pc:KC] → Ã (fits in L2 cache) │
18//! │ Loop 2: for jr in 0..NC step NR (register blocking) │
19//! │ Loop 1: for ir in 0..MC step MR (microkernel) │
20//! │ microkernel(Ã[ir], B̃[jr], C[ic+ir, jc+jr]) │
21//! └─────────────────────────────────────────────────────────────────┘
22//! ```
23//!
24//! # Cache Tiling Parameters
25//!
26//! The [`TilingParams`] struct controls blocking sizes:
27//!
28//! | Parameter | Purpose | Typical Value (f32 AVX2) |
29//! |-----------|---------|--------------------------|
30//! | MC | Rows per L2 block | 256 |
31//! | NC | Columns per L3 block | 256 |
32//! | KC | Depth per block | 512 |
33//! | MR | Microkernel rows | 8 |
34//! | NR | Microkernel columns | 8 |
35//!
36//! # Packing
37//!
38//! Before computation, matrices are **packed** into contiguous buffers:
39//!
40//! - [`pack_a`]: Packs an `MC×KC` block of A for sequential microkernel access
41//! - [`pack_b`]: Packs a `KC×NC` block of B for efficient broadcasting
42//!
43//! Packing eliminates TLB misses and enables SIMD vectorization.
44//!
45//! # Microkernel
46//!
47//! The innermost loop executes an `MR×NR` tile computation:
48//!
49//! ```text
50//! for each k in packed_k:
51//! C[0:MR, 0:NR] = C ⊕ (A_col[0:MR] ⊗ B_row[0:NR])
52//! ```
53//!
54//! The [`Microkernel`] trait abstracts over portable and SIMD implementations.
55//!
56//! # Module Contents
57//!
58//! - [`gemm`](gemm): The main GEMM algorithm with blocking loops
59//! - [`kernel`](kernel): Microkernel trait and portable implementation
60//! - [`packing`](packing): Matrix packing utilities
61//! - [`tiling`](tiling): Cache tiling parameters and iterators
62//! - [`argmax`](argmax): Argmax tracking for backpropagation
63
64mod argmax;
65mod gemm;
66mod kernel;
67mod packing;
68mod tiling;
69
70pub use argmax::GemmWithArgmax;
71pub use gemm::{
72 tropical_gemm_inner, tropical_gemm_portable, tropical_gemm_with_argmax_inner,
73 tropical_gemm_with_argmax_portable,
74};
75pub use kernel::{Microkernel, MicrokernelWithArgmax, PortableMicrokernel};
76pub use packing::{pack_a, pack_b, packed_a_size, packed_b_size, Layout, Transpose};
77pub use tiling::{BlockIterator, TilingParams};