tropical_gemm/simd/mod.rs
1//! SIMD-optimized microkernels for tropical GEMM.
2//!
3//! This module provides architecture-specific SIMD implementations of the
4//! microkernel, which is the innermost loop of the BLIS-style GEMM algorithm.
5//!
6//! # Supported Architectures
7//!
8//! | Architecture | Instruction Set | Register Width | Supported Types |
9//! |--------------|-----------------|----------------|-----------------|
10//! | x86_64 | AVX-512 | 512-bit | f32, f64 |
11//! | x86_64 | AVX2 | 256-bit | f32, f64 |
12//! | x86_64 | SSE4.1 | 128-bit | f32, f64 |
13//! | aarch64 | NEON | 128-bit | f32 |
14//! | Any | Portable | Scalar | All types |
15//!
16//! # Runtime Dispatch
17//!
18//! At runtime, [`tropical_gemm_dispatch`] selects the best kernel:
19//!
20//! ```rust,ignore
21//! // Automatically uses AVX2 on supported CPUs
22//! tropical_gemm_dispatch::<MaxPlus<f32>>(...);
23//! ```
24//!
25//! The dispatch mechanism:
26//! 1. [`simd_level()`] detects CPU features at runtime
27//! 2. [`KernelDispatch`] trait routes to the appropriate implementation
28//! 3. Falls back to portable kernel if no SIMD available
29//!
30//! # Microkernel Design
31//!
32//! For tropical MaxPlus f32 with AVX2 (8-wide vectors):
33//!
34//! ```text
35//! // MR×NR = 8×8 output tile
36//! for k in 0..KC:
37//! a_vec = load_8xf32(packed_a) // 8 elements from A column
38//! for j in 0..8:
39//! b_scalar = broadcast(packed_b[j]) // 1 element from B row
40//! prod = a_vec + b_scalar // tropical multiply
41//! c[j] = max(c[j], prod) // tropical accumulate
42//! ```
43//!
44//! # Module Contents
45//!
46//! - [`detect`](detect): CPU feature detection ([`SimdLevel`])
47//! - [`dispatch`](dispatch): Runtime kernel selection ([`KernelDispatch`])
48//! - [`kernels`](kernels): Architecture-specific microkernel implementations
49
50mod detect;
51pub mod dispatch;
52pub mod kernels;
53
54pub use detect::{simd_level, SimdLevel};
55pub use dispatch::{tropical_gemm_dispatch, KernelDispatch};
56pub use kernels::*;