Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SIMD Kernels

The microkernel is vectorized using SIMD instructions for maximum throughput.

Supported Architectures

ArchitectureInstruction SetVector Widthf32 MR×NRf64 MR×NR
x86_64AVX-512512-bit16×168×8
x86_64AVX2256-bit8×84×4
x86_64SSE4.1128-bit4×42×2
aarch64NEON128-bit4×42×2
AnyPortableScalar4×44×4

Runtime Detection

CPU features are detected at runtime:

#![allow(unused)]
fn main() {
use tropical_gemm::{simd_level, SimdLevel};

match simd_level() {
    SimdLevel::Avx512 => println!("Using AVX-512"),
    SimdLevel::Avx2   => println!("Using AVX2"),
    SimdLevel::Sse41  => println!("Using SSE4.1"),
    SimdLevel::Neon   => println!("Using NEON"),
    SimdLevel::None   => println!("Using portable"),
}
}

Microkernel Design

For MaxPlus f32 with AVX2 (8-wide vectors):

#![allow(unused)]
fn main() {
// Pseudocode for 8×8 microkernel
for k in 0..KC {
    // Load 8 elements from packed A
    let a_vec = _mm256_loadu_ps(a_ptr);

    // For each column in the 8-column output tile
    for j in 0..8 {
        // Broadcast scalar from packed B
        let b_scalar = _mm256_broadcast_ss(b_ptr + j);

        // Tropical multiply: a + b (element-wise)
        let prod = _mm256_add_ps(a_vec, b_scalar);

        // Tropical accumulate: max(c, prod)
        c_vec[j] = _mm256_max_ps(c_vec[j], prod);
    }

    a_ptr += 8;  // Next column in packed A
    b_ptr += 8;  // Next row in packed B
}
}

Semiring-Specific Operations

SemiringTropical MulTropical Add
MaxPlus_mm256_add_ps_mm256_max_ps
MinPlus_mm256_add_ps_mm256_min_ps
MaxMul_mm256_mul_ps_mm256_max_ps

Dispatch Mechanism

The KernelDispatch trait routes to the appropriate implementation:

#![allow(unused)]
fn main() {
impl KernelDispatch for TropicalMaxPlus<f32> {
    unsafe fn dispatch_gemm(...) {
        match simd_level() {
            SimdLevel::Avx2 | SimdLevel::Avx512 => {
                tropical_gemm_inner::<Self, Avx2MaxPlusF32>(...);
            }
            _ => {
                tropical_gemm_inner::<Self, PortableMicrokernel>(...);
            }
        }
    }
}
}

Code Location

  • simd/detect.rs: CPU feature detection
  • simd/dispatch.rs: Runtime dispatch trait
  • simd/kernels/avx2.rs: AVX2 implementations
  • simd/kernels/neon.rs: NEON implementations
  • simd/kernels/portable.rs: Fallback implementation