Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GPU Acceleration

tropical-gemm-cuda provides NVIDIA GPU acceleration via CUDA.

Requirements

  • NVIDIA GPU (compute capability 3.5+)
  • CUDA Toolkit 11.0 or later
  • nvcc in PATH

Basic Usage

use tropical_gemm::{MatRef, MaxPlus};
use tropical_gemm_cuda::{CudaContext, GpuMat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create CUDA context (compiles kernels on first use)
    let ctx = CudaContext::new()?;

    // Prepare CPU data
    let a_data = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];
    let b_data = [1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0];

    let a = MatRef::<MaxPlus<f32>>::from_slice(&a_data, 2, 3);
    let b = MatRef::<MaxPlus<f32>>::from_slice(&b_data, 3, 2);

    // Upload to GPU
    let a_gpu = GpuMat::from_matref(&ctx, &a)?;
    let b_gpu = GpuMat::from_matref(&ctx, &b)?;

    // Compute on GPU
    let c_gpu = a_gpu.matmul(&ctx, &b_gpu)?;

    // Download result
    let c = c_gpu.to_mat(&ctx)?;

    println!("C[0,0] = {}", c.get_value(0, 0));
    Ok(())
}

Context Reuse

The CudaContext compiles CUDA kernels on first use. Always reuse contexts to avoid repeated compilation:

#![allow(unused)]
fn main() {
// GOOD: Reuse context
let ctx = CudaContext::new()?;
for _ in 0..100 {
    let c = a_gpu.matmul(&ctx, &b_gpu)?;
}

// BAD: Creates new context each iteration
for _ in 0..100 {
    let ctx = CudaContext::new()?;  // Slow!
    let c = a_gpu.matmul(&ctx, &b_gpu)?;
}
}

GPU Argmax

For backpropagation with GPU computation:

#![allow(unused)]
fn main() {
let ctx = CudaContext::new()?;
let a_gpu = GpuMat::from_matref(&ctx, &a)?;
let b_gpu = GpuMat::from_matref(&ctx, &b)?;

// Forward pass with argmax tracking
let result = a_gpu.matmul_argmax(&ctx, &b_gpu)?;

// Download values and argmax
let result_cpu = result.to_mat_with_argmax(&ctx)?;
let value = result_cpu.get_value(0, 0);
let k_idx = result_cpu.get_argmax(0, 0);

// Backward pass on GPU
let grad_c_gpu = GpuMat::from_matref(&ctx, &grad_c)?;
let grad_a_gpu = result.backward_a(&ctx, &grad_c_gpu)?;
let grad_b_gpu = result.backward_b(&ctx, &grad_c_gpu)?;
}

Batched GPU Operations

Process multiple matrices efficiently:

#![allow(unused)]
fn main() {
use tropical_gemm::{Mat, MaxPlus};
use tropical_gemm_cuda::{CudaContext, GpuMat};

let ctx = CudaContext::new()?;

// Upload batch to GPU
let a_batch: Vec<Mat<MaxPlus<f32>>> = /* ... */;
let b_batch: Vec<Mat<MaxPlus<f32>>> = /* ... */;

let a_gpu_batch = GpuMat::from_mats(&ctx, &a_batch)?;
let b_gpu_batch = GpuMat::from_mats(&ctx, &b_batch)?;

// Batched multiply
let c_gpu_batch = GpuMat::matmul_batched(&ctx, &a_gpu_batch, &b_gpu_batch)?;

// Download results
let c_batch = GpuMat::to_mats(&ctx, &c_gpu_batch)?;
}

One-Shot API

For simple cases without context reuse:

#![allow(unused)]
fn main() {
use tropical_gemm::TropicalMaxPlus;
use tropical_gemm_cuda::tropical_matmul_gpu;

let a = vec![1.0f32; 64 * 64];
let b = vec![1.0f32; 64 * 64];

// One-shot GPU multiplication (creates temporary context)
let c = tropical_matmul_gpu::<TropicalMaxPlus<f32>>(&a, 64, 64, &b, 64)?;
}

Performance Comparison

SizeCPU SIMDGPUSpeedup
2564.1 ms0.032 ms128x
51232.8 ms0.086 ms381x
1024262.3 ms0.358 ms733x
20482091.6 ms2.510 ms833x

GPU becomes advantageous for matrices larger than ~256×256.