Blog – Vinay R Jumani

Profiling CUDA GEMM Kernels: 2D Block Tiling vs Vectorized Loads

June 7, 2026 CUDA, Systems, HPC, Profiling 20 min read

In this post, I profile and compare a 2D block-tiled GEMM kernel against a vectorized variant on an NVIDIA RTX A6000 GPU, analyzing memory coalescing, shared memory bank conflicts, and occupancy using Nsight Systems and Nsight Compute.

Profiling CUDA GEMM Kernels: 2D Block Tiling vs Vectorized Loads

Subscribe to the Mailing List