Profiling CUDA GEMM Kernels: 2D Block Tiling vs Vectorized Loads

In this post, I profile and compare a 2D block-tiled GEMM kernel against a vectorized variant on an NVIDIA RTX A6000 GPU, analyzing memory coalescing, shared memory bank conflicts, and occupancy using Nsight Systems and Nsight Compute.

Subscribe to the Mailing List

Get notified when I publish new posts on AI hardware, deep learning, and HPC systems.