Profiling CUDA GEMM Kernels: 2D Block Tiling vs Vectorized Loads
In this post, I profile and compare a 2D block-tiled GEMM kernel against a vectorized variant on an NVIDIA RTX A6000 GPU, analyzing memory coalescing, shared memory bank conflicts, and occupancy using Nsight Systems and Nsight Compute.