You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hand-written CUDA Tensor Core GEMM kernels on Blackwell (sm_120) and Hopper (sm_90) — raw mma.sync reaching 106% of the cuBLAS-TC kernel on sm_120, CUTLASS 3.x wgmma at 85.5% of nvjet on H100, and an FP16→FP8→MXFP4 precision ladder. Every number reproducible from committed bench data.