Skip to content

Fix c_pointer mem leak#3352

Open
Difers wants to merge 1 commit into
NVIDIA:mainfrom
Difers:fix_mem_leak
Open

Fix c_pointer mem leak#3352
Difers wants to merge 1 commit into
NVIDIA:mainfrom
Difers:fix_mem_leak

Conversation

@Difers

@Difers Difers commented Jun 26, 2026

Copy link
Copy Markdown

Problem

c_pointers() on CUTLASS DSL scalar types (Int32, Float32, Float64, Float16, BFloat16, TFloat32, etc.) creates fresh ctypes objects on every
kernel invocation. In ML training loops with frequent kernel launches, these short-lived allocations interleave with long-lived framework objects,
causing memory fragmentation and monotonic process RSS growth that eventually leads to OOM.

Observed in production: ~0.3 GB/step RSS growth in large model training, leading to OOM within hundreds of steps. MoE architectures (many expert
kernels per step) are particularly affected.

Root Cause

Each call to c_pointers() creates 3 ctypes objects (c_int/c_float, pointer(), cast()) that are immediately discarded after use. In a training
loop with multiple kernel calls per step, this produces hundreds to thousands of short-lived objects per step that fragment the heap. tracemalloc
cannot detect this because ctypes objects are freed before snapshots — the damage is heap fragmentation.

Fix

Add a per-type _cptr_cache dictionary that caches c_pointers() results keyed by scalar value. On subsequent calls with the same value, return
the cached result directly.

Related issue #3351

@Difers Difers changed the title Fix mem c_pointer leak Fix c_pointer mem c_pointer leak Jun 29, 2026
@Difers Difers changed the title Fix c_pointer mem c_pointer leak Fix c_pointer mem leak Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant