Fix c_pointer mem leak#3352
Open
Difers wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
c_pointers() on CUTLASS DSL scalar types (Int32, Float32, Float64, Float16, BFloat16, TFloat32, etc.) creates fresh ctypes objects on every
kernel invocation. In ML training loops with frequent kernel launches, these short-lived allocations interleave with long-lived framework objects,
causing memory fragmentation and monotonic process RSS growth that eventually leads to OOM.
Observed in production: ~0.3 GB/step RSS growth in large model training, leading to OOM within hundreds of steps. MoE architectures (many expert
kernels per step) are particularly affected.
Root Cause
Each call to c_pointers() creates 3 ctypes objects (c_int/c_float, pointer(), cast()) that are immediately discarded after use. In a training
loop with multiple kernel calls per step, this produces hundreds to thousands of short-lived objects per step that fragment the heap. tracemalloc
cannot detect this because ctypes objects are freed before snapshots — the damage is heap fragmentation.
Fix
Add a per-type _cptr_cache dictionary that caches c_pointers() results keyed by scalar value. On subsequent calls with the same value, return
the cached result directly.
Related issue #3351