-
Notifications
You must be signed in to change notification settings - Fork 2.5k
- #15044 · laikhtewari opened
on Jun 6, 2026 1 - #3148 · juney-nvidia opened
on Mar 29, 2025 5 - #3124 · juney-nvidia opened
on Mar 27, 2025 11
Issues
is:issue state:open
is:issue state:open
Issue creation is restricted in this repository
Search results
SuperCompress: Free open-source LLM prompt compression - cut token costs by ~65%
feature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality supportStatus: Open.#15747 In NVIDIA/TensorRT-LLM;[Feature]: General cross-instance KV-cache transfer/staging (kv_transfer)
Disaggregated serving<NV>Deploying with separated, distributed components (params, kv-cache, compute). Arch & perf.<NV>Deploying with separated, distributed components (params, kv-cache, compute). Arch & perf.Status: Open.#15735 In NVIDIA/TensorRT-LLM;[Bug]: Clamp very small non-zero temperature values to avoid numerical instability
bugSomething isn't workingSomething isn't workingDecoding/Sampling<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).Status: Open.#15715 In NVIDIA/TensorRT-LLM;[Bug]: Bad Words Missing Space-Prefixed Token Variants for BPE Tokenizers (e.g., GPT-2, Qwen)
bugSomething isn't workingSomething isn't workingStatus: Open.#15706 In NVIDIA/TensorRT-LLM;[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids)
Pytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesStatus: Open.#15684 In NVIDIA/TensorRT-LLM;[Bug]: FP8 linear cuda_scaled_mm fast path silently disabled on SM121 (DGX Spark GB10)
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Status: Open.#15673 In NVIDIA/TensorRT-LLM;[DeepSeek-V4] Async CUDA illegal memory access in MTP spec-decode sampler under sustained load with per-size CUDA graphs
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterStatus: Open.#15639 In NVIDIA/TensorRT-LLM;- Status: Open.#15638 In NVIDIA/TensorRT-LLM;
[Bug] Qwen3-Next (Gated-DeltaNet) fails at warmup on consumer Blackwell sm120 (RTX PRO 6000) — TRT-LLM 1.3.0rc19 bundles flashinfer 0.6.12 (Hopper-only GDN); please bump to >=0.6.13
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Status: Open.#15634 In NVIDIA/TensorRT-LLM;[Bug]: Warp Illegal Address / MMU Fault** (Xid 13) during prefill when running GLM-5.2-NVFP4 on NVIDIA B200 GPUs
bugSomething isn't workingSomething isn't workingCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Status: Open.#15610 In NVIDIA/TensorRT-LLM;[AutoDeploy] Re-enable SSM replay for Nemotron-Super MTP (replay kernel illegal memory access at CUDA-graph capture on Blackwell)
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendSpeculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterStatus: Open.#15565 In NVIDIA/TensorRT-LLM;[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0)
bugSomething isn't workingSomething isn't workingCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.Status: Open.#15537 In NVIDIA/TensorRT-LLM;