Issues

#15747

· arjunkshah opened

on Jun 30, 2026

[Feature]: General cross-instance KV-cache transfer/staging (kv_transfer)

Disaggregated serving

#15735

· nafis271 opened

on Jun 29, 2026

[Bug]: Clamp very small non-zero temperature values to avoid numerical instability

bug

Decoding/Sampling

#15715

· chfeng-cs opened

on Jun 29, 2026

[Bug]: Bad Words Missing Space-Prefixed Token Variants for BPE Tokenizers (e.g., GPT-2, Qwen)

bug

#15706

· chfeng-cs opened

on Jun 29, 2026

[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids)

Pytorch

#15684

· Thachnh opened

on Jun 27, 2026

[Bug]: FP8 linear cuda_scaled_mm fast path silently disabled on SM121 (DGX Spark GB10)

Customized kernels

#15673

· mihai-chiorean opened

on Jun 26, 2026

[DeepSeek-V4] Async CUDA illegal memory access in MTP spec-decode sampler under sustained load with per-size CUDA graphs

CUDA Graph

Customized kernels

Speculative Decoding

#15639

· Thachnh opened

on Jun 25, 2026

RFC: Deprecating AutoDeploy Backend

RFC

#15638

· arysef opened

on Jun 25, 2026

[Bug] Qwen3-Next (Gated-DeltaNet) fails at warmup on consumer Blackwell sm120 (RTX PRO 6000) — TRT-LLM 1.3.0rc19 bundles flashinfer 0.6.12 (Hopper-only GDN); please bump to >=0.6.13

Customized kernels

#15634

· zentradev-rabih opened

on Jun 25, 2026

[Bug]: Warp Illegal Address / MMU Fault** (Xid 13) during prefill when running GLM-5.2-NVFP4 on NVIDIA B200 GPUs

bug

Customized kernels

#15610

· bleedingfight opened

on Jun 25, 2026

[AutoDeploy] Re-enable SSM replay for Nemotron-Super MTP (replay kernel illegal memory access at CUDA-graph capture on Blackwell)

AutoDeploy

CUDA Graph

Speculative Decoding

#15565

· govind-ramnarayan opened

on Jun 24, 2026

[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0)

bug

Customized kernels

Inference runtime

#15537

· xuxiongjie272 opened

on Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SuperCompress: Free open-source LLM prompt compression - cut token costs by ~65%

[Feature]: General cross-instance KV-cache transfer/staging (kv_transfer)

[Bug]: Clamp very small non-zero temperature values to avoid numerical instability

[Bug]: Bad Words Missing Space-Prefixed Token Variants for BPE Tokenizers (e.g., GPT-2, Qwen)

[DeepSeek-V4] Overlap scheduler + chunked prefill deadlocks in sparse-MLA ctx metadata (device→host sync at _compute_ctx_compressed_position_ids)

[Bug]: FP8 linear cuda_scaled_mm fast path silently disabled on SM121 (DGX Spark GB10)

[DeepSeek-V4] Async CUDA illegal memory access in MTP spec-decode sampler under sustained load with per-size CUDA graphs

RFC: Deprecating AutoDeploy Backend

[Bug] Qwen3-Next (Gated-DeltaNet) fails at warmup on consumer Blackwell sm120 (RTX PRO 6000) — TRT-LLM 1.3.0rc19 bundles flashinfer 0.6.12 (Hopper-only GDN); please bump to >=0.6.13

[Bug]: Warp Illegal Address / MMU Fault** (Xid 13) during prefill when running GLM-5.2-NVFP4 on NVIDIA B200 GPUs

[AutoDeploy] Re-enable SSM replay for Nemotron-Super MTP (replay kernel illegal memory access at CUDA-graph capture on Blackwell)

[Bug]: XQA multi_block_mode crashes with CUDA_ERROR_INVALID_VALUE under concurrent inference (v1.0.0)

Uh oh!

Search results