A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Jun 29, 2026 - Python
CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
A high-throughput and memory-efficient inference and serving engine for LLMs
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
A Python framework for GPU-accelerated simulation, robotics, and machine learning.
A flexible framework of neural networks for deep learning
FlashInfer: Kernel Library for LLM Serving
A GPU cluster manager for high-performance AI model serving (vLLM, SGLang) and on-demand SSH-accessible GPU instances.
cuML - RAPIDS Machine Learning Library
A PyTorch Library for Accelerating 3D Deep Learning Research
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
Self-hosted, local only NVR and AI Computer Vision software. With features such as object detection, motion detection, face recognition and more, it gives you the power to keep an eye on your home, office or any other place you want to monitor.
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.
Created by Nvidia
Released June 23, 2007