Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.
-
Updated
Jun 29, 2026 - Python
Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started with Inference, Fine-Tuning, RAG. We also show you how to solve end to end problems using Llama model family and using them on various provider services
Comprehensive open-source library of AI research and engineering skills for any AI model. Package the skills and your claude code/codex/gemini agent will be an AI research agent with full horsepower. Maintained by Orchestra Research.
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Async RL)
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.
A Datacenter Scale Distributed Inference Serving Framework
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
A GPU cluster manager for high-performance AI model serving (vLLM, SGLang) and on-demand SSH-accessible GPU instances.
Structured data extraction, instruction calling and agentic workflows with ML, LLM and Vision LLM
Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc
System Level Intelligent Router for Mixture-of-Models at Cloud, Data Center and Edge
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.
Cascading runtime for AI agents. Optimize cost, latency, quality, and policy decisions inside the agent loop.
Community maintained hardware plugin for vLLM on Ascend
Add a description, image, and links to the vllm topic page so that developers can more easily learn about it.
To associate your repository with the vllm topic, visit your repo's landing page and select "manage topics."