Getting started · Models · Server · Production · Benchmarks · Releases · Changelog
LLM inference in Rust, for Apple Silicon and NVIDIA CUDA.
A single binary that downloads a model, runs it GPU-resident, and prints text — built from scratch with zero ML dependencies (no PyTorch, no ONNX, no Python), native CUDA C and Metal kernels, a native tokenizer, and a native model format.
lumen run qwen3.5-9b:q8_0 "Write a haiku about Rust"That one command downloads the model on first use, converts it, picks your backend (Metal on Apple Silicon, CUDA on NVIDIA), and streams tokens.
Status: Production-ready for the shipped Qwen3.5 / Qwen3.6 models (dense 9B, dense 27B, and MoE-30B-A3B) on NVIDIA CUDA (compute capability 8.0+) and Apple Silicon (M-series). The public API and the binary
.lbcformat are not yet stable — read Production deployment before deploying.
Get Lumen one of two ways, then run.
Option A — pre-built binary (no Rust toolchain). One command detects your platform (macOS → Metal, Linux x86_64 + NVIDIA → CUDA), installs the matching validated binary, and helps you set up a model:
curl -fsSL https://servelumen.com/install.sh | bashOption B — build from source (Rust toolchain):
git clone https://github.com/faisalmumtaz89/Lumen && cd Lumen
cargo install --path crates/lumen-cli # Apple Silicon (Metal)
cargo install --path crates/lumen-cli --features cuda # NVIDIA Linux (CUDA)(That installs the lumen CLI. For the lumen-server binary too: cargo install --path crates/lumen-server --features bin — append ,cuda on NVIDIA. Option A installs both.)
Run — the model auto-downloads + converts on first use, then runs GPU-resident on your backend (Metal on Apple Silicon, CUDA on NVIDIA):
lumen run qwen3.5-9b:q8_0 "Write a haiku about Rust"
lumen run qwen3.5-moe:q4_0 "Explain quantum computing in one paragraph" # mixture-of-expertsMore install paths (Docker, the one-command clone → running server script): Getting started.
- One self-contained binary, zero ML dependencies — native CUDA C and Metal MSL kernels, a native BPE tokenizer, and the native
.lbcmodel format, all in Rust. No PyTorch, no ONNX, no Python runtime. - No build-time CUDA SDK — kernels JIT-compile at runtime via NVRTC, so one CUDA build runs on any compute-capability-8.0+ device (driver-only).
- Download → convert → run in a single command; weights stay GPU-resident for fast batch-1 decode.
- OpenAI- and Anthropic-compatible HTTP server with SSE streaming and template-driven tool calls; optional per-request reasoning / extended thinking.
- Runs on NVIDIA cc 8.0+ (Ampere / Hopper) and Apple Silicon (M-series); a scalar + SIMD CPU path is the correctness reference.
- Tuned for interactive serving — single-stream, GPU-resident decode latency, not large-batch throughput.
v1 (current) verifies the Qwen3.5 family and the Qwen3.6-27B dense model end-to-end; more model families are planned.
| Model | Architecture | Parameters | Quants |
|---|---|---|---|
qwen3.5-9b |
Dense GDN-hybrid | 9B | Q8_0, Q4_0, BF16 |
qwen3.6-27b |
Dense GDN-hybrid | 27B | Q8_0, Q4_0, BF16 |
qwen3.5-moe |
MoE GDN-hybrid | 30B total / 3B active | Q8_0, Q4_0, BF16 |
| Backend | Hardware | Status |
|---|---|---|
| CUDA | NVIDIA, compute capability 8.0+ (e.g. A100, H100) | Production-ready |
| Metal | Apple Silicon (M-series) | Production-ready |
| CPU | Scalar reference + SIMD NEON | Correctness reference, not throughput-optimized |
lumen models lists what is available and disk-cached. Live support matrix and per-config verification status: docs/support.md.
For concurrent clients, run the long-lived server (not repeated lumen run):
lumen pull qwen3.5-9b:q8_0
lumen-server qwen3.5-9b:q8_0 # defaults: port 8000, auto backend
# (explicit form: lumen-server --model qwen3.5-9b --quant q8_0 --port 8000)
curl http://localhost:8000/v1/modelsPOST /v1/chat/completions # OpenAI-compatible, SSE streaming
POST /v1/completions # OpenAI-compatible
POST /v1/messages # Anthropic-compatible, SSE streaming
Wire formats, reasoning / extended thinking, sampling & reproducibility, and embedding the engine as a library: docs/server.md.
Lumen optimizes for batch-1, GPU-resident decode latency — single-stream interactive serving.
Workload-weighted decode (Qwen3.5-MoE-30B-A3B, A100-80GB, mean across short / medium / long / code / multi-turn):
| Quant | Decode (tok/s) |
|---|---|
| Q8_0 | 76.4 |
| Q4_0 | 99.5 |
| BF16 | 87.4 |
Full per-cell decode + prefill numbers (every model × quant, on A100-80GB and M3 Ultra), methodology, and baseline comparisons: bench/RESULTS.md. Long-context decode is validated to 65K+ tokens.
lumen-format LBC binary format, quantization descriptors, test model generators
lumen-convert GGUF -> LBC converter (qwen35 dense, qwen35moe MoE)
lumen-runtime CUDA backend (200+ NVRTC kernels), Metal backend (MSL shaders),
CPU + SIMD NEON references, KV cache (memory + disk),
GDN recurrent state, sampling, sessions, suffix prefill
lumen-server axum HTTP server: OpenAI + Anthropic SSE endpoints, tool calling
lumen-bench benchmark harness with JSON + table output
lumen-cli CLI: built-in BPE tokenizer, model registry, HuggingFace downloader
Shipped instances share an L=32 stack of GDN linear-attention layers interleaved with full-attention layers, with a fused gate+up+SwiGLU+down FFN (dense) or top-k expert dispatch (MoE). Forward-pass details, the .lbc on-disk format, and suffix-prefill cache reuse: docs/architecture.md.
For development — build the workspace and run the test suite (the install commands are in Quick start above):
cargo build --release # Metal (macOS)
cargo build --release --features cuda # CUDA (Linux)
cargo test --workspace --release # CPU reference suite needs no GPURust is pinned via rust-toolchain.toml; CUDA needs libnvrtc + libcublas present at run time (no build-time SDK). Dev workflow: CONTRIBUTING.md.
- Getting started — install, pull, run (binaries, Docker, source)
- CLI reference — all subcommands and flags (or
lumen run --help) - HTTP server — endpoints, reasoning, library embedding
- Model support — live support matrix and verification status
- Production deployment — serving mode, GPU sizing, known limitations
- Environment variables — all
LUMEN_*flags - Benchmarks — per-cell numbers and methodology
- Releases & versioning · Changelog
Dual-licensed under MIT or Apache-2.0, at your option.