Lumen

Getting started · Models · Server · Production · Benchmarks · Releases · Changelog

LLM inference in Rust, for Apple Silicon and NVIDIA CUDA.

A single binary that downloads a model, runs it GPU-resident, and prints text — built from scratch with zero ML dependencies (no PyTorch, no ONNX, no Python), native CUDA C and Metal kernels, a native tokenizer, and a native model format.

lumen run qwen3.5-9b:q8_0 "Write a haiku about Rust"

That one command downloads the model on first use, converts it, picks your backend (Metal on Apple Silicon, CUDA on NVIDIA), and streams tokens.

Status: Production-ready for the shipped Qwen3.5 / Qwen3.6 models (dense 9B, dense 27B, and MoE-30B-A3B) on NVIDIA CUDA (compute capability 8.0+) and Apple Silicon (M-series). The public API and the binary .lbc format are not yet stable — read Production deployment before deploying.

Quick start

Get Lumen one of two ways, then run.

Option A — pre-built binary (no Rust toolchain). One command detects your platform (macOS → Metal, Linux x86_64 + NVIDIA → CUDA), installs the matching validated binary, and helps you set up a model:

curl -fsSL https://servelumen.com/install.sh | bash

Option B — build from source (Rust toolchain):

git clone https://github.com/faisalmumtaz89/Lumen && cd Lumen
cargo install --path crates/lumen-cli                  # Apple Silicon (Metal)
cargo install --path crates/lumen-cli --features cuda   # NVIDIA Linux (CUDA)

(That installs the lumen CLI. For the lumen-server binary too: cargo install --path crates/lumen-server --features bin — append ,cuda on NVIDIA. Option A installs both.)

Run — the model auto-downloads + converts on first use, then runs GPU-resident on your backend (Metal on Apple Silicon, CUDA on NVIDIA):

lumen run qwen3.5-9b:q8_0 "Write a haiku about Rust"
lumen run qwen3.5-moe:q4_0 "Explain quantum computing in one paragraph"   # mixture-of-experts

More install paths (Docker, the one-command clone → running server script): Getting started.

What it is

One self-contained binary, zero ML dependencies — native CUDA C and Metal MSL kernels, a native BPE tokenizer, and the native .lbc model format, all in Rust. No PyTorch, no ONNX, no Python runtime.
No build-time CUDA SDK — kernels JIT-compile at runtime via NVRTC, so one CUDA build runs on any compute-capability-8.0+ device (driver-only).
Download → convert → run in a single command; weights stay GPU-resident for fast batch-1 decode.
OpenAI- and Anthropic-compatible HTTP server with SSE streaming and template-driven tool calls; optional per-request reasoning / extended thinking.
Runs on NVIDIA cc 8.0+ (Ampere / Hopper) and Apple Silicon (M-series); a scalar + SIMD CPU path is the correctness reference.
Tuned for interactive serving — single-stream, GPU-resident decode latency, not large-batch throughput.

Supported models & hardware

v1 (current) verifies the Qwen3.5 family and the Qwen3.6-27B dense model end-to-end; more model families are planned.

Model	Architecture	Parameters	Quants
`qwen3.5-9b`	Dense GDN-hybrid	9B	Q8_0, Q4_0, BF16
`qwen3.6-27b`	Dense GDN-hybrid	27B	Q8_0, Q4_0, BF16
`qwen3.5-moe`	MoE GDN-hybrid	30B total / 3B active	Q8_0, Q4_0, BF16

Backend	Hardware	Status
CUDA	NVIDIA, compute capability 8.0+ (e.g. A100, H100)	Production-ready
Metal	Apple Silicon (M-series)	Production-ready
CPU	Scalar reference + SIMD NEON	Correctness reference, not throughput-optimized

lumen models lists what is available and disk-cached. Live support matrix and per-config verification status: docs/support.md.

HTTP server

For concurrent clients, run the long-lived server (not repeated lumen run):

lumen pull qwen3.5-9b:q8_0
lumen-server qwen3.5-9b:q8_0                          # defaults: port 8000, auto backend
# (explicit form: lumen-server --model qwen3.5-9b --quant q8_0 --port 8000)
curl http://localhost:8000/v1/models

POST /v1/chat/completions   # OpenAI-compatible, SSE streaming
POST /v1/completions        # OpenAI-compatible
POST /v1/messages           # Anthropic-compatible, SSE streaming

Wire formats, reasoning / extended thinking, sampling & reproducibility, and embedding the engine as a library: docs/server.md.

Performance

Lumen optimizes for batch-1, GPU-resident decode latency — single-stream interactive serving.

Workload-weighted decode (Qwen3.5-MoE-30B-A3B, A100-80GB, mean across short / medium / long / code / multi-turn):

Quant	Decode (tok/s)
Q8_0	76.4
Q4_0	99.5
BF16	87.4

Full per-cell decode + prefill numbers (every model × quant, on A100-80GB and M3 Ultra), methodology, and baseline comparisons: bench/RESULTS.md. Long-context decode is validated to 65K+ tokens.

Architecture

lumen-format      LBC binary format, quantization descriptors, test model generators
lumen-convert     GGUF -> LBC converter (qwen35 dense, qwen35moe MoE)
lumen-runtime     CUDA backend (200+ NVRTC kernels), Metal backend (MSL shaders),
                  CPU + SIMD NEON references, KV cache (memory + disk),
                  GDN recurrent state, sampling, sessions, suffix prefill
lumen-server      axum HTTP server: OpenAI + Anthropic SSE endpoints, tool calling
lumen-bench       benchmark harness with JSON + table output
lumen-cli         CLI: built-in BPE tokenizer, model registry, HuggingFace downloader

Shipped instances share an L=32 stack of GDN linear-attention layers interleaved with full-attention layers, with a fused gate+up+SwiGLU+down FFN (dense) or top-k expert dispatch (MoE). Forward-pass details, the .lbc on-disk format, and suffix-prefill cache reuse: docs/architecture.md.

Building & testing

For development — build the workspace and run the test suite (the install commands are in Quick start above):

cargo build --release                  # Metal (macOS)
cargo build --release --features cuda  # CUDA (Linux)
cargo test --workspace --release       # CPU reference suite needs no GPU

Rust is pinned via rust-toolchain.toml; CUDA needs libnvrtc + libcublas present at run time (no build-time SDK). Dev workflow: CONTRIBUTING.md.

Documentation

Getting started — install, pull, run (binaries, Docker, source)
CLI reference — all subcommands and flags (or lumen run --help)
HTTP server — endpoints, reasoning, library embedding
Model support — live support matrix and verification status
Production deployment — serving mode, GPU sizing, known limitations
Environment variables — all LUMEN_* flags
Benchmarks — per-cell numbers and methodology
Releases & versioning · Changelog

License

Dual-licensed under MIT or Apache-2.0, at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github		.github
bench		bench
crates		crates
docs		docs
examples		examples
packaging		packaging
quality		quality
scripts		scripts
tests/fixtures		tests/fixtures
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
lumen.sh		lumen.sh
model_registry.toml		model_registry.toml
rust-toolchain.toml		rust-toolchain.toml
setup-local.sh		setup-local.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lumen

Quick start

What it is

Supported models & hardware

HTTP server

Performance

Architecture

Building & testing

Documentation

License

About

Licenses found

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lumen

Quick start

What it is

Supported models & hardware

HTTP server

Performance

Architecture

Building & testing

Documentation

License

About

Topics

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages