# llmux Design Specification ## Overview llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system. ## Hardware Constraints - GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0) - CPU: AMD Ryzen 9 9900X - RAM: 64GB DDR5 - Storage: ~1.3TB free on /home - OS: Debian 12 (Bookworm) - NVIDIA driver: 590.48 (CUDA 13.1 capable) - Host CUDA toolkit: 12.8 ## Architecture ### Single Process Design llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory. ### Runtimes Three inference runtimes coexist within the single process: | Runtime | Purpose | Models | |---------|---------|--------| | transformers (HuggingFace) | HF safetensors models | Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe | | llama-cpp-python | GGUF models | Qwen3.5-9B-FP8-Uncensored | | chatterbox | TTS | Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox | ### Why transformers (not vLLM) vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures. ## Physical Models | ID | Type | Backend | HuggingFace / Source | Estimated VRAM | Vision | Tools | |----|------|---------|---------------------|---------------|--------|-------| | qwen3.5-9b-fp8 | LLM | transformers | lovedheart/Qwen3.5-9B-FP8 | ~9GB | yes | yes | | qwen3.5-9b-fp8-uncensored | LLM | llamacpp | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF) | ~9GB | yes | yes | | qwen3.5-4b | LLM | transformers | Qwen/Qwen3.5-4B | ~4GB | yes | yes | | gpt-oss-20b | LLM | transformers | openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM) | ~13GB | no | yes | | gpt-oss-20b-uncensored | LLM | transformers | aoxo/gpt-oss-20b-uncensored | ~13GB | no | yes | | cohere-transcribe | ASR | transformers | CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted) | ~4GB | n/a | n/a | | chatterbox-turbo | TTS | chatterbox | resemble-ai/chatterbox (turbo variant) | ~2GB | n/a | n/a | | chatterbox-multilingual | TTS | chatterbox | resemble-ai/chatterbox (multilingual variant) | ~2GB | n/a | n/a | | chatterbox | TTS | chatterbox | resemble-ai/chatterbox (default variant) | ~2GB | n/a | n/a | ## Virtual Models Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost. | Virtual Model Name | Physical Model | Behavior | |--------------------|---------------|----------| | Qwen3.5-9B-FP8-Thinking | qwen3.5-9b-fp8 | Thinking enabled (default Qwen3.5 behavior) | | Qwen3.5-9B-FP8-Instruct | qwen3.5-9b-fp8 | enable_thinking=False | | Qwen3.5-9B-FP8-Uncensored-Thinking | qwen3.5-9b-fp8-uncensored | Thinking enabled | | Qwen3.5-9B-FP8-Uncensored-Instruct | qwen3.5-9b-fp8-uncensored | enable_thinking=False | | Qwen3.5-4B-Thinking | qwen3.5-4b | Thinking enabled | | Qwen3.5-4B-Instruct | qwen3.5-4b | enable_thinking=False | | GPT-OSS-20B-Low | gpt-oss-20b | System prompt prefix: "Reasoning: low" | | GPT-OSS-20B-Medium | gpt-oss-20b | System prompt prefix: "Reasoning: medium" | | GPT-OSS-20B-High | gpt-oss-20b | System prompt prefix: "Reasoning: high" | | GPT-OSS-20B-Uncensored-Low | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: low" | | GPT-OSS-20B-Uncensored-Medium | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: medium" | | GPT-OSS-20B-Uncensored-High | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: high" | | cohere-transcribe | cohere-transcribe | ASR (used via /v1/audio/transcriptions) | | Chatterbox-Turbo | chatterbox-turbo | TTS (used via /v1/audio/speech) | | Chatterbox-Multilingual | chatterbox-multilingual | TTS | | Chatterbox | chatterbox | TTS | ## VRAM Manager ### Preemption Policy Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted. ### Priority (highest to lowest) 1. ASR (cohere-transcribe) — highest priority, evicted only as last resort 2. TTS (one Chatterbox variant at a time) 3. LLM (one at a time) — lowest priority, evicted first ### Loading Algorithm When a request arrives for a model whose physical model is not loaded: 1. If the physical model is already loaded, proceed immediately. 2. If it fits in available VRAM, load alongside existing models. 3. If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free: - Evict LLM first - Evict TTS second - Evict ASR only as last resort - Never evict a higher-priority model to load a lower-priority one (e.g., never evict ASR to make room for TTS; in that case, evict the LLM instead) 4. Load the requested model. ### Concurrency - An asyncio Lock ensures only one load/unload operation at a time. - Requests arriving during a model swap await the lock. - Inference requests hold a read-lock on their model to prevent eviction mid-inference. ### Typical Scenarios | Current State | Request | Action | |---------------|---------|--------| | ASR + Qwen3.5-4B (~8GB) | Chat with Qwen3.5-4B | Proceed, already loaded | | ASR + TTS + Qwen3.5-4B (~10GB) | Chat with Qwen3.5-9B-FP8 | Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits. | | ASR + TTS + Qwen3.5-4B (~10GB) | Chat with GPT-OSS-20B | Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB). | | GPT-OSS-20B loaded (~13GB) | Transcription request | Evict LLM (gpt-oss-20b). Load ASR (~4GB). | | ASR + Qwen3.5-4B (~8GB) | TTS request | Fits (~10GB). Load Chatterbox alongside. | ## API Endpoints All endpoints on `127.0.0.1:8081`. All `/v1/*` endpoints require Bearer token authentication. ### GET /v1/models Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping. ### POST /v1/chat/completions OpenAI-compatible chat completions. Accepts `model` parameter matching a virtual model name. Supports `stream: true` for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it. ### POST /v1/audio/transcriptions OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and `model` parameter. Returns transcript in OpenAI response format. Supports `language` parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. ### POST /v1/audio/speech OpenAI TTS-compatible endpoint. Accepts JSON with `model`, `input` (text), `voice` (maps to Chatterbox voice/speaker config). Returns audio bytes. ### GET /health Unauthenticated. Returns service status and currently loaded models. ## Authentication - All `/v1/*` endpoints require a Bearer token (`Authorization: Bearer `) - API keys stored in `config/api_keys.yaml`, mounted read-only into the container - Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.) - `GET /health` is unauthenticated for monitoring/readiness probes - Traefik acts purely as a router, no auth on its side ## Container & Pod Architecture ### Pod - Pod name: `llmux_pod` - Single container: `llmux_ctr` - Port: `127.0.0.1:8081:8081` - GPU: NVIDIA CDI (`--device nvidia.com/gpu=all`) - Network: default (no host loopback needed) ### Base Image `pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime` Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible. ### Dockerfile Layers 1. System deps: libsndfile, ffmpeg (audio processing) 2. pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML 3. Copy llmux application code 4. Entrypoint: `uvicorn llmux.main:app --host 0.0.0.0 --port 8081` ### Bind Mounts | Host Path | Container Path | Mode | |-----------|---------------|------| | /home/llm/.local/share/llmux_pod/models/ | /models | read-only | | /home/llm/.local/share/llmux_pod/config/ | /config | read-only | ### Systemd Managed via `create_pod_llmux.sh` following the Kischdle pattern: create pod, create container, generate systemd units, enable service. ## Application Structure ``` llmux/ ├── Dockerfile ├── requirements.txt ├── config/ │ ├── models.yaml │ └── api_keys.yaml ├── llmux/ │ ├── main.py # FastAPI app, startup/shutdown, health endpoint │ ├── auth.py # API key validation middleware │ ├── vram_manager.py # VRAM tracking, load/unload, eviction logic │ ├── model_registry.py # Parse models.yaml, virtual→physical mapping │ ├── routes/ │ │ ├── models.py # GET /v1/models │ │ ├── chat.py # POST /v1/chat/completions │ │ ├── transcription.py # POST /v1/audio/transcriptions │ │ └── speech.py # POST /v1/audio/speech │ └── backends/ │ ├── base.py # Abstract base class for model backends │ ├── transformers.py # HuggingFace transformers backend │ ├── llamacpp.py # llama-cpp-python backend (GGUF) │ └── chatterbox.py # Chatterbox TTS backend └── scripts/ ├── download_models.sh # Pre-download all model weights └── create_pod_llmux.sh # Podman pod creation script ``` ### Key Design Decisions - `backends/` encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic. - `vram_manager.py` is the single authority on what's loaded. Route handlers call `vram_manager.ensure_loaded(physical_model_id)` before inference. - `model_registry.py` handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend. - Streaming for chat completions uses FastAPI `StreamingResponse` with SSE, matching OpenAI streaming format. ## Model Downloads All models are pre-downloaded before the pod is created. The `scripts/download_models.sh` script runs as user `llm` and downloads to `/home/llm/.local/share/llmux_pod/models/`. | Model | Method | Approx Size | |-------|--------|-------------| | lovedheart/Qwen3.5-9B-FP8 | huggingface-cli download | ~9GB | | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF) | huggingface-cli download (specific files) | ~10GB | | Qwen/Qwen3.5-4B | huggingface-cli download | ~8GB | | openai/gpt-oss-20b | huggingface-cli download | ~13GB | | aoxo/gpt-oss-20b-uncensored | huggingface-cli download | ~13GB | | CohereLabs/cohere-transcribe-03-2026 | huggingface-cli download (gated, terms accepted) | ~4GB | | resemble-ai/chatterbox (3 variants) | per Chatterbox install docs | ~2GB | Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token). ## Open WebUI Configuration Open WebUI (user `wbg`, port 8080) connects to llmux: ### Connections (Admin > Settings > Connections) - OpenAI API Base URL: `http://127.0.0.1:8081/v1` - API Key: the key from api_keys.yaml designated for Open WebUI ### Audio (Admin > Settings > Audio) - STT Engine: openai - STT OpenAI API Base URL: `http://127.0.0.1:8081/v1` - STT Model: cohere-transcribe - TTS Engine: openai - TTS OpenAI API Base URL: `http://127.0.0.1:8081/v1` - TTS Model: Chatterbox-Multilingual - TTS Voice: to be configured based on Chatterbox options ### User Experience - Model dropdown lists all 16 virtual models - Chat works on any model selection (with potential swap delay for first request) - Dictation uses cohere-transcribe - Audio playback uses Chatterbox - Voice chat combines ASR, LLM, and TTS ## Traefik Routing New dynamic config at `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`: ```yaml http: routers: llmux: entryPoints: ["wghttp"] rule: "Host(`kidirekt.kischdle.com`)" priority: 100 service: llmux services: llmux: loadBalancer: servers: - url: "http://10.0.2.2:8081" ``` - Routed through WireGuard VPN entry point - No Traefik-level auth (llmux handles API key auth) - DNS setup for kidirekt.kischdle.com is a manual step ## Configuration Files ### config/models.yaml ```yaml physical_models: qwen3.5-9b-fp8: type: llm backend: transformers model_id: "lovedheart/Qwen3.5-9B-FP8" estimated_vram_gb: 9 supports_vision: true supports_tools: true qwen3.5-9b-fp8-uncensored: type: llm backend: llamacpp model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf" mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf" estimated_vram_gb: 9 supports_vision: true supports_tools: true qwen3.5-4b: type: llm backend: transformers model_id: "Qwen/Qwen3.5-4B" estimated_vram_gb: 4 supports_vision: true supports_tools: true gpt-oss-20b: type: llm backend: transformers model_id: "openai/gpt-oss-20b" estimated_vram_gb: 13 supports_vision: false supports_tools: true gpt-oss-20b-uncensored: type: llm backend: transformers model_id: "aoxo/gpt-oss-20b-uncensored" estimated_vram_gb: 13 supports_vision: false supports_tools: true cohere-transcribe: type: asr backend: transformers model_id: "CohereLabs/cohere-transcribe-03-2026" estimated_vram_gb: 4 default_language: "en" chatterbox-turbo: type: tts backend: chatterbox variant: "turbo" estimated_vram_gb: 2 chatterbox-multilingual: type: tts backend: chatterbox variant: "multilingual" estimated_vram_gb: 2 chatterbox: type: tts backend: chatterbox variant: "default" estimated_vram_gb: 2 virtual_models: Qwen3.5-9B-FP8-Thinking: physical: qwen3.5-9b-fp8 params: { enable_thinking: true } Qwen3.5-9B-FP8-Instruct: physical: qwen3.5-9b-fp8 params: { enable_thinking: false } Qwen3.5-9B-FP8-Uncensored-Thinking: physical: qwen3.5-9b-fp8-uncensored params: { enable_thinking: true } Qwen3.5-9B-FP8-Uncensored-Instruct: physical: qwen3.5-9b-fp8-uncensored params: { enable_thinking: false } Qwen3.5-4B-Thinking: physical: qwen3.5-4b params: { enable_thinking: true } Qwen3.5-4B-Instruct: physical: qwen3.5-4b params: { enable_thinking: false } GPT-OSS-20B-Low: physical: gpt-oss-20b params: { system_prompt_prefix: "Reasoning: low" } GPT-OSS-20B-Medium: physical: gpt-oss-20b params: { system_prompt_prefix: "Reasoning: medium" } GPT-OSS-20B-High: physical: gpt-oss-20b params: { system_prompt_prefix: "Reasoning: high" } GPT-OSS-20B-Uncensored-Low: physical: gpt-oss-20b-uncensored params: { system_prompt_prefix: "Reasoning: low" } GPT-OSS-20B-Uncensored-Medium: physical: gpt-oss-20b-uncensored params: { system_prompt_prefix: "Reasoning: medium" } GPT-OSS-20B-Uncensored-High: physical: gpt-oss-20b-uncensored params: { system_prompt_prefix: "Reasoning: high" } cohere-transcribe: physical: cohere-transcribe Chatterbox-Turbo: physical: chatterbox-turbo Chatterbox-Multilingual: physical: chatterbox-multilingual Chatterbox: physical: chatterbox ``` ### config/api_keys.yaml ```yaml api_keys: - key: "sk-llmux-openwebui-" name: "Open WebUI" - key: "sk-llmux-whisper-" name: "Remote Whisper clients" - key: "sk-llmux-opencode-" name: "OpenCode" ``` Keys generated at deployment time. ## Testing & Verification ### Phase 1: System Integration (iterative, fix issues before proceeding) 1. Container build — Dockerfile builds successfully, image contains all dependencies 2. GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container) 3. Model mount — container can read model weights from /models 4. Service startup — llmux starts, port 8081 reachable from host 5. Open WebUI connection — model list populates in Open WebUI 6. Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured) 7. Systemd lifecycle — start/stop/restart works, service survives reboot ### Phase 2: Functional Tests 8. Auth — requests without valid API key get 401 9. Model listing — GET /v1/models returns all 16 virtual models 10. Chat inference — for each physical LLM, chat via Open WebUI as user "try": - Qwen3.5-9B-FP8 (Thinking + Instruct) - Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct) - Qwen3.5-4B (Thinking + Instruct) - GPT-OSS-20B (Low, Medium, High) - GPT-OSS-20B-Uncensored (Low, Medium, High) 11. Streaming — chat responses stream token-by-token in Open WebUI 12. ASR — Open WebUI dictation transcribes speech (English and German) 13. TTS — Open WebUI audio playback speaks text 14. Vision — image + text prompt to each vision-capable model: - Qwen3.5-4B - Qwen3.5-9B-FP8 - Qwen3.5-9B-FP8-Uncensored 15. Tool usage — verify tool calling for each runtime and tool-capable model: - Qwen3.5-9B-FP8 (transformers) - Qwen3.5-9B-FP8-Uncensored (llama-cpp-python) - GPT-OSS-20B (transformers) - GPT-OSS-20B-Uncensored (transformers) ### Phase 3: VRAM Management Tests 16. Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total) 17. Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total) 18. Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first. 19. Model swapping — switch between two LLMs, verify second loads and first is evicted ### Phase 4: Performance Tests 20. Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution. - Qwen3.5-9B-FP8 - Qwen3.5-4B - gpt-oss-20b - gpt-oss-20b-uncensored - cohere-transcribe 21. llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint. 22. Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration. ## Manual Steps These require human action and cannot be automated: - DNS setup for kidirekt.kischdle.com (during implementation) - HuggingFace terms for cohere-transcribe: accepted 2026-04-03 - HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment) - Open WebUI admin configuration (connections, audio settings)