DesTEngSsv006_swd/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-design.md

# llmux Design Specification

## Overview

llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system.

## Hardware Constraints

- GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0)
- CPU: AMD Ryzen 9 9900X
- RAM: 64GB DDR5
- Storage: ~1.3TB free on /home
- OS: Debian 12 (Bookworm)
- NVIDIA driver: 590.48 (CUDA 13.1 capable)
- Host CUDA toolkit: 12.8

## Architecture

### Single Process Design

llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory.

### Runtimes

Three inference runtimes coexist within the single process:

| Runtime | Purpose | Models |
|---------|---------|--------|
| transformers (HuggingFace) | HF safetensors models | Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe |
| llama-cpp-python | GGUF models | Qwen3.5-9B-FP8-Uncensored |
| chatterbox | TTS | Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox |

### Why transformers (not vLLM)

vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures.

## Physical Models

| ID | Type | Backend | HuggingFace / Source | Estimated VRAM | Vision | Tools |
|----|------|---------|---------------------|---------------|--------|-------|
| qwen3.5-9b-fp8 | LLM | transformers | lovedheart/Qwen3.5-9B-FP8 | ~9GB | yes | yes |
| qwen3.5-9b-fp8-uncensored | LLM | llamacpp | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF) | ~9GB | yes | yes |
| qwen3.5-4b | LLM | transformers | Qwen/Qwen3.5-4B | ~4GB | yes | yes |
| gpt-oss-20b | LLM | transformers | openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM) | ~13GB | no | yes |
| gpt-oss-20b-uncensored | LLM | transformers | aoxo/gpt-oss-20b-uncensored | ~13GB | no | yes |
| cohere-transcribe | ASR | transformers | CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted) | ~4GB | n/a | n/a |
| chatterbox-turbo | TTS | chatterbox | resemble-ai/chatterbox (turbo variant) | ~2GB | n/a | n/a |
| chatterbox-multilingual | TTS | chatterbox | resemble-ai/chatterbox (multilingual variant) | ~2GB | n/a | n/a |
| chatterbox | TTS | chatterbox | resemble-ai/chatterbox (default variant) | ~2GB | n/a | n/a |

## Virtual Models

Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost.

| Virtual Model Name | Physical Model | Behavior |
|--------------------|---------------|----------|
| Qwen3.5-9B-FP8-Thinking | qwen3.5-9b-fp8 | Thinking enabled (default Qwen3.5 behavior) |
| Qwen3.5-9B-FP8-Instruct | qwen3.5-9b-fp8 | enable_thinking=False |
| Qwen3.5-9B-FP8-Uncensored-Thinking | qwen3.5-9b-fp8-uncensored | Thinking enabled |
| Qwen3.5-9B-FP8-Uncensored-Instruct | qwen3.5-9b-fp8-uncensored | enable_thinking=False |
| Qwen3.5-4B-Thinking | qwen3.5-4b | Thinking enabled |
| Qwen3.5-4B-Instruct | qwen3.5-4b | enable_thinking=False |
| GPT-OSS-20B-Low | gpt-oss-20b | System prompt prefix: "Reasoning: low" |
| GPT-OSS-20B-Medium | gpt-oss-20b | System prompt prefix: "Reasoning: medium" |
| GPT-OSS-20B-High | gpt-oss-20b | System prompt prefix: "Reasoning: high" |
| GPT-OSS-20B-Uncensored-Low | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: low" |
| GPT-OSS-20B-Uncensored-Medium | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: medium" |
| GPT-OSS-20B-Uncensored-High | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: high" |
| cohere-transcribe | cohere-transcribe | ASR (used via /v1/audio/transcriptions) |
| Chatterbox-Turbo | chatterbox-turbo | TTS (used via /v1/audio/speech) |
| Chatterbox-Multilingual | chatterbox-multilingual | TTS |
| Chatterbox | chatterbox | TTS |

## VRAM Manager

### Preemption Policy

Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted.

### Priority (highest to lowest)

1. ASR (cohere-transcribe) — highest priority, evicted only as last resort
2. TTS (one Chatterbox variant at a time)
3. LLM (one at a time) — lowest priority, evicted first

### Loading Algorithm

When a request arrives for a model whose physical model is not loaded:

1. If the physical model is already loaded, proceed immediately.
2. If it fits in available VRAM, load alongside existing models.
3. If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free:
   - Evict LLM first
   - Evict TTS second
   - Evict ASR only as last resort
   - Never evict a higher-priority model to load a lower-priority one (e.g., never evict ASR to make room for TTS; in that case, evict the LLM instead)
4. Load the requested model.

### Concurrency

- An asyncio Lock ensures only one load/unload operation at a time.
- Requests arriving during a model swap await the lock.
- Inference requests hold a read-lock on their model to prevent eviction mid-inference.

### Typical Scenarios

| Current State | Request | Action |
|---------------|---------|--------|
| ASR + Qwen3.5-4B (~8GB) | Chat with Qwen3.5-4B | Proceed, already loaded |
| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with Qwen3.5-9B-FP8 | Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits. |
| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with GPT-OSS-20B | Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB). |
| GPT-OSS-20B loaded (~13GB) | Transcription request | Evict LLM (gpt-oss-20b). Load ASR (~4GB). |
| ASR + Qwen3.5-4B (~8GB) | TTS request | Fits (~10GB). Load Chatterbox alongside. |

## API Endpoints

All endpoints on `127.0.0.1:8081`. All `/v1/*` endpoints require Bearer token authentication.

### GET /v1/models

Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping.

### POST /v1/chat/completions

OpenAI-compatible chat completions. Accepts `model` parameter matching a virtual model name. Supports `stream: true` for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it.

### POST /v1/audio/transcriptions

OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and `model` parameter. Returns transcript in OpenAI response format. Supports `language` parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.

### POST /v1/audio/speech

OpenAI TTS-compatible endpoint. Accepts JSON with `model`, `input` (text), `voice` (maps to Chatterbox voice/speaker config). Returns audio bytes.

### GET /health

Unauthenticated. Returns service status and currently loaded models.

## Authentication

- All `/v1/*` endpoints require a Bearer token (`Authorization: Bearer <api-key>`)
- API keys stored in `config/api_keys.yaml`, mounted read-only into the container
- Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.)
- `GET /health` is unauthenticated for monitoring/readiness probes
- Traefik acts purely as a router, no auth on its side

## Container & Pod Architecture

### Pod

- Pod name: `llmux_pod`
- Single container: `llmux_ctr`
- Port: `127.0.0.1:8081:8081`
- GPU: NVIDIA CDI (`--device nvidia.com/gpu=all`)
- Network: default (no host loopback needed)

### Base Image

`pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime`

Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible.

### Dockerfile Layers

1. System deps: libsndfile, ffmpeg (audio processing)
2. pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML
3. Copy llmux application code
4. Entrypoint: `uvicorn llmux.main:app --host 0.0.0.0 --port 8081`

### Bind Mounts

| Host Path | Container Path | Mode |
|-----------|---------------|------|
| /home/llm/.local/share/llmux_pod/models/ | /models | read-only |
| /home/llm/.local/share/llmux_pod/config/ | /config | read-only |

### Systemd

Managed via `create_pod_llmux.sh` following the Kischdle pattern: create pod, create container, generate systemd units, enable service.

## Application Structure

```
llmux/
├── Dockerfile
├── requirements.txt
├── config/
│   ├── models.yaml
│   └── api_keys.yaml
├── llmux/
│   ├── main.py              # FastAPI app, startup/shutdown, health endpoint
│   ├── auth.py              # API key validation middleware
│   ├── vram_manager.py      # VRAM tracking, load/unload, eviction logic
│   ├── model_registry.py    # Parse models.yaml, virtual→physical mapping
│   ├── routes/
│   │   ├── models.py        # GET /v1/models
│   │   ├── chat.py          # POST /v1/chat/completions
│   │   ├── transcription.py # POST /v1/audio/transcriptions
│   │   └── speech.py        # POST /v1/audio/speech
│   └── backends/
│       ├── base.py          # Abstract base class for model backends
│       ├── transformers.py  # HuggingFace transformers backend
│       ├── llamacpp.py      # llama-cpp-python backend (GGUF)
│       └── chatterbox.py    # Chatterbox TTS backend
└── scripts/
    ├── download_models.sh   # Pre-download all model weights
    └── create_pod_llmux.sh  # Podman pod creation script
```

### Key Design Decisions

- `backends/` encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic.
- `vram_manager.py` is the single authority on what's loaded. Route handlers call `vram_manager.ensure_loaded(physical_model_id)` before inference.
- `model_registry.py` handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend.
- Streaming for chat completions uses FastAPI `StreamingResponse` with SSE, matching OpenAI streaming format.

## Model Downloads

All models are pre-downloaded before the pod is created. The `scripts/download_models.sh` script runs as user `llm` and downloads to `/home/llm/.local/share/llmux_pod/models/`.

| Model | Method | Approx Size |
|-------|--------|-------------|
| lovedheart/Qwen3.5-9B-FP8 | huggingface-cli download | ~9GB |
| HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF) | huggingface-cli download (specific files) | ~10GB |
| Qwen/Qwen3.5-4B | huggingface-cli download | ~8GB |
| openai/gpt-oss-20b | huggingface-cli download | ~13GB |
| aoxo/gpt-oss-20b-uncensored | huggingface-cli download | ~13GB |
| CohereLabs/cohere-transcribe-03-2026 | huggingface-cli download (gated, terms accepted) | ~4GB |
| resemble-ai/chatterbox (3 variants) | per Chatterbox install docs | ~2GB |

Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token).

## Open WebUI Configuration

Open WebUI (user `wbg`, port 8080) connects to llmux:

### Connections (Admin > Settings > Connections)

- OpenAI API Base URL: `http://127.0.0.1:8081/v1`
- API Key: the key from api_keys.yaml designated for Open WebUI

### Audio (Admin > Settings > Audio)

- STT Engine: openai
- STT OpenAI API Base URL: `http://127.0.0.1:8081/v1`
- STT Model: cohere-transcribe
- TTS Engine: openai
- TTS OpenAI API Base URL: `http://127.0.0.1:8081/v1`
- TTS Model: Chatterbox-Multilingual
- TTS Voice: to be configured based on Chatterbox options

### User Experience

- Model dropdown lists all 16 virtual models
- Chat works on any model selection (with potential swap delay for first request)
- Dictation uses cohere-transcribe
- Audio playback uses Chatterbox
- Voice chat combines ASR, LLM, and TTS

## Traefik Routing

New dynamic config at `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`:

```yaml
http:
  routers:
    llmux:
      entryPoints: ["wghttp"]
      rule: "Host(`kidirekt.kischdle.com`)"
      priority: 100
      service: llmux

  services:
    llmux:
      loadBalancer:
        servers:
          - url: "http://10.0.2.2:8081"
```

- Routed through WireGuard VPN entry point
- No Traefik-level auth (llmux handles API key auth)
- DNS setup for kidirekt.kischdle.com is a manual step

## Configuration Files

### config/models.yaml

```yaml
physical_models:
  qwen3.5-9b-fp8:
    type: llm
    backend: transformers
    model_id: "lovedheart/Qwen3.5-9B-FP8"
    estimated_vram_gb: 9
    supports_vision: true
    supports_tools: true

  qwen3.5-9b-fp8-uncensored:
    type: llm
    backend: llamacpp
    model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
    mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
    estimated_vram_gb: 9
    supports_vision: true
    supports_tools: true

  qwen3.5-4b:
    type: llm
    backend: transformers
    model_id: "Qwen/Qwen3.5-4B"
    estimated_vram_gb: 4
    supports_vision: true
    supports_tools: true

  gpt-oss-20b:
    type: llm
    backend: transformers
    model_id: "openai/gpt-oss-20b"
    estimated_vram_gb: 13
    supports_vision: false
    supports_tools: true

  gpt-oss-20b-uncensored:
    type: llm
    backend: transformers
    model_id: "aoxo/gpt-oss-20b-uncensored"
    estimated_vram_gb: 13
    supports_vision: false
    supports_tools: true

  cohere-transcribe:
    type: asr
    backend: transformers
    model_id: "CohereLabs/cohere-transcribe-03-2026"
    estimated_vram_gb: 4
    default_language: "en"

  chatterbox-turbo:
    type: tts
    backend: chatterbox
    variant: "turbo"
    estimated_vram_gb: 2

  chatterbox-multilingual:
    type: tts
    backend: chatterbox
    variant: "multilingual"
    estimated_vram_gb: 2

  chatterbox:
    type: tts
    backend: chatterbox
    variant: "default"
    estimated_vram_gb: 2

virtual_models:
  Qwen3.5-9B-FP8-Thinking:
    physical: qwen3.5-9b-fp8
    params: { enable_thinking: true }
  Qwen3.5-9B-FP8-Instruct:
    physical: qwen3.5-9b-fp8
    params: { enable_thinking: false }

  Qwen3.5-9B-FP8-Uncensored-Thinking:
    physical: qwen3.5-9b-fp8-uncensored
    params: { enable_thinking: true }
  Qwen3.5-9B-FP8-Uncensored-Instruct:
    physical: qwen3.5-9b-fp8-uncensored
    params: { enable_thinking: false }

  Qwen3.5-4B-Thinking:
    physical: qwen3.5-4b
    params: { enable_thinking: true }
  Qwen3.5-4B-Instruct:
    physical: qwen3.5-4b
    params: { enable_thinking: false }

  GPT-OSS-20B-Low:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: low" }
  GPT-OSS-20B-Medium:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: medium" }
  GPT-OSS-20B-High:
    physical: gpt-oss-20b
    params: { system_prompt_prefix: "Reasoning: high" }

  GPT-OSS-20B-Uncensored-Low:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: low" }
  GPT-OSS-20B-Uncensored-Medium:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: medium" }
  GPT-OSS-20B-Uncensored-High:
    physical: gpt-oss-20b-uncensored
    params: { system_prompt_prefix: "Reasoning: high" }

  cohere-transcribe:
    physical: cohere-transcribe
  Chatterbox-Turbo:
    physical: chatterbox-turbo
  Chatterbox-Multilingual:
    physical: chatterbox-multilingual
  Chatterbox:
    physical: chatterbox
```

### config/api_keys.yaml

```yaml
api_keys:
  - key: "sk-llmux-openwebui-<generated>"
    name: "Open WebUI"
  - key: "sk-llmux-whisper-<generated>"
    name: "Remote Whisper clients"
  - key: "sk-llmux-opencode-<generated>"
    name: "OpenCode"
```

Keys generated at deployment time.

## Testing & Verification

### Phase 1: System Integration (iterative, fix issues before proceeding)

1. Container build — Dockerfile builds successfully, image contains all dependencies
2. GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container)
3. Model mount — container can read model weights from /models
4. Service startup — llmux starts, port 8081 reachable from host
5. Open WebUI connection — model list populates in Open WebUI
6. Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured)
7. Systemd lifecycle — start/stop/restart works, service survives reboot

### Phase 2: Functional Tests

8. Auth — requests without valid API key get 401
9. Model listing — GET /v1/models returns all 16 virtual models
10. Chat inference — for each physical LLM, chat via Open WebUI as user "try":
    - Qwen3.5-9B-FP8 (Thinking + Instruct)
    - Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct)
    - Qwen3.5-4B (Thinking + Instruct)
    - GPT-OSS-20B (Low, Medium, High)
    - GPT-OSS-20B-Uncensored (Low, Medium, High)
11. Streaming — chat responses stream token-by-token in Open WebUI
12. ASR — Open WebUI dictation transcribes speech (English and German)
13. TTS — Open WebUI audio playback speaks text
14. Vision — image + text prompt to each vision-capable model:
    - Qwen3.5-4B
    - Qwen3.5-9B-FP8
    - Qwen3.5-9B-FP8-Uncensored
15. Tool usage — verify tool calling for each runtime and tool-capable model:
    - Qwen3.5-9B-FP8 (transformers)
    - Qwen3.5-9B-FP8-Uncensored (llama-cpp-python)
    - GPT-OSS-20B (transformers)
    - GPT-OSS-20B-Uncensored (transformers)

### Phase 3: VRAM Management Tests

16. Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total)
17. Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total)
18. Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first.
19. Model swapping — switch between two LLMs, verify second loads and first is evicted

### Phase 4: Performance Tests

20. Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution.
    - Qwen3.5-9B-FP8
    - Qwen3.5-4B
    - gpt-oss-20b
    - gpt-oss-20b-uncensored
    - cohere-transcribe
21. llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint.
22. Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration.

## Manual Steps

These require human action and cannot be automated:

- DNS setup for kidirekt.kischdle.com (during implementation)
- HuggingFace terms for cohere-transcribe: accepted 2026-04-03
- HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment)
- Open WebUI admin configuration (connections, audio settings)