Covers architecture, model registry, VRAM management, API endpoints, container setup, Open WebUI integration, Traefik routing, and four-phase testing plan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
483 lines
19 KiB
Markdown
483 lines
19 KiB
Markdown
# llmux Design Specification
|
|
|
|
## Overview
|
|
|
|
llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system.
|
|
|
|
## Hardware Constraints
|
|
|
|
- GPU: NVIDIA RTX 5070 Ti, 16GB VRAM, compute capability 12.0 (Blackwell/SM12.0)
|
|
- CPU: AMD Ryzen 9 9900X
|
|
- RAM: 64GB DDR5
|
|
- Storage: ~1.3TB free on /home
|
|
- OS: Debian 12 (Bookworm)
|
|
- NVIDIA driver: 590.48 (CUDA 13.1 capable)
|
|
- Host CUDA toolkit: 12.8
|
|
|
|
## Architecture
|
|
|
|
### Single Process Design
|
|
|
|
llmux is a monolithic FastAPI application. One Python process handles all model loading/unloading, VRAM management, and inference routing. This keeps the system simple and gives full control over GPU memory.
|
|
|
|
### Runtimes
|
|
|
|
Three inference runtimes coexist within the single process:
|
|
|
|
| Runtime | Purpose | Models |
|
|
|---------|---------|--------|
|
|
| transformers (HuggingFace) | HF safetensors models | Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe |
|
|
| llama-cpp-python | GGUF models | Qwen3.5-9B-FP8-Uncensored |
|
|
| chatterbox | TTS | Chatterbox-Turbo, Chatterbox-Multilingual, Chatterbox |
|
|
|
|
### Why transformers (not vLLM)
|
|
|
|
vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). The PyTorch transformers stack works with PyTorch 2.7+ and CUDA 12.8+ on SM12.0. vLLM can be reconsidered once SM12.0 support matures.
|
|
|
|
## Physical Models
|
|
|
|
| ID | Type | Backend | HuggingFace / Source | Estimated VRAM | Vision | Tools |
|
|
|----|------|---------|---------------------|---------------|--------|-------|
|
|
| qwen3.5-9b-fp8 | LLM | transformers | lovedheart/Qwen3.5-9B-FP8 | ~9GB | yes | yes |
|
|
| qwen3.5-9b-fp8-uncensored | LLM | llamacpp | HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 GGUF + mmproj GGUF) | ~9GB | yes | yes |
|
|
| qwen3.5-4b | LLM | transformers | Qwen/Qwen3.5-4B | ~4GB | yes | yes |
|
|
| gpt-oss-20b | LLM | transformers | openai/gpt-oss-20b (MXFP4 quantized MoE, designed for 16GB VRAM) | ~13GB | no | yes |
|
|
| gpt-oss-20b-uncensored | LLM | transformers | aoxo/gpt-oss-20b-uncensored | ~13GB | no | yes |
|
|
| cohere-transcribe | ASR | transformers | CohereLabs/cohere-transcribe-03-2026 (gated, terms accepted) | ~4GB | n/a | n/a |
|
|
| chatterbox-turbo | TTS | chatterbox | resemble-ai/chatterbox (turbo variant) | ~2GB | n/a | n/a |
|
|
| chatterbox-multilingual | TTS | chatterbox | resemble-ai/chatterbox (multilingual variant) | ~2GB | n/a | n/a |
|
|
| chatterbox | TTS | chatterbox | resemble-ai/chatterbox (default variant) | ~2GB | n/a | n/a |
|
|
|
|
## Virtual Models
|
|
|
|
Virtual models are what Open WebUI and API clients see. Multiple virtual models can map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model has zero VRAM cost.
|
|
|
|
| Virtual Model Name | Physical Model | Behavior |
|
|
|--------------------|---------------|----------|
|
|
| Qwen3.5-9B-FP8-Thinking | qwen3.5-9b-fp8 | Thinking enabled (default Qwen3.5 behavior) |
|
|
| Qwen3.5-9B-FP8-Instruct | qwen3.5-9b-fp8 | enable_thinking=False |
|
|
| Qwen3.5-9B-FP8-Uncensored-Thinking | qwen3.5-9b-fp8-uncensored | Thinking enabled |
|
|
| Qwen3.5-9B-FP8-Uncensored-Instruct | qwen3.5-9b-fp8-uncensored | enable_thinking=False |
|
|
| Qwen3.5-4B-Thinking | qwen3.5-4b | Thinking enabled |
|
|
| Qwen3.5-4B-Instruct | qwen3.5-4b | enable_thinking=False |
|
|
| GPT-OSS-20B-Low | gpt-oss-20b | System prompt prefix: "Reasoning: low" |
|
|
| GPT-OSS-20B-Medium | gpt-oss-20b | System prompt prefix: "Reasoning: medium" |
|
|
| GPT-OSS-20B-High | gpt-oss-20b | System prompt prefix: "Reasoning: high" |
|
|
| GPT-OSS-20B-Uncensored-Low | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: low" |
|
|
| GPT-OSS-20B-Uncensored-Medium | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: medium" |
|
|
| GPT-OSS-20B-Uncensored-High | gpt-oss-20b-uncensored | System prompt prefix: "Reasoning: high" |
|
|
| cohere-transcribe | cohere-transcribe | ASR (used via /v1/audio/transcriptions) |
|
|
| Chatterbox-Turbo | chatterbox-turbo | TTS (used via /v1/audio/speech) |
|
|
| Chatterbox-Multilingual | chatterbox-multilingual | TTS |
|
|
| Chatterbox | chatterbox | TTS |
|
|
|
|
## VRAM Manager
|
|
|
|
### Preemption Policy
|
|
|
|
Models remain loaded until VRAM is needed for another model. No idle timeout — a model stays in VRAM indefinitely until evicted.
|
|
|
|
### Priority (highest to lowest)
|
|
|
|
1. ASR (cohere-transcribe) — highest priority, evicted only as last resort
|
|
2. TTS (one Chatterbox variant at a time)
|
|
3. LLM (one at a time) — lowest priority, evicted first
|
|
|
|
### Loading Algorithm
|
|
|
|
When a request arrives for a model whose physical model is not loaded:
|
|
|
|
1. If the physical model is already loaded, proceed immediately.
|
|
2. If it fits in available VRAM, load alongside existing models.
|
|
3. If it doesn't fit, evict models by priority (lowest first) until enough VRAM is free:
|
|
- Evict LLM first
|
|
- Evict TTS second
|
|
- Evict ASR only as last resort
|
|
- Never evict a higher-priority model to load a lower-priority one
|
|
4. Load the requested model.
|
|
|
|
### Concurrency
|
|
|
|
- An asyncio Lock ensures only one load/unload operation at a time.
|
|
- Requests arriving during a model swap await the lock.
|
|
- Inference requests hold a read-lock on their model to prevent eviction mid-inference.
|
|
|
|
### Typical Scenarios
|
|
|
|
| Current State | Request | Action |
|
|
|---------------|---------|--------|
|
|
| ASR + Qwen3.5-4B (~8GB) | Chat with Qwen3.5-4B | Proceed, already loaded |
|
|
| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with Qwen3.5-9B-FP8 | Evict LLM (4B), load 9B (~9GB). ASR+TTS+9B≈15GB, fits. |
|
|
| ASR + TTS + Qwen3.5-4B (~10GB) | Chat with GPT-OSS-20B | Evict LLM first, then TTS, then ASR if needed. Load gpt-oss-20b alone (~13GB). |
|
|
| GPT-OSS-20B loaded (~13GB) | Transcription request | Evict LLM (gpt-oss-20b). Load ASR (~4GB). |
|
|
| ASR + Qwen3.5-4B (~8GB) | TTS request | Fits (~10GB). Load Chatterbox alongside. |
|
|
|
|
## API Endpoints
|
|
|
|
All endpoints on `127.0.0.1:8081`. All `/v1/*` endpoints require Bearer token authentication.
|
|
|
|
### GET /v1/models
|
|
|
|
Returns all 16 virtual models in OpenAI format, regardless of what's currently loaded. Users can freely select any model; llmux handles swapping.
|
|
|
|
### POST /v1/chat/completions
|
|
|
|
OpenAI-compatible chat completions. Accepts `model` parameter matching a virtual model name. Supports `stream: true` for SSE streaming. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) are applied transparently. Tool/function calling is passed through to models that support it.
|
|
|
|
### POST /v1/audio/transcriptions
|
|
|
|
OpenAI Whisper-compatible endpoint. Accepts multipart form with audio file and `model` parameter. Returns transcript in OpenAI response format. Supports `language` parameter (required by cohere-transcribe — default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
|
|
|
|
### POST /v1/audio/speech
|
|
|
|
OpenAI TTS-compatible endpoint. Accepts JSON with `model`, `input` (text), `voice` (maps to Chatterbox voice/speaker config). Returns audio bytes.
|
|
|
|
### GET /health
|
|
|
|
Unauthenticated. Returns service status and currently loaded models.
|
|
|
|
## Authentication
|
|
|
|
- All `/v1/*` endpoints require a Bearer token (`Authorization: Bearer <api-key>`)
|
|
- API keys stored in `config/api_keys.yaml`, mounted read-only into the container
|
|
- Multiple keys: one per client (Open WebUI, remote Whisper, OpenCode, etc.)
|
|
- `GET /health` is unauthenticated for monitoring/readiness probes
|
|
- Traefik acts purely as a router, no auth on its side
|
|
|
|
## Container & Pod Architecture
|
|
|
|
### Pod
|
|
|
|
- Pod name: `llmux_pod`
|
|
- Single container: `llmux_ctr`
|
|
- Port: `127.0.0.1:8081:8081`
|
|
- GPU: NVIDIA CDI (`--device nvidia.com/gpu=all`)
|
|
- Network: default (no host loopback needed)
|
|
|
|
### Base Image
|
|
|
|
`pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime`
|
|
|
|
Verified compatible with SM12.0 (Blackwell). PyTorch 2.7+ with CUDA 12.8+ supports RTX 5070 Ti. Host driver 590.48 (CUDA 13.1) is backwards compatible.
|
|
|
|
### Dockerfile Layers
|
|
|
|
1. System deps: libsndfile, ffmpeg (audio processing)
|
|
2. pip install: FastAPI, uvicorn, transformers (>=5.4.0), llama-cpp-python (CUDA build), chatterbox, soundfile, librosa, sentencepiece, protobuf, PyYAML
|
|
3. Copy llmux application code
|
|
4. Entrypoint: `uvicorn llmux.main:app --host 0.0.0.0 --port 8081`
|
|
|
|
### Bind Mounts
|
|
|
|
| Host Path | Container Path | Mode |
|
|
|-----------|---------------|------|
|
|
| /home/llm/.local/share/llmux_pod/models/ | /models | read-only |
|
|
| /home/llm/.local/share/llmux_pod/config/ | /config | read-only |
|
|
|
|
### Systemd
|
|
|
|
Managed via `create_pod_llmux.sh` following the Kischdle pattern: create pod, create container, generate systemd units, enable service.
|
|
|
|
## Application Structure
|
|
|
|
```
|
|
llmux/
|
|
├── Dockerfile
|
|
├── requirements.txt
|
|
├── config/
|
|
│ ├── models.yaml
|
|
│ └── api_keys.yaml
|
|
├── llmux/
|
|
│ ├── main.py # FastAPI app, startup/shutdown, health endpoint
|
|
│ ├── auth.py # API key validation middleware
|
|
│ ├── vram_manager.py # VRAM tracking, load/unload, eviction logic
|
|
│ ├── model_registry.py # Parse models.yaml, virtual→physical mapping
|
|
│ ├── routes/
|
|
│ │ ├── models.py # GET /v1/models
|
|
│ │ ├── chat.py # POST /v1/chat/completions
|
|
│ │ ├── transcription.py # POST /v1/audio/transcriptions
|
|
│ │ └── speech.py # POST /v1/audio/speech
|
|
│ └── backends/
|
|
│ ├── base.py # Abstract base class for model backends
|
|
│ ├── transformers.py # HuggingFace transformers backend
|
|
│ ├── llamacpp.py # llama-cpp-python backend (GGUF)
|
|
│ └── chatterbox.py # Chatterbox TTS backend
|
|
└── scripts/
|
|
├── download_models.sh # Pre-download all model weights
|
|
└── create_pod_llmux.sh # Podman pod creation script
|
|
```
|
|
|
|
### Key Design Decisions
|
|
|
|
- `backends/` encapsulates runtime differences. Each backend knows how to load, unload, and run inference. Route handlers are backend-agnostic.
|
|
- `vram_manager.py` is the single authority on what's loaded. Route handlers call `vram_manager.ensure_loaded(physical_model_id)` before inference.
|
|
- `model_registry.py` handles virtual-to-physical mapping and injects behavior params (thinking toggle, system prompts) before passing to the backend.
|
|
- Streaming for chat completions uses FastAPI `StreamingResponse` with SSE, matching OpenAI streaming format.
|
|
|
|
## Model Downloads
|
|
|
|
All models are pre-downloaded before the pod is created. The `scripts/download_models.sh` script runs as user `llm` and downloads to `/home/llm/.local/share/llmux_pod/models/`.
|
|
|
|
| Model | Method | Approx Size |
|
|
|-------|--------|-------------|
|
|
| lovedheart/Qwen3.5-9B-FP8 | huggingface-cli download | ~9GB |
|
|
| HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive (Q8_0 + mmproj GGUF) | huggingface-cli download (specific files) | ~10GB |
|
|
| Qwen/Qwen3.5-4B | huggingface-cli download | ~8GB |
|
|
| openai/gpt-oss-20b | huggingface-cli download | ~13GB |
|
|
| aoxo/gpt-oss-20b-uncensored | huggingface-cli download | ~13GB |
|
|
| CohereLabs/cohere-transcribe-03-2026 | huggingface-cli download (gated, terms accepted) | ~4GB |
|
|
| resemble-ai/chatterbox (3 variants) | per Chatterbox install docs | ~2GB |
|
|
|
|
Total estimated: ~60GB. The script is idempotent (skips existing models). A HuggingFace access token is required for gated models (stored at ~/.cache/huggingface/token).
|
|
|
|
## Open WebUI Configuration
|
|
|
|
Open WebUI (user `wbg`, port 8080) connects to llmux:
|
|
|
|
### Connections (Admin > Settings > Connections)
|
|
|
|
- OpenAI API Base URL: `http://127.0.0.1:8081/v1`
|
|
- API Key: the key from api_keys.yaml designated for Open WebUI
|
|
|
|
### Audio (Admin > Settings > Audio)
|
|
|
|
- STT Engine: openai
|
|
- STT OpenAI API Base URL: `http://127.0.0.1:8081/v1`
|
|
- STT Model: cohere-transcribe
|
|
- TTS Engine: openai
|
|
- TTS OpenAI API Base URL: `http://127.0.0.1:8081/v1`
|
|
- TTS Model: Chatterbox-Multilingual
|
|
- TTS Voice: to be configured based on Chatterbox options
|
|
|
|
### User Experience
|
|
|
|
- Model dropdown lists all 16 virtual models
|
|
- Chat works on any model selection (with potential swap delay for first request)
|
|
- Dictation uses cohere-transcribe
|
|
- Audio playback uses Chatterbox
|
|
- Voice chat combines ASR, LLM, and TTS
|
|
|
|
## Traefik Routing
|
|
|
|
New dynamic config at `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`:
|
|
|
|
```yaml
|
|
http:
|
|
routers:
|
|
llmux:
|
|
entryPoints: ["wghttp"]
|
|
rule: "Host(`kidirekt.kischdle.com`)"
|
|
priority: 100
|
|
service: llmux
|
|
|
|
services:
|
|
llmux:
|
|
loadBalancer:
|
|
servers:
|
|
- url: "http://10.0.2.2:8081"
|
|
```
|
|
|
|
- Routed through WireGuard VPN entry point
|
|
- No Traefik-level auth (llmux handles API key auth)
|
|
- DNS setup for kidirekt.kischdle.com is a manual step
|
|
|
|
## Configuration Files
|
|
|
|
### config/models.yaml
|
|
|
|
```yaml
|
|
physical_models:
|
|
qwen3.5-9b-fp8:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "lovedheart/Qwen3.5-9B-FP8"
|
|
estimated_vram_gb: 9
|
|
supports_vision: true
|
|
supports_tools: true
|
|
|
|
qwen3.5-9b-fp8-uncensored:
|
|
type: llm
|
|
backend: llamacpp
|
|
model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
|
|
mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
|
|
estimated_vram_gb: 9
|
|
supports_vision: true
|
|
supports_tools: true
|
|
|
|
qwen3.5-4b:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "Qwen/Qwen3.5-4B"
|
|
estimated_vram_gb: 4
|
|
supports_vision: true
|
|
supports_tools: true
|
|
|
|
gpt-oss-20b:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "openai/gpt-oss-20b"
|
|
estimated_vram_gb: 13
|
|
supports_vision: false
|
|
supports_tools: true
|
|
|
|
gpt-oss-20b-uncensored:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "aoxo/gpt-oss-20b-uncensored"
|
|
estimated_vram_gb: 13
|
|
supports_vision: false
|
|
supports_tools: true
|
|
|
|
cohere-transcribe:
|
|
type: asr
|
|
backend: transformers
|
|
model_id: "CohereLabs/cohere-transcribe-03-2026"
|
|
estimated_vram_gb: 4
|
|
default_language: "en"
|
|
|
|
chatterbox-turbo:
|
|
type: tts
|
|
backend: chatterbox
|
|
variant: "turbo"
|
|
estimated_vram_gb: 2
|
|
|
|
chatterbox-multilingual:
|
|
type: tts
|
|
backend: chatterbox
|
|
variant: "multilingual"
|
|
estimated_vram_gb: 2
|
|
|
|
chatterbox:
|
|
type: tts
|
|
backend: chatterbox
|
|
variant: "default"
|
|
estimated_vram_gb: 2
|
|
|
|
virtual_models:
|
|
Qwen3.5-9B-FP8-Thinking:
|
|
physical: qwen3.5-9b-fp8
|
|
params: { enable_thinking: true }
|
|
Qwen3.5-9B-FP8-Instruct:
|
|
physical: qwen3.5-9b-fp8
|
|
params: { enable_thinking: false }
|
|
|
|
Qwen3.5-9B-FP8-Uncensored-Thinking:
|
|
physical: qwen3.5-9b-fp8-uncensored
|
|
params: { enable_thinking: true }
|
|
Qwen3.5-9B-FP8-Uncensored-Instruct:
|
|
physical: qwen3.5-9b-fp8-uncensored
|
|
params: { enable_thinking: false }
|
|
|
|
Qwen3.5-4B-Thinking:
|
|
physical: qwen3.5-4b
|
|
params: { enable_thinking: true }
|
|
Qwen3.5-4B-Instruct:
|
|
physical: qwen3.5-4b
|
|
params: { enable_thinking: false }
|
|
|
|
GPT-OSS-20B-Low:
|
|
physical: gpt-oss-20b
|
|
params: { system_prompt_prefix: "Reasoning: low" }
|
|
GPT-OSS-20B-Medium:
|
|
physical: gpt-oss-20b
|
|
params: { system_prompt_prefix: "Reasoning: medium" }
|
|
GPT-OSS-20B-High:
|
|
physical: gpt-oss-20b
|
|
params: { system_prompt_prefix: "Reasoning: high" }
|
|
|
|
GPT-OSS-20B-Uncensored-Low:
|
|
physical: gpt-oss-20b-uncensored
|
|
params: { system_prompt_prefix: "Reasoning: low" }
|
|
GPT-OSS-20B-Uncensored-Medium:
|
|
physical: gpt-oss-20b-uncensored
|
|
params: { system_prompt_prefix: "Reasoning: medium" }
|
|
GPT-OSS-20B-Uncensored-High:
|
|
physical: gpt-oss-20b-uncensored
|
|
params: { system_prompt_prefix: "Reasoning: high" }
|
|
|
|
cohere-transcribe:
|
|
physical: cohere-transcribe
|
|
Chatterbox-Turbo:
|
|
physical: chatterbox-turbo
|
|
Chatterbox-Multilingual:
|
|
physical: chatterbox-multilingual
|
|
Chatterbox:
|
|
physical: chatterbox
|
|
```
|
|
|
|
### config/api_keys.yaml
|
|
|
|
```yaml
|
|
api_keys:
|
|
- key: "sk-llmux-openwebui-<generated>"
|
|
name: "Open WebUI"
|
|
- key: "sk-llmux-whisper-<generated>"
|
|
name: "Remote Whisper clients"
|
|
- key: "sk-llmux-opencode-<generated>"
|
|
name: "OpenCode"
|
|
```
|
|
|
|
Keys generated at deployment time.
|
|
|
|
## Testing & Verification
|
|
|
|
### Phase 1: System Integration (iterative, fix issues before proceeding)
|
|
|
|
1. Container build — Dockerfile builds successfully, image contains all dependencies
|
|
2. GPU passthrough — container sees RTX 5070 Ti (nvidia-smi works inside container)
|
|
3. Model mount — container can read model weights from /models
|
|
4. Service startup — llmux starts, port 8081 reachable from host
|
|
5. Open WebUI connection — model list populates in Open WebUI
|
|
6. Traefik routing — kidirekt.kischdle.com routes to llmux (when DNS configured)
|
|
7. Systemd lifecycle — start/stop/restart works, service survives reboot
|
|
|
|
### Phase 2: Functional Tests
|
|
|
|
8. Auth — requests without valid API key get 401
|
|
9. Model listing — GET /v1/models returns all 16 virtual models
|
|
10. Chat inference — for each physical LLM, chat via Open WebUI as user "try":
|
|
- Qwen3.5-9B-FP8 (Thinking + Instruct)
|
|
- Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct)
|
|
- Qwen3.5-4B (Thinking + Instruct)
|
|
- GPT-OSS-20B (Low, Medium, High)
|
|
- GPT-OSS-20B-Uncensored (Low, Medium, High)
|
|
11. Streaming — chat responses stream token-by-token in Open WebUI
|
|
12. ASR — Open WebUI dictation transcribes speech (English and German)
|
|
13. TTS — Open WebUI audio playback speaks text
|
|
14. Vision — image + text prompt to each vision-capable model:
|
|
- Qwen3.5-4B
|
|
- Qwen3.5-9B-FP8
|
|
- Qwen3.5-9B-FP8-Uncensored
|
|
15. Tool usage — verify tool calling for each runtime and tool-capable model:
|
|
- Qwen3.5-9B-FP8 (transformers)
|
|
- Qwen3.5-9B-FP8-Uncensored (llama-cpp-python)
|
|
- GPT-OSS-20B (transformers)
|
|
- GPT-OSS-20B-Uncensored (transformers)
|
|
|
|
### Phase 3: VRAM Management Tests
|
|
|
|
16. Small LLM — load Qwen3.5-4B (~4GB), verify ASR and TTS remain loaded (~10GB total)
|
|
17. Medium LLM — load Qwen3.5-9B-FP8 (~9GB), verify ASR and TTS remain loaded (~15GB total)
|
|
18. Large LLM — load GPT-OSS-20B (~13GB), verify ASR and TTS are evicted. Next ASR request evicts LLM first.
|
|
19. Model swapping — switch between two LLMs, verify second loads and first is evicted
|
|
|
|
### Phase 4: Performance Tests
|
|
|
|
20. Transformers GPU vs CPU — for each transformers-backed physical model, run same prompt on GPU and CPU, verify GPU is at least 5x faster. Requires admin test endpoint or CLI tool to force CPU execution.
|
|
- Qwen3.5-9B-FP8
|
|
- Qwen3.5-4B
|
|
- gpt-oss-20b
|
|
- gpt-oss-20b-uncensored
|
|
- cohere-transcribe
|
|
21. llama-cpp-python GPU vs CPU — run inference for Qwen3.5-9B-FP8-Uncensored with n_gpu_layers=-1 (GPU) and n_gpu_layers=0 (CPU), verify GPU is at least 5x faster. Same admin test endpoint.
|
|
22. Chatterbox performance — run TTS synthesis, verify audio generation time is reasonable relative to audio duration.
|
|
|
|
## Manual Steps
|
|
|
|
These require human action and cannot be automated:
|
|
|
|
- DNS setup for kidirekt.kischdle.com (during implementation)
|
|
- HuggingFace terms for cohere-transcribe: accepted 2026-04-03
|
|
- HuggingFace token configured at ~/.cache/huggingface/token (done for user tlg, needs setup for user llm during deployment)
|
|
- Open WebUI admin configuration (connections, audio settings)
|