diff --git a/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-requirements.sdoc b/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-requirements.sdoc new file mode 100644 index 0000000..1a18d96 --- /dev/null +++ b/kischdle/llmux/docs/superpowers/specs/2026-04-03-llmux-requirements.sdoc @@ -0,0 +1,630 @@ +[DOCUMENT] +TITLE: llmux Product Requirements +VERSION: 1.0 +DATE: 2026-04-03 + +[TEXT] +STATEMENT: >>> +llmux is a single-process FastAPI application that manages multiple AI models on a single GPU (NVIDIA RTX 5070 Ti, 16GB VRAM). It provides an OpenAI-compatible API for chat completions, speech-to-text, and text-to-speech, serving as the unified AI backend for Open WebUI and external clients on the Kischdle on-premise system. +<<< + +[[SECTION]] +TITLE: System Architecture + +[REQUIREMENT] +UID: LLMUX-ARCH-001 +TITLE: Single process design +STATEMENT: >>> +llmux shall be a monolithic FastAPI application where one Python process handles all model loading/unloading, VRAM management, and inference routing. +<<< +RATIONALE: >>> +Keeps the system simple, easy to debug, and gives full control over GPU memory management. The 16GB VRAM constraint means concurrent model usage is limited anyway. +<<< + +[REQUIREMENT] +UID: LLMUX-ARCH-002 +TITLE: Containerized deployment +STATEMENT: >>> +llmux shall run as a rootless Podman pod (pod name: llmux_pod, container name: llmux_ctr) under the dedicated Linux user llm, managed via systemd user services. +<<< +RATIONALE: >>> +Consistent with the Kischdle microservice architecture where each service runs as a rootless Podman pod under a dedicated user. +<<< + +[REQUIREMENT] +UID: LLMUX-ARCH-003 +TITLE: Base container image +STATEMENT: >>> +llmux shall use pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime as the base container image. +<<< +RATIONALE: >>> +PyTorch 2.7+ with CUDA 12.8+ supports SM12.0 (Blackwell/RTX 5070 Ti). Host driver 590.48 (CUDA 13.1) is backwards compatible. Verified available on Docker Hub. +<<< + +[REQUIREMENT] +UID: LLMUX-ARCH-004 +TITLE: GPU passthrough +STATEMENT: >>> +The container shall have access to the NVIDIA RTX 5070 Ti GPU via NVIDIA CDI (--device nvidia.com/gpu=all). +<<< + +[REQUIREMENT] +UID: LLMUX-ARCH-005 +TITLE: Pod creation script +STATEMENT: >>> +A shell script create_pod_llmux.sh shall be provided that creates the Podman pod and enables it as a systemd service, following the Kischdle shell script pattern (create pod, create container, generate systemd units, enable service). The script shall be installed at /home/llm/bin/create_pod_llmux.sh. +<<< +RELATIONS: +- TYPE: Parent + VALUE: LLMUX-ARCH-002 + +[[/SECTION]] + +[[SECTION]] +TITLE: Inference Runtimes + +[REQUIREMENT] +UID: LLMUX-RT-001 +TITLE: HuggingFace transformers runtime +STATEMENT: >>> +llmux shall use the HuggingFace transformers library (version >= 5.4.0) as the primary runtime for loading and running inference on HuggingFace safetensors models. +<<< +RATIONALE: >>> +vLLM lacks stable support for SM12.0 (RTX Blackwell consumer GPUs). Specifically, NVFP4 MoE kernels fail on SM12.0 (vllm-project/vllm#33416). vLLM can be reconsidered once SM12.0 support matures. +<<< + +[REQUIREMENT] +UID: LLMUX-RT-002 +TITLE: llama-cpp-python runtime +STATEMENT: >>> +llmux shall use the llama-cpp-python library (built with CUDA support) for loading and running inference on GGUF format models. +<<< +RATIONALE: >>> +The Qwen3.5-9B-Uncensored model is distributed in GGUF format and requires a llama.cpp compatible runtime. +<<< + +[REQUIREMENT] +UID: LLMUX-RT-003 +TITLE: Chatterbox runtime +STATEMENT: >>> +llmux shall use the resemble-ai/chatterbox library for text-to-speech inference. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: AI Models + +[[SECTION]] +TITLE: Physical Models + +[REQUIREMENT] +UID: LLMUX-MDL-001 +TITLE: Qwen3.5-9B-FP8 +STATEMENT: >>> +llmux shall support the lovedheart/Qwen3.5-9B-FP8 model via the transformers runtime. The model supports vision (image input) and tool/function calling. Estimated VRAM: ~9GB. +<<< + +[REQUIREMENT] +UID: LLMUX-MDL-002 +TITLE: Qwen3.5-9B-FP8-Uncensored +STATEMENT: >>> +llmux shall support the HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive model via the llama-cpp-python runtime, using the files Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf (main model) and mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf (vision encoder). The model supports vision and tool/function calling. Estimated VRAM: ~9GB. +<<< + +[REQUIREMENT] +UID: LLMUX-MDL-003 +TITLE: Qwen3.5-4B +STATEMENT: >>> +llmux shall support the Qwen/Qwen3.5-4B model via the transformers runtime. The model supports vision and tool/function calling. Estimated VRAM: ~4GB. +<<< + +[REQUIREMENT] +UID: LLMUX-MDL-004 +TITLE: gpt-oss-20B +STATEMENT: >>> +llmux shall support the openai/gpt-oss-20b model via the transformers runtime. The model uses MXFP4 quantization on MoE weights and is designed for 16GB VRAM. The model supports tool/function calling but not vision. Estimated VRAM: ~13GB. +<<< + +[REQUIREMENT] +UID: LLMUX-MDL-005 +TITLE: gpt-oss-20B-uncensored +STATEMENT: >>> +llmux shall support the aoxo/gpt-oss-20b-uncensored model via the transformers runtime. The model supports tool/function calling but not vision. Estimated VRAM: ~13GB. +<<< + +[REQUIREMENT] +UID: LLMUX-MDL-006 +TITLE: cohere-transcribe ASR +STATEMENT: >>> +llmux shall support the CohereLabs/cohere-transcribe-03-2026 model via the transformers runtime for automatic speech recognition. The model supports English and German. Estimated VRAM: ~4GB. +<<< + +[REQUIREMENT] +UID: LLMUX-MDL-007 +TITLE: Chatterbox TTS variants +STATEMENT: >>> +llmux shall support three Chatterbox TTS model variants: Chatterbox-Turbo, Chatterbox-Multilingual, and Chatterbox (default). Only one Chatterbox variant shall be loaded in VRAM at a time. Estimated VRAM per variant: ~2GB. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Virtual Models + +[REQUIREMENT] +UID: LLMUX-VMDL-001 +TITLE: Virtual model concept +STATEMENT: >>> +llmux shall expose virtual models to API clients. Multiple virtual models may map to the same physical model with different behavior parameters. Switching between virtual models that share a physical model shall have zero VRAM cost. +<<< + +[REQUIREMENT] +UID: LLMUX-VMDL-002 +TITLE: Qwen3.5 Thinking and Instruct variants +STATEMENT: >>> +For each Qwen3.5 physical model (qwen3.5-9b-fp8, qwen3.5-9b-fp8-uncensored, qwen3.5-4b), llmux shall expose two virtual models: one with Thinking enabled (default Qwen3.5 behavior) and one with Instruct mode (enable_thinking=False for direct response). +<<< +RELATIONS: +- TYPE: Parent + VALUE: LLMUX-VMDL-001 + +[REQUIREMENT] +UID: LLMUX-VMDL-003 +TITLE: gpt-oss-20B reasoning level variants +STATEMENT: >>> +For each gpt-oss-20b physical model (gpt-oss-20b, gpt-oss-20b-uncensored), llmux shall expose three virtual models corresponding to reasoning levels Low, Medium, and High, implemented by prepending "Reasoning: low/medium/high" to the system prompt. +<<< +RELATIONS: +- TYPE: Parent + VALUE: LLMUX-VMDL-001 + +[REQUIREMENT] +UID: LLMUX-VMDL-004 +TITLE: Total virtual model count +STATEMENT: >>> +llmux shall expose exactly 16 virtual models: 6 Qwen3.5 variants (3 physical x 2 modes), 6 gpt-oss-20b variants (2 physical x 3 levels), 1 ASR model, and 3 TTS models. +<<< +RELATIONS: +- TYPE: Parent + VALUE: LLMUX-VMDL-001 + +[[/SECTION]] + +[[/SECTION]] + +[[SECTION]] +TITLE: VRAM Management + +[REQUIREMENT] +UID: LLMUX-VRAM-001 +TITLE: No idle timeout +STATEMENT: >>> +Models shall remain loaded in VRAM indefinitely until eviction is required to load another model. There shall be no idle timeout for unloading models. +<<< + +[REQUIREMENT] +UID: LLMUX-VRAM-002 +TITLE: Eviction priority order +STATEMENT: >>> +When VRAM is insufficient to load a requested model, llmux shall evict loaded models in the following order (lowest priority evicted first): +1. LLM models (lowest priority, evicted first) +2. TTS models +3. ASR model (highest priority, evicted only as last resort) + +llmux shall never evict a higher-priority model to load a lower-priority one (e.g., never evict ASR to make room for TTS; in that case, evict the LLM instead). +<<< + +[REQUIREMENT] +UID: LLMUX-VRAM-003 +TITLE: Load alongside if VRAM permits +STATEMENT: >>> +If sufficient VRAM is available, llmux shall load a requested model alongside already-loaded models without evicting any model. +<<< + +[REQUIREMENT] +UID: LLMUX-VRAM-004 +TITLE: One LLM at a time +STATEMENT: >>> +At most one LLM physical model shall be loaded in VRAM at any time. +<<< + +[REQUIREMENT] +UID: LLMUX-VRAM-005 +TITLE: One TTS variant at a time +STATEMENT: >>> +At most one Chatterbox TTS variant shall be loaded in VRAM at any time. Loading a different TTS variant shall unload the current one. +<<< + +[REQUIREMENT] +UID: LLMUX-VRAM-006 +TITLE: Concurrency during model swap +STATEMENT: >>> +An asyncio Lock shall ensure only one load/unload operation at a time. Requests arriving during a model swap shall await the lock. Inference requests shall hold a read-lock on their model to prevent eviction mid-inference. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: API + +[[SECTION]] +TITLE: Endpoints + +[REQUIREMENT] +UID: LLMUX-API-001 +TITLE: Listen address +STATEMENT: >>> +llmux shall listen on 127.0.0.1:8081 for all API traffic. +<<< + +[REQUIREMENT] +UID: LLMUX-API-002 +TITLE: Model listing endpoint +STATEMENT: >>> +llmux shall provide a GET /v1/models endpoint that returns all virtual models in OpenAI format, regardless of which models are currently loaded in VRAM. +<<< + +[REQUIREMENT] +UID: LLMUX-API-003 +TITLE: Chat completions endpoint +STATEMENT: >>> +llmux shall provide a POST /v1/chat/completions endpoint compatible with the OpenAI chat completions API. It shall accept a model parameter matching a virtual model name, support stream: true for SSE streaming, and pass through tool/function calling for models that support it. The virtual-to-physical mapping and behavior modification (thinking toggle, reasoning system prompt) shall be applied transparently. +<<< + +[REQUIREMENT] +UID: LLMUX-API-004 +TITLE: Audio transcription endpoint +STATEMENT: >>> +llmux shall provide a POST /v1/audio/transcriptions endpoint compatible with the OpenAI Whisper API. It shall accept multipart form data with an audio file and model parameter. It shall support the language parameter (default "en", also "de"). Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. +<<< + +[REQUIREMENT] +UID: LLMUX-API-005 +TITLE: Text-to-speech endpoint +STATEMENT: >>> +llmux shall provide a POST /v1/audio/speech endpoint compatible with the OpenAI TTS API. It shall accept JSON with model, input (text), and voice parameters. It shall return audio bytes. +<<< + +[REQUIREMENT] +UID: LLMUX-API-006 +TITLE: Health endpoint +STATEMENT: >>> +llmux shall provide a GET /health endpoint that returns service status and currently loaded models. This endpoint shall not require authentication. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Authentication + +[REQUIREMENT] +UID: LLMUX-AUTH-001 +TITLE: API key authentication +STATEMENT: >>> +All /v1/* endpoints shall require a Bearer token in the Authorization header (Authorization: Bearer ). Requests without a valid API key shall receive HTTP 401. +<<< + +[REQUIREMENT] +UID: LLMUX-AUTH-002 +TITLE: API key storage +STATEMENT: >>> +API keys shall be stored in a config/api_keys.yaml file mounted read-only into the container. Multiple keys shall be supported (one per client: Open WebUI, remote Whisper clients, OpenCode, etc.). Keys shall be generated at deployment time. +<<< + +[REQUIREMENT] +UID: LLMUX-AUTH-003 +TITLE: No Traefik authentication +STATEMENT: >>> +Traefik shall act purely as a router. Authentication shall be handled entirely by llmux via API keys. +<<< + +[[/SECTION]] + +[[/SECTION]] + +[[SECTION]] +TITLE: Configuration + +[REQUIREMENT] +UID: LLMUX-CFG-001 +TITLE: Model registry configuration +STATEMENT: >>> +All physical and virtual model definitions shall be stored in a config/models.yaml file. Physical model entries shall define: type (llm/asr/tts), backend (transformers/llamacpp/chatterbox), model identifier, estimated VRAM in GB, and capability flags (vision, tools). Virtual model entries shall reference a physical model and define behavior parameters. +<<< + +[REQUIREMENT] +UID: LLMUX-CFG-002 +TITLE: Configuration bind mounts +STATEMENT: >>> +Model weights shall be bind-mounted from /home/llm/.local/share/llmux_pod/models/ to /models (read-only). Configuration files shall be bind-mounted from /home/llm/.local/share/llmux_pod/config/ to /config (read-only). +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Model Downloads + +[REQUIREMENT] +UID: LLMUX-DL-001 +TITLE: Pre-download all models +STATEMENT: >>> +All model weights shall be pre-downloaded before the pod is created. A scripts/download_models.sh script shall download all models to /home/llm/.local/share/llmux_pod/models/. The script shall be idempotent (skip existing models). +<<< + +[REQUIREMENT] +UID: LLMUX-DL-002 +TITLE: HuggingFace token requirement +STATEMENT: >>> +The download script shall use a HuggingFace access token (stored at ~/.cache/huggingface/token) for downloading gated models (cohere-transcribe). The token must be configured for user llm during deployment. +<<< + +[REQUIREMENT] +UID: LLMUX-DL-003 +TITLE: Estimated storage +STATEMENT: >>> +Total estimated model storage is ~60GB. The host has ~1.3TB free on /home, which is sufficient. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: System Integration + +[[SECTION]] +TITLE: Open WebUI + +[REQUIREMENT] +UID: LLMUX-INT-001 +TITLE: Open WebUI connection +STATEMENT: >>> +Open WebUI (user wbg, port 8080) shall be configured with OpenAI API base URL http://127.0.0.1:8081/v1 and the designated API key from api_keys.yaml. +<<< + +[REQUIREMENT] +UID: LLMUX-INT-002 +TITLE: Open WebUI audio configuration +STATEMENT: >>> +Open WebUI shall be configured with STT engine set to "openai" with base URL http://127.0.0.1:8081/v1 and model "cohere-transcribe", and TTS engine set to "openai" with base URL http://127.0.0.1:8081/v1 and model "Chatterbox-Multilingual". +<<< + +[REQUIREMENT] +UID: LLMUX-INT-003 +TITLE: Model visibility in Open WebUI +STATEMENT: >>> +All 16 virtual models shall be visible in the Open WebUI model dropdown for user selection. Users shall be able to select any model; llmux handles loading/swapping transparently. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Traefik + +[REQUIREMENT] +UID: LLMUX-INT-004 +TITLE: Traefik route for remote access +STATEMENT: >>> +A Traefik dynamic configuration file shall be added at /home/trf/.local/share/traefik_pod/dynamic/llmux.yml, routing the hostname kidirekt.kischdle.com through the WireGuard VPN entry point to http://10.0.2.2:8081. +<<< + +[REQUIREMENT] +UID: LLMUX-INT-005 +TITLE: DNS setup +STATEMENT: >>> +DNS for kidirekt.kischdle.com shall be configured as a manual step during implementation. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Systemd + +[REQUIREMENT] +UID: LLMUX-INT-006 +TITLE: Systemd service lifecycle +STATEMENT: >>> +The llmux pod shall be managed as a systemd user service under user llm. The service shall support start, stop, and restart operations via systemctl --user, and shall survive system reboots. +<<< + +[[/SECTION]] + +[[/SECTION]] + +[[SECTION]] +TITLE: Testing and Verification + +[[SECTION]] +TITLE: Phase 1 - System Integration Tests + +[TEXT] +STATEMENT: >>> +System integration tests are iterative: issues are fixed before proceeding to the next phase. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-001 +TITLE: Container build +STATEMENT: >>> +The Dockerfile shall build successfully and the resulting image shall contain all required dependencies (FastAPI, uvicorn, transformers, llama-cpp-python, chatterbox, and supporting libraries). +<<< + +[REQUIREMENT] +UID: LLMUX-TST-002 +TITLE: GPU passthrough verification +STATEMENT: >>> +nvidia-smi shall execute successfully inside the container and report the RTX 5070 Ti GPU. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-003 +TITLE: Model mount verification +STATEMENT: >>> +The container shall be able to read model weight files from the /models bind mount. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-004 +TITLE: Service startup verification +STATEMENT: >>> +llmux shall start inside the pod and port 8081 shall be reachable from the host. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-005 +TITLE: Open WebUI connection verification +STATEMENT: >>> +Open WebUI shall connect to llmux and the model list shall populate with all 16 virtual models. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-006 +TITLE: Traefik routing verification +STATEMENT: >>> +When DNS is configured, kidirekt.kischdle.com shall route to llmux through the WireGuard VPN. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-007 +TITLE: Systemd lifecycle verification +STATEMENT: >>> +systemctl --user start/stop/restart pod-llmux_pod.service shall work cleanly, and the service shall survive reboot. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Phase 2 - Functional Tests + +[REQUIREMENT] +UID: LLMUX-TST-008 +TITLE: Authentication test +STATEMENT: >>> +Requests to /v1/* endpoints without a valid API key shall receive HTTP 401 Unauthorized. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-009 +TITLE: Model listing test +STATEMENT: >>> +GET /v1/models shall return all 16 virtual models in OpenAI format. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-010 +TITLE: Chat inference test +STATEMENT: >>> +For each physical LLM model, a chat request via Open WebUI as user "try" shall produce a reasonable response. This shall be tested for all virtual model variants: Qwen3.5-9B-FP8 (Thinking + Instruct), Qwen3.5-9B-FP8-Uncensored (Thinking + Instruct), Qwen3.5-4B (Thinking + Instruct), GPT-OSS-20B (Low, Medium, High), GPT-OSS-20B-Uncensored (Low, Medium, High). +<<< + +[REQUIREMENT] +UID: LLMUX-TST-011 +TITLE: Streaming test +STATEMENT: >>> +Chat responses shall stream token-by-token in Open WebUI, not be delivered as a single block. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-012 +TITLE: ASR test +STATEMENT: >>> +Open WebUI dictation shall transcribe speech correctly in English and German using cohere-transcribe. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-013 +TITLE: TTS test +STATEMENT: >>> +Open WebUI audio playback shall produce spoken audio from text using Chatterbox. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-014 +TITLE: Vision test +STATEMENT: >>> +An image + text prompt shall produce a correct response for each vision-capable model: Qwen3.5-4B, Qwen3.5-9B-FP8, and Qwen3.5-9B-FP8-Uncensored. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-015 +TITLE: Tool usage test +STATEMENT: >>> +Tool/function calling shall work for each runtime and all tool-capable models: Qwen3.5-9B-FP8 (transformers), Qwen3.5-9B-FP8-Uncensored (llama-cpp-python), GPT-OSS-20B (transformers), GPT-OSS-20B-Uncensored (transformers). +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Phase 3 - VRAM Management Tests + +[REQUIREMENT] +UID: LLMUX-TST-016 +TITLE: Small LLM coexistence test +STATEMENT: >>> +Loading Qwen3.5-4B (~4GB) shall leave ASR and TTS models loaded in VRAM (~10GB total). +<<< + +[REQUIREMENT] +UID: LLMUX-TST-017 +TITLE: Medium LLM coexistence test +STATEMENT: >>> +Loading Qwen3.5-9B-FP8 (~9GB) shall leave ASR and TTS models loaded in VRAM (~15GB total). +<<< + +[REQUIREMENT] +UID: LLMUX-TST-018 +TITLE: Large LLM eviction test +STATEMENT: >>> +Loading GPT-OSS-20B (~13GB) shall evict ASR and TTS from VRAM. A subsequent ASR request shall evict the LLM first (not attempt to fit alongside it). +<<< + +[REQUIREMENT] +UID: LLMUX-TST-019 +TITLE: Model swapping test +STATEMENT: >>> +Switching between two LLMs in Open WebUI shall result in the second model loading and the first being evicted. +<<< + +[[/SECTION]] + +[[SECTION]] +TITLE: Phase 4 - Performance Tests + +[REQUIREMENT] +UID: LLMUX-TST-020 +TITLE: Transformers GPU vs CPU performance test +STATEMENT: >>> +For each transformers-backed physical model (Qwen3.5-9B-FP8, Qwen3.5-4B, gpt-oss-20b, gpt-oss-20b-uncensored, cohere-transcribe), running the same inference on GPU shall be at least 5x faster than on CPU. An admin test endpoint or CLI tool shall be provided to force CPU execution for this test. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-021 +TITLE: llama-cpp-python GPU vs CPU performance test +STATEMENT: >>> +For Qwen3.5-9B-FP8-Uncensored, running inference with n_gpu_layers=-1 (GPU) shall be at least 5x faster than with n_gpu_layers=0 (CPU). The same admin test endpoint shall support this. +<<< + +[REQUIREMENT] +UID: LLMUX-TST-022 +TITLE: Chatterbox performance test +STATEMENT: >>> +TTS synthesis duration shall be reasonable relative to the duration of the generated audio output. +<<< + +[[/SECTION]] + +[[/SECTION]] + +[[SECTION]] +TITLE: Manual Steps + +[TEXT] +STATEMENT: >>> +The following steps require human action and cannot be automated: + +- DNS setup for kidirekt.kischdle.com (during implementation) +- HuggingFace terms for cohere-transcribe: accepted 2026-04-03 +- HuggingFace token configuration for user llm during deployment +- Open WebUI admin configuration (connections, audio settings) +<<< + +[[/SECTION]]