Covers project scaffolding, config, auth, VRAM manager, all four backends, API routes, Dockerfile, deployment scripts, and four phases of testing (integration, functional, VRAM, performance). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3196 lines
95 KiB
Markdown
3196 lines
95 KiB
Markdown
# llmux Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Build a FastAPI service that manages 9 AI models on a single GPU, exposing an OpenAI-compatible API for chat, ASR, and TTS to Open WebUI and external clients.
|
|
|
|
**Architecture:** Single-process Python app with three inference runtimes (transformers, llama-cpp-python, chatterbox), a VRAM manager that handles model loading/eviction by priority (ASR > TTS > LLM), and Bearer token auth. Runs in a rootless Podman container with GPU passthrough.
|
|
|
|
**Tech Stack:** Python 3.11+, FastAPI, uvicorn, PyTorch, transformers >=5.4.0, llama-cpp-python, chatterbox, PyYAML, Podman, systemd
|
|
|
|
**Spec:** `docs/superpowers/specs/2026-04-03-llmux-design.md`
|
|
|
|
---
|
|
|
|
## File Map
|
|
|
|
| File | Responsibility |
|
|
|------|---------------|
|
|
| `llmux/llmux/__init__.py` | Package marker |
|
|
| `llmux/llmux/main.py` | FastAPI app, startup/shutdown, /health endpoint |
|
|
| `llmux/llmux/auth.py` | API key validation dependency |
|
|
| `llmux/llmux/config.py` | Load and validate YAML config files |
|
|
| `llmux/llmux/model_registry.py` | Virtual→physical model mapping, behavior params |
|
|
| `llmux/llmux/vram_manager.py` | VRAM tracking, load/unload orchestration, eviction |
|
|
| `llmux/llmux/routes/__init__.py` | Package marker |
|
|
| `llmux/llmux/routes/models.py` | GET /v1/models |
|
|
| `llmux/llmux/routes/chat.py` | POST /v1/chat/completions |
|
|
| `llmux/llmux/routes/transcription.py` | POST /v1/audio/transcriptions |
|
|
| `llmux/llmux/routes/speech.py` | POST /v1/audio/speech |
|
|
| `llmux/llmux/routes/admin.py` | POST /admin/test/performance (test-only) |
|
|
| `llmux/llmux/backends/__init__.py` | Package marker |
|
|
| `llmux/llmux/backends/base.py` | Abstract base class for all backends |
|
|
| `llmux/llmux/backends/transformers_llm.py` | HuggingFace transformers for LLM chat + vision + tools |
|
|
| `llmux/llmux/backends/transformers_asr.py` | HuggingFace transformers for cohere-transcribe ASR |
|
|
| `llmux/llmux/backends/llamacpp.py` | llama-cpp-python for GGUF models |
|
|
| `llmux/llmux/backends/chatterbox_tts.py` | Chatterbox TTS |
|
|
| `llmux/tests/__init__.py` | Package marker |
|
|
| `llmux/tests/test_config.py` | Tests for config loading |
|
|
| `llmux/tests/test_auth.py` | Tests for API key auth |
|
|
| `llmux/tests/test_model_registry.py` | Tests for virtual→physical mapping |
|
|
| `llmux/tests/test_vram_manager.py` | Tests for VRAM eviction logic |
|
|
| `llmux/tests/test_routes.py` | Tests for API routes with mocked backends |
|
|
| `llmux/tests/conftest.py` | Shared pytest fixtures |
|
|
| `llmux/Dockerfile` | Container image definition |
|
|
| `llmux/requirements.txt` | Python dependencies |
|
|
| `llmux/config/models.yaml` | Model registry config |
|
|
| `llmux/config/api_keys.yaml` | API key config |
|
|
| `llmux/scripts/download_models.sh` | Pre-download model weights |
|
|
| `llmux/scripts/create_pod_llmux.sh` | Podman pod creation + systemd setup |
|
|
|
|
---
|
|
|
|
### Task 1: Project Scaffolding
|
|
|
|
**Files:**
|
|
- Create: `llmux/requirements.txt`
|
|
- Create: `llmux/config/models.yaml`
|
|
- Create: `llmux/config/api_keys.yaml`
|
|
- Create: `llmux/llmux/__init__.py`
|
|
- Create: `llmux/llmux/routes/__init__.py`
|
|
- Create: `llmux/llmux/backends/__init__.py`
|
|
- Create: `llmux/tests/__init__.py`
|
|
- Create: `llmux/tests/conftest.py`
|
|
|
|
- [ ] **Step 1: Create requirements.txt**
|
|
|
|
```
|
|
# Web framework
|
|
fastapi>=0.115.0
|
|
uvicorn[standard]>=0.34.0
|
|
python-multipart>=0.0.18
|
|
|
|
# AI runtimes
|
|
torch>=2.7.0
|
|
transformers>=5.4.0
|
|
llama-cpp-python>=0.3.0
|
|
chatterbox-tts>=0.1.0
|
|
|
|
# Audio processing
|
|
soundfile>=0.12.0
|
|
librosa>=0.10.0
|
|
|
|
# Config & utilities
|
|
pyyaml>=6.0
|
|
sentencepiece>=0.2.0
|
|
protobuf>=5.0.0
|
|
|
|
# Testing
|
|
pytest>=8.0.0
|
|
pytest-asyncio>=0.24.0
|
|
httpx>=0.28.0
|
|
```
|
|
|
|
- [ ] **Step 2: Create config/models.yaml**
|
|
|
|
Copy the exact YAML from the spec (section "Configuration Files > config/models.yaml"). This is the full model registry with all 9 physical models and 16 virtual models.
|
|
|
|
```yaml
|
|
physical_models:
|
|
qwen3.5-9b-fp8:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "lovedheart/Qwen3.5-9B-FP8"
|
|
estimated_vram_gb: 9
|
|
supports_vision: true
|
|
supports_tools: true
|
|
|
|
qwen3.5-9b-fp8-uncensored:
|
|
type: llm
|
|
backend: llamacpp
|
|
model_file: "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
|
|
mmproj_file: "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
|
|
estimated_vram_gb: 9
|
|
supports_vision: true
|
|
supports_tools: true
|
|
|
|
qwen3.5-4b:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "Qwen/Qwen3.5-4B"
|
|
estimated_vram_gb: 4
|
|
supports_vision: true
|
|
supports_tools: true
|
|
|
|
gpt-oss-20b:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "openai/gpt-oss-20b"
|
|
estimated_vram_gb: 13
|
|
supports_vision: false
|
|
supports_tools: true
|
|
|
|
gpt-oss-20b-uncensored:
|
|
type: llm
|
|
backend: transformers
|
|
model_id: "aoxo/gpt-oss-20b-uncensored"
|
|
estimated_vram_gb: 13
|
|
supports_vision: false
|
|
supports_tools: true
|
|
|
|
cohere-transcribe:
|
|
type: asr
|
|
backend: transformers
|
|
model_id: "CohereLabs/cohere-transcribe-03-2026"
|
|
estimated_vram_gb: 4
|
|
default_language: "en"
|
|
|
|
chatterbox-turbo:
|
|
type: tts
|
|
backend: chatterbox
|
|
variant: "turbo"
|
|
estimated_vram_gb: 2
|
|
|
|
chatterbox-multilingual:
|
|
type: tts
|
|
backend: chatterbox
|
|
variant: "multilingual"
|
|
estimated_vram_gb: 2
|
|
|
|
chatterbox:
|
|
type: tts
|
|
backend: chatterbox
|
|
variant: "default"
|
|
estimated_vram_gb: 2
|
|
|
|
virtual_models:
|
|
Qwen3.5-9B-FP8-Thinking:
|
|
physical: qwen3.5-9b-fp8
|
|
params: { enable_thinking: true }
|
|
Qwen3.5-9B-FP8-Instruct:
|
|
physical: qwen3.5-9b-fp8
|
|
params: { enable_thinking: false }
|
|
|
|
Qwen3.5-9B-FP8-Uncensored-Thinking:
|
|
physical: qwen3.5-9b-fp8-uncensored
|
|
params: { enable_thinking: true }
|
|
Qwen3.5-9B-FP8-Uncensored-Instruct:
|
|
physical: qwen3.5-9b-fp8-uncensored
|
|
params: { enable_thinking: false }
|
|
|
|
Qwen3.5-4B-Thinking:
|
|
physical: qwen3.5-4b
|
|
params: { enable_thinking: true }
|
|
Qwen3.5-4B-Instruct:
|
|
physical: qwen3.5-4b
|
|
params: { enable_thinking: false }
|
|
|
|
GPT-OSS-20B-Low:
|
|
physical: gpt-oss-20b
|
|
params: { system_prompt_prefix: "Reasoning: low" }
|
|
GPT-OSS-20B-Medium:
|
|
physical: gpt-oss-20b
|
|
params: { system_prompt_prefix: "Reasoning: medium" }
|
|
GPT-OSS-20B-High:
|
|
physical: gpt-oss-20b
|
|
params: { system_prompt_prefix: "Reasoning: high" }
|
|
|
|
GPT-OSS-20B-Uncensored-Low:
|
|
physical: gpt-oss-20b-uncensored
|
|
params: { system_prompt_prefix: "Reasoning: low" }
|
|
GPT-OSS-20B-Uncensored-Medium:
|
|
physical: gpt-oss-20b-uncensored
|
|
params: { system_prompt_prefix: "Reasoning: medium" }
|
|
GPT-OSS-20B-Uncensored-High:
|
|
physical: gpt-oss-20b-uncensored
|
|
params: { system_prompt_prefix: "Reasoning: high" }
|
|
|
|
cohere-transcribe:
|
|
physical: cohere-transcribe
|
|
Chatterbox-Turbo:
|
|
physical: chatterbox-turbo
|
|
Chatterbox-Multilingual:
|
|
physical: chatterbox-multilingual
|
|
Chatterbox:
|
|
physical: chatterbox
|
|
```
|
|
|
|
- [ ] **Step 3: Create config/api_keys.yaml with generated keys**
|
|
|
|
Generate three real keys and write the file:
|
|
|
|
```python
|
|
import secrets
|
|
keys = {
|
|
"Open WebUI": f"sk-llmux-openwebui-{secrets.token_urlsafe(32)}",
|
|
"Remote Whisper clients": f"sk-llmux-whisper-{secrets.token_urlsafe(32)}",
|
|
"OpenCode": f"sk-llmux-opencode-{secrets.token_urlsafe(32)}",
|
|
}
|
|
```
|
|
|
|
```yaml
|
|
api_keys:
|
|
- key: "<generated-openwebui-key>"
|
|
name: "Open WebUI"
|
|
- key: "<generated-whisper-key>"
|
|
name: "Remote Whisper clients"
|
|
- key: "<generated-opencode-key>"
|
|
name: "OpenCode"
|
|
```
|
|
|
|
- [ ] **Step 4: Create package __init__.py files and conftest.py**
|
|
|
|
`llmux/llmux/__init__.py`, `llmux/llmux/routes/__init__.py`, `llmux/llmux/backends/__init__.py`, `llmux/tests/__init__.py` — all empty files.
|
|
|
|
`llmux/tests/conftest.py`:
|
|
|
|
```python
|
|
import os
|
|
import pytest
|
|
from pathlib import Path
|
|
|
|
# Point config to the project's config directory for tests
|
|
@pytest.fixture(autouse=True)
|
|
def set_config_dir(tmp_path, monkeypatch):
|
|
"""Use the project's config files for tests by default."""
|
|
config_dir = Path(__file__).parent.parent / "config"
|
|
monkeypatch.setenv("LLMUX_CONFIG_DIR", str(config_dir))
|
|
return config_dir
|
|
```
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add llmux/requirements.txt llmux/config/ llmux/llmux/__init__.py \
|
|
llmux/llmux/routes/__init__.py llmux/llmux/backends/__init__.py \
|
|
llmux/tests/__init__.py llmux/tests/conftest.py
|
|
git commit -m "feat: project scaffolding with config files and test fixtures"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: Config Loading
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/config.py`
|
|
- Create: `llmux/tests/test_config.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
`llmux/tests/test_config.py`:
|
|
|
|
```python
|
|
from llmux.config import load_models_config, load_api_keys, PhysicalModel, VirtualModel
|
|
|
|
|
|
def test_load_models_config_returns_physical_and_virtual():
|
|
physical, virtual = load_models_config()
|
|
assert isinstance(physical, dict)
|
|
assert isinstance(virtual, dict)
|
|
assert len(physical) == 9
|
|
assert len(virtual) == 16
|
|
|
|
|
|
def test_physical_model_has_required_fields():
|
|
physical, _ = load_models_config()
|
|
qwen = physical["qwen3.5-9b-fp8"]
|
|
assert qwen.type == "llm"
|
|
assert qwen.backend == "transformers"
|
|
assert qwen.model_id == "lovedheart/Qwen3.5-9B-FP8"
|
|
assert qwen.estimated_vram_gb == 9
|
|
assert qwen.supports_vision is True
|
|
assert qwen.supports_tools is True
|
|
|
|
|
|
def test_physical_model_llamacpp_has_gguf_fields():
|
|
physical, _ = load_models_config()
|
|
uncensored = physical["qwen3.5-9b-fp8-uncensored"]
|
|
assert uncensored.backend == "llamacpp"
|
|
assert uncensored.model_file == "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf"
|
|
assert uncensored.mmproj_file == "mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
|
|
|
|
|
|
def test_virtual_model_maps_to_physical():
|
|
_, virtual = load_models_config()
|
|
thinking = virtual["Qwen3.5-9B-FP8-Thinking"]
|
|
assert thinking.physical == "qwen3.5-9b-fp8"
|
|
assert thinking.params == {"enable_thinking": True}
|
|
|
|
|
|
def test_virtual_model_gpt_oss_has_system_prompt():
|
|
_, virtual = load_models_config()
|
|
low = virtual["GPT-OSS-20B-Low"]
|
|
assert low.physical == "gpt-oss-20b"
|
|
assert low.params == {"system_prompt_prefix": "Reasoning: low"}
|
|
|
|
|
|
def test_virtual_model_without_params():
|
|
_, virtual = load_models_config()
|
|
ct = virtual["cohere-transcribe"]
|
|
assert ct.physical == "cohere-transcribe"
|
|
assert ct.params == {}
|
|
|
|
|
|
def test_load_api_keys():
|
|
keys = load_api_keys()
|
|
assert len(keys) == 3
|
|
assert all(k.key.startswith("sk-llmux-") for k in keys)
|
|
assert {k.name for k in keys} == {"Open WebUI", "Remote Whisper clients", "OpenCode"}
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_config.py -v`
|
|
Expected: FAIL — `ModuleNotFoundError: No module named 'llmux.config'`
|
|
|
|
- [ ] **Step 3: Implement config.py**
|
|
|
|
`llmux/llmux/config.py`:
|
|
|
|
```python
|
|
import os
|
|
from dataclasses import dataclass, field
|
|
from pathlib import Path
|
|
|
|
import yaml
|
|
|
|
|
|
def _config_dir() -> Path:
|
|
return Path(os.environ.get("LLMUX_CONFIG_DIR", "/config"))
|
|
|
|
|
|
@dataclass
|
|
class PhysicalModel:
|
|
type: str # "llm", "asr", "tts"
|
|
backend: str # "transformers", "llamacpp", "chatterbox"
|
|
estimated_vram_gb: float
|
|
model_id: str = ""
|
|
model_file: str = ""
|
|
mmproj_file: str = ""
|
|
supports_vision: bool = False
|
|
supports_tools: bool = False
|
|
default_language: str = ""
|
|
variant: str = ""
|
|
|
|
|
|
@dataclass
|
|
class VirtualModel:
|
|
physical: str
|
|
params: dict = field(default_factory=dict)
|
|
|
|
|
|
@dataclass
|
|
class ApiKey:
|
|
key: str
|
|
name: str
|
|
|
|
|
|
def load_models_config(
|
|
config_path: Path | None = None,
|
|
) -> tuple[dict[str, PhysicalModel], dict[str, VirtualModel]]:
|
|
if config_path is None:
|
|
config_path = _config_dir() / "models.yaml"
|
|
|
|
with open(config_path) as f:
|
|
raw = yaml.safe_load(f)
|
|
|
|
physical: dict[str, PhysicalModel] = {}
|
|
for model_id, attrs in raw["physical_models"].items():
|
|
physical[model_id] = PhysicalModel(
|
|
type=attrs["type"],
|
|
backend=attrs["backend"],
|
|
estimated_vram_gb=attrs["estimated_vram_gb"],
|
|
model_id=attrs.get("model_id", ""),
|
|
model_file=attrs.get("model_file", ""),
|
|
mmproj_file=attrs.get("mmproj_file", ""),
|
|
supports_vision=attrs.get("supports_vision", False),
|
|
supports_tools=attrs.get("supports_tools", False),
|
|
default_language=attrs.get("default_language", ""),
|
|
variant=attrs.get("variant", ""),
|
|
)
|
|
|
|
virtual: dict[str, VirtualModel] = {}
|
|
for model_name, attrs in raw["virtual_models"].items():
|
|
virtual[model_name] = VirtualModel(
|
|
physical=attrs["physical"],
|
|
params=attrs.get("params", {}),
|
|
)
|
|
|
|
return physical, virtual
|
|
|
|
|
|
def load_api_keys(config_path: Path | None = None) -> list[ApiKey]:
|
|
if config_path is None:
|
|
config_path = _config_dir() / "api_keys.yaml"
|
|
|
|
with open(config_path) as f:
|
|
raw = yaml.safe_load(f)
|
|
|
|
return [ApiKey(key=entry["key"], name=entry["name"]) for entry in raw["api_keys"]]
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to verify they pass**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_config.py -v`
|
|
Expected: all 7 tests PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/config.py llmux/tests/test_config.py
|
|
git commit -m "feat: config loading for models.yaml and api_keys.yaml"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: API Key Authentication
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/auth.py`
|
|
- Create: `llmux/tests/test_auth.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
`llmux/tests/test_auth.py`:
|
|
|
|
```python
|
|
import pytest
|
|
from fastapi import FastAPI, Depends
|
|
from fastapi.testclient import TestClient
|
|
|
|
from llmux.auth import create_api_key_dependency
|
|
from llmux.config import ApiKey
|
|
|
|
|
|
@pytest.fixture
|
|
def app_with_auth():
|
|
keys = [
|
|
ApiKey(key="sk-test-valid-key", name="Test"),
|
|
ApiKey(key="sk-test-another-key", name="Another"),
|
|
]
|
|
require_api_key = create_api_key_dependency(keys)
|
|
|
|
app = FastAPI()
|
|
|
|
@app.get("/protected")
|
|
def protected(api_key: str = Depends(require_api_key)):
|
|
return {"key_name": api_key}
|
|
|
|
return app
|
|
|
|
|
|
@pytest.fixture
|
|
def client(app_with_auth):
|
|
return TestClient(app_with_auth)
|
|
|
|
|
|
def test_valid_key_returns_200(client):
|
|
resp = client.get("/protected", headers={"Authorization": "Bearer sk-test-valid-key"})
|
|
assert resp.status_code == 200
|
|
assert resp.json()["key_name"] == "Test"
|
|
|
|
|
|
def test_another_valid_key(client):
|
|
resp = client.get("/protected", headers={"Authorization": "Bearer sk-test-another-key"})
|
|
assert resp.status_code == 200
|
|
assert resp.json()["key_name"] == "Another"
|
|
|
|
|
|
def test_missing_auth_header_returns_401(client):
|
|
resp = client.get("/protected")
|
|
assert resp.status_code == 401
|
|
|
|
|
|
def test_invalid_key_returns_401(client):
|
|
resp = client.get("/protected", headers={"Authorization": "Bearer sk-wrong"})
|
|
assert resp.status_code == 401
|
|
|
|
|
|
def test_malformed_header_returns_401(client):
|
|
resp = client.get("/protected", headers={"Authorization": "sk-test-valid-key"})
|
|
assert resp.status_code == 401
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_auth.py -v`
|
|
Expected: FAIL — `ModuleNotFoundError: No module named 'llmux.auth'`
|
|
|
|
- [ ] **Step 3: Implement auth.py**
|
|
|
|
`llmux/llmux/auth.py`:
|
|
|
|
```python
|
|
from fastapi import HTTPException, Request
|
|
|
|
from llmux.config import ApiKey
|
|
|
|
|
|
def create_api_key_dependency(api_keys: list[ApiKey]):
|
|
key_to_name = {k.key: k.name for k in api_keys}
|
|
|
|
async def require_api_key(request: Request) -> str:
|
|
auth = request.headers.get("Authorization", "")
|
|
if not auth.startswith("Bearer "):
|
|
raise HTTPException(status_code=401, detail="Missing or malformed Authorization header")
|
|
token = auth[7:]
|
|
name = key_to_name.get(token)
|
|
if name is None:
|
|
raise HTTPException(status_code=401, detail="Invalid API key")
|
|
return name
|
|
|
|
return require_api_key
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to verify they pass**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_auth.py -v`
|
|
Expected: all 5 tests PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/auth.py llmux/tests/test_auth.py
|
|
git commit -m "feat: API key authentication dependency"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: Model Registry
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/model_registry.py`
|
|
- Create: `llmux/tests/test_model_registry.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
`llmux/tests/test_model_registry.py`:
|
|
|
|
```python
|
|
import pytest
|
|
|
|
from llmux.model_registry import ModelRegistry
|
|
|
|
|
|
@pytest.fixture
|
|
def registry():
|
|
return ModelRegistry.from_config()
|
|
|
|
|
|
def test_list_virtual_models(registry):
|
|
models = registry.list_virtual_models()
|
|
assert len(models) == 16
|
|
names = [m["id"] for m in models]
|
|
assert "Qwen3.5-9B-FP8-Thinking" in names
|
|
assert "GPT-OSS-20B-High" in names
|
|
assert "cohere-transcribe" in names
|
|
assert "Chatterbox-Multilingual" in names
|
|
|
|
|
|
def test_virtual_model_openai_format(registry):
|
|
models = registry.list_virtual_models()
|
|
m = next(m for m in models if m["id"] == "Qwen3.5-9B-FP8-Thinking")
|
|
assert m["object"] == "model"
|
|
assert m["owned_by"] == "llmux"
|
|
|
|
|
|
def test_resolve_virtual_to_physical(registry):
|
|
physical_id, physical, params = registry.resolve("Qwen3.5-9B-FP8-Thinking")
|
|
assert physical_id == "qwen3.5-9b-fp8"
|
|
assert physical.backend == "transformers"
|
|
assert params == {"enable_thinking": True}
|
|
|
|
|
|
def test_resolve_instruct_variant(registry):
|
|
physical_id, physical, params = registry.resolve("Qwen3.5-9B-FP8-Instruct")
|
|
assert physical_id == "qwen3.5-9b-fp8"
|
|
assert params == {"enable_thinking": False}
|
|
|
|
|
|
def test_resolve_gpt_oss_reasoning(registry):
|
|
physical_id, physical, params = registry.resolve("GPT-OSS-20B-Medium")
|
|
assert physical_id == "gpt-oss-20b"
|
|
assert params == {"system_prompt_prefix": "Reasoning: medium"}
|
|
|
|
|
|
def test_resolve_same_physical_for_variants(registry):
|
|
pid1, _, _ = registry.resolve("Qwen3.5-9B-FP8-Thinking")
|
|
pid2, _, _ = registry.resolve("Qwen3.5-9B-FP8-Instruct")
|
|
assert pid1 == pid2
|
|
|
|
|
|
def test_resolve_unknown_model_raises(registry):
|
|
with pytest.raises(KeyError):
|
|
registry.resolve("nonexistent-model")
|
|
|
|
|
|
def test_get_physical(registry):
|
|
physical = registry.get_physical("qwen3.5-9b-fp8")
|
|
assert physical.type == "llm"
|
|
assert physical.estimated_vram_gb == 9
|
|
|
|
|
|
def test_get_physical_unknown_raises(registry):
|
|
with pytest.raises(KeyError):
|
|
registry.get_physical("nonexistent")
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_model_registry.py -v`
|
|
Expected: FAIL — `ModuleNotFoundError: No module named 'llmux.model_registry'`
|
|
|
|
- [ ] **Step 3: Implement model_registry.py**
|
|
|
|
`llmux/llmux/model_registry.py`:
|
|
|
|
```python
|
|
from llmux.config import PhysicalModel, VirtualModel, load_models_config
|
|
|
|
|
|
class ModelRegistry:
|
|
def __init__(
|
|
self,
|
|
physical: dict[str, PhysicalModel],
|
|
virtual: dict[str, VirtualModel],
|
|
):
|
|
self._physical = physical
|
|
self._virtual = virtual
|
|
|
|
@classmethod
|
|
def from_config(cls) -> "ModelRegistry":
|
|
physical, virtual = load_models_config()
|
|
return cls(physical, virtual)
|
|
|
|
def list_virtual_models(self) -> list[dict]:
|
|
return [
|
|
{
|
|
"id": name,
|
|
"object": "model",
|
|
"created": 0,
|
|
"owned_by": "llmux",
|
|
}
|
|
for name in self._virtual
|
|
]
|
|
|
|
def resolve(self, virtual_name: str) -> tuple[str, PhysicalModel, dict]:
|
|
"""Resolve a virtual model name to (physical_id, PhysicalModel, params)."""
|
|
vm = self._virtual[virtual_name] # raises KeyError if unknown
|
|
pm = self._physical[vm.physical]
|
|
return vm.physical, pm, dict(vm.params)
|
|
|
|
def get_physical(self, physical_id: str) -> PhysicalModel:
|
|
return self._physical[physical_id] # raises KeyError if unknown
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to verify they pass**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_model_registry.py -v`
|
|
Expected: all 9 tests PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/model_registry.py llmux/tests/test_model_registry.py
|
|
git commit -m "feat: model registry with virtual-to-physical resolution"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: VRAM Manager
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/vram_manager.py`
|
|
- Create: `llmux/tests/test_vram_manager.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
`llmux/tests/test_vram_manager.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import pytest
|
|
|
|
from llmux.vram_manager import VRAMManager, ModelSlot
|
|
|
|
|
|
class FakeBackend:
|
|
"""Simulates a backend that tracks load/unload calls."""
|
|
|
|
def __init__(self):
|
|
self.loaded = {} # model_id -> True
|
|
self.load_count = 0
|
|
self.unload_count = 0
|
|
|
|
async def load(self, model_id: str):
|
|
self.loaded[model_id] = True
|
|
self.load_count += 1
|
|
|
|
async def unload(self, model_id: str):
|
|
self.loaded.pop(model_id, None)
|
|
self.unload_count += 1
|
|
|
|
|
|
@pytest.fixture
|
|
def manager():
|
|
return VRAMManager(total_vram_gb=16.0)
|
|
|
|
|
|
# --- Priority ordering ---
|
|
|
|
def test_priority_ordering():
|
|
assert ModelSlot.priority_rank("llm") == 0
|
|
assert ModelSlot.priority_rank("tts") == 1
|
|
assert ModelSlot.priority_rank("asr") == 2
|
|
|
|
|
|
# --- Loading into empty VRAM ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_load_into_empty_vram(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
assert manager.is_loaded("qwen3.5-4b")
|
|
assert manager.available_vram_gb == pytest.approx(12.0)
|
|
|
|
|
|
# --- Loading alongside existing ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_load_alongside_when_fits(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
assert manager.is_loaded("cohere-transcribe")
|
|
assert manager.is_loaded("qwen3.5-4b")
|
|
assert manager.available_vram_gb == pytest.approx(8.0)
|
|
|
|
|
|
# --- Eviction: LLM evicted first ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_evict_llm_first(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("chatterbox-multilingual", model_type="tts", vram_gb=2.0, backend=backend)
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
# 10 GB used. Loading 9B (9GB) requires 19GB. Must evict.
|
|
await manager.load_model("qwen3.5-9b-fp8", model_type="llm", vram_gb=9.0, backend=backend)
|
|
# LLM (4B) evicted first. ASR+TTS+9B = 4+2+9 = 15GB, fits.
|
|
assert not manager.is_loaded("qwen3.5-4b")
|
|
assert manager.is_loaded("cohere-transcribe")
|
|
assert manager.is_loaded("chatterbox-multilingual")
|
|
assert manager.is_loaded("qwen3.5-9b-fp8")
|
|
|
|
|
|
# --- Eviction cascade: LLM then TTS then ASR ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_evict_cascade_for_large_llm(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("chatterbox-multilingual", model_type="tts", vram_gb=2.0, backend=backend)
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
# 10 GB used. Loading gpt-oss-20b (13GB). Need to free a lot.
|
|
await manager.load_model("gpt-oss-20b", model_type="llm", vram_gb=13.0, backend=backend)
|
|
# Evict LLM (4GB free=6), then TTS (free=8), then ASR (free=12).
|
|
# Actually: after evicting LLM, free=12. 12 < 13. Evict TTS, free=14. 14 >= 13. Load.
|
|
assert not manager.is_loaded("qwen3.5-4b")
|
|
assert not manager.is_loaded("chatterbox-multilingual")
|
|
assert manager.is_loaded("cohere-transcribe") # ASR not evicted if not needed
|
|
assert manager.is_loaded("gpt-oss-20b")
|
|
|
|
|
|
# --- Eviction: never evict higher priority for lower ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_never_evict_asr_for_tts(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("gpt-oss-20b", model_type="llm", vram_gb=13.0, backend=backend)
|
|
# 17GB > 16GB? No — we loaded into 16GB, so gpt-oss-20b alone uses 13GB.
|
|
# Wait, this test needs adjustment. Let's load ASR alone (4GB), then try TTS.
|
|
# Actually the scenario: ASR (4) + large LLM (13) = 17 > 16.
|
|
# Loading LLM should evict... nothing higher, but LLM can't fit alongside ASR.
|
|
# The LLM IS the thing being loaded, so we evict nothing of lower priority.
|
|
# Actually we need to think about this differently. Let's redo.
|
|
pass
|
|
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_asr_evicts_llm_not_reversed(manager):
|
|
"""When ASR request arrives and LLM is loaded, evict LLM (lower priority)."""
|
|
backend = FakeBackend()
|
|
await manager.load_model("gpt-oss-20b", model_type="llm", vram_gb=13.0, backend=backend)
|
|
# 13GB used, 3GB free. ASR needs 4GB. Must evict LLM.
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
assert not manager.is_loaded("gpt-oss-20b")
|
|
assert manager.is_loaded("cohere-transcribe")
|
|
|
|
|
|
# --- Already loaded ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_already_loaded_is_noop(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
assert backend.load_count == 1 # only loaded once
|
|
|
|
|
|
# --- Scenario from spec: ASR + TTS + 4B, switch to 9B ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_spec_scenario_switch_to_9b(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("chatterbox-multilingual", model_type="tts", vram_gb=2.0, backend=backend)
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
# Switch to 9B. Evict LLM (4B). ASR+TTS+9B = 15GB, fits.
|
|
await manager.load_model("qwen3.5-9b-fp8", model_type="llm", vram_gb=9.0, backend=backend)
|
|
assert manager.is_loaded("cohere-transcribe")
|
|
assert manager.is_loaded("chatterbox-multilingual")
|
|
assert manager.is_loaded("qwen3.5-9b-fp8")
|
|
assert not manager.is_loaded("qwen3.5-4b")
|
|
assert manager.available_vram_gb == pytest.approx(1.0)
|
|
|
|
|
|
# --- get_loaded_models ---
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_get_loaded_models(manager):
|
|
backend = FakeBackend()
|
|
await manager.load_model("cohere-transcribe", model_type="asr", vram_gb=4.0, backend=backend)
|
|
await manager.load_model("qwen3.5-4b", model_type="llm", vram_gb=4.0, backend=backend)
|
|
loaded = manager.get_loaded_models()
|
|
assert set(loaded.keys()) == {"cohere-transcribe", "qwen3.5-4b"}
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_vram_manager.py -v`
|
|
Expected: FAIL — `ModuleNotFoundError: No module named 'llmux.vram_manager'`
|
|
|
|
- [ ] **Step 3: Implement vram_manager.py**
|
|
|
|
`llmux/llmux/vram_manager.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import logging
|
|
from dataclasses import dataclass
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Priority ranks: higher number = higher priority = evicted last
|
|
_PRIORITY = {"llm": 0, "tts": 1, "asr": 2}
|
|
|
|
|
|
@dataclass
|
|
class ModelSlot:
|
|
model_id: str
|
|
model_type: str # "llm", "tts", "asr"
|
|
vram_gb: float
|
|
backend: object # backend instance that loaded this model
|
|
|
|
@staticmethod
|
|
def priority_rank(model_type: str) -> int:
|
|
return _PRIORITY[model_type]
|
|
|
|
@property
|
|
def priority(self) -> int:
|
|
return _PRIORITY[self.model_type]
|
|
|
|
|
|
class VRAMManager:
|
|
def __init__(self, total_vram_gb: float = 16.0):
|
|
self._total_vram_gb = total_vram_gb
|
|
self._loaded: dict[str, ModelSlot] = {} # model_id -> ModelSlot
|
|
self._lock = asyncio.Lock()
|
|
|
|
@property
|
|
def available_vram_gb(self) -> float:
|
|
used = sum(slot.vram_gb for slot in self._loaded.values())
|
|
return self._total_vram_gb - used
|
|
|
|
def is_loaded(self, model_id: str) -> bool:
|
|
return model_id in self._loaded
|
|
|
|
def get_loaded_models(self) -> dict[str, ModelSlot]:
|
|
return dict(self._loaded)
|
|
|
|
async def load_model(
|
|
self,
|
|
model_id: str,
|
|
model_type: str,
|
|
vram_gb: float,
|
|
backend: object,
|
|
) -> None:
|
|
async with self._lock:
|
|
await self._load_model_locked(model_id, model_type, vram_gb, backend)
|
|
|
|
async def _load_model_locked(
|
|
self,
|
|
model_id: str,
|
|
model_type: str,
|
|
vram_gb: float,
|
|
backend: object,
|
|
) -> None:
|
|
# Already loaded — nothing to do
|
|
if model_id in self._loaded:
|
|
return
|
|
|
|
# Try to free VRAM if needed
|
|
if self.available_vram_gb < vram_gb:
|
|
await self._evict_for(vram_gb, model_type)
|
|
|
|
if self.available_vram_gb < vram_gb:
|
|
raise RuntimeError(
|
|
f"Cannot free enough VRAM for {model_id} "
|
|
f"(need {vram_gb}GB, available {self.available_vram_gb}GB)"
|
|
)
|
|
|
|
# Load the model
|
|
logger.info(f"Loading {model_id} ({vram_gb}GB VRAM)")
|
|
await backend.load(model_id)
|
|
self._loaded[model_id] = ModelSlot(
|
|
model_id=model_id,
|
|
model_type=model_type,
|
|
vram_gb=vram_gb,
|
|
backend=backend,
|
|
)
|
|
logger.info(
|
|
f"Loaded {model_id}. VRAM: {self._total_vram_gb - self.available_vram_gb:.1f}/"
|
|
f"{self._total_vram_gb:.1f}GB used"
|
|
)
|
|
|
|
async def _evict_for(self, needed_gb: float, requesting_type: str) -> None:
|
|
"""Evict models in priority order (lowest first) until enough VRAM is free."""
|
|
requesting_priority = _PRIORITY[requesting_type]
|
|
|
|
# Sort loaded models by priority ascending (evict lowest first)
|
|
candidates = sorted(self._loaded.values(), key=lambda s: s.priority)
|
|
|
|
for slot in candidates:
|
|
if self.available_vram_gb >= needed_gb:
|
|
break
|
|
# Never evict a model with higher or equal priority than the requester
|
|
if slot.priority >= requesting_priority:
|
|
continue
|
|
logger.info(f"Evicting {slot.model_id} ({slot.model_type}, {slot.vram_gb}GB)")
|
|
await slot.backend.unload(slot.model_id)
|
|
del self._loaded[slot.model_id]
|
|
|
|
# If still not enough, evict same-priority models (e.g., old LLM for new LLM)
|
|
if self.available_vram_gb < needed_gb:
|
|
candidates = sorted(self._loaded.values(), key=lambda s: s.priority)
|
|
for slot in candidates:
|
|
if self.available_vram_gb >= needed_gb:
|
|
break
|
|
if slot.priority > requesting_priority:
|
|
continue
|
|
logger.info(f"Evicting same-priority {slot.model_id} ({slot.model_type}, {slot.vram_gb}GB)")
|
|
await slot.backend.unload(slot.model_id)
|
|
del self._loaded[slot.model_id]
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to verify they pass**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_vram_manager.py -v`
|
|
Expected: all tests PASS (the `test_never_evict_asr_for_tts` test with `pass` will trivially pass — that's fine, the real scenario is covered by `test_asr_evicts_llm_not_reversed`)
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/vram_manager.py llmux/tests/test_vram_manager.py
|
|
git commit -m "feat: VRAM manager with priority-based eviction"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 6: Backend Base Class
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/backends/base.py`
|
|
|
|
- [ ] **Step 1: Create the abstract base class**
|
|
|
|
`llmux/llmux/backends/base.py`:
|
|
|
|
```python
|
|
from abc import ABC, abstractmethod
|
|
from typing import AsyncIterator
|
|
|
|
|
|
class BaseBackend(ABC):
|
|
"""Abstract base for all model backends."""
|
|
|
|
@abstractmethod
|
|
async def load(self, model_id: str, **kwargs) -> None:
|
|
"""Load model weights into GPU VRAM.
|
|
|
|
Backends accept optional kwargs:
|
|
- device: "cuda" or "cpu" (transformers backends, chatterbox)
|
|
- n_gpu_layers: int (llamacpp backend, -1=all GPU, 0=CPU only)
|
|
"""
|
|
|
|
@abstractmethod
|
|
async def unload(self, model_id: str) -> None:
|
|
"""Unload model weights from GPU VRAM."""
|
|
|
|
@abstractmethod
|
|
async def generate(
|
|
self,
|
|
model_id: str,
|
|
messages: list[dict],
|
|
params: dict,
|
|
stream: bool = False,
|
|
tools: list[dict] | None = None,
|
|
) -> AsyncIterator[str] | dict:
|
|
"""Run chat inference. Returns full response dict or async iterator of SSE chunks."""
|
|
|
|
async def transcribe(
|
|
self,
|
|
model_id: str,
|
|
audio_data: bytes,
|
|
language: str = "en",
|
|
) -> dict:
|
|
"""Transcribe audio. Only implemented by ASR backends."""
|
|
raise NotImplementedError(f"{self.__class__.__name__} does not support transcription")
|
|
|
|
async def synthesize(
|
|
self,
|
|
model_id: str,
|
|
text: str,
|
|
voice: str = "default",
|
|
) -> bytes:
|
|
"""Synthesize speech. Only implemented by TTS backends."""
|
|
raise NotImplementedError(f"{self.__class__.__name__} does not support speech synthesis")
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/backends/base.py
|
|
git commit -m "feat: abstract base class for model backends"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 7: Transformers LLM Backend
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/backends/transformers_llm.py`
|
|
|
|
- [ ] **Step 1: Implement the transformers LLM backend**
|
|
|
|
`llmux/llmux/backends/transformers_llm.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import json
|
|
import logging
|
|
import time
|
|
import uuid
|
|
from typing import AsyncIterator
|
|
|
|
import torch
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor, TextIteratorStreamer
|
|
from threading import Thread
|
|
|
|
from llmux.backends.base import BaseBackend
|
|
from llmux.config import PhysicalModel
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
class TransformersLLMBackend(BaseBackend):
|
|
def __init__(self, models_dir: str = "/models"):
|
|
self._models_dir = models_dir
|
|
self._loaded: dict[str, dict] = {} # model_id -> {"model", "tokenizer", "processor"}
|
|
|
|
async def load(self, model_id: str, device: str = "cuda") -> None:
|
|
"""Load a HuggingFace model and tokenizer into VRAM."""
|
|
if model_id in self._loaded:
|
|
return
|
|
|
|
physical = _get_physical_config(model_id)
|
|
hf_id = physical.model_id
|
|
logger.info(f"Loading transformers model {hf_id} to {device}")
|
|
|
|
def _load():
|
|
tokenizer = AutoTokenizer.from_pretrained(
|
|
hf_id,
|
|
cache_dir=self._models_dir,
|
|
trust_remote_code=True,
|
|
)
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
hf_id,
|
|
cache_dir=self._models_dir,
|
|
torch_dtype="auto",
|
|
device_map=device,
|
|
trust_remote_code=True,
|
|
)
|
|
processor = None
|
|
if physical.supports_vision:
|
|
try:
|
|
processor = AutoProcessor.from_pretrained(
|
|
hf_id,
|
|
cache_dir=self._models_dir,
|
|
trust_remote_code=True,
|
|
)
|
|
except Exception:
|
|
logger.warning(f"No processor found for {hf_id}, vision disabled")
|
|
return model, tokenizer, processor
|
|
|
|
loop = asyncio.get_event_loop()
|
|
model, tokenizer, processor = await loop.run_in_executor(None, _load)
|
|
self._loaded[model_id] = {
|
|
"model": model,
|
|
"tokenizer": tokenizer,
|
|
"processor": processor,
|
|
"device": device,
|
|
}
|
|
logger.info(f"Loaded {hf_id} on {device}")
|
|
|
|
async def unload(self, model_id: str) -> None:
|
|
if model_id not in self._loaded:
|
|
return
|
|
entry = self._loaded.pop(model_id)
|
|
del entry["model"]
|
|
del entry["tokenizer"]
|
|
if entry.get("processor"):
|
|
del entry["processor"]
|
|
torch.cuda.empty_cache()
|
|
logger.info(f"Unloaded {model_id}")
|
|
|
|
async def generate(
|
|
self,
|
|
model_id: str,
|
|
messages: list[dict],
|
|
params: dict,
|
|
stream: bool = False,
|
|
tools: list[dict] | None = None,
|
|
) -> AsyncIterator[str] | dict:
|
|
entry = self._loaded[model_id]
|
|
model = entry["model"]
|
|
tokenizer = entry["tokenizer"]
|
|
|
|
# Apply virtual model params
|
|
chat_params = {}
|
|
if "enable_thinking" in params:
|
|
chat_params["enable_thinking"] = params["enable_thinking"]
|
|
|
|
# Inject system prompt prefix for gpt-oss reasoning levels
|
|
effective_messages = list(messages)
|
|
if "system_prompt_prefix" in params:
|
|
prefix = params["system_prompt_prefix"]
|
|
if effective_messages and effective_messages[0].get("role") == "system":
|
|
effective_messages[0] = dict(effective_messages[0])
|
|
effective_messages[0]["content"] = prefix + "\n\n" + effective_messages[0]["content"]
|
|
else:
|
|
effective_messages.insert(0, {"role": "system", "content": prefix})
|
|
|
|
# Build input
|
|
text = tokenizer.apply_chat_template(
|
|
effective_messages,
|
|
tokenize=False,
|
|
add_generation_prompt=True,
|
|
tools=tools,
|
|
**chat_params,
|
|
)
|
|
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
|
|
|
if stream:
|
|
return self._stream_generate(model, tokenizer, inputs, model_id)
|
|
else:
|
|
return await self._full_generate(model, tokenizer, inputs, model_id)
|
|
|
|
async def _full_generate(self, model, tokenizer, inputs, model_id: str) -> dict:
|
|
def _run():
|
|
with torch.no_grad():
|
|
output_ids = model.generate(
|
|
**inputs,
|
|
max_new_tokens=4096,
|
|
)
|
|
new_tokens = output_ids[0][inputs["input_ids"].shape[1]:]
|
|
return tokenizer.decode(new_tokens, skip_special_tokens=True)
|
|
|
|
loop = asyncio.get_event_loop()
|
|
text = await loop.run_in_executor(None, _run)
|
|
|
|
return {
|
|
"id": f"chatcmpl-{uuid.uuid4().hex[:12]}",
|
|
"object": "chat.completion",
|
|
"created": int(time.time()),
|
|
"model": model_id,
|
|
"choices": [
|
|
{
|
|
"index": 0,
|
|
"message": {"role": "assistant", "content": text},
|
|
"finish_reason": "stop",
|
|
}
|
|
],
|
|
"usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0},
|
|
}
|
|
|
|
async def _stream_generate(
|
|
self, model, tokenizer, inputs, model_id: str
|
|
) -> AsyncIterator[str]:
|
|
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
|
gen_kwargs = {**inputs, "max_new_tokens": 4096, "streamer": streamer}
|
|
|
|
thread = Thread(target=lambda: model.generate(**gen_kwargs))
|
|
thread.start()
|
|
|
|
chat_id = f"chatcmpl-{uuid.uuid4().hex[:12]}"
|
|
created = int(time.time())
|
|
|
|
async def _iter():
|
|
loop = asyncio.get_event_loop()
|
|
while True:
|
|
token = await loop.run_in_executor(None, lambda: next(streamer, None))
|
|
if token is None:
|
|
# Final chunk
|
|
chunk = {
|
|
"id": chat_id,
|
|
"object": "chat.completion.chunk",
|
|
"created": created,
|
|
"model": model_id,
|
|
"choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}],
|
|
}
|
|
yield f"data: {json.dumps(chunk)}\n\n"
|
|
yield "data: [DONE]\n\n"
|
|
break
|
|
chunk = {
|
|
"id": chat_id,
|
|
"object": "chat.completion.chunk",
|
|
"created": created,
|
|
"model": model_id,
|
|
"choices": [
|
|
{"index": 0, "delta": {"content": token}, "finish_reason": None}
|
|
],
|
|
}
|
|
yield f"data: {json.dumps(chunk)}\n\n"
|
|
|
|
thread.join()
|
|
|
|
return _iter()
|
|
|
|
|
|
# Helper to get physical model config — injected at app startup
|
|
_physical_models: dict[str, PhysicalModel] = {}
|
|
|
|
|
|
def set_physical_models(models: dict[str, PhysicalModel]) -> None:
|
|
global _physical_models
|
|
_physical_models = models
|
|
|
|
|
|
def _get_physical_config(model_id: str) -> PhysicalModel:
|
|
return _physical_models[model_id]
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/backends/transformers_llm.py
|
|
git commit -m "feat: transformers LLM backend with streaming and thinking/reasoning support"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 8: Transformers ASR Backend
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/backends/transformers_asr.py`
|
|
|
|
- [ ] **Step 1: Implement the ASR backend**
|
|
|
|
`llmux/llmux/backends/transformers_asr.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import logging
|
|
|
|
import torch
|
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
|
|
|
|
from llmux.backends.base import BaseBackend
|
|
from llmux.config import PhysicalModel
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
class TransformersASRBackend(BaseBackend):
|
|
def __init__(self, models_dir: str = "/models"):
|
|
self._models_dir = models_dir
|
|
self._loaded: dict[str, dict] = {}
|
|
|
|
async def load(self, model_id: str, device: str = "cuda") -> None:
|
|
if model_id in self._loaded:
|
|
return
|
|
|
|
physical = _get_physical_config(model_id)
|
|
hf_id = physical.model_id
|
|
logger.info(f"Loading ASR model {hf_id} to {device}")
|
|
|
|
def _load():
|
|
processor = AutoProcessor.from_pretrained(
|
|
hf_id,
|
|
cache_dir=self._models_dir,
|
|
trust_remote_code=True,
|
|
)
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
|
hf_id,
|
|
cache_dir=self._models_dir,
|
|
torch_dtype="auto",
|
|
device_map=device,
|
|
trust_remote_code=True,
|
|
)
|
|
return model, processor
|
|
|
|
loop = asyncio.get_event_loop()
|
|
model, processor = await loop.run_in_executor(None, _load)
|
|
self._loaded[model_id] = {
|
|
"model": model,
|
|
"processor": processor,
|
|
"device": device,
|
|
}
|
|
logger.info(f"Loaded ASR model {hf_id} on {device}")
|
|
|
|
async def unload(self, model_id: str) -> None:
|
|
if model_id not in self._loaded:
|
|
return
|
|
entry = self._loaded.pop(model_id)
|
|
del entry["model"]
|
|
del entry["processor"]
|
|
torch.cuda.empty_cache()
|
|
logger.info(f"Unloaded ASR model {model_id}")
|
|
|
|
async def generate(self, model_id, messages, params, stream=False, tools=None):
|
|
raise NotImplementedError("ASR backend does not support chat generation")
|
|
|
|
async def transcribe(
|
|
self,
|
|
model_id: str,
|
|
audio_data: bytes,
|
|
language: str = "en",
|
|
) -> dict:
|
|
import io
|
|
import soundfile as sf
|
|
|
|
entry = self._loaded[model_id]
|
|
model = entry["model"]
|
|
processor = entry["processor"]
|
|
|
|
def _transcribe():
|
|
# Decode audio bytes to numpy array
|
|
audio_array, sample_rate = sf.read(io.BytesIO(audio_data))
|
|
|
|
# Process audio
|
|
inputs = processor(
|
|
audio_array,
|
|
sampling_rate=sample_rate,
|
|
return_tensors="pt",
|
|
language=language,
|
|
).to(model.device)
|
|
|
|
with torch.no_grad():
|
|
predicted_ids = model.generate(**inputs)
|
|
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
|
|
return transcription
|
|
|
|
loop = asyncio.get_event_loop()
|
|
text = await loop.run_in_executor(None, _transcribe)
|
|
|
|
return {"text": text}
|
|
|
|
|
|
# Physical model config injection (same pattern as transformers_llm)
|
|
_physical_models: dict[str, PhysicalModel] = {}
|
|
|
|
|
|
def set_physical_models(models: dict[str, PhysicalModel]) -> None:
|
|
global _physical_models
|
|
_physical_models = models
|
|
|
|
|
|
def _get_physical_config(model_id: str) -> PhysicalModel:
|
|
return _physical_models[model_id]
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/backends/transformers_asr.py
|
|
git commit -m "feat: transformers ASR backend for cohere-transcribe"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 9: llama-cpp-python Backend
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/backends/llamacpp.py`
|
|
|
|
- [ ] **Step 1: Implement the llama.cpp backend**
|
|
|
|
`llmux/llmux/backends/llamacpp.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import json
|
|
import logging
|
|
import time
|
|
import uuid
|
|
from pathlib import Path
|
|
from typing import AsyncIterator
|
|
|
|
from llama_cpp import Llama, LlamaGrammar
|
|
|
|
from llmux.backends.base import BaseBackend
|
|
from llmux.config import PhysicalModel
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
class LlamaCppBackend(BaseBackend):
|
|
def __init__(self, models_dir: str = "/models"):
|
|
self._models_dir = Path(models_dir)
|
|
self._loaded: dict[str, dict] = {}
|
|
|
|
async def load(self, model_id: str, n_gpu_layers: int = -1) -> None:
|
|
if model_id in self._loaded:
|
|
return
|
|
|
|
physical = _get_physical_config(model_id)
|
|
model_path = self._models_dir / physical.model_file
|
|
logger.info(f"Loading GGUF model {model_path} with n_gpu_layers={n_gpu_layers}")
|
|
|
|
def _load():
|
|
kwargs = {
|
|
"model_path": str(model_path),
|
|
"n_gpu_layers": n_gpu_layers,
|
|
"n_ctx": 8192,
|
|
"verbose": False,
|
|
}
|
|
if physical.mmproj_file:
|
|
mmproj_path = self._models_dir / physical.mmproj_file
|
|
kwargs["chat_handler"] = _create_vision_handler(str(mmproj_path))
|
|
return Llama(**kwargs)
|
|
|
|
loop = asyncio.get_event_loop()
|
|
llm = await loop.run_in_executor(None, _load)
|
|
self._loaded[model_id] = {"llm": llm, "n_gpu_layers": n_gpu_layers}
|
|
logger.info(f"Loaded GGUF model {physical.model_file}")
|
|
|
|
async def unload(self, model_id: str) -> None:
|
|
if model_id not in self._loaded:
|
|
return
|
|
entry = self._loaded.pop(model_id)
|
|
del entry["llm"]
|
|
logger.info(f"Unloaded GGUF model {model_id}")
|
|
|
|
async def generate(
|
|
self,
|
|
model_id: str,
|
|
messages: list[dict],
|
|
params: dict,
|
|
stream: bool = False,
|
|
tools: list[dict] | None = None,
|
|
) -> AsyncIterator[str] | dict:
|
|
entry = self._loaded[model_id]
|
|
llm = entry["llm"]
|
|
|
|
# Apply virtual model params
|
|
effective_messages = list(messages)
|
|
if "enable_thinking" in params:
|
|
# For Qwen GGUF models, thinking is controlled via chat template
|
|
# enable_thinking=False adds /no_think tag
|
|
if not params["enable_thinking"]:
|
|
if effective_messages and effective_messages[0].get("role") == "system":
|
|
effective_messages[0] = dict(effective_messages[0])
|
|
effective_messages[0]["content"] = (
|
|
"/no_think\n" + effective_messages[0]["content"]
|
|
)
|
|
else:
|
|
effective_messages.insert(0, {"role": "system", "content": "/no_think"})
|
|
|
|
if "system_prompt_prefix" in params:
|
|
prefix = params["system_prompt_prefix"]
|
|
if effective_messages and effective_messages[0].get("role") == "system":
|
|
effective_messages[0] = dict(effective_messages[0])
|
|
effective_messages[0]["content"] = prefix + "\n\n" + effective_messages[0]["content"]
|
|
else:
|
|
effective_messages.insert(0, {"role": "system", "content": prefix})
|
|
|
|
if stream:
|
|
return self._stream_generate(llm, effective_messages, model_id, tools)
|
|
else:
|
|
return await self._full_generate(llm, effective_messages, model_id, tools)
|
|
|
|
async def _full_generate(self, llm, messages, model_id, tools) -> dict:
|
|
def _run():
|
|
kwargs = {"messages": messages, "max_tokens": 4096}
|
|
if tools:
|
|
kwargs["tools"] = tools
|
|
return llm.create_chat_completion(**kwargs)
|
|
|
|
loop = asyncio.get_event_loop()
|
|
result = await loop.run_in_executor(None, _run)
|
|
|
|
# llama-cpp-python returns OpenAI-compatible format
|
|
result["model"] = model_id
|
|
return result
|
|
|
|
async def _stream_generate(
|
|
self, llm, messages, model_id, tools
|
|
) -> AsyncIterator[str]:
|
|
def _run():
|
|
kwargs = {"messages": messages, "max_tokens": 4096, "stream": True}
|
|
if tools:
|
|
kwargs["tools"] = tools
|
|
return llm.create_chat_completion(**kwargs)
|
|
|
|
loop = asyncio.get_event_loop()
|
|
stream = await loop.run_in_executor(None, _run)
|
|
|
|
async def _iter():
|
|
for chunk in stream:
|
|
chunk["model"] = model_id
|
|
yield f"data: {json.dumps(chunk)}\n\n"
|
|
yield "data: [DONE]\n\n"
|
|
|
|
return _iter()
|
|
|
|
|
|
def _create_vision_handler(mmproj_path: str):
|
|
"""Create a chat handler with vision support using the mmproj file."""
|
|
from llama_cpp.llama_chat_format import Llava16ChatHandler
|
|
|
|
return Llava16ChatHandler(clip_model_path=mmproj_path)
|
|
|
|
|
|
# Physical model config injection
|
|
_physical_models: dict[str, PhysicalModel] = {}
|
|
|
|
|
|
def set_physical_models(models: dict[str, PhysicalModel]) -> None:
|
|
global _physical_models
|
|
_physical_models = models
|
|
|
|
|
|
def _get_physical_config(model_id: str) -> PhysicalModel:
|
|
return _physical_models[model_id]
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/backends/llamacpp.py
|
|
git commit -m "feat: llama-cpp-python backend with GGUF, vision, and tool support"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 10: Chatterbox TTS Backend
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/backends/chatterbox_tts.py`
|
|
|
|
- [ ] **Step 1: Implement the Chatterbox TTS backend**
|
|
|
|
`llmux/llmux/backends/chatterbox_tts.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import io
|
|
import logging
|
|
|
|
import soundfile as sf
|
|
import torch
|
|
|
|
from llmux.backends.base import BaseBackend
|
|
from llmux.config import PhysicalModel
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
class ChatterboxTTSBackend(BaseBackend):
|
|
def __init__(self, models_dir: str = "/models"):
|
|
self._models_dir = models_dir
|
|
self._loaded: dict[str, dict] = {}
|
|
|
|
async def load(self, model_id: str, device: str = "cuda") -> None:
|
|
if model_id in self._loaded:
|
|
return
|
|
|
|
physical = _get_physical_config(model_id)
|
|
variant = physical.variant
|
|
logger.info(f"Loading Chatterbox {variant} to {device}")
|
|
|
|
def _load():
|
|
from chatterbox.tts import ChatterboxTTS
|
|
|
|
if variant == "turbo":
|
|
model = ChatterboxTTS.from_pretrained(device=device, variant="turbo")
|
|
elif variant == "multilingual":
|
|
model = ChatterboxTTS.from_pretrained(device=device, variant="multilingual")
|
|
else:
|
|
model = ChatterboxTTS.from_pretrained(device=device)
|
|
return model
|
|
|
|
loop = asyncio.get_event_loop()
|
|
model = await loop.run_in_executor(None, _load)
|
|
self._loaded[model_id] = {"model": model, "device": device}
|
|
logger.info(f"Loaded Chatterbox {variant} on {device}")
|
|
|
|
async def unload(self, model_id: str) -> None:
|
|
if model_id not in self._loaded:
|
|
return
|
|
entry = self._loaded.pop(model_id)
|
|
del entry["model"]
|
|
torch.cuda.empty_cache()
|
|
logger.info(f"Unloaded Chatterbox {model_id}")
|
|
|
|
async def generate(self, model_id, messages, params, stream=False, tools=None):
|
|
raise NotImplementedError("TTS backend does not support chat generation")
|
|
|
|
async def synthesize(
|
|
self,
|
|
model_id: str,
|
|
text: str,
|
|
voice: str = "default",
|
|
) -> bytes:
|
|
entry = self._loaded[model_id]
|
|
model = entry["model"]
|
|
|
|
def _synthesize():
|
|
wav = model.generate(text)
|
|
# Convert to WAV bytes
|
|
buf = io.BytesIO()
|
|
sf.write(buf, wav.cpu().numpy().squeeze(), samplerate=24000, format="WAV")
|
|
buf.seek(0)
|
|
return buf.read()
|
|
|
|
loop = asyncio.get_event_loop()
|
|
audio_bytes = await loop.run_in_executor(None, _synthesize)
|
|
return audio_bytes
|
|
|
|
|
|
# Physical model config injection
|
|
_physical_models: dict[str, PhysicalModel] = {}
|
|
|
|
|
|
def set_physical_models(models: dict[str, PhysicalModel]) -> None:
|
|
global _physical_models
|
|
_physical_models = models
|
|
|
|
|
|
def _get_physical_config(model_id: str) -> PhysicalModel:
|
|
return _physical_models[model_id]
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/backends/chatterbox_tts.py
|
|
git commit -m "feat: Chatterbox TTS backend with turbo/multilingual/default variants"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 11: API Routes — Health and Models
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/routes/models.py`
|
|
- Create: `llmux/tests/test_routes.py`
|
|
|
|
- [ ] **Step 1: Write the failing tests**
|
|
|
|
`llmux/tests/test_routes.py`:
|
|
|
|
```python
|
|
import pytest
|
|
from fastapi import FastAPI
|
|
from fastapi.testclient import TestClient
|
|
|
|
from llmux.config import ApiKey, load_models_config
|
|
from llmux.auth import create_api_key_dependency
|
|
from llmux.model_registry import ModelRegistry
|
|
from llmux.vram_manager import VRAMManager
|
|
from llmux.routes.models import create_models_router
|
|
|
|
|
|
API_KEY = "sk-test-key"
|
|
|
|
|
|
@pytest.fixture
|
|
def registry():
|
|
return ModelRegistry.from_config()
|
|
|
|
|
|
@pytest.fixture
|
|
def vram_manager():
|
|
return VRAMManager(total_vram_gb=16.0)
|
|
|
|
|
|
@pytest.fixture
|
|
def app(registry, vram_manager):
|
|
keys = [ApiKey(key=API_KEY, name="Test")]
|
|
require_api_key = create_api_key_dependency(keys)
|
|
|
|
app = FastAPI()
|
|
app.include_router(create_models_router(registry, require_api_key))
|
|
return app
|
|
|
|
|
|
@pytest.fixture
|
|
def client(app):
|
|
return TestClient(app)
|
|
|
|
|
|
@pytest.fixture
|
|
def auth_headers():
|
|
return {"Authorization": f"Bearer {API_KEY}"}
|
|
|
|
|
|
def test_list_models_returns_16(client, auth_headers):
|
|
resp = client.get("/v1/models", headers=auth_headers)
|
|
assert resp.status_code == 200
|
|
body = resp.json()
|
|
assert body["object"] == "list"
|
|
assert len(body["data"]) == 16
|
|
|
|
|
|
def test_list_models_contains_expected_names(client, auth_headers):
|
|
resp = client.get("/v1/models", headers=auth_headers)
|
|
names = [m["id"] for m in resp.json()["data"]]
|
|
assert "Qwen3.5-9B-FP8-Thinking" in names
|
|
assert "GPT-OSS-20B-High" in names
|
|
assert "cohere-transcribe" in names
|
|
assert "Chatterbox-Multilingual" in names
|
|
|
|
|
|
def test_list_models_requires_auth(client):
|
|
resp = client.get("/v1/models")
|
|
assert resp.status_code == 401
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_routes.py -v`
|
|
Expected: FAIL — `ModuleNotFoundError: No module named 'llmux.routes.models'`
|
|
|
|
- [ ] **Step 3: Implement routes/models.py**
|
|
|
|
`llmux/llmux/routes/models.py`:
|
|
|
|
```python
|
|
from fastapi import APIRouter, Depends
|
|
|
|
from llmux.model_registry import ModelRegistry
|
|
|
|
|
|
def create_models_router(registry: ModelRegistry, require_api_key) -> APIRouter:
|
|
router = APIRouter()
|
|
|
|
@router.get("/v1/models")
|
|
async def list_models(api_key: str = Depends(require_api_key)):
|
|
return {
|
|
"object": "list",
|
|
"data": registry.list_virtual_models(),
|
|
}
|
|
|
|
return router
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests to verify they pass**
|
|
|
|
Run: `cd llmux && python -m pytest tests/test_routes.py -v`
|
|
Expected: all 3 tests PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/routes/models.py llmux/tests/test_routes.py
|
|
git commit -m "feat: GET /v1/models endpoint with auth"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 12: API Routes — Chat Completions
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/routes/chat.py`
|
|
|
|
- [ ] **Step 1: Implement chat route**
|
|
|
|
`llmux/llmux/routes/chat.py`:
|
|
|
|
```python
|
|
import logging
|
|
|
|
from fastapi import APIRouter, Depends, HTTPException, Request
|
|
from fastapi.responses import StreamingResponse
|
|
|
|
from llmux.model_registry import ModelRegistry
|
|
from llmux.vram_manager import VRAMManager
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def create_chat_router(
|
|
registry: ModelRegistry,
|
|
vram_manager: VRAMManager,
|
|
backends: dict,
|
|
require_api_key,
|
|
) -> APIRouter:
|
|
router = APIRouter()
|
|
|
|
@router.post("/v1/chat/completions")
|
|
async def chat_completions(request: Request, api_key: str = Depends(require_api_key)):
|
|
body = await request.json()
|
|
|
|
virtual_name = body.get("model")
|
|
if not virtual_name:
|
|
raise HTTPException(status_code=400, detail="Missing 'model' field")
|
|
|
|
try:
|
|
physical_id, physical, params = registry.resolve(virtual_name)
|
|
except KeyError:
|
|
raise HTTPException(status_code=404, detail=f"Model '{virtual_name}' not found")
|
|
|
|
# Get the backend for this model
|
|
backend = backends.get(physical.backend)
|
|
if backend is None:
|
|
raise HTTPException(status_code=500, detail=f"No backend for '{physical.backend}'")
|
|
|
|
# Ensure model is loaded (VRAM manager handles eviction)
|
|
await vram_manager.load_model(
|
|
model_id=physical_id,
|
|
model_type=physical.type,
|
|
vram_gb=physical.estimated_vram_gb,
|
|
backend=backend,
|
|
)
|
|
|
|
messages = body.get("messages", [])
|
|
stream = body.get("stream", False)
|
|
tools = body.get("tools")
|
|
|
|
result = await backend.generate(
|
|
model_id=physical_id,
|
|
messages=messages,
|
|
params=params,
|
|
stream=stream,
|
|
tools=tools,
|
|
)
|
|
|
|
if stream:
|
|
return StreamingResponse(result, media_type="text/event-stream")
|
|
return result
|
|
|
|
return router
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/routes/chat.py
|
|
git commit -m "feat: POST /v1/chat/completions with streaming and VRAM management"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 13: API Routes — Audio Transcription
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/routes/transcription.py`
|
|
|
|
- [ ] **Step 1: Implement transcription route**
|
|
|
|
`llmux/llmux/routes/transcription.py`:
|
|
|
|
```python
|
|
import logging
|
|
|
|
from fastapi import APIRouter, Depends, File, Form, HTTPException, UploadFile
|
|
|
|
from llmux.model_registry import ModelRegistry
|
|
from llmux.vram_manager import VRAMManager
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def create_transcription_router(
|
|
registry: ModelRegistry,
|
|
vram_manager: VRAMManager,
|
|
backends: dict,
|
|
require_api_key,
|
|
) -> APIRouter:
|
|
router = APIRouter()
|
|
|
|
@router.post("/v1/audio/transcriptions")
|
|
async def create_transcription(
|
|
file: UploadFile = File(...),
|
|
model: str = Form(...),
|
|
language: str = Form("en"),
|
|
api_key: str = Depends(require_api_key),
|
|
):
|
|
try:
|
|
physical_id, physical, params = registry.resolve(model)
|
|
except KeyError:
|
|
raise HTTPException(status_code=404, detail=f"Model '{model}' not found")
|
|
|
|
if physical.type != "asr":
|
|
raise HTTPException(status_code=400, detail=f"Model '{model}' is not an ASR model")
|
|
|
|
backend = backends.get(physical.backend)
|
|
if backend is None:
|
|
raise HTTPException(status_code=500, detail=f"No backend for '{physical.backend}'")
|
|
|
|
await vram_manager.load_model(
|
|
model_id=physical_id,
|
|
model_type=physical.type,
|
|
vram_gb=physical.estimated_vram_gb,
|
|
backend=backend,
|
|
)
|
|
|
|
audio_data = await file.read()
|
|
result = await backend.transcribe(
|
|
model_id=physical_id,
|
|
audio_data=audio_data,
|
|
language=language,
|
|
)
|
|
|
|
return result
|
|
|
|
return router
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/routes/transcription.py
|
|
git commit -m "feat: POST /v1/audio/transcriptions endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 14: API Routes — Speech Synthesis
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/routes/speech.py`
|
|
|
|
- [ ] **Step 1: Implement speech route**
|
|
|
|
`llmux/llmux/routes/speech.py`:
|
|
|
|
```python
|
|
import logging
|
|
|
|
from fastapi import APIRouter, Depends, HTTPException, Request
|
|
from fastapi.responses import Response
|
|
|
|
from llmux.model_registry import ModelRegistry
|
|
from llmux.vram_manager import VRAMManager
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def create_speech_router(
|
|
registry: ModelRegistry,
|
|
vram_manager: VRAMManager,
|
|
backends: dict,
|
|
require_api_key,
|
|
) -> APIRouter:
|
|
router = APIRouter()
|
|
|
|
@router.post("/v1/audio/speech")
|
|
async def create_speech(request: Request, api_key: str = Depends(require_api_key)):
|
|
body = await request.json()
|
|
|
|
model_name = body.get("model")
|
|
if not model_name:
|
|
raise HTTPException(status_code=400, detail="Missing 'model' field")
|
|
|
|
try:
|
|
physical_id, physical, params = registry.resolve(model_name)
|
|
except KeyError:
|
|
raise HTTPException(status_code=404, detail=f"Model '{model_name}' not found")
|
|
|
|
if physical.type != "tts":
|
|
raise HTTPException(status_code=400, detail=f"Model '{model_name}' is not a TTS model")
|
|
|
|
backend = backends.get(physical.backend)
|
|
if backend is None:
|
|
raise HTTPException(status_code=500, detail=f"No backend for '{physical.backend}'")
|
|
|
|
await vram_manager.load_model(
|
|
model_id=physical_id,
|
|
model_type=physical.type,
|
|
vram_gb=physical.estimated_vram_gb,
|
|
backend=backend,
|
|
)
|
|
|
|
text = body.get("input", "")
|
|
voice = body.get("voice", "default")
|
|
|
|
audio_bytes = await backend.synthesize(
|
|
model_id=physical_id,
|
|
text=text,
|
|
voice=voice,
|
|
)
|
|
|
|
return Response(content=audio_bytes, media_type="audio/wav")
|
|
|
|
return router
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/routes/speech.py
|
|
git commit -m "feat: POST /v1/audio/speech endpoint"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 15: API Routes — Admin Performance Test
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/routes/admin.py`
|
|
|
|
- [ ] **Step 1: Implement admin performance test endpoint**
|
|
|
|
`llmux/llmux/routes/admin.py`:
|
|
|
|
```python
|
|
import asyncio
|
|
import logging
|
|
import time
|
|
|
|
from fastapi import APIRouter, Depends, HTTPException, Request
|
|
|
|
from llmux.model_registry import ModelRegistry
|
|
from llmux.vram_manager import VRAMManager
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
TEST_PROMPT = [{"role": "user", "content": "Say hello in one sentence."}]
|
|
|
|
|
|
def create_admin_router(
|
|
registry: ModelRegistry,
|
|
vram_manager: VRAMManager,
|
|
backends: dict,
|
|
require_api_key,
|
|
) -> APIRouter:
|
|
router = APIRouter()
|
|
|
|
@router.post("/admin/test/performance")
|
|
async def test_performance(request: Request, api_key: str = Depends(require_api_key)):
|
|
"""Run GPU vs CPU inference for a model and compare timings.
|
|
|
|
Request body:
|
|
{
|
|
"physical_model_id": "qwen3.5-4b",
|
|
"test_type": "llm" | "asr" | "tts"
|
|
}
|
|
"""
|
|
body = await request.json()
|
|
physical_id = body.get("physical_model_id")
|
|
if not physical_id:
|
|
raise HTTPException(status_code=400, detail="Missing 'physical_model_id'")
|
|
|
|
physical = registry.get_physical(physical_id)
|
|
backend_name = physical.backend
|
|
|
|
if backend_name == "transformers" and physical.type == "llm":
|
|
return await _test_transformers_llm(physical_id, physical, backends)
|
|
elif backend_name == "transformers" and physical.type == "asr":
|
|
return await _test_transformers_asr(physical_id, physical, backends)
|
|
elif backend_name == "llamacpp":
|
|
return await _test_llamacpp(physical_id, physical, backends)
|
|
elif backend_name == "chatterbox":
|
|
return await _test_chatterbox(physical_id, physical, backends)
|
|
else:
|
|
raise HTTPException(status_code=400, detail=f"Unknown backend: {backend_name}")
|
|
|
|
return router
|
|
|
|
|
|
async def _test_transformers_llm(physical_id, physical, backends):
|
|
from llmux.backends.transformers_llm import TransformersLLMBackend
|
|
|
|
results = {}
|
|
|
|
for device_label, device in [("gpu", "cuda"), ("cpu", "cpu")]:
|
|
backend = TransformersLLMBackend(models_dir=backends["transformers"]._models_dir)
|
|
await backend.load(physical_id, device=device)
|
|
start = time.monotonic()
|
|
await backend.generate(physical_id, TEST_PROMPT, params={}, stream=False)
|
|
elapsed = time.monotonic() - start
|
|
await backend.unload(physical_id)
|
|
results[device_label] = round(elapsed, 2)
|
|
|
|
ratio = results["cpu"] / results["gpu"] if results["gpu"] > 0 else 0
|
|
return {
|
|
"model": physical_id,
|
|
"gpu_seconds": results["gpu"],
|
|
"cpu_seconds": results["cpu"],
|
|
"speedup": round(ratio, 1),
|
|
"pass": ratio >= 5.0,
|
|
}
|
|
|
|
|
|
async def _test_transformers_asr(physical_id, physical, backends):
|
|
from llmux.backends.transformers_asr import TransformersASRBackend
|
|
import struct
|
|
|
|
# Generate a short silent WAV for testing
|
|
silent_wav = _make_silent_wav(duration_seconds=2)
|
|
|
|
results = {}
|
|
|
|
for device_label, device in [("gpu", "cuda"), ("cpu", "cpu")]:
|
|
backend = TransformersASRBackend(models_dir=backends["transformers_asr"]._models_dir)
|
|
await backend.load(physical_id, device=device)
|
|
start = time.monotonic()
|
|
await backend.transcribe(physical_id, silent_wav, language="en")
|
|
elapsed = time.monotonic() - start
|
|
await backend.unload(physical_id)
|
|
results[device_label] = round(elapsed, 2)
|
|
|
|
ratio = results["cpu"] / results["gpu"] if results["gpu"] > 0 else 0
|
|
return {
|
|
"model": physical_id,
|
|
"gpu_seconds": results["gpu"],
|
|
"cpu_seconds": results["cpu"],
|
|
"speedup": round(ratio, 1),
|
|
"pass": ratio >= 5.0,
|
|
}
|
|
|
|
|
|
async def _test_llamacpp(physical_id, physical, backends):
|
|
from llmux.backends.llamacpp import LlamaCppBackend
|
|
|
|
results = {}
|
|
|
|
for label, n_gpu_layers in [("gpu", -1), ("cpu", 0)]:
|
|
backend = LlamaCppBackend(models_dir=backends["llamacpp"]._models_dir)
|
|
await backend.load(physical_id, n_gpu_layers=n_gpu_layers)
|
|
start = time.monotonic()
|
|
await backend.generate(physical_id, TEST_PROMPT, params={}, stream=False)
|
|
elapsed = time.monotonic() - start
|
|
await backend.unload(physical_id)
|
|
results[label] = round(elapsed, 2)
|
|
|
|
ratio = results["cpu"] / results["gpu"] if results["gpu"] > 0 else 0
|
|
return {
|
|
"model": physical_id,
|
|
"gpu_seconds": results["gpu"],
|
|
"cpu_seconds": results["cpu"],
|
|
"speedup": round(ratio, 1),
|
|
"pass": ratio >= 5.0,
|
|
}
|
|
|
|
|
|
async def _test_chatterbox(physical_id, physical, backends):
|
|
from llmux.backends.chatterbox_tts import ChatterboxTTSBackend
|
|
|
|
backend = ChatterboxTTSBackend(models_dir=backends["chatterbox"]._models_dir)
|
|
await backend.load(physical_id, device="cuda")
|
|
test_text = "Hello, this is a performance test."
|
|
start = time.monotonic()
|
|
audio_bytes = await backend.synthesize(physical_id, test_text)
|
|
elapsed = time.monotonic() - start
|
|
await backend.unload(physical_id)
|
|
|
|
# Estimate audio duration from WAV bytes (24kHz, 16-bit mono)
|
|
audio_samples = (len(audio_bytes) - 44) / 2 # subtract WAV header, 2 bytes per sample
|
|
audio_duration = audio_samples / 24000
|
|
|
|
return {
|
|
"model": physical_id,
|
|
"synthesis_seconds": round(elapsed, 2),
|
|
"audio_duration_seconds": round(audio_duration, 2),
|
|
"realtime_factor": round(audio_duration / elapsed, 1) if elapsed > 0 else 0,
|
|
}
|
|
|
|
|
|
def _make_silent_wav(duration_seconds=2, sample_rate=16000) -> bytes:
|
|
"""Generate a silent WAV file as bytes."""
|
|
import struct
|
|
num_samples = int(sample_rate * duration_seconds)
|
|
data = b"\x00\x00" * num_samples # 16-bit silence
|
|
header = struct.pack(
|
|
"<4sI4s4sIHHIIHH4sI",
|
|
b"RIFF", 36 + len(data), b"WAVE",
|
|
b"fmt ", 16, 1, 1, sample_rate, sample_rate * 2, 2, 16,
|
|
b"data", len(data),
|
|
)
|
|
return header + data
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/routes/admin.py
|
|
git commit -m "feat: admin performance test endpoint for GPU vs CPU comparison"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 16: FastAPI App Assembly (main.py)
|
|
|
|
**Files:**
|
|
- Create: `llmux/llmux/main.py`
|
|
|
|
- [ ] **Step 1: Implement main.py**
|
|
|
|
`llmux/llmux/main.py`:
|
|
|
|
```python
|
|
import logging
|
|
import os
|
|
|
|
from fastapi import FastAPI
|
|
|
|
from llmux.config import load_models_config, load_api_keys
|
|
from llmux.auth import create_api_key_dependency
|
|
from llmux.model_registry import ModelRegistry
|
|
from llmux.vram_manager import VRAMManager
|
|
from llmux.backends.transformers_llm import TransformersLLMBackend
|
|
from llmux.backends.transformers_llm import set_physical_models as set_transformers_llm_models
|
|
from llmux.backends.transformers_asr import TransformersASRBackend
|
|
from llmux.backends.transformers_asr import set_physical_models as set_transformers_asr_models
|
|
from llmux.backends.llamacpp import LlamaCppBackend
|
|
from llmux.backends.llamacpp import set_physical_models as set_llamacpp_models
|
|
from llmux.backends.chatterbox_tts import ChatterboxTTSBackend
|
|
from llmux.backends.chatterbox_tts import set_physical_models as set_chatterbox_models
|
|
from llmux.routes.models import create_models_router
|
|
from llmux.routes.chat import create_chat_router
|
|
from llmux.routes.transcription import create_transcription_router
|
|
from llmux.routes.speech import create_speech_router
|
|
from llmux.routes.admin import create_admin_router
|
|
|
|
logging.basicConfig(
|
|
level=logging.INFO,
|
|
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
|
)
|
|
logger = logging.getLogger(__name__)
|
|
|
|
MODELS_DIR = os.environ.get("LLMUX_MODELS_DIR", "/models")
|
|
|
|
app = FastAPI(title="llmux", version="0.1.0")
|
|
|
|
|
|
@app.on_event("startup")
|
|
async def startup():
|
|
logger.info("Starting llmux...")
|
|
|
|
# Load config
|
|
physical, virtual = load_models_config()
|
|
api_keys = load_api_keys()
|
|
|
|
# Inject physical model configs into backends
|
|
set_transformers_llm_models(physical)
|
|
set_transformers_asr_models(physical)
|
|
set_llamacpp_models(physical)
|
|
set_chatterbox_models(physical)
|
|
|
|
# Create core components
|
|
registry = ModelRegistry(physical, virtual)
|
|
vram_manager = VRAMManager(total_vram_gb=16.0)
|
|
require_api_key = create_api_key_dependency(api_keys)
|
|
|
|
# Create backends
|
|
transformers_llm = TransformersLLMBackend(models_dir=MODELS_DIR)
|
|
transformers_asr = TransformersASRBackend(models_dir=MODELS_DIR)
|
|
llamacpp = LlamaCppBackend(models_dir=MODELS_DIR)
|
|
chatterbox = ChatterboxTTSBackend(models_dir=MODELS_DIR)
|
|
|
|
backends = {
|
|
"transformers": transformers_llm,
|
|
"transformers_asr": transformers_asr,
|
|
"llamacpp": llamacpp,
|
|
"chatterbox": chatterbox,
|
|
}
|
|
|
|
# Store on app state for health endpoint
|
|
app.state.vram_manager = vram_manager
|
|
app.state.registry = registry
|
|
|
|
# Register routes
|
|
app.include_router(create_models_router(registry, require_api_key))
|
|
app.include_router(create_chat_router(registry, vram_manager, backends, require_api_key))
|
|
app.include_router(create_transcription_router(registry, vram_manager, backends, require_api_key))
|
|
app.include_router(create_speech_router(registry, vram_manager, backends, require_api_key))
|
|
app.include_router(create_admin_router(registry, vram_manager, backends, require_api_key))
|
|
|
|
logger.info("llmux started successfully")
|
|
|
|
|
|
@app.get("/health")
|
|
async def health():
|
|
vram_manager = app.state.vram_manager
|
|
loaded = vram_manager.get_loaded_models()
|
|
return {
|
|
"status": "ok",
|
|
"loaded_models": {
|
|
mid: {"type": slot.model_type, "vram_gb": slot.vram_gb}
|
|
for mid, slot in loaded.items()
|
|
},
|
|
"available_vram_gb": round(vram_manager.available_vram_gb, 1),
|
|
}
|
|
```
|
|
|
|
- [ ] **Step 2: Fix backend routing in chat.py**
|
|
|
|
The chat router currently looks up backends by `physical.backend` which is `"transformers"` for both LLM and ASR. We need to route ASR models to `transformers_asr`. Update `create_chat_router` in `llmux/llmux/routes/chat.py` to resolve the backend key:
|
|
|
|
Replace the line:
|
|
```python
|
|
backend = backends.get(physical.backend)
|
|
```
|
|
with:
|
|
```python
|
|
backend_key = physical.backend
|
|
if backend_key == "transformers" and physical.type == "asr":
|
|
backend_key = "transformers_asr"
|
|
backend = backends.get(backend_key)
|
|
```
|
|
|
|
Apply the same fix in `llmux/llmux/routes/transcription.py` and `llmux/llmux/routes/speech.py`.
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add llmux/llmux/main.py llmux/llmux/routes/chat.py \
|
|
llmux/llmux/routes/transcription.py llmux/llmux/routes/speech.py
|
|
git commit -m "feat: FastAPI app assembly with all routes and backend wiring"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 17: Dockerfile
|
|
|
|
**Files:**
|
|
- Create: `llmux/Dockerfile`
|
|
|
|
- [ ] **Step 1: Create the Dockerfile**
|
|
|
|
`llmux/Dockerfile`:
|
|
|
|
```dockerfile
|
|
FROM pytorch/pytorch:2.11.0-cuda12.8-cudnn9-runtime
|
|
|
|
# System dependencies for audio processing
|
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
libsndfile1 \
|
|
ffmpeg \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Install Python dependencies
|
|
COPY requirements.txt /tmp/requirements.txt
|
|
RUN pip install --no-cache-dir -r /tmp/requirements.txt && rm /tmp/requirements.txt
|
|
|
|
# llama-cpp-python needs CUDA build
|
|
RUN CMAKE_ARGS="-DGGML_CUDA=on" pip install --no-cache-dir --force-reinstall llama-cpp-python>=0.3.0
|
|
|
|
# Copy application code
|
|
COPY llmux/ /app/llmux/
|
|
WORKDIR /app
|
|
|
|
# Run the server
|
|
EXPOSE 8081
|
|
CMD ["uvicorn", "llmux.main:app", "--host", "0.0.0.0", "--port", "8081"]
|
|
```
|
|
|
|
- [ ] **Step 2: Commit**
|
|
|
|
```bash
|
|
git add llmux/Dockerfile
|
|
git commit -m "feat: Dockerfile with PyTorch CUDA 12.8, audio deps, and CUDA llama-cpp"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 18: Model Download Script
|
|
|
|
**Files:**
|
|
- Create: `llmux/scripts/download_models.sh`
|
|
|
|
- [ ] **Step 1: Create the download script**
|
|
|
|
`llmux/scripts/download_models.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Download all model weights for llmux.
|
|
# Run as user llm: bash scripts/download_models.sh
|
|
# Requires: pip install huggingface_hub
|
|
# Requires: HuggingFace token at ~/.cache/huggingface/token for gated models
|
|
|
|
set -euo pipefail
|
|
|
|
MODELS_DIR="${LLMUX_MODELS_DIR:-$HOME/.local/share/llmux_pod/models}"
|
|
mkdir -p "$MODELS_DIR"
|
|
|
|
echo "=== Downloading models to $MODELS_DIR ==="
|
|
|
|
# Helper: download HF model if not already present
|
|
download_hf() {
|
|
local repo="$1"
|
|
local target="$MODELS_DIR/models--${repo//\//-}"
|
|
if [ -d "$target" ]; then
|
|
echo "SKIP: $repo (already downloaded)"
|
|
return
|
|
fi
|
|
echo "Downloading: $repo"
|
|
huggingface-cli download "$repo" --cache-dir "$MODELS_DIR"
|
|
}
|
|
|
|
# Helper: download specific files from HF repo
|
|
download_hf_files() {
|
|
local repo="$1"
|
|
shift
|
|
echo "Downloading specific files from: $repo"
|
|
huggingface-cli download "$repo" "$@" --cache-dir "$MODELS_DIR"
|
|
}
|
|
|
|
# 1. Qwen3.5-9B-FP8
|
|
download_hf "lovedheart/Qwen3.5-9B-FP8"
|
|
|
|
# 2. Qwen3.5-9B-FP8-Uncensored (GGUF files only)
|
|
download_hf_files "HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive" \
|
|
"Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf" \
|
|
"mmproj-Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-BF16.gguf"
|
|
|
|
# 3. Qwen3.5-4B
|
|
download_hf "Qwen/Qwen3.5-4B"
|
|
|
|
# 4. gpt-oss-20b
|
|
download_hf "openai/gpt-oss-20b"
|
|
|
|
# 5. gpt-oss-20b-uncensored
|
|
download_hf "aoxo/gpt-oss-20b-uncensored"
|
|
|
|
# 6. cohere-transcribe (gated — requires accepted terms)
|
|
echo "Downloading: CohereLabs/cohere-transcribe-03-2026 (gated)"
|
|
download_hf "CohereLabs/cohere-transcribe-03-2026" || \
|
|
echo "WARNING: cohere-transcribe download failed. Have you accepted the terms at https://huggingface.co/CohereLabs/cohere-transcribe-03-2026 ?"
|
|
|
|
# 7. Chatterbox TTS
|
|
# Chatterbox downloads weights automatically on first load via from_pretrained().
|
|
# We trigger a dry-run download here so weights are cached.
|
|
echo "Downloading: Chatterbox TTS weights (auto-downloaded by library)"
|
|
python3 -c "
|
|
from chatterbox.tts import ChatterboxTTS
|
|
import os
|
|
os.environ['CUDA_VISIBLE_DEVICES'] = '' # CPU only for download
|
|
print('Downloading Chatterbox default...')
|
|
ChatterboxTTS.from_pretrained(device='cpu')
|
|
print('Downloading Chatterbox turbo...')
|
|
ChatterboxTTS.from_pretrained(device='cpu', variant='turbo')
|
|
print('Downloading Chatterbox multilingual...')
|
|
ChatterboxTTS.from_pretrained(device='cpu', variant='multilingual')
|
|
print('Chatterbox downloads complete.')
|
|
" || echo "WARNING: Chatterbox download failed. Check chatterbox-tts installation."
|
|
|
|
echo ""
|
|
echo "=== Download complete ==="
|
|
echo "Models directory: $MODELS_DIR"
|
|
du -sh "$MODELS_DIR"
|
|
```
|
|
|
|
- [ ] **Step 2: Make executable and commit**
|
|
|
|
```bash
|
|
chmod +x llmux/scripts/download_models.sh
|
|
git add llmux/scripts/download_models.sh
|
|
git commit -m "feat: model download script for all 9 physical models"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 19: Pod Creation Script
|
|
|
|
**Files:**
|
|
- Create: `llmux/scripts/create_pod_llmux.sh`
|
|
|
|
- [ ] **Step 1: Create the pod creation script**
|
|
|
|
`llmux/scripts/create_pod_llmux.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Create the llmux Podman pod and systemd service.
|
|
# Run as user llm: bash scripts/create_pod_llmux.sh
|
|
# Prerequisites:
|
|
# - Model weights downloaded to ~/.local/share/llmux_pod/models/
|
|
# - Config files in ~/.local/share/llmux_pod/config/
|
|
# - Container image built: podman build -t llmux:latest -f Dockerfile .
|
|
|
|
set -euo pipefail
|
|
|
|
# --- Variables ---
|
|
POD_NAME="llmux_pod"
|
|
CTR_NAME="llmux_ctr"
|
|
IMAGE="localhost/llmux:latest"
|
|
PORT="127.0.0.1:8081:8081"
|
|
BIND_DIR="$HOME/.local/share/${POD_NAME}"
|
|
USER_SYSTEMD_DIR="$HOME/.config/systemd/user"
|
|
|
|
MODELS_DIR="${BIND_DIR}/models"
|
|
CONFIG_DIR="${BIND_DIR}/config"
|
|
|
|
# --- Sanity checks ---
|
|
if [ ! -d "$MODELS_DIR" ]; then
|
|
echo "ERROR: Models directory not found: $MODELS_DIR"
|
|
echo "Run download_models.sh first."
|
|
exit 1
|
|
fi
|
|
|
|
if [ ! -f "$CONFIG_DIR/models.yaml" ]; then
|
|
echo "ERROR: Config not found: $CONFIG_DIR/models.yaml"
|
|
exit 1
|
|
fi
|
|
|
|
if [ ! -f "$CONFIG_DIR/api_keys.yaml" ]; then
|
|
echo "ERROR: Config not found: $CONFIG_DIR/api_keys.yaml"
|
|
exit 1
|
|
fi
|
|
|
|
# --- Ensure directories ---
|
|
mkdir -p "$USER_SYSTEMD_DIR"
|
|
|
|
# --- Build image if not present ---
|
|
if ! podman image exists "$IMAGE"; then
|
|
echo "Building container image..."
|
|
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
|
podman build -t llmux:latest -f "$SCRIPT_DIR/../Dockerfile" "$SCRIPT_DIR/.."
|
|
fi
|
|
|
|
# --- Remove old pod if exists ---
|
|
podman pod exists "$POD_NAME" && podman pod stop "$POD_NAME" 2>/dev/null || true
|
|
podman pod exists "$POD_NAME" && podman pod rm -f "$POD_NAME" 2>/dev/null || true
|
|
|
|
# --- Create pod ---
|
|
echo "Creating pod $POD_NAME..."
|
|
podman pod create \
|
|
--name "$POD_NAME" \
|
|
-p "$PORT"
|
|
|
|
# --- Create container ---
|
|
echo "Creating container $CTR_NAME..."
|
|
podman run -d \
|
|
--name "$CTR_NAME" \
|
|
--pod "$POD_NAME" \
|
|
--device nvidia.com/gpu=all \
|
|
-v "${MODELS_DIR}:/models:ro" \
|
|
-v "${CONFIG_DIR}:/config:ro" \
|
|
-e LLMUX_CONFIG_DIR=/config \
|
|
-e LLMUX_MODELS_DIR=/models \
|
|
"$IMAGE"
|
|
|
|
# --- Wait for startup ---
|
|
echo "Waiting for llmux to start..."
|
|
for i in $(seq 1 30); do
|
|
if curl -sf http://127.0.0.1:8081/health > /dev/null 2>&1; then
|
|
echo "llmux is healthy!"
|
|
break
|
|
fi
|
|
sleep 2
|
|
done
|
|
|
|
# --- Generate systemd units ---
|
|
echo "Generating systemd units..."
|
|
cd "$USER_SYSTEMD_DIR"
|
|
podman generate systemd --files --new --name "$POD_NAME"
|
|
|
|
# --- Stop the live pod (systemd will manage it) ---
|
|
podman pod stop "$POD_NAME"
|
|
podman pod rm -f "$POD_NAME"
|
|
|
|
# --- Enable systemd service ---
|
|
systemctl --user daemon-reload
|
|
systemctl --user enable --now "pod-${POD_NAME}.service"
|
|
|
|
echo ""
|
|
echo "=== llmux pod created and enabled ==="
|
|
echo "Service: systemctl --user status pod-${POD_NAME}.service"
|
|
echo "Health: curl http://127.0.0.1:8081/health"
|
|
echo "Logs: journalctl --user -u pod-${POD_NAME}.service -f"
|
|
```
|
|
|
|
- [ ] **Step 2: Make executable and commit**
|
|
|
|
```bash
|
|
chmod +x llmux/scripts/create_pod_llmux.sh
|
|
git add llmux/scripts/create_pod_llmux.sh
|
|
git commit -m "feat: Podman pod creation script with systemd integration"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 20: Traefik Configuration
|
|
|
|
**Files:**
|
|
- Create: (written to) `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`
|
|
|
|
- [ ] **Step 1: Create the Traefik dynamic config**
|
|
|
|
Write to `/home/trf/.local/share/traefik_pod/dynamic/llmux.yml`:
|
|
|
|
```yaml
|
|
http:
|
|
routers:
|
|
llmux:
|
|
entryPoints: ["wghttp"]
|
|
rule: "Host(`kidirekt.kischdle.com`)"
|
|
priority: 100
|
|
service: llmux
|
|
|
|
services:
|
|
llmux:
|
|
loadBalancer:
|
|
servers:
|
|
- url: "http://10.0.2.2:8081"
|
|
```
|
|
|
|
- [ ] **Step 2: Verify Traefik picks up the config**
|
|
|
|
Traefik watches the `dynamic/` directory with `watch: true`. Check Traefik logs or dashboard at `127.0.0.1:8085` to confirm the `llmux` router appears.
|
|
|
|
- [ ] **Step 3: Commit (in the llmux repo, note the file location)**
|
|
|
|
The Traefik config lives outside the llmux repo. Document this in a comment within `create_pod_llmux.sh` and log it.
|
|
|
|
```bash
|
|
git add -A
|
|
git commit -m "docs: note Traefik config location for llmux routing"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 21: System Integration — Build and GPU Passthrough
|
|
|
|
**Checkpoint: Phase 1 system integration begins. Iterate on issues until resolved before proceeding.**
|
|
|
|
- [ ] **Step 1: Copy config to llm user data dir**
|
|
|
|
```bash
|
|
# As user tlg (has llmux-design group access)
|
|
sudo -u llm mkdir -p /home/llm/.local/share/llmux_pod/config
|
|
sudo -u llm cp llmux/config/models.yaml /home/llm/.local/share/llmux_pod/config/
|
|
sudo -u llm cp llmux/config/api_keys.yaml /home/llm/.local/share/llmux_pod/config/
|
|
```
|
|
|
|
- [ ] **Step 2: Copy HuggingFace token to llm user**
|
|
|
|
```bash
|
|
sudo -u llm mkdir -p /home/llm/.cache/huggingface
|
|
sudo -u llm cp /home/tlg/.cache/huggingface/token /home/llm/.cache/huggingface/token
|
|
sudo -u llm chmod 600 /home/llm/.cache/huggingface/token
|
|
```
|
|
|
|
- [ ] **Step 3: Build the container image**
|
|
|
|
```bash
|
|
cd llmux
|
|
podman build -t llmux:latest -f Dockerfile .
|
|
```
|
|
|
|
Expected: Image builds successfully. If dependencies fail, fix Dockerfile and rebuild.
|
|
|
|
- [ ] **Step 4: Test GPU passthrough**
|
|
|
|
```bash
|
|
podman run --rm --device nvidia.com/gpu=all llmux:latest nvidia-smi
|
|
```
|
|
|
|
Expected: Shows RTX 5070 Ti inside the container. If CDI doesn't work, try `--device nvidia.com/gpu=0` or check NVIDIA container toolkit setup.
|
|
|
|
- [ ] **Step 5: Test model mount**
|
|
|
|
```bash
|
|
podman run --rm \
|
|
-v /home/llm/.local/share/llmux_pod/models:/models:ro \
|
|
llmux:latest \
|
|
ls /models
|
|
```
|
|
|
|
Expected: Lists model directories. If empty, models haven't been downloaded yet — run `download_models.sh` first.
|
|
|
|
- [ ] **Step 6: Commit any fixes**
|
|
|
|
```bash
|
|
git add -A
|
|
git commit -m "fix: system integration fixes for container build and GPU passthrough"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 22: System Integration — Service Startup and Open WebUI
|
|
|
|
- [ ] **Step 1: Start llmux manually for testing**
|
|
|
|
```bash
|
|
podman pod create --name llmux_pod -p 127.0.0.1:8081:8081
|
|
podman run -d --name llmux_ctr --pod llmux_pod \
|
|
--device nvidia.com/gpu=all \
|
|
-v /home/llm/.local/share/llmux_pod/models:/models:ro \
|
|
-v /home/llm/.local/share/llmux_pod/config:/config:ro \
|
|
-e LLMUX_CONFIG_DIR=/config \
|
|
-e LLMUX_MODELS_DIR=/models \
|
|
llmux:latest
|
|
```
|
|
|
|
- [ ] **Step 2: Verify health endpoint**
|
|
|
|
```bash
|
|
curl http://127.0.0.1:8081/health
|
|
```
|
|
|
|
Expected: `{"status":"ok","loaded_models":{},"available_vram_gb":16.0}`
|
|
|
|
- [ ] **Step 3: Verify model listing with auth**
|
|
|
|
```bash
|
|
API_KEY=$(grep 'openwebui' /home/llm/.local/share/llmux_pod/config/api_keys.yaml | awk '{print $2}' | tr -d '"')
|
|
curl -H "Authorization: Bearer $API_KEY" http://127.0.0.1:8081/v1/models | python3 -m json.tool
|
|
```
|
|
|
|
Expected: JSON with 16 models listed.
|
|
|
|
- [ ] **Step 4: Configure Open WebUI via API**
|
|
|
|
```bash
|
|
# Login to get JWT token
|
|
TOKEN=$(curl -s http://127.0.0.1:8080/api/v1/auths/signin \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"Thomas.Langer@destengs.com","password":"3hXp+3!bks"}' \
|
|
| python3 -c "import sys,json; print(json.load(sys.stdin)['token'])")
|
|
|
|
# Configure OpenAI connection
|
|
API_KEY=$(grep 'openwebui' /home/llm/.local/share/llmux_pod/config/api_keys.yaml | awk '{print $2}' | tr -d '"')
|
|
|
|
curl -X POST http://127.0.0.1:8080/api/v1/configs \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{
|
|
\"OPENAI_API_BASE_URL\": \"http://127.0.0.1:8081/v1\",
|
|
\"OPENAI_API_KEY\": \"$API_KEY\"
|
|
}"
|
|
```
|
|
|
|
Note: The exact Open WebUI API endpoints for configuring connections and audio may differ by version. Check the Open WebUI v0.8.12 API docs and adjust. The key settings to configure:
|
|
- OpenAI API base URL → `http://127.0.0.1:8081/v1`
|
|
- OpenAI API key → the generated key
|
|
- STT engine → openai, base URL → `http://127.0.0.1:8081/v1`
|
|
- TTS engine → openai, base URL → `http://127.0.0.1:8081/v1`
|
|
|
|
- [ ] **Step 5: Verify models appear in Open WebUI**
|
|
|
|
Open `http://127.0.0.1:8080` in a browser, log in as user "try" (destengs@gmx.com / k4/vvZ+17), and verify the model dropdown shows the 16 virtual models.
|
|
|
|
- [ ] **Step 6: Cleanup test pod and deploy via script**
|
|
|
|
```bash
|
|
podman pod stop llmux_pod && podman pod rm -f llmux_pod
|
|
# Now run the real deployment script as user llm:
|
|
sudo -u llm bash /home/llm/bin/create_pod_llmux.sh
|
|
```
|
|
|
|
- [ ] **Step 7: Verify systemd lifecycle**
|
|
|
|
```bash
|
|
sudo -u llm systemctl --user status pod-llmux_pod.service
|
|
sudo -u llm systemctl --user restart pod-llmux_pod.service
|
|
curl http://127.0.0.1:8081/health
|
|
```
|
|
|
|
Expected: Service running and healthy after restart.
|
|
|
|
- [ ] **Step 8: Commit any fixes**
|
|
|
|
```bash
|
|
git add -A
|
|
git commit -m "fix: system integration fixes for service startup and Open WebUI connection"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 23: Download Models
|
|
|
|
**This task takes several hours due to ~60GB of downloads.**
|
|
|
|
- [ ] **Step 1: Run the download script**
|
|
|
|
```bash
|
|
sudo -u llm bash llmux/scripts/download_models.sh
|
|
```
|
|
|
|
Expected: All models download successfully. cohere-transcribe requires accepted terms and token. Chatterbox downloads via Python.
|
|
|
|
- [ ] **Step 2: Verify all models are present**
|
|
|
|
```bash
|
|
ls -la /home/llm/.local/share/llmux_pod/models/
|
|
du -sh /home/llm/.local/share/llmux_pod/models/
|
|
```
|
|
|
|
Expected: ~60GB of model weights.
|
|
|
|
---
|
|
|
|
### Task 24: Functional Tests — Chat Inference
|
|
|
|
**Checkpoint: Phase 2 functional tests. Test each model via Open WebUI and curl.**
|
|
|
|
- [ ] **Step 1: Test Qwen3.5-4B-Thinking via curl**
|
|
|
|
```bash
|
|
API_KEY="<openwebui-key>"
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "Qwen3.5-4B-Thinking",
|
|
"messages": [{"role": "user", "content": "What is 2+2? Think step by step."}],
|
|
"stream": false
|
|
}' | python3 -m json.tool
|
|
```
|
|
|
|
Expected: Response with thinking/reasoning visible in the output.
|
|
|
|
- [ ] **Step 2: Test Qwen3.5-4B-Instruct**
|
|
|
|
Same as above but with `"model": "Qwen3.5-4B-Instruct"`. Expected: Direct response without thinking.
|
|
|
|
- [ ] **Step 3: Test each remaining LLM model**
|
|
|
|
Repeat curl tests for:
|
|
- Qwen3.5-9B-FP8-Thinking / Instruct
|
|
- Qwen3.5-9B-FP8-Uncensored-Thinking / Instruct
|
|
- GPT-OSS-20B-Low / Medium / High
|
|
- GPT-OSS-20B-Uncensored-Low / Medium / High
|
|
|
|
Verify each returns a reasonable response.
|
|
|
|
- [ ] **Step 4: Test streaming**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "Qwen3.5-4B-Instruct",
|
|
"messages": [{"role": "user", "content": "Count from 1 to 10."}],
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
Expected: SSE stream with `data: {...}` chunks arriving incrementally.
|
|
|
|
- [ ] **Step 5: Test in Open WebUI**
|
|
|
|
Log in as user "try" at `http://127.0.0.1:8080`. Select each model from the dropdown and send a test message. Verify responses stream in the UI.
|
|
|
|
---
|
|
|
|
### Task 25: Functional Tests — Vision and Tools
|
|
|
|
- [ ] **Step 1: Test vision with Qwen3.5-4B**
|
|
|
|
In Open WebUI as user "try", select Qwen3.5-4B-Instruct, attach an image, and ask "What is in this image?". Verify the model describes the image content.
|
|
|
|
Repeat for Qwen3.5-9B-FP8-Instruct and Qwen3.5-9B-FP8-Uncensored-Instruct.
|
|
|
|
- [ ] **Step 2: Test tool usage via curl**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "Qwen3.5-9B-FP8-Instruct",
|
|
"messages": [{"role": "user", "content": "What is the weather in Berlin?"}],
|
|
"tools": [
|
|
{
|
|
"type": "function",
|
|
"function": {
|
|
"name": "get_weather",
|
|
"description": "Get current weather for a city",
|
|
"parameters": {
|
|
"type": "object",
|
|
"properties": {
|
|
"city": {"type": "string", "description": "City name"}
|
|
},
|
|
"required": ["city"]
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}' | python3 -m json.tool
|
|
```
|
|
|
|
Expected: Response contains a `tool_calls` entry requesting `get_weather` with `city: "Berlin"`.
|
|
|
|
Repeat for Qwen3.5-9B-FP8-Uncensored-Instruct (llama-cpp-python), GPT-OSS-20B-Medium, and GPT-OSS-20B-Uncensored-Medium.
|
|
|
|
---
|
|
|
|
### Task 26: Functional Tests — ASR and TTS
|
|
|
|
- [ ] **Step 1: Test ASR via curl**
|
|
|
|
```bash
|
|
# Record a short WAV or use an existing audio file
|
|
curl -X POST http://127.0.0.1:8081/v1/audio/transcriptions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-F "file=@test_audio.wav" \
|
|
-F "model=cohere-transcribe" \
|
|
-F "language=en"
|
|
```
|
|
|
|
Expected: `{"text": "...transcribed text..."}`
|
|
|
|
- [ ] **Step 2: Test TTS via curl**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/v1/audio/speech \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "Chatterbox-Multilingual", "input": "Hello, this is a test.", "voice": "default"}' \
|
|
--output test_output.wav
|
|
|
|
# Play the audio
|
|
aplay test_output.wav # or ffplay test_output.wav
|
|
```
|
|
|
|
Expected: Audible speech output.
|
|
|
|
- [ ] **Step 3: Test ASR and TTS in Open WebUI**
|
|
|
|
Log in as user "try". Use the dictation button (microphone icon) to record speech. Verify it appears as text. Use audio playback on a response to hear TTS output.
|
|
|
|
- [ ] **Step 4: Test German ASR**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/v1/audio/transcriptions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-F "file=@test_german.wav" \
|
|
-F "model=cohere-transcribe" \
|
|
-F "language=de"
|
|
```
|
|
|
|
Expected: German transcription.
|
|
|
|
---
|
|
|
|
### Task 27: VRAM Management Tests
|
|
|
|
**Checkpoint: Phase 3 VRAM management tests.**
|
|
|
|
- [ ] **Step 1: Test small LLM — ASR + TTS remain loaded**
|
|
|
|
```bash
|
|
# Load ASR
|
|
curl -X POST http://127.0.0.1:8081/v1/audio/transcriptions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-F "file=@test_audio.wav" -F "model=cohere-transcribe" -F "language=en"
|
|
|
|
# Load TTS
|
|
curl -X POST http://127.0.0.1:8081/v1/audio/speech \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "Chatterbox-Multilingual", "input": "Test", "voice": "default"}' --output /dev/null
|
|
|
|
# Load small LLM
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "Qwen3.5-4B-Instruct", "messages": [{"role":"user","content":"Hi"}]}'
|
|
|
|
# Check health — all three should be loaded
|
|
curl http://127.0.0.1:8081/health | python3 -m json.tool
|
|
```
|
|
|
|
Expected: `loaded_models` contains cohere-transcribe, chatterbox-multilingual, and qwen3.5-4b.
|
|
|
|
- [ ] **Step 2: Test medium LLM — ASR + TTS remain loaded**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "Qwen3.5-9B-FP8-Instruct", "messages": [{"role":"user","content":"Hi"}]}'
|
|
|
|
curl http://127.0.0.1:8081/health | python3 -m json.tool
|
|
```
|
|
|
|
Expected: `loaded_models` contains cohere-transcribe, chatterbox-multilingual, and qwen3.5-9b-fp8 (~15GB total).
|
|
|
|
- [ ] **Step 3: Test large LLM — evicts ASR and TTS**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "GPT-OSS-20B-High", "messages": [{"role":"user","content":"Hi"}]}'
|
|
|
|
curl http://127.0.0.1:8081/health | python3 -m json.tool
|
|
```
|
|
|
|
Expected: Only gpt-oss-20b loaded (~13GB). ASR and TTS evicted.
|
|
|
|
- [ ] **Step 4: Test ASR request evicts LLM first**
|
|
|
|
```bash
|
|
# With gpt-oss-20b still loaded, request ASR
|
|
curl -X POST http://127.0.0.1:8081/v1/audio/transcriptions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-F "file=@test_audio.wav" -F "model=cohere-transcribe" -F "language=en"
|
|
|
|
curl http://127.0.0.1:8081/health | python3 -m json.tool
|
|
```
|
|
|
|
Expected: gpt-oss-20b evicted, cohere-transcribe loaded.
|
|
|
|
- [ ] **Step 5: Test model swapping**
|
|
|
|
```bash
|
|
# Load one LLM
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "Qwen3.5-4B-Instruct", "messages": [{"role":"user","content":"Hi"}]}'
|
|
|
|
# Switch to another
|
|
curl -X POST http://127.0.0.1:8081/v1/chat/completions \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "Qwen3.5-9B-FP8-Instruct", "messages": [{"role":"user","content":"Hi"}]}'
|
|
|
|
curl http://127.0.0.1:8081/health | python3 -m json.tool
|
|
```
|
|
|
|
Expected: Only qwen3.5-9b-fp8 loaded (qwen3.5-4b evicted).
|
|
|
|
---
|
|
|
|
### Task 28: Performance Tests
|
|
|
|
**Checkpoint: Phase 4 performance tests.**
|
|
|
|
- [ ] **Step 1: Test transformers LLM GPU vs CPU**
|
|
|
|
```bash
|
|
for model in qwen3.5-4b qwen3.5-9b-fp8 gpt-oss-20b gpt-oss-20b-uncensored; do
|
|
echo "=== Testing $model ==="
|
|
curl -X POST http://127.0.0.1:8081/admin/test/performance \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"physical_model_id\": \"$model\"}" | python3 -m json.tool
|
|
done
|
|
```
|
|
|
|
Expected: Each model shows `"pass": true` with GPU at least 5x faster than CPU.
|
|
|
|
- [ ] **Step 2: Test ASR GPU vs CPU**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/admin/test/performance \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"physical_model_id": "cohere-transcribe"}' | python3 -m json.tool
|
|
```
|
|
|
|
Expected: `"pass": true`
|
|
|
|
- [ ] **Step 3: Test llama-cpp-python GPU vs CPU**
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8081/admin/test/performance \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"physical_model_id": "qwen3.5-9b-fp8-uncensored"}' | python3 -m json.tool
|
|
```
|
|
|
|
Expected: `"pass": true`
|
|
|
|
- [ ] **Step 4: Test Chatterbox performance**
|
|
|
|
```bash
|
|
for model in chatterbox-turbo chatterbox-multilingual chatterbox; do
|
|
echo "=== Testing $model ==="
|
|
curl -X POST http://127.0.0.1:8081/admin/test/performance \
|
|
-H "Authorization: Bearer $API_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"physical_model_id\": \"$model\"}" | python3 -m json.tool
|
|
done
|
|
```
|
|
|
|
Expected: `realtime_factor > 1.0` (generates audio faster than real-time).
|
|
|
|
---
|
|
|
|
### Task 29: Traefik and Remote Access Test
|
|
|
|
- [ ] **Step 1: Test Traefik routing**
|
|
|
|
From a machine on the WireGuard VPN, or locally if DNS resolves:
|
|
|
|
```bash
|
|
curl -H "Authorization: Bearer $API_KEY" https://kidirekt.kischdle.com/v1/models | python3 -m json.tool
|
|
```
|
|
|
|
Expected: Same 16 models as localhost. If DNS is not yet resolving, test locally:
|
|
|
|
```bash
|
|
curl -H "Host: kidirekt.kischdle.com" -H "Authorization: Bearer $API_KEY" http://127.0.0.1:8080/v1/models | python3 -m json.tool
|
|
```
|
|
|
|
(Port 8080 is Traefik's entry point.)
|
|
|
|
- [ ] **Step 2: Test remote Whisper transcription**
|
|
|
|
```bash
|
|
curl -X POST https://kidirekt.kischdle.com/v1/audio/transcriptions \
|
|
-H "Authorization: Bearer $WHISPER_KEY" \
|
|
-F "file=@test_audio.wav" \
|
|
-F "model=cohere-transcribe" \
|
|
-F "language=en"
|
|
```
|
|
|
|
Expected: Transcription returned via remote API.
|
|
|
|
---
|
|
|
|
### Task 30: Final Cleanup and Documentation
|
|
|
|
- [ ] **Step 1: Copy create_pod_llmux.sh to /home/llm/bin/**
|
|
|
|
```bash
|
|
cp llmux/scripts/create_pod_llmux.sh /home/llm/bin/create_pod_llmux.sh
|
|
chmod +x /home/llm/bin/create_pod_llmux.sh
|
|
```
|
|
|
|
- [ ] **Step 2: Final commit**
|
|
|
|
```bash
|
|
git add -A
|
|
git commit -m "feat: llmux v0.1.0 — complete implementation with all models and tests passing"
|
|
```
|
|
|
|
- [ ] **Step 3: Push to Gitea**
|
|
|
|
```bash
|
|
git push origin main
|
|
```
|