- Force gc.collect() before torch.cuda.empty_cache() to ensure all
model references are released
- Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in container
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Multi-stage: devel image builds llama-cpp-python with CUDA, runtime
image gets the compiled library via COPY
- chatterbox-tts installed --no-deps to prevent torch 2.6 downgrade
- librosa and diskcache added as explicit chatterbox/llama-cpp deps
- All imports verified with GPU passthrough
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed librosa (unused), torch, pyyaml from install list since
they're in the base image. Avoid numpy rebuild conflict.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Podman requires docker.io/ prefix when unqualified-search registries
are not configured.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original eviction logic blocked ASR eviction even when an LLM
genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses
two-pass eviction: first evicts lower/same priority, then cascades to
higher priority as last resort. Added tests for ASR-survives and
full-cascade scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks GPU VRAM usage (16GB) and handles model loading/unloading with
priority-based eviction: LLM (lowest) -> TTS -> ASR (highest, protected).
Uses asyncio Lock for concurrency safety.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements create_api_key_dependency() FastAPI dependency that validates
Bearer tokens against a configured list of ApiKey objects (401 on missing,
malformed, or unknown tokens). Includes 5 TDD tests covering all cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers project scaffolding, config, auth, VRAM manager, all four
backends, API routes, Dockerfile, deployment scripts, and four
phases of testing (integration, functional, VRAM, performance).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers architecture, model registry, VRAM management, API endpoints,
container setup, Open WebUI integration, Traefik routing, and
four-phase testing plan.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>