Commit Graph

16 Commits

Author SHA256 Message Date
tlg
d615bb4553 fix: Chatterbox uses separate classes per variant, remove turbo
ChatterboxTTS and ChatterboxMultilingualTTS are separate classes.
Turbo variant doesn't exist in chatterbox-tts 0.1.7.
Multilingual generate() requires language_id parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:43:40 +02:00
tlg
f24a225baf fix: resolve GGUF paths through HF cache, add model_id to GGUF config
llama-cpp-python backend now uses huggingface_hub to resolve GGUF
file paths within the HF cache structure instead of assuming flat
/models/ directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:33:36 +02:00
tlg
38e1523d7e feat: proper VRAM cleanup and admin clear-vram endpoint
- gc.collect() + torch.cuda.empty_cache() in unload for reliable VRAM release
- POST /admin/clear-vram endpoint unloads all models and reports GPU memory
- VRAMManager.clear_all() method for programmatic VRAM cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:03:39 +02:00
tlg
aa7a160118 fix: proper VRAM cleanup on model unload + CUDA alloc config
- Force gc.collect() before torch.cuda.empty_cache() to ensure all
  model references are released
- Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in container

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 17:59:23 +02:00
tlg
17818a3860 feat: FastAPI app assembly with all routes and backend wiring
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:04:56 +02:00
tlg
d55c80ae35 feat: API routes for models, chat, transcription, speech, and admin
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:04:45 +02:00
tlg
ef44bc09b9 feat: Chatterbox TTS backend with turbo/multilingual/default variants 2026-04-04 09:40:42 +02:00
tlg
c6677dcab3 feat: llama-cpp-python backend with GGUF, vision, and tool support 2026-04-04 09:40:40 +02:00
tlg
de25b5e2a7 feat: transformers ASR backend for cohere-transcribe 2026-04-04 09:40:39 +02:00
tlg
449e37d318 feat: abstract base class for model backends
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:29:35 +02:00
tlg
813bbe0ad0 fix: VRAM eviction cascades through all tiers for large LLM loads
The original eviction logic blocked ASR eviction even when an LLM
genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses
two-pass eviction: first evicts lower/same priority, then cascades to
higher priority as last resort. Added tests for ASR-survives and
full-cascade scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:22:14 +02:00
tlg
d7a091df8c feat: VRAM manager with priority-based model eviction
Tracks GPU VRAM usage (16GB) and handles model loading/unloading with
priority-based eviction: LLM (lowest) -> TTS -> ASR (highest, protected).
Uses asyncio Lock for concurrency safety.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:14:41 +02:00
tlg
969bcb3292 feat: API key authentication dependency
Implements create_api_key_dependency() FastAPI dependency that validates
Bearer tokens against a configured list of ApiKey objects (401 on missing,
malformed, or unknown tokens). Includes 5 TDD tests covering all cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:31:30 +02:00
tlg
c4eaf5088b feat: model registry with virtual-to-physical resolution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:31:10 +02:00
tlg
690ad46d88 feat: config loading for models.yaml and api_keys.yaml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:30:13 +02:00
tlg
a64f32b590 feat: project scaffolding with config files and test fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 07:23:14 +02:00