fix: Open WebUI integration — Harmony stripping, VRAM eviction, concurrency lock

- Add harmony.py: strip GPT-OSS-20B analysis/thinking channel from both streaming and non-streaming responses (HarmonyStreamFilter + extract_final_text) - Add per-model asyncio.Lock in llamacpp backend to prevent concurrent C++ access that caused container segfaults (exit 139) - Fix chat handler swap for streaming: move inside _stream_generate within lock scope (was broken by try/finally running before stream was consumed) - Filter /v1/models to return only LLM models (hide ASR/TTS from chat dropdown) - Correct Qwen3.5-4B estimated_vram_gb: 4 → 9 (actual allocation ~8GB) - Add GPU memory verification after eviction with retry loop in vram_manager - Add HF_TOKEN_PATH support in main.py for gated model access - Add /v1/audio/models and /v1/audio/voices discovery endpoints (no auth) - Add OOM error handling in both backends and chat route - Add AUDIO_STT_SUPPORTED_CONTENT_TYPES for webm/wav/mp3/ogg - Add performance test script (scripts/perf_test.py) - Update tests to match current config (42 tests pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:50:39 +02:00
parent 06923d51b4
commit 3edc055299
15 changed files with 634 additions and 74 deletions
--- a/kischdle/llmux/tests/test_config.py
+++ b/kischdle/llmux/tests/test_config.py
@@ -13,10 +13,10 @@ def test_physical_model_has_required_fields():
    physical, _ = load_models_config()
    qwen = physical["qwen3.5-9b-fp8"]
    assert qwen.type == "llm"
-    assert qwen.backend == "transformers"
-    assert qwen.model_id == "lovedheart/Qwen3.5-9B-FP8"
-    assert qwen.estimated_vram_gb == 9
-    assert qwen.supports_vision is True
+    assert qwen.backend == "llamacpp"
+    assert qwen.model_id == "unsloth/Qwen3.5-9B-GGUF"
+    assert qwen.estimated_vram_gb == 10
+    assert qwen.supports_vision is False
    assert qwen.supports_tools is True