fix: add triton kernels for MXFP4, fix GGUF KV cache quantization

- Add 'kernels' package to Dockerfile for native MXFP4 execution (fixes gpt-oss-20b OOM: 15.2GB→13.5GB) - Reduce GGUF n_ctx from 8192 to 4096 and quantize KV cache to Q8_0 to reduce VRAM usage - Use GGML_TYPE_Q8_0 constant instead of string for type_k/type_v Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 22:49:16 +02:00
parent a88f0afb8a
commit da35e94b16
2 changed files with 7 additions and 3 deletions
--- a/kischdle/llmux/Dockerfile
+++ b/kischdle/llmux/Dockerfile
@@ -27,10 +27,11 @@ RUN pip install --no-cache-dir --break-system-packages \
    "sentencepiece>=0.2.0" \
    "protobuf>=5.0.0"

-# Install transformers + accelerate (needed for device_map)
+# Install transformers + accelerate + kernels (MXFP4/FP8 triton kernels)
 RUN pip install --no-cache-dir --break-system-packages --no-build-isolation \
    "transformers>=5.4.0" \
-    "accelerate>=1.0.0"
+    "accelerate>=1.0.0" \
+    "kernels"

 # Install chatterbox-tts WITHOUT its dependencies (it would downgrade
 # torch from 2.11 to 2.6 and pull gradio, librosa, etc.)