- Add 'kernels' package to Dockerfile for native MXFP4 execution
(fixes gpt-oss-20b OOM: 15.2GB→13.5GB)
- Reduce GGUF n_ctx from 8192 to 4096 and quantize KV cache to Q8_0
to reduce VRAM usage
- Use GGML_TYPE_Q8_0 constant instead of string for type_k/type_v
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Force gc.collect() before torch.cuda.empty_cache() to ensure all
model references are released
- Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in container
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Multi-stage: devel image builds llama-cpp-python with CUDA, runtime
image gets the compiled library via COPY
- chatterbox-tts installed --no-deps to prevent torch 2.6 downgrade
- librosa and diskcache added as explicit chatterbox/llama-cpp deps
- All imports verified with GPU passthrough
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed librosa (unused), torch, pyyaml from install list since
they're in the base image. Avoid numpy rebuild conflict.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Podman requires docker.io/ prefix when unqualified-search registries
are not configured.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>