feat: Jinja template thinking toggle, Qwen3.5-9B GGUF Q8_0
- Thinking/Instruct toggle via Jinja template patching in llama-cpp backend: creates separate handlers for thinking-enabled and thinking-disabled modes - Replace lovedheart/Qwen3.5-9B-FP8 (safetensors, 15.8GB OOM) with unsloth/Qwen3.5-9B-GGUF Q8_0 (9.2GB, fits) - Enable flash_attn in llama-cpp for better performance - GGUF path resolution falls back to flat gguf/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,10 +1,11 @@
|
||||
physical_models:
|
||||
qwen3.5-9b-fp8:
|
||||
type: llm
|
||||
backend: transformers
|
||||
model_id: "lovedheart/Qwen3.5-9B-FP8"
|
||||
estimated_vram_gb: 9
|
||||
supports_vision: true
|
||||
backend: llamacpp
|
||||
model_id: "unsloth/Qwen3.5-9B-GGUF"
|
||||
model_file: "Qwen3.5-9B-Q8_0.gguf"
|
||||
estimated_vram_gb: 10
|
||||
supports_vision: false
|
||||
supports_tools: true
|
||||
|
||||
qwen3.5-9b-fp8-uncensored:
|
||||
|
||||
Reference in New Issue
Block a user