fix: VRAM eviction cascades through all tiers for large LLM loads

The original eviction logic blocked ASR eviction even when an LLM
genuinely needed all 16GB VRAM (e.g., gpt-oss-20b at 13GB). Now uses
two-pass eviction: first evicts lower/same priority, then cascades to
higher priority as last resort. Added tests for ASR-survives and
full-cascade scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
tlg
2026-04-04 09:22:14 +02:00
parent d7a091df8c
commit 813bbe0ad0
2 changed files with 71 additions and 18 deletions

View File

@@ -70,20 +70,39 @@ class VRAMManager:
requesting_priority = _PRIORITY[requesting_type]
# Evict in priority order: lowest first (LLM=0, TTS=1, ASR=2).
# Rule: never evict highest-priority tier (ASR) for a lower-priority
# request. Same-priority replacement is always allowed (e.g., old LLM
# evicted for new LLM). Lower-priority models are fair game for any
# requester — cascade through them until enough VRAM is freed.
#
# Rule: never evict a higher-priority model to make room for a
# lower-priority one. E.g., a TTS request must not evict ASR —
# it should evict the LLM instead. But an LLM request CAN cascade
# through TTS and ASR as a last resort, because there is nothing
# lower to evict. Same-priority replacement is always allowed.
#
# Pass 1: evict models with priority <= requesting priority
# (lower or same tier).
# Pass 2: if still not enough, evict higher-priority models
# in ascending order (only when the requester has no
# lower-priority alternatives left).
candidates = sorted(self._loaded.values(), key=lambda s: s.priority)
for slot in candidates:
# Pass 1: evict lower and same priority
for slot in list(candidates):
if self.available_vram_gb >= needed_gb:
break
# Skip if this slot is the highest-priority tier and the requester
# is lower priority. (Protects ASR from eviction by TTS/LLM.)
if slot.priority > requesting_priority and slot.model_type == "asr":
continue
logger.info(
f"Evicting {slot.model_id} ({slot.model_type}, {slot.vram_gb}GB)"
)
await slot.backend.unload(slot.model_id)
del self._loaded[slot.model_id]
if slot.priority <= requesting_priority:
logger.info(
f"Evicting {slot.model_id} ({slot.model_type}, {slot.vram_gb}GB)"
)
await slot.backend.unload(slot.model_id)
del self._loaded[slot.model_id]
# Pass 2: evict higher priority as last resort
if self.available_vram_gb < needed_gb:
candidates = sorted(self._loaded.values(), key=lambda s: s.priority)
for slot in list(candidates):
if self.available_vram_gb >= needed_gb:
break
logger.info(
f"Evicting {slot.model_id} ({slot.model_type}, {slot.vram_gb}GB) [last resort]"
)
await slot.backend.unload(slot.model_id)
del self._loaded[slot.model_id]