The purpose of these changes is to stop LoRA adapters from accumulating in GPU memory forever.

#17
by RahulRathod7 - opened

The purpose of these changes is to stop LoRA adapters from accumulating in GPU memory forever.

Before:

  • Every newly selected LoRA stayed loaded for the lifetime of the app.
  • VRAM usage could keep growing as users tried more adapters.
  • Because the pipeline is global, concurrent requests could also interfere with adapter switching.

After:

  • The app keeps only a small number of recently used LoRAs in memory.
  • When the cache is full, it evicts the least recently used adapter before loading a new one.
  • A lock ensures adapter load/evict/switch operations don’t race with inference.

So the practical goal is:

  • lower long-session VRAM growth
  • make the app more stable under repeated adapter switching
  • reduce the chance of OOMs and shared-pipeline corruption

You can control the cache size with MAX_CACHED_LORAS.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment