Andrej Baranovskij Blog
Blog about Oracle, Full Stack, Machine Learning and Cloud
Thursday, April 2, 2026
Running Multiple Models on One GPU with vLLM and GPU Memory Utilization
In this video I show how to run multiple vLLM model instances on the same GPU (Nvidia) in parallel by adjusting the --gpu-memory-utilization flag.
You'll see:
- How to launch separate vLLM servers for different models
- How to split GPU memory between them without running out of VRAM
This approach works when you want to serve several smaller models concurrently on limited hardware.
No comments:
Post a Comment
Older Post
Home
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment