Andrej Baranovskij Blog

Blog about Oracle, Full Stack, Machine Learning and Cloud

Thursday, April 2, 2026

In this video I show how to run multiple vLLM model instances on the same GPU (Nvidia) in parallel by adjusting the --gpu-memory-utilization flag.

You'll see:

- How to launch separate vLLM servers for different models

- How to split GPU memory between them without running out of VRAM

This approach works when you want to serve several smaller models concurrently on limited hardware.

Posted by Andrej Baranovskij at 8:46 PM

Oracle PaaS Partner Community Award for Outstanding Visual Builder Cloud Service Contribution 2019

Oracle Fusion Middleware Partner Community Award for Outstanding ACM/BPM Contribution 2015

Oracle Fusion Middleware Innovation Award Winner 2010

SOA Partner Community Award for Outstanding Contribution Across the World 2010

2010 Enterprise 2.0 Blazer: Enterprise 2.0 Leader Award

Andrej Baranovskij: Vilnius, Lithuania; I'm Oracle ACE Director, Oracle Groundbreaker Ambassador, CEO and Technical Expert at Red Samurai Consulting with focus on Oracle Fusion Middleware and Oracle Cloud technologies.