Tuesday, April 21, 2026

Gemma 4 for Structured Data Extraction: Can It Beat Qwen 3.5?

In this video, I put Gemma 4 to the test on a real-world task — extracting structured data from bank statements — and benchmark it head-to-head against Mistral's Ministral and Qwen 3.5.

I run both the MoE and Dense variants of Gemma 4 to see how architecture affects accuracy on financial documents, then compare the results side-by-side.

My takeaway: Gemma 4 holds its own and performs on par with Qwen 3.5 — a strong result for local structured extraction workflows.

 

Thursday, April 2, 2026

Running Multiple Models on One GPU with vLLM and GPU Memory Utilization

In this video I show how to run multiple vLLM model instances on the same GPU (Nvidia) in parallel by adjusting the --gpu-memory-utilization flag.

You'll see: 

- How to launch separate vLLM servers for different models 

- How to split GPU memory between them without running out of VRAM

This approach works when you want to serve several smaller models concurrently on limited hardware.