Monday, April 27, 2026

MoE vs Dense Models for Structured Data Extraction — Who Wins?

MoE or Dense — which model architecture wins for structured data extraction from documents? It depends on document complexity. In this video, I test MoE vs Dense models on real extraction tasks and show that MoE struggles with larger tables, while Dense models handle them reliably. In my benchmark, Qwen 3.6 Dense comes out on top.

 

Tuesday, April 21, 2026

Gemma 4 for Structured Data Extraction: Can It Beat Qwen 3.5?

In this video, I put Gemma 4 to the test on a real-world task — extracting structured data from bank statements — and benchmark it head-to-head against Mistral's Ministral and Qwen 3.5.

I run both the MoE and Dense variants of Gemma 4 to see how architecture affects accuracy on financial documents, then compare the results side-by-side.

My takeaway: Gemma 4 holds its own and performs on par with Qwen 3.5 — a strong result for local structured extraction workflows.

 

Thursday, April 2, 2026

Running Multiple Models on One GPU with vLLM and GPU Memory Utilization

In this video I show how to run multiple vLLM model instances on the same GPU (Nvidia) in parallel by adjusting the --gpu-memory-utilization flag.

You'll see: 

- How to launch separate vLLM servers for different models 

- How to split GPU memory between them without running out of VRAM

This approach works when you want to serve several smaller models concurrently on limited hardware.