Monday, May 18, 2026

Instruction-Based Data Analysis with Sparrow and Local LLM

In this video, I show how to use Sparrow instruction processing pipeline to analyze a bond portfolio JSON extracted from a financial document — all running locally, no external APIs.

I run three different analysis cases using Gemma 4 31B on Apple Silicon Mac Mini M4 Pro:

  • Risk classification — categorize each position into low, medium, or high risk based on loss percentage
  • Concentration risk — flag overweight positions above 20% portfolio weighting
  • Portfolio aggregation — total valuation, weighted average P&L, best and worst performer

All three cases use the same sparrow-instructor pipeline, demonstrating how different instruction types — classification, rule-based flagging, and aggregation — are handled by a single local LLM.

Monday, May 11, 2026

Smart Document Extraction with Business Rules — Gemma vs Qwen vs Ministral

In this video I show how Sparrow hints work — a powerful feature that goes beyond simple field extraction. Using a bank bonds portfolio document, I demonstrate how to define business rules directly in the hints file: formatting rules for European number standards, short name normalization, and risk classification logic derived from extracted fields. I test the same hints across three local vision models — Gemma 4 31B Dense, Qwen 3.6 27B Dense, and Ministral 3 14B. All processing runs locally with no cloud dependencies.

 

Monday, May 4, 2026

Large Table Extraction to JSON with dots.ocr — No Vision LLM Hallucinations

Sparrow now supports a dedicated table mode for extracting large, complex tables into structured JSON — without Vision LLM hallucinations. 

Vision LLMs struggle with dense tabular data: they hallucinate values, misalign rows, and lose precision at scale. Sparrow's table mode solves this by using dots.ocr to capture the full table structure as HTML, then applying a generic Sparrow template to convert that HTML into clean, structured JSON. 

 

Monday, April 27, 2026

MoE vs Dense Models for Structured Data Extraction — Who Wins?

MoE or Dense — which model architecture wins for structured data extraction from documents? It depends on document complexity. In this video, I test MoE vs Dense models on real extraction tasks and show that MoE struggles with larger tables, while Dense models handle them reliably. In my benchmark, Qwen 3.6 Dense comes out on top.

 

Tuesday, April 21, 2026

Gemma 4 for Structured Data Extraction: Can It Beat Qwen 3.5?

In this video, I put Gemma 4 to the test on a real-world task — extracting structured data from bank statements — and benchmark it head-to-head against Mistral's Ministral and Qwen 3.5.

I run both the MoE and Dense variants of Gemma 4 to see how architecture affects accuracy on financial documents, then compare the results side-by-side.

My takeaway: Gemma 4 holds its own and performs on par with Qwen 3.5 — a strong result for local structured extraction workflows.

 

Thursday, April 2, 2026

Running Multiple Models on One GPU with vLLM and GPU Memory Utilization

In this video I show how to run multiple vLLM model instances on the same GPU (Nvidia) in parallel by adjusting the --gpu-memory-utilization flag.

You'll see: 

- How to launch separate vLLM servers for different models 

- How to split GPU memory between them without running out of VRAM

This approach works when you want to serve several smaller models concurrently on limited hardware.

 

Tuesday, March 24, 2026

How to Cache vLLM Model in FastAPI for Faster Inference

I show you how to keep your vLLM model loaded in FastAPI cache for much faster inference — without reloading it on every request.