Wednesday, June 24, 2026

Mistral OCR + Sparrow: Document to JSON

Integrated Mistral OCR as a new cloud inference backend into Sparrow, an open-source document extraction platform. This gives Sparrow a full cloud option alongside its existing local backends (MLX, vLLM), so users without GPU infrastructure can still run enterprise-grade document extraction.

Pipeline: Mistral OCR converts the document to structured HTML, then Mistral Small extracts and transforms the data into JSON based on a defined schema with field-level hints.

In this video, extracting a bonds portfolio table with hint-driven rules:

  • Instrument name normalization (extracting issuer brand from full fund names)
  • European number formatting (period as thousands separator, comma as decimal)
  • Percentage formatting with sign preservation
  • Derived risk classification computed from profit/loss percentage
Same Sparrow API, same schema and hint format as local backends — just switch the backend flag to run on Mistral Cloud instead of MLX or vLLM.

Sparrow is open source and local-first by design — documents never leave your infrastructure unless you choose the cloud backend.

⭐ GitHub: github.com/katanaml/sparrow
🌐 Live demo: sparrow.katanaml.io 

 

No comments: