Monday, May 20, 2024

Sparrow Parse - Data Processing for LLM

Data processing in LLM RAG is very important, it helps to improve data extraction results, especially for complex layout documents, with large tables. This is why I build open source Sparrow Parse library, it helps to balance between LLM and standard Python data extraction methods. 


Monday, May 13, 2024

Invoice Data Preprocessing for LLM

Data preprocessing is important step for LLM pipeline. I show various approaches to preprocess invoice data, before feeding it to LLM. This is quite challenging step, especially to preprocess tables. 


Monday, May 6, 2024

You Don't Need RAG to Extract Invoice Data

Documents like invoices or receipts can be processed by LLM directly, without RAG. I explain how you can do this locally with Ollama and Instructor. Thanks to Instructor, structured output from LLM can be validated with your own Pydantic class. 


Monday, April 29, 2024

LLM JSON Output with Instructor RAG and WizardLM-2

With Instructor library you can implement simple RAG without Vector DB or dependencies to other LLM libraries. The key RAG components - good data pre-processing and cleaning, powerful local LLM (such as WizardLM-2, Nous Hermes 2 PRO or Llama3) and Ollama or MLX backend.

Monday, April 22, 2024

Local RAG Explained with Unstructured and LangChain

In this tutorial, I do a code walkthrough and demonstrate how to implement the RAG pipeline using Unstructured, LangChain, and Pydantic for processing invoice data and extracting structured JSON data.


Monday, April 15, 2024

Local LLM RAG with Unstructured and LangChain [Structured JSON]

Using unstructured library to pre-process PDF document content, to be in a cleaner format. This helps LLM to produce more accurate response. JSON response is generated thanks to Nous Hermes 2 PRO LLM. Without any additional post-processing. Using Pydantic dynamic class to validate response to make sure it matches request. 


Sunday, March 31, 2024

LlamaIndex Upgrade to 0.10.x Experience

I explain key points you should keep in mind when upgrading to LlamaIndex 0.10.x.