Tuesday, January 31, 2023

Preparing Dataset for Donut Fine-Tuning (part 1, Document AI)

I explain the dataset I will be using to fine-tune Donut model. I show how PDFs are converted to image files for further processing and OCR data extraction. In the next step, JSON data is converted to the format understandable by Sparrow annotation processing/review tool.

 

Monday, January 23, 2023

How To Fine-tune Donut Model

Donut is an awesome Document AI model to extract data from docs. I share my experiences in fine-tuning the model, with CORD dataset, based on example from Transformers Tutorials.

 

Monday, January 16, 2023

Donut 🍩 - ChatGPT for Document AI

Donut - OCR-free Document Understanding Transformer. This ML model can process documents (images, scans) and return JSON structured info about the content. It works for different use cases: form understanding, visual question answering about the document, document image classification.

 

Thursday, January 5, 2023

Best Platform for Python Apps Deployment - Hugging Face Spaces with Docker

I walk through Hugging Face Spaces Docker SDK deployment option. I was using it to deploy our Streamlit/Python app Sparrow. So far very happy with Spaces Docker SDK - simple setup, very stable and good runtime performance, HTTPS out of the box, content compression out of the box too.

 

Monday, December 19, 2022

File Upload/Download in Streamlit/Python

File upload/download is supported by Streamlit out of the box. There are a few hints to share about more effective file upload implementation. You will learn how to wrap the file upload widget with Streamlit form, use Submit button to confirm the upload and reinitialize the upload widget. Additionally, I will show you an example of how to download JSON file from the server with Streamlit download component.

 

Monday, December 12, 2022

Dependent UI Widgets in Streamlit/Python

This video explains how to implement dependent UI widgets refresh in Streamlit/Python, when the value changes. I'm using Streamlit Empty widget as a placeholder to update selectbox with a new entry, after the new file upload. Selectbox displays the list of uploaded files.

 

Sunday, December 4, 2022

Invoice Annotation with Sparrow/Python

I explain our Streamlit component for invoice/receipt document annotation and labeling. It can be used either to create new annotations or review and edit existing ones. With this component you can add new annotations directly on top of the document image. Existing annotations can be resized/moved and values/labels assigned. 

This component is part of Sparrow - our open-source solution for data extraction from invoices/receipts with ML.