


Stars: 124


Open-DocLLM is an open-source project that addresses data extraction and processing challenges using OCR and LLM technologies. It consists of two main layers: OCR for reading document content and LLM for extracting specific content in a structured manner. The project offers a larger context window size compared to JP Morgan's DocLLM and integrates tools like Tesseract OCR and Mistral for efficient data analysis. Users can run the models on-premises using LLM studio or Ollama, and the project includes a FastAPI app for testing purposes.




This project aims to tackle the challenges of data extraction and processing using OCR and LLM. It is inspired by JP Morgan's DocLLM but is fully open-source and offers a larger context window size. The project is divided into two parts: the OCR and LLM layer.


OCR Layer

The OCR layer is responsible for reading all the content from a document. It involves the following steps:

  1. Convert pages to images: Any type of file is converted into an image so that all the content in the document can be read.

  2. Preprocess image for OCR: The image is adjusted to improve its quality and readability.

  3. Tesseract OCR: The Tesseract OCR, the most popular open-source OCR in the world, is used to read the content from the images.

LLM Layer

The LLM layer is responsible for extracting specific content from the document in a structured way. It involves defining an extraction contract and extracting the JSON data.

Running Locally

You can run the models on-premises using LLM studio or Ollama. This project uses LlamaIndex and Ollama.

Running the Code

The repo includes a FastAPI app with one endpoint for testing. Make sure to point to the proper Tesseract executable and change the key in the file.

  1. Install Tessaract

  2. Install the required Python packages.

pip install -r requirements.txt
  1. Run fast api
uvicorn main:app --reload
  1. go to the Swgger page: http://localhost:8000/docs

Running with Docker

  1. Build the Docker image.
docker build -t your-image-name .
  1. Run the Docker container.
docker run -p 8000:8000 your-image-name
  1. go to the Swgger page: http://localhost:8000/docs

Advanced Cases: 1 Million token context

The project also explores advanced cases like a 1 million token context using LLM Lingua and Mistral Yarn 128k context window.


The integration of OCR and LLM technologies in this project marks a pivotal advancement in analyzing unstructured data. The combination of open-source projects like Tesseract and Mistral makes a perfect implementation that could be used in an on-premise use case.

References & Documents 

  2. YaRN: Efficient Context Window Extension of Large Language Models

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Open-DocLLM

Similar Open Source Tools

For similar tasks

For similar jobs