ExtractThinker

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Stars: 317

Visit
 screenshot

ExtractThinker is a library designed for extracting data from files and documents using Language Model Models (LLMs). It offers ORM-style interaction between files and LLMs, supporting multiple document loaders such as Tesseract OCR, Azure Form Recognizer, AWS TextExtract, and Google Document AI. Users can customize extraction using contract definitions, process documents asynchronously, handle various document formats efficiently, and split and process documents. The project is inspired by the LangChain ecosystem and focuses on Intelligent Document Processing (IDP) using LLMs to achieve high accuracy in document extraction tasks.

README:

Extract Thinker Logo

Medium GitHub Last Commit Github License

ExtractThinker

Library to extract data from files and documents agnostically using LLMs. extract_thinker provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.

Features

  • Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.
  • Customizable extraction using contract definitions.
  • Asynchronous processing for efficient document handling.
  • Built-in support for various document formats.
  • ORM-style interaction between files and LLMs.

Extract Thinker Features Diagram

Installation

To install extract_thinker, you can use pip:

pip install extract_thinker

Usage

Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.

import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Contract

load_dotenv()
cwd = os.getcwd()

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

tesseract_path = os.getenv("TESSERACT_PATH")
test_file_path = os.path.join(cwd, "test_images", "invoice.png")

extractor = Extractor()
extractor.load_document_loader(
    DocumentLoaderTesseract(tesseract_path)
)
extractor.load_llm("claude-3-haiku-20240307")

result = extractor.extract(test_file_path, InvoiceContract)

print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)

Splitting Files Example

You can also split and process documents using extract_thinker. Here's how you can do it:

import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter

load_dotenv()

class DriverLicense(Contract):
    # Define your DriverLicense contract fields here
    pass

class InvoiceContract(Contract):
    invoice_number: str
    invoice_date: str

extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
extractor.load_llm("gpt-3.5-turbo")

classifications = [
    Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),
    Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)
]

process = Process()
process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
process.load_splitter(ImageSplitter())

path = "..."

split_content = process.load_file(path)\
    .split(classifications)\
    .extract()

# Process the split_content as needed

Infrastructure

The extract_thinker project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing.

Extract Thinker Logo

Why Just Not LangChain?

While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.

Additional Examples

You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library

Contributing

We welcome contributions from the community! If you would like to contribute, please follow these steps:

Fork the repository. Create a new branch for your feature or bugfix. Write tests for your changes. Run tests to ensure everything is working correctly. Submit a pull request with a description of your changes.

Community

Júlio Almeida https://pub.towardsai.net/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef

License

This project is licensed under the Apache License 2.0. See the LICENSE file for more details.

Contact

For any questions or issues, please open an issue on the GitHub repository.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ExtractThinker

Similar Open Source Tools

For similar tasks

For similar jobs