data:image/s3,"s3://crabby-images/74c83/74c83df2ebf176f02fdd6a78b77f5efae33d2d47" alt="swift-ocr-llm-powered-pdf-to-markdown"
swift-ocr-llm-powered-pdf-to-markdown
An open-source OCR API that leverages OpenAI's powerful language models with optimized performance techniques like parallel processing and batching to deliver high-quality text extraction from complex PDF documents. Ideal for businesses seeking efficient document digitization and data extraction solutions.
Stars: 219
data:image/s3,"s3://crabby-images/a7134/a71346980a3147f11e6e57ab6ae1d7d49bbf9132" alt="screenshot"
Swift OCR is a powerful tool for extracting text from PDF files using OpenAI's GPT-4 Turbo with Vision model. It offers flexible input options, advanced OCR processing, performance optimizations, structured output, robust error handling, and scalable architecture. The tool ensures accurate text extraction, resilience against failures, and efficient handling of multiple requests.
README:
- Flexible Input Options: Accepts PDF files via direct upload or by specifying a URL.
- Advanced OCR Processing: Utilizes OpenAI's GPT-4 Turbo with Vision model for accurate text extraction.
-
Performance Optimizations:
- Parallel PDF Conversion: Converts PDF pages to images concurrently using multiprocessing.
- Batch Processing: Processes multiple images in batches to maximize throughput.
- Retry Mechanism with Exponential Backoff: Ensures resilience against transient failures and API rate limits.
- Structured Output: Extracted text is formatted using Markdown for readability and consistency.
- Robust Error Handling: Comprehensive logging and exception handling for reliable operations.
- Scalable Architecture: Asynchronous processing enables handling multiple requests efficiently.
https://github.com/user-attachments/assets/6b39f3ea-248e-4c29-ac2e-b57de64d5d65
Demo video showcasing the conversion of NASA's Apollo 17 flight documents, which include unorganized, horizontally and vertically oriented pages, into well-structured Markdown format without any issues.
Here's a single, comprehensive section on cost comparison for your README:
Our solution offers an optimal balance of affordability, accuracy, and advanced features:
- Average token usage per image: ~1200
- Total tokens per page (including prompt): ~1500
- [GPT4O] Input token cost: $5 per million tokens
- [GPT4O] Output token cost: $15 per million tokens
For 1000 documents:
- Estimated total cost: $15
- Utilizing GPT4 mini: Reduces cost to ~$8 per 1000 documents
- Implementing batch API: Further reduces cost to ~$4 per 1000 documents
This solution is significantly more affordable than alternatives:
- Our cost: $15 per 1000 documents
- CloudConvert: ~$30 per 1000 documents (PDFTron mode, 4 credits required)
While cost-effectiveness is a major advantage, our solution also provides:
- Superior accuracy and consistency
- Precise table generation
- Output in easily editable markdown format
This combination of affordability and advanced features makes solution stand out in the document processing market. It's not just about being cheaper; it's about providing excellent value through reliability, flexibility, and high-quality output.
- Python 3.8+
- Git
- Virtualenv (optional but recommended)
-
Clone the Repository
git clone https://github.com/yourusername/llm-openai-ocr.git cd llm-openai-ocr
-
Create a Virtual Environment
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Configure Environment Variables
Create a
.env
file in the root directory and add the following variables:OPENAI_API_KEY=your_openai_api_key AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint OPENAI_DEPLOYMENT_ID=your_openai_deployment_id OPENAI_API_VERSION=your_openai_api_version # Default is "gpt-4o" BATCH_SIZE=10 # Optional: Default is 1 MAX_CONCURRENT_OCR_REQUESTS=5 # Optional: Default is 5 MAX_CONCURRENT_PDF_CONVERSION=4 # Optional: Default is 4
Note: Replace
your_openai_api_key
,your_azure_openai_endpoint
, andyour_openai_deployment_id
with your actual OpenAI credentials. -
Run the Application
uvicorn main:app --reload
The API will be available at
http://127.0.0.1:8000
.
POST /ocr
- file: (Optional) Upload a PDF file.
- ocr_request.url: (Optional) URL of the PDF to process.
You must provide either a file or a URL, not both.
Uploading a PDF File:
curl -X POST "http://127.0.0.1:8000/ocr" -F "file=@/path/to/your/document.pdf"
Providing a PDF URL:
curl -X POST "http://127.0.0.1:8000/ocr" -F "ocr_request={\"url\": \"https://example.com/document.pdf\"}" -H "Content-Type: application/json"
-
200 OK
{ "text": "Extracted and formatted text from the PDF." }
-
Error Responses
-
400 Bad Request
: Invalid input parameters. -
422 Unprocessable Entity
: Validation errors. -
500 Internal Server Error
: Processing errors.
-
All configurations are managed via environment variables. Ensure you have a .env
file set up with the necessary variables as described in the Installation section.
- OPENAI_API_KEY: Your OpenAI API key.
- AZURE_OPENAI_ENDPOINT: The endpoint for Azure OpenAI service.
- OPENAI_DEPLOYMENT_ID: Deployment ID for the OpenAI model.
- OPENAI_API_VERSION: API version for OpenAI (default: "gpt-4o").
- BATCH_SIZE: Number of images to process per OCR request (default: 1).
- MAX_CONCURRENT_OCR_REQUESTS: Maximum number of concurrent OCR requests (default: 5).
- MAX_CONCURRENT_PDF_CONVERSION: Maximum number of concurrent PDF page conversions (default: 4).
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for swift-ocr-llm-powered-pdf-to-markdown
Similar Open Source Tools
data:image/s3,"s3://crabby-images/a7134/a71346980a3147f11e6e57ab6ae1d7d49bbf9132" alt="swift-ocr-llm-powered-pdf-to-markdown Screenshot"
swift-ocr-llm-powered-pdf-to-markdown
Swift OCR is a powerful tool for extracting text from PDF files using OpenAI's GPT-4 Turbo with Vision model. It offers flexible input options, advanced OCR processing, performance optimizations, structured output, robust error handling, and scalable architecture. The tool ensures accurate text extraction, resilience against failures, and efficient handling of multiple requests.
data:image/s3,"s3://crabby-images/d5ef8/d5ef8271a079938826a00337bff70a6101bf6d8a" alt="Hacx-GPT Screenshot"
Hacx-GPT
Hacx GPT is a cutting-edge AI tool developed by BlackTechX, inspired by WormGPT, designed to push the boundaries of natural language processing. It is an advanced broken AI model that facilitates seamless and powerful interactions, allowing users to ask questions and perform various tasks. The tool has been rigorously tested on platforms like Kali Linux, Termux, and Ubuntu, offering powerful AI conversations and the ability to do anything the user wants. Users can easily install and run Hacx GPT on their preferred platform to explore its vast capabilities.
data:image/s3,"s3://crabby-images/18103/181033341e756ffe03671ffdddf1100e605e442b" alt="nodetool Screenshot"
nodetool
NodeTool is a platform designed for AI enthusiasts, developers, and creators, providing a visual interface to access a variety of AI tools and models. It simplifies access to advanced AI technologies, offering resources for content creation, data analysis, automation, and more. With features like a visual editor, seamless integration with leading AI platforms, model manager, and API integration, NodeTool caters to both newcomers and experienced users in the AI field.
data:image/s3,"s3://crabby-images/cf01d/cf01d76bb694232ba192ae7c6082026545df5a7d" alt="Visionatrix Screenshot"
Visionatrix
Visionatrix is a project aimed at providing easy use of ComfyUI workflows. It offers simplified setup and update processes, a minimalistic UI for daily workflow use, stable workflows with versioning and update support, scalability for multiple instances and task workers, multiple user support with integration of different user backends, LLM power for integration with Ollama/Gemini, and seamless integration as a service with backend endpoints and webhook support. The project is approaching version 1.0 release and welcomes new ideas for further implementation.
data:image/s3,"s3://crabby-images/b82d3/b82d3520bb8eafe1ab797d30e6902b69649e5ea9" alt="ComfyUI-fal-API Screenshot"
ComfyUI-fal-API
ComfyUI-fal-API is a repository containing custom nodes for using Flux models with fal API in ComfyUI. It provides nodes for image generation, video generation, language models, and vision language models. Users can easily install and configure the repository to access various nodes for different tasks such as generating images, creating videos, processing text, and understanding images. The repository also includes troubleshooting steps and is licensed under the Apache License 2.0.
data:image/s3,"s3://crabby-images/85747/8574794c7e202694a2973d730ee08526a590b6d9" alt="farfalle Screenshot"
farfalle
Farfalle is an open-source AI-powered search engine that allows users to run their own local LLM or utilize the cloud. It provides a tech stack including Next.js for frontend, FastAPI for backend, Tavily for search API, Logfire for logging, and Redis for rate limiting. Users can get started by setting up prerequisites like Docker and Ollama, and obtaining API keys for Tavily, OpenAI, and Groq. The tool supports models like llama3, mistral, and gemma. Users can clone the repository, set environment variables, run containers using Docker Compose, and deploy the backend and frontend using services like Render and Vercel.
data:image/s3,"s3://crabby-images/6889c/6889c18e8e8c95bd4d045745080e04ff21d40938" alt="polyfire-js Screenshot"
polyfire-js
Polyfire is an all-in-one managed backend for AI apps that allows users to build AI apps directly from the frontend, eliminating the need for a separate backend. It simplifies the process by providing most backend services in just a few lines of code. With Polyfire, users can easily create chatbots, transcribe audio files to text, generate simple text, create a long-term memory, and generate images with Dall-E. The tool also offers starter guides and tutorials to help users get started quickly and efficiently.
data:image/s3,"s3://crabby-images/ee397/ee3971c652e4f14e4775a088218b031b521bc592" alt="summarize Screenshot"
summarize
The 'summarize' tool is designed to transcribe and summarize videos from various sources using AI models. It helps users efficiently summarize lengthy videos, take notes, and extract key insights by providing timestamps, original transcripts, and support for auto-generated captions. Users can utilize different AI models via Groq, OpenAI, or custom local models to generate grammatically correct video transcripts and extract wisdom from video content. The tool simplifies the process of summarizing video content, making it easier to remember and reference important information.
data:image/s3,"s3://crabby-images/b87fe/b87fe2b645bd88569a2a912be8b7b8cc2f62b4a9" alt="llms-interview-questions Screenshot"
llms-interview-questions
This repository contains a comprehensive collection of 63 must-know Large Language Models (LLMs) interview questions. It covers topics such as the architecture of LLMs, transformer models, attention mechanisms, training processes, encoder-decoder frameworks, differences between LLMs and traditional statistical language models, handling context and long-term dependencies, transformers for parallelization, applications of LLMs, sentiment analysis, language translation, conversation AI, chatbots, and more. The readme provides detailed explanations, code examples, and insights into utilizing LLMs for various tasks.
data:image/s3,"s3://crabby-images/100c1/100c1743d25c7f4f5d3e579607f96bbab2c7b443" alt="agentneo Screenshot"
agentneo
AgentNeo is a Python package that provides functionalities for project, trace, dataset, experiment management. It allows users to authenticate, create projects, trace agents and LangGraph graphs, manage datasets, and run experiments with metrics. The tool aims to streamline AI project management and analysis by offering a comprehensive set of features.
data:image/s3,"s3://crabby-images/3e39a/3e39aefd9bf1161b5379c7b630bc42a4597fec9c" alt="llmchat Screenshot"
llmchat
LLMChat is an all-in-one AI chat interface that supports multiple language models, offers a plugin library for enhanced functionality, enables web search capabilities, allows customization of AI assistants, provides text-to-speech conversion, ensures secure local data storage, and facilitates data import/export. It also includes features like knowledge spaces, prompt library, personalization, and can be installed as a Progressive Web App (PWA). The tech stack includes Next.js, TypeScript, Pglite, LangChain, Zustand, React Query, Supabase, Tailwind CSS, Framer Motion, Shadcn, and Tiptap. The roadmap includes upcoming features like speech-to-text and knowledge spaces.
data:image/s3,"s3://crabby-images/15f2a/15f2a162624a35921dec331a1d7cad5a917b6c83" alt="RookieAI_yolov8 Screenshot"
RookieAI_yolov8
RookieAI_yolov8 is an open-source project designed for developers and users interested in utilizing YOLOv8 models for object detection tasks. The project provides instructions for setting up the required libraries and Pytorch, as well as guidance on using custom or official YOLOv8 models. Users can easily train their own models and integrate them with the software. The tool offers features for packaging the code, managing model files, and organizing the necessary resources for running the software. It also includes updates and optimizations for better performance and functionality, with a focus on FPS game aimbot functionalities. The project aims to provide a comprehensive solution for object detection tasks using YOLOv8 models.
data:image/s3,"s3://crabby-images/2e9fc/2e9fc58490a8d5862ff89c951cf684ea222b4083" alt="WebAI-to-API Screenshot"
WebAI-to-API
This project implements a web API that offers a unified interface to Google Gemini and Claude 3. It provides a self-hosted, lightweight, and scalable solution for accessing these AI models through a streaming API. The API supports both Claude and Gemini models, allowing users to interact with them in real-time. The project includes a user-friendly web UI for configuration and documentation, making it easy to get started and explore the capabilities of the API.
data:image/s3,"s3://crabby-images/f9366/f93669774ef9f5ac6f0b90da1f1a0e00b3554b4d" alt="cog Screenshot"
cog
Cog is an open-source tool that lets you package machine learning models in a standard, production-ready container. You can deploy your packaged model to your own infrastructure, or to Replicate.
data:image/s3,"s3://crabby-images/59b5f/59b5f7149eae3bdea77d300ad8d3513803ee51f2" alt="kweaver Screenshot"
kweaver
KWeaver is an open-source cognitive intelligence development framework that provides data scientists, application developers, and domain experts with the ability for rapid development, comprehensive openness, and high-performance knowledge network generation and cognitive intelligence large model framework. It offers features such as automated and visual knowledge graph construction, visualization and analysis of knowledge graph data, knowledge graph integration, knowledge graph resource management, large model prompt engineering and debugging, and visual configuration for large model access.
For similar tasks
data:image/s3,"s3://crabby-images/a7134/a71346980a3147f11e6e57ab6ae1d7d49bbf9132" alt="swift-ocr-llm-powered-pdf-to-markdown Screenshot"
swift-ocr-llm-powered-pdf-to-markdown
Swift OCR is a powerful tool for extracting text from PDF files using OpenAI's GPT-4 Turbo with Vision model. It offers flexible input options, advanced OCR processing, performance optimizations, structured output, robust error handling, and scalable architecture. The tool ensures accurate text extraction, resilience against failures, and efficient handling of multiple requests.
data:image/s3,"s3://crabby-images/100b1/100b1a548ab1673d76d378f985bfe9259bfcc8ad" alt="MegaParse Screenshot"
MegaParse
MegaParse is a powerful and versatile parser designed to handle various types of documents such as text, PDFs, Powerpoint presentations, and Word documents with no information loss. It is fast, efficient, and open source, supporting a wide range of file formats. MegaParse ensures compatibility with tables, table of contents, headers, footers, and images, making it a comprehensive solution for document parsing.
data:image/s3,"s3://crabby-images/288cb/288cb5a1b9640787688ee71a6a8a1152ecf971b2" alt="NekoImageGallery Screenshot"
NekoImageGallery
NekoImageGallery is an online AI image search engine that utilizes the Clip model and Qdrant vector database. It supports keyword search and similar image search. The tool generates 768-dimensional vectors for each image using the Clip model, supports OCR text search using PaddleOCR, and efficiently searches vectors using the Qdrant vector database. Users can deploy the tool locally or via Docker, with options for metadata storage using Qdrant database or local file storage. The tool provides API documentation through FastAPI's built-in Swagger UI and can be used for tasks like image search, text extraction, and vector search.
data:image/s3,"s3://crabby-images/a49e8/a49e8ca994a1a6f4964365820cf24b296932d852" alt="gemini_multipdf_chat Screenshot"
gemini_multipdf_chat
Gemini PDF Chatbot is a Streamlit-based application that allows users to chat with a conversational AI model trained on PDF documents. The chatbot extracts information from uploaded PDF files and answers user questions based on the provided context. It features PDF upload, text extraction, conversational AI using the Gemini model, and a chat interface. Users can deploy the application locally or to the cloud, and the project structure includes main application script, environment variable file, requirements, and documentation. Dependencies include PyPDF2, langchain, Streamlit, google.generativeai, and dotenv.
data:image/s3,"s3://crabby-images/9a34b/9a34be72be2a40c7cec5e6f980e18000e342699a" alt="screen-pipe Screenshot"
screen-pipe
Screen-pipe is a Rust + WASM tool that allows users to turn their screen into actions using Large Language Models (LLMs). It enables users to record their screen 24/7, extract text from frames, and process text and images for tasks like analyzing sales conversations. The tool is still experimental and aims to simplify the process of recording screens, extracting text, and integrating with various APIs for tasks such as filling CRM data based on screen activities. The project is open-source and welcomes contributions to enhance its functionalities and usability.
data:image/s3,"s3://crabby-images/ee3f5/ee3f5fc47e504e26841df626935677dc9a00a200" alt="whisper Screenshot"
whisper
Whisper is an open-source library by Open AI that converts/extracts text from audio. It is a cross-platform tool that supports real-time transcription of various types of audio/video without manual conversion to WAV format. The library is designed to run on Linux and Android platforms, with plans for expansion to other platforms. Whisper utilizes three frameworks to function: DART for CLI execution, Flutter for mobile app integration, and web/WASM for web application deployment. The tool aims to provide a flexible and easy-to-use solution for transcription tasks across different programs and platforms.
data:image/s3,"s3://crabby-images/7ffab/7ffabfbbac429b5d78646739347280eb20e9b71e" alt="extractous Screenshot"
extractous
Extractous offers a fast and efficient solution for extracting content and metadata from various document types such as PDF, Word, HTML, and many other formats. It is built with Rust, providing high performance, memory safety, and multi-threading capabilities. The tool eliminates the need for external services or APIs, making data processing pipelines faster and more efficient. It supports multiple file formats, including Microsoft Office, OpenOffice, PDF, spreadsheets, web documents, e-books, text files, images, and email formats. Extractous provides a clear and simple API for extracting text and metadata content, with upcoming support for JavaScript/TypeScript. It is free for commercial use under the Apache 2.0 License.
data:image/s3,"s3://crabby-images/e2ed7/e2ed784f6b0c04d9d7d17620ef5f19435f015ab4" alt="ai-starter-kit Screenshot"
ai-starter-kit
SambaNova AI Starter Kits is a collection of open-source examples and guides designed to facilitate the deployment of AI-driven use cases for developers and enterprises. The kits cover various categories such as Data Ingestion & Preparation, Model Development & Optimization, Intelligent Information Retrieval, and Advanced AI Capabilities. Users can obtain a free API key using SambaNova Cloud or deploy models using SambaStudio. Most examples are written in Python but can be applied to any programming language. The kits provide resources for tasks like text extraction, fine-tuning embeddings, prompt engineering, question-answering, image search, post-call analysis, and more.
For similar jobs
data:image/s3,"s3://crabby-images/7a828/7a828889d979cbf4be5a04454f679734bb36585f" alt="sweep Screenshot"
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
data:image/s3,"s3://crabby-images/cac11/cac1100b7e92d3c9c9529eacfe5a6e8d943d8f57" alt="teams-ai Screenshot"
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
data:image/s3,"s3://crabby-images/10f6b/10f6b939c21eecaacb4aeb678159f5a587a20256" alt="ai-guide Screenshot"
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
data:image/s3,"s3://crabby-images/8b8c3/8b8c30180bcfba25fde40a102b6ae98fd35704b8" alt="classifai Screenshot"
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
data:image/s3,"s3://crabby-images/c6b52/c6b52a0438e707c19f9dcb358608627496141f31" alt="chatbot-ui Screenshot"
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
data:image/s3,"s3://crabby-images/2fa15/2fa15d62e208bea0a119405a82ad37a6b24564c0" alt="BricksLLM Screenshot"
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
data:image/s3,"s3://crabby-images/e597e/e597e24a3c2657c376591c1e0da9159b22cd2ff2" alt="uAgents Screenshot"
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
data:image/s3,"s3://crabby-images/8ab69/8ab692a869eef895ffca840dda9b43d13f3cf958" alt="griptape Screenshot"
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.