markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Stars: 80

Visit

Markdrop is a Python package that facilitates the conversion of PDFs to markdown format while extracting images and tables. It also generates descriptive text descriptions for extracted tables and images using various LLM clients. The tool offers additional functionalities such as PDF URL support, AI-powered image and table descriptions, interactive HTML output with downloadable Excel tables, customizable image resolution and UI elements, and a comprehensive logging system. Markdrop aims to simplify the process of handling PDF documents and enhancing their content with AI-generated descriptions.

README:

Markdrop

Quick Start

Features

[x] PDF to Markdown conversion with formatting preservation using Docling
[x] Automatic image extraction with quality preservation using XRef Id
[x] Table detection using Microsoft's Table Transformer
[x] PDF URL support for core functionalities
[x] AI-powered image and table descriptions using multiple LLM providers
[x] Interactive HTML output with downloadable Excel tables
[x] Customizable image resolution and UI elements
[x] Comprehensive logging system
[ ] Support for other files
[ ] Streamlit/web interface

Installation

pip install markdrop

Python Package Index (PyPI) Page: https://pypi.org/project/markdrop

Basic PDF Processing

from markdrop import extract_images, make_markdown, extract_tables_from_pdf

source_pdf = 'url/or/path/to/pdf/file'    # Replace with your local PDF file path or a URL
output_dir = 'data/output'                 # Replace with desired output directory's path

make_markdown(source_pdf, output_dir)
extract_images(source_pdf, output_dir)
extract_tables_from_pdf(source_pdf, output_dir=output_dir)

Advanced PDF Processing with MarkDrop

from markdrop import markdrop, MarkDropConfig, add_downloadable_tables
from pathlib import Path
import logging

# Configure processing options
config = MarkDropConfig(
    image_resolution_scale=2.0,        # Scale factor for image resolution
    download_button_color='#444444',   # Color for download buttons in HTML
    log_level=logging.INFO,           # Logging detail level
    log_dir='logs',                   # Directory for log files
    excel_dir='markdropped-excel-tables'  # Directory for Excel table exports
)

# Process PDF document
input_doc_path = "path/to/input.pdf"
output_dir = Path('output_directory')

# Convert PDF and generate HTML with images and tables
html_path = markdrop(input_doc_path, output_dir, config)

# Add interactive table download functionality
downloadable_html = add_downloadable_tables(html_path, config)

AI-Powered Content Analysis

from markdrop import setup_keys, process_markdown, ProcessorConfig, AIProvider, logger
from pathlib import Path

# Set up API keys for AI providers
setup_apikeys(key='gemini')  # or setup_keys(key='openai')

# Configure AI processing options
config = ProcessorConfig(
    input_path="path/to/markdown/file.md",    # Input markdown file path
    output_dir=Path("output_directory"),      # Output directory
    ai_provider=AIProvider.GEMINI,            # AI provider (GEMINI or OPENAI)
    remove_images=False,                      # Keep or remove original images
    remove_tables=False,                      # Keep or remove original tables
    table_descriptions=True,                  # Generate table descriptions
    image_descriptions=True,                  # Generate image descriptions
    max_retries=3,                           # Number of API call retries
    retry_delay=2,                           # Delay between retries in seconds
    gemini_model_name="gemini-1.5-flash",    # Gemini model for images
    gemini_text_model_name="gemini-pro",     # Gemini model for text
    image_prompt=DEFAULT_IMAGE_PROMPT,        # Custom prompt for image analysis
    table_prompt=DEFAULT_TABLE_PROMPT         # Custom prompt for table analysis
)

# Process markdown with AI descriptions
output_path = process_markdown(config)

Image Description Generation

from markdrop import generate_descriptions

prompt = "Give textual highly detailed descriptions from this image ONLY, nothing else."
input_path = 'path/to/img_file/or/dir'
output_dir = 'data/output'
llm_clients = ['gemini', 'llama-vision']  # Available: ['qwen', 'gemini', 'openai', 'llama-vision', 'molmo', 'pixtral']

generate_descriptions(
    input_path=input_path,
    output_dir=output_dir,
    prompt=prompt,
    llm_client=llm_clients
)

API Reference

Core Functions

markdrop(input_doc_path: str, output_dir: str, config: Optional[MarkDropConfig] = None) -> Path

Converts PDF to markdown and HTML with enhanced features.

Parameters:

input_doc_path (str): Path to input PDF file
output_dir (str): Output directory path
config (MarkDropConfig, optional): Configuration options for processing

add_downloadable_tables(html_path: Path, config: Optional[MarkDropConfig] = None) -> Path

Adds interactive table download functionality to HTML output.

Parameters:

html_path (Path): Path to HTML file
config (MarkDropConfig, optional): Configuration options

Configuration Classes

MarkDropConfig

Configuration for PDF processing:

image_resolution_scale (float): Scale factor for image resolution (default: 2.0)
download_button_color (str): HTML color code for download buttons (default: '#444444')
log_level (int): Logging level (default: logging.INFO)
log_dir (str): Directory for log files (default: 'logs')
excel_dir (str): Directory for Excel table exports (default: 'markdropped-excel-tables')

ProcessorConfig

Configuration for AI processing:

input_path (str): Path to markdown file
output_dir (str): Output directory path
ai_provider (AIProvider): AI provider selection (GEMINI or OPENAI)
remove_images (bool): Whether to remove original images
remove_tables (bool): Whether to remove original tables
table_descriptions (bool): Generate table descriptions
image_descriptions (bool): Generate image descriptions
max_retries (int): Maximum API call retries
retry_delay (int): Delay between retries in seconds
gemini_model_name (str): Gemini model for image processing
gemini_text_model_name (str): Gemini model for text processing
image_prompt (str): Custom prompt for image analysis
table_prompt (str): Custom prompt for table analysis

Legacy Functions

make_markdown(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic PDF to markdown conversion.

Parameters:

source (str): Path to input PDF or URL
output_dir (str): Output directory path
verbose (bool): Enable detailed logging

extract_images(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic image extraction.

Parameters:

source (str): Path to input PDF or URL
output_dir (str): Output directory path
verbose (bool): Enable detailed logging

extract_tables_from_pdf(pdf_path: str, **kwargs)

Legacy function for basic table extraction.

Parameters:

pdf_path (str): Path to input PDF or URL
start_page (int, optional): Starting page number
end_page (int, optional): Ending page number
threshold (float, optional): Detection confidence threshold
output_dir (str): Output directory path

Quick Start for Legacy Functions

Check an example in run.py

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

Clone the repository:

git clone https://github.com/shoryasethia/markdrop.git  
cd markdrop

Create a virtual environment:

python -m venv venv  
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install development dependencies:

pip install -r requirements.txt

Project Structure

markdrop/  
├── LICENSE  
├── README.md  
├── CONTRIBUTING.md  
├── CHANGELOG.md  
├── requirements.txt  
├── setup.py  
└── markdrop/ 
    ├── __init__.py 
    ├── src
    |    └── markdrop-logo.png
    ├── main.py
    ├── process.py
    ├── api_setup.py
    ├── parse.py
    ├── utils.py  
    ├── helper.py
    ├── ignore_warnings.py
    ├── run.py
    └── models/
        ├── __init__.py
        ├── .env
        ├── img_descriptions.py
        ├── logger.py
        ├── model_loader.py
        ├── responder.py
        └── setup_keys.py

Star History

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history.

Code of Conduct

Please note that this project follows our Code of Conduct.

Support

Open an issue

For Tasks:

Click tags to check more tools for each tasks

convert pdfs extract images generate descriptions create markdown analyze content

For Jobs:

data scientist content creator research analyst technical writer data analyst

Alternative AI tools for markdrop

Similar Open Source Tools

markdrop

github

: 80

serve

Jina-Serve is a framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets. It provides native support for major ML frameworks and data types, high-performance service design with scaling and dynamic batching, LLM serving with streaming output, built-in Docker integration and Executor Hub, one-click deployment to Jina AI Cloud, and enterprise-ready features with Kubernetes and Docker Compose support. Users can create gRPC-based AI services, build pipelines, scale services locally with replicas, shards, and dynamic batching, deploy to the cloud using Kubernetes, Docker Compose, or JCloud, and enable token-by-token streaming for responsive LLM applications.

github

: 21.4k

educhain

Educhain is a powerful Python package that leverages Generative AI to create engaging and personalized educational content. It enables users to generate multiple-choice questions, create lesson plans, and support various LLM models. Users can export questions to JSON, PDF, and CSV formats, customize prompt templates, and generate questions from text, PDF, URL files, youtube videos, and images. Educhain outperforms traditional methods in content generation speed and quality. It offers advanced configuration options and has a roadmap for future enhancements, including integration with popular Learning Management Systems and a mobile app for content generation on-the-go.

github

: 157

llama_ros

This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. By using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs.

github

: 195

agentops

AgentOps is a toolkit for evaluating and developing robust and reliable AI agents. It provides benchmarks, observability, and replay analytics to help developers build better agents. AgentOps is open beta and can be signed up for here. Key features of AgentOps include: - Session replays in 3 lines of code: Initialize the AgentOps client and automatically get analytics on every LLM call. - Time travel debugging: (coming soon!) - Agent Arena: (coming soon!) - Callback handlers: AgentOps works seamlessly with applications built using Langchain and LlamaIndex.

github

: 4.1k

langcheck

LangCheck is a Python library that provides a suite of metrics and tools for evaluating the quality of text generated by large language models (LLMs). It includes metrics for evaluating text fluency, sentiment, toxicity, factual consistency, and more. LangCheck also provides tools for visualizing metrics, augmenting data, and writing unit tests for LLM applications. With LangCheck, you can quickly and easily assess the quality of LLM-generated text and identify areas for improvement.

github

: 184

obsei

Obsei is an open-source, low-code, AI powered automation tool that consists of an Observer to collect unstructured data from various sources, an Analyzer to analyze the collected data with various AI tasks, and an Informer to send analyzed data to various destinations. The tool is suitable for scheduled jobs or serverless applications as all Observers can store their state in databases. Obsei is still in alpha stage, so caution is advised when using it in production. The tool can be used for social listening, alerting/notification, automatic customer issue creation, extraction of deeper insights from feedbacks, market research, dataset creation for various AI tasks, and more based on creativity.

github

: 1.2k

LongLLaVA

LongLLaVA is a tool for scaling multi-modal LLMs to 1000 images efficiently via hybrid architecture. It includes stages for single-image alignment, instruction-tuning, and multi-image instruction-tuning, with evaluation through a command line interface and model inference. The tool aims to achieve GPT-4V level capabilities and beyond, providing reproducibility of results and benchmarks for efficiency and performance.

github

: 158

ChatGPT-Next-Web

ChatGPT Next Web is a well-designed cross-platform ChatGPT web UI tool that supports Claude, GPT4, and Gemini Pro models. It allows users to deploy their private ChatGPT applications with ease. The tool offers features like one-click deployment, compact client for Linux/Windows/MacOS, compatibility with self-deployed LLMs, privacy-first approach with local data storage, markdown support, responsive design, fast loading speed, prompt templates, awesome prompts, chat history compression, multilingual support, and more.

github

: 78.5k

evalplus

EvalPlus is a rigorous evaluation framework for LLM4Code, providing HumanEval+ and MBPP+ tests to evaluate large language models on code generation tasks. It offers precise evaluation and ranking, coding rigorousness analysis, and pre-generated code samples. Users can use EvalPlus to generate code solutions, post-process code, and evaluate code quality. The tool includes tools for code generation and test input generation using various backends.

github

: 1.3k

e2m

E2M is a Python library that can parse and convert various file types into Markdown format. It supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning. The core architecture consists of a Parser responsible for parsing various file types into text or image data, and a Converter responsible for converting text or image data into Markdown format.

github

: 143

NextChat

NextChat is a well-designed cross-platform ChatGPT web UI tool that supports Claude, GPT4, and Gemini Pro. It offers a compact client for Linux, Windows, and MacOS, with features like self-deployed LLMs compatibility, privacy-first data storage, markdown support, responsive design, and fast loading speed. Users can create, share, and debug chat tools with prompt templates, access various prompts, compress chat history, and use multiple languages. The tool also supports enterprise-level privatization and customization deployment, with features like brand customization, resource integration, permission control, knowledge integration, security auditing, private deployment, and continuous updates.

github

: 78.7k

desktop

E2B Desktop Sandbox is a secure virtual desktop environment powered by E2B, allowing users to create isolated sandboxes with customizable dependencies. It provides features such as streaming the desktop screen, mouse and keyboard control, taking screenshots, opening files, and running bash commands. The environment is based on Linux and Xfce, offering a fast and lightweight experience that can be fully customized to create unique desktop environments.

github

: 569

ollama4j

Ollama4j is a Java library that serves as a wrapper or binding for the Ollama server. It facilitates communication with the Ollama server and provides models for deployment. The tool requires Java 11 or higher and can be installed locally or via Docker. Users can integrate Ollama4j into Maven projects by adding the specified dependency. The tool offers API specifications and supports various development tasks such as building, running unit tests, and integration tests. Releases are automated through GitHub Actions CI workflow. Areas of improvement include adhering to Java naming conventions, updating deprecated code, implementing logging, using lombok, and enhancing request body creation. Contributions to the project are encouraged, whether reporting bugs, suggesting enhancements, or contributing code.

github

: 162

client-ts

Mistral Typescript Client is an SDK for Mistral AI API, providing Chat Completion and Embeddings APIs. It allows users to create chat completions, upload files, create agent completions, create embedding requests, and more. The SDK supports various JavaScript runtimes and provides detailed documentation on installation, requirements, API key setup, example usage, error handling, server selection, custom HTTP client, authentication, providers support, standalone functions, debugging, and contributions.

github

: 52

openlrc

Open-Lyrics is a Python library that transcribes voice files using faster-whisper and translates/polishes the resulting text into `.lrc` files in the desired language using LLM, e.g. OpenAI-GPT, Anthropic-Claude. It offers well preprocessed audio to reduce hallucination and context-aware translation to improve translation quality. Users can install the library from PyPI or GitHub and follow the installation steps to set up the environment. The tool supports GUI usage and provides Python code examples for transcription and translation tasks. It also includes features like utilizing context and glossary for translation enhancement, pricing information for different models, and a list of todo tasks for future improvements.

github

: 476

For similar tasks

Open-DocLLM

Open-DocLLM is an open-source project that addresses data extraction and processing challenges using OCR and LLM technologies. It consists of two main layers: OCR for reading document content and LLM for extracting specific content in a structured manner. The project offers a larger context window size compared to JP Morgan's DocLLM and integrates tools like Tesseract OCR and Mistral for efficient data analysis. Users can run the models on-premises using LLM studio or Ollama, and the project includes a FastAPI app for testing purposes.

github

: 124

Awesome-AI

Awesome AI is a repository that collects and shares resources in the fields of large language models (LLM), AI-assisted programming, AI drawing, and more. It explores the application and development of generative artificial intelligence. The repository provides information on various AI tools, models, and platforms, along with tutorials and web products related to AI technologies.

github

: 157

Qmedia

QMedia is an open-source multimedia AI content search engine designed specifically for content creators. It provides rich information extraction methods for text, image, and short video content. The tool integrates unstructured text, image, and short video information to build a multimodal RAG content Q&A system. Users can efficiently search for image/text and short video materials, analyze content, provide content sources, and generate customized search results based on user interests and needs. QMedia supports local deployment for offline content search and Q&A for private data. The tool offers features like content cards display, multimodal content RAG search, and pure local multimodal models deployment. Users can deploy different types of models locally, manage language models, feature embedding models, image models, and video models. QMedia aims to spark new ideas for content creation and share AI content creation concepts in an open-source manner.

github

: 537

aws-ai-intelligent-document-processing

This repository is part of Intelligent Document Processing with AWS AI Services workshop. It aims to automate the extraction of information from complex content in various document formats such as insurance claims, mortgages, healthcare claims, contracts, and legal contracts using AWS Machine Learning services like Amazon Textract and Amazon Comprehend. The repository provides hands-on labs to familiarize users with these AI services and build solutions to automate business processes that rely on manual inputs and intervention across different file types and formats.

github

: 124

Scrapegraph-LabLabAI-Hackathon

ScrapeGraphAI is a web scraping Python library that utilizes LangChain, LLM, and direct graph logic to create scraping pipelines. Users can specify the information they want to extract, and the library will handle the extraction process. The tool is designed to simplify web scraping tasks by providing a streamlined and efficient approach to data extraction.

github

: 75

parsera

Parsera is a lightweight Python library designed for scraping websites using LLMs. It offers simplicity and efficiency by minimizing token usage, enhancing speed, and reducing costs. Users can easily set up and run the tool to extract specific elements from web pages, generating JSON output with relevant data. Additionally, Parsera supports integration with various chat models, such as Azure, expanding its functionality and customization options for web scraping tasks.

github

: 1.1k

Scrapegraph-demo

ScrapeGraphAI is a web scraping Python library that utilizes LangChain, LLM, and direct graph logic to create scraping pipelines. Users can specify the information they want to extract, and the library will handle the extraction process. This repository contains an official demo/trial for the ScrapeGraphAI library, showcasing its capabilities in web scraping tasks. The tool is designed to simplify the process of extracting data from websites by providing a user-friendly interface and powerful scraping functionalities.

github

: 76

you2txt

You2Txt is a tool developed for the Vercel + Nvidia 2-hour hackathon that converts any YouTube video into a transcribed .txt file. The project won first place in the hackathon and is hosted at you2txt.com. Due to rate limiting issues with YouTube requests, it is recommended to run the tool locally. The project was created using Next.js, Tailwind, v0, and Claude, and can be built and accessed locally for development purposes.

github

: 71

For similar jobs

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

daily-poetry-image

Daily Chinese ancient poetry and AI-generated images powered by Bing DALL-E-3. GitHub Action triggers the process automatically. Poetry is provided by Today's Poem API. The website is built with Astro.

github

: 492

exif-photo-blog

EXIF Photo Blog is a full-stack photo blog application built with Next.js, Vercel, and Postgres. It features built-in authentication, photo upload with EXIF extraction, photo organization by tag, infinite scroll, light/dark mode, automatic OG image generation, a CMD-K menu with photo search, experimental support for AI-generated descriptions, and support for Fujifilm simulations. The application is easy to deploy to Vercel with just a few clicks and can be customized with a variety of environment variables.

github

: 992

SillyTavern

SillyTavern is a user interface you can install on your computer (and Android phones) that allows you to interact with text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development and has added many major features. At this point, they can be thought of as completely independent programs.

github

: 13.2k

Twitter-Insight-LLM

This project enables you to fetch liked tweets from Twitter (using Selenium), save it to JSON and Excel files, and perform initial data analysis and image captions. This is part of the initial steps for a larger personal project involving Large Language Models (LLMs).

github

: 401

AISuperDomain

Aila Desktop Application is a powerful tool that integrates multiple leading AI models into a single desktop application. It allows users to interact with various AI models simultaneously, providing diverse responses and insights to their inquiries. With its user-friendly interface and customizable features, Aila empowers users to engage with AI seamlessly and efficiently. Whether you're a researcher, student, or professional, Aila can enhance your AI interactions and streamline your workflow.

github

: 1.2k

ChatGPT-On-CS

This project is an intelligent dialogue customer service tool based on a large model, which supports access to platforms such as WeChat, Qianniu, Bilibili, Douyin Enterprise, Douyin, Doudian, Weibo chat, Xiaohongshu professional account operation, Xiaohongshu, Zhihu, etc. You can choose GPT3.5/GPT4.0/ Lazy Treasure Box (more platforms will be supported in the future), which can process text, voice and pictures, and access external resources such as operating systems and the Internet through plug-ins, and support enterprise AI applications customized based on their own knowledge base.

github

: 768

obs-localvocal

LocalVocal is a live-streaming AI assistant plugin for OBS that allows you to transcribe audio speech into text and perform various language processing functions on the text using AI / LLMs (Large Language Models). It's privacy-first, with all data staying on your machine, and requires no GPU, cloud costs, network, or downtime.

github

: 248