markdrop

markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Stars: 52

Visit
 screenshot

Markdrop is a Python package that facilitates the conversion of PDFs to markdown format while extracting images and tables. It also generates descriptive text descriptions for extracted tables and images using various LLM clients. The tool offers additional functionalities such as PDF URL support, AI-powered image and table descriptions, interactive HTML output with downloadable Excel tables, customizable image resolution and UI elements, and a comprehensive logging system. Markdrop aims to simplify the process of handling PDF documents and enhancing their content with AI-generated descriptions.

README:

Markdrop Logo

Markdrop

Downloads PyPI Version License Stars Issues Forks Markdrop - PDF to markdown | Tables to Excel | Table/Images Description | Product Hunt

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Features

  • [x] PDF to Markdown conversion with formatting preservation using Docling
  • [x] Automatic image extraction with quality preservation using XRef Id
  • [x] Table detection using Microsoft's Table Transformer
  • [x] PDF URL support for core functionalities
  • [x] AI-powered image and table descriptions using multiple LLM providers
  • [x] Interactive HTML output with downloadable Excel tables
  • [x] Customizable image resolution and UI elements
  • [x] Comprehensive logging system
  • [ ] Support for other files
  • [ ] Streamlit/web interface

Installation

pip install markdrop  

Python Package Index (PyPI) Page: https://pypi.org/project/markdrop

Quick Start

Open in Colab Watch the demo

Basic PDF Processing

from markdrop import extract_images, make_markdown, extract_tables_from_pdf

source_pdf = 'url/or/path/to/pdf/file'    # Replace with your local PDF file path or a URL
output_dir = 'data/output'                 # Replace with desired output directory's path

make_markdown(source_pdf, output_dir)
extract_images(source_pdf, output_dir)
extract_tables_from_pdf(source_pdf, output_dir=output_dir)

Advanced PDF Processing with MarkDrop

from markdrop import markdrop, MarkDropConfig, add_downloadable_tables
from pathlib import Path
import logging

# Configure processing options
config = MarkDropConfig(
    image_resolution_scale=2.0,        # Scale factor for image resolution
    download_button_color='#444444',   # Color for download buttons in HTML
    log_level=logging.INFO,           # Logging detail level
    log_dir='logs',                   # Directory for log files
    excel_dir='markdropped-excel-tables'  # Directory for Excel table exports
)

# Process PDF document
input_doc_path = "path/to/input.pdf"
output_dir = Path('output_directory')

# Convert PDF and generate HTML with images and tables
html_path = markdrop(input_doc_path, output_dir, config)

# Add interactive table download functionality
downloadable_html = add_downloadable_tables(html_path, config)

AI-Powered Content Analysis

from markdrop import setup_keys, process_markdown, ProcessorConfig, AIProvider, logger
from pathlib import Path

# Set up API keys for AI providers
setup_apikeys(key='gemini')  # or setup_keys(key='openai')

# Configure AI processing options
config = ProcessorConfig(
    input_path="path/to/markdown/file.md",    # Input markdown file path
    output_dir=Path("output_directory"),      # Output directory
    ai_provider=AIProvider.GEMINI,            # AI provider (GEMINI or OPENAI)
    remove_images=False,                      # Keep or remove original images
    remove_tables=False,                      # Keep or remove original tables
    table_descriptions=True,                  # Generate table descriptions
    image_descriptions=True,                  # Generate image descriptions
    max_retries=3,                           # Number of API call retries
    retry_delay=2,                           # Delay between retries in seconds
    gemini_model_name="gemini-1.5-flash",    # Gemini model for images
    gemini_text_model_name="gemini-pro",     # Gemini model for text
    image_prompt=DEFAULT_IMAGE_PROMPT,        # Custom prompt for image analysis
    table_prompt=DEFAULT_TABLE_PROMPT         # Custom prompt for table analysis
)

# Process markdown with AI descriptions
output_path = process_markdown(config)

Image Description Generation

from markdrop import generate_descriptions

prompt = "Give textual highly detailed descriptions from this image ONLY, nothing else."
input_path = 'path/to/img_file/or/dir'
output_dir = 'data/output'
llm_clients = ['gemini', 'llama-vision']  # Available: ['qwen', 'gemini', 'openai', 'llama-vision', 'molmo', 'pixtral']

generate_descriptions(
    input_path=input_path,
    output_dir=output_dir,
    prompt=prompt,
    llm_client=llm_clients
)

API Reference

Core Functions

markdrop(input_doc_path: str, output_dir: str, config: Optional[MarkDropConfig] = None) -> Path

Converts PDF to markdown and HTML with enhanced features.

Parameters:

  • input_doc_path (str): Path to input PDF file
  • output_dir (str): Output directory path
  • config (MarkDropConfig, optional): Configuration options for processing

add_downloadable_tables(html_path: Path, config: Optional[MarkDropConfig] = None) -> Path

Adds interactive table download functionality to HTML output.

Parameters:

  • html_path (Path): Path to HTML file
  • config (MarkDropConfig, optional): Configuration options

Configuration Classes

MarkDropConfig

Configuration for PDF processing:

  • image_resolution_scale (float): Scale factor for image resolution (default: 2.0)
  • download_button_color (str): HTML color code for download buttons (default: '#444444')
  • log_level (int): Logging level (default: logging.INFO)
  • log_dir (str): Directory for log files (default: 'logs')
  • excel_dir (str): Directory for Excel table exports (default: 'markdropped-excel-tables')

ProcessorConfig

Configuration for AI processing:

  • input_path (str): Path to markdown file
  • output_dir (str): Output directory path
  • ai_provider (AIProvider): AI provider selection (GEMINI or OPENAI)
  • remove_images (bool): Whether to remove original images
  • remove_tables (bool): Whether to remove original tables
  • table_descriptions (bool): Generate table descriptions
  • image_descriptions (bool): Generate image descriptions
  • max_retries (int): Maximum API call retries
  • retry_delay (int): Delay between retries in seconds
  • gemini_model_name (str): Gemini model for image processing
  • gemini_text_model_name (str): Gemini model for text processing
  • image_prompt (str): Custom prompt for image analysis
  • table_prompt (str): Custom prompt for table analysis

Legacy Functions

make_markdown(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic PDF to markdown conversion.

Parameters:

  • source (str): Path to input PDF or URL
  • output_dir (str): Output directory path
  • verbose (bool): Enable detailed logging

extract_images(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic image extraction.

Parameters:

  • source (str): Path to input PDF or URL
  • output_dir (str): Output directory path
  • verbose (bool): Enable detailed logging

extract_tables_from_pdf(pdf_path: str, **kwargs)

Legacy function for basic table extraction.

Parameters:

  • pdf_path (str): Path to input PDF or URL
  • start_page (int, optional): Starting page number
  • end_page (int, optional): Ending page number
  • threshold (float, optional): Detection confidence threshold
  • output_dir (str): Output directory path

Quick Start for Legacy Functions

Check an example in run.py

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

  1. Clone the repository:
git clone https://github.com/shoryasethia/markdrop.git  
cd markdrop  
  1. Create a virtual environment:
python -m venv venv  
source venv/bin/activate  # On Windows: venv\Scripts\activate  
  1. Install development dependencies:
pip install -r requirements.txt  

Project Structure

markdrop/  
├── LICENSE  
├── README.md  
├── CONTRIBUTING.md  
├── CHANGELOG.md  
├── requirements.txt  
├── setup.py  
└── markdrop/ 
    ├── __init__.py 
    ├── src
    |    └── markdrop-logo.png
    ├── main.py
    ├── process.py
    ├── api_setup.py
    ├── parse.py
    ├── utils.py  
    ├── helper.py
    ├── ignore_warnings.py
    ├── run.py
    └── models/
        ├── __init__.py
        ├── .env
        ├── img_descriptions.py
        ├── logger.py
        ├── model_loader.py
        ├── responder.py
        └── setup_keys.py  

Star History

Star History Chart

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history.

Code of Conduct

Please note that this project follows our Code of Conduct.

Support

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for markdrop

Similar Open Source Tools

For similar tasks

For similar jobs