ppt2desc

Convert PowerPoint files into semantically rich text using vision language models

Stars: 84

Visit

ppt2desc is a command-line tool that converts PowerPoint presentations into detailed textual descriptions using vision language models. It interprets and describes visual elements, capturing the full semantic meaning of each slide in a machine-readable format. The tool supports various model providers and offers features like converting PPT/PPTX files to semantic descriptions, processing individual files or directories, visual elements interpretation, rate limiting for API calls, customizable prompts, and JSON output format for easy integration.

README:

ppt2desc

Convert PowerPoint presentations into semantically rich text using Vision Language Models.

Overview

ppt2desc is a command-line tool that converts PowerPoint presentations into detailed textual descriptions. PowerPoint presentations are an inherently visual medium that often convey complex ideas through a combination of text, graphics, charts, and other visual layouts. This tool uses vision language models to not only transcribe the text content but also interpret and describe the visual elements and their relationships, capturing the full semantic meaning of each slide in a machine-readable format.

Features

Convert PPT/PPTX files to semantic descriptions
Process individual files or entire directories
Support for visual elements interpretation (charts, graphs, figures)
Rate limiting for API calls
Customizable prompts and instructions
JSON output format for easy integration

Current Model Provider Support

Gemini models via Google Gemini API
GPT Models via OpenAI API
Claude Models via Anthropic API
Gemini Models via Google Cloud Platform Vertex AI
GPT Models via Microsoft Azure AI Foundry Deployments
Nova & Claude Models via Amazon Web Services's Amazon Bedrock

Prerequisites

Python 3.9 or higher
LibreOffice (for PPT/PPTX to PDF conversion)
- Option 1: Install LibreOffice locally.
- Option 2: Use the provided Docker container for LibreOffice.
vLLM API credentials

Installation

Clone the repository:

git clone https://github.com/ALucek/ppt2desc.git
cd ppt2desc

Installing LibreOffice

LibreOffice is a critical dependency for this tool as it handles the headless conversion of PowerPoint files to PDF format

Option 1: Local Installation

Linux Systems:

sudo apt install libreoffice

macOS:

brew install libreoffice

Windows:
Build from the installer at LibreOffice's Official Website

Option 2: Docker-based Installation

a. Ensure you have Docker installed on your system
b. Run the following command

docker compose up -d

This command will build the Docker image based on the provided Dockerfile and start the container in detached mode. The LibreOffice conversion service will be accessible athttp://localhost:2002. Take down with docker compose down.

Create and activate a virtual environment:

python -m venv ppt2desc_venv
source ppt2desc_venv/bin/activate  # On Windows: ppt2desc_venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Basic usage with Gemini API:

python src/main.py \
    --input_dir /path/to/presentations \
    --output_dir /path/to/output \
    --libreoffice_path /path/to/soffice \
    --client gemini \
    --api_key YOUR_GEMINI_API_KEY

Command Line Arguments

General Arguments:

--input_dir: Path to input directory or PPT file (required)
--output_dir: Output directory path (required)
--client: LLM client to use: 'gemini', 'vertexai', 'anthropic', 'azure', 'aws' or 'openai' (required)
--model: Model to use (default: "gemini-1.5-flash")
--instructions: Additional instructions for the model
--libreoffice_path: Path to LibreOffice installation
--libreoffice_url: Url for docker-based libreoffice installation (configured: http://localhost:2002)
--rate_limit: API calls per minute (default: 60)
--prompt_path: Custom prompt file path
--api_key: Model Provider API key (if not set via environment variable)
--save_pdf: Include to save the converted PDF in your output folder
--save_images: Include to save the individual slide images in your output folder

Vertex AI Specific Arguments:

--gcp_project_id: GCP project ID for Vertex AI service account
--gcp_region: GCP region for Vertex AI service (e.g., us-central1)
--gcp_application_credentials: Path to GCP service account JSON credentials file

Azure AI Foundry Specific Arguments:

--azure_openai_api_key: Azure AI Foundry Resource Key 1 or Key 2
--azure_openai_endpoint: Azure AI Foundry deployment service endpoint link
--azure_deployment_name: The name of your model deployment
--azure_api_version: Azure API Version (Default: "2023-12-01-preview")

AWS Amazon Bedrock Specific Arguments:

--aws_access_key_id: Bedrock Account Access Key
--aws_secret_access_key: Bedrock Account Account Secret Access Key
--aws_region: AWS Bedrock Region

Example Commands

Using Gemini API:

python src/main.py \
    --input_dir ./presentations \
    --output_dir ./output \
    --libreoffice_path ./soffice \
    --client gemini \
    --model gemini-1.5-flash \
    --rate_limit 30 \
    --instructions "Focus on extracting numerical data from charts and graphs"

Using Vertex AI:

python src/main.py \
    --input_dir ./presentations \
    --output_dir ./output \
    --client vertexai \
    --libreoffice_path ./soffice \
    --gcp_project_id my-project-123 \
    --gcp_region us-central1 \
    --gcp_application_credentials ./service-account.json \
    --model gemini-1.5-pro \
    --instructions "Extract detailed information from technical diagrams"

Using Azure AI Foundry:

python src/main.py \
    --input_dir ./presentations \
    --output_dir ./output \
    --libreoffice_path ./soffice \
    --client azure \
    --azure_openai_api_key 123456790ABCDEFG \
    --azure_openai_endpoint 'https://example-endpoint-001.openai.azure.com/' \
    --azure_deployment_name gpt-4o \
    --azure_api_version 2023-12-01-preview \
    --rate_limit 60

Using AWS Amazon Bedrock:

python src/main.py \
    --input_dir ./presentations \
    --output_dir ./output \
    --libreoffice_path ./soffice \
    --client aws \
    --model us.amazon.nova-lite-v1:0 \
    --aws_access_key_id 123456790ABCDEFG \
    --aws_secret_access_key 123456790ABCDEFG \
    --aws_region us-east-1 \
    --rate_limit 60

Output Format

The tool generates JSON files with the following structure:

{
  "deck": "presentation.pptx",
  "model": "model-name",
  "slides": [
    {
      "number": 1,
      "content": "Detailed description of slide content..."
    },
    // ... more slides
  ]
}

Advanced Usage

Using Docker-based LibreOffice Conversion

When using the Docker container for LibreOffice, you can use the --libreoffice_url argument to direct the conversion process to the container's API endpoint, rather than a local installation.

python src/main.py \
    --input_dir ./presentations \
    --output_dir ./output \
    --libreoffice_url http://localhost:2002 \
    --client vertexai \
    --model gemini-1.5-pro \
    --gcp_project_id my-project-123 \
    --gcp_region us-central1 \
    --gcp_application_credentials ./service-account.json \
    --rate_limit 30 \
    --instructions "Extract detailed information from technical diagrams" \
    --save_pdf \
    --save_images

You should use either --libreoffice_url or --libreoffice_path but not both.

Custom Prompts

You can modify the base prompt by editing src/prompt.txt or providing additional instructions via the command line:

python src/main.py \
    --input_dir ./presentations \
    --output_dir ./output \
    --libreoffice_path ./soffice \
    --instructions "Include mathematical equations and formulas in LaTeX format"

Authentication

For Consumer APIs:

Set your API key via the --api_key argument or through your respective provider's environment variables

For Vertex AI:

Create a service account in your GCP project IAM
Grant necessary permissions (typically, "Vertex AI User" role)
Download the service account JSON key file
Provide the credentials file path via --gcp_application_credentials

For Azure OpenAI Foundry:

Create an Azure OpenAI Resource
Navigate to Azure AI Foundry and choose the subscription and Azure OpenAI Resource to work with
Under management select deployments
Select create new deployment and configure with your vision LLM
Provide deployment name, API key, endpoint, and api version via --azure_deployment_name, --azure_openai_api_key, --azure_openai_endpoint, --azure_api_version,

For AWS Bedrock:

Request access to serverless model deployments in Amazon Bedrock's model catalog
Create a user in your AWS IAM
Enable Amazon Bedrock access policies for your user
Save User Credentials access key and secret access key
Provide user's credentials via --aws_access_key_id, and --aws_secret_access_key

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Todo

Handling google's new genai SDK for a unified gemini/vertex experience
Better Docker Setup
AWS Llama Vision Support Confirmation
Combination of JSON files across multiple ppts
Dynamic font understanding (i.e. struggles when font that ppt is using is not installed on machine)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

LibreOffice for PPT/PPTX conversion
PyMuPDF for PDF processing

For Tasks:

Click tags to check more tools for each tasks

summarize presentations analyze visual data generate detailed descriptions extract numerical data interpret technical diagrams

For Jobs:

data analyst content creator researcher academic writer presentation designer

Alternative AI tools for ppt2desc

Similar Open Source Tools

ppt2desc

github

: 84

mcphost

MCPHost is a CLI host application that enables Large Language Models (LLMs) to interact with external tools through the Model Context Protocol (MCP). It acts as a host in the MCP client-server architecture, allowing language models to access external tools and data sources, maintain consistent context across interactions, and execute commands safely. The tool supports interactive conversations with Claude 3.5 Sonnet and Ollama models, multiple concurrent MCP servers, dynamic tool discovery and integration, configurable server locations and arguments, and a consistent command interface across model types.

github

: 172

cursor-tools

cursor-tools is a CLI tool designed to enhance AI agents with advanced skills, such as web search, repository context, documentation generation, GitHub integration, Xcode tools, and browser automation. It provides features like Perplexity for web search, Gemini 2.0 for codebase context, and Stagehand for browser operations. The tool requires API keys for Perplexity AI and Google Gemini, and supports global installation for system-wide access. It offers various commands for different tasks and integrates with Cursor Composer for AI agent usage.

github

: 3.5k

raycast_api_proxy

The Raycast AI Proxy is a tool that acts as a proxy for the Raycast AI application, allowing users to utilize the application without subscribing. It intercepts and forwards Raycast requests to various AI APIs, then reformats the responses for Raycast. The tool supports multiple AI providers and allows for custom model configurations. Users can generate self-signed certificates, add them to the system keychain, and modify DNS settings to redirect requests to the proxy. The tool is designed to work with providers like OpenAI, Azure OpenAI, Google, and more, enabling tasks such as AI chat completions, translations, and image generation.

github

: 317

nuxt-llms

Nuxt LLMs automatically generates llms.txt markdown documentation for Nuxt applications. It provides runtime hooks to collect data from various sources and generate structured documentation. The tool allows customization of sections directly from nuxt.config.ts and integrates with Nuxt modules via the runtime hooks system. It generates two documentation formats: llms.txt for concise structured documentation and llms_full.txt for detailed documentation. Users can extend documentation using hooks to add sections, links, and metadata. The tool is suitable for developers looking to automate documentation generation for their Nuxt applications.

github

: 117

llm-vscode

llm-vscode is an extension designed for all things LLM, utilizing llm-ls as its backend. It offers features such as code completion with 'ghost-text' suggestions, the ability to choose models for code generation via HTTP requests, ensuring prompt size fits within the context window, and code attribution checks. Users can configure the backend, suggestion behavior, keybindings, llm-ls settings, and tokenization options. Additionally, the extension supports testing models like Code Llama 13B, Phind/Phind-CodeLlama-34B-v2, and WizardLM/WizardCoder-Python-34B-V1.0. Development involves cloning llm-ls, building it, and setting up the llm-vscode extension for use.

github

: 1.1k

hayhooks

Hayhooks is a tool that simplifies the deployment and serving of Haystack pipelines as REST APIs. It allows users to wrap their pipelines with custom logic and expose them via HTTP endpoints, including OpenAI-compatible chat completion endpoints. With Hayhooks, users can easily convert their Haystack pipelines into API services with minimal boilerplate code.

github

: 51

shortest

Shortest is an AI-powered natural language end-to-end testing framework built on Playwright. It provides a seamless testing experience by allowing users to write tests in natural language and execute them using Anthropic Claude API. The framework also offers GitHub integration with 2FA support, making it suitable for testing web applications with complex authentication flows. Shortest simplifies the testing process by enabling users to run tests locally or in CI/CD pipelines, ensuring the reliability and efficiency of web applications.

github

: 4.4k

RA.Aid

RA.Aid is an AI software development agent powered by `aider` and advanced reasoning models like `o1`. It combines `aider`'s code editing capabilities with LangChain's agent-based task execution framework to provide an intelligent assistant for research, planning, and implementation of multi-step development tasks. It handles complex programming tasks by breaking them down into manageable steps, running shell commands automatically, and leveraging expert reasoning models like OpenAI's o1. RA.Aid is designed for everyday software development, offering features such as multi-step task planning, automated command execution, and the ability to handle complex programming tasks beyond single-shot code edits.

github

: 1.6k

fetcher-mcp

Fetcher MCP is a server tool designed for fetching web page content using Playwright headless browser. It supports JavaScript execution, intelligent content extraction, flexible output formats, parallel processing, resource optimization, robust error handling, and configurable parameters. The tool provides features like fetching web page content from a specified URL, batch retrieving content from multiple URLs, and offers fine-grained control over various parameters. Fetcher MCP is ideal for users looking to scrape dynamic web content efficiently and reliably.

github

: 94

llm-functions

LLM Functions is a project that enables the enhancement of large language models (LLMs) with custom tools and agents developed in bash, javascript, and python. Users can create tools for their LLM to execute system commands, access web APIs, or perform other complex tasks triggered by natural language prompts. The project provides a framework for building tools and agents, with tools being functions written in the user's preferred language and automatically generating JSON declarations based on comments. Agents combine prompts, function callings, and knowledge (RAG) to create conversational AI agents. The project is designed to be user-friendly and allows users to easily extend the capabilities of their language models.

github

: 263

mcpdoc

The MCP LLMS-TXT Documentation Server is an open-source server that provides developers full control over tools used by applications like Cursor, Windsurf, and Claude Code/Desktop. It allows users to create a user-defined list of `llms.txt` files and use a `fetch_docs` tool to read URLs within these files, enabling auditing of tool calls and context returned. The server supports various applications and provides a way to connect to them, configure rules, and test tool calls for tasks related to documentation retrieval and processing.

github

: 148

Construction-Hazard-Detection

Construction-Hazard-Detection is an AI-driven tool focused on improving safety at construction sites by utilizing the YOLOv8 model for object detection. The system identifies potential hazards like overhead heavy loads and steel pipes, providing real-time analysis and warnings. Users can configure the system via a YAML file and run it using Docker. The primary dataset used for training is the Construction Site Safety Image Dataset enriched with additional annotations. The system logs are accessible within the Docker container for debugging, and notifications are sent through the LINE messaging API when hazards are detected.

github

: 153

tiledesk-dashboard

Tiledesk is an open-source live chat platform with integrated chatbots written in Node.js and Express. It is designed to be a multi-channel platform for web, Android, and iOS, and it can be used to increase sales or provide post-sales customer service. Tiledesk's chatbot technology allows for automation of conversations, and it also provides APIs and webhooks for connecting external applications. Additionally, it offers a marketplace for apps and features such as CRM, ticketing, and data export.

github

: 258

chatgpt-cli

ChatGPT CLI provides a powerful command-line interface for seamless interaction with ChatGPT models via OpenAI and Azure. It features streaming capabilities, extensive configuration options, and supports various modes like streaming, query, and interactive mode. Users can manage thread-based context, sliding window history, and provide custom context from any source. The CLI also offers model and thread listing, advanced configuration options, and supports GPT-4, GPT-3.5-turbo, and Perplexity's models. Installation is available via Homebrew or direct download, and users can configure settings through default values, a config.yaml file, or environment variables.

github

: 661

pastemax

PasteMax is a modern file viewer application designed for developers to easily navigate, search, and copy code from repositories. It provides features such as file tree navigation, token counting, search capabilities, selection management, sorting options, dark mode, binary file detection, and smart file exclusion. Built with Electron, React, and TypeScript, PasteMax is ideal for pasting code into ChatGPT or other language models. Users can download the application or build it from source, and customize file exclusions. Troubleshooting steps are provided for common issues, and contributions to the project are welcome under the MIT License.

github

: 276

For similar tasks

mlp-mixer-pytorch

MLP Mixer - Pytorch is an all-MLP solution for vision tasks, developed by Google AI, implemented in Pytorch. It provides an architecture that does not require convolutions or attention mechanisms, offering an alternative approach for image and video processing. The tool is designed to handle tasks related to image classification and video recognition, utilizing multi-layer perceptrons (MLPs) for feature extraction and classification. Users can easily install the tool using pip and integrate it into their Pytorch projects to experiment with MLP-based vision models.

github

: 986

ppt2desc

github

: 84

awesome-robotics-ai-companies

A curated list of companies in the robotics and artificially intelligent agents industry, including large companies, stable start-ups, non-profits, and government research labs. The list covers companies working on autonomous vehicles, robotics, artificial intelligence, machine learning, computer vision, and more. It aims to showcase industry innovators and important players in the field of robotics and AI.

github

: 86

For similar jobs

SLR-FC

This repository provides a comprehensive collection of AI tools and resources to enhance literature reviews. It includes a curated list of AI tools for various tasks, such as identifying research gaps, discovering relevant papers, visualizing paper content, and summarizing text. Additionally, the repository offers materials on generative AI, effective prompts, copywriting, image creation, and showcases of AI capabilities. By leveraging these tools and resources, researchers can streamline their literature review process, gain deeper insights from scholarly literature, and improve the quality of their research outputs.

github

: 131

paper-ai

Paper-ai is a tool that helps you write papers using artificial intelligence. It provides features such as AI writing assistance, reference searching, and editing and formatting tools. With Paper-ai, you can quickly and easily create high-quality papers.

github

: 664

paper-qa

PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and follows a process of embedding docs and queries, searching for top passages, creating summaries, scoring and selecting relevant summaries, putting summaries into prompt, and generating answers. Users can customize prompts and use various models for embeddings and LLMs. The tool can be used asynchronously and supports adding documents from paths, files, or URLs.

github

: 3.6k

ChatData

ChatData is a robust chat-with-documents application designed to extract information and provide answers by querying the MyScale free knowledge base or uploaded documents. It leverages the Retrieval Augmented Generation (RAG) framework, millions of Wikipedia pages, and arXiv papers. Features include self-querying retriever, VectorSQL, session management, and building a personalized knowledge base. Users can effortlessly navigate vast data, explore academic papers, and research documents. ChatData empowers researchers, students, and knowledge enthusiasts to unlock the true potential of information retrieval.

github

: 135

noScribe

noScribe is an AI-based software designed for automated audio transcription, specifically tailored for transcribing interviews for qualitative social research or journalistic purposes. It is a free and open-source tool that runs locally on the user's computer, ensuring data privacy. The software can differentiate between speakers and supports transcription in 99 languages. It includes a user-friendly editor for reviewing and correcting transcripts. Developed by Kai Dröge, a PhD in sociology with a background in computer science, noScribe aims to streamline the transcription process and enhance the efficiency of qualitative analysis.

github

: 655

AIStudyAssistant

AI Study Assistant is an app designed to enhance learning experience and boost academic performance. It serves as a personal tutor, lecture summarizer, writer, and question generator powered by Google PaLM 2. Features include interacting with an AI chatbot, summarizing lectures, generating essays, and creating practice questions. The app is built using 100% Kotlin, Jetpack Compose, Clean Architecture, and MVVM design pattern, with technologies like Ktor, Room DB, Hilt, and Kotlin coroutines. AI Study Assistant aims to provide comprehensive AI-powered assistance for students in various academic tasks.

github

: 69

data-to-paper

Data-to-paper is an AI-driven framework designed to guide users through the process of conducting end-to-end scientific research, starting from raw data to the creation of comprehensive and human-verifiable research papers. The framework leverages a combination of LLM and rule-based agents to assist in tasks such as hypothesis generation, literature search, data analysis, result interpretation, and paper writing. It aims to accelerate research while maintaining key scientific values like transparency, traceability, and verifiability. The framework is field-agnostic, supports both open-goal and fixed-goal research, creates data-chained manuscripts, involves human-in-the-loop interaction, and allows for transparent replay of the research process.

github

: 553

k2

K2 (GeoLLaMA) is a large language model for geoscience, trained on geoscience literature and fine-tuned with knowledge-intensive instruction data. It outperforms baseline models on objective and subjective tasks. The repository provides K2 weights, core data of GeoSignal, GeoBench benchmark, and code for further pretraining and instruction tuning. The model is available on Hugging Face for use. The project aims to create larger and more powerful geoscience language models in the future.

github

: 153