gpt-rag-ingestion

gpt-rag-ingestion

The GPT-RAG Data Ingestion service automates processing of diverse documents—PDFs, images, spreadsheets, transcripts, and SharePoint—readying them for Azure AI Search. It applies smart chunking, generates text and image embeddings, and enables rich, multimodal retrieval.

Stars: 129

Visit
 screenshot

The GPT-RAG Data Ingestion service automates processing of diverse document types for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experiences for agent-based RAG applications. Supported data sources include Blob Storage, NL2SQL Metadata, and SharePoint. The service selects chunkers based on file extension, such as DocAnalysisChunker for PDF files, OCR for image files, LangChainChunker for text-based files, TranscriptionChunker for video transcripts, and SpreadsheetChunker for spreadsheets. Deployment requires provisioning infrastructure and assigning specific roles to the user or service principal.

README:

GPT-RAG Data Ingestion

Part of the GPT-RAG solution.

The GPT-RAG Data Ingestion service automates the processing of diverse document types—such as PDFs, images, spreadsheets, transcripts, and SharePoint files—preparing them for indexing in Azure AI Search. It uses intelligent chunking strategies tailored to each format, generates text and image embeddings, and enables rich, multimodal retrieval experies for agent-based RAG applications.

How data ingestion works

The service performs the following steps:

  • Scan sources: Detects new or updated content in configured sources
  • Process content: Chunk and enrich data for retrieval
  • Index documents: Writes processed chunks into Azure AI Search
  • Schedule execution: Runs on a CRON-based scheduler defined by environment variables

Supported data sources

Supported formats and chunkers

The ingestion service selects a chunker based on the file extension, ensuring each document is processed with the most suitable method.

  • .pdf files — Processed by the DocAnalysisChunker using the Document Intelligence API. Structured elements such as tables and sections are extracted and converted into Markdown, then segmented with LangChain splitters. When Document Intelligence API 4.0 is enabled, .docx and .pptx files are handled the same way.

  • Image files (.bmp, .png, .jpeg, .tiff) — The DocAnalysisChunker applies OCR to extract text before chunking.

  • Text-based files (.txt, .md, .json, .csv) — Processed by the LangChainChunker, which splits content into paragraphs or sections.

  • Specialized formats:

    • .vtt (video transcripts) — Handled by the TranscriptionChunker, which splits content by time codes.
    • .xlsx (spreadsheets) — Processed by the SpreadsheetChunker, chunked by rows or sheets.

How to deploy the data ingestion service

Prerequisites

Provision the infrastructure first by following the GPT-RAG repository instructions GPT-RAG. This ensures all required Azure resources (e.g., Container App, Storage, AI Search) are in place before deploying the web application.

Click to view software prerequisites
The machine used to customize and or deploy the service should have:
Click to view permissions requirements
To customize the service, your user should have the following roles:
Resource Role Description
App Configuration Store App Configuration Data Owner Full control over configuration settings
Container Registry AcrPush Push and pull container images
AI Search Service Search Index Data Contributor Read and write index data
Storage Account Storage Blob Data Contributor Read and write blob data
Cosmos DB Cosmos DB Built-in Data Contributor Read and write documents in Cosmos DB

To deploy the service, assign these roles to your user or service principal:

Resource Role Description
App Configuration Store App Configuration Data Reader Read config
Container Registry AcrPush Push images
Azure Container App Azure Container Apps Contributor Manage Container Apps

Ensure the deployment identity has these roles at the correct scope (subscription or resource group).

Deployment steps

Make sure you're logged in to Azure before anything else:

az login

Clone this repository.

If you used azd provision

Just run:

azd env refresh
azd deploy 

[!IMPORTANT] Make sure you use the same subscription, resource group, environment name, and location from azd provision.

If you did not use azd provision

You need to set the App Configuration endpoint and run the deploy script.

Bash (Linux/macOS):

export APP_CONFIG_ENDPOINT="https://<your-app-config-name>.azconfig.io"
./scripts/deploy.sh

PowerShell (Windows):

$env:APP_CONFIG_ENDPOINT = "https://<your-app-config-name>.azconfig.io"
.\scripts\deploy.ps1

Previous Releases

[!NOTE]
For earlier versions, use the corresponding release in the GitHub repository (e.g., v1.0.0 for the initial version).

🤝 Contributing

We appreciate contributions! See CONTRIBUTING.md for guidelines on the Contributor License Agreement (CLA), code of conduct, and submitting pull requests.

Trademarks

This project may contain trademarks or logos. Authorized use of Microsoft trademarks or logos must follow Microsoft’s Trademark & Brand Guidelines. Modified versions must not imply sponsorship or cause confusion. Third-party trademarks are subject to their own policies.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for gpt-rag-ingestion

Similar Open Source Tools

For similar tasks

For similar jobs