ai-chunking

None

Stars: 67

Visit

AI Chunking is a powerful Python library for semantic document chunking and enrichment using AI. It provides intelligent document chunking capabilities with various strategies to split text while preserving semantic meaning, particularly useful for processing markdown documentation. The library offers multiple chunking strategies such as Recursive Text Splitting, Section-based Semantic Chunking, and Base Chunking. Users can configure chunk sizes, overlap, and support various text formats. The tool is easy to extend with custom chunking strategies, making it versatile for different document processing needs.

README:

AI Chunking

A powerful Python library for semantic document chunking and enrichment using AI. This library provides intelligent document chunking capabilities with various strategies to split text while preserving semantic meaning, particularly useful for processing markdown documentation.

Features

Multiple chunking strategies:
- Recursive Text Splitting: Hierarchical document splitting with configurable chunk sizes
- Section-based Semantic Chunking: Structure-aware semantic splitting using section markers
- Base Chunking: Extensible base implementation for custom chunking strategies
Key Benefits:
- Preserve semantic meaning across chunks
- Configurable chunk sizes and overlap
- Support for various text formats
- Easy to extend with custom chunking strategies

Chunking Strategies

RecursiveTextSplitter

Uses a hierarchical approach to split documents at natural breakpoints while maintaining context. It works by recursively splitting text using a list of separators in order of priority (headings, paragraphs, sentences, words).

Use when: You have structured text with clear headings and sections and want simple, reliable chunking.
Benefits: Fast, deterministic, maintains hierarchical structure.

SemanticTextChunker

Uses embeddings to find natural semantic breakpoints in text, creating chunks that preserve meaning regardless of formatting or explicit structure markers.

Use when: Working with documents without clear section markers or when meaning preservation is critical.
Benefits: Creates cohesive chunks based on semantic similarity rather than arbitrary length.

SectionBasedSemanticChunker

Processes documents by first identifying logical sections, then creating semantically meaningful chunks within each section. Enriches chunks with section and document summaries, providing rich context.

Use when: You need deeper semantic understanding of document structure and relationships between sections.
Benefits: Preserves document structure, adds rich metadata including section summaries and contextual relationships.

AutoAIChunker

The most advanced chunker that uses LLMs to intelligently analyze document structure, creating optimal chunks with rich metadata and summaries.

Use when: You need maximum semantic understanding and are willing to trade processing time for quality.
Benefits: Superior chunk quality, rich metadata generation, deep semantic understanding.

Installation

pip install ai-chunking

Environment Setup

To use the semantic chunking capabilities, you need to set up the following environment variables:

Required Environment Variables

# Required for semantic chunkers (SemanticTextChunker, SectionBasedSemanticChunker, AutoAIChunker)
export OPENAI_API_KEY=your_openai_api_key_here

You can also set these environment variables in your Python code:

import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key_here"

Usage Examples

Processing Multiple Documents

Each chunker type supports processing multiple documents together:

from ai_chunking import RecursiveTextSplitter

chunker = RecursiveTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

# List of markdown files to process
files = ['document1.md', 'document2.md', 'document3.md']

# Process all documents at once
all_chunks = chunker.chunk_documents(files)

print(f"Processed {len(files)} documents into {len(all_chunks)} chunks")

Processing Multiple Documents with Semantic Chunkers

from ai_chunking import SectionBasedSemanticChunker, SemanticTextChunker

# Choose the chunker based on your needs
chunker = SemanticTextChunker(
    chunk_size=1000,
    similarity_threshold=0.85
)

# Or use the more advanced section-based chunker
# chunker = SectionBasedSemanticChunker(
#     min_chunk_size=100,
#     max_chunk_size=1000
# )

# List of files to process
markdown_files = ['user_guide.md', 'api_reference.md', 'tutorials.md']

# Process all documents, getting enriched chunks with metadata
all_chunks = chunker.chunk_documents(markdown_files)

# Access metadata for each chunk
for i, chunk in enumerate(all_chunks):
    print(f"Chunk {i}:")
    print(f"  Source: {chunk.metadata.get('filename')}")
    print(f"  Tokens: {chunk.metadata.get('tokens_count')}")
    if 'summary' in chunk.metadata:
        print(f"  Summary: {chunk.metadata.get('summary')}")
    print(f"  Text snippet: {chunk.text[:100]}...\n")

Custom Chunking Strategy

You can create your own markdown-specific chunking strategy:

from ai_chunking import BaseChunker
import re

class MarkdownChunker(BaseChunker):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.heading_pattern = re.compile(r'^#{1,6}\s+', re.MULTILINE)

    def chunk_documents(self, files: list[str]) -> list[str]:
        chunks = []
        for file in files:
            with open(file, 'r') as f:
                content = f.read()
                # Split on markdown headings while preserving them
                sections = self.heading_pattern.split(content)
                # Remove empty sections and trim whitespace
                chunks.extend([section.strip() for section in sections if section.strip()])
        return chunks

# Usage
chunker = MarkdownChunker()
chunks = chunker.chunk_documents(['README.md', 'CONTRIBUTING.md'])

Configuration

Each chunker accepts different configuration parameters:

RecursiveTextSplitter

chunk_size: Maximum size of each chunk (default: 500)
chunk_overlap: Number of characters to overlap between chunks (default: 50)
separators: List of separators to use for splitting (default: ["\n\n", "\n", " ", ""])

SemanticTextChunker

chunk_size: Maximum chunk size in tokens (default: 1024)
min_chunk_size: Minimum chunk size in tokens (default: 128)
similarity_threshold: Threshold for semantic similarity when determining breakpoints (default: 0.9)

The chunker uses OpenAI's text-embedding-3-small model for embeddings to determine semantic breakpoints in the text. If chunks are too large, they are further split using recursive semantic chunking. If chunks are too small, they are merged with adjacent chunks when possible.

Each chunk includes metadata:

Source filename
Token count
Source path

SectionBasedSemanticChunker

min_section_chunk_size: Minimum size for section chunks in tokens (default: 5000)
max_section_chunk_size: Maximum size for section chunks in tokens (default: 50000)
min_chunk_size: Minimum size for individual chunks in tokens (default: 128)
max_chunk_size: Maximum size for individual chunks in tokens (default: 1000)
page_content_similarity_threshold: Threshold for content similarity between pages (default: 0.8)

The chunker uses OpenAI's text-embedding-3-small model for embeddings with dynamic threshold settings:

For chunks ≤ 500 tokens: 80% threshold (more aggressive splitting)
For chunks 501-1000 tokens: 85% threshold (moderate splitting)
For chunks > 1000 tokens: 90% threshold (less aggressive splitting)

Each chunk includes rich metadata:

Unique chunk ID
Text content and token count
Page numbers
Summary with metadata
Global context
Section and document summaries
Previous/Next chunk summaries
List of questions the excerpt can answer

Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a new branch for your feature
Write tests for your changes
Submit a pull request

For more details, see our Contributing Guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Issue Tracker: GitHub Issues
Documentation: Coming soon

Citation

If you use this software in your research, please cite:

@software{ai_chunking2024,
  title = {AI Chunking: A Python Library for Semantic Document Processing},
  author = {Desai, Amey},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/nexla-opensource/ai-chunking}
}

For Tasks:

Click tags to check more tools for each tasks

organize documents summarize content extract key information enhance document structure automate text processing

For Jobs:

data scientist content writer technical writer research analyst ai engineer

Alternative AI tools for ai-chunking

Similar Open Source Tools

ai-chunking

github

: 67

jido

Jido is a toolkit for building autonomous, distributed agent systems in Elixir. It provides the foundation for creating smart, composable workflows that can evolve and respond to their environment. Geared towards Agent builders, it contains core state primitives, composable actions, agent data structures, real-time sensors, signal system, skills, and testing tools. Jido is designed for multi-node Elixir clusters and offers rich helpers for unit and property-based testing.

github

: 276

flo-ai

Flo AI is a Python framework that enables users to build production-ready AI agents and teams with minimal code. It allows users to compose complex AI architectures using pre-built components while maintaining the flexibility to create custom components. The framework supports composable, production-ready, YAML-first, and flexible AI systems. Users can easily create AI agents and teams, manage teams of AI agents working together, and utilize built-in support for Retrieval-Augmented Generation (RAG) and compatibility with Langchain tools. Flo AI also provides tools for output parsing and formatting, tool logging, data collection, and JSON output collection. It is MIT Licensed and offers detailed documentation, tutorials, and examples for AI engineers and teams to accelerate development, maintainability, scalability, and testability of AI systems.

github

: 73

IntelliNode

IntelliNode is a javascript module that integrates cutting-edge AI models like ChatGPT, LLaMA, WaveNet, Gemini, and Stable diffusion into projects. It offers functions for generating text, speech, and images, as well as semantic search, multi-model evaluation, and chatbot capabilities. The module provides a wrapper layer for low-level model access, a controller layer for unified input handling, and a function layer for abstract functionality tailored to various use cases.

github

: 201

lionagi

LionAGI is a powerful intelligent workflow automation framework that introduces advanced ML models into any existing workflows and data infrastructure. It can interact with almost any model, run interactions in parallel for most models, produce structured pydantic outputs with flexible usage, automate workflow via graph based agents, use advanced prompting techniques, and more. LionAGI aims to provide a centralized agent-managed framework for "ML-powered tools coordination" and to dramatically lower the barrier of entries for creating use-case/domain specific tools. It is designed to be asynchronous only and requires Python 3.10 or higher.

github

: 322

semantic-kernel

Semantic Kernel is an SDK that integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel achieves this by allowing you to define plugins that can be chained together in just a few lines of code. What makes Semantic Kernel _special_ , however, is its ability to _automatically_ orchestrate plugins with AI. With Semantic Kernel planners, you can ask an LLM to generate a plan that achieves a user's unique goal. Afterwards, Semantic Kernel will execute the plan for the user.

github

: 23.9k

GraphRAG-SDK

Build fast and accurate GenAI applications with GraphRAG SDK, a specialized toolkit for building Graph Retrieval-Augmented Generation (GraphRAG) systems. It integrates knowledge graphs, ontology management, and state-of-the-art LLMs to deliver accurate, efficient, and customizable RAG workflows. The SDK simplifies the development process by automating ontology creation, knowledge graph agent creation, and query handling, enabling users to interact and query their knowledge graphs effectively. It supports multi-agent systems and orchestrates agents specialized in different domains. The SDK is optimized for FalkorDB, ensuring high performance and scalability for large-scale applications. By leveraging knowledge graphs, it enables semantic relationships and ontology-driven queries that go beyond standard vector similarity, enhancing retrieval-augmented generation capabilities.

github

: 292

gollm

gollm is a Go package designed to simplify interactions with Large Language Models (LLMs) for AI engineers and developers. It offers a unified API for multiple LLM providers, easy provider and model switching, flexible configuration options, advanced prompt engineering, prompt optimization, memory retention, structured output and validation, provider comparison tools, high-level AI functions, robust error handling and retries, and extensible architecture. The package enables users to create AI-powered golems for tasks like content creation workflows, complex reasoning tasks, structured data generation, model performance analysis, prompt optimization, and creating a mixture of agents.

github

: 414

flow-prompt

Flow Prompt is a dynamic library for managing and optimizing prompts for large language models. It facilitates budget-aware operations, dynamic data integration, and efficient load distribution. Features include CI/CD testing, dynamic prompt development, multi-model support, real-time insights, and prompt testing and evolution.

github

: 68

curator

Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.

github

: 1.2k

langfair

LangFair is a Python library for bias and fairness assessments of large language models (LLMs). It offers a comprehensive framework for choosing bias and fairness metrics, demo notebooks, and a technical playbook. Users can tailor evaluations to their use cases with a Bring Your Own Prompts approach. The focus is on output-based metrics practical for governance audits and real-world testing.

github

: 146

hugging-chat-api

Unofficial HuggingChat Python API for creating chatbots, supporting features like image generation, web search, memorizing context, and changing LLMs. Users can log in, chat with the ChatBot, perform web searches, create new conversations, manage conversations, switch models, get conversation info, use assistants, and delete conversations. The API also includes a CLI mode with various commands for interacting with the tool. Users are advised not to use the application for high-stakes decisions or advice and to avoid high-frequency requests to preserve server resources.

github

: 780

lollms

LoLLMs Server is a text generation server based on large language models. It provides a Flask-based API for generating text using various pre-trained language models. This server is designed to be easy to install and use, allowing developers to integrate powerful text generation capabilities into their applications.

github

: 287

MCPSharp

MCPSharp is a .NET library that helps build Model Context Protocol (MCP) servers and clients for AI assistants and models. It allows creating MCP-compliant tools, connecting to existing MCP servers, exposing .NET methods as MCP endpoints, and handling MCP protocol details seamlessly. With features like attribute-based API, JSON-RPC support, parameter validation, and type conversion, MCPSharp simplifies the development of AI capabilities in applications through standardized interfaces.

github

: 142

gateway

Adaline Gateway is a fully local production-grade Super SDK that offers a unified interface for calling over 200+ LLMs. It is production-ready, supports batching, retries, caching, callbacks, and OpenTelemetry. Users can create custom plugins and providers for seamless integration with their infrastructure.

github

: 419

agents-starter

A starter template for building AI-powered chat agents using Cloudflare's Agent platform, powered by agents-sdk. It provides a foundation for creating interactive chat experiences with AI, complete with a modern UI and tool integration capabilities. Features include interactive chat interface with AI, built-in tool system with human-in-the-loop confirmation, advanced task scheduling, dark/light theme support, real-time streaming responses, state management, and chat history. Prerequisites include a Cloudflare account and OpenAI API key. The project structure includes components for chat UI implementation, chat agent logic, tool definitions, and helper functions. Customization guide covers adding new tools, modifying the UI, and example use cases for customer support, development assistant, data analysis assistant, personal productivity assistant, and scheduling assistant.

github

: 503

For similar tasks

ai-chunking

github

: 67

anylabeling

AnyLabeling is a tool for effortless data labeling with AI support from YOLO and Segment Anything. It combines features from LabelImg and Labelme with an improved UI and auto-labeling capabilities. Users can annotate images with polygons, rectangles, circles, lines, and points, as well as perform auto-labeling using YOLOv5 and Segment Anything. The tool also supports text detection, recognition, and Key Information Extraction (KIE) labeling, with multiple language options available such as English, Vietnamese, and Chinese.

github

: 2.6k

basdonax-ai-rag

Basdonax AI RAG v1.0 is a repository that contains all the necessary resources to create your own AI-powered secretary using the RAG from Basdonax AI. It leverages open-source models from Meta and Microsoft, namely 'Llama3-7b' and 'Phi3-4b', allowing users to upload documents and make queries. This tool aims to simplify life for individuals by harnessing the power of AI. The installation process involves choosing between different data models based on GPU capabilities, setting up Docker, pulling the desired model, and customizing the assistant prompt file. Once installed, users can access the RAG through a local link and enjoy its functionalities.

github

: 102

Local-File-Organizer

The Local File Organizer is an AI-powered tool designed to help users organize their digital files efficiently and securely on their local device. By leveraging advanced AI models for text and visual content analysis, the tool automatically scans and categorizes files, generates relevant descriptions and filenames, and organizes them into a new directory structure. All AI processing occurs locally using the Nexa SDK, ensuring privacy and security. With support for multiple file types and customizable prompts, this tool aims to simplify file management and bring order to users' digital lives.

github

: 1.0k

uwazi

Uwazi is a flexible database application designed for capturing and organizing collections of information, with a focus on document management. It is developed and supported by HURIDOCS, benefiting human rights organizations globally. The tool requires NodeJs, ElasticSearch, ICU Analysis Plugin, MongoDB, Yarn, and pdftotext for installation. It offers production and development installation guides, including Docker setup. Uwazi supports hot reloading, unit and integration testing with JEST, and end-to-end testing with Nightmare or Puppeteer. The system requirements include RAM, CPU, and disk space recommendations for on-premises and development usage.

github

: 259

Notate

Notate is a powerful desktop research assistant that combines AI-driven analysis with advanced vector search technology. It streamlines research workflow by processing, organizing, and retrieving information from documents, audio, and text. Notate offers flexible AI capabilities with support for various LLM providers and local models, ensuring data privacy. Built for researchers, academics, and knowledge workers, it features real-time collaboration, accessible UI, and cross-platform compatibility.

github

: 177

nlp-llms-resources

The 'nlp-llms-resources' repository is a comprehensive resource list for Natural Language Processing (NLP) and Large Language Models (LLMs). It covers a wide range of topics including traditional NLP datasets, data acquisition, libraries for NLP, neural networks, sentiment analysis, optical character recognition, information extraction, semantics, topic modeling, multilingual NLP, domain-specific LLMs, vector databases, ethics, costing, books, courses, surveys, aggregators, newsletters, papers, conferences, and societies. The repository provides valuable information and resources for individuals interested in NLP and LLMs.

github

: 82

Awesome-Segment-Anything

Awesome-Segment-Anything is a powerful tool for segmenting and extracting information from various types of data. It provides a user-friendly interface to easily define segmentation rules and apply them to text, images, and other data formats. The tool supports both supervised and unsupervised segmentation methods, allowing users to customize the segmentation process based on their specific needs. With its versatile functionality and intuitive design, Awesome-Segment-Anything is ideal for data analysts, researchers, content creators, and anyone looking to efficiently extract valuable insights from complex datasets.

github

: 926

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k