exstruct

Excel to structured JSON (tables, shapes, charts) for LLM/RAG pipelines

Stars: 116

Visit

ExStruct is an Excel structured extraction engine that reads Excel workbooks and outputs structured data as JSON, including cells, table candidates, shapes, charts, smartart, merged cell ranges, print areas/views, auto page-break areas, and hyperlinks. It offers different output modes, formula map extraction, table detection tuning, CLI rendering options, and graceful fallback in case Excel COM is unavailable. The tool is designed to fit LLM/RAG pipelines and provides benchmark reports for accuracy and utility. It supports various formats like JSON, YAML, and TOON, with optional extras for rendering and full extraction targeting Windows + Excel environments.

README:

ExStruct — Excel Structured Extraction Engine

ExStruct reads Excel workbooks and outputs structured data (cells, table candidates, shapes, charts, smartart, merged cell ranges, print areas/views, auto page-break areas, hyperlinks) as JSON by default, with optional YAML/TOON formats. It targets both COM/Excel environments (rich extraction) and non-COM environments (cells + table candidates + print areas), with tunable detection heuristics and multiple output modes to fit LLM/RAG pipelines.

日本版 README

Features

Excel → Structured JSON: cells, shapes, charts, smartart, table candidates, print areas/views, and auto page-break areas per sheet.
Output modes: light (cells + table candidates + print areas; no COM, shapes/charts empty), standard (texted shapes + arrows, charts, smartart, merged cell ranges, print areas), verbose (all shapes with width/height, charts with size, merged cell ranges, print areas). Verbose also emits cell hyperlinks and colors_map. Size output is flag-controlled.
Formula map extraction: emits formulas_map (formula string -> cell coordinates) via openpyxl/COM; enabled by default in verbose or via include_formulas_map.
Auto page-break export (COM only): capture Excel-computed auto page breaks and write per-area JSON/YAML/TOON when requested (CLI option appears only when COM is available).
Formats: JSON (compact by default, --pretty available), YAML, TOON (optional dependencies).
Table detection tuning: adjust heuristics at runtime via API.
CLI rendering (Excel required): optional PDF and per-sheet PNGs.
Graceful fallback: if Excel COM is unavailable, extraction falls back to cells + table candidates without crashing.

Benchmark

This repository includes benchmark reports focused on RAG/LLM preprocessing of Excel documents. We track two perspectives: (1) core extraction accuracy and (2) reconstruction utility for downstream structure queries (RUB). See benchmark/REPORT.md for the working summary and benchmark/public/REPORT.md for the public bundle. Current results are based on n=12 cases and will be expanded.

Installation

pip install exstruct

Optional extras:

YAML: pip install pyyaml
TOON: pip install python-toon
Rendering (PDF/PNG): Excel + pip install pypdfium2 pillow
All extras at once: pip install exstruct[yaml,toon,render]

Platform note:

Full extraction (shapes/charts) targets Windows + Excel (COM via xlwings). On other platforms, use mode=light to get cells + table_candidates.

Quick Start (CLI)

exstruct input.xlsx > output.json          # compact JSON to stdout (default)
exstruct input.xlsx -o out.json --pretty   # pretty JSON to a file
exstruct input.xlsx --format yaml          # YAML (needs pyyaml)
exstruct input.xlsx --format toon          # TOON (needs python-toon)
exstruct input.xlsx --sheets-dir sheets/   # split per sheet in chosen format
exstruct input.xlsx --auto-page-breaks-dir auto_areas/  # COM only; option appears when available
exstruct input.xlsx --print-areas-dir areas/  # split per print area (if any)
exstruct input.xlsx --mode light           # cells + table candidates only
exstruct input.xlsx --pdf --image          # PDF and PNGs (Excel required)

Auto page-break exports are available via API and CLI when Excel/COM is available; the CLI exposes --auto-page-breaks-dir only in COM-capable environments.

MCP Server (stdio)

Quick Start with uvx (recommended)

Run directly without installation:

uvx --from 'exstruct[mcp]' exstruct-mcp --root C:\data --log-file C:\logs\exstruct-mcp.log --on-conflict rename

Benefits:

No pip install required
Automatic dependency management
Environment isolation
Easy version pinning: uvx --from 'exstruct[mcp]==0.4.4' exstruct-mcp

Traditional Installation

Alternatively, install with pip:

pip install exstruct[mcp]
exstruct-mcp --root C:\data --log-file C:\logs\exstruct-mcp.log --on-conflict rename

Available tools:

exstruct_extract
exstruct_patch
exstruct_read_json_chunk
exstruct_validate_input

Notes:

In MCP, exstruct_extract defaults to options.alpha_col=true (column keys: A, B, ...). Set options.alpha_col=false for legacy 0-based numeric string keys.
Logs go to stderr (and optionally --log-file) to avoid contaminating stdio responses.
On Windows with Excel, standard/verbose can use COM for richer extraction. On non-Windows, COM is unavailable and extraction uses openpyxl-based fallbacks.

MCP Setup Guide for Each AI Agent:

MCP Server

Quick Start (Python)

from pathlib import Path
from exstruct import extract, export, set_table_detection_params

# Tune table detection (optional)
set_table_detection_params(table_score_threshold=0.3, density_min=0.04)

# Extract with modes: "light", "standard", "verbose"
wb = extract("input.xlsx", mode="standard")
export(wb, Path("out.json"), pretty=False)  # compact JSON

# Model helpers: iterate, index, and serialize directly
first_sheet = wb["Sheet1"]           # __getitem__ access
for name, sheet in wb:               # __iter__ yields (name, SheetData)
    print(name, len(sheet.rows))
wb.save("out.json", pretty=True)     # WorkbookData → file (by extension)
first_sheet.save("sheet.json")       # SheetData → file (by extension)
print(first_sheet.to_yaml())         # YAML text (requires pyyaml)

# ExStructEngine: per-instance options (nested configs)
from exstruct import (
    DestinationOptions,
    ExStructEngine,
    FilterOptions,
    FormatOptions,
    OutputOptions,
    StructOptions,
    export_auto_page_breaks,
)

engine = ExStructEngine(
    options=StructOptions(mode="verbose"),  # verbose includes hyperlinks by default
    output=OutputOptions(
        format=FormatOptions(pretty=True),
        filters=FilterOptions(include_shapes=False),  # drop shapes in output
        destinations=DestinationOptions(sheets_dir=Path("out_sheets")),  # also write per-sheet files
    ),
)
wb2 = engine.extract("input.xlsx")
engine.export(wb2, Path("out_filtered.json"))

# Enable hyperlinks in standard mode
engine_links = ExStructEngine(options=StructOptions(mode="standard", include_cell_links=True))
with_links = engine_links.extract("input.xlsx")

# Export per print area (if print areas exist)
from exstruct import export_print_areas_as
export_print_areas_as(wb, "areas", fmt="json", pretty=True)

# Auto page-break extraction/output (COM only; raises if no auto breaks exist)
engine_auto = ExStructEngine(
    output=OutputOptions(
        destinations=DestinationOptions(auto_page_breaks_dir=Path("auto_areas"))
    )
)
wb_auto = engine_auto.extract("input.xlsx")  # includes SheetData.auto_print_areas
engine_auto.export(wb_auto, Path("out_with_auto.json"))  # also writes auto_areas/*
export_auto_page_breaks(wb_auto, "auto_areas", fmt="json", pretty=True)  # manual writer

Note (non-COM environments): If Excel COM is unavailable, extraction still runs and returns cells + table_candidates; shapes/charts will be empty.

Table Detection Tuning

from exstruct import set_table_detection_params

set_table_detection_params(
    table_score_threshold=0.35,  # increase to be stricter
    density_min=0.05,
    coverage_min=0.2,
    min_nonempty_cells=3,
)

Use higher thresholds to reduce false positives; lower them if true tables are missed.

Output Modes

light: cells + table candidates (no COM needed).
standard: texted shapes + arrows, charts (COM if available), merged cell ranges, table candidates. Hyperlinks are off unless include_cell_links=True.
verbose: all shapes (with width/height), charts, merged cell ranges, table candidates, cell hyperlinks, and colors_map.

Error Handling / Fallbacks

Excel COM unavailable → falls back to cells + table candidates; shapes/charts empty.
Shape extraction failure → logs warning, still returns cells + table candidates.
CLI prints errors to stdout/stderr and returns non-zero on failures.

Optional Rendering

Requires Excel and pypdfium2.

exstruct input.xlsx --pdf --image --dpi 144

Creates <output>.pdf and <output>_images/ PNGs per sheet.

Example 1: Excel Structuring Demo

To show how well exstruct can structure Excel, we parse a workbook that combines three elements on one sheet and share an AI reasoning benchmark that uses the JSON output.

Table (sales data)
Line chart
Flowchart built only with shapes

(Screenshot below is the actual sample Excel sheet) Sample workbook: sample/sample.xlsx

1. Input: Excel Sheet Overview

This sample Excel contains:

① Table (Sales Data)

Month	Product A	Product B	Product C
Jan-25	120	80	60
Feb-25	135	90	64
Mar-25	150	100	70
Apr-25	170	110	72
May-25	160	120	75
Jun-25	180	130	80

② Chart (Line Chart)

Title: Sales Data
Series: Product A / Product B / Product C (six months)
Y axis: 0–200

③ Flowchart built with shapes

The sheet includes this flow:

Start / End
Format check
Loop (items remaining?)
Error handling
Yes/No decision for sending email

2. Output: Structured JSON produced by exstruct (excerpt)

Below is a shortened JSON output example from parsing this Excel workbook.

{
  "book_name": "sample.xlsx",
  "sheets": {
    "Sheet1": {
      "rows": [
        {
          "r": 3,
          "c": {
            "1": "月",
            "2": "製品A",
            "3": "製品B",
            "4": "製品C"
          }
        },
        ...
      ],
      "shapes": [
        {
          "id": 1,
          "text": "開始",
          "l": 148,
          "t": 220,
          "kind": "shape",
          "type": "AutoShape-FlowchartProcess"
        },
        {
          "id": 2,
          "text": "入力データ読み込み",
          "l": 132,
          "t": 282,
          "kind": "shape",
          "type": "AutoShape-FlowchartProcess"
        },
        {
          "l": 193,
          "t": 246,
          "kind": "arrow",
          "begin_arrow_style": 1,
          "end_arrow_style": 2,
          "begin_id": 1,
          "end_id": 2,
          "direction": "N"
        },
        ...
      ],
      "charts": [
        {
          "name": "Chart 1",
          "chart_type": "Line",
          "title": "売上データ",
          "y_axis_range": [
            0.0,
            200.0
          ],
          "series": [
            {
              "name": "製品A",
              "name_range": "Sheet1!$C$3",
              "x_range": "Sheet1!$B$4:$B$9",
              "y_range": "Sheet1!$C$4:$C$9"
            },
            ...
          ],
          "l": 377,
          "t": 25
        }
      ],
      "table_candidates": [
        "B3:E9"
      ]
    }
  }
}

3. How AI (Copilot / LLM) interprets the JSON

Below is the Markdown reconstruction of the Excel workbook. The table, chart, and flowchart are all represented.

---

## 📊 Sales Data Table

| Month      | Product A | Product B | Product C |
| ---------- | --------- | --------- | --------- |
| 2025-01-01 | 120       | 80        | 60        |
| 2025-02-01 | 135       | 90        | 64        |
| 2025-03-01 | 150       | 100       | 70        |
| 2025-04-01 | 170       | 110       | 72        |
| 2025-05-01 | 160       | 120       | 75        |
| 2025-06-01 | 180       | 130       | 80        |

---

## 📈 Sales Data (Line Chart)

- Chart title: **売上データ (Sales Data)**
- Chart type: Line
- Y-axis range: 0 to 200
- Data series:
  - Product A: 120 → 135 → 150 → 170 → 160 → 180
  - Product B: 80 → 90 → 100 → 110 → 120 → 130
  - Product C: 60 → 64 → 70 → 72 → 75 → 80

---

## 🔄 Process Flow (Mermaid Flowchart)

```mermaid
flowchart TD
    A[Start]
    B[Load input data]
    C{Is format valid?}
    D[Show error]
    E[Process one item]
    F{Items remaining?}
    G[Generate output]
    H{Send email?}
    I[Send email]
    J[Finish]

    A --> B
    B --> C
    C -->|yes| D
    C --> H
    D --> E
    E --> F
    F --> G
    G -->|yes| I
    G -->|no| J
    H --> J
    I --> J
```

Example 2: General Application Form

Excel Sheet

ExStruct JSON

(Truncated for brevity)

{
  "book_name": "en_sf425.xlsx",
  "sheets": {
    "FFR": {
      "rows": [
        { "r": 1, "c": { "0": "FEDERAL FINANCIAL REPORT" } },
        { "r": 3, "c": { "0": "(Follow form instructions)" } },
        {
          "r": 4,
          "c": {
            "0": "1.  Federal Agency and Organizational Element",
            "5": "2.  Federal Grant or Other Identifying Number Assigned by Federal Agency",
            "20": "Page",
            "23": " of"
          }
        },
        ...
      ],
      "shapes": [
        {
          "id": 1,
          "text": "Not Required by EPA",
          "l": 575,
          "t": 396,
          "kind": "shape",
          "type": "AutoShape-Rectangle"
        }
      ],
      "print_areas": [{ "r1": 1, "c1": 0, "r2": 66, "c2": 23 }],
      "merged_cells": {
        "schema": ["r1", "c1", "r2", "c2", "v"],
        "items": [
          [34, 15, 34, 23, " "],
          [56, 10, 57, 17, "Federal Share Calculation"],
          [18, 10, 18, 23, " "],
          [15, 0, 15, 1, " "],
          ...
        ]
      }
    }
  }
}

Migration note (v0.3.5): merged_cells format change

merged_cells changed from a list of objects to a schema/items structure in v0.3.5 (breaking change for JSON consumers).

Old format (<= v0.3.2):

"merged_cells": [
  { "r1": 34, "c1": 15, "r2": 34, "c2": 23, "v": " " },
  { "r1": 56, "c1": 10, "r2": 57, "c2": 17, "v": "Federal Share Calculation" }
]

New format (v0.3.5+):

"merged_cells": {
  "schema": ["r1", "c1", "r2", "c2", "v"],
  "items": [
    [34, 15, 34, 23, " "],
    [56, 10, 57, 17, "Federal Share Calculation"]
  ]
}

Migration example (support both during transition):

def normalize_merged_cells(raw):
    schema = ["r1", "c1", "r2", "c2", "v"]
    if isinstance(raw, list):
        items = [[d.get(k, " ") for k in schema] for d in raw]
        return {"schema": schema, "items": items}
    if isinstance(raw, dict) and "schema" in raw and "items" in raw:
        return raw
    return None

LLM reconstruction example

<!-- This is a structured interpretation, not an official reproduction. -->

# **Federal Financial Report (FFR)**

_(Follow form instructions)_

---

## **1. Federal Agency and Organizational Element**

**United States Environmental Protection Agency**

## **2. Federal Grant or Other Identifying Number Assigned by Federal Agency**

**Page 1 of pages**

---

## **3. Recipient Organization**

_(Name and complete address including Zip code)_

---

## **4. Recipient Identifiers**

- **4a. DUNS Number**
- **4b. EIN**
- **5. Recipient Account Number or Identifying Number**  
  _(To report multiple grants, use FFR Attachment)_
- **6. Report Type**
  - □ Quarterly
  - □ Semi-Annual
  - □ Annual
  - □ Final
- **7. Basis of Accounting**
  - □ Cash
  - □ Accrual

---

## **8. Project/Grant Period**

- **From:** (Month, Day, Year)
- **To:** (Month, Day, Year)

## **9. Reporting Period End Date**

(Month, Day, Year)

---

# **10. Transactions**

_(Use lines a–c for single or multiple grant reporting)_

### **Federal Cash**

_(To report multiple grants, also use FFR Attachment)_

- **a. Cash Receipts**
- **b. Cash Disbursements**
- **c. Cash on Hand** (line a minus b)

_(Use lines d–o for single grant reporting)_

### **Federal Expenditures and Unobligated Balance**

- **d. Total Federal funds authorized**
- **e. Federal share of expenditures**
- **f. Federal share of unliquidated obligations** — 0
- **g. Total Federal share** (sum of lines e and f) — 0
- **h. Unobligated balance of Federal funds** (line d minus g) — 0

### **Recipient Share**

- **i. Total recipient share required**
- **j. Recipient share of expenditures**
- **k. Remaining recipient share to be provided** (line i minus j) — 0

### **Program Income**

- **l. Total Federal program income earned**
- **m. Program income expended (deduction alternative)**
- **n. Program income expended (addition alternative)**
- **o. Unexpended program income** (line l minus line m or line n)

---

# **11. Indirect Expense**

| Type           | Rate | Period From | Period To | Base | Amount Charged | Federal Share |
| -------------- | ---- | ----------- | --------- | ---- | -------------- | ------------- |
|                |      |             |           |      |                |               |
| **g. Totals:** |      |             |           | 0    | 0              | 0             |

---

# **12. Remarks**

Attach any explanations deemed necessary or information required by the Federal sponsoring agency in compliance with governing legislation.

---

# **13. Certification**

By signing this report, I certify that it is true, complete, and accurate to the best of my knowledge.  
I am aware that any false, fictitious, or fraudulent information may subject me to criminal, civil, or administrative penalties.  
_(U.S. Code, Title 18, Section 1001)_

- **a. Typed or Printed Name and Title of Authorized Certifying Official**
- **b. Signature of Authorized Certifying Official**
- **c. Telephone** (Area code, number, extension)
- **d. Email address**
- **e. Date Report Submitted** (Month, Day, Year)

---

# **14. Agency Use Only**

**Federal Share Calculation**

---

### **Standard Form 425**

**OMB Approval Number:** 0348-0061  
**Expiration Date:** 02/28/2015

---

# **Paperwork Burden Statement**

According to the Paperwork Reduction Act, as amended, no persons are required to respond to a collection of information unless it displays a valid OMB Control Number.  
The valid OMB control number for this information collection is **0348-0061**.

Public reporting burden for this collection of information is estimated to average **1.5 hours per response**, including time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information.

Send comments regarding the burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to:

**Office of Management and Budget**  
Paperwork Reduction Project (0348-0060)  
Washington, DC 20503

---

### **Shape in Original Document**

- _Not Required by EPA_

From this we can see:

exstruct's JSON is already in a format that AI can read and reason over directly.

Other LLM inference samples using this library can be found in the following directory:

4. Summary

This benchmark confirms exstruct can:

Parse tables, charts, and shapes (flowcharts) simultaneously
Convert the semantic structure of Excel into JSON
Let AI/LLMs read that JSON directly and reconstruct the workbook contents

In short, exstruct = “an engine that converts Excel into a format AI can understand.”

Notes

Default JSON is compact to reduce tokens; use --pretty or pretty=True when readability matters.
Field table_candidates replaces tables; adjust downstream consumers accordingly.

Enterprise Use

ExStruct is used primarily as a library, not a service.

No official support or SLA is provided
Long-term stability is prioritized over rapid feature growth
Forking and internal modification are expected in enterprise use

This project is suitable for teams that:

need transparency over black-box tools
are comfortable maintaining internal forks if necessary

Print Areas and Auto Page Breaks (PrintArea / PrintAreaView)

SheetData.print_areas holds print areas (cell coordinates) in light/standard/verbose.
SheetData.auto_print_areas holds Excel COM-computed auto page-break areas when auto page-break extraction is enabled (COM only).
Use export_print_areas_as(...) or CLI --print-areas-dir to write one file per print area (nothing is written if none exist).
Use CLI --auto-page-breaks-dir (COM only), DestinationOptions.auto_page_breaks_dir (preferred), or export_auto_page_breaks(...) to write per-auto-page-break files; the API raises ValueError if no auto page breaks exist.
PrintAreaView includes rows and table candidates inside the area, plus shapes/charts that overlap the area (size-less shapes are treated as points). normalize=True rebases row/col indices to the area origin.

Documentation build

Update generated model docs before building the site: python scripts/gen_model_docs.py.
Build locally with mkdocs + mkdocstrings (dev deps required): uv run mkdocs serve or uv run mkdocs build.

Architecture

ExStruct uses a pipeline-based architecture that separates extraction strategy (Backend) from orchestration (Pipeline) and semantic modeling.

→ See: docs/architecture/pipeline.md

Contributing

If you plan to extend ExStruct internals, please read the contributor architecture guide.

→ docs/contributors/architecture.md

Note on coverage

The cell-structure inference logic (cells.py) relies on heuristic rules and Excel-specific behaviors. Full coverage is intentionally not pursued, as exhaustive testing would not reflect real-world reliability.

License

BSD-3-Clause. See LICENSE for details.

Documentation

API Reference (GitHub Pages): https://harumiweb.github.io/exstruct/
JSON Schemas: see schemas/ (one file per model); regenerate via python scripts/gen_json_schema.py.

For Tasks:

Click tags to check more tools for each tasks

extract structured data tune table detection render output handle excel fallback benchmark extraction accuracy

For Jobs:

data analyst data engineer business intelligence developer ai engineer data scientist

Alternative AI tools for exstruct

Similar Open Source Tools

exstruct

github

: 116

acte

Acte is a framework designed to build GUI-like tools for AI Agents. It aims to address the issues of cognitive load and freedom degrees when interacting with multiple APIs in complex scenarios. By providing a graphical user interface (GUI) for Agents, Acte helps reduce cognitive load and constraints interaction, similar to how humans interact with computers through GUIs. The tool offers APIs for starting new sessions, executing actions, and displaying screens, accessible via HTTP requests or the SessionManager class.

github

: 113

ai-counsel

AI Counsel is a true deliberative consensus MCP server where AI models engage in actual debate, refine positions across multiple rounds, and converge with voting and confidence levels. It features two modes (quick and conference), mixed adapters (CLI tools and HTTP services), auto-convergence, structured voting, semantic grouping, model-controlled stopping, evidence-based deliberation, local model support, data privacy, context injection, semantic search, fault tolerance, and full transcripts. Users can run local and cloud models to deliberate on various questions, ground decisions in reality by querying code and files, and query past decisions for analysis. The tool is designed for critical technical decisions requiring multi-model deliberation and consensus building.

github

: 185

auto-round

AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It competes impressively against recent methods without introducing any additional inference overhead. The method adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200 steps, often significantly outperforming SignRound with the cost of more tuning time for quantization. AutoRound is tailored for a wide range of models and consistently delivers noticeable improvements.

github

: 845

LightRAG

LightRAG is a repository hosting the code for LightRAG, a system that supports seamless integration of custom knowledge graphs, Oracle Database 23ai, Neo4J for storage, and multiple file types. It includes features like entity deletion, batch insert, incremental insert, and graph visualization. LightRAG provides an API server implementation for RESTful API access to RAG operations, allowing users to interact with it through HTTP requests. The repository also includes evaluation scripts, code for reproducing results, and a comprehensive code structure.

github

: 21.3k

rust-genai

genai is a multi-AI providers library for Rust that aims to provide a common and ergonomic single API to various generative AI providers such as OpenAI, Anthropic, Cohere, Ollama, and Gemini. It focuses on standardizing chat completion APIs across major AI services, prioritizing ergonomics and commonality. The library initially focuses on text chat APIs and plans to expand to support images, function calling, and more in the future versions. Version 0.1.x will have breaking changes in patches, while version 0.2.x will follow semver more strictly. genai does not provide a full representation of a given AI provider but aims to simplify the differences at a lower layer for ease of use.

github

: 522

cellseg_models.pytorch

cellseg-models.pytorch is a Python library built upon PyTorch for 2D cell/nuclei instance segmentation models. It provides multi-task encoder-decoder architectures and post-processing methods for segmenting cell/nuclei instances. The library offers high-level API to define segmentation models, open-source datasets for training, flexibility to modify model components, sliding window inference, multi-GPU inference, benchmarking utilities, regularization techniques, and example notebooks for training and finetuning models with different backbones.

github

: 69

Bindu

Bindu is an operating layer for AI agents that provides identity, communication, and payment capabilities. It delivers a production-ready service with a convenient API to connect, authenticate, and orchestrate agents across distributed systems using open protocols: A2A, AP2, and X402. Built with a distributed architecture, Bindu makes it fast to develop and easy to integrate with any AI framework. Transform any agent framework into a fully interoperable service for communication, collaboration, and commerce in the Internet of Agents.

github

: 453

pixeltable

Pixeltable is a Python library designed for ML Engineers and Data Scientists to focus on exploration, modeling, and app development without the need to handle data plumbing. It provides a declarative interface for working with text, images, embeddings, and video, enabling users to store, transform, index, and iterate on data within a single table interface. Pixeltable is persistent, acting as a database unlike in-memory Python libraries such as Pandas. It offers features like data storage and versioning, combined data and model lineage, indexing, orchestration of multimodal workloads, incremental updates, and automatic production-ready code generation. The tool emphasizes transparency, reproducibility, cost-saving through incremental data changes, and seamless integration with existing Python code and libraries.

github

: 805

candle-vllm

Candle-vllm is an efficient and easy-to-use platform designed for inference and serving local LLMs, featuring an OpenAI compatible API server. It offers a highly extensible trait-based system for rapid implementation of new module pipelines, streaming support in generation, efficient management of key-value cache with PagedAttention, and continuous batching. The tool supports chat serving for various models and provides a seamless experience for users to interact with LLMs through different interfaces.

github

: 329

alphora

Alphora is a full-stack framework for building production AI agents, providing agent orchestration, prompt engineering, tool execution, memory management, streaming, and deployment with an async-first, OpenAI-compatible design. It offers features like agent derivation, reasoning-action loop, async streaming, visual debugger, OpenAI compatibility, multimodal support, tool system with zero-config tools and type safety, prompt engine with dynamic prompts, memory and storage management, sandbox for secure execution, deployment as API, and more. Alphora allows users to build sophisticated AI agents easily and efficiently.

github

: 161

freeGPT

freeGPT provides free access to text and image generation models. It supports various models, including gpt3, gpt4, alpaca_7b, falcon_40b, prodia, and pollinations. The tool offers both asynchronous and non-asynchronous interfaces for text completion and image generation. It also features an interactive Discord bot that provides access to all the models in the repository. The tool is easy to use and can be integrated into various applications.

github

: 361

pocketgroq

PocketGroq is a tool that provides advanced functionalities for text generation, web scraping, web search, and AI response evaluation. It includes features like an Autonomous Agent for answering questions, web crawling and scraping capabilities, enhanced web search functionality, and flexible integration with Ollama server. Users can customize the agent's behavior, evaluate responses using AI, and utilize various methods for text generation, conversation management, and Chain of Thought reasoning. The tool offers comprehensive methods for different tasks, such as initializing RAG, error handling, and tool management. PocketGroq is designed to enhance development processes and enable the creation of AI-powered applications with ease.

github

: 178

Scrapegraph-ai

ScrapeGraphAI is a Python library that uses Large Language Models (LLMs) and direct graph logic to create web scraping pipelines for websites, documents, and XML files. It allows users to extract specific information from web pages by providing a prompt describing the desired data. ScrapeGraphAI supports various LLMs, including Ollama, OpenAI, Gemini, and Docker, enabling users to choose the most suitable model for their needs. The library provides a user-friendly interface through its `SmartScraper` class, which simplifies the process of building and executing scraping pipelines. ScrapeGraphAI is open-source and available on GitHub, with extensive documentation and examples to guide users. It is particularly useful for researchers and data scientists who need to extract structured data from web pages for analysis and exploration.

github

: 12.8k

mLLMCelltype

mLLMCelltype is a multi-LLM consensus framework for automated cell type annotation in single-cell RNA sequencing (scRNA-seq) data. The tool integrates multiple large language models to improve annotation accuracy through consensus-based predictions. It offers advantages over single-model approaches by combining predictions from models like OpenAI GPT-5.2, Anthropic Claude-4.6/4.5, Google Gemini-3, and others. Researchers can incorporate mLLMCelltype into existing workflows without the need for reference datasets.

github

: 631

openai-scala-client

This is a no-nonsense async Scala client for OpenAI API supporting all the available endpoints and params including streaming, chat completion, vision, and voice routines. It provides a single service called OpenAIService that supports various calls such as Models, Completions, Chat Completions, Edits, Images, Embeddings, Batches, Audio, Files, Fine-tunes, Moderations, Assistants, Threads, Thread Messages, Runs, Run Steps, Vector Stores, Vector Store Files, and Vector Store File Batches. The library aims to be self-contained with minimal dependencies and supports API-compatible providers like Azure OpenAI, Azure AI, Anthropic, Google Vertex AI, Groq, Grok, Fireworks AI, OctoAI, TogetherAI, Cerebras, Mistral, Deepseek, Ollama, FastChat, and more.

github

: 242

For similar tasks

exstruct

github

: 116

extractor

Extractor is an AI-powered data extraction library for Laravel that leverages OpenAI's capabilities to effortlessly extract structured data from various sources, including images, PDFs, and emails. It features a convenient wrapper around OpenAI Chat and Completion endpoints, supports multiple input formats, includes a flexible Field Extractor for arbitrary data extraction, and integrates with Textract for OCR functionality. Extractor utilizes JSON Mode from the latest GPT-3.5 and GPT-4 models, providing accurate and efficient data extraction.

github

: 86

NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding _programmable guardrails_ to LLM-based conversational applications. Guardrails (or "rails" for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.

github

: 5.1k

kor

Kor is a prototype tool designed to help users extract structured data from text using Language Models (LLMs). It generates prompts, sends them to specified LLMs, and parses the output. The tool works with the parsing approach and is integrated with the LangChain framework. Kor is compatible with pydantic v2 and v1, and schema is typed checked using pydantic. It is primarily used for extracting information from text based on provided reference examples and schema documentation. Kor is designed to work with all good-enough LLMs regardless of their support for function/tool calling or JSON modes.

github

: 1.6k

awesome-llm-json

This repository is an awesome list dedicated to resources for using Large Language Models (LLMs) to generate JSON or other structured outputs. It includes terminology explanations, hosted and local models, Python libraries, blog articles, videos, Jupyter notebooks, and leaderboards related to LLMs and JSON generation. The repository covers various aspects such as function calling, JSON mode, guided generation, and tool usage with different providers and models.

github

: 1.9k

tensorzero

TensorZero is an open-source platform that helps LLM applications graduate from API wrappers into defensible AI products. It enables a data & learning flywheel for LLMs by unifying inference, observability, optimization, and experimentation. The platform includes a high-performance model gateway, structured schema-based inference, observability, experimentation, and data warehouse for analytics. TensorZero Recipes optimize prompts and models, and the platform supports experimentation features and GitOps orchestration for deployment.

github

: 11.0k

stagehand

Stagehand is an AI web browsing framework that simplifies and extends web automation using three simple APIs: act, extract, and observe. It aims to provide a lightweight, configurable framework without complex abstractions, allowing users to automate web tasks reliably. The tool generates Playwright code based on atomic instructions provided by the user, enabling natural language-driven web automation. Stagehand is open source, maintained by the Browserbase team, and supports different models and model providers for flexibility in automation tasks.

github

: 21.0k

azure-ai-document-processing-samples

This repository contains a collection of code samples that demonstrate how to use various Azure AI capabilities to process documents. The samples help engineering teams establish techniques with Azure AI Foundry, Azure OpenAI, Azure AI Document Intelligence, and Azure AI Language services to build solutions for extracting structured data, classifying, and analyzing documents. The techniques simplify custom model training, improve reliability in document processing, and simplify document processing workflows by providing reusable code and patterns that can be easily modified and evaluated for most use cases.

github

: 60

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 697

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k