FDAbench

FDABench, a benchmark for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios.

Stars: 53

Visit

FDABench is a benchmark tool designed for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. It offers 2,007 tasks across various data sources, domains, difficulty levels, and task types. The tool provides ready-to-use data agent implementations, a DAG-based evaluation system, and a framework for agent-expert collaboration in dataset generation. Key features include data agent implementations, comprehensive evaluation metrics, multi-database support, different task types, extensible framework for custom agent integration, and cost tracking. Users can set up the environment using Python 3.10+ on Linux, macOS, or Windows. FDABench can be installed with a one-command setup or manually. The tool supports API configuration for LLM access and offers quick start guides for database download, dataset loading, and running examples. It also includes features like dataset generation using the PUDDING framework, custom agent integration, evaluation metrics like accuracy and rubric score, and a directory structure for easy navigation.

README:

FDABench

FDABench is a benchmark for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. It contains 2,007 tasks across different data sources, domains, difficulty levels, and task types. We provide ready-to-use data agent implementations, a DAG-based evaluation system, and an agent-expert collaboration framework for dataset generation.

Overview

Key Features

Data Agent Implementations: Ready-to-use agents (Planning, Multi-Agent, Reflection, Tool-Use)
Comprehensive Evaluation: Accuracy metrics, rubric-based scoring, and performance analytics
Multi-Database Support: SQLite, BigQuery, Snowflake, and unstructured data sources
Three Task Types: Single-choice, multiple-choice, and open-ended report generation
Extensible Framework: Modular base classes for custom agent integration
Cost Tracking: Token usage and latency monitoring

Task Types

Type	Description
Single Choice	One correct answer from four options
Multiple Choice	Multiple correct answers allowed
Report	Open-ended analytical report generation

Environment Setup

System Requirements

Python: 3.10+
OS: Linux, macOS, Windows

Option 1: One-Command Setup

Create the complete environment with all dependencies:

conda env create -f environment.yml
conda activate fdabench

This will:

Create a new conda environment named fdabench
Install Python 3.11 and all required dependencies
Automatically install FDABench in development mode

Option 2: Manual Setup

If you prefer manual installation:

# Create environment
conda create -n fdabench python=3.11
conda activate fdabench

# Install FDABench
pip install -e .

API Configuration

Set up your API keys for LLM access:

# Option 1: Environment variables
export OPENROUTER_API_KEY="your-openrouter-api-key"

# Option 2: Create .env file in project root
echo "OPENROUTER_API_KEY=your-openrouter-api-key" >> .env

Quick Start

After completing the environment setup above, you can immediately start using FDABench with FDABench-Lite:

Database Download

Download the FDABench-Lite database files from Google Drive, extract to your directory, and configure paths in FDABench/utils/database_connection_manager.py (see FDABench-Full Usage for details).

Dataset Loading and Start

HuggingFace Dataset: FDABench now loads data directly from the HuggingFace dataset hub. The dataset FDAbench2026/Fdabench-Lite contains 289 curated test cases in three tasks for immediate use. We also offer FDABench-Full with 2007 test cases on HuggingFace.

# Activate your environment (if not already active)
conda activate fdabench

# Run your first example - automatically loads from HuggingFace
python examples/run_planning_agent.py

# Run with a specific sample (0-116 available)
python examples/run_planning_agent.py --index 10

# Run with a custom model
python examples/run_planning_agent.py --model "openai/gpt-5" --index 5

FDABench-Full Usage

Database Configuration

FDABench-Full supports multiple database types including Snowflake, Bigquery and SQLlite. You need to configure database paths and obtain required data:

1. SQLite Databases

BIRD Dataset: Download from BIRD repository

Spider2-lite Dataset: Download from Spider2 spider-agent-lite

2. Cloud Databases

BigQuery and Snowflake: Follow registration and setup instructions from Spider2 README

3. Unstructured Data

Unstructured Dataset: Download from Google Drive

4. Configure Database Paths

Edit FDABench/utils/database_connection_manager.py and update the configuration:

default_config = {
    # SQLite database paths
    'bird_db_path': "/your/path/to/BIRD_train/train_databases",
    'local_db_path': "/your/path/to/local/databases", 
    'spider1_db_path': "/your/path/to/spider1/databases",
    
    # Cloud database credentials
    'bigquery_credentials_path': "/your/path/to/bigquery-service-account.json",
    'snowflake_config': {
        'account': 'your-snowflake-account',
        'user': 'your-username', 
        'password': 'your-password',
        'warehouse': 'your-warehouse',
        'database': 'your-database'
    }
}

5. Directory Structure

your_databases/
├── BIRD_train/train_databases/
│   ├── california_schools/
│   │   └── california_schools.sqlite  
│   ├── card_games/
│   │   └── card_games.sqlite
│   └── ...
├── spider1_databases/
│   ├── concert_singer.sqlite
│   ├── pets_1.sqlite  
│   └── ...
├── local_databases/
│   └── merchant_data.db
└── credentials/
    └── bigquery-service-account.json

5. Dataset Configuration

HuggingFace Dataset (Default): The benchmark uses the FDAbench2026/Fdabench-Lite dataset from HuggingFace, which includes:

289 curated test cases
Three subsets (report, single, multiple )
Multiple database types (BIRD, local, Spider2-lite)
Various difficulty levels (easy, medium, hard)

Loading Data in Your Code:

from FDABench.utils.test_utils import load_test_data

# Load the first sample (default)
test_data = load_test_data()

# Load a specific sample by index (0-116)
test_data = load_test_data(index=10)

Custom Local Datasets: If you have your own test data, you can still use local JSON files by modifying the load_test_data() function in FDABench/utils/test_utils.py.

Examples

Test different agent workflows with HuggingFace dataset:

# Planning Agent - Uses step-by-step planning
python examples/run_planning_agent.py                # Default: index 0
python examples/run_planning_agent.py --index 25     # Specific sample

# Multi-Agent - Coordinates multiple specialized agents
python examples/run_multi_agent.py --index 10

# Reflection Agent - Self-improving with reflection
python examples/run_reflection_agent.py --index 50

# Tool-Use Agent - Optimized for tool selection
python examples/run_tooluse_agent.py --index 100

# All agents support the same parameters:
# --index N: Select sample N from the dataset (0-116)
# --model "model_name": Specify the LLM model to use

FDABench with DeepAnalyze

We also provide a ready-to-run benchmarking script that connects DeepAnalyze to FDABench's tasks.

Position the DeepAnalyze project alongside FDABench so the runtime layout looks like:
```
/path/to/workspace/
├── FDAbench/
└── DeepAnalyze/
```

Start the DeepAnalyze vLLM server and export its model path and endpoint (or pass them as CLI flags):

export DEEPANALYZE_MODEL_PATH=/path/to/DeepAnalyze/model/DeepAnalyze-8B
export DEEPANALYZE_API_URL=http://localhost:8000/v1/chat/completions

Run the benchmark. By default the script pulls FDABench-Lite sample index=0, runs evaluation, and writes metrics to results/test_query_results_deepanalyze.duckdb:
```
python FDABench/examples/test_deepanalyze_adapter.py \
  --index 5 \
  --max_agent_rounds 8 \
  --max_deepanalyze_rounds 25
```

Useful flags:

--input load custom JSON/JSONL tasks instead of HuggingFace samples.
--duckdb_path specify a custom metrics file.
--api_key override the API key FDABench uses for auxiliary tools.

Data Agent with Semantic Operator

Data agents integrated with semantic data operators for advanced data processing:

# DocETL Semantic Operator Agent - Uses DocETL operators for document processing
python FDABench/examples/test_planning_agent_docetl_batch.py

# Lotus Semantic Operator Agent - Uses Lotus operators for natural language processing
python FDABench/examples/test_planning_agent_lotus_batch.py

# Palimpzest Semantic Operator Agent - Uses Palimpzest operators for data transformation
python FDABench/examples/test_planning_agent_pz_batch.py

Note: Data Agent with semantic operator require additional environment setup. Check the respective environment files:

FDABench/examples/docetl_environment.yml
FDABench/examples/lotus_environment.yml
FDABench/examples/palimpzest_environment.yml

Basic Usage Example

from FDABench.agents.planning_agent import PlanningAgent
from FDABench.evaluation.evaluation_tools import ReportEvaluator
from FDABench.utils.test_utils import load_test_data

# Initialize agent with your preferred model
agent = PlanningAgent(
    model="openai/gpt-5",  # or "deepseek/deepseek-chat-v3"
    api_key="your-api-key"
)

# Load test data from HuggingFace dataset
test_data = load_test_data(index=0)  # Load first sample
print(f"Processing task: {test_data['task_id']}")
print(f"Database: {test_data['db']}")
print(f"Question type: {test_data['question_type']}")

# Process the query
result = agent.process_query_from_json(test_data)
print(f"Generated report: {result['report'][:200]}...")

# Load and process multiple samples
for i in range(5):
    test_data = load_test_data(index=i)
    result = agent.process_query_from_json(test_data)
    print(f"Task {i}: {test_data['task_id']} - Completed")

Output and Results

All test results are automatically saved to:

results/ - DuckDB files with test results and metrics
FDABench/examples/data/ - Temporary processing files

Vector Index Building and Search

The VectorSearchTool enables semantic search over unstructured documents using FAISS + OpenAI Embeddings.

Option A: Download Pre-built Index (Recommended)

Download the pre-built FAISS index from Google Drive and extract to project root:

cd /path/to/FDAbench

# Download storage_faiss.tar.gz, then extract
tar xzvf storage_faiss.tar.gz

# This creates ./storage_faiss/ directory with:
#   - faiss.index     (FAISS vector index)
#   - chunks.json     (text chunks with metadata)
#   - config.json     (index configuration)

Now you can use VectorSearchTool directly:

from FDABench.tools.search_tools import VectorSearchTool

# Uses ./storage_faiss by default
tool = VectorSearchTool()
result = tool.execute(query="machine learning", top_k=5)

Option B: Build Index from Source

If you need to rebuild or customize the index:

1. Download Unstructured Data

Download the raw documents from Google Drive containing 50 domain categories with PDFs and other files.

2. Build Vector Index

cd /path/to/FDAbench
export OPENAI_API_KEY="your-openai-api-key"

python -m FDABench.utils.vector_index_builder \
    --doc-path /path/to/Vector_Database \
    --index-path ./storage_faiss \
    --unified \
    --chunk-size 1024

Builder Options:

Option	Description	Default
`--doc-path`	Path to document categories	Required
`--index-path`	Where to save the index	`./storage_faiss`
`--unified`	Merge all categories into one index	Flag
`--chunk-size`	Text chunk size in characters	1024
`--chunk-overlap`	Overlap between chunks	200
`--api-key`	OpenAI API key	Uses `OPENAI_API_KEY` env

Features:

Supports PDF files (via pdfplumber/PyPDF2)
50 concurrent embedding requests
Auto-truncation for long texts (max 30K chars)
Timeout handling for problematic PDFs (30s)
Skips failed chunks and continues

Using VectorSearchTool

from FDABench.tools.search_tools import VectorSearchTool

# Initialize (uses ./storage_faiss by default)
tool = VectorSearchTool()

# Or specify custom path
tool = VectorSearchTool(storage_path="./my_index", api_key="your-key")

# Search
result = tool.execute(query="machine learning in healthcare", top_k=5)

if result["status"] == "success":
    print(f"Found {result['num_results']} results")
    print(result["results"])  # Formatted output

    # Access raw results
    for r in result["raw_results"]:
        print(f"Score: {r['score']:.4f}, Category: {r['metadata']['category']}")

Example Output:

[Rank 1] (Score: 0.6523)
Category: Healthcare_Medical Systems | File: medical_ai.pdf
Content: This paper presents a novel approach to...

Dataset Generation (PUDDING)

PUDDING is an agentic dataset construction framework that combines LLM generation with iterative expert validation. It operates in three phases:

Initialization: Gather structured data (schema, SQL results) and unstructured context (web search, vector retrieval, file system)
Expert Verification: Iterative agent-expert collaboration with accept/revise/dispose decisions
Finalization: Quality validation and difficulty classification

python -m PUDDING.main          # Interactive mode (with expert review)
python -m PUDDING.main --auto   # Automatic mode

See PUDDING/README.md for detailed documentation.

Custom Agent Integration

Inherit from BaseAgent to create custom agents:

from FDABench.core.base_agent import BaseAgent

class YourAgent(BaseAgent):
    def process_query_from_json(self, query_data):
        question_type = query_data.get("question_type", "report")
        if question_type == "single_choice":
            return self.process_single_choice(query_data)
        elif question_type == "multiple_choice":
            return self.process_multiple_choice(query_data)
        else:
            return self.process_report(query_data)

Evaluation Metrics

Accuracy: Correctness for choice questions
Rubric Score: Report quality evaluation
Latency: Response time per query
Token Usage: Cost tracking

Directory Structure

FDABench/
├── FDABench/                # Main package
│   ├── agents/              # Agent implementations (planning, multi, reflection, tool-use)
│   ├── core/                # Base classes, token tracking, tool registry
│   ├── evaluation/          # Evaluation and scoring tools
│   ├── tools/               # Schema, SQL, search tools
│   └── utils/               # Database connection, utilities
├── PUDDING/           # Dataset generation framework (see PUDDING/README.md)
├── examples/                # Usage examples
├── results/                 # Test results (DuckDB files)
└── environment.yml          # Conda environment

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Implement changes (inherit from BaseAgent for new agents)
Test with the evaluation suite
Open a Pull Request

Submission

If you need to submit results, please submit them in JSONL format similar to results/submission.jsonl to [email protected].

Each line should contain a JSON object with the following key fields:

task_id, instance_id, db, level, database_type, question_type
For report tasks: "generated_report": "your report content"
For single choice: "selected_answer": ["A"]
For multiple choice: "selected_answer": ["A", "C", "F"]
Performance metrics: tool_executed, latency, total_tokens, total_cost, etc.

Example format:

{"task_id": "FDA0045", "question_type": "report", "generated_report": "...", "tool_executed": ["tool_1"], "latency": "", "total_tokens": ""}
{"task_id": "FDA0803", "question_type": "single_choice", "selected_answer": ["D"], "tool_executed": ["tool_1"], "latency": "", "total_tokens": ""}

Citation

If you find FDABench useful in your research, please consider citing our paper:

@article{wang2025fdabench,
  title={FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data},
  author={Wang, Ziting and Zhang, Shize and Yuan, Haitao and Zhu, Jinwei and Li, Shifu and Dong, Wei and Cong, Gao},
  journal={arXiv preprint arXiv:2509.02473},
  year={2025}
}

For Tasks:

Click tags to check more tools for each tasks

evaluate data agents generate analytical reports track token usage integrate custom agents build vector index

For Jobs:

data analyst machine learning engineer data scientist ai researcher database administrator

Alternative AI tools for FDAbench

Similar Open Source Tools

FDAbench

github

: 53

paperbanana

PaperBanana is an automated academic illustration tool designed for AI scientists. It implements an agentic framework for generating publication-quality academic diagrams and statistical plots from text descriptions. The tool utilizes a two-phase multi-agent pipeline with iterative refinement, Gemini-based VLM planning, and image generation. It offers a CLI, Python API, and MCP server for IDE integration, along with Claude Code skills for generating diagrams, plots, and evaluating diagrams. PaperBanana is not affiliated with or endorsed by the original authors or Google Research, and it may differ from the original system described in the paper.

github

: 229

scabench

ScaBench is a comprehensive framework designed for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. It provides curated datasets from recent audits and official tooling for consistent evaluation. The tool includes features such as curated datasets from Code4rena, Cantina, and Sherlock audits, a baseline runner for security analysis, a scoring tool for evaluating findings, a report generator for HTML reports with visualizations, and pipeline automation for complete workflow execution. Users can access curated datasets, generate new datasets, download project source code, run security analysis using LLMs, and evaluate tool findings against benchmarks using LLM matching. The tool enforces strict matching policies to ensure accurate evaluation results.

github

: 53

llm-context.py

LLM Context is a tool designed to assist developers in quickly injecting relevant content from code/text projects into Large Language Model chat interfaces. It leverages `.gitignore` patterns for smart file selection and offers a streamlined clipboard workflow using the command line. The tool also provides direct integration with Large Language Models through the Model Context Protocol (MCP). LLM Context is optimized for code repositories and collections of text/markdown/html documents, making it suitable for developers working on projects that fit within an LLM's context window. The tool is under active development and aims to enhance AI-assisted development workflows by harnessing the power of Large Language Models.

github

: 273

kiss_ai

KISS AI is a lightweight and powerful multi-agent evolutionary framework that simplifies building AI agents. It uses native function calling for efficiency and accuracy, making building AI agents as straightforward as possible. The framework includes features like multi-agent orchestration, agent evolution and optimization, relentless coding agent for long-running tasks, output formatting, trajectory saving and visualization, GEPA for prompt optimization, KISSEvolve for algorithm discovery, self-evolving multi-agent, Docker integration, multiprocessing support, and support for various models from OpenAI, Anthropic, Gemini, Together AI, and OpenRouter.

github

: 56

docutranslate

Docutranslate is a versatile tool for translating documents efficiently. It supports multiple file formats and languages, making it ideal for businesses and individuals needing quick and accurate translations. The tool uses advanced algorithms to ensure high-quality translations while maintaining the original document's formatting. With its user-friendly interface, Docutranslate simplifies the translation process and saves time for users. Whether you need to translate legal documents, technical manuals, or personal letters, Docutranslate is the go-to solution for all your document translation needs.

github

: 87

sgr-deep-research

This repository contains a deep learning research project focused on natural language processing tasks. It includes implementations of various state-of-the-art models and algorithms for text classification, sentiment analysis, named entity recognition, and more. The project aims to provide a comprehensive resource for researchers and developers interested in exploring deep learning techniques for NLP applications.

github

: 471

pixeltable

Pixeltable is a Python library designed for ML Engineers and Data Scientists to focus on exploration, modeling, and app development without the need to handle data plumbing. It provides a declarative interface for working with text, images, embeddings, and video, enabling users to store, transform, index, and iterate on data within a single table interface. Pixeltable is persistent, acting as a database unlike in-memory Python libraries such as Pandas. It offers features like data storage and versioning, combined data and model lineage, indexing, orchestration of multimodal workloads, incremental updates, and automatic production-ready code generation. The tool emphasizes transparency, reproducibility, cost-saving through incremental data changes, and seamless integration with existing Python code and libraries.

github

: 805

mcp-devtools

MCP DevTools is a high-performance server written in Go that replaces multiple Node.js and Python-based servers. It provides access to essential developer tools through a unified, modular interface. The server is efficient, with minimal memory footprint and fast response times. It offers a comprehensive tool suite for agentic coding, including 20+ essential developer agent tools. The tool registry allows for easy addition of new tools. The server supports multiple transport modes, including STDIO, HTTP, and SSE. It includes a security framework for multi-layered protection and a plugin system for adding new tools.

github

: 122

flyte-sdk

Flyte 2 SDK is a pure Python tool for type-safe, distributed orchestration of agents, ML pipelines, and more. It allows users to write data pipelines, ML training jobs, and distributed compute in Python without any DSL constraints. With features like async-first parallelism and fine-grained observability, Flyte 2 offers a seamless workflow experience. Users can leverage core concepts like TaskEnvironments for container configuration, pure Python workflows for flexibility, and async parallelism for distributed execution. Advanced features include sub-task observability with tracing and remote task execution. The tool also provides native Jupyter integration for running and monitoring workflows directly from notebooks. Configuration and deployment are made easy with configuration files and commands for deploying and running workflows. Flyte 2 is licensed under the Apache 2.0 License.

github

: 67

RepairAgent

RepairAgent is an autonomous LLM-based agent for automated program repair targeting the Defects4J benchmark. It uses an LLM-driven loop to localize, analyze, and fix Java bugs. The tool requires Docker, VS Code with Dev Containers extension, OpenAI API key, disk space of ~40 GB, and internet access. Users can get started with RepairAgent using either VS Code Dev Container or Docker Image. Running RepairAgent involves checking out the buggy project version, autonomous bug analysis, fix candidate generation, and testing against the project's test suite. Users can configure hyperparameters for budget control, repetition handling, commands limit, and external fix strategy. The tool provides output structure, experiment overview, individual analysis scripts, and data on fixed bugs from the Defects4J dataset.

github

: 82

agentops

AgentOps is a toolkit for evaluating and developing robust and reliable AI agents. It provides benchmarks, observability, and replay analytics to help developers build better agents. AgentOps is open beta and can be signed up for here. Key features of AgentOps include: - Session replays in 3 lines of code: Initialize the AgentOps client and automatically get analytics on every LLM call. - Time travel debugging: (coming soon!) - Agent Arena: (coming soon!) - Callback handlers: AgentOps works seamlessly with applications built using Langchain and LlamaIndex.

github

: 4.1k

hud-python

hud-python is a Python library for creating interactive heads-up displays (HUDs) in video games. It provides a simple and flexible way to overlay information on the screen, such as player health, score, and notifications. The library is designed to be easy to use and customizable, allowing game developers to enhance the user experience by adding dynamic elements to their games. With hud-python, developers can create engaging HUDs that improve gameplay and provide important feedback to players.

github

: 181

LocalAGI

LocalAGI is a powerful, self-hostable AI Agent platform that allows you to design AI automations without writing code. It provides a complete drop-in replacement for OpenAI's Responses APIs with advanced agentic capabilities. With LocalAGI, you can create customizable AI assistants, automations, chat bots, and agents that run 100% locally, without the need for cloud services or API keys. The platform offers features like no-code agents, web-based interface, advanced agent teaming, connectors for various platforms, comprehensive REST API, short & long-term memory capabilities, planning & reasoning, periodic tasks scheduling, memory management, multimodal support, extensible custom actions, fully customizable models, observability, and more.

github

: 1.6k

ck

ck (seek) is a semantic grep tool that finds code by meaning, not just keywords. It replaces traditional grep by understanding the user's search intent. It allows users to search for code based on concepts like 'error handling' and retrieves relevant code even if the exact keywords are not present. ck offers semantic search, drop-in grep compatibility, hybrid search combining keyword precision with semantic understanding, agent-friendly output in JSONL format, smart file filtering, and various advanced features. It supports multiple search modes, relevance scoring, top-K results, and smart exclusions. Users can index projects for semantic search, choose embedding models, and search specific files or directories. The tool is designed to improve code search efficiency and accuracy for developers and AI agents.

github

: 742

pocketgroq

PocketGroq is a tool that provides advanced functionalities for text generation, web scraping, web search, and AI response evaluation. It includes features like an Autonomous Agent for answering questions, web crawling and scraping capabilities, enhanced web search functionality, and flexible integration with Ollama server. Users can customize the agent's behavior, evaluate responses using AI, and utilize various methods for text generation, conversation management, and Chain of Thought reasoning. The tool offers comprehensive methods for different tasks, such as initializing RAG, error handling, and tool management. PocketGroq is designed to enhance development processes and enable the creation of AI-powered applications with ease.

github

: 178

For similar tasks

FDAbench

github

: 53

agent-squad

Agent Squad is a flexible, lightweight open-source framework for orchestrating multiple AI agents to handle complex conversations. It intelligently routes queries, maintains context across interactions, and offers pre-built components for quick deployment. The system allows easy integration of custom agents and conversation messages storage solutions, making it suitable for various applications from simple chatbots to sophisticated AI systems, scaling efficiently.

github

: 6.8k

gpustack

GPUStack is an open-source GPU cluster manager designed for running large language models (LLMs). It supports a wide variety of hardware, scales with GPU inventory, offers lightweight Python package with minimal dependencies, provides OpenAI-compatible APIs, simplifies user and API key management, enables GPU metrics monitoring, and facilitates token usage and rate metrics tracking. The tool is suitable for managing GPU clusters efficiently and effectively.

github

: 4.5k

CursorLens

Cursor Lens is an open-source tool that acts as a proxy between Cursor and various AI providers, logging interactions and providing detailed analytics to help developers optimize their use of AI in their coding workflow. It supports multiple AI providers, captures and logs all requests, provides visual analytics on AI usage, allows users to set up and switch between different AI configurations, offers real-time monitoring of AI interactions, tracks token usage, estimates costs based on token usage and model pricing. Built with Next.js, React, PostgreSQL, Prisma ORM, Vercel AI SDK, Tailwind CSS, and shadcn/ui components.

github

: 73

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.9k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675