FDAbench
FDABench, a benchmark for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios.
Stars: 53
FDABench is a benchmark tool designed for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. It offers 2,007 tasks across various data sources, domains, difficulty levels, and task types. The tool provides ready-to-use data agent implementations, a DAG-based evaluation system, and a framework for agent-expert collaboration in dataset generation. Key features include data agent implementations, comprehensive evaluation metrics, multi-database support, different task types, extensible framework for custom agent integration, and cost tracking. Users can set up the environment using Python 3.10+ on Linux, macOS, or Windows. FDABench can be installed with a one-command setup or manually. The tool supports API configuration for LLM access and offers quick start guides for database download, dataset loading, and running examples. It also includes features like dataset generation using the PUDDING framework, custom agent integration, evaluation metrics like accuracy and rubric score, and a directory structure for easy navigation.
README:
FDABench is a benchmark for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. It contains 2,007 tasks across different data sources, domains, difficulty levels, and task types. We provide ready-to-use data agent implementations, a DAG-based evaluation system, and an agent-expert collaboration framework for dataset generation.
- Data Agent Implementations: Ready-to-use agents (Planning, Multi-Agent, Reflection, Tool-Use)
- Comprehensive Evaluation: Accuracy metrics, rubric-based scoring, and performance analytics
- Multi-Database Support: SQLite, BigQuery, Snowflake, and unstructured data sources
- Three Task Types: Single-choice, multiple-choice, and open-ended report generation
- Extensible Framework: Modular base classes for custom agent integration
- Cost Tracking: Token usage and latency monitoring
| Type | Description |
|---|---|
| Single Choice | One correct answer from four options |
| Multiple Choice | Multiple correct answers allowed |
| Report | Open-ended analytical report generation |
- Python: 3.10+
- OS: Linux, macOS, Windows
Create the complete environment with all dependencies:
conda env create -f environment.yml
conda activate fdabenchThis will:
- Create a new conda environment named
fdabench - Install Python 3.11 and all required dependencies
- Automatically install FDABench in development mode
If you prefer manual installation:
# Create environment
conda create -n fdabench python=3.11
conda activate fdabench
# Install FDABench
pip install -e .Set up your API keys for LLM access:
# Option 1: Environment variables
export OPENROUTER_API_KEY="your-openrouter-api-key"
# Option 2: Create .env file in project root
echo "OPENROUTER_API_KEY=your-openrouter-api-key" >> .envAfter completing the environment setup above, you can immediately start using FDABench with FDABench-Lite:
Download the FDABench-Lite database files from Google Drive, extract to your directory, and configure paths in FDABench/utils/database_connection_manager.py (see FDABench-Full Usage for details).
HuggingFace Dataset: FDABench now loads data directly from the HuggingFace dataset hub. The dataset FDAbench2026/Fdabench-Lite contains 289 curated test cases in three tasks for immediate use. We also offer FDABench-Full with 2007 test cases on HuggingFace.
# Activate your environment (if not already active)
conda activate fdabench
# Run your first example - automatically loads from HuggingFace
python examples/run_planning_agent.py
# Run with a specific sample (0-116 available)
python examples/run_planning_agent.py --index 10
# Run with a custom model
python examples/run_planning_agent.py --model "openai/gpt-5" --index 5FDABench-Full supports multiple database types including Snowflake, Bigquery and SQLlite. You need to configure database paths and obtain required data:
BIRD Dataset: Download from BIRD repository
Spider2-lite Dataset: Download from Spider2 spider-agent-lite
BigQuery and Snowflake: Follow registration and setup instructions from Spider2 README
Unstructured Dataset: Download from Google Drive
Edit FDABench/utils/database_connection_manager.py and update the configuration:
default_config = {
# SQLite database paths
'bird_db_path': "/your/path/to/BIRD_train/train_databases",
'local_db_path': "/your/path/to/local/databases",
'spider1_db_path': "/your/path/to/spider1/databases",
# Cloud database credentials
'bigquery_credentials_path': "/your/path/to/bigquery-service-account.json",
'snowflake_config': {
'account': 'your-snowflake-account',
'user': 'your-username',
'password': 'your-password',
'warehouse': 'your-warehouse',
'database': 'your-database'
}
}your_databases/
├── BIRD_train/train_databases/
│ ├── california_schools/
│ │ └── california_schools.sqlite
│ ├── card_games/
│ │ └── card_games.sqlite
│ └── ...
├── spider1_databases/
│ ├── concert_singer.sqlite
│ ├── pets_1.sqlite
│ └── ...
├── local_databases/
│ └── merchant_data.db
└── credentials/
└── bigquery-service-account.json
HuggingFace Dataset (Default): The benchmark uses the FDAbench2026/Fdabench-Lite dataset from HuggingFace, which includes:
- 289 curated test cases
- Three subsets (report, single, multiple )
- Multiple database types (BIRD, local, Spider2-lite)
- Various difficulty levels (easy, medium, hard)
Loading Data in Your Code:
from FDABench.utils.test_utils import load_test_data
# Load the first sample (default)
test_data = load_test_data()
# Load a specific sample by index (0-116)
test_data = load_test_data(index=10)Custom Local Datasets: If you have your own test data, you can still use local JSON files by modifying the load_test_data() function in FDABench/utils/test_utils.py.
Test different agent workflows with HuggingFace dataset:
# Planning Agent - Uses step-by-step planning
python examples/run_planning_agent.py # Default: index 0
python examples/run_planning_agent.py --index 25 # Specific sample
# Multi-Agent - Coordinates multiple specialized agents
python examples/run_multi_agent.py --index 10
# Reflection Agent - Self-improving with reflection
python examples/run_reflection_agent.py --index 50
# Tool-Use Agent - Optimized for tool selection
python examples/run_tooluse_agent.py --index 100
# All agents support the same parameters:
# --index N: Select sample N from the dataset (0-116)
# --model "model_name": Specify the LLM model to useWe also provide a ready-to-run benchmarking script that connects DeepAnalyze to FDABench's tasks.
- Position the DeepAnalyze project alongside FDABench so the runtime layout looks like:
/path/to/workspace/ ├── FDAbench/ └── DeepAnalyze/ - Start the DeepAnalyze vLLM server and export its model path and endpoint (or pass them as CLI flags):
export DEEPANALYZE_MODEL_PATH=/path/to/DeepAnalyze/model/DeepAnalyze-8B export DEEPANALYZE_API_URL=http://localhost:8000/v1/chat/completions
- Run the benchmark. By default the script pulls FDABench-Lite sample
index=0, runs evaluation, and writes metrics toresults/test_query_results_deepanalyze.duckdb:python FDABench/examples/test_deepanalyze_adapter.py \ --index 5 \ --max_agent_rounds 8 \ --max_deepanalyze_rounds 25
Useful flags:
-
--inputload custom JSON/JSONL tasks instead of HuggingFace samples. -
--duckdb_pathspecify a custom metrics file. -
--api_keyoverride the API key FDABench uses for auxiliary tools.
Data agents integrated with semantic data operators for advanced data processing:
# DocETL Semantic Operator Agent - Uses DocETL operators for document processing
python FDABench/examples/test_planning_agent_docetl_batch.py
# Lotus Semantic Operator Agent - Uses Lotus operators for natural language processing
python FDABench/examples/test_planning_agent_lotus_batch.py
# Palimpzest Semantic Operator Agent - Uses Palimpzest operators for data transformation
python FDABench/examples/test_planning_agent_pz_batch.pyNote: Data Agent with semantic operator require additional environment setup. Check the respective environment files:
FDABench/examples/docetl_environment.ymlFDABench/examples/lotus_environment.ymlFDABench/examples/palimpzest_environment.yml
from FDABench.agents.planning_agent import PlanningAgent
from FDABench.evaluation.evaluation_tools import ReportEvaluator
from FDABench.utils.test_utils import load_test_data
# Initialize agent with your preferred model
agent = PlanningAgent(
model="openai/gpt-5", # or "deepseek/deepseek-chat-v3"
api_key="your-api-key"
)
# Load test data from HuggingFace dataset
test_data = load_test_data(index=0) # Load first sample
print(f"Processing task: {test_data['task_id']}")
print(f"Database: {test_data['db']}")
print(f"Question type: {test_data['question_type']}")
# Process the query
result = agent.process_query_from_json(test_data)
print(f"Generated report: {result['report'][:200]}...")
# Load and process multiple samples
for i in range(5):
test_data = load_test_data(index=i)
result = agent.process_query_from_json(test_data)
print(f"Task {i}: {test_data['task_id']} - Completed")All test results are automatically saved to:
-
results/- DuckDB files with test results and metrics -
FDABench/examples/data/- Temporary processing files
The VectorSearchTool enables semantic search over unstructured documents using FAISS + OpenAI Embeddings.
Download the pre-built FAISS index from Google Drive and extract to project root:
cd /path/to/FDAbench
# Download storage_faiss.tar.gz, then extract
tar xzvf storage_faiss.tar.gz
# This creates ./storage_faiss/ directory with:
# - faiss.index (FAISS vector index)
# - chunks.json (text chunks with metadata)
# - config.json (index configuration)Now you can use VectorSearchTool directly:
from FDABench.tools.search_tools import VectorSearchTool
# Uses ./storage_faiss by default
tool = VectorSearchTool()
result = tool.execute(query="machine learning", top_k=5)If you need to rebuild or customize the index:
1. Download Unstructured Data
Download the raw documents from Google Drive containing 50 domain categories with PDFs and other files.
2. Build Vector Index
cd /path/to/FDAbench
export OPENAI_API_KEY="your-openai-api-key"
python -m FDABench.utils.vector_index_builder \
--doc-path /path/to/Vector_Database \
--index-path ./storage_faiss \
--unified \
--chunk-size 1024Builder Options:
| Option | Description | Default |
|---|---|---|
--doc-path |
Path to document categories | Required |
--index-path |
Where to save the index | ./storage_faiss |
--unified |
Merge all categories into one index | Flag |
--chunk-size |
Text chunk size in characters | 1024 |
--chunk-overlap |
Overlap between chunks | 200 |
--api-key |
OpenAI API key | Uses OPENAI_API_KEY env |
Features:
- Supports PDF files (via pdfplumber/PyPDF2)
- 50 concurrent embedding requests
- Auto-truncation for long texts (max 30K chars)
- Timeout handling for problematic PDFs (30s)
- Skips failed chunks and continues
from FDABench.tools.search_tools import VectorSearchTool
# Initialize (uses ./storage_faiss by default)
tool = VectorSearchTool()
# Or specify custom path
tool = VectorSearchTool(storage_path="./my_index", api_key="your-key")
# Search
result = tool.execute(query="machine learning in healthcare", top_k=5)
if result["status"] == "success":
print(f"Found {result['num_results']} results")
print(result["results"]) # Formatted output
# Access raw results
for r in result["raw_results"]:
print(f"Score: {r['score']:.4f}, Category: {r['metadata']['category']}")Example Output:
[Rank 1] (Score: 0.6523)
Category: Healthcare_Medical Systems | File: medical_ai.pdf
Content: This paper presents a novel approach to...
PUDDING is an agentic dataset construction framework that combines LLM generation with iterative expert validation. It operates in three phases:
- Initialization: Gather structured data (schema, SQL results) and unstructured context (web search, vector retrieval, file system)
- Expert Verification: Iterative agent-expert collaboration with accept/revise/dispose decisions
- Finalization: Quality validation and difficulty classification
python -m PUDDING.main # Interactive mode (with expert review)
python -m PUDDING.main --auto # Automatic modeSee PUDDING/README.md for detailed documentation.
Inherit from BaseAgent to create custom agents:
from FDABench.core.base_agent import BaseAgent
class YourAgent(BaseAgent):
def process_query_from_json(self, query_data):
question_type = query_data.get("question_type", "report")
if question_type == "single_choice":
return self.process_single_choice(query_data)
elif question_type == "multiple_choice":
return self.process_multiple_choice(query_data)
else:
return self.process_report(query_data)- Accuracy: Correctness for choice questions
- Rubric Score: Report quality evaluation
- Latency: Response time per query
- Token Usage: Cost tracking
FDABench/
├── FDABench/ # Main package
│ ├── agents/ # Agent implementations (planning, multi, reflection, tool-use)
│ ├── core/ # Base classes, token tracking, tool registry
│ ├── evaluation/ # Evaluation and scoring tools
│ ├── tools/ # Schema, SQL, search tools
│ └── utils/ # Database connection, utilities
├── PUDDING/ # Dataset generation framework (see PUDDING/README.md)
├── examples/ # Usage examples
├── results/ # Test results (DuckDB files)
└── environment.yml # Conda environment
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Implement changes (inherit from
BaseAgentfor new agents) - Test with the evaluation suite
- Open a Pull Request
If you need to submit results, please submit them in JSONL format similar to results/submission.jsonl to [email protected].
Each line should contain a JSON object with the following key fields:
-
task_id,instance_id,db,level,database_type,question_type - For report tasks:
"generated_report": "your report content" - For single choice:
"selected_answer": ["A"] - For multiple choice:
"selected_answer": ["A", "C", "F"] - Performance metrics:
tool_executed,latency,total_tokens,total_cost, etc.
Example format:
{"task_id": "FDA0045", "question_type": "report", "generated_report": "...", "tool_executed": ["tool_1"], "latency": "", "total_tokens": ""}
{"task_id": "FDA0803", "question_type": "single_choice", "selected_answer": ["D"], "tool_executed": ["tool_1"], "latency": "", "total_tokens": ""}If you find FDABench useful in your research, please consider citing our paper:
@article{wang2025fdabench,
title={FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data},
author={Wang, Ziting and Zhang, Shize and Yuan, Haitao and Zhu, Jinwei and Li, Shifu and Dong, Wei and Cong, Gao},
journal={arXiv preprint arXiv:2509.02473},
year={2025}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for FDAbench
Similar Open Source Tools
FDAbench
FDABench is a benchmark tool designed for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. It offers 2,007 tasks across various data sources, domains, difficulty levels, and task types. The tool provides ready-to-use data agent implementations, a DAG-based evaluation system, and a framework for agent-expert collaboration in dataset generation. Key features include data agent implementations, comprehensive evaluation metrics, multi-database support, different task types, extensible framework for custom agent integration, and cost tracking. Users can set up the environment using Python 3.10+ on Linux, macOS, or Windows. FDABench can be installed with a one-command setup or manually. The tool supports API configuration for LLM access and offers quick start guides for database download, dataset loading, and running examples. It also includes features like dataset generation using the PUDDING framework, custom agent integration, evaluation metrics like accuracy and rubric score, and a directory structure for easy navigation.
paperbanana
PaperBanana is an automated academic illustration tool designed for AI scientists. It implements an agentic framework for generating publication-quality academic diagrams and statistical plots from text descriptions. The tool utilizes a two-phase multi-agent pipeline with iterative refinement, Gemini-based VLM planning, and image generation. It offers a CLI, Python API, and MCP server for IDE integration, along with Claude Code skills for generating diagrams, plots, and evaluating diagrams. PaperBanana is not affiliated with or endorsed by the original authors or Google Research, and it may differ from the original system described in the paper.
scabench
ScaBench is a comprehensive framework designed for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. It provides curated datasets from recent audits and official tooling for consistent evaluation. The tool includes features such as curated datasets from Code4rena, Cantina, and Sherlock audits, a baseline runner for security analysis, a scoring tool for evaluating findings, a report generator for HTML reports with visualizations, and pipeline automation for complete workflow execution. Users can access curated datasets, generate new datasets, download project source code, run security analysis using LLMs, and evaluate tool findings against benchmarks using LLM matching. The tool enforces strict matching policies to ensure accurate evaluation results.
llm-context.py
LLM Context is a tool designed to assist developers in quickly injecting relevant content from code/text projects into Large Language Model chat interfaces. It leverages `.gitignore` patterns for smart file selection and offers a streamlined clipboard workflow using the command line. The tool also provides direct integration with Large Language Models through the Model Context Protocol (MCP). LLM Context is optimized for code repositories and collections of text/markdown/html documents, making it suitable for developers working on projects that fit within an LLM's context window. The tool is under active development and aims to enhance AI-assisted development workflows by harnessing the power of Large Language Models.
kiss_ai
KISS AI is a lightweight and powerful multi-agent evolutionary framework that simplifies building AI agents. It uses native function calling for efficiency and accuracy, making building AI agents as straightforward as possible. The framework includes features like multi-agent orchestration, agent evolution and optimization, relentless coding agent for long-running tasks, output formatting, trajectory saving and visualization, GEPA for prompt optimization, KISSEvolve for algorithm discovery, self-evolving multi-agent, Docker integration, multiprocessing support, and support for various models from OpenAI, Anthropic, Gemini, Together AI, and OpenRouter.
docutranslate
Docutranslate is a versatile tool for translating documents efficiently. It supports multiple file formats and languages, making it ideal for businesses and individuals needing quick and accurate translations. The tool uses advanced algorithms to ensure high-quality translations while maintaining the original document's formatting. With its user-friendly interface, Docutranslate simplifies the translation process and saves time for users. Whether you need to translate legal documents, technical manuals, or personal letters, Docutranslate is the go-to solution for all your document translation needs.
sgr-deep-research
This repository contains a deep learning research project focused on natural language processing tasks. It includes implementations of various state-of-the-art models and algorithms for text classification, sentiment analysis, named entity recognition, and more. The project aims to provide a comprehensive resource for researchers and developers interested in exploring deep learning techniques for NLP applications.
pixeltable
Pixeltable is a Python library designed for ML Engineers and Data Scientists to focus on exploration, modeling, and app development without the need to handle data plumbing. It provides a declarative interface for working with text, images, embeddings, and video, enabling users to store, transform, index, and iterate on data within a single table interface. Pixeltable is persistent, acting as a database unlike in-memory Python libraries such as Pandas. It offers features like data storage and versioning, combined data and model lineage, indexing, orchestration of multimodal workloads, incremental updates, and automatic production-ready code generation. The tool emphasizes transparency, reproducibility, cost-saving through incremental data changes, and seamless integration with existing Python code and libraries.
mcp-devtools
MCP DevTools is a high-performance server written in Go that replaces multiple Node.js and Python-based servers. It provides access to essential developer tools through a unified, modular interface. The server is efficient, with minimal memory footprint and fast response times. It offers a comprehensive tool suite for agentic coding, including 20+ essential developer agent tools. The tool registry allows for easy addition of new tools. The server supports multiple transport modes, including STDIO, HTTP, and SSE. It includes a security framework for multi-layered protection and a plugin system for adding new tools.
flyte-sdk
Flyte 2 SDK is a pure Python tool for type-safe, distributed orchestration of agents, ML pipelines, and more. It allows users to write data pipelines, ML training jobs, and distributed compute in Python without any DSL constraints. With features like async-first parallelism and fine-grained observability, Flyte 2 offers a seamless workflow experience. Users can leverage core concepts like TaskEnvironments for container configuration, pure Python workflows for flexibility, and async parallelism for distributed execution. Advanced features include sub-task observability with tracing and remote task execution. The tool also provides native Jupyter integration for running and monitoring workflows directly from notebooks. Configuration and deployment are made easy with configuration files and commands for deploying and running workflows. Flyte 2 is licensed under the Apache 2.0 License.
RepairAgent
RepairAgent is an autonomous LLM-based agent for automated program repair targeting the Defects4J benchmark. It uses an LLM-driven loop to localize, analyze, and fix Java bugs. The tool requires Docker, VS Code with Dev Containers extension, OpenAI API key, disk space of ~40 GB, and internet access. Users can get started with RepairAgent using either VS Code Dev Container or Docker Image. Running RepairAgent involves checking out the buggy project version, autonomous bug analysis, fix candidate generation, and testing against the project's test suite. Users can configure hyperparameters for budget control, repetition handling, commands limit, and external fix strategy. The tool provides output structure, experiment overview, individual analysis scripts, and data on fixed bugs from the Defects4J dataset.
agentops
AgentOps is a toolkit for evaluating and developing robust and reliable AI agents. It provides benchmarks, observability, and replay analytics to help developers build better agents. AgentOps is open beta and can be signed up for here. Key features of AgentOps include: - Session replays in 3 lines of code: Initialize the AgentOps client and automatically get analytics on every LLM call. - Time travel debugging: (coming soon!) - Agent Arena: (coming soon!) - Callback handlers: AgentOps works seamlessly with applications built using Langchain and LlamaIndex.
hud-python
hud-python is a Python library for creating interactive heads-up displays (HUDs) in video games. It provides a simple and flexible way to overlay information on the screen, such as player health, score, and notifications. The library is designed to be easy to use and customizable, allowing game developers to enhance the user experience by adding dynamic elements to their games. With hud-python, developers can create engaging HUDs that improve gameplay and provide important feedback to players.
LocalAGI
LocalAGI is a powerful, self-hostable AI Agent platform that allows you to design AI automations without writing code. It provides a complete drop-in replacement for OpenAI's Responses APIs with advanced agentic capabilities. With LocalAGI, you can create customizable AI assistants, automations, chat bots, and agents that run 100% locally, without the need for cloud services or API keys. The platform offers features like no-code agents, web-based interface, advanced agent teaming, connectors for various platforms, comprehensive REST API, short & long-term memory capabilities, planning & reasoning, periodic tasks scheduling, memory management, multimodal support, extensible custom actions, fully customizable models, observability, and more.
ck
ck (seek) is a semantic grep tool that finds code by meaning, not just keywords. It replaces traditional grep by understanding the user's search intent. It allows users to search for code based on concepts like 'error handling' and retrieves relevant code even if the exact keywords are not present. ck offers semantic search, drop-in grep compatibility, hybrid search combining keyword precision with semantic understanding, agent-friendly output in JSONL format, smart file filtering, and various advanced features. It supports multiple search modes, relevance scoring, top-K results, and smart exclusions. Users can index projects for semantic search, choose embedding models, and search specific files or directories. The tool is designed to improve code search efficiency and accuracy for developers and AI agents.
pocketgroq
PocketGroq is a tool that provides advanced functionalities for text generation, web scraping, web search, and AI response evaluation. It includes features like an Autonomous Agent for answering questions, web crawling and scraping capabilities, enhanced web search functionality, and flexible integration with Ollama server. Users can customize the agent's behavior, evaluate responses using AI, and utilize various methods for text generation, conversation management, and Chain of Thought reasoning. The tool offers comprehensive methods for different tasks, such as initializing RAG, error handling, and tool management. PocketGroq is designed to enhance development processes and enable the creation of AI-powered applications with ease.
For similar tasks
FDAbench
FDABench is a benchmark tool designed for evaluating data agents' reasoning ability over heterogeneous data in analytical scenarios. It offers 2,007 tasks across various data sources, domains, difficulty levels, and task types. The tool provides ready-to-use data agent implementations, a DAG-based evaluation system, and a framework for agent-expert collaboration in dataset generation. Key features include data agent implementations, comprehensive evaluation metrics, multi-database support, different task types, extensible framework for custom agent integration, and cost tracking. Users can set up the environment using Python 3.10+ on Linux, macOS, or Windows. FDABench can be installed with a one-command setup or manually. The tool supports API configuration for LLM access and offers quick start guides for database download, dataset loading, and running examples. It also includes features like dataset generation using the PUDDING framework, custom agent integration, evaluation metrics like accuracy and rubric score, and a directory structure for easy navigation.
agent-squad
Agent Squad is a flexible, lightweight open-source framework for orchestrating multiple AI agents to handle complex conversations. It intelligently routes queries, maintains context across interactions, and offers pre-built components for quick deployment. The system allows easy integration of custom agents and conversation messages storage solutions, making it suitable for various applications from simple chatbots to sophisticated AI systems, scaling efficiently.
gpustack
GPUStack is an open-source GPU cluster manager designed for running large language models (LLMs). It supports a wide variety of hardware, scales with GPU inventory, offers lightweight Python package with minimal dependencies, provides OpenAI-compatible APIs, simplifies user and API key management, enables GPU metrics monitoring, and facilitates token usage and rate metrics tracking. The tool is suitable for managing GPU clusters efficiently and effectively.
CursorLens
Cursor Lens is an open-source tool that acts as a proxy between Cursor and various AI providers, logging interactions and providing detailed analytics to help developers optimize their use of AI in their coding workflow. It supports multiple AI providers, captures and logs all requests, provides visual analytics on AI usage, allows users to set up and switch between different AI configurations, offers real-time monitoring of AI interactions, tracks token usage, estimates costs based on token usage and model pricing. Built with Next.js, React, PostgreSQL, Prisma ORM, Vercel AI SDK, Tailwind CSS, and shadcn/ui components.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

