scabench

A framework for evaluating AI audit agents using recent real-world data

Stars: 53

Visit

ScaBench is a comprehensive framework designed for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. It provides curated datasets from recent audits and official tooling for consistent evaluation. The tool includes features such as curated datasets from Code4rena, Cantina, and Sherlock audits, a baseline runner for security analysis, a scoring tool for evaluating findings, a report generator for HTML reports with visualizations, and pipeline automation for complete workflow execution. Users can access curated datasets, generate new datasets, download project source code, run security analysis using LLMs, and evaluate tool findings against benchmarks using LLM matching. The tool enforces strict matching policies to ensure accurate evaluation results.

README:

ScaBench: Smart Contract Audit Benchmark

A comprehensive framework for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. ScaBench provides curated datasets from recent audits and official tooling for consistent evaluation.

📚 Dataset Documentation

Dataset Format & Statistics - Complete dataset information

Features

🎯 Curated Datasets: Real-world vulnerabilities from Code4rena, Cantina, and Sherlock audits
🤖 Baseline Runner: LLM-based security analyzer with configurable models
📊 Scoring Tool: Evaluates findings with LLM-based matching (confidence = 1.0 only)
📈 Report Generator: HTML reports with visualizations and performance metrics
🔄 Pipeline Automation: Complete workflow with single-command execution

Available Curated Datasets

Note: New datasets are added regularly to prevent models from being trained on known results and to maintain benchmark integrity.

Current Dataset: curated-2025-08-18

Location: datasets/curated-2025-08-18/curated-2025-08-18.json

The most current dataset contains contest scope repositories with expected vulnerabilities from audit competitions:

31 projects from Code4rena, Cantina, and Sherlock platforms
555 total vulnerabilities (114 high/critical severity)
Time range: 2024-08 to 2025-08
Data format: JSON with project metadata including:
- project_id: Unique identifier for each project
- codebases: Repository URLs, commit hashes, and download links
- vulnerabilities: Array of findings with severity, title, and detailed descriptions

Baseline Results

Location: datasets/curated-2025-08-18/baseline-results/

Pre-computed baseline results from analyzing each individual file with GPT-5:

Approach: Single-file analysis using GPT-5 to identify vulnerabilities
Coverage: One baseline file per project (e.g., baseline_cantina_minimal-delegation_2025_04.json)
Data format: JSON containing:
- project: Project identifier
- files_analyzed: Number of files processed
- total_findings: Count of vulnerabilities found
- findings: Array of identified issues with:
  - title: Brief vulnerability description
  - description: Detailed explanation
  - severity: Risk level (high/medium/low)
  - confidence: Model's confidence score
  - location: Specific code location
  - file: Source file name

What Each Component Does

📊 Official Datasets (`datasets/`)

Pre-curated benchmark datasets with real vulnerabilities from audits.

Current: curated-2025-08-18.json (31 projects, 555 vulnerabilities)
Format: JSON with project metadata, repo URLs, commits, and vulnerability details

🔧 Dataset Generator (`dataset-generator/`)

Create NEW datasets by scraping and curating audit data.

Step 1: Scrape audit platforms

cd dataset-generator
python scraper.py --platforms code4rena cantina sherlock --months 3

Step 2: Curate the dataset

# Filter projects based on quality criteria
python curate_dataset.py \
  --input raw_dataset.json \
  --output curated_dataset.json \
  --min-vulnerabilities 5 \
  --min-high-critical 1

# This filters out projects that:
# - Have fewer than 5 vulnerabilities
# - Have no high/critical severity findings
# - Have inaccessible GitHub repositories
# - Have invalid or missing data

The curation step ensures high-quality benchmark data by removing low-value or inaccessible projects.

📥 Source Checkout (`dataset-generator/checkout_sources.py`)

Download project source code at EXACT commits from dataset.

# Download all projects
python dataset-generator/checkout_sources.py

# Download specific project
python dataset-generator/checkout_sources.py --project vulnerable_vault

🔍 Baseline Runner (`baseline-runner/`)

Reference security analyzer using LLMs. Produces findings in standard JSON format.

python baseline-runner/baseline_runner.py \
  --project my_project \
  --source sources/my_project

📈 Scorer (`scoring/scorer_v2.py`)

Evaluates ANY tool's findings against the benchmark using LLM matching with one-by-one comparison for better consistency.

Important: Model Requirements

The scorer uses one-by-one matching - processes each expected finding sequentially
More deterministic than batch matching with fixed seed and zero temperature
Recommended: gpt-4o (default, best accuracy)
Alternative: gpt-4o-mini (faster, cheaper, good for testing)

Scoring a Single Project

IMPORTANT: When scoring a single project, you must specify the exact project ID from the benchmark dataset using the --project flag. Project IDs often contain hyphens (e.g., code4rena_iq-ai_2025_03) while baseline result filenames may have underscores.

# Example: Score results for a single project
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --project code4rena_iq-ai_2025_03 \
  --model gpt-4o \
  --confidence-threshold 0.75

# With verbose output to see matching details
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --project code4rena_iq-ai_2025_03 \
  --verbose

Note: The --project parameter must match the exact project_id field from the benchmark dataset JSON. Check the dataset file if unsure about the correct project ID.

Scoring an Entire Baseline Run (All Projects)

To score all baseline results at once:

# Score all baseline results in a directory
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --output scores/ \
  --model gpt-4o \
  --confidence-threshold 0.75

# This will:
# 1. Process all *.json files in the results directory
# 2. Automatically extract and match project IDs
# 3. Generate individual score files for each project
# 4. Save results to the scores/ directory

# With debug output
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --output scores/ \
  --debug

Available Options

--confidence-threshold: Set matching confidence threshold (default: 0.75)
--verbose: Show detailed matching progress for each finding
--debug: Enable debug output for troubleshooting

After scoring, generate a comprehensive report:

python scoring/report_generator.py \
  --scores scores/ \
  --output baseline_report.html \
  --tool-name "Baseline" \
  --model gpt-5-mini

📄 Report Generator (`scoring/report_generator.py`)

Creates HTML reports with metrics and visualizations.

python scoring/report_generator.py \
  --scores scores/ \
  --output report.html

Quick Start

Prerequisites

# Install dependencies
pip install -r requirements.txt

# Set OpenAI API key
export OPENAI_API_KEY="your-key-here"

Option 1: Process ALL Projects (Easiest!) 🚀

The run_all.sh script provides a complete end-to-end pipeline that:

Downloads source code - Clones all project repositories at exact audit commits
Runs baseline analysis - Analyzes each project with LLM-based security scanner
Scores results - Evaluates findings against known vulnerabilities using strict matching
Generates reports - Creates comprehensive HTML report with metrics and visualizations

Basic Usage

# Run everything with defaults (all projects in dataset, gpt-5-mini model)
./run_all.sh

# Use different model (e.g., gpt-4o-mini for faster/cheaper runs)
./run_all.sh --model gpt-4o-mini

# Use a different dataset
./run_all.sh --dataset datasets/my_custom_dataset.json

# Combine options
./run_all.sh --model gpt-4o-mini --output-dir test_run

All Options

./run_all.sh [OPTIONS]

Options:
  --dataset FILE       Dataset to use (default: datasets/curated-2025-08-18.json)
  --model MODEL        Model for analysis (default: gpt-5-mini)
                       Options: gpt-5-mini, gpt-4o-mini, gpt-4o
  --output-dir DIR     Output directory (default: all_results_TIMESTAMP)
  --skip-checkout      Skip source checkout (use existing sources)
  --skip-baseline      Skip baseline analysis (use existing results)
  --skip-scoring       Skip scoring and report generation
  --help               Show help

What It Does (Step by Step)

Step 1: Source Checkout

Downloads all projects from the dataset (from their GitHub repositories)
Checks out exact commits from audit time
Preserves original project structure
Creates: OUTPUT_DIR/sources/PROJECT_ID/

Step 2: Baseline Analysis

Runs LLM-based security analysis on each project
Configurable file limits for testing
Uses specified model (default: gpt-5-mini)
Creates: OUTPUT_DIR/baseline_results/baseline_PROJECT_ID.json

Step 3: Scoring

Compares findings against known vulnerabilities in the dataset
Uses STRICT matching (confidence = 1.0 only)
Batch processes all projects
Creates: OUTPUT_DIR/scoring_results/score_PROJECT_ID.json

Step 4: Report Generation

Aggregates all scoring results
Generates HTML report with charts and metrics
Calculates overall detection rates and F1 scores
Creates: OUTPUT_DIR/reports/full_report.html

Step 5: Summary Statistics

Computes aggregate metrics across all projects
Saves summary JSON with key statistics
Creates: OUTPUT_DIR/summary.json

Performance Notes

Full run (all files): 4-6 hours for default dataset (31 projects)
Fast test (--model gpt-4o-mini): 30-45 minutes
Model selection:
- gpt-5-mini: Best accuracy (default)
- gpt-4o-mini: Faster, cheaper, good for testing

Note: The default dataset (curated-2025-08-18.json) contains 31 projects with 555 total vulnerabilities. Custom datasets may have different counts.

Option 2: Process Single Project

# For a specific project
./run_pipeline.sh --project vulnerable_vault --source sources/vulnerable_vault

Option 3: Complete Command-Line Guides

Complete Guide: Analyze and Score a Single Project

# Step 1: Set up environment
export OPENAI_API_KEY="your-key-here"

# Step 2: Find your project ID in the dataset
PROJECT_ID="code4rena_iq-ai_2025_03"  # Example - check dataset for exact ID

# Step 3: Download the source code
python dataset-generator/checkout_sources.py \
  --dataset datasets/curated-2025-08-18.json \
  --project $PROJECT_ID \
  --output sources/

# Step 4: Run baseline analysis
python baseline-runner/baseline_runner.py \
  --project $PROJECT_ID \
  --source sources/${PROJECT_ID//-/_} \
  --output datasets/curated-2025-08-18/baseline-results/ \
  --model gpt-5-mini

# Step 5: Score the results (IMPORTANT: use exact project ID with hyphens)
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --project $PROJECT_ID \
  --output scores/ \
  --model gpt-4o

# Step 6: Generate HTML report
python scoring/report_generator.py \
  --scores scores/ \
  --output single_project_report.html \
  --tool-name "Baseline" \
  --model gpt-5-mini

# Step 7: View the report
open single_project_report.html  # macOS
# xdg-open single_project_report.html  # Linux

Complete Guide: Analyze and Score ALL Projects

# Step 1: Set up environment
export OPENAI_API_KEY="your-key-here"

# Step 2: Download ALL project sources (this may take a while)
python dataset-generator/checkout_sources.py \
  --dataset datasets/curated-2025-08-18.json \
  --output sources/

# Step 3: Run baseline on ALL projects (this will take hours)
for dir in sources/*/; do
  project=$(basename "$dir")
  echo "Analyzing $project..."
  python baseline-runner/baseline_runner.py \
    --project "$project" \
    --source "$dir" \
    --output datasets/curated-2025-08-18/baseline-results/ \
    --model gpt-5-mini \
done

# Step 4: Score ALL baseline results
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --output scores/ \
  --model gpt-4o

# Step 5: Generate comprehensive report
python scoring/report_generator.py \
  --scores scores/ \
  --output full_baseline_report.html \
  --tool-name "Baseline Analysis" \
  --model gpt-5-mini

# Step 6: View the report
open full_baseline_report.html  # macOS
# xdg-open full_baseline_report.html  # Linux

Quick Test Run (Small Sample)

# Test with just one small project for quick validation
export OPENAI_API_KEY="your-key-here"

# Pick a small project
PROJECT_ID="code4rena_coded-estate-invitational_2024_12"

# Run complete pipeline for single project
python dataset-generator/checkout_sources.py --project $PROJECT_ID --output sources/
python baseline-runner/baseline_runner.py \
  --project $PROJECT_ID \
  --source sources/${PROJECT_ID//-/_} \
  --model gpt-5-mini
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --project $PROJECT_ID \
  --model gpt-4o
python scoring/report_generator.py \
  --scores scores/ \
  --output test_report.html \
  --model gpt-5-mini
open test_report.html

Option 4: Step-by-Step Manual Process

Two Ways to Use ScaBench

🎯 Option A: Run the Official Baseline

Easiest - Process ALL projects with one command:

./run_all.sh

This automatically:

Downloads all source code at exact commits
Runs baseline security analysis
Scores against benchmark
Generates comprehensive reports

Manual approach for specific projects:

# 1. Download source code
python dataset-generator/checkout_sources.py --project vulnerable_vault

# 2. Run baseline analysis
python baseline-runner/baseline_runner.py \
  --project vulnerable_vault \
  --source sources/vulnerable_vault

# 3. Score results
python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir datasets/curated-2025-08-18/baseline-results/ \
  --project vulnerable_vault

# 4. Generate report
python scoring/report_generator.py \
  --scores scores/ \
  --output report.html

🚀 Option B: Evaluate YOUR Tool

Step 1: Get the source code

python dataset-generator/checkout_sources.py \
  --dataset datasets/curated-2025-08-18.json \
  --output sources/

Step 2: Run YOUR tool on each project

# Example with your tool
your-tool analyze sources/project1/ > results/project1.json

Step 3: Format results to match required JSON structure

{
  "project": "project_name",
  "findings": [{
    "title": "Reentrancy in withdraw",
    "description": "Details...",
    "severity": "high",
    "location": "withdraw() function",
    "file": "Vault.sol"
  }]
}

See format specification below

Step 4: Score your results

python scoring/scorer_v2.py \
  --benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
  --results-dir results/

Step 5: View your performance

python scoring/report_generator.py \
  --scores scores/ \
  --output my_tool_report.html

Installation

Requirements

Python 3.8+
OpenAI API key
4GB+ RAM for large codebases

Setup

# Clone the repository
git clone https://github.com/scabench/scabench.git
cd scabench

# Install all dependencies
pip install -r requirements.txt

# Run tests to verify installation
pytest tests/

Strict Matching Policy

The scorer enforces EXTREMELY STRICT matching criteria:

✅ IDENTICAL LOCATION - Must be exact same file/contract/function
✅ EXACT IDENTIFIERS - Same contract names, function names, variables
✅ IDENTICAL ROOT CAUSE - Must be THE SAME vulnerability
✅ IDENTICAL ATTACK VECTOR - Exact same exploitation method
✅ IDENTICAL IMPACT - Exact same security consequence
❌ NO MATCH for similar patterns in different locations
❌ NO MATCH for same bug type but different functions
⚠️ WHEN IN DOUBT: DO NOT MATCH

Only findings with confidence = 1.0 count as true positives!

Performance Tips

Model Selection:
- For Scoring: Use gpt-5-mini (recommended) - needs long context for batch matching
- For Baseline Analysis: Use gpt-5-mini for best accuracy
- Important: The scorer processes ALL findings in a single LLM call, so a model with sufficient context window is critical
- Use gpt-4o if you encounter context length errors with very large projects
- Use --patterns to specify which files to analyze

Batch Processing:

# Process multiple projects
for project in project1 project2 project3; do
  ./run_pipeline.sh --project $project --source sources/$project
done

Caching: Results are saved to disk for reprocessing

Output Formats

Baseline Results

{
  "project": "vulnerable_vault",
  "files_analyzed": 10,
  "total_findings": 5,
  "findings": [{
    "title": "Reentrancy vulnerability",
    "severity": "high",
    "confidence": 0.95,
    "location": "withdraw() function"
  }]
}

Scoring Results

{
  "total_expected": 10,
  "true_positives": 6,
  "detection_rate": 0.6,
  "matched_findings": [{
    "confidence": 1.0,
    "justification": "Perfect match: identical vulnerability"
  }]
}

License

MIT License - see LICENSE file for details

For Tasks:

Click tags to check more tools for each tasks

evaluate security tools generate curated datasets download project source code run security analysis evaluate tool findings

For Jobs:

security analyst ai engineer software developer cybersecurity consultant blockchain developer

Alternative AI tools for scabench

Similar Open Source Tools

scabench

github

: 53

llm-context.py

LLM Context is a tool designed to assist developers in quickly injecting relevant content from code/text projects into Large Language Model chat interfaces. It leverages `.gitignore` patterns for smart file selection and offers a streamlined clipboard workflow using the command line. The tool also provides direct integration with Large Language Models through the Model Context Protocol (MCP). LLM Context is optimized for code repositories and collections of text/markdown/html documents, making it suitable for developers working on projects that fit within an LLM's context window. The tool is under active development and aims to enhance AI-assisted development workflows by harnessing the power of Large Language Models.

github

: 273

R2R

R2R (RAG to Riches) is a fast and efficient framework for serving high-quality Retrieval-Augmented Generation (RAG) to end users. The framework is designed with customizable pipelines and a feature-rich FastAPI implementation, enabling developers to quickly deploy and scale RAG-based applications. R2R was conceived to bridge the gap between local LLM experimentation and scalable production solutions. **R2R is to LangChain/LlamaIndex what NextJS is to React**. A JavaScript client for R2R deployments can be found here. ### Key Features * **🚀 Deploy** : Instantly launch production-ready RAG pipelines with streaming capabilities. * **🧩 Customize** : Tailor your pipeline with intuitive configuration files. * **🔌 Extend** : Enhance your pipeline with custom code integrations. * **⚖️ Autoscale** : Scale your pipeline effortlessly in the cloud using SciPhi. * **🤖 OSS** : Benefit from a framework developed by the open-source community, designed to simplify RAG deployment.

github

: 5.9k

auto-engineer

Auto Engineer is a tool designed to automate the Software Development Life Cycle (SDLC) by building production-grade applications with a combination of human and AI agents. It offers a plugin-based architecture that allows users to install only the necessary functionality for their projects. The tool guides users through key stages including Flow Modeling, IA Generation, Deterministic Scaffolding, AI Coding & Testing Loop, and Comprehensive Quality Checks. Auto Engineer follows a command/event-driven architecture and provides a modular plugin system for specific functionalities. It supports TypeScript with strict typing throughout and includes a built-in message bus server with a web dashboard for monitoring commands and events.

github

: 61

VASA-1-hack

VASA-1-hack is a repository containing the VASA implementation separated from EMOPortraits, with all components properly configured for standalone training. It provides detailed setup instructions, prerequisites, project structure, configuration details, running training modes, troubleshooting tips, monitoring training progress, development information, and acknowledgments. The repository aims to facilitate training volumetric avatar models with configurable parameters and logging levels for efficient debugging and testing.

github

: 295

tunacode

TunaCode CLI is an AI-powered coding assistant that provides a command-line interface for developers to enhance their coding experience. It offers features like model selection, parallel execution for faster file operations, and various commands for code management. The tool aims to improve coding efficiency and provide a seamless coding environment for developers.

github

: 83

sim

Sim is a platform that allows users to build and deploy AI agent workflows quickly and easily. It provides cloud-hosted and self-hosted options, along with support for local AI models. Users can set up the application using Docker Compose, Dev Containers, or manual setup with PostgreSQL and pgvector extension. The platform utilizes technologies like Next.js, Bun, PostgreSQL with Drizzle ORM, Better Auth for authentication, Shadcn and Tailwind CSS for UI, Zustand for state management, ReactFlow for flow editor, Fumadocs for documentation, Turborepo for monorepo management, Socket.io for real-time communication, and Trigger.dev for background jobs.

github

: 14.7k

rag-security-scanner

RAG/LLM Security Scanner is a professional security testing tool designed for Retrieval-Augmented Generation (RAG) systems and LLM applications. It identifies critical vulnerabilities in AI-powered applications such as chatbots, virtual assistants, and knowledge retrieval systems. The tool offers features like prompt injection detection, data leakage assessment, function abuse testing, context manipulation identification, professional reporting with JSON/HTML formats, and easy integration with OpenAI, HuggingFace, and custom RAG systems.

github

: 53

MassGen

MassGen is a cutting-edge multi-agent system that leverages the power of collaborative AI to solve complex tasks. It assigns a task to multiple AI agents who work in parallel, observe each other's progress, and refine their approaches to converge on the best solution to deliver a comprehensive and high-quality result. The system operates through an architecture designed for seamless multi-agent collaboration, with key features including cross-model/agent synergy, parallel processing, intelligence sharing, consensus building, and live visualization. Users can install the system, configure API settings, and run MassGen for various tasks such as question answering, creative writing, research, development & coding tasks, and web automation & browser tasks. The roadmap includes plans for advanced agent collaboration, expanded model, tool & agent integration, improved performance & scalability, enhanced developer experience, and a web interface.

github

: 454

zcf

ZCF (Zero-Config Claude-Code Flow) is a tool that provides zero-configuration, one-click setup for Claude Code with bilingual support, intelligent agent system, and personalized AI assistant. It offers an interactive menu for easy operations and direct commands for quick execution. The tool supports bilingual operation with automatic language switching and customizable AI output styles. ZCF also includes features like BMad Workflow for enterprise-grade workflow system, Spec Workflow for structured feature development, CCR (Claude Code Router) support for proxy routing, and CCometixLine for real-time usage tracking. It provides smart installation, complete configuration management, and core features like professional agents, command system, and smart configuration. ZCF is cross-platform compatible, supports Windows and Termux environments, and includes security features like dangerous operation confirmation mechanism.

github

: 1.8k

ck

ck (seek) is a semantic grep tool that finds code by meaning, not just keywords. It replaces traditional grep by understanding the user's search intent. It allows users to search for code based on concepts like 'error handling' and retrieves relevant code even if the exact keywords are not present. ck offers semantic search, drop-in grep compatibility, hybrid search combining keyword precision with semantic understanding, agent-friendly output in JSONL format, smart file filtering, and various advanced features. It supports multiple search modes, relevance scoring, top-K results, and smart exclusions. Users can index projects for semantic search, choose embedding models, and search specific files or directories. The tool is designed to improve code search efficiency and accuracy for developers and AI agents.

github

: 742

code-graph-rag

Graph-Code is an accurate Retrieval-Augmented Generation (RAG) system that analyzes multi-language codebases using Tree-sitter. It builds comprehensive knowledge graphs, enabling natural language querying of codebase structure and relationships, along with editing capabilities. The system supports various languages, uses Tree-sitter for parsing, Memgraph for storage, and AI models for natural language to Cypher translation. It offers features like code snippet retrieval, advanced file editing, shell command execution, interactive code optimization, reference-guided optimization, dependency analysis, and more. The architecture consists of a multi-language parser and an interactive CLI for querying the knowledge graph.

github

: 1.2k

zotero-mcp

Zotero MCP seamlessly connects your Zotero research library with AI assistants like ChatGPT and Claude via the Model Context Protocol. It offers AI-powered semantic search, access to library content, PDF annotation extraction, and easy updates. Users can search their library, analyze citations, and get summaries, making it ideal for research tasks. The tool supports multiple embedding models, intelligent search results, and flexible access methods for both local and remote collaboration. With advanced features like semantic search and PDF annotation extraction, Zotero MCP enhances research efficiency and organization.

github

: 513

MCPSpy

MCPSpy is a command-line tool leveraging eBPF technology to monitor Model Context Protocol (MCP) communication at the kernel level. It provides real-time visibility into JSON-RPC 2.0 messages exchanged between MCP clients and servers, supporting Stdio and HTTP transports. MCPSpy offers security analysis, debugging, performance monitoring, compliance assurance, and learning opportunities for understanding MCP communications. The tool consists of eBPF programs, an eBPF loader, an HTTP session manager, an MCP protocol parser, and output handlers for console display and JSONL output.

github

: 403

browser4

Browser4 is a lightning-fast, coroutine-safe browser designed for AI integration with large language models. It offers ultra-fast automation, deep web understanding, and powerful data extraction APIs. Users can automate the browser, extract data at scale, and perform tasks like summarizing products, extracting product details, and finding specific links. The tool is developer-friendly, supports AI-powered automation, and provides advanced features like X-SQL for precise data extraction. It also offers RPA capabilities, browser control, and complex data extraction with X-SQL. Browser4 is suitable for web scraping, data extraction, automation, and AI integration tasks.

github

: 928

wingman

The LLM Platform, also known as Inference Hub, is an open-source tool designed to simplify the development and deployment of large language model applications at scale. It provides a unified framework for integrating and managing multiple LLM vendors, models, and related services through a flexible approach. The platform supports various LLM providers, document processing, RAG, advanced AI workflows, infrastructure operations, and flexible configuration using YAML files. Its modular and extensible architecture allows developers to plug in different providers and services as needed. Key components include completers, embedders, renderers, synthesizers, transcribers, document processors, segmenters, retrievers, summarizers, translators, AI workflows, tools, and infrastructure components. Use cases range from enterprise AI applications to scalable LLM deployment and custom AI pipelines. Integrations with LLM providers like OpenAI, Azure OpenAI, Anthropic, Google Gemini, AWS Bedrock, Groq, Mistral AI, xAI, Hugging Face, and more are supported.

github

: 63

For similar tasks

scabench

github

: 53

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 668

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

scabench

README:

ScaBench: Smart Contract Audit Benchmark

📚 Dataset Documentation

Features

Available Curated Datasets

Current Dataset: curated-2025-08-18

Baseline Results

What Each Component Does

📊 Official Datasets (datasets/)

🔧 Dataset Generator (dataset-generator/)

📥 Source Checkout (dataset-generator/checkout_sources.py)

🔍 Baseline Runner (baseline-runner/)

📈 Scorer (scoring/scorer_v2.py)

Scoring a Single Project

Scoring an Entire Baseline Run (All Projects)

Available Options

📄 Report Generator (scoring/report_generator.py)

Quick Start

Prerequisites

Option 1: Process ALL Projects (Easiest!) 🚀

Basic Usage

All Options

What It Does (Step by Step)

Performance Notes

Option 2: Process Single Project

Option 3: Complete Command-Line Guides

Complete Guide: Analyze and Score a Single Project

Complete Guide: Analyze and Score ALL Projects

Quick Test Run (Small Sample)

Option 4: Step-by-Step Manual Process

Two Ways to Use ScaBench

🎯 Option A: Run the Official Baseline

🚀 Option B: Evaluate YOUR Tool

Installation

Requirements

Setup

Strict Matching Policy

Performance Tips

Output Formats

Baseline Results

Scoring Results

License

For Tasks:

For Jobs:

Alternative AI tools for scabench

Similar Open Source Tools

scabench

llm-context.py

R2R

auto-engineer

VASA-1-hack

tunacode

sim

rag-security-scanner

MassGen

zcf

ck

code-graph-rag

zotero-mcp

MCPSpy

browser4

wingman

For similar tasks

scabench

For similar jobs

sweep

teams-ai

ai-guide

classifai

chatbot-ui

BricksLLM

uAgents

griptape

📊 Official Datasets (`datasets/`)

🔧 Dataset Generator (`dataset-generator/`)

📥 Source Checkout (`dataset-generator/checkout_sources.py`)

🔍 Baseline Runner (`baseline-runner/`)

📈 Scorer (`scoring/scorer_v2.py`)

📄 Report Generator (`scoring/report_generator.py`)