
scabench
A framework for evaluating AI audit agents using recent real-world data
Stars: 53

ScaBench is a comprehensive framework designed for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. It provides curated datasets from recent audits and official tooling for consistent evaluation. The tool includes features such as curated datasets from Code4rena, Cantina, and Sherlock audits, a baseline runner for security analysis, a scoring tool for evaluating findings, a report generator for HTML reports with visualizations, and pipeline automation for complete workflow execution. Users can access curated datasets, generate new datasets, download project source code, run security analysis using LLMs, and evaluate tool findings against benchmarks using LLM matching. The tool enforces strict matching policies to ensure accurate evaluation results.
README:
A comprehensive framework for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. ScaBench provides curated datasets from recent audits and official tooling for consistent evaluation.
- Dataset Format & Statistics - Complete dataset information
- 🎯 Curated Datasets: Real-world vulnerabilities from Code4rena, Cantina, and Sherlock audits
- 🤖 Baseline Runner: LLM-based security analyzer with configurable models
- 📊 Scoring Tool: Evaluates findings with LLM-based matching (confidence = 1.0 only)
- 📈 Report Generator: HTML reports with visualizations and performance metrics
- 🔄 Pipeline Automation: Complete workflow with single-command execution
Note: New datasets are added regularly to prevent models from being trained on known results and to maintain benchmark integrity.
Location: datasets/curated-2025-08-18/curated-2025-08-18.json
The most current dataset contains contest scope repositories with expected vulnerabilities from audit competitions:
- 31 projects from Code4rena, Cantina, and Sherlock platforms
- 555 total vulnerabilities (114 high/critical severity)
- Time range: 2024-08 to 2025-08
-
Data format: JSON with project metadata including:
-
project_id
: Unique identifier for each project -
codebases
: Repository URLs, commit hashes, and download links -
vulnerabilities
: Array of findings with severity, title, and detailed descriptions
-
Location: datasets/curated-2025-08-18/baseline-results/
Pre-computed baseline results from analyzing each individual file with GPT-5:
- Approach: Single-file analysis using GPT-5 to identify vulnerabilities
-
Coverage: One baseline file per project (e.g.,
baseline_cantina_minimal-delegation_2025_04.json
) -
Data format: JSON containing:
-
project
: Project identifier -
files_analyzed
: Number of files processed -
total_findings
: Count of vulnerabilities found -
findings
: Array of identified issues with:-
title
: Brief vulnerability description -
description
: Detailed explanation -
severity
: Risk level (high/medium/low) -
confidence
: Model's confidence score -
location
: Specific code location -
file
: Source file name
-
-
Pre-curated benchmark datasets with real vulnerabilities from audits.
-
Current:
curated-2025-08-18.json
(31 projects, 555 vulnerabilities) - Format: JSON with project metadata, repo URLs, commits, and vulnerability details
Create NEW datasets by scraping and curating audit data.
Step 1: Scrape audit platforms
cd dataset-generator
python scraper.py --platforms code4rena cantina sherlock --months 3
Step 2: Curate the dataset
# Filter projects based on quality criteria
python curate_dataset.py \
--input raw_dataset.json \
--output curated_dataset.json \
--min-vulnerabilities 5 \
--min-high-critical 1
# This filters out projects that:
# - Have fewer than 5 vulnerabilities
# - Have no high/critical severity findings
# - Have inaccessible GitHub repositories
# - Have invalid or missing data
The curation step ensures high-quality benchmark data by removing low-value or inaccessible projects.
Download project source code at EXACT commits from dataset.
# Download all projects
python dataset-generator/checkout_sources.py
# Download specific project
python dataset-generator/checkout_sources.py --project vulnerable_vault
Reference security analyzer using LLMs. Produces findings in standard JSON format.
python baseline-runner/baseline_runner.py \
--project my_project \
--source sources/my_project
Evaluates ANY tool's findings against the benchmark using LLM matching with one-by-one comparison for better consistency.
Important: Model Requirements
- The scorer uses one-by-one matching - processes each expected finding sequentially
- More deterministic than batch matching with fixed seed and zero temperature
-
Recommended:
gpt-4o
(default, best accuracy) -
Alternative:
gpt-4o-mini
(faster, cheaper, good for testing)
IMPORTANT: When scoring a single project, you must specify the exact project ID from the benchmark dataset using the --project
flag. Project IDs often contain hyphens (e.g., code4rena_iq-ai_2025_03
) while baseline result filenames may have underscores.
# Example: Score results for a single project
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--project code4rena_iq-ai_2025_03 \
--model gpt-4o \
--confidence-threshold 0.75
# With verbose output to see matching details
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--project code4rena_iq-ai_2025_03 \
--verbose
Note: The --project
parameter must match the exact project_id
field from the benchmark dataset JSON. Check the dataset file if unsure about the correct project ID.
To score all baseline results at once:
# Score all baseline results in a directory
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--output scores/ \
--model gpt-4o \
--confidence-threshold 0.75
# This will:
# 1. Process all *.json files in the results directory
# 2. Automatically extract and match project IDs
# 3. Generate individual score files for each project
# 4. Save results to the scores/ directory
# With debug output
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--output scores/ \
--debug
-
--confidence-threshold
: Set matching confidence threshold (default: 0.75) -
--verbose
: Show detailed matching progress for each finding -
--debug
: Enable debug output for troubleshooting
After scoring, generate a comprehensive report:
python scoring/report_generator.py \
--scores scores/ \
--output baseline_report.html \
--tool-name "Baseline" \
--model gpt-5-mini
Creates HTML reports with metrics and visualizations.
python scoring/report_generator.py \
--scores scores/ \
--output report.html
# Install dependencies
pip install -r requirements.txt
# Set OpenAI API key
export OPENAI_API_KEY="your-key-here"
The run_all.sh
script provides a complete end-to-end pipeline that:
- Downloads source code - Clones all project repositories at exact audit commits
- Runs baseline analysis - Analyzes each project with LLM-based security scanner
- Scores results - Evaluates findings against known vulnerabilities using strict matching
- Generates reports - Creates comprehensive HTML report with metrics and visualizations
# Run everything with defaults (all projects in dataset, gpt-5-mini model)
./run_all.sh
# Use different model (e.g., gpt-4o-mini for faster/cheaper runs)
./run_all.sh --model gpt-4o-mini
# Use a different dataset
./run_all.sh --dataset datasets/my_custom_dataset.json
# Combine options
./run_all.sh --model gpt-4o-mini --output-dir test_run
./run_all.sh [OPTIONS]
Options:
--dataset FILE Dataset to use (default: datasets/curated-2025-08-18.json)
--model MODEL Model for analysis (default: gpt-5-mini)
Options: gpt-5-mini, gpt-4o-mini, gpt-4o
--output-dir DIR Output directory (default: all_results_TIMESTAMP)
--skip-checkout Skip source checkout (use existing sources)
--skip-baseline Skip baseline analysis (use existing results)
--skip-scoring Skip scoring and report generation
--help Show help
Step 1: Source Checkout
- Downloads all projects from the dataset (from their GitHub repositories)
- Checks out exact commits from audit time
- Preserves original project structure
- Creates:
OUTPUT_DIR/sources/PROJECT_ID/
Step 2: Baseline Analysis
- Runs LLM-based security analysis on each project
- Configurable file limits for testing
- Uses specified model (default: gpt-5-mini)
- Creates:
OUTPUT_DIR/baseline_results/baseline_PROJECT_ID.json
Step 3: Scoring
- Compares findings against known vulnerabilities in the dataset
- Uses STRICT matching (confidence = 1.0 only)
- Batch processes all projects
- Creates:
OUTPUT_DIR/scoring_results/score_PROJECT_ID.json
Step 4: Report Generation
- Aggregates all scoring results
- Generates HTML report with charts and metrics
- Calculates overall detection rates and F1 scores
- Creates:
OUTPUT_DIR/reports/full_report.html
Step 5: Summary Statistics
- Computes aggregate metrics across all projects
- Saves summary JSON with key statistics
- Creates:
OUTPUT_DIR/summary.json
- Full run (all files): 4-6 hours for default dataset (31 projects)
- Fast test (--model gpt-4o-mini): 30-45 minutes
-
Model selection:
-
gpt-5-mini
: Best accuracy (default) -
gpt-4o-mini
: Faster, cheaper, good for testing
-
Note: The default dataset (curated-2025-08-18.json
) contains 31 projects with 555 total vulnerabilities. Custom datasets may have different counts.
# For a specific project
./run_pipeline.sh --project vulnerable_vault --source sources/vulnerable_vault
# Step 1: Set up environment
export OPENAI_API_KEY="your-key-here"
# Step 2: Find your project ID in the dataset
PROJECT_ID="code4rena_iq-ai_2025_03" # Example - check dataset for exact ID
# Step 3: Download the source code
python dataset-generator/checkout_sources.py \
--dataset datasets/curated-2025-08-18.json \
--project $PROJECT_ID \
--output sources/
# Step 4: Run baseline analysis
python baseline-runner/baseline_runner.py \
--project $PROJECT_ID \
--source sources/${PROJECT_ID//-/_} \
--output datasets/curated-2025-08-18/baseline-results/ \
--model gpt-5-mini
# Step 5: Score the results (IMPORTANT: use exact project ID with hyphens)
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--project $PROJECT_ID \
--output scores/ \
--model gpt-4o
# Step 6: Generate HTML report
python scoring/report_generator.py \
--scores scores/ \
--output single_project_report.html \
--tool-name "Baseline" \
--model gpt-5-mini
# Step 7: View the report
open single_project_report.html # macOS
# xdg-open single_project_report.html # Linux
# Step 1: Set up environment
export OPENAI_API_KEY="your-key-here"
# Step 2: Download ALL project sources (this may take a while)
python dataset-generator/checkout_sources.py \
--dataset datasets/curated-2025-08-18.json \
--output sources/
# Step 3: Run baseline on ALL projects (this will take hours)
for dir in sources/*/; do
project=$(basename "$dir")
echo "Analyzing $project..."
python baseline-runner/baseline_runner.py \
--project "$project" \
--source "$dir" \
--output datasets/curated-2025-08-18/baseline-results/ \
--model gpt-5-mini \
done
# Step 4: Score ALL baseline results
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--output scores/ \
--model gpt-4o
# Step 5: Generate comprehensive report
python scoring/report_generator.py \
--scores scores/ \
--output full_baseline_report.html \
--tool-name "Baseline Analysis" \
--model gpt-5-mini
# Step 6: View the report
open full_baseline_report.html # macOS
# xdg-open full_baseline_report.html # Linux
# Test with just one small project for quick validation
export OPENAI_API_KEY="your-key-here"
# Pick a small project
PROJECT_ID="code4rena_coded-estate-invitational_2024_12"
# Run complete pipeline for single project
python dataset-generator/checkout_sources.py --project $PROJECT_ID --output sources/
python baseline-runner/baseline_runner.py \
--project $PROJECT_ID \
--source sources/${PROJECT_ID//-/_} \
--model gpt-5-mini
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--project $PROJECT_ID \
--model gpt-4o
python scoring/report_generator.py \
--scores scores/ \
--output test_report.html \
--model gpt-5-mini
open test_report.html
Easiest - Process ALL projects with one command:
./run_all.sh
This automatically:
- Downloads all source code at exact commits
- Runs baseline security analysis
- Scores against benchmark
- Generates comprehensive reports
Manual approach for specific projects:
# 1. Download source code
python dataset-generator/checkout_sources.py --project vulnerable_vault
# 2. Run baseline analysis
python baseline-runner/baseline_runner.py \
--project vulnerable_vault \
--source sources/vulnerable_vault
# 3. Score results
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir datasets/curated-2025-08-18/baseline-results/ \
--project vulnerable_vault
# 4. Generate report
python scoring/report_generator.py \
--scores scores/ \
--output report.html
Step 1: Get the source code
python dataset-generator/checkout_sources.py \
--dataset datasets/curated-2025-08-18.json \
--output sources/
Step 2: Run YOUR tool on each project
# Example with your tool
your-tool analyze sources/project1/ > results/project1.json
Step 3: Format results to match required JSON structure
{
"project": "project_name",
"findings": [{
"title": "Reentrancy in withdraw",
"description": "Details...",
"severity": "high",
"location": "withdraw() function",
"file": "Vault.sol"
}]
}
See format specification below
Step 4: Score your results
python scoring/scorer_v2.py \
--benchmark datasets/curated-2025-08-18/curated-2025-08-18.json \
--results-dir results/
Step 5: View your performance
python scoring/report_generator.py \
--scores scores/ \
--output my_tool_report.html
- Python 3.8+
- OpenAI API key
- 4GB+ RAM for large codebases
# Clone the repository
git clone https://github.com/scabench/scabench.git
cd scabench
# Install all dependencies
pip install -r requirements.txt
# Run tests to verify installation
pytest tests/
The scorer enforces EXTREMELY STRICT matching criteria:
- ✅ IDENTICAL LOCATION - Must be exact same file/contract/function
- ✅ EXACT IDENTIFIERS - Same contract names, function names, variables
- ✅ IDENTICAL ROOT CAUSE - Must be THE SAME vulnerability
- ✅ IDENTICAL ATTACK VECTOR - Exact same exploitation method
- ✅ IDENTICAL IMPACT - Exact same security consequence
- ❌ NO MATCH for similar patterns in different locations
- ❌ NO MATCH for same bug type but different functions
⚠️ WHEN IN DOUBT: DO NOT MATCH
Only findings with confidence = 1.0 count as true positives!
-
Model Selection:
-
For Scoring: Use
gpt-5-mini
(recommended) - needs long context for batch matching -
For Baseline Analysis: Use
gpt-5-mini
for best accuracy - Important: The scorer processes ALL findings in a single LLM call, so a model with sufficient context window is critical
- Use
gpt-4o
if you encounter context length errors with very large projects - Use
--patterns
to specify which files to analyze
-
For Scoring: Use
-
Batch Processing:
# Process multiple projects for project in project1 project2 project3; do ./run_pipeline.sh --project $project --source sources/$project done
-
Caching: Results are saved to disk for reprocessing
{
"project": "vulnerable_vault",
"files_analyzed": 10,
"total_findings": 5,
"findings": [{
"title": "Reentrancy vulnerability",
"severity": "high",
"confidence": 0.95,
"location": "withdraw() function"
}]
}
{
"total_expected": 10,
"true_positives": 6,
"detection_rate": 0.6,
"matched_findings": [{
"confidence": 1.0,
"justification": "Perfect match: identical vulnerability"
}]
}
MIT License - see LICENSE file for details
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for scabench
Similar Open Source Tools

scabench
ScaBench is a comprehensive framework designed for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. It provides curated datasets from recent audits and official tooling for consistent evaluation. The tool includes features such as curated datasets from Code4rena, Cantina, and Sherlock audits, a baseline runner for security analysis, a scoring tool for evaluating findings, a report generator for HTML reports with visualizations, and pipeline automation for complete workflow execution. Users can access curated datasets, generate new datasets, download project source code, run security analysis using LLMs, and evaluate tool findings against benchmarks using LLM matching. The tool enforces strict matching policies to ensure accurate evaluation results.

llm-context.py
LLM Context is a tool designed to assist developers in quickly injecting relevant content from code/text projects into Large Language Model chat interfaces. It leverages `.gitignore` patterns for smart file selection and offers a streamlined clipboard workflow using the command line. The tool also provides direct integration with Large Language Models through the Model Context Protocol (MCP). LLM Context is optimized for code repositories and collections of text/markdown/html documents, making it suitable for developers working on projects that fit within an LLM's context window. The tool is under active development and aims to enhance AI-assisted development workflows by harnessing the power of Large Language Models.

gonzo
Gonzo is a powerful, real-time log analysis terminal UI tool inspired by k9s. It allows users to analyze log streams with beautiful charts, AI-powered insights, and advanced filtering directly from the terminal. The tool provides features like live streaming log processing, OTLP support, interactive dashboard with real-time charts, advanced filtering options including regex support, and AI-powered insights such as pattern detection, anomaly analysis, and root cause suggestions. Users can also configure AI models from providers like OpenAI, LM Studio, and Ollama for intelligent log analysis. Gonzo is built with Bubble Tea, Lipgloss, Cobra, Viper, and OpenTelemetry, following a clean architecture with separate modules for TUI, log analysis, frequency tracking, OTLP handling, and AI integration.

R2R
R2R (RAG to Riches) is a fast and efficient framework for serving high-quality Retrieval-Augmented Generation (RAG) to end users. The framework is designed with customizable pipelines and a feature-rich FastAPI implementation, enabling developers to quickly deploy and scale RAG-based applications. R2R was conceived to bridge the gap between local LLM experimentation and scalable production solutions. **R2R is to LangChain/LlamaIndex what NextJS is to React**. A JavaScript client for R2R deployments can be found here. ### Key Features * **🚀 Deploy** : Instantly launch production-ready RAG pipelines with streaming capabilities. * **🧩 Customize** : Tailor your pipeline with intuitive configuration files. * **🔌 Extend** : Enhance your pipeline with custom code integrations. * **⚖️ Autoscale** : Scale your pipeline effortlessly in the cloud using SciPhi. * **🤖 OSS** : Benefit from a framework developed by the open-source community, designed to simplify RAG deployment.

auto-engineer
Auto Engineer is a tool designed to automate the Software Development Life Cycle (SDLC) by building production-grade applications with a combination of human and AI agents. It offers a plugin-based architecture that allows users to install only the necessary functionality for their projects. The tool guides users through key stages including Flow Modeling, IA Generation, Deterministic Scaffolding, AI Coding & Testing Loop, and Comprehensive Quality Checks. Auto Engineer follows a command/event-driven architecture and provides a modular plugin system for specific functionalities. It supports TypeScript with strict typing throughout and includes a built-in message bus server with a web dashboard for monitoring commands and events.

sim
Sim is a platform that allows users to build and deploy AI agent workflows quickly and easily. It provides cloud-hosted and self-hosted options, along with support for local AI models. Users can set up the application using Docker Compose, Dev Containers, or manual setup with PostgreSQL and pgvector extension. The platform utilizes technologies like Next.js, Bun, PostgreSQL with Drizzle ORM, Better Auth for authentication, Shadcn and Tailwind CSS for UI, Zustand for state management, ReactFlow for flow editor, Fumadocs for documentation, Turborepo for monorepo management, Socket.io for real-time communication, and Trigger.dev for background jobs.

paperless-gpt
paperless-gpt is a tool designed to generate accurate and meaningful document titles and tags for paperless-ngx using Large Language Models (LLMs). It supports multiple LLM providers, including OpenAI and Ollama. With paperless-gpt, you can streamline your document management by automatically suggesting appropriate titles and tags based on the content of your scanned documents. The tool offers features like multiple LLM support, customizable prompts, easy integration with paperless-ngx, user-friendly interface for reviewing and applying suggestions, dockerized deployment, automatic document processing, and an experimental OCR feature.

BrowserAI
BrowserAI is a tool that allows users to run large language models (LLMs) directly in the browser, providing a simple, fast, and open-source solution. It prioritizes privacy by processing data locally, is cost-effective with no server costs, works offline after initial download, and offers WebGPU acceleration for high performance. It is developer-friendly with a simple API, supports multiple engines, and comes with pre-configured models for easy use. Ideal for web developers, companies needing privacy-conscious AI solutions, researchers experimenting with browser-based AI, and hobbyists exploring AI without infrastructure overhead.

one
ONE is a modern web and AI agent development toolkit that empowers developers to build AI-powered applications with high performance, beautiful UI, AI integration, responsive design, type safety, and great developer experience. It is perfect for building modern web applications, from simple landing pages to complex AI-powered platforms.

arkflow
ArkFlow is a high-performance Rust stream processing engine that seamlessly integrates AI capabilities, providing powerful real-time data processing and intelligent analysis. It supports multiple input/output sources and processors, enabling easy loading and execution of machine learning models for streaming data and inference, anomaly detection, and complex event processing. The tool is built on Rust and Tokio async runtime, offering excellent performance and low latency. It features built-in SQL queries, Python script, JSON processing, Protobuf encoding/decoding, and batch processing capabilities. ArkFlow is extensible with a modular design, making it easy to extend with new components.

BrowserAI
BrowserAI is a production-ready tool that allows users to run AI models directly in the browser, offering simplicity, speed, privacy, and open-source capabilities. It provides WebGPU acceleration for fast inference, zero server costs, offline capability, and developer-friendly features. Perfect for web developers, companies seeking privacy-conscious AI solutions, researchers experimenting with browser-based AI, and hobbyists exploring AI without infrastructure overhead. The tool supports various AI tasks like text generation, speech recognition, and text-to-speech, with pre-configured popular models ready to use. It offers a simple SDK with multiple engine support and seamless switching between MLC and Transformers engines.

evalplus
EvalPlus is a rigorous evaluation framework for LLM4Code, providing HumanEval+ and MBPP+ tests to evaluate large language models on code generation tasks. It offers precise evaluation and ranking, coding rigorousness analysis, and pre-generated code samples. Users can use EvalPlus to generate code solutions, post-process code, and evaluate code quality. The tool includes tools for code generation and test input generation using various backends.

quantalogic
QuantaLogic is a ReAct framework for building advanced AI agents that seamlessly integrates large language models with a robust tool system. It aims to bridge the gap between advanced AI models and practical implementation in business processes by enabling agents to understand, reason about, and execute complex tasks through natural language interaction. The framework includes features such as ReAct Framework, Universal LLM Support, Secure Tool System, Real-time Monitoring, Memory Management, and Enterprise Ready components.

enferno
Enferno is a modern Flask framework optimized for AI-assisted development workflows. It combines carefully crafted development patterns, smart Cursor Rules, and modern libraries to enable developers to build sophisticated web applications with unprecedented speed. Enferno's intelligent patterns and contextual guides help create production-ready SAAS applications faster than ever. It includes features like modern stack, authentication, OAuth integration, database support, task queue, frontend components, security measures, Docker readiness, and more.

new-api
New API is a next-generation large model gateway and AI asset management system that provides a wide range of features, including a new UI interface, multi-language support, online recharge function, key query for usage quota, compatibility with the original One API database, model charging by usage count, channel weighted randomization, data dashboard, token grouping and model restrictions, support for various authorization login methods, support for Rerank models, OpenAI Realtime API, Claude Messages format, reasoning effort setting, content reasoning, user-specific model rate limiting, request format conversion, cache billing support, and various model support such as gpts, Midjourney-Proxy, Suno API, custom channels, Rerank models, Claude Messages format, Dify, and more.

VimLM
VimLM is an AI-powered coding assistant for Vim that integrates AI for code generation, refactoring, and documentation directly into your Vim workflow. It offers native Vim integration with split-window responses and intuitive keybindings, offline first execution with MLX-compatible models, contextual awareness with seamless integration with codebase and external resources, conversational workflow for iterating on responses, project scaffolding for generating and deploying code blocks, and extensibility for creating custom LLM workflows with command chains.
For similar tasks

scabench
ScaBench is a comprehensive framework designed for evaluating security analysis tools and AI agents on real-world smart contract vulnerabilities. It provides curated datasets from recent audits and official tooling for consistent evaluation. The tool includes features such as curated datasets from Code4rena, Cantina, and Sherlock audits, a baseline runner for security analysis, a scoring tool for evaluating findings, a report generator for HTML reports with visualizations, and pipeline automation for complete workflow execution. Users can access curated datasets, generate new datasets, download project source code, run security analysis using LLMs, and evaluate tool findings against benchmarks using LLM matching. The tool enforces strict matching policies to ensure accurate evaluation results.
For similar jobs

sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.