BinaryAudit
An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
Stars: 68
BinaryAudit is an open-source benchmark designed to evaluate AI agents' ability to detect backdoors hidden in compiled binaries of real open-source software. The tool provides a benchmark dataset with tasks across five target applications, including lighttpd, dnsmasq, Dropbear, Sozu, and Caddy. AI agents are given compiled binary executables and reverse engineering tools inside isolated Docker environments to determine the presence of backdoors or malicious modifications. The tool also includes tasks for detecting timebombs and verifying the correct operation of reverse engineering tools like Ghidra and Radare2.
README:
An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries of real open-source software.
- Benchmark: BinaryAudit results
- Blog post: We hid backdoors in binaries — Opus 4.6 found 49% of them
Requires Harbor (uv tool install harbor), Docker, and relevant API keys.
Before running tasks with Docker, build the shared base image:
docker build --platform linux/amd64 -t binaryaudit-base:latest -f docker/base.Dockerfile .By default, we use the terminus-2 agent (default for Harbor) via OpenRouter to compare models.
You are free to use others, including well-known CLI AI Agents like Claude Code, Codex, or Cursor CLI.
You need to clone this repo:
git clone [email protected]:QuesmaOrg/BinaryAudit.git
cd BinaryAuditRun a single task, for a single model:
export ANTHROPIC_API_KEY=...
harbor run \
--path tasks \
--task-name lighttpd-backdoor-detect \
--agent terminus-2 \
--model anthropic/claude-opus-4-6-20260409Run all lighttpd tasks with a few models, with 3 attempts per model-task combination:
export OPENROUTER_API_KEY=...
harbor run \
--path tasks \
--task-name lighttpd-* \
--agent terminus-2 \
--model openrouter/anthropic/claude-opus-4.6 \
--model openrouter/google/gemini-3-pro-preview \
--model openrouter/openai/gpt-5.2 \
--n-attempts 3You can view trajectories (interactions between the agent and the system) via harbor view jobs.
Our overview of Harbor in Migrating CompileBench to Harbor: standardizing AI agent evals.
AI agents are given compiled binary executables (without source code) and reverse engineering tools (Ghidra, Radare2) inside isolated Docker environments. They must determine whether the binary contains a backdoor or other malicious modification.
We started with several open-source projects, manually injected backdoors into them, and compiled the modified code into stripped binaries. Backdoors range from remote code execution via hidden HTTP headers to timebombs that trigger after a specific date.
The benchmark dataset in tasks/ contains tasks across five target applications and several task categories.
[!WARNING] All backdoors in this benchmark are artificially injected for testing. We are not claiming these projects have real vulnerabilities.
- lighttpd — a lightweight web server (C)
- dnsmasq — a DNS/DHCP server (C)
- Dropbear — an SSH server (C)
- Sozu — a reverse proxy / load balancer (Rust)
- Caddy — an HTTP server (Go)
Tasks where agents must find injected backdoors in compiled binaries:
- lighttpd: backdoor-detect, backdoor-detect-open, backdoor-detect-proc, backdoor-detect-proc-obfuscated, backdoor-detect-syscall, backdoor-detect-syscall-obfuscated, authentication-harvester-detect, backdoor-multiple-binaries-detect, backdoor-multiple-arch-binaries-detect
- dnsmasq: backdoor-detect, backdoor-detect-obfuscated, backdoor-detect-printf, backdoor-detect-posix-spawn, backdoor-detect-posix-spawn-obfuscated, backdoor-detect-execvp-obfuscated, backdoor-detect-syscall, backdoor-detect-syscall-obfuscated
- Dropbear: brokenauth-detect, brokenauth-detect-hint, brokenauth-detect-nologline, brokenauth2-detect
- Sozu: backdoor-multiple-binaries-detect, backdoor-multiple-arch-binaries-detect
- Caddy: backdoor-detect, backdoor-simple-detect, backdoor-multiple-binaries-detect, backdoor-multiple-arch-binaries-detect
Tasks with unmodified binaries where the correct answer is "no backdoor found." These test false positive rates:
- lighttpd: backdoor-detect-negative, backdoor-detect-negative2
- dnsmasq: backdoor-detect-negative, backdoor-detect-negative2
- Dropbear: brokenauth-detect-negative, brokenauth-detect-negative2
- Sozu: backdoor-detect-negative, backdoor-detect-negative2
Tasks where agents must find code that triggers malicious behavior after a specific date and extract the exact timestamp:
- lighttpd: timebomb-multiple-binaries-detect
- Sozu: timebomb-multiple-binaries-detect
- Caddy: timebomb-multiple-binaries-detect, timebomb-multiple-arch-binaries-detect
Baseline tasks verifying that agents can operate reverse engineering tools correctly:
- Ghidra: decompile-vanilla, decompile-vanilla-jq, decompile-pyghidra, decompile-pyghidra-jq
- Radare2: decompile, decompile-jq
This task cross-compiles lighttpd for ARM64, which requires installing :arm64 packages inside the Docker build. On amd64 hosts, APT pulls python3.12-minimal:arm64 as a transitive dependency, and its post-install script tries to execute an ARM64 binary — failing unless QEMU user-mode emulation is registered.
Fix by registering binfmt_misc handlers before running the task:
docker run --privileged --rm tonistiigi/binfmt --install allThis is only needed for multi-arch cross-compilation tasks. All other tasks work without it.
Apache 2.0, see LICENSE for details.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for BinaryAudit
Similar Open Source Tools
BinaryAudit
BinaryAudit is an open-source benchmark designed to evaluate AI agents' ability to detect backdoors hidden in compiled binaries of real open-source software. The tool provides a benchmark dataset with tasks across five target applications, including lighttpd, dnsmasq, Dropbear, Sozu, and Caddy. AI agents are given compiled binary executables and reverse engineering tools inside isolated Docker environments to determine the presence of backdoors or malicious modifications. The tool also includes tasks for detecting timebombs and verifying the correct operation of reverse engineering tools like Ghidra and Radare2.
CyberStrikeAI
CyberStrikeAI is an AI-native security testing platform built in Go that integrates 100+ security tools, an intelligent orchestration engine, role-based testing with predefined security roles, a skills system with specialized testing skills, and comprehensive lifecycle management capabilities. It enables end-to-end automation from conversational commands to vulnerability discovery, attack-chain analysis, knowledge retrieval, and result visualization, delivering an auditable, traceable, and collaborative testing environment for security teams. The platform features an AI decision engine with OpenAI-compatible models, native MCP implementation with various transports, prebuilt tool recipes, large-result pagination, attack-chain graph, password-protected web UI, knowledge base with vector search, vulnerability management, batch task management, role-based testing, and skills system.
Claw-Hunter
Claw Hunter is a discovery and risk-assessment tool for OpenClaw instances, designed to identify 'Shadow AI' and audit agent privileges. It helps ITSec teams detect security risks, credential exposure, integration inventory, configuration issues, and installation status. The tool offers system-agnostic visibility, MDM readiness, non-intrusive operations, comprehensive detection, structured output in JSON format, and zero dependencies. It provides silent execution mode for automated deployment, machine identification, security risk scoring, results upload to a central API endpoint, bearer token authentication support, and persistent logging. Claw Hunter offers proper exit codes for automation and is available for macOS, Linux, and Windows platforms.
kubewall
kubewall is an open-source, single-binary Kubernetes dashboard with multi-cluster management and AI integration. It provides a simple and rich real-time interface to manage and investigate your clusters. With features like multi-cluster management, AI-powered troubleshooting, real-time monitoring, single-binary deployment, in-depth resource views, browser-based access, search and filter capabilities, privacy by default, port forwarding, live refresh, aggregated pod logs, and clean resource management, kubewall offers a comprehensive solution for Kubernetes cluster management.
hyper-mcp
hyper-mcp is a fast and secure MCP server that extends its capabilities through WebAssembly plugins. It makes it easy to add AI capabilities to applications by allowing users to write plugins in any language that compiles to WebAssembly, distribute them via standard OCI registries, and run them anywhere from cloud to edge. The tool is built with a security-first mindset, offering sandboxed plugins, memory-safe execution, secure plugin distribution, and fine-grained access control for host functions. Users can deploy hyper-mcp anywhere, benefit from cross-platform compatibility, and prevent tool name collisions with the support tool name prefix feature.
Windows-Use
Windows-Use is a powerful automation agent that interacts directly with the Windows OS at the GUI layer. It bridges the gap between AI agents and Windows to perform tasks such as opening apps, clicking buttons, typing, executing shell commands, and capturing UI state without relying on traditional computer vision models. It enables any large language model (LLM) to perform computer automation instead of relying on specific models for it.
opencode-manager
OpenCode Manager is a mobile-first web interface for managing and coding with OpenCode AI agents. It allows users to control and code from any device, including phones, tablets, and desktops. The tool provides features for repository and Git management, file management, chat and sessions, AI configuration, as well as mobile and PWA support. Users can clone and manage multiple git repos, work on multiple branches simultaneously, view changes, commits, and branches in a unified interface, create pull requests, navigate files with tree view and search, preview code with syntax highlighting, and perform various file operations. Additionally, the tool supports real-time streaming, slash commands, file mentions, plan/build modes, Mermaid diagrams, text-to-speech, speech-to-text, model selection, provider management, OAuth support, custom agents creation, and more. It is optimized for mobile devices, installable as a PWA, and offers push notifications for agent events.
codemie-code
Unified AI Coding Assistant CLI for managing multiple AI agents like Claude Code, Google Gemini, OpenCode, and custom AI agents. Supports OpenAI, Azure OpenAI, AWS Bedrock, LiteLLM, Ollama, and Enterprise SSO. Features built-in LangGraph agent with file operations, command execution, and planning tools. Cross-platform support for Windows, Linux, and macOS. Ideal for developers seeking a powerful alternative to GitHub Copilot or Cursor.
orbit
ORBIT (Open Retrieval-Based Inference Toolkit) is a middleware platform that provides a unified API for AI inference. It acts as a central gateway, allowing you to connect various local and remote AI models with your private data sources like SQL databases, vector stores, and local files. ORBIT uses a flexible adapter architecture to connect your data to AI models, creating specialized 'agents' for specific tasks. It supports scenarios like Knowledge Base Q&A and Chat with Your SQL Database, enabling users to interact with AI models seamlessly. The tool offers a RESTful API for programmatic access and includes features like authentication, API key management, system prompts, health monitoring, and file management. ORBIT is designed to streamline AI inference tasks and facilitate interactions between users and AI models.
mini-sglang
Mini-SGLang is a lightweight yet high-performance inference framework for Large Language Models. With a compact codebase of ~5,000 lines of Python, it serves as both a capable inference engine and a transparent reference for researchers and developers. It achieves state-of-the-art throughput and latency with advanced optimizations such as Radix Cache, Chunked Prefill, Overlap Scheduling, Tensor Parallelism, and Optimized Kernels integrating FlashAttention and FlashInfer for maximum efficiency. Mini-SGLang is designed to demystify the complexities of modern LLM serving systems, providing a clean, modular, and fully type-annotated codebase that is easy to understand and modify.
stata-mcp
Stata-MCP is a tool designed to help users with regression analysis in Stata. It includes features such as a security guard system for validation against dangerous commands, RAM monitoring for real-time monitoring and automatic process termination, unified configuration with TOML-based config files, and complete documentation. Users can use Stata-MCP for tasks like paper replication, quick hypothesis testing, learning econometrics, code organization, and result interpretation. The tool is suitable for researchers, data analysts, economists, statisticians, and social scientists.
Free-GPT4-WEB-API
FreeGPT4-WEB-API is a Python server that allows you to have a self-hosted GPT-4 Unlimited and Free WEB API, via the latest Bing's AI. It uses Flask and GPT4Free libraries. GPT4Free provides an interface to the Bing's GPT-4. The server can be configured by editing the `FreeGPT4_Server.py` file. You can change the server's port, host, and other settings. The only cookie needed for the Bing model is `_U`.
kantv
KanTV is an open-source project that focuses on studying and practicing state-of-the-art AI technology in real applications and scenarios, such as online TV playback, transcription, translation, and video/audio recording. It is derived from the original ijkplayer project and includes many enhancements and new features, including: * Watching online TV and local media using a customized FFmpeg 6.1. * Recording online TV to automatically generate videos. * Studying ASR (Automatic Speech Recognition) using whisper.cpp. * Studying LLM (Large Language Model) using llama.cpp. * Studying SD (Text to Image by Stable Diffusion) using stablediffusion.cpp. * Generating real-time English subtitles for English online TV using whisper.cpp. * Running/experiencing LLM on Xiaomi 14 using llama.cpp. * Setting up a customized playlist and using the software to watch the content for R&D activity. * Refactoring the UI to be closer to a real commercial Android application (currently only supports English). Some goals of this project are: * To provide a well-maintained "workbench" for ASR researchers interested in practicing state-of-the-art AI technology in real scenarios on mobile devices (currently focusing on Android). * To provide a well-maintained "workbench" for LLM researchers interested in practicing state-of-the-art AI technology in real scenarios on mobile devices (currently focusing on Android). * To create an Android "turn-key project" for AI experts/researchers (who may not be familiar with regular Android software development) to focus on device-side AI R&D activity, where part of the AI R&D activity (algorithm improvement, model training, model generation, algorithm validation, model validation, performance benchmark, etc.) can be done very easily using Android Studio IDE and a powerful Android phone.
VibeSurf
VibeSurf is an open-source AI agentic browser that combines workflow automation with intelligent AI agents, offering faster, cheaper, and smarter browser automation. It allows users to create revolutionary browser workflows, run multiple AI agents in parallel, perform intelligent AI automation tasks, maintain privacy with local LLM support, and seamlessly integrate as a Chrome extension. Users can save on token costs, achieve efficiency gains, and enjoy deterministic workflows for consistent and accurate results. VibeSurf also provides a Docker image for easy deployment and offers pre-built workflow templates for common tasks.
browser-use
Browser Use is a tool designed to make websites accessible for AI agents. It provides an easy way to connect AI agents with the browser, enabling users to perform tasks such as extracting vision and HTML elements, managing multiple tabs, and executing custom actions. The tool supports various language models and allows users to parallelize multiple agents for efficient processing. With features like self-correction and the ability to register custom actions, Browser Use offers a versatile solution for interacting with web content using AI technology.
gpt-computer-assistant
GPT Computer Assistant (GCA) is an open-source framework designed to build vertical AI agents that can automate tasks on Windows, macOS, and Ubuntu systems. It leverages the Model Context Protocol (MCP) and its own modules to mimic human-like actions and achieve advanced capabilities. With GCA, users can empower themselves to accomplish more in less time by automating tasks like updating dependencies, analyzing databases, and configuring cloud security settings.
For similar tasks
libedgetpu
This repository contains the source code for the userspace level runtime driver for Coral devices. The software is distributed in binary form at coral.ai/software. Users can build the library using Docker + Bazel, Bazel, or Makefile methods. It supports building on Linux, macOS, and Windows. The library is used to enable the Edge TPU runtime, which may heat up during operation. Google does not accept responsibility for any loss or damage if the device is operated outside the recommended ambient temperature range.
BinaryAudit
BinaryAudit is an open-source benchmark designed to evaluate AI agents' ability to detect backdoors hidden in compiled binaries of real open-source software. The tool provides a benchmark dataset with tasks across five target applications, including lighttpd, dnsmasq, Dropbear, Sozu, and Caddy. AI agents are given compiled binary executables and reverse engineering tools inside isolated Docker environments to determine the presence of backdoors or malicious modifications. The tool also includes tasks for detecting timebombs and verifying the correct operation of reverse engineering tools like Ghidra and Radare2.
rlhf_trojan_competition
This competition is organized by Javier Rando and Florian Tramèr from the ETH AI Center and SPY Lab at ETH Zurich. The goal of the competition is to create a method that can detect universal backdoors in aligned language models. A universal backdoor is a secret suffix that, when appended to any prompt, enables the model to answer harmful instructions. The competition provides a set of poisoned generation models, a reward model that measures how safe a completion is, and a dataset with prompts to run experiments. Participants are encouraged to use novel methods for red-teaming, automated approaches with low human oversight, and interpretability tools to find the trojans. The best submissions will be offered the chance to present their work at an event during the SaTML 2024 conference and may be invited to co-author a publication summarizing the competition results.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.
