sec-code-bench
SecCodeBench is a benchmark suite focusing on evaluating the security of code generated by large language models (LLMs).
Stars: 88
SecCodeBench is a benchmark suite for evaluating the security of AI-generated code, specifically designed for modern Agentic Coding Tools. It addresses challenges in existing security benchmarks by ensuring test case quality, employing precise evaluation methods, and covering Agentic Coding Tools. The suite includes 98 test cases across 5 programming languages, focusing on functionality-first evaluation and dynamic execution-based validation. It offers a highly extensible testing framework for end-to-end automated evaluation of agentic coding tools, generating comprehensive reports and logs for analysis and improvement.
README:
[English] · 简体中文
SecCodeBench is a benchmark suite for evaluating the security of AI-generated code, specifically designed for modern Agentic Coding Tool. It is jointly developed by Alibaba Group in collaboration with the the Institute for Network Sciences and Cyberspace at Tsinghua University, the School of CyberScience and Technology at Zhejiang University, Fudan University, and Peking University.
With the proliferation of Large Language Model (LLM)-powered coding assistants, the security of AI-generated code has become a critical concern. To scientifically evaluate the security of AI-generated code, identify its intrinsic flaws, and foster improvements in model security capabilities, a comprehensive and reliable benchmark is essential.
However, existing security benchmarks in the community suffer from significant limitations across three core dimensions, making them inadequate for authentically assessing the secure coding capabilities of models or Agents:
However, existing security benchmarks in the community suffer from significant limitations across three core dimensions, making them inadequate for authentically assessing the secure coding capabilities of models or Agentic Coding Tools:
-
Test Case Quality: Many datasets are sourced from open-source repositories, relying heavily on automated generation and simplistic filtering with minimal deep human involvement. This leads to: (a) Data Imbalance, where a large volume of low-priority security issues predominates, failing to effectively measure model performance on critical vulnerabilities; (b) Invalid Test Cases, where some problems are flawed by design (e.g., generating secure code is impossible under the given constraints), causing a systematic underestimation of model capabilities rather than an objective evaluation; and (c) Potential Data Contamination, where the source code of the test cases may have been part of the models' pre-training corpus, thus compromising the fairness of the evaluation.
-
Singular and Imprecise Evaluation Methods: Most existing evaluation methods rely on simple regular expressions or static analysis tools. This makes them incapable of accurately identifying syntactically or semantically complex code variants and completely overlooks dynamic vulnerabilities that can only be verified through live execution. More importantly, many methods neglect the importance of functionality, leading to a disconnect between evaluation criteria and real-world usability, and may even favor non-functional "secure" code over correct solutions.
-
Failure to Cover Agentic Coding Tools: Real-world programming has evolved to rely on agentic coding tools—intelligent agents capable of autonomously using tools and retrieving knowledge. Existing evaluation paradigms, however, remain stuck at testing atomic API calls. This creates a disconnect between the evaluation paradigm and real-world application scenarios, limiting the practical value of their conclusions.
To address these challenges, we introduce SecCodeBench, a benchmark suite purpose-built for modern Agentic Coding Tools. It ensures evaluation depth and breadth through three core design principles:
-
Dataset: We ensure the authenticity and diversity of our test cases. Most of the cases are based on anonymized, real-world historical vulnerabilities from within Alibaba and are presented as complete, runnable projects rather than mere code snippets. Each test case is uniquely defined by four attributes: (Functional Requirements, Programming Language, Third-Party Libraries, Function Interface). Currently, it includes 98 test cases across 5 programming languages (Java: 53, C/C++: 15, Python: 13, Go: 13, Node.js: 4), covering 22 CWE types, adapted into four testing modes: Code Generation (native/security-aware) and Code Fix (native/security-aware). Each test case is crafted by a team of senior security experts and undergoes a rigorous three-person peer review. Furthermore, all cases have been subjected to multiple rounds of empirical testing and fine-tuning across more than ten models to ensure their fairness and challenge.
-
Evaluation: We have established a multi-stage, high-precision evaluation process. This process is governed by a "Functionality-First" principle, where generated code must first pass all functional tests to qualify for security assessment. The security evaluation employs a layered strategy: it prioritizes dynamic execution-based validation using Proof-of-Concept (PoC) exploits to ensure objective and reliable results. For complex scenarios not coverable by dynamic execution, we introduce an LLM-as-a-Judge infused with domain-specific security knowledge. The final score is a weighted sum of the pass@1 results, where the weights holistically consider factors such as the test scenario (with a 4:1 ratio for native vs. security-aware modes) and a combined metric of vulnerability prevalence and severity (assigned weights of 4, 2, and 1 for critical, high, and medium tiers, respectively). This sophisticated scoring mechanism is designed to provide a more authentic reflection of the model's comprehensive security capabilities.
-
Framework: We provide a highly extensible testing framework. It not only supports standard multi-turn dialogue testing of model APIs but also enables end-to-end automated evaluation of mainstream agentic coding tools (e.g., IDE plugins, CLI tools). Additionally, the framework generates comprehensive, visual reports and logs to facilitate in-depth analysis and model diagnostics, thereby driving continuous improvement in the secure coding capabilities of large models.
To ensure the reproducibility of our results, we strongly recommend using an official release of this project rather than cloning directly from the main branch.
Clone a specific version of the repository using the following commands:
# Clone the repository
git clone https://github.com/alibaba/sec-code-bench.git
cd sec-code-bench
# Check out the desired version tag
git checkout v2.2.0- Python: 3.12 or higher
- Docker: 24.0 or higher
Install uv (if not already installed) for project management and dependency synchronization:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Update
uv self update
# Sync dependencies
uv sync- High Token Consumption Warning: This evaluation framework will incur significant Token consumption. Before starting, please ensure your API account has a sufficient balance. Reference Case: A single full evaluation of the DeepSeek V3.2 model consumes approximately 22 million tokens in thinking mode, compared to roughly 12 million tokens in non-thinking mode.
- Computational and Time Costs: This is a computationally intensive task. We recommend running it on hardware with comparable or better performance. Performance Benchmark: On a 32C128G server with unrestricted API concurrency, a full evaluation is estimated to take approximately 3 hour.
Note that the resource consumption and evaluation time will gradually increase as more test cases are added.
Step 1: Configure Parameters
Copy the example configuration file and modify it with your settings:
cp config.example.yaml config.yamlEdit config.yaml to configure the following fields:
| Field | Description |
|---|---|
lang_configs |
Language configurations for evaluation. Each entry specifies: |
lang_configs[].language |
Programming language to evaluate (e.g., java, python, cpp, go, nodejs) |
lang_configs[].benchmark |
Path to the benchmark JSON file (e.g., ./datasets/benchmark/java/java.json) |
eval_llm |
The LLM model to be evaluated |
eval_llm.provider |
LLM provider type (e.g., OPENAI for OpenAI-compatible APIs) |
eval_llm.model |
Model name to evaluate (e.g., gpt-4, qwen-plus) |
eval_llm.api_key |
API key for authentication |
eval_llm.endpoint |
API endpoint URL (e.g., https://api.openai.com/v1) |
judge_llms |
Judge LLMs for security evaluation. Must be an odd number (1, 3, 5, etc.) for majority voting. Each entry has the same fields as eval_llm. |
experiment.cycle |
Number of experiment cycles for each test case (default: 10) |
experiment.parameters |
Optional JSON string of parameters to pass to LLM API calls (e.g., '{"enable_thinking": true}') |
experiment.rpm_limit |
Optional RPM (Requests Per Minute) limit for the evaluated LLM models (default: 60) |
directories.container_result |
Path inside container for results (default: /dockershare). When using Docker, set host dir via env LOCAL_RESULT_DIR. |
Step 2: (Optional) Modify System Configuration
If needed, you can modify system_config.yaml to adjust:
-
category_weights: Score weights for different severity levels (low, medium, high, critical) -
scenario_weights: Weights for different test scenarios (gen, gen-hints, fix, fix-hints) -
languages_need_llm_judges: Languages that require LLM judges
In most cases, you don't need to modify this file.
Step 3: Start Verifiers
Start the shared verifier services (only need to start once, can be reused across multiple evaluations):
docker compose -f docker-compose-verifiers.yml up -d --buildWait for all verifiers to become healthy:
docker compose -f docker-compose-verifiers.yml psStep 4: Run Evaluation
Run the evaluation:
docker compose -f docker-compose-eval.yml up -dYou can monitor the evaluation progress by checking the logs:
docker compose -f docker-compose-eval.yml logs -fCompletion Indicator: The evaluation is complete when a finish file is generated in the output directory at {result_dir}/finish, where result_dir is under the directory set by LOCAL_RESULT_DIR (Docker) or --log-dir (native run).
You can run multiple evaluations simultaneously by using different project names and config files. All evaluations share the same verifier containers.
Create separate config files for each model:
cp config.example.yaml config-gpt4.yaml
cp config.example.yaml config-claude.yaml
# Edit each file with different model settingsRun evaluations in parallel using different project names (-p):
# Terminal 1
CONFIG_FILE=./config-gpt4.yaml docker compose -f docker-compose-eval.yml -p eval-gpt4 up -d
# Terminal 2
CONFIG_FILE=./config-claude.yaml docker compose -f docker-compose-eval.yml -p eval-claude up -dWhen all evaluations are complete, stop the verifier services:
docker compose -f docker-compose-verifiers.yml downResults are saved under the directory set by LOCAL_RESULT_DIR when using Docker (e.g. LOCAL_RESULT_DIR=/path/to/results docker compose ...), or by --log-dir when running natively.
| Agentic Coding Tool | Type |
--editor Parameter |
|---|---|---|
| Claude Code | CLI | claude-code |
| Qwen Code | CLI | qwen-code |
| Codex | CLI | codex |
| Gemini CLI | CLI | gemini |
| Cursor CLI | CLI | cursor |
- Update to Latest Versions: Ensure all CLI tools to be tested are updated to their latest official versions.
- Prepare API Account: Ensure your configured LLM API account has a sufficient balance to cover the high Token consumption during evaluation.
- Authorize Automated Execution: Pre-authorize the CLI tool to automatically execute terminal commands. Settings vary by tool, so please refer to the respective documentation.
- CLI Tools: Support high-concurrency testing mode. The number of concurrent threads can be adjusted based on machine performance.
-
Large-Scale Testing Strategy: For full-scale evaluations, you can partition the test cases using the
-pparameter and run them in parallel across multiple machines to significantly reduce the total evaluation time.
Step 1: Start Verifier Services
Start the language verifier containers using Docker Compose:
docker compose -f docker-compose-verifiers.yml up -d --buildThis will start verifier services for all supported languages (C/C++, Python, Go, Node.js, Java) with port mappings for local access.
Step 2: Run E2E Evaluation
Execute the evaluation command:
uv run -m sec_code_bench.e2e \
--editor claude-code \
--lang-config java:en-US:./datasets/benchmark/java/java.json \
--lang-config go:en-US:./datasets/benchmark/go/go.json \
--lang-config cpp:en-US:./datasets/benchmark/cpp/c.json \
--lang-config python:en-US:./datasets/benchmark/python/python.json \
--lang-config nodejs:en-US:./datasets/benchmark/nodejs/nodejs.json \
--judge-llm-list \
'OPENAI::judge-model-1::your-api-key::https://api.openai.com/v1' \
'OPENAI::judge-model-2::your-api-key::https://api.openai.com/v1' \
'OPENAI::judge-model-3::your-api-key::https://api.openai.com/v1' \
--threads 2 \
--experiment-cycle 1Completion Indicator: The evaluation is complete when a finish file is generated in the output directory at {result_dir}/{model_name}/{date}/{time}/finish.
Step 3: Stop Verifier Services (when done)
Stop the Docker containers to free up resources:
docker compose -f docker-compose-verifiers.yml down| Argument | Description |
|---|---|
--editor, -e
|
(Required) Specify the CLI tool to evaluate (e.g., claude-code, qwen-code) |
--lang-config |
(Required) Per-language configuration in format language:locale:benchmark_path. Can be specified multiple times for multi-language evaluation. Example: java:en-US:./datasets/benchmark/java/java.json
|
--judge-llm-list |
Judge LLMs provided as PROVIDER::MODEL::API_KEY::BASE_URL. Can be specified multiple times. Must be odd number for majority voting.
|
--experiment-cycle |
Number of experiment cycles for each test case (default: 10) |
--threads |
Number of worker threads for parallel execution (default: 1) |
--batch-size |
Batch size for processing test cases (default: 15) |
--prompt, -p
|
Filter testcases: use range like 0-4 for indices or string for exact/partial key matching. Empty means all testcases. |
--prepare, -f
|
Call the prepare method of the editor before execution |
--debug |
Enable debug mode - save debug snapshots on exceptions |
--log-level |
Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO) |
--log-dir |
Log directory path (default: ./logs/) |
We are committed to making SecCodeBench a continuously evolving, vibrant security benchmark. We welcome you to create Issues to discuss new features or propose suggestions!
Thanks to all the developers who have contributed to this project!
This project is licensed under the Apache 2.0 license.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for sec-code-bench
Similar Open Source Tools
sec-code-bench
SecCodeBench is a benchmark suite for evaluating the security of AI-generated code, specifically designed for modern Agentic Coding Tools. It addresses challenges in existing security benchmarks by ensuring test case quality, employing precise evaluation methods, and covering Agentic Coding Tools. The suite includes 98 test cases across 5 programming languages, focusing on functionality-first evaluation and dynamic execution-based validation. It offers a highly extensible testing framework for end-to-end automated evaluation of agentic coding tools, generating comprehensive reports and logs for analysis and improvement.
RA.Aid
RA.Aid is an AI software development agent powered by `aider` and advanced reasoning models like `o1`. It combines `aider`'s code editing capabilities with LangChain's agent-based task execution framework to provide an intelligent assistant for research, planning, and implementation of multi-step development tasks. It handles complex programming tasks by breaking them down into manageable steps, running shell commands automatically, and leveraging expert reasoning models like OpenAI's o1. RA.Aid is designed for everyday software development, offering features such as multi-step task planning, automated command execution, and the ability to handle complex programming tasks beyond single-shot code edits.
code2prompt
Code2Prompt is a powerful command-line tool that generates comprehensive prompts from codebases, designed to streamline interactions between developers and Large Language Models (LLMs) for code analysis, documentation, and improvement tasks. It bridges the gap between codebases and LLMs by converting projects into AI-friendly prompts, enabling users to leverage AI for various software development tasks. The tool offers features like holistic codebase representation, intelligent source tree generation, customizable prompt templates, smart token management, Gitignore integration, flexible file handling, clipboard-ready output, multiple output options, and enhanced code readability.
Biomni
Biomni is a general-purpose biomedical AI agent designed to autonomously execute a wide range of research tasks across diverse biomedical subfields. By integrating cutting-edge large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, Biomni helps scientists dramatically enhance research productivity and generate testable hypotheses.
EasyInstruct
EasyInstruct is a Python package proposed as an easy-to-use instruction processing framework for Large Language Models (LLMs) like GPT-4, LLaMA, ChatGLM in your research experiments. EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.
rag-chatbot
The RAG ChatBot project combines Lama.cpp, Chroma, and Streamlit to build a Conversation-aware Chatbot and a Retrieval-augmented generation (RAG) ChatBot. The RAG Chatbot works by taking a collection of Markdown files as input and provides answers based on the context provided by those files. It utilizes a Memory Builder component to load Markdown pages, divide them into sections, calculate embeddings, and save them in an embedding database. The chatbot retrieves relevant sections from the database, rewrites questions for optimal retrieval, and generates answers using a local language model. It also remembers previous interactions for more accurate responses. Various strategies are implemented to deal with context overflows, including creating and refining context, hierarchical summarization, and async hierarchical summarization.
premsql
PremSQL is an open-source library designed to help developers create secure, fully local Text-to-SQL solutions using small language models. It provides essential tools for building and deploying end-to-end Text-to-SQL pipelines with customizable components, ideal for secure, autonomous AI-powered data analysis. The library offers features like Local-First approach, Customizable Datasets, Robust Executors and Evaluators, Advanced Generators, Error Handling and Self-Correction, Fine-Tuning Support, and End-to-End Pipelines. Users can fine-tune models, generate SQL queries from natural language inputs, handle errors, and evaluate model performance against predefined metrics. PremSQL is extendible for customization and private data usage.
evalverse
Evalverse is an open-source project designed to support Large Language Model (LLM) evaluation needs. It provides a standardized and user-friendly solution for processing and managing LLM evaluations, catering to AI research engineers and scientists. Evalverse supports various evaluation methods, insightful reports, and no-code evaluation processes. Users can access unified evaluation with submodules, request evaluations without code via Slack bot, and obtain comprehensive reports with scores, rankings, and visuals. The tool allows for easy comparison of scores across different models and swift addition of new evaluation tools.
graphiti
Graphiti is a framework for building and querying temporally-aware knowledge graphs, tailored for AI agents in dynamic environments. It continuously integrates user interactions, structured and unstructured data, and external information into a coherent, queryable graph. The framework supports incremental data updates, efficient retrieval, and precise historical queries without complete graph recomputation, making it suitable for developing interactive, context-aware AI applications.
BoxPwnr
BoxPwnr is a tool designed to test the performance of different agentic architectures using Large Language Models (LLMs) to autonomously solve HackTheBox machines. It provides a plug and play system with various strategies and platforms supported. BoxPwnr uses an iterative process where LLMs receive system prompts, suggest commands, execute them in a Docker container, analyze outputs, and repeat until the flag is found. The tool automates commands, saves conversations and commands for analysis, and tracks usage statistics. With recent advancements in LLM technology, BoxPwnr aims to evaluate AI systems' reasoning capabilities, creative thinking, security understanding, problem-solving skills, and code generation abilities.
basic-memory
Basic Memory is a tool that enables users to build persistent knowledge through natural conversations with Large Language Models (LLMs) like Claude. It uses the Model Context Protocol (MCP) to allow compatible LLMs to read and write to a local knowledge base stored in simple Markdown files on the user's computer. The tool facilitates creating structured notes during conversations, maintaining a semantic knowledge graph, and keeping all data local and under user control. Basic Memory aims to address the limitations of ephemeral LLM interactions by providing a structured, bi-directional, and locally stored knowledge management solution.
fraim
Fraim is an AI-powered toolkit designed for security engineers to enhance their workflows by leveraging AI capabilities. It offers solutions to find, detect, fix, and flag vulnerabilities throughout the development lifecycle. The toolkit includes features like Risk Flagger for identifying risks in code changes, Code Security Analysis for context-aware vulnerability detection, and Infrastructure as Code Analysis for spotting misconfigurations in cloud environments. Fraim can be run as a CLI tool or integrated into Github Actions, making it a versatile solution for security teams and organizations looking to enhance their security practices with AI technology.
TokenFormer
TokenFormer is a fully attention-based neural network architecture that leverages tokenized model parameters to enhance architectural flexibility. It aims to maximize the flexibility of neural networks by unifying token-token and token-parameter interactions through the attention mechanism. The architecture allows for incremental model scaling and has shown promising results in language modeling and visual modeling tasks. The codebase is clean, concise, easily readable, state-of-the-art, and relies on minimal dependencies.
py-llm-core
PyLLMCore is a light-weighted interface with Large Language Models with native support for llama.cpp, OpenAI API, and Azure deployments. It offers a Pythonic API that is simple to use, with structures provided by the standard library dataclasses module. The high-level API includes the assistants module for easy swapping between models. PyLLMCore supports various models including those compatible with llama.cpp, OpenAI, and Azure APIs. It covers use cases such as parsing, summarizing, question answering, hallucinations reduction, context size management, and tokenizing. The tool allows users to interact with language models for tasks like parsing text, summarizing content, answering questions, reducing hallucinations, managing context size, and tokenizing text.
ProX
ProX is a lm-based data refinement framework that automates the process of cleaning and improving data used in pre-training large language models. It offers better performance, domain flexibility, efficiency, and cost-effectiveness compared to traditional methods. The framework has been shown to improve model performance by over 2% and boost accuracy by up to 20% in tasks like math. ProX is designed to refine data at scale without the need for manual adjustments, making it a valuable tool for data preprocessing in natural language processing tasks.
FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.
For similar tasks
sec-code-bench
SecCodeBench is a benchmark suite for evaluating the security of AI-generated code, specifically designed for modern Agentic Coding Tools. It addresses challenges in existing security benchmarks by ensuring test case quality, employing precise evaluation methods, and covering Agentic Coding Tools. The suite includes 98 test cases across 5 programming languages, focusing on functionality-first evaluation and dynamic execution-based validation. It offers a highly extensible testing framework for end-to-end automated evaluation of agentic coding tools, generating comprehensive reports and logs for analysis and improvement.
evalverse
Evalverse is an open-source project designed to support Large Language Model (LLM) evaluation needs. It provides a standardized and user-friendly solution for processing and managing LLM evaluations, catering to AI research engineers and scientists. Evalverse supports various evaluation methods, insightful reports, and no-code evaluation processes. Users can access unified evaluation with submodules, request evaluations without code via Slack bot, and obtain comprehensive reports with scores, rankings, and visuals. The tool allows for easy comparison of scores across different models and swift addition of new evaluation tools.
TrustEval-toolkit
TrustEval-toolkit is a dynamic and comprehensive framework for evaluating the trustworthiness of Generative Foundation Models (GenFMs) across dimensions such as safety, fairness, robustness, privacy, and more. It offers features like dynamic dataset generation, multi-model compatibility, customizable metrics, metadata-driven pipelines, comprehensive evaluation dimensions, optimized inference, and detailed reports.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.







