VulBench

VulBench

This is a benchmark for evaluating the vulnerability discovery ability of automated approaches including Large Language Models (LLMs), deep learning methods and static analyzers

Stars: 56

Visit
 screenshot

This repository contains materials for the paper 'How Far Have We Gone in Vulnerability Detection Using Large Language Model'. It provides a tool for evaluating vulnerability detection models using datasets such as d2a, ctf, magma, big-vul, and devign. Users can query the model 'Llama-2-7b-chat-hf' and store results in a SQLite database for analysis. The tool supports binary and multiple classification tasks with concurrency settings. Additionally, users can evaluate the results and generate a CSV file with metrics for each dataset and prompt type.

README:

VulBench

This repository contains materials for the paper "How Far Have We Gone in Vulnerability Detection Using Large Language Model".

Usage

Suppose there is OpenAI-compatiable endpoint at http://localhost:8080, you can use the following command to evaluate the model.

# Alternatively, set OPENAI_API_KEY environment variable
export OPENAI_API_KEY=xxx
python3 query.py \
  --datasets d2a ctf magma big-vul devign \
  --db_path ./query_result.db \
  --api_endpoint http://localhost:8080 \
  --model 'Llama-2-7b-chat-hf' \
  --trials 5 \
  --do_binary_classification \
  --do_multiple_classification \
  --concurrency 100

Specifically, it will query the model Llama-2-7b-chat-hf with the datasets d2a, ctf, magma, big-vul, and devign. The results will be stored in the SQLite database query_result.db. The script will run 5 trials for each dataset and the script will send 100 requests concurrently.

After that, evaluate the result database with the following command.

python3 eval.py ./query_result.db [./query_result_1.db, ./query_result_2.db, ...]

It will generate a all_result.csv file, containing the different metrics for each dataset with different prompt type.

Extra Data

The compiled binaries (including fixed and unfixed, compiled with -g -fno-inline-functions -O2 to better represent the original individual vulnerable functions) and source code (with MAGMA_ENABLE_FIXES and MAGMA_ENABLE_CANARIES left untouched) are available at VulBench.

News

  • [2023/12/18] We release the raw datasets of VulBench.
  • [2024/3/19] We release the code for querying the model and evaluating the results.

Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{gao2023far,
  title={How Far Have We Gone in Vulnerability Detection Using Large Language Models},
  author={Gao, Zeyu and Wang, Hao and Zhou, Yuchen and Zhu, Wenyu and Zhang, Chao},
  journal={arXiv preprint arXiv:2311.12420},
  year={2023}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for VulBench

Similar Open Source Tools

For similar tasks

For similar jobs