CJA_Comprehensive_Jailbreak_Assessment

CJA_Comprehensive_Jailbreak_Assessment

This is the public code repository of paper 'Comprehensive Assessment of Jailbreak Attacks Against LLMs'

Stars: 122

Visit
 screenshot

This public repository contains the paper 'Comprehensive Assessment of Jailbreak Attacks Against LLMs'. It provides a labeling method to label results using Python and offers the opportunity to submit evaluation results to the leaderboard. Full codes will be released after the paper is accepted.

README:

Comprehensive Assessment of Jailbreak Attacks

website: online dataset: released

This is the public repository of the paper Comprehensive Assessment of Jailbreak Attacks Against LLMs.
The following updates will be released first on the official repository in the future.

Be careful! This repository may contain harmful/offensive responses. Users need to use this repository responsibly.

How to use this repository?

Install and set the ENV

  1. Clone this repository.
  2. Prepare the python ENV.
conda create -n CJA python=3.10
conda activate CJA
cd PATH_TO_THE_REPOSITORY
pip install -r requirements.txt

Label - use our labeling method to label the responses after jailbreak.

Option 1: label single file

  1. Switch directory:
cd ./scripts_label
  1. Command to label single file:
python label.py \
--model_name gpt-4 --test_mode False \
--start_line 0 \
--raw_questions_path "$QUESTIONS" \
--results_path "$file"

$QUESTIONS is the path to the forbidden questions (ideally it should be a .csv file, refer to ./forbidden_questions/forbidden_questions.csv for example).
$file is the path to the LLM responses after jailbreak, it should be a .json file. The .json file could be generated by the following codes.

answers.append({'response': answer})
# Write into the output file
with open(output_file, 'w') as out_file:
    json.dump(answers, out_file, indent=4)

Note that answer is the response from the target LLM suffering jailbreak attacks.

Option 2: label files in a directory
You may also utilize label.sh to label files in a directory:

bash label.sh PATH_TO_RESPONSES_DIRECTORY

The files storing the labels will be saved to the same directory where you store the jailbreak responses.
NOTE: We have omitted the harmful responses related to the project. For example, the few-shot examples in scripts_label/label.py. Feel free to use your own examples.

Defense - use our defense scripts to detect the jailbreak prompts (adv prompts).

  1. Switch directory:
cd ./scripts_defense
  1. Execute the defense:
bash ./defense_execute.sh DEFENSE_METHOD PATH_TO_YOUR_ADV_PROMPTS_FOLDER

Currently, seven defense methods are supported (refer to ./scripts_defense/defense_execute.sh for details).

The adv prompts folder should follow such a structure:

example_adv_prompts
└─ adv_basic.json

The .json file could be obtained by the following codes:

adv_prompts = [prompt_1, prompt_2, ...] # a list of adv prompts
json_file = OUTPUT_PATH
with open(json_file, 'w') as outfile:
    json.dump(adv_prompts, outfile, indent=4)

Refer to folder ./example_adv_prompts for an example.

Add new results to the leaderboard.

Welcome to submit your own evaluation results (steps = 50) of jailbreak attacks to us. The leaderboard is available here.

Full codes will be released after the paper is accepted.

TO DO

  • [ ] Check the env file requirements.txt.
  • [ ] Test the guide in the README.md.
  • [ ] Clean the codes/comments.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for CJA_Comprehensive_Jailbreak_Assessment

Similar Open Source Tools

For similar tasks

For similar jobs