alignment-attribution-code
Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Stars: 57
This repository provides an original implementation of Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. It includes tools for neuron-level pruning, pruning based on set difference, Wanda/SNIP score dumping, rank-level pruning, and rank removal with orthogonal projection. Users can specify parameters like prune method, datasets, sparsity ratio, model, and save location to evaluate and modify neural networks for safety alignment.
README:
This repository provides an original implementation of Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications by Boyi Wei*, Kaixuan Huang*, Yangsibo Huang*, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang and Peter Henderson.
You can use the following instruction to create conda environment
conda env create -f environment.yml
Please notice that you need to specify your environment path inside environment.yml
Besides, you need to manually install a hacked version of lm_eval to support evaluating the pruned model. See wanda.
There are known issues with the transformers library on loading the LLaMA tokenizer correctly. Please follow the mentioned suggestions to resolve this issue.
Before running experiments, make sure you have specified the path pointing to the model stored in your locations.
The main function is main.py
. When using Top-down pruning, we need to add --neg_prune
in the command line.
Important parameters are:
-
--prune_method
: To specify the prune method. Available options arewanda
,wandg
(SNIP in the paper),random
. -
--prune_data
: To specify datasets used for pruning. When doing top-down pruning safety-critical neurons, we can usealign
(safety-full in the paper) andalign_short
(safety-short in the paper) as our dataset. -
--sparsity_ratio
: Specify the prune sparsity. -
--eval_zero_shot
: Whether to evaluate the model's zero-shot-accuracy after pruning -
--eval_attack
: Whether to evaluate the model's ASR after pruning. -
--save
: Specify the save location -
--model
: Specify the model. Currently we only supportllama2-7b-chat-hf
andllama2-13b-chat-hf
Example: Using llama2-7b-chat-hf
to prune 0.5 part of weights, using safety-full dataset.
model="llama2-7b-chat-hf"
method="wanda"
type="unstructured"
suffix="weightonly"
save_dir="out/$model/$type/${method}_${suffix}/align/"
python main.py \
--model $model \
--prune_method $method \
--prune_data align \
--sparsity_ratio 0.5 \
--sparsity_type $type \
--neg_prune
--save $save_dir \
--eval_zero_shot \
--eval_attack \
--save_attack_res
Simply remove --neg_prune
will reverse the order of pruning. We recommend using align_short
(safety-short in our paper) when pruning the least safety-critical neurons to get more obvious results.
Select option --prune_method
as wandg_set_difference
(SNIP with set difference in our paper). Add option --p
, which corresponds to top-p scored entries in alpaca_no_safety-based wandg score; Add option --q
, which corresponds to top-q scored entries in aligned-based wandg score. Please notice that you have to specify a non-zero value of --sparsity_ratio
. For the dataset to compute the utility importance score, we usealpaca_cleaned_no_safety
by default, --prune_data
here is used to specify the dataset to compute the safety importance score. Available options are align
(safety-full in our paper) and align_short
(safety-short in our paper)
Example: Pruning the set difference between top-10% utility-critical neurons (Use alpaca_cleaned_no_safety dataset to identify) and top-10% safety-critical (Use safety-full to identify) safety neurons.
model="llama2-7b-chat-hf"
method="wandg_set_difference"
type="unstructured"
suffix="weightonly"
save_dir="out/$model/$type/wandg_set_difference_{$suffix}"
python main.py \
--model $model \
--prune_method $method \
--sparsity_ratio 0.5 \
--prune_data align
--p 0.1\
--q 0.1\
--sparsity_type $type \
--save $save_dir \
--eval_zero_shot \
--eval_attack \
--save_attack_res
Simply add option --dump_wanda_score
into the command.
Example: Safety-first pruning with align_llama2-7b-chat dataset:
model="llama2-7b-chat-hf"
method="wanda"
type="unstructured"
suffix="weightonly"
save_dir="out/$model/$type/${method}_${suffix}/align/"
python main.py \
--model $model \
--prune_method $method \
--prune_data align \
--sparsity_ratio 0.5 \
--sparsity_type $type \
--save $save_dir \
--dump_wanda_score
The main function of this pipeline is main_low_rank.py
. Most of the parameters are similar to the prune neurons situation.
Important parameters are:
-
--prune_method
: To specify the pruning method, in this case we chooselow_rank
, which corresponds to ActSVD in our paper. -
--prune_data
: To specify the dataset used to identify the safety/utility projection matrix. Available options arealign
(safety-full),align_short
(safety-short),alpaca_cleaned_no_safety
(filtered alpaca_cleaned dataset) -
--rank
: To determine how many ranks needed to be removed . -
--top_remove
: To determine whether to remove the top-critical ranks or the least-critical ranks. If true, remove the top critical ranks
Example: Prune the top-10 safety-critical rank based on the safety-full(align
in the code) dataset.
model="llama2-7b-chat-hf"
method="low_rank"
type="unstructured"
suffix="weightonly"
save_dir="out/$model/$type/${method}_${suffix}/align/"
python main_low_rank.py \
--model $model \
--prune_method $method \
--prune_data align \
--rank 10 \
--top_remove \
--save $save_dir \
--eval_zero_shot \
--eval_attack \
--save_attack_res
Similar to 3.1, but here we don't need to add --top_remove
in the command line.
Example: Remove the bottom-1000 safety-critical rank based on the safety-short(align_short
in the code) dataset.
model="llama2-7b-chat-hf"
method="low_rank"
type="unstructured"
save_dir="out/$model/$type/${method}/align_short/"
python main_low_rank.py \
--model $model \
--prune_method $method \
--prune_data align_short \
--rank 1000 \
--top_remove \
--save $save_dir \
--eval_zero_shot \
--eval_attack \
--save_attack_res
The main function of this program is main_low_rank_diff.py
.
Important parameters are:
-
--prune_method
: To specify the method of rank removal, here we uselow_rank_diff
, which corresponds to the (ActSVD with orthogonal projection in the paper) -
--rank_pos
: Specify the $r^u$ in the paper. -
--rank_neg
: Specify the $r^s$ in the paper. -
--prune_data_pos
: The data to determine the utility projection matrix, we usealpaca_cleaned_no_safety
. -
--pruned_data_neg
: The data to determine the safety projection matrix, we recommend to usealign
.
Example: Prune based on rank-3000 utility projection matrix and rank-4000 safety projection matrix on alpaca_cleaned_no_safety
(filtered alpaca_cleaned dataset without safety-related prompt-response pairs) and safety-full on llama2-7b-chat-hf
.
model="llama2-7b-chat-hf"
type="unstructured"
ru=3000
rs=4000
method="low_rank_diff"
save_dir="out/$model/$type/${method}/align/"
python main_low_rank_diff.py \
--model $model \
--rank_pos $ru \
--rank_neg $rs \
--prune_data_pos "alpaca_cleaned_no_safety" \
--prune_data_neg "align" \
--save $save_dir \
--eval_zero_shot \
--eval_attack \
If you find our code and paper helpful, please consider citing our work:
@inproceedings{weiassessing,
title={Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications},
author={Wei, Boyi and Huang, Kaixuan and Huang, Yangsibo and Xie, Tinghao and Qi, Xiangyu and Xia, Mengzhou and Mittal, Prateek and Wang, Mengdi and Henderson, Peter},
booktitle={Forty-first International Conference on Machine Learning}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for alignment-attribution-code
Similar Open Source Tools
alignment-attribution-code
This repository provides an original implementation of Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. It includes tools for neuron-level pruning, pruning based on set difference, Wanda/SNIP score dumping, rank-level pruning, and rank removal with orthogonal projection. Users can specify parameters like prune method, datasets, sparsity ratio, model, and save location to evaluate and modify neural networks for safety alignment.
ML-Bench
ML-Bench is a tool designed to evaluate large language models and agents for machine learning tasks on repository-level code. It provides functionalities for data preparation, environment setup, usage, API calling, open source model fine-tuning, and inference. Users can clone the repository, load datasets, run ML-LLM-Bench, prepare data, fine-tune models, and perform inference tasks. The tool aims to facilitate the evaluation of language models and agents in the context of machine learning tasks on code repositories.
ice-score
ICE-Score is a tool designed to instruct large language models to evaluate code. It provides a minimum viable product (MVP) for evaluating generated code snippets using inputs such as problem, output, task, aspect, and model. Users can also evaluate with reference code and enable zero-shot chain-of-thought evaluation. The tool is built on codegen-metrics and code-bert-score repositories and includes datasets like CoNaLa and HumanEval. ICE-Score has been accepted to EACL 2024.
cheating-based-prompt-engine
This is a vulnerability mining engine purely based on GPT, requiring no prior knowledge base, no fine-tuning, yet its effectiveness can overwhelmingly surpass most of the current related research. The core idea revolves around being task-driven, not question-driven, driven by prompts, not by code, and focused on prompt design, not model design. The essence is encapsulated in one word: deception. It is a type of code understanding logic vulnerability mining that fully stimulates the capabilities of GPT, suitable for real actual projects.
trickPrompt-engine
This repository contains a vulnerability mining engine based on GPT technology. The engine is designed to identify logic vulnerabilities in code by utilizing task-driven prompts. It does not require prior knowledge or fine-tuning and focuses on prompt design rather than model design. The tool is effective in real-world projects and should not be used for academic vulnerability testing. It supports scanning projects in various languages, with current support for Solidity. The engine is configured through prompts and environment settings, enabling users to scan for vulnerabilities in their codebase. Future updates aim to optimize code structure, add more language support, and enhance usability through command line mode. The tool has received a significant audit bounty of $50,000+ as of May 2024.
mergekit
Mergekit is a toolkit for merging pre-trained language models. It uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations. Merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM. Many merging algorithms are supported, with more coming as they catch my attention.
cli-agent
Pieces CLI for Developers is a comprehensive command-line interface (CLI) tool designed to interact seamlessly with Pieces OS. It provides functionalities such as asset management, application interaction, and integration with various Pieces OS features. The tool is compatible with Windows 10 or greater, Mac, and Windows operating systems. Users can install the tool by running 'pip install pieces-cli' or 'brew install pieces-cli'. After installation, users can access the tool's functionalities through the terminal by using the 'pieces' command followed by subcommands and options. The tool supports various commands, which can be found in the documentation. Developers can contribute to the project by forking and cloning the repository, setting up a virtual environment, installing dependencies with poetry, and running test cases with pytest and coverage.
hordelib
horde-engine is a wrapper around ComfyUI designed to run inference pipelines visually designed in the ComfyUI GUI. It enables users to design inference pipelines in ComfyUI and then call them programmatically, maintaining compatibility with the existing horde implementation. The library provides features for processing Horde payloads, initializing the library, downloading and validating models, and generating images based on input data. It also includes custom nodes for preprocessing and tasks such as face restoration and QR code generation. The project depends on various open source projects and bundles some dependencies within the library itself. Users can design ComfyUI pipelines, convert them to the backend format, and run them using the run_image_pipeline() method in hordelib.comfy.Comfy(). The project is actively developed and tested using git, tox, and a specific model directory structure.
godot-llm
Godot LLM is a plugin that enables the utilization of large language models (LLM) for generating content in games. It provides functionality for text generation, text embedding, multimodal text generation, and vector database management within the Godot game engine. The plugin supports features like Retrieval Augmented Generation (RAG) and integrates llama.cpp-based functionalities for text generation, embedding, and multimodal capabilities. It offers support for various platforms and allows users to experiment with LLM models in their game development projects.
garak
Garak is a free tool that checks if a Large Language Model (LLM) can be made to fail in a way that is undesirable. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses. Garak's a free tool. We love developing it and are always interested in adding functionality to support applications.
log10
Log10 is a one-line Python integration to manage your LLM data. It helps you log both closed and open-source LLM calls, compare and identify the best models and prompts, store feedback for fine-tuning, collect performance metrics such as latency and usage, and perform analytics and monitor compliance for LLM powered applications. Log10 offers various integration methods, including a python LLM library wrapper, the Log10 LLM abstraction, and callbacks, to facilitate its use in both existing production environments and new projects. Pick the one that works best for you. Log10 also provides a copilot that can help you with suggestions on how to optimize your prompt, and a feedback feature that allows you to add feedback to your completions. Additionally, Log10 provides prompt provenance, session tracking and call stack functionality to help debug prompt chains. With Log10, you can use your data and feedback from users to fine-tune custom models with RLHF, and build and deploy more reliable, accurate and efficient self-hosted models. Log10 also supports collaboration, allowing you to create flexible groups to share and collaborate over all of the above features.
olah
Olah is a self-hosted lightweight Huggingface mirror service that implements mirroring feature for Huggingface resources at file block level, enhancing download speeds and saving bandwidth. It offers cache control policies and allows administrators to configure accessible repositories. Users can install Olah with pip or from source, set up the mirror site, and download models and datasets using huggingface-cli. Olah provides additional configurations through a configuration file for basic setup and accessibility restrictions. Future work includes implementing an administrator and user system, OOS backend support, and mirror update schedule task. Olah is released under the MIT License.
screeps-starter-rust
screeps-starter-rust is a Rust AI starter kit for Screeps: World, a JavaScript-based MMO game. It utilizes the screeps-game-api bindings from the rustyscreeps organization and wasm-pack for building Rust code to WebAssembly. The example includes Rollup for bundling javascript, Babel for transpiling code, and screeps-api Node.js package for deployment. Users can refer to the Rust version of game APIs documentation at https://docs.rs/screeps-game-api/. The tool supports most crates on crates.io, except those interacting with OS APIs.
bilingual_book_maker
The bilingual_book_maker is an AI translation tool that uses ChatGPT to assist users in creating multi-language versions of epub/txt/srt files and books. It supports various models like gpt-4, gpt-3.5-turbo, claude-2, palm, llama-2, azure-openai, command-nightly, and gemini. Users need ChatGPT or OpenAI token, epub/txt books, internet access, and Python 3.8+. The tool provides options to specify OpenAI API key, model selection, target language, proxy server, context addition, translation style, and more. It generates bilingual books in epub format after translation. Users can test translations, set batch size, tweak prompts, and use different models like DeepL, Google Gemini, Tencent TranSmart, and more. The tool also supports retranslation, translating specific tags, and e-reader type specification. Docker usage is available for easy setup.
garak
Garak is a vulnerability scanner designed for LLMs (Large Language Models) that checks for various weaknesses such as hallucination, data leakage, prompt injection, misinformation, toxicity generation, and jailbreaks. It combines static, dynamic, and adaptive probes to explore vulnerabilities in LLMs. Garak is a free tool developed for red-teaming and assessment purposes, focusing on making LLMs or dialog systems fail. It supports various LLM models and can be used to assess their security and robustness.
SpeziLLM
The Spezi LLM Swift Package includes modules that help integrate LLM-related functionality in applications. It provides tools for local LLM execution, usage of remote OpenAI-based LLMs, and LLMs running on Fog node resources within the local network. The package contains targets like SpeziLLM, SpeziLLMLocal, SpeziLLMLocalDownload, SpeziLLMOpenAI, and SpeziLLMFog for different LLM functionalities. Users can configure and interact with local LLMs, OpenAI LLMs, and Fog LLMs using the provided APIs and platforms within the Spezi ecosystem.
For similar tasks
alignment-attribution-code
This repository provides an original implementation of Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. It includes tools for neuron-level pruning, pruning based on set difference, Wanda/SNIP score dumping, rank-level pruning, and rank removal with orthogonal projection. Users can specify parameters like prune method, datasets, sparsity ratio, model, and save location to evaluate and modify neural networks for safety alignment.
For similar jobs
responsible-ai-toolbox
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment interfaces and libraries for understanding AI systems. It empowers developers and stakeholders to develop and monitor AI responsibly, enabling better data-driven actions. The toolbox includes visualization widgets for model assessment, error analysis, interpretability, fairness assessment, and mitigations library. It also offers a JupyterLab extension for managing machine learning experiments and a library for measuring gender bias in NLP datasets.
fairlearn
Fairlearn is a Python package designed to help developers assess and mitigate fairness issues in artificial intelligence (AI) systems. It provides mitigation algorithms and metrics for model assessment. Fairlearn focuses on two types of harms: allocation harms and quality-of-service harms. The package follows the group fairness approach, aiming to identify groups at risk of experiencing harms and ensuring comparable behavior across these groups. Fairlearn consists of metrics for assessing model impacts and algorithms for mitigating unfairness in various AI tasks under different fairness definitions.
Open-Prompt-Injection
OpenPromptInjection is an open-source toolkit for attacks and defenses in LLM-integrated applications, enabling easy implementation, evaluation, and extension of attacks, defenses, and LLMs. It supports various attack and defense strategies, including prompt injection, paraphrasing, retokenization, data prompt isolation, instructional prevention, sandwich prevention, perplexity-based detection, LLM-based detection, response-based detection, and know-answer detection. Users can create models, tasks, and apps to evaluate different scenarios. The toolkit currently supports PaLM2 and provides a demo for querying models with prompts. Users can also evaluate ASV for different scenarios by injecting tasks and querying models with attacked data prompts.
aws-machine-learning-university-responsible-ai
This repository contains slides, notebooks, and data for the Machine Learning University (MLU) Responsible AI class. The mission is to make Machine Learning accessible to everyone, covering widely used ML techniques and applying them to real-world problems. The class includes lectures, final projects, and interactive visuals to help users learn about Responsible AI and core ML concepts.
AIF360
The AI Fairness 360 toolkit is an open-source library designed to detect and mitigate bias in machine learning models. It provides a comprehensive set of metrics, explanations, and algorithms for bias mitigation in various domains such as finance, healthcare, and education. The toolkit supports multiple bias mitigation algorithms and fairness metrics, and is available in both Python and R. Users can leverage the toolkit to ensure fairness in AI applications and contribute to its development for extensibility.
Awesome-Interpretability-in-Large-Language-Models
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.
hallucination-index
LLM Hallucination Index - RAG Special is a comprehensive evaluation of large language models (LLMs) focusing on context length and open vs. closed-source attributes. The index explores the impact of context length on model performance and tests the assumption that closed-source LLMs outperform open-source ones. It also investigates the effectiveness of prompting techniques like Chain-of-Note across different context lengths. The evaluation includes 22 models from various brands, analyzing major trends and declaring overall winners based on short, medium, and long context insights. Methodologies involve rigorous testing with different context lengths and prompting techniques to assess models' abilities in handling extensive texts and detecting hallucinations.
llm-misinformation-survey
The 'llm-misinformation-survey' repository is dedicated to the survey on combating misinformation in the age of Large Language Models (LLMs). It explores the opportunities and challenges of utilizing LLMs to combat misinformation, providing insights into the history of combating misinformation, current efforts, and future outlook. The repository serves as a resource hub for the initiative 'LLMs Meet Misinformation' and welcomes contributions of relevant research papers and resources. The goal is to facilitate interdisciplinary efforts in combating LLM-generated misinformation and promoting the responsible use of LLMs in fighting misinformation.