llm-past-tense

llm-past-tense

Does Refusal Training in LLMs Generalize to the Past Tense? [NeurIPS 2024 Safe Generative AI Workshop (Oral)]

Stars: 53

Visit
 screenshot

The 'llm-past-tense' repository contains code related to the research paper 'Does Refusal Training in LLMs Generalize to the Past Tense?' by Maksym Andriushchenko and Nicolas Flammarion. It explores the generalization of refusal training in large language models (LLMs) to the past tense. The code includes experiments and examples for running different models and requests related to the study. Users can cite the work if found useful in their research, and the codebase is released under the MIT License.

README:

Does Refusal Training in LLMs Generalize to the Past Tense?

Maksym Andriushchenko (EPFL), Nicolas Flammarion (EPFL)

Paper: https://arxiv.org/abs/2407.11969

NeurIPS 2024 Safe Generative AI Workshop (Oral)

Getting started

To get started, install dependencies: pip install transformers openai anthropic

Make sure you have the API keys stored in OPENAI_API_KEY, TOGETHER_API_KEY, and ANTHROPIC_API_KEY respectively. For this, you can run:

export OPENAI_API_KEY=[YOUR_API_KEY_HERE]
export TOGETHER_API_KEY=[YOUR_API_KEY_HERE]
export ANTHROPIC_API_KEY=[YOUR_API_KEY_HERE]

Similarly, some HuggingFace models, like Llama-3, require an API key, which can be set in HF_TOKEN.

Run experiments

Simply run main.py! :-) Examples:

python main.py --target_model=gpt-3.5-turbo --n_requests=100 --n_restarts=20
python main.py --target_model=gpt-4o-mini --n_requests=100 --n_restarts=20 
python main.py --target_model=gpt-4o-2024-05-13 --n_requests=100 --n_restarts=20 
python main.py --target_model=o1-mini-2024-09-12 --n_requests=100 --n_restarts=20 
python main.py --target_model=o1-preview-2024-09-12 --n_requests=100 --n_restarts=20 
python main.py --target_model=claude-3-5-sonnet-20240620 --n_requests=100 --n_restarts=20 
python main.py --target_model=phi3 --n_requests=100 --n_restarts=20  
python main.py --target_model=gemma2-9b --n_requests=100 --n_restarts=20 
python main.py --target_model=llama3-8b --n_requests=100 --n_restarts=20 
python main.py --target_model=r2d2 --n_requests=100 --n_restarts=20  

Citation

If you find this work useful in your own research, please consider citing it:

@article{andriushchenko2024refusal,
      title={Does Refusal Training in LLMs Generalize to the Past Tense?}, 
      author={Andriushchenko, Maksym and Flammarion, Nicolas},
      journal={arXiv preprint arXiv:2407.11969},
      year={2024}
}

License

This codebase is released under the MIT License.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for llm-past-tense

Similar Open Source Tools

For similar tasks

For similar jobs