AgentPoison
[NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"
Stars: 78
AgentPoison is a repository that provides the official PyTorch implementation of the paper 'AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning'. It offers tools for red-teaming LLM agents by poisoning memory or knowledge bases. The repository includes trigger optimization algorithms, agent experiments, and evaluation scripts for Agent-Driver, ReAct-StrategyQA, and EHRAgent. Users can fine-tune motion planners, inject queries with triggers, and evaluate red-teaming performance. The codebase supports multiple RAG embedders and provides a unified dataset access for all three agents.
README:
π₯π₯ Recent news please check Project page !
This repository provides the official PyTorch implementation of the following paper:
AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning
Zhaorun Chen1, Zhen Xiang2, Chaowei Xiao 3, Dawn Song 4, Bo Li1,21University of Chicago, 2University of Illinois, Urbana-Champaign
3University of Wisconsin, Madison, 4University of California, Berkeley
To install, run the following commands to install the required packages:
git clone https://github.com/BillChan226/AgentPoison.git
cd AgentPoison
conda env create -f environment.yml
conda activate agentpoison
You can download the embedder checkpoints from the links below then specify the path to the embedder checkpoints in the algo/config.yaml file.
You can also use custmor embedders (e.g. fine-tuned yourself) as long as you specify their identifier and model path in the config.
After setting up the configuration for the embedders, you can run trigger optimization for all three agents using the following command:
python algo/trigger_optimization.py --agent ad --algo ap --model dpr-ctx_encoder-single-nq-base --save_dir ./results --ppl_filter --target_gradient_guidance --asr_threshold 0.5 --num_adv_passage_tokens 10 --golden_trigger -w -pSpecifically, the descriptions of arguments are listed below:
| Argument | Example | Description |
|---|---|---|
--agent |
ad |
Specify the type of agent to red-team, [ad, qa, ehr]. |
--algo |
ap |
Trigger optimization algorithm to use, [ap, cpa]. |
--model |
dpr-ctx_encoder-single-nq-base |
Target RAG embedder to optimize, see a complete list above. |
--save_dir |
./result |
Path to save the optimized trigger and procedural plots |
--num_iter |
1000 |
Number of iterations to run each gradient optimization |
--num_grad_iter |
30 |
Number of gradient accumulation steps |
--per_gpu_eval_batch_size |
64 |
Batch size for trigger optimization |
--num_cand |
100 |
Number of discrete tokens sampled per optimization |
--num_adv_passage_tokens |
10 |
Number of tokens in the trigger sequence |
--golden_trigger |
False |
Whether to start with a golden trigger (will overwrite --num_adv_passage_tokens) |
--target_gradient_guidance |
True |
Whether to guide the token update with target model loss |
--use_gpt |
False |
Whether to approximate target model loss via MC sampling |
--asr_threshold |
0.5 |
ASR threshold for target model loss |
--ppl_filter |
True |
Whether to enable coherence loss filter for token sampling |
--plot |
False |
Whether to plot the procedural optimization of the embeddings |
--report_to_wandb |
True |
Whether to report the results to wandb |
We have modified the original code for Agent-Driver, ReAct-StrategyQA, EHRAgent to support more RAG embedders, and add interface for data poisoning. We have provided unified dataset access for all three agents at here. Specifically, we list the inference command for all three agents.
First download the corresponding dataset from here or the original dataset host. Put the corresponding dataset in agentdriver/data.
Then put the optimized trigger tokens in here and you can also determine more attack parameters in here. Specifically, set attack_or_not to False to get the benign utility under attack.
Then run the following script for inference:
sh scripts/agent_driver/run_inference.shThe motion planning result regarding ASR-r, ASR-a, and ACC will be printed directly at the end of the program. The planned trajectory will be saved to ./result. Run the following command to get ASR-t:
sh scripts/agent_driver/run_evaluation.shWe provide more options for red-teaming agent-driver that cover each individual components of an autonomous agent, including perception APIs, memory module, ego-states, mission goal.
You need to follow the instruction here and fine-tune a motion planner based on GPT-3.5 using OpenAI's API first. As an alternative, we fine-tune a motion planner based on LLaMA-3 in here, such that the agent inference can be completely offline. Set use_local_planner in here to True to enable this.
First download the corresponding dataset from here or the StrategyQA dataset. Put the corresponding dataset in ReAct/database.
Then put the optimized trigger tokens in here. Run the following command to infer with GPT backbone:
python ReAct/run_strategyqa_gpt3.5.py --model dpr --task_type advand similarly to infer with LLaMA-3-70b backbone (you need to first obtain an API key in Replicate to access LLaMA-3) and put it here.
python ReAct/run_strategyqa_llama3_api.py --model dpr --task_type advSpecifically, set --task_type to adv to inject querries with trigger and benign to get the benign utility under attack. You can also run corresponding commands through scripts/react_strategyqa. The results will be saved to a path indicated by --save_dir.
To evaluate the red-teaming performance for StrategyQA, simply run the following command:
python ReAct/eval.py -p [RESPONSE_PATH]where RESPONSE_PATH is the path to the response json file.
First download the corresponding dataset from here and put it under EhrAgent/database.
Then put the optimized trigger tokens in here. Run the following command to infer with GPT/LLaMA3:
python EhrAgent/ehragent/main.py --backbone gpt --model dpr --algo ap --attackYou can specify --backbone to llama3 to infer with LLaMA3, and set --attack to False to get the benign utility under attack. You can also run corresponding commands through scripts/ehragent. The results will be saved to a path indicated by --save_dir.
To evaluate the red-teaming performance for EHRAgent, simply run the following command:
python EhrAgent/ehragent/eval.py -p [RESPONSE_PATH]where RESPONSE_PATH is the path to the response json file.
Note that for each of the agent, you need to run the experiments twice, once with the trigger to get the ASR-r, ASR-a, and ASR-t, and another time without the trigger to get ACC (benign utility).
Please cite the paper as follows if you use the data or code from AgentPoison:
@misc{chen2024agentpoisonredteamingllmagents,
title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases},
author={Zhaorun Chen and Zhen Xiang and Chaowei Xiao and Dawn Song and Bo Li},
year={2024},
eprint={2407.12784},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.12784},
}
Please reach out to us if you have any suggestions or need any help in reproducing the results. You can submit an issue or pull request, or send an email to [email protected].
This repository is under MIT License.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for AgentPoison
Similar Open Source Tools
AgentPoison
AgentPoison is a repository that provides the official PyTorch implementation of the paper 'AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning'. It offers tools for red-teaming LLM agents by poisoning memory or knowledge bases. The repository includes trigger optimization algorithms, agent experiments, and evaluation scripts for Agent-Driver, ReAct-StrategyQA, and EHRAgent. Users can fine-tune motion planners, inject queries with triggers, and evaluate red-teaming performance. The codebase supports multiple RAG embedders and provides a unified dataset access for all three agents.
skyvern
Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites itβs never seen before, as itβs able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question βWere you eligible to drive at 18?β could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, itβs understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern
LEADS
LEADS is a lightweight embedded assisted driving system designed to simplify the development of instrumentation, control, and analysis systems for racing cars. It is written in Python and C/C++ with impressive performance. The system is customizable and provides abstract layers for component rearrangement. It supports hardware components like Raspberry Pi and Arduino, and can adapt to various hardware types. LEADS offers a modular structure with a focus on flexibility and lightweight design. It includes robust safety features, modern GUI design with dark mode support, high performance on different platforms, and powerful ESC systems for traction control and braking. The system also supports real-time data sharing, live video streaming, and AI-enhanced data analysis for driver training. LEADS VeC Remote Analyst enables transparency between the driver and pit crew, allowing real-time data sharing and analysis. The system is designed to be user-friendly, adaptable, and efficient for racing car development.
moatless-tools
Moatless Tools is a hobby project focused on experimenting with using Large Language Models (LLMs) to edit code in large existing codebases. The project aims to build tools that insert the right context into prompts and handle responses effectively. It utilizes an agentic loop functioning as a finite state machine to transition between states like Search, Identify, PlanToCode, ClarifyChange, and EditCode for code editing tasks.
thepipe
The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that require a deep understanding of tricky data sources. The Pipe is available as a hosted API at thepi.pe, or it can be set up locally.
mindcraft
Mindcraft is a project that crafts minds for Minecraft using Large Language Models (LLMs) and Mineflayer. It allows an LLM to write and execute code on your computer, with code sandboxed but still vulnerable to injection attacks. The project requires Minecraft Java Edition, Node.js, and one of several API keys. Users can run tasks to acquire specific items or construct buildings, customize project details in settings.js, and connect to online servers with a Microsoft/Minecraft account. The project also supports Docker container deployment for running in a secure environment.
nano-graphrag
nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.
call-center-ai
Call Center AI is an AI-powered call center solution leveraging Azure and OpenAI GPT. It allows for AI agent-initiated phone calls or direct calls to the bot from a configured phone number. The bot is customizable for various industries like insurance, IT support, and customer service, with features such as accessing claim information, conversation history, language change, SMS sending, and more. The project is a proof of concept showcasing the integration of Azure Communication Services, Azure Cognitive Services, and Azure OpenAI for an automated call center solution.
agenticSeek
AgenticSeek is a voice-enabled AI assistant powered by DeepSeek R1 agents, offering a fully local alternative to cloud-based AI services. It allows users to interact with their filesystem, code in multiple languages, and perform various tasks autonomously. The tool is equipped with memory to remember user preferences and past conversations, and it can divide tasks among multiple agents for efficient execution. AgenticSeek prioritizes privacy by running entirely on the user's hardware without sending data to the cloud.
binary_ninja_mcp
This repository contains a Binary Ninja plugin, MCP server, and bridge that enables seamless integration of Binary Ninja's capabilities with your favorite LLM client. It provides real-time integration, AI assistance for reverse engineering, multi-binary support, and various MCP tools for tasks like decompiling functions, getting IL code, managing comments, renaming variables, and more.
ChatGPT-Telegram-Bot
ChatGPT Telegram Bot is a Telegram bot that provides a smooth AI experience. It supports both Azure OpenAI and native OpenAI, and offers real-time (streaming) response to AI, with a faster and smoother experience. The bot also has 15 preset bot identities that can be quickly switched, and supports custom bot identities to meet personalized needs. Additionally, it supports clearing the contents of the chat with a single click, and restarting the conversation at any time. The bot also supports native Telegram bot button support, making it easy and intuitive to implement required functions. User level division is also supported, with different levels enjoying different single session token numbers, context numbers, and session frequencies. The bot supports English and Chinese on UI, and is containerized for easy deployment.
can-ai-code
Can AI Code is a self-evaluating interview tool for AI coding models. It includes interview questions written by humans and tests taken by AI, inference scripts for common API providers and CUDA-enabled quantization runtimes, a Docker-based sandbox environment for validating untrusted Python and NodeJS code, and the ability to evaluate the impact of prompting techniques and sampling parameters on large language model (LLM) coding performance. Users can also assess LLM coding performance degradation due to quantization. The tool provides test suites for evaluating LLM coding performance, a webapp for exploring results, and comparison scripts for evaluations. It supports multiple interviewers for API and CUDA runtimes, with detailed instructions on running the tool in different environments. The repository structure includes folders for interviews, prompts, parameters, evaluation scripts, comparison scripts, and more.
text-embeddings-inference
Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for popular models like FlagEmbedding, Ember, GTE, and E5. It implements features such as no model graph compilation step, Metal support for local execution on Macs, small docker images with fast boot times, token-based dynamic batching, optimized transformers code for inference using Flash Attention, Candle, and cuBLASLt, Safetensors weight loading, and production-ready features like distributed tracing with Open Telemetry and Prometheus metrics.
magentic
Easily integrate Large Language Models into your Python code. Simply use the `@prompt` and `@chatprompt` decorators to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.
paxml
Pax is a framework to configure and run machine learning experiments on top of Jax.
monacopilot
Monacopilot is a powerful and customizable AI auto-completion plugin for the Monaco Editor. It supports multiple AI providers such as Anthropic, OpenAI, Groq, and Google, providing real-time code completions with an efficient caching system. The plugin offers context-aware suggestions, customizable completion behavior, and framework agnostic features. Users can also customize the model support and trigger completions manually. Monacopilot is designed to enhance coding productivity by providing accurate and contextually appropriate completions in daily spoken language.
For similar tasks
AgentPoison
AgentPoison is a repository that provides the official PyTorch implementation of the paper 'AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning'. It offers tools for red-teaming LLM agents by poisoning memory or knowledge bases. The repository includes trigger optimization algorithms, agent experiments, and evaluation scripts for Agent-Driver, ReAct-StrategyQA, and EHRAgent. Users can fine-tune motion planners, inject queries with triggers, and evaluate red-teaming performance. The codebase supports multiple RAG embedders and provides a unified dataset access for all three agents.
For similar jobs
ciso-assistant-community
CISO Assistant is a tool that helps organizations manage their cybersecurity posture and compliance. It provides a centralized platform for managing security controls, threats, and risks. CISO Assistant also includes a library of pre-built frameworks and tools to help organizations quickly and easily implement best practices.
PurpleLlama
Purple Llama is an umbrella project that aims to provide tools and evaluations to support responsible development and usage of generative AI models. It encompasses components for cybersecurity and input/output safeguards, with plans to expand in the future. The project emphasizes a collaborative approach, borrowing the concept of purple teaming from cybersecurity, to address potential risks and challenges posed by generative AI. Components within Purple Llama are licensed permissively to foster community collaboration and standardize the development of trust and safety tools for generative AI.
vpnfast.github.io
VPNFast is a lightweight and fast VPN service provider that offers secure and private internet access. With VPNFast, users can protect their online privacy, bypass geo-restrictions, and secure their internet connection from hackers and snoopers. The service provides high-speed servers in multiple locations worldwide, ensuring a reliable and seamless VPN experience for users. VPNFast is easy to use, with a user-friendly interface and simple setup process. Whether you're browsing the web, streaming content, or accessing sensitive information, VPNFast helps you stay safe and anonymous online.
taranis-ai
Taranis AI is an advanced Open-Source Intelligence (OSINT) tool that leverages Artificial Intelligence to revolutionize information gathering and situational analysis. It navigates through diverse data sources like websites to collect unstructured news articles, utilizing Natural Language Processing and Artificial Intelligence to enhance content quality. Analysts then refine these AI-augmented articles into structured reports that serve as the foundation for deliverables such as PDF files, which are ultimately published.
NightshadeAntidote
Nightshade Antidote is an image forensics tool used to analyze digital images for signs of manipulation or forgery. It implements several common techniques used in image forensics including metadata analysis, copy-move forgery detection, frequency domain analysis, and JPEG compression artifacts analysis. The tool takes an input image, performs analysis using the above techniques, and outputs a report summarizing the findings.
h4cker
This repository is a comprehensive collection of cybersecurity-related references, scripts, tools, code, and other resources. It is carefully curated and maintained by Omar Santos. The repository serves as a supplemental material provider to several books, video courses, and live training created by Omar Santos. It encompasses over 10,000 references that are instrumental for both offensive and defensive security professionals in honing their skills.
AIMr
AIMr is an AI aimbot tool written in Python that leverages modern technologies to achieve an undetected system with a pleasing appearance. It works on any game that uses human-shaped models. To optimize its performance, users should build OpenCV with CUDA. For Valorant, additional perks in the Discord and an Arduino Leonardo R3 are required.
admyral
Admyral is an open-source Cybersecurity Automation & Investigation Assistant that provides a unified console for investigations and incident handling, workflow automation creation, automatic alert investigation, and next step suggestions for analysts. It aims to tackle alert fatigue and automate security workflows effectively by offering features like workflow actions, AI actions, case management, alert handling, and more. Admyral combines security automation and case management to streamline incident response processes and improve overall security posture. The tool is open-source, transparent, and community-driven, allowing users to self-host, contribute, and collaborate on integrations and features.

