Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.

Stars: 507

Visit

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, and exciting jailbreak methods on Large Language Models (LLMs). The repository contains papers, codes, datasets, evaluations, and analyses related to jailbreak attacks on LLMs. It serves as a comprehensive resource for researchers and practitioners interested in exploring various jailbreak techniques and defenses in the context of LLMs. Contributions such as additional jailbreak-related content, pull requests, and issue reports are welcome, and contributors are acknowledged. For any inquiries or issues, contact [email protected]. If you find this repository useful for your research or work, consider starring it to show appreciation.

README:

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. Any additional things regarding jailbreak, PRs, issues are welcome and we are glad to add you to the contributor list here. Any problems, please contact [email protected]. If you find this repository useful to your research or work, it is really appreciated to star this repository and cite our papers here. ✨

Papers

Jailbreak Attack

Black-box Attack

Time	Title	Venue	Paper	Code
2025.03	FlipAttack: Jailbreak LLMs via Flipping (FlipAttack)	ICLR Workshop'25	link	link
2024.11	The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models	arXiv	link	link
2024.11	Playing Language Game with LLMs Leads to Jailbreaking	arXiv	link	link
2024.11	GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs (GASP)	arXiv	link	link
2024.11	LLM STINGER: Jailbreaking LLMs using RL fine-tuned LLMs	arXiv	link	-
2024.11	SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt	arXiv	link	link
2024.11	Diversity Helps Jailbreak Large Language Models	arXiv	link	-
2024.11	Plentiful Jailbreaks with String Compositions	arXiv	link	-
2024.11	Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models	arXiv	link	link
2024.11	Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring	arXiv	link	-
2024.10	Endless Jailbreaks with Bijection	arXiv	link	-
2024.10	Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models	arXiv	link	-
2024.10	You Know What I'm Saying: Jailbreak Attack via Implicit Reference	arXiv	link	link
2024.10	Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation	arXiv	link	link
2024.10	AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (AutoDAN-Turbo)	arXiv	link	link
2024.10	PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach (PathSeeker)	arXiv	link	-
2024.10	Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity	arXiv	link	link
2024.09	AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs	arXiv	link	link
2024.09	Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs	arXiv	link	-
2024.09	Jailbreaking Large Language Models with Symbolic Mathematics	arXiv	link	-
2024.08	Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models	arXiv	link	-
2024.08	Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles	arXiv	link	-
2024.08	h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment (h4rm3l)	arXiv	link	link
2024.08	EnJa: Ensemble Jailbreak on Large Language Models (EnJa)	arXiv	link	-
2024.07	Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack	arXiv	link	link
2024.07	Single Character Perturbations Break LLM Alignment	arXiv	link	link
2024.07	A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses	arXiv	link	-
2024.07	Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context)	arXiv	link	-
2024.07	SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP)	arXiv	link	link
2024.06	Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ)	NeurIPS'24	link	link
2024.06	When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker)	NeurIPS'24	link	-
2024.06	Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith)	ICML'24	link	link
2024.06	Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation	ICML'24	link	-
2024.06	ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt)	ACL'24	link	link
2024.06	From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF)	arXiv	link	-
2024.06	CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack)	ACL'24	link	-
2024.06	Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA)	USENIX Security'24	link	link
2024.06	AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak)	arXiv	link	-
2024.06	Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks	arXiv	link	link
2024.06	GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFUZZER)	arXiv	link	link
2024.06	A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)	NAACL'24	link	link
2024.06	QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA)	arXiv	link	link
2024.06	Poisoned LangChain: Jailbreak LLMs by LangChain (PLC)	arXiv	link	link
2024.05	Multilingual Jailbreak Challenges in Large Language Models	ICLR'24	link	link
2024.05	DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception)	EMNLP'24	link	link
2024.05	GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS)	ACL'24	link	-
2024.05	GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD)	arXiv	link	-
2024.05	"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN)	CCS'24	link	link
2024.05	Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher)	ICLR'24	link	link
2024.05	Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM)	NeurIPS'24	link	-
2024.05	Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA)	arXiv	link	-
2024.04	Many-shot jailbreaking (MSJ)	NeurIPS'24 Anthropic	link	-
2024.04	PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA)	ICLR Workshop'24	link	-
2024.04	Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM)	ICASSP'24	link	link
2024.04	Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack)	TrustNLP'24	link	-
2024.03	Tastle: Distract large language models for automatic jailbreak attack (TASTLE)	arXiv	link	-
2024.03	DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack)	EMNLP'24	link	link
2024.02	PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP)	arXiv	link	-
2024.02	CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon)	arXiv	link	link
2024.02	PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL)	arXiv	link	link
2024.02	Jailbreaking Proprietary Large Language Models using Word Substitution Cipher	arXiv	link	-
2024.02	Query-Based Adversarial Prompt Generation	arXiv	link	-
2024.02	Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack)	arXiv	link	-
2024.02	Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ)	arXiv	link	-
2024.02	Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking	NAACL'24	link	link
2024.01	Low-Resource Languages Jailbreak GPT-4	NeurIPS Workshop'24	link	-
2024.01	How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)	arXiv	link	link
2023.12	Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP)	NeurIPS'24	link	link
2023.12	Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs	arXiv	link	-
2023.12	Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition	ACL'24	link	-
2023.11	Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona)	NeurIPS Workshop'23	link	-
2023.10	Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR)	NeurIPS'24	link	link
2023.10	Adversarial Demonstration Attacks on Large Language Models (advICL)	EMNLP'24	link	-
2023.10	MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY)	NDSS'24	link	link
2023.10	Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP)	EMNLP'23	link	link
2023.10	An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack)	ICLR'24	link	link
2023.09	Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP)	EMNLP Findings'23	link	link
2023.09	Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA)	Applied Sciences'24	link	-
2023.05	Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection	CCS'23	link	link
2022.11	Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject)	NeurIPS WorkShop'22	link	link

White-box Attack

Year	Title	Venue	Paper	Code
2024.11	AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts	arXiv	link	-
2024.11	DROJ: A Prompt-Driven Attack against Large Language Models	arXiv	link	link
2024.11	SQL Injection Jailbreak: a structural disaster of large language models	arXiv	link	link
2024.10	Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks	arXiv	link	-
2024.10	AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	arXiv	link	link
2024.10	Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting	arXiv	link	-
2024.10	Boosting Jailbreak Transferability for Large Language Models (SI-GCG)	arXiv	link	-
2024.10	Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (ADV-LLM)	arXiv	link	link
2024.08	Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation (JVD)	arXiv	link	-
2024.08	Jailbreak Open-Sourced Large Language Models via Enforced Decoding (EnDec)	ACL'24	link	-
2024.07	Refusal in Language Models Is Mediated by a Single Direction	arXiv	Link	Link
2024.07	Revisiting Character-level Adversarial Attacks for Language Models	ICML'24	link	link
2024.07	Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3)	arXiv	link	-
2024.07	SOS! Soft Prompt Attack Against Open-Source Large Language Models	arXiv	link	-
2024.06	COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack)	ICML'24	link	link
2024.06	Improved Techniques for Optimization-Based Jailbreaking on Large Language Models (I-GCG)	arXiv	link	link
2024.05	Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs	arXiv	link
2024.05	Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization	NeurIPS'24	Link	-
2024.05	AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN)	ICLR'24	link	link
2024.05	AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG)	arXiv	link	link
2024.05	Boosting jailbreak attack with momentum (MAC)	ICLR Workshop'24	link	link
2024.04	AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter)	arXiv	link	link
2024.03	Universal Jailbreak Backdoors from Poisoned Human Feedback	ICLR'24	link	-
2024.02	Attacking large language models with projected gradient descent (PGD)	arXiv	link	-
2024.02	Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE)	arXiv	link	-
2024.02	Curiosity-driven red-teaming for large language models (CRT)	arXiv	link	link
2023.12	AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN)	arXiv	link	link
2023.10	Catastrophic jailbreak of open-source llms via exploiting generation	ICLR'24	link	link
2023.06	Automatically Auditing Large Language Models via Discrete Optimization (ARCA)	ICML'23	link	link
2023.07	Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)	arXiv	link	link

Multi-turn Attack

Time	Title	Venue	Paper	Code
2024.11	MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue	arXiv	link	-
2024.10	Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models (JSP)	arXiv	link	link
2024.10	Multi-round jailbreak attack on large language	arXiv	link	-
2024.10	Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues	arXiv	link	link
2024.09	LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet	arXiv	link	link
2024.09	RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking	arXiv	link	link
2024.08	FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)	arXiv	link	-
2024.08	Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks	arXiv	link	link
2024.05	CoA: Context-Aware based Chain of Attack for Multi-Turn Dialogue LLM (CoA)	arXiv	link	link
2024.04	Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo)	Microsoft Azure	link	-

Attack on RAG-based LLM

Time	Title	Venue	Paper	Code
2024.09	Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking	arXiv	link	link
2024.02	Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora)	arXiv	link	-

Multi-modal Attack

Time	Title	Venue	Paper	Code
2024.11	Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey	arXiv	link	link
2024.10	Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step	arXiv	link	-
2024.10	ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation	NeurIPS'24	Link	-
2024.08	Jailbreaking Text-to-Image Models with LLM-Based Agents (Atlas)	arXiv	link	-
2024.07	Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything	arXiv	link	-
2024.06	Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt	arXiv	link	link
2024.05	Voice Jailbreak Attacks Against GPT-4o	arXiv	link	link
2024.05	Automatic Jailbreaking of the Text-to-Image Generative AI Systems	ICML'24 Workshop	link	link
2024.04	Image hijacks: Adversarial images can control generative models at runtime	arXiv	link	link
2024.03	An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA)	ICLR'24	link	link
2024.03	Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model	ICLR'24	link	-
2024.03	Rethinking model ensemble in transfer-based adversarial attacks	ICLR'24	link	link
2024.02	VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models	NeurIPS'23	link	link
2024.02	Jailbreaking Attack against Multimodal Large Language Model	arXiv	link	-
2024.01	Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts	arXiv	link	-
2024.03	Visual Adversarial Examples Jailbreak Aligned Large Language Models	AAAI'24	link	-
2023.12	OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack)	arXiv	link	-
2023.12	FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep)	arXiv	link	link
2023.11	SneakyPrompt: Jailbreaking Text-to-image Generative Models	S&P'24	link	link
2023.11	On Evaluating Adversarial Robustness of Large Vision-Language Models	NeurIPS'23	link	link
2023.10	How Robust is Google's Bard to Adversarial Image Attacks?	arXiv	link	link
2023.08	AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP)	ACM MM'23	link	link
2023.07	Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA)	ICCV'23	link	link
2023.07	On the Adversarial Robustness of Multi-Modal Foundation Models	ICCV Workshop'23	link	-
2022.10	Towards Adversarial Attack on Vision-Language Pre-training Models	arXiv	link	link

Jailbreak Defense

Learning-based Defense

Time	Title	Venue	Paper	Code
2024.12	Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models	arXiv'24	link	-
2024.10	Safety-Aware Fine-Tuning of Large Language Models	arXiv'24	link	-
2024.10	MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks	AAAI'24	link	-
2024.08	BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger (BaThe)	arXiv	link	-
2024.07	DART: Deep Adversarial Automated Red Teaming for LLM Safety	arXiv	link	-
2024.07	Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (Eraser)	arXiv	link	link
2024.07	Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks	arXiv	link	link
2024.06	Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs	arXiv	Link	-
2024.06	Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo)	arXiv	link	link
2024.06	Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding)	ACL'24	link	link
2024.06	Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment	NeurIPS'24	link	link
2024.06	On Prompt-Driven Safeguarding for Large Language Models (DRO)	ICML'24	link	link
2024.06	Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO)	NeurIPS'24	link	-
2024.06	Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT)	NeurIPS'24	link	link
2024.05	Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching (SAFEPATCHING)	arXiv	link	-
2024.05	Detoxifying Large Language Models via Knowledge Editing (DINM)	ACL'24	link	link
2024.05	Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing	arXiv	link	link
2023.11	MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART)	ACL'24	link	-
2023.11	Baseline defenses for adversarial attacks against aligned language models	arXiv	link	-
2023.10	Safe rlhf: Safe reinforcement learning from human feedback	arXiv	link	link
2023.08	Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT)	arXiv	link	link
2022.04	Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	Anthropic	link	-

Strategy-based Defense

Time	Title	Venue	Paper	Code
2024.11	Rapid Response: Mitigating LLM Jailbreaks with a Few Examples	arXiv	link	link
2024.10	RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (RePD)	arXiv	link	-
2024.10	Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models (G4D)	arXiv	link	link
2024.10	Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models	arXiv	link	-
2024.09	HSF: Defending against Jailbreak Attacks with Hidden State Filtering	arXiv	link	link
2024.08	EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models (EEG-Defender)	arXiv	link	-
2024.08	Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks (PG)	arXiv	link	link
2024.08	Self-Evaluation as a Defense Against Adversarial Attacks on LLMs (Self-Evaluation)	arXiv	link	link
2024.06	Defending LLMs against Jailbreaking Attacks via Backtranslation (Backtranslation)	ACL Findings'24	link	link
2024.06	SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding)	ACL'24	link	link
2024.06	Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM	ACL'24	link	-
2024.06	A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)	NAACL'24	link	link
2024.06	SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks	arXiv	link	link
2024.05	Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique)	ACL'24	link	link
2024.05	PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN)	ICML'24	link	link
2024.05	LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked	ICLR Tiny Paper'24	link	link
2024.05	GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe)	ACL'24	link	link
2024.05	Multilingual Jailbreak Challenges in Large Language Models	ICLR'24	link	link
2024.05	Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes	NeurIPS'24	link	-
2024.05	AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks	arXiv	link	link
2024.05	Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron)	arXiv	link	link
2024.05	Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD)	arXiv	link	-
2024.04	Protecting your llms with information bottleneck	NeurIPS'24	link	link
2024.04	Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning	arXiv	link	link
2024.02	Certifying LLM Safety against Adversarial Prompting	arXiv	link	link
2024.02	Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement	arXiv	link	-
2024.02	Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH)	arXiv	link	link
2024.01	Intention Analysis Makes LLMs A Good Jailbreak Defender (IA)	arXiv	link	link
2024.01	How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)	ACL'24	link	link
2023.12	Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder)	Nature Machine Intelligence	link	link
2023.11	Detecting language model attacks with perplexity	arXiv	link	-
2023.10	RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN)	ICLR'24	link	link

Guard Model

Time	Title	Venue	Paper	Code
2025.03	GuardReasoner: Towards Reasoning-based LLM Safeguards (GuardReasoner)	ICLR Workshop'25	link	link
2025.02	Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming	arXiv'25	link	-
2024.12	Lightweight Safety Classification Using Pruned Language Models (Sentence-BERT)	arXiv'24	link	-
2024.11	GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding (GuardFormer)	Meta	link	-
2024.11	Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations (LLaMA Guard 3 Vision)	Meta	link	link
2024.11	AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (Aegis2.0)	Nvidia, NeurIPS'24 Workshop	link	-
2024.11	Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings (Sentence-BERT)	arXiv'24	link	-
2024.11	STAND-Guard: A Small Task-Adaptive Content Moderation Model (STAND-Guard)	Microsoft	link	-
2024.10	VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data	arXiv	link	-
2024.09	AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (Aegis)	Nvidia	link	link
2024.09	Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (LLaMA Guard 3)	Meta	link	link
2024.08	ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma)	Google	link	link
2024.07	WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard)	NeurIPS'24	link	link
2024.06	GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning (GuardAgent)	arXiv'24	link	-
2024.06	R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning (R2-Guard)	arXiv	link	link
2024.04	Llama Guard 2	Meta	link	link
2024.03	AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield)	ECCV'24	link	link
2023.12	Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (LLaMA Guard)	Meta	link	link

Moderation API

Time	Title	Venue	Paper	Code
2023.08	Using GPT-4 for content moderation (GPT-4)	OpenAI	link	-
2023.02	A Holistic Approach to Undesired Content Detection in the Real World (OpenAI Moderation Endpoint)	AAAI OpenAI	link	link
2022.02	A New Generation of Perspective API: Efficient Multilingual Character-level Transformers (Perspective API)	KDD Google	link	link
-	Azure AI Content Safety	Microsoft Azure	-	link
-	Detoxify	unitary.ai	-	link

Evaluation & Analysis

Time	Title	Venue	Paper	Code
2025.02	GuidedBench: Equipping Jailbreak Evaluation with Guidelines	arXiv	link	link
2024.12	Agent-SafetyBench: Evaluating the Safety of LLM Agents	arXiv	link	link
2024.11	Global Challenge for Safe and Secure LLMs Track 1	arXiv	link	-
2024.11	JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit	arXiv	link	-
2024.11	The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense	arXiv	link	-
2024.11	HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment	arXiv	link	-
2024.11	ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain	arXiv	link	link
2024.11	GuardBench: A Large-Scale Benchmark for Guardrail Models	EMNLP'24	link	link
2024.11	What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks	arXiv	Link	link
2024.11	Benchmarking LLM Guardrails in Handling Multilingual Toxicity	arXiv	link	link
2024.10	JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework	arXiv	link	link
2024.10	Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems	arXiv	link	link
2024.10	A Realistic Threat Model for Large Language Model Jailbreaks	arXiv	link	link
2024.10	ADVERSARIAL SUFFIXES MAY BE FEATURES TOO!	arXiv	link	link
2024.09	JAILJUDGE: A COMPREHENSIVE JAILBREAK	arXiv	Link	Link
2024.09	Multimodal Pragmatic Jailbreak on Text-to-image Models	arXiv	link	link
2024.08	ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma)	arXiv	link	link
2024.08	MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models (MMJ-Bench)	arXiv	link	link
2024.08	Mission Impossible: A Statistical Perspective on Jailbreaking LLMs	NeurIPS'24	Link	-
2024.07	Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)	arXiv	link	link
2024.07	JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks	arXiv	link	link
2024.07	Jailbreak Attacks and Defenses Against Large Language Models: A Survey	arXiv	link	-
2024.06	"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak	arXiv	link	link
2024.06	WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models (WildTeaming)	NeurIPS'24	link	link
2024.06	From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking	arXiv	link	-
2024.06	AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways	arXiv	link	-
2024.06	MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (MM-SafetyBench)	arXiv	link	-
2024.06	ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (VITC)	ACL'24	link	link
2024.06	Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	NeurIPS'24	link	link
2024.06	JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (JailbreakZoo)	arXiv	link	link
2024.06	Fundamental limitations of alignment in large language models	arXiv	link	-
2024.06	JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (JailbreakBench)	NeurIPS'24	link	link
2024.06	Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis	arXiv	link	link
2024.06	JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models (JailbreakEval)	arXiv	link	link
2024.05	Rethinking How to Evaluate Language Model Jailbreak	arXiv	link	link
2024.05	Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (INDust)	arXiv	link	link
2024.05	Prompt Injection attack against LLM-integrated Applications	arXiv	link	-
2024.05	Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks	LREC-COLING'24	link	link
2024.05	LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study	NDSS'24	link	-
2024.05	Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study	arXiv	link	-
2024.05	Detoxifying Large Language Models via Knowledge Editing (SafeEdit)	ACL'24	link	link
2024.04	JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (JailbreakLens)	arXiv	link	-
2024.03	How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (TECHHAZARDQA)	arXiv	link	link
2024.03	Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models	USENIX Security	link	-
2024.03	EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (EasyJailbreak)	arXiv	link	link
2024.02	Comprehensive Assessment of Jailbreak Attacks Against LLMs	arXiv	link	-
2024.02	SPML: A DSL for Defending Language Models Against Prompt Attacks	arXiv	link	-
2024.02	Coercing LLMs to do and reveal (almost) anything	arXiv	link	-
2024.02	A STRONGREJECT for Empty Jailbreaks (StrongREJECT)	NeurIPS'24	link	link
2024.02	ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages	ACL'24	link	link
2024.02	HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (HarmBench)	arXiv	link	link
2023.12	Goal-Oriented Prompt Attack and Safety Evaluation for LLMs	arXiv	link	link
2023.12	The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness	arXiv	link	-
2023.12	A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models	UbiSec'23	link	-
2023.11	Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild	arXiv	link	-
2023.11	How many unicorns are in this image? a safety evaluation benchmark for vision llms	arXiv	link	link
2023.11	Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles	arXiv	link	-
2023.10	Explore, establish, exploit: Red teaming language models from scratch	arXiv	link	-
2023.10	Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks	arXiv	link	-
2023.10	Fine-tuning aligned language models compromises safety, even when users do not intend to! (HEx-PHI)	ICLR'24 (oral)	link	link
2023.08	Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-EVAL)	arXiv	link	link
2023.08	Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities	arXiv	link	-
2023.07	Jailbroken: How Does LLM Safety Training Fail? (Jailbroken)	NeurIPS'23	link	-
2023.08	Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities	arXiv	link	-
2023.08	From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy	IEEE Access	link	-
2023.07	Llm censorship: A machine learning challenge or a computer security problem?	arXiv	link	-
2023.07	Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench)	arXiv	link	link
2023.06	DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models	NeurIPS'23	link	link
2023.04	Safety Assessment of Chinese Large Language Models	arXiv	link	link
2023.02	Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks	arXiv	link	-
2022.11	Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned	arXiv	link	-
2022.02	Red Teaming Language Models with Language Models	arXiv	link	-

Application

Time	Title	Venue	Paper	Code
2024.11	Attacking Vision-Language Computer Agents via Pop-ups	arXiv	link	link
2024.10	Jailbreaking LLM-Controlled Robots (ROBOPAIR)	arXiv	link	link
2024.10	SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis	arXiv	link	link
2024.10	Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates	arXiv	link	link
2024.09	RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems	arXiv	link	-
2024.08	A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares (APwT)	arXiv	link	-

Other Related Awesome Repository

Reference

If you find this repository helpful to your research, it is really appreciated to cite our papers. ✨

@article{liuyue_GuardReasoner,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2501.18492},
  year={2025}
}

@article{liuyue_FlipAttack,
  title={FlipAttack: Jailbreak LLMs via Flipping},
  author={Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Hooi, Bryan},
  journal={arXiv preprint arXiv:2410.02832},
  year={2024}
}

Contributors

(back to top)

For Tasks:

Click tags to check more tools for each tasks

analyze jailbreak methods implement defense techniques evaluate attack robustness explore adversarial strategies benchmark safety measures

For Jobs:

data scientist machine learning engineer cybersecurity analyst research scientist ai ethics consultant

Alternative AI tools for Awesome-Jailbreak-on-LLMs

Similar Open Source Tools

Awesome-Jailbreak-on-LLMs

github

: 507

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

llms-from-scratch-cn

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

github

: 860

Awesome-Knowledge-Distillation-of-LLMs

A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.

github

: 890

Tiktoken

Tiktoken is a high-performance implementation focused on token count operations. It provides various encodings like o200k_base, cl100k_base, r50k_base, p50k_base, and p50k_edit. Users can easily encode and decode text using the provided API. The repository also includes a benchmark console app for performance tracking. Contributions in the form of PRs are welcome.

github

: 68

step_into_llm

The 'step_into_llm' repository is dedicated to the 昇思MindSpore technology open class, which focuses on exploring cutting-edge technologies, combining theory with practical applications, expert interpretations, open sharing, and empowering competitions. The repository contains course materials, including slides and code, for the ongoing second phase of the course. It covers various topics related to large language models (LLMs) such as Transformer, BERT, GPT, GPT2, and more. The course aims to guide developers interested in LLMs from theory to practical implementation, with a special emphasis on the development and application of large models.

github

: 405

2025-AI-College-Jobs

2025-AI-College-Jobs is a repository containing a comprehensive list of AI/ML & Data Science jobs suitable for college students seeking internships or new graduate positions. The repository is regularly updated with positions posted within the last 120 days, featuring opportunities from various companies in the USA and internationally. The list includes positions in areas such as research scientist internships, quantitative research analyst roles, and other data science-related positions. The repository aims to provide a valuable resource for students looking to kickstart their careers in the field of artificial intelligence and machine learning.

github

: 988

Chinese-LLaMA-Alpaca

This project open sources the **Chinese LLaMA model and the Alpaca large model fine-tuned with instructions**, to further promote the open research of large models in the Chinese NLP community. These models **extend the Chinese vocabulary based on the original LLaMA** and use Chinese data for secondary pre-training, further enhancing the basic Chinese semantic understanding ability. At the same time, the Chinese Alpaca model further uses Chinese instruction data for fine-tuning, significantly improving the model's understanding and execution of instructions.

github

: 17.2k

Chinese-LLaMA-Alpaca-3

Chinese-LLaMA-Alpaca-3 is a project based on Meta's latest release of the new generation open-source large model Llama-3. It is the third phase of the Chinese-LLaMA-Alpaca open-source large model series projects (Phase 1, Phase 2). This project open-sources the Chinese Llama-3 base model and the Chinese Llama-3-Instruct instruction fine-tuned large model. These models incrementally pre-train with a large amount of Chinese data on the basis of the original Llama-3 and further fine-tune using selected instruction data, enhancing Chinese basic semantics and instruction understanding capabilities. Compared to the second-generation related models, significant performance improvements have been achieved.

github

: 825

MindChat

MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

github

: 436

LLM4EC

LLM4EC is an interdisciplinary research repository focusing on the intersection of Large Language Models (LLM) and Evolutionary Computation (EC). It provides a comprehensive collection of papers and resources exploring various applications, enhancements, and synergies between LLM and EC. The repository covers topics such as LLM-assisted optimization, EA-based LLM architecture search, and applications in code generation, software engineering, neural architecture search, and other generative tasks. The goal is to facilitate research and development in leveraging LLM and EC for innovative solutions in diverse domains.

github

: 78

stylellm_models

**stylellm** is a text style transfer project based on large language models (llms). The project utilizes large language models to learn the writing style of specific literary works (commonly used vocabulary, sentence structure, rhetoric, character dialogue, etc.), forming a series of specific style models. Using the **stylellm** model, the learned style can be transferred to other general texts, that is: input a piece of original text, the model can rewrite it, output text with the characteristics of that style, achieving the effect of text modification,润色or style imitation.

github

: 121

LLM-for-Healthcare

The repository 'LLM-for-Healthcare' provides a comprehensive survey of large language models (LLMs) for healthcare, covering data, technology, applications, and accountability and ethics. It includes information on various LLM models, training data, evaluation methods, and computation costs. The repository also discusses tasks such as NER, text classification, question answering, dialogue systems, and generation of medical reports from images in the healthcare domain.

github

: 96

SpinQuant

SpinQuant is a tool designed for LLM quantization with learned rotations. It focuses on optimizing rotation matrices to enhance the performance of quantized models, narrowing the accuracy gap to full precision models. The tool implements rotation optimization and PTQ evaluation with optimized rotation, providing arguments for model name, batch sizes, quantization bits, and rotation options. SpinQuant is based on the findings that rotation helps in removing outliers and improving quantization, with specific enhancements achieved through learning rotation with Cayley optimization.

github

: 76

PaddleScience

PaddleScience is a scientific computing suite developed based on the deep learning framework PaddlePaddle. It utilizes the learning ability of deep neural networks and the automatic (higher-order) differentiation mechanism of PaddlePaddle to solve problems in physics, chemistry, meteorology, and other fields. It supports three solving methods: physics mechanism-driven, data-driven, and mathematical fusion, and provides basic APIs and detailed documentation for users to use and further develop.

github

: 305

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

For similar tasks

Awesome-Jailbreak-on-LLMs

github

: 507

For similar jobs

LLM-and-Law

This repository is dedicated to summarizing papers related to large language models with the field of law. It includes applications of large language models in legal tasks, legal agents, legal problems of large language models, data resources for large language models in law, law LLMs, and evaluation of large language models in the legal domain.

github

: 180

start-llms

This repository is a comprehensive guide for individuals looking to start and improve their skills in Large Language Models (LLMs) without an advanced background in the field. It provides free resources, online courses, books, articles, and practical tips to become an expert in machine learning. The guide covers topics such as terminology, transformers, prompting, retrieval augmented generation (RAG), and more. It also includes recommendations for podcasts, YouTube videos, and communities to stay updated with the latest news in AI and LLMs.

github

: 789

aiverify

AI Verify is an AI governance testing framework and software toolkit that validates the performance of AI systems against internationally recognised principles through standardised tests. It offers a new API Connector feature to bypass size limitations, test various AI frameworks, and configure connection settings for batch requests. The toolkit operates within an enterprise environment, conducting technical tests on common supervised learning models for tabular and image datasets. It does not define AI ethical standards or guarantee complete safety from risks or biases.

github

: 140

Awesome-LLM-Watermark

This repository contains a collection of research papers related to watermarking techniques for text and images, specifically focusing on large language models (LLMs). The papers cover various aspects of watermarking LLM-generated content, including robustness, statistical understanding, topic-based watermarks, quality-detection trade-offs, dual watermarks, watermark collision, and more. Researchers have explored different methods and frameworks for watermarking LLMs to protect intellectual property, detect machine-generated text, improve generation quality, and evaluate watermarking techniques. The repository serves as a valuable resource for those interested in the field of watermarking for LLMs.

github

: 212

LLM-LieDetector

This repository contains code for reproducing experiments on lie detection in black-box LLMs by asking unrelated questions. It includes Q/A datasets, prompts, and fine-tuning datasets for generating lies with language models. The lie detectors rely on asking binary 'elicitation questions' to diagnose whether the model has lied. The code covers generating lies from language models, training and testing lie detectors, and generalization experiments. It requires access to GPUs and OpenAI API calls for running experiments with open-source models. Results are stored in the repository for reproducibility.

github

: 54

graphrag

The GraphRAG project is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using LLMs. It enhances LLMs' ability to reason about private data. The repository provides guidance on using knowledge graph memory structures to enhance LLM outputs, with a warning about the potential costs of GraphRAG indexing. It offers contribution guidelines, development resources, and encourages prompt tuning for optimal results. The Responsible AI FAQ addresses GraphRAG's capabilities, intended uses, evaluation metrics, limitations, and operational factors for effective and responsible use.

github

: 24.1k

langtest

LangTest is a comprehensive evaluation library for custom LLM and NLP models. It aims to deliver safe and effective language models by providing tools to test model quality, augment training data, and support popular NLP frameworks. LangTest comes with benchmark datasets to challenge and enhance language models, ensuring peak performance in various linguistic tasks. The tool offers more than 60 distinct types of tests with just one line of code, covering aspects like robustness, bias, representation, fairness, and accuracy. It supports testing LLMS for question answering, toxicity, clinical tests, legal support, factuality, sycophancy, and summarization.

github

: 515

Awesome-Jailbreak-on-LLMs

github

: 507