Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs

Stars: 230

Visit

This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.

README:

Awesome Interpretability in Large Language Models

The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.

This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: [email protected].

Table of Contents

Awesome Interpretability Libraries
Awesome Interpretability Blogs & Videos
Awesome Interpretability Tutorials
Awesome Interpretability Forums
Awesome Interpretability Tools
Awesome Interpretability Programs
Awesome Interpretability Papers
Other Awesome Interpretability Resources

Awesome Interpretability Libraries

TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial, Paper)
SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
EleutherAI: sae: train SAE on very large model based on the method and released code of the openAI SAE paper
Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Paper, Doc, Tutorial)
LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)
shap: Python library for computing SHAP feature / token importance for any black box model. Works with hugginface, pytorch, tensorflow models, including LLMs. (Paper, Doc)
captum: Model interpretability and understanding library for PyTorch (Paper, Doc)

Awesome Interpretability Blogs & Videos

Awesome Interpretability Tutorials

ARENA 3.0: understand mechanistic interpretability using TransformerLens.
EACL24: Transformer-specific Interpretability (Github)
ICML24: Physics of Language Models (Youtube)
NAACL24: Explanations in the Era of Large Language Models

Awesome Interpretability Forums & Worhshops

Awesome Interpretability Tools

Transformer Debugger: investigate specific behaviors of small LLMs
LLM Transparency Tool (Demo)
sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
Neuronpedia: an open platform for interpretability research. (Doc)
Comgra: A tool to analyze and debug neural networks in pytorch. Use a GUI to traverse the computation graph and view the data from many different angles at the click of a button. (Paper)

Awesome Interpretability Programs

ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.

Awesome Interpretability Papers

Survey Papers

Title	Venue	Date	Code
Attention Heads of Large Language Models: A Survey	arXiv	2024-09-06	Github
Internal Consistency and Self-Feedback in Large Language Models: A Survey	arXiv	2024-07-22	Github Paper List
Relational Composition in Neural Networks: A Survey and Call to Action	MechInterp@ICML	2024-07-15	-
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP	arXiv	2024-06-18	-
A Primer on the Inner Workings of Transformer-based Language Models	arXiv	2024-05-02	-
Mechanistic Interpretability for AI Safety -- A Review	arXiv	2024-04-22	-
From Understanding to Utilization: A Survey on Explainability for Large Language Models	arXiv	2024-02-22	-
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks	arXiv	2023-08-18	-

Position Papers

Title	Venue	Date	Code
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience	ICML	2024-06-25	-
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience	ICML	2024-06-03	-
Interpretability Needs a New Paradigm	arXiv	2024-05-08	-
Position Paper: Toward New Frameworks for Studying Model Representations	arXiv	2024-02-06	-
Rethinking Interpretability in the Era of Large Language Models	arXiv	2024-01-30	-

Interpretable Analysis of LLMs

Title	Venue	Date	Code	Blog
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models	-	2024-08-05	Github	-
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically	MechInterp@ICML	2024-07-15	-	-
Compact Proofs of Model Performance via Mechanistic Interpretability	MechInterp@ICML	2024-07-15	Github	-
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks	MechInterp@ICML	2024-07-15	-	-
How Do Llamas Process Multilingual Text? A Latent Exploration through Activation Patching	MechInterp@ICML	2024-07-15	-	-
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models	MechInterp@ICML	2024-07-15	-	-
What Makes and Breaks Safety Fine-tuning? Mechanistic Study	MechInterp@ICML	2024-07-15	-	-
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability	MechInterp@ICML	2024-07-15	-	-
Loss in the Crowd: Hidden Breakthroughs in Language Model Training	MechInterp@ICML	2024-07-15	-	-
Robust Knowledge Unlearning via Mechanistic Localizations	MechInterp@ICML	2024-07-15	-	-
Language Models Linearly Represent Sentiment	MechInterp@ICML	2024-07-15	-	-
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms	MechInterp@ICML	2024-07-15	Github	-
Learning and Unlearning of Fabricated Knowledge in Language Models	MechInterp@ICML	2024-07-15	-	-
Faithful and Fast Influence Function via Advanced Sampling	MechInterp@ICML	2024-07-15	-	-
Hypothesis Testing the Circuit Hypothesis in LLMs	MechInterp@ICML	2024-07-15	-	-
The Geometry of Categorical and Hierarchical Concepts in Large Language Models	MechInterp@ICML	2024-07-15	Github	-
InversionView: A General-Purpose Method for Reading Information from Neural Activations	MechInterp@ICML	2024-07-15	Github	-
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks	MechInterp@ICML	2024-07-15	-	-
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning	arXiv	2024-07-04	-	-
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation	arXiv	2024-07-01	Github	-
Recovering the Pre-Fine-Tuning Weights of Generative Models	ICML	2024-07-01	Github	Blog
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs	arXiv	2024-06-28	Github	Blog
Observable Propagation: Uncovering Feature Vectors in Transformers	ICML	2024-06-25	Github	-
Multi-property Steering of Large Language Models with Dynamic Activation Composition	arXiv	2024-06-25	Github	-
What Do the Circuits Mean? A Knowledge Edit View	arXiv	2024-06-25	-	-
Confidence Regulation Neurons in Language Models	arXiv	2024-06-24	-	-
Compact Proofs of Model Performance via Mechanistic Interpretability	arXiv	2024-06-24	Github	-
Preference Tuning For Toxicity Mitigation Generalizes Across Languages	arXiv	2024-06-23	Github	-
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models	arXiv	2024-06-23	-	-
Estimating Knowledge in Large Language Models Without Generating a Single Token	arXiv	2024-06-18	Github	-
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations	arXiv	2024-06-17	-	-
Transcoders Find Interpretable LLM Feature Circuits	MechInterp@ICML	2024-06-17	Github	-
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue	arXiv	2024-06-16	Github	-
Context versus Prior Knowledge in Language Models	ACL	2024-06-16	Github	-
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models	arXiv	2024-06-13	-	-
MambaLRP: Explaining Selective State Space Sequence Models	arXiv	2024-06-11	Github	-
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models	ICML	2024-06-06	Github	Blog
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals	ACL	2024-06-06	Github	-
Learned feature representations are biased by complexity, learning order, position, and more	arXiv	2024-06-06	Demo	-
Iteration Head: A Mechanistic Study of Chain-of-Thought	arXiv	2024-06-05	-	-
Activation Addition: Steering Language Models Without Optimization	arXiv	2024-06-04	Code	-
Interpretability Illusions in the Generalization of Simplified Models	arXiv	2024-06-04	-	-
SyntaxShap: Syntax-aware Explainability Method for Text Generation	arXiv	2024-06-03	Github	Blog
Calibrating Reasoning in Language Models with Internal Consistency	arXiv	2024-05-29	-	-
Black-Box Access is Insufficient for Rigorous AI Audits	FAccT	2024-05-29	-	-
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting	arXiv	2024-05-28	-	-
From Neurons to Neutrons: A Case Study in Interpretability	ICML	2024-05-27	Github	-
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization	MechInterp@ICML	2024-05-27	Github	-
Explorations of Self-Repair in Language Models	ICML	2024-05-26	Github	-
Emergence of a High-Dimensional Abstraction Phase in Language Transformers	arXiv	2024-05-24	-	-
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions	arXiv	2024-05-23	Github	-
Not All Language Model Features Are Linear	arXiv	2024-05-23	Github	-
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability	arXiv	2024-05-20	-	-
Your Transformer is Secretly Linear	arXiv	2024-05-19	Github	-
Are self-explanations from Large Language Models faithful?	ACL	2024-05-16	Github	-
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models	arXiv	2024-05-14	-	-
Steering Llama 2 via Contrastive Activation Addition	arXiv	2024-05-07	Github	-
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability	AISTATS	2024-05-07	Github	-
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning	arXiv	2024-05-06	Github	-
Circuit Component Reuse Across Tasks in Transformer Language Models	ICLR	2024-05-06	Github	-
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations	HCI+NLP@NAACL	2024-04-24	Github	-
How to use and interpret activation patching	arXiv	2024-04-23	-	-
Understanding Addition in Transformers	arXiv	2024-04-23	-	-
Towards Uncovering How Large Language Model Works: An Explainability Perspective	arXiv	2024-04-15	-	-
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation	ICML	2024-04-10	Github	-
Does Transformer Interpretability Transfer to RNNs?	arXiv	2024-04-09	-	-
Locating and Editing Factual Associations in Mamba	arXiv	2024-04-04	Github	Demo
Eliciting Latent Knowledge from Quirky Language Models	ME-FoMo@ICLR	2024-04-03	-	-
Do language models plan ahead for future tokens?	arXiv	2024-04-01	-	-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models	arXiv	2024-03-31	Github	Demo
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms	arXiv	2024-03-26	-	-
What does the Knowledge Neuron Thesis Have to do with Knowledge?	ICLR	2024-03-16	Github	-
Language Models Represent Space and Time	ICLR	2024-03-04	Github	-
*AtP: An efficient and scalable method for localizing LLM behaviour to components**	arXiv	2024-03-01	-	-
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task	arXiv	2024-02-28	-	-
Function Vectors in Large Language Models	ICLR	2024-02-25	Github	Blog
A Language Model's Guide Through Latent Space	arXiv	2024-02-22	-	-
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model	arXiv	2024-02-22	-	-
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking	ICLR	2024-02-22	Github	Blog
Fine-grained Hallucination Detection and Editing for Language Models	arXiv	2024-02-21	Github	Blog
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation	arXiv	2024-02-20	Github	-
Identifying Semantic Induction Heads to Understand In-Context Learning	arXiv	2024-02-20	-	-
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space	arXiv	2024-02-20	-	-
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models	ACML	2024-02-12	-	-
Model Editing with Canonical Examples	arXiv	2024-02-09	Github	-
Opening the AI black box: program synthesis via mechanistic interpretability	arXiv	2024-02-07	Github	-
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection	ICLR	2024-02-06	-	-
In-Context Language Learning: Architectures and Algorithms	arXiv	2024-01-30	Github	-
Gradient-Based Language Model Red Teaming	EACL	2024-01-30	Github	-
The Calibration Gap between Model and Human Confidence in Large Language Models	arXiv	2024-01-24	-	-
Universal Neurons in GPT2 Language Models	arXiv	2024-01-22	Github	-
The mechanistic basis of data dependence and abrupt learning in an in-context classification task	ICLR	2024-01-16	-	-
Overthinking the Truth: Understanding how Language Models Process False Demonstrations	ICLR	2024-01-16	Github	-
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks	ICLR	2024-01-16	-	-
Feature emergence via margin maximization: case studies in algebraic tasks	ICLR	2024-01-16	-	-
Successor Heads: Recurring, Interpretable Attention Heads In The Wild	ICLR	2024-01-16	-	-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods	ICLR	2024-01-16	-	-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity	ICML	2024-01-03	Github	-
Forbidden Facts: An Investigation of Competing Objectives in Llama-2	ATTRIB@NeurIPS	2023-12-31	Github	Blog
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets	arXiv	2023-12-08	Github	Blog
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching	ATTRIB@NeurIPS	2023-12-06	Github	-
Structured World Representations in Maze-Solving Transformers	UniReps@NeurIPS	2023-12-05	Github	-
Generating Interpretable Networks using Hypernetworks	arXiv	2023-12-05	-	-
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks	NeurIPS	2023-11-21	Github	-
Attribution Patching Outperforms Automated Circuit Discovery	ATTRIB@NeurIPS	2023-11-20	Github	-
Tracr: Compiled Transformers as a Laboratory for Interpretability	NeurIPS	2023-11-03	Github	-
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model	NeurIPS	2023-11-02	Github	-
Learning Transformer Programs	NeurIPS	2023-10-31	Github	-
Towards Automated Circuit Discovery for Mechanistic Interpretability	NeurIPS	2023-10-28	Github	-
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models	EMNLP	2023-10-23	Github	-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model	NeurIPS	2023-10-20	Github	-
Progress measures for grokking via mechanistic interpretability	ICLR	2023-10-19	Github	Blog
Copy Suppression: Comprehensively Understanding an Attention Head	arXiv	2023-10-06	Github	Blog & Demo
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models	NeurIPS	2023-09-21	Github	-
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca	NeurIPS	2023-09-21	Github	-
Emergent Linear Representations in World Models of Self-Supervised Sequence Models	BlackboxNLP@EMNLP	2023-09-07	Github	Blog
Finding Neurons in a Haystack: Case Studies with Sparse Probing	arXiv	2023-06-02	Github	-
Efficient Shapley Values Estimation by Amortization for Text Classification	ACL	2023-05-31	Github	Video
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations	ICML	2023-05-24	Github	-
Localizing Model Behavior with Path Patching	arXiv	2023-05-16	-	-
Language models can explain neurons in language models	OpenAI	2023-05-09	-	-
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models	ICLR Workshop	2023-04-22	-	-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	ICLR	2023-01-20	Github	-
Interpreting Neural Networks through the Polytope Lens	arXiv	2022-11-22	-	-
Scaling Laws and Interpretability of Learning from Repeated Data	arXiv	2022-05-21	-	-
In-context Learning and Induction Heads	Anthropic	2022-03-08	-	-
A Mathematical Framework for Transformer Circuits	Anthropic	2021-12-22	-	-
Thinking Like Transformers	ICML	2021-07-19	Github	Mini Tutorial

SAE, Dictionary Learning and Superposition

Title	Venue	Date	Code	Blog
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task	MechInterp@ICML	2024-07-15	-	-
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models	MechInterp@ICML	2024-07-15	-	-
Interpreting Attention Layer Outputs with Sparse Autoencoders	MechInterp@ICML	2024-06-25	-	Demo
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning	MechInterp@ICML	2024-05-24	Github	-
Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis	arXiv	2024-05-23	-	-
Automatically Identifying Local and Global Circuits with Linear Computation Graphs	arXiv	2024-05-22	-	-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet	Anthropic	2024-05-21	-	Demo
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models	arXiv	2024-05-21	-	-
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control	arXiv	2024-05-20	-	-
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks	arXiv	2024-05-20	Github	-
Improving Dictionary Learning with Gated Sparse Autoencoders	arXiv	2024-04-30	-	-
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers	LessWrong	2024-04-29	-	Demo
Activation Steering with SAEs	LessWrong	2024-04-19	-	-
SAE reconstruction errors are (empirically) pathological	LessWrong	2024-03-29	-	-
Sparse autoencoders find composed features in small toy models	LessWrong	2024-03-14	Github	-
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT	LessWrong	2024-03-05	Github	-
Do sparse autoencoders find "true features"?	LessWrong	2024-02-12	-	-
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT	arXiv	2024-02-19	-	-
Toward A Mathematical Framework for Computation in Superposition	LessWrong	2024-01-18	-	-
Sparse Autoencoders Work on Attention Layer Outputs	LessWrong	2024-01-16	-	Demo
Sparse Autoencoders Find Highly Interpretable Features in Language Models	ICLR	2024-01-16	Github	-
Codebook Features: Sparse and Discrete Interpretability for Neural Networks	arXiv	2023-10-26	Github	Demo
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	Anthropic	2023-10-04	Github	Demo-1, Demo-2, Tutorial
Polysemanticity and Capacity in Neural Networks	arXiv	2023-07-12	-	-
Distributed Representations: Composition & Superposition	Anthropic	2023-05-04	-	-
Superposition, Memorization, and Double Descent	Anthropic	2023-01-05	-	-
Engineering Monosemanticity in Toy Models	arXiv	2022-11-16	Github	-
Toy Models of Superposition	Anthropic	2022-09-14	Github	Demo
Softmax Linear Units	Anthropic	2022-06-27	-	-
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors	DeeLIO@NAACL	2021-03-29	Github	-
Zoom In: An Introduction to Circuits	Distill	2020-03-10	-	-

Interpretability in Vision LLMs

Title	Venue	Date	Code	Blog
Dissecting Query-Key Interaction in Vision Transformers	MechInterp@ICML	2024-06-25	-	-
Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models	MechInterp@ICML	2024-06-25	-	-
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP	MechInterp@ICML	2024-06-25	-	-
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision	MechInterp@ICML	2024-06-25	-	-
Don’t trust your eyes: on the (un)reliability of feature visualizations	ICML	2024-06-25	Github	-
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation	arXiv	2024-06-24	Github	-
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits	XAI4CV@CVPR	2024-04-09	Github	-
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)	arXiv	2024-02-16	Github	-
Analyzing Vision Transformers for Image Classification in Class Embedding Space	NeurIPS	2023-09-21	Github	-
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP	CLVL@ICCV	2023-08-27	Github	-
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models	NeurIPS	2023-07-11	Github	Blog

Benchmarking Interpretability

Title	Venue	Date	Code	Blog
Benchmarking Mental State Representations in Language Models	MechInterp@ICML	2024-06-25	-	-
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains	ACL	2024-05-21	Dataset	Blog
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations	arXiv	2024-02-27	Github	-
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	arXiv	2024-02-19	Github	-

Enhancing Interpretability

Title	Venue	Date	Code	Blog
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability	arXiv	2024-01-08	-	-
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability	arXiv	2023-06-06	Github	-

Others

Title	Venue	Date	Code	Blog
An introduction to graphical tensor notation for mechanistic interpretability	arXiv	2024-02-02	-	-
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks	arXiv	2023-10-03	Github	-

Other Awesome Interpretability Resources

For Tasks:

Click tags to check more tools for each tasks

analyze model internals interpret model predictions improve model transparency evaluate model behavior understand model mechanisms

For Jobs:

data scientist machine learning engineer research scientist ai ethics researcher nlp engineer

Alternative AI tools for Awesome-Interpretability-in-Large-Language-Models

Similar Open Source Tools

Awesome-Interpretability-in-Large-Language-Models

github

: 230

awesome-VLLMs

github

: 52

Awesome-Knowledge-Distillation-of-LLMs

A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.

github

: 890

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

llms-from-scratch-cn

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

github

: 860

MMOS

MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.

github

: 61

AstrBot

AstrBot is a powerful and versatile tool that leverages the capabilities of large language models (LLMs) like GPT-3, GPT-3.5, and GPT-4 to enhance communication and automate tasks. It seamlessly integrates with popular messaging platforms such as QQ, QQ Channel, and Telegram, enabling users to harness the power of AI within their daily conversations and workflows.

github

: 6.6k

MindChat

MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

github

: 436

MedicalGPT

MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).

github

: 3.6k

Awesome-Tabular-LLMs

This repository is a collection of papers on Tabular Large Language Models (LLMs) specialized for processing tabular data. It includes surveys, models, and applications related to table understanding tasks such as Table Question Answering, Table-to-Text, Text-to-SQL, and more. The repository categorizes the papers based on key ideas and provides insights into the advancements in using LLMs for processing diverse tables and fulfilling various tabular tasks based on natural language instructions.

github

: 151

PaddleNLP

PaddleNLP is an easy-to-use and high-performance NLP library. It aggregates high-quality pre-trained models in the industry and provides out-of-the-box development experience, covering a model library for multiple NLP scenarios with industry practice examples to meet developers' flexible customization needs.

github

: 12.5k

LLM-TPU

LLM-TPU project aims to deploy various open-source generative AI models on the BM1684X chip, with a focus on LLM. Models are converted to bmodel using TPU-MLIR compiler and deployed to PCIe or SoC environments using C++ code. The project has deployed various open-source models such as Baichuan2-7B, ChatGLM3-6B, CodeFuse-7B, DeepSeek-6.7B, Falcon-40B, Phi-3-mini-4k, Qwen-7B, Qwen-14B, Qwen-72B, Qwen1.5-0.5B, Qwen1.5-1.8B, Llama2-7B, Llama2-13B, LWM-Text-Chat, Mistral-7B-Instruct, Stable Diffusion, Stable Diffusion XL, WizardCoder-15B, Yi-6B-chat, Yi-34B-chat. Detailed model deployment information can be found in the 'models' subdirectory of the project. For demonstrations, users can follow the 'Quick Start' section. For inquiries about the chip, users can contact SOPHGO via the official website.

github

: 193

LLamaTuner

LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.

github

: 586

Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models (like ChatGPT, LLaMA, GLM, Baichuan, etc) Evaluation on Language capabilities, Knowledge, Reasoning, Fairness and Safety.

github

: 280

fastapi

智元 Fast API is a one-stop API management system that unifies various LLM APIs in terms of format, standards, and management, achieving the ultimate in functionality, performance, and user experience. It supports various models from companies like OpenAI, Azure, Baidu, Keda Xunfei, Alibaba Cloud, Zhifu AI, Google, DeepSeek, 360 Brain, and Midjourney. The project provides user and admin portals for preview, supports cluster deployment, multi-site deployment, and cross-zone deployment. It also offers Docker deployment, a public API site for registration, and screenshots of the admin and user portals. The API interface is similar to OpenAI's interface, and the project is open source with repositories for API, web, admin, and SDK on GitHub and Gitee.

github

: 245

For similar tasks

Awesome-Interpretability-in-Large-Language-Models

github

: 230

FigStep

FigStep is a black-box jailbreaking algorithm against large vision-language models (VLMs). It feeds harmful instructions through the image channel and uses benign text prompts to induce VLMs to output contents that violate common AI safety policies. The tool highlights the vulnerability of VLMs to jailbreaking attacks, emphasizing the need for safety alignments between visual and textual modalities.

github

: 52

Awesome-LLM-Interpretability

Awesome-LLM-Interpretability is a curated list of materials related to LLM (Large Language Models) interpretability, covering tutorials, code libraries, surveys, videos, papers, and blogs. It includes resources on transformer mechanistic interpretability, visualization, interventions, probing, fine-tuning, feature representation, learning dynamics, knowledge editing, hallucination detection, and redundancy analysis. The repository aims to provide a comprehensive overview of tools, techniques, and methods for understanding and interpreting the inner workings of large language models.

github

: 130

pytorch-grad-cam

This repository provides advanced AI explainability for PyTorch, offering state-of-the-art methods for Explainable AI in computer vision. It includes a comprehensive collection of Pixel Attribution methods for various tasks like Classification, Object Detection, Semantic Segmentation, and more. The package supports high performance with full batch image support and includes metrics for evaluating and tuning explanations. Users can visualize and interpret model predictions, making it suitable for both production and model development scenarios.

github

: 11.0k

For similar jobs

responsible-ai-toolbox

Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment interfaces and libraries for understanding AI systems. It empowers developers and stakeholders to develop and monitor AI responsibly, enabling better data-driven actions. The toolbox includes visualization widgets for model assessment, error analysis, interpretability, fairness assessment, and mitigations library. It also offers a JupyterLab extension for managing machine learning experiments and a library for measuring gender bias in NLP datasets.

github

: 1.3k

fairlearn

Fairlearn is a Python package designed to help developers assess and mitigate fairness issues in artificial intelligence (AI) systems. It provides mitigation algorithms and metrics for model assessment. Fairlearn focuses on two types of harms: allocation harms and quality-of-service harms. The package follows the group fairness approach, aiming to identify groups at risk of experiencing harms and ensuring comparable behavior across these groups. Fairlearn consists of metrics for assessing model impacts and algorithms for mitigating unfairness in various AI tasks under different fairness definitions.

github

: 2.0k

Open-Prompt-Injection

OpenPromptInjection is an open-source toolkit for attacks and defenses in LLM-integrated applications, enabling easy implementation, evaluation, and extension of attacks, defenses, and LLMs. It supports various attack and defense strategies, including prompt injection, paraphrasing, retokenization, data prompt isolation, instructional prevention, sandwich prevention, perplexity-based detection, LLM-based detection, response-based detection, and know-answer detection. Users can create models, tasks, and apps to evaluate different scenarios. The toolkit currently supports PaLM2 and provides a demo for querying models with prompts. Users can also evaluate ASV for different scenarios by injecting tasks and querying models with attacked data prompts.

github

: 90

aws-machine-learning-university-responsible-ai

This repository contains slides, notebooks, and data for the Machine Learning University (MLU) Responsible AI class. The mission is to make Machine Learning accessible to everyone, covering widely used ML techniques and applying them to real-world problems. The class includes lectures, final projects, and interactive visuals to help users learn about Responsible AI and core ML concepts.

github

: 60

AIF360

The AI Fairness 360 toolkit is an open-source library designed to detect and mitigate bias in machine learning models. It provides a comprehensive set of metrics, explanations, and algorithms for bias mitigation in various domains such as finance, healthcare, and education. The toolkit supports multiple bias mitigation algorithms and fairness metrics, and is available in both Python and R. Users can leverage the toolkit to ensure fairness in AI applications and contribute to its development for extensibility.

github

: 2.4k

Awesome-Interpretability-in-Large-Language-Models

github

: 230

hallucination-index

LLM Hallucination Index - RAG Special is a comprehensive evaluation of large language models (LLMs) focusing on context length and open vs. closed-source attributes. The index explores the impact of context length on model performance and tests the assumption that closed-source LLMs outperform open-source ones. It also investigates the effectiveness of prompting techniques like Chain-of-Note across different context lengths. The evaluation includes 22 models from various brands, analyzing major trends and declaring overall winners based on short, medium, and long context insights. Methodologies involve rigorous testing with different context lengths and prompting techniques to assess models' abilities in handling extensive texts and detecting hallucinations.

github

: 54

llm-misinformation-survey

The 'llm-misinformation-survey' repository is dedicated to the survey on combating misinformation in the age of Large Language Models (LLMs). It explores the opportunities and challenges of utilizing LLMs to combat misinformation, providing insights into the history of combating misinformation, current efforts, and future outlook. The repository serves as a resource hub for the initiative 'LLMs Meet Misinformation' and welcomes contributions of relevant research papers and resources. The goal is to facilitate interdisciplinary efforts in combating LLM-generated misinformation and promoting the responsible use of LLMs in fighting misinformation.

github

: 68