Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
Stars: 230
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.
README:
The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.
This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: [email protected]
.
Table of Contents
- Awesome Interpretability Libraries
- Awesome Interpretability Blogs & Videos
- Awesome Interpretability Tutorials
- Awesome Interpretability Forums
- Awesome Interpretability Tools
- Awesome Interpretability Programs
- Awesome Interpretability Papers
- Other Awesome Interpretability Resources
- TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
- nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial, Paper)
- SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
- EleutherAI: sae: train SAE on very large model based on the method and released code of the openAI SAE paper
- Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
- Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
- pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
- repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
- Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Paper, Doc, Tutorial)
- LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
- Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
- Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)
- shap: Python library for computing SHAP feature / token importance for any black box model. Works with hugginface, pytorch, tensorflow models, including LLMs. (Paper, Doc)
- captum: Model interpretability and understanding library for PyTorch (Paper, Doc)
- A Barebones Guide to Mechanistic Interpretability Prerequisites
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
- 200 Concrete Open Problems in Mechanistic Interpretability
- 3Blue1Brown: But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning
- 3Blue1Brown: Attention in transformers, visually explained | Chapter 6, Deep Learning
- 3Blue1Brown: How might LLMs store facts | Chapter 7, Deep Learning
- ARENA 3.0: understand mechanistic interpretability using TransformerLens.
- EACL24: Transformer-specific Interpretability (Github)
- ICML24: Physics of Language Models (Youtube)
- NAACL24: Explanations in the Era of Large Language Models
- AI Alignment Forum
- LessWrong
- Mechanistic Interpretability Workshop 2024 ICML (Accepted papers)
- Attributing Model Behavior at Scale Workshop 2023 NeurIPS (Accepted papers)
- BlackboxNLP 2023 EMNLP (Accepted papers)
- Transformer Debugger: investigate specific behaviors of small LLMs
- LLM Transparency Tool (Demo)
- sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
- Neuronpedia: an open platform for interpretability research. (Doc)
- Comgra: A tool to analyze and debug neural networks in pytorch. Use a GUI to traverse the computation graph and view the data from many different angles at the click of a button. (Paper)
- ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.
Title | Venue | Date | Code |
---|---|---|---|
Attention Heads of Large Language Models: A Survey | arXiv | 2024-09-06 | Github |
Internal Consistency and Self-Feedback in Large Language Models: A Survey | arXiv | 2024-07-22 | Github Paper List |
Relational Composition in Neural Networks: A Survey and Call to Action |
MechInterp@ICML | 2024-07-15 | - |
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP | arXiv | 2024-06-18 | - |
A Primer on the Inner Workings of Transformer-based Language Models | arXiv | 2024-05-02 | - |
Mechanistic Interpretability for AI Safety -- A Review | arXiv | 2024-04-22 | - |
From Understanding to Utilization: A Survey on Explainability for Large Language Models | arXiv | 2024-02-22 | - |
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks | arXiv | 2023-08-18 | - |
Title | Venue | Date | Code |
---|---|---|---|
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience | ICML | 2024-06-25 | - |
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience | ICML | 2024-06-03 | - |
Interpretability Needs a New Paradigm | arXiv | 2024-05-08 | - |
Position Paper: Toward New Frameworks for Studying Model Representations | arXiv | 2024-02-06 | - |
Rethinking Interpretability in the Era of Large Language Models | arXiv | 2024-01-30 | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
Benchmarking Mental State Representations in Language Models |
MechInterp@ICML | 2024-06-25 | - | - |
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains |
ACL | 2024-05-21 | Dataset | Blog |
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations |
arXiv | 2024-02-27 | Github | - |
CausalGym: Benchmarking causal interpretability methods on linguistic tasks |
arXiv | 2024-02-19 | Github | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability |
arXiv | 2024-01-08 | - | - |
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability |
arXiv | 2023-06-06 | Github | - |
Title | Venue | Date | Code | Blog |
---|---|---|---|---|
An introduction to graphical tensor notation for mechanistic interpretability |
arXiv | 2024-02-02 | - | - |
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks |
arXiv | 2023-10-03 | Github | - |
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Awesome-Interpretability-in-Large-Language-Models
Similar Open Source Tools
Awesome-Interpretability-in-Large-Language-Models
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.
Awesome-Knowledge-Distillation-of-LLMs
A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.
Awesome-LLMs-for-Video-Understanding
Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.
llms-from-scratch-cn
This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.
MMOS
MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.
LLamaTuner
LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.
MindChat
MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.
VoiceBench
VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.
Awesome-Tabular-LLMs
This repository is a collection of papers on Tabular Large Language Models (LLMs) specialized for processing tabular data. It includes surveys, models, and applications related to table understanding tasks such as Table Question Answering, Table-to-Text, Text-to-SQL, and more. The repository categorizes the papers based on key ideas and provides insights into the advancements in using LLMs for processing diverse tables and fulfilling various tabular tasks based on natural language instructions.
Awesome-Model-Merging-Methods-Theories-Applications
A comprehensive repository focusing on 'Model Merging in LLMs, MLLMs, and Beyond', providing an exhaustive overview of model merging methods, theories, applications, and future research directions. The repository covers various advanced methods, applications in foundation models, different machine learning subfields, and tasks like pre-merging methods, architecture transformation, weight alignment, basic merging methods, and more.
Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models (like ChatGPT, LLaMA, GLM, Baichuan, etc) Evaluation on Language capabilities, Knowledge, Reasoning, Fairness and Safety.
PaddleNLP
PaddleNLP is an easy-to-use and high-performance NLP library. It aggregates high-quality pre-trained models in the industry and provides out-of-the-box development experience, covering a model library for multiple NLP scenarios with industry practice examples to meet developers' flexible customization needs.
MedicalGPT
MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).
rulm
This repository contains language models for the Russian language, as well as their implementation and comparison. The models are trained on a dataset of ChatGPT-generated instructions and chats in Russian. They can be used for a variety of tasks, including question answering, text generation, and translation.
Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.
For similar tasks
Awesome-Interpretability-in-Large-Language-Models
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.
FigStep
FigStep is a black-box jailbreaking algorithm against large vision-language models (VLMs). It feeds harmful instructions through the image channel and uses benign text prompts to induce VLMs to output contents that violate common AI safety policies. The tool highlights the vulnerability of VLMs to jailbreaking attacks, emphasizing the need for safety alignments between visual and textual modalities.
Awesome-LLM-Interpretability
Awesome-LLM-Interpretability is a curated list of materials related to LLM (Large Language Models) interpretability, covering tutorials, code libraries, surveys, videos, papers, and blogs. It includes resources on transformer mechanistic interpretability, visualization, interventions, probing, fine-tuning, feature representation, learning dynamics, knowledge editing, hallucination detection, and redundancy analysis. The repository aims to provide a comprehensive overview of tools, techniques, and methods for understanding and interpreting the inner workings of large language models.
pytorch-grad-cam
This repository provides advanced AI explainability for PyTorch, offering state-of-the-art methods for Explainable AI in computer vision. It includes a comprehensive collection of Pixel Attribution methods for various tasks like Classification, Object Detection, Semantic Segmentation, and more. The package supports high performance with full batch image support and includes metrics for evaluating and tuning explanations. Users can visualize and interpret model predictions, making it suitable for both production and model development scenarios.
For similar jobs
responsible-ai-toolbox
Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment interfaces and libraries for understanding AI systems. It empowers developers and stakeholders to develop and monitor AI responsibly, enabling better data-driven actions. The toolbox includes visualization widgets for model assessment, error analysis, interpretability, fairness assessment, and mitigations library. It also offers a JupyterLab extension for managing machine learning experiments and a library for measuring gender bias in NLP datasets.
fairlearn
Fairlearn is a Python package designed to help developers assess and mitigate fairness issues in artificial intelligence (AI) systems. It provides mitigation algorithms and metrics for model assessment. Fairlearn focuses on two types of harms: allocation harms and quality-of-service harms. The package follows the group fairness approach, aiming to identify groups at risk of experiencing harms and ensuring comparable behavior across these groups. Fairlearn consists of metrics for assessing model impacts and algorithms for mitigating unfairness in various AI tasks under different fairness definitions.
Open-Prompt-Injection
OpenPromptInjection is an open-source toolkit for attacks and defenses in LLM-integrated applications, enabling easy implementation, evaluation, and extension of attacks, defenses, and LLMs. It supports various attack and defense strategies, including prompt injection, paraphrasing, retokenization, data prompt isolation, instructional prevention, sandwich prevention, perplexity-based detection, LLM-based detection, response-based detection, and know-answer detection. Users can create models, tasks, and apps to evaluate different scenarios. The toolkit currently supports PaLM2 and provides a demo for querying models with prompts. Users can also evaluate ASV for different scenarios by injecting tasks and querying models with attacked data prompts.
aws-machine-learning-university-responsible-ai
This repository contains slides, notebooks, and data for the Machine Learning University (MLU) Responsible AI class. The mission is to make Machine Learning accessible to everyone, covering widely used ML techniques and applying them to real-world problems. The class includes lectures, final projects, and interactive visuals to help users learn about Responsible AI and core ML concepts.
AIF360
The AI Fairness 360 toolkit is an open-source library designed to detect and mitigate bias in machine learning models. It provides a comprehensive set of metrics, explanations, and algorithms for bias mitigation in various domains such as finance, healthcare, and education. The toolkit supports multiple bias mitigation algorithms and fairness metrics, and is available in both Python and R. Users can leverage the toolkit to ensure fairness in AI applications and contribute to its development for extensibility.
Awesome-Interpretability-in-Large-Language-Models
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.
hallucination-index
LLM Hallucination Index - RAG Special is a comprehensive evaluation of large language models (LLMs) focusing on context length and open vs. closed-source attributes. The index explores the impact of context length on model performance and tests the assumption that closed-source LLMs outperform open-source ones. It also investigates the effectiveness of prompting techniques like Chain-of-Note across different context lengths. The evaluation includes 22 models from various brands, analyzing major trends and declaring overall winners based on short, medium, and long context insights. Methodologies involve rigorous testing with different context lengths and prompting techniques to assess models' abilities in handling extensive texts and detecting hallucinations.
llm-misinformation-survey
The 'llm-misinformation-survey' repository is dedicated to the survey on combating misinformation in the age of Large Language Models (LLMs). It explores the opportunities and challenges of utilizing LLMs to combat misinformation, providing insights into the history of combating misinformation, current efforts, and future outlook. The repository serves as a resource hub for the initiative 'LLMs Meet Misinformation' and welcomes contributions of relevant research papers and resources. The goal is to facilitate interdisciplinary efforts in combating LLM-generated misinformation and promoting the responsible use of LLMs in fighting misinformation.