Awesome-LLM-Post-training

Awesome-LLM-Post-training

Awesome Reasoning LLM Tutorial/Survey/Guide

Stars: 807

Visit
 screenshot

The Awesome-LLM-Post-training repository is a curated collection of influential papers, code implementations, benchmarks, and resources related to Large Language Models (LLMs) Post-Training Methodologies. It covers various aspects of LLMs, including reasoning, decision-making, reinforcement learning, reward learning, policy optimization, explainability, multimodal agents, benchmarks, tutorials, libraries, and implementations. The repository aims to provide a comprehensive overview and resources for researchers and practitioners interested in advancing LLM technologies.

README:

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

MIT License arXiv Maintenance Contribution Welcome

Oryx Video-ChatGPT

Welcome to the Awesome-LLM-Post-training repository! This repository is a curated collection of the most influential papers, code implementations, benchmarks, and resources related to Large Language Models (LLMs) Post-Training Methodologies.

Our work is based on the following paper:
πŸ“„ LLM Post-Training: A Deep Dive into Reasoning Large Language Models – Available on arXiv

* Equally contributing first authors

Feel free to ⭐ star and fork this repository to keep up with the latest advancements and contribute to the community.



πŸ“Œ Contents

Section Subsection
πŸ“– Papers Survey, Theory, Explainability
πŸ€– LLMs in RL LLM-Augmented Reinforcement Learning
πŸ† Reward Learning Human Feedback, Preference-Based RL, Intrinsic Motivation
πŸš€ Policy Optimization Offline RL, Imitation Learning, Hierarchical RL
🧠 LLMs for Reasoning & Decision-Making Causal Reasoning, Planning, Commonsense RL
πŸŒ€ Exploration & Generalization Zero-Shot RL, Generalization in RL, Self-Supervised RL
🀝 Multi-Agent RL (MARL) Emergent Communication, Coordination, Social RL
⚑ Applications & Benchmarks Autonomous Agents, Simulations, LLM-RL Benchmarks
πŸ“š Tutorials & Courses Lectures, Workshops
πŸ› οΈ Libraries & Implementations Open-Source RL-LLM Frameworks
πŸ”— Other Resources Additional Research & Readings

πŸ“– Papers

πŸ” Survey

Title Publication Date Link
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning 3 Mar 2025 Arxiv
From System 1 to System 2: A Survey of Reasoning Large Language Models 25 Feb 2025 Arxiv
Empowering LLMs with Logical Reasoning: A Comprehensive Survey 24 Fev 2025 Arxiv
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models 16 Jan 2025 Arxiv
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey 26 Sep 2024 Arxiv
Reasoning with Large Language Models, a Survey 16 July 2024 Arxiv
Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods 30 Mar 2024 Arxiv
Reinforcement Learning Enhanced LLMs: A Survey 5 Dec 2024 Arxiv
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey 29 Dec 2024 Arxiv
Large Language Models: A Survey of Their Development, Capabilities, and Applications 15 Jan 2025 Springer
A Survey on Multimodal Large Language Models 10 Feb 2025 Oxford Academic
Large Language Models (LLMs): Survey, Technical Frameworks, and Future Directions 20 Jul 2024 Springer
Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machines 11 Feb 2024 Arxiv
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models 14 Mar 2024 Arxiv
Reinforcement Learning Problem Solving with Large Language Models 29 Apr 2024 Arxiv
A Survey on Large Language Models for Reinforcement Learning 10 Dec 2023 Arxiv
Large Language Models as Decision-Makers: A Survey 23 Aug 2023 Arxiv
A Survey on Large Language Model Alignment Techniques 6 May 2023 Arxiv
Reinforcement Learning with Human Feedback: A Survey 12 April 2023 Arxiv
Reasoning with Large Language Models: A Survey 14 Feb 2023 Arxiv
A Survey on Foundation Models for Decision Making 9 Jan 2023 Arxiv
Large Language Models in Reinforcement Learning: Opportunities and Challenges 5 Dec 2022 Arxiv

πŸ€– LLMs in RL

  • "Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search" | 08-02-2025 | [Paper]
  • "DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL" | 08-02-2025 | [Paper]
  • "QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search" | 08-02-2025 | [Paper]
  • "Process Reinforcement through Implicit Rewards" | 04-02-2025 | [Paper]
  • "Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling" | 29-01-2025 | [Paper]
  • "Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies" | 30-01-2025 [Paper]
  • "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" | 26-01-2025 | [Paper]
  • "Kimi k1.5: Scaling Reinforcement Learning with LLMs" | 25-01-2025 | [Paper]
  • "Does RLHF Scale? Exploring the Impacts From Data, Model, and Method" | 18-12-2024 | [Paper]
  • "Offline Reinforcement Learning for LLM Multi-Step Reasoning" | 29-12-2024 | [Paper]
  • "ReFT: Representation Finetuning for Language Models" | 10-07-2024 | [Paper]
  • "Deepseekmath: Pushing the Limits of Mathematical Reasoning in Open Language Models" | 02-02-2024 | [Paper]
  • "Reasoning with Reinforced Functional Token Tuning" | 15-02-2025 | [Paper]
  • "Value-Based Deep RL Scales Predictably" | 07-02-2025 | [Paper]
  • "InfAlign: Inference-aware Language Model Alignment" | 30-12-2024 | [Paper]
  • "LIMR: Less is More for RL Scaling" | 12-02-2025 | [Paper]
  • "A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics" | 14-02-2025 | [Paper]

πŸ† Reward Learning (Process Reward Models)

  • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [Paper]
  • ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [Paper]
  • The Lessons of Developing Process Reward Models in Mathematical Reasoning. [Paper]
  • ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [Paper]
  • AutoPSV: Automated Process-Supervised Verifier [Paper]
  • ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
  • Free Process Rewards without Process Labels. [Paper]
  • Outcome-Refining Process Supervision for Code Generation [Paper]
  • Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [Paper]
  • OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [Paper]
  • Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper]
  • Let's Verify Step by Step. [Paper]
  • Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
  • Making Large Language Models Better Reasoners with Step-Aware Verifier [Paper]
  • Solving Math Word Problems with Process and Outcome-Based Feedback [Paper]
  • Uncertainty-Aware Step-wise Verification with Generative Reward Models [Paper]
  • AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
  • Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [Paper]
  • Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
  • Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [Paper]

πŸš€ Policy Optimization

  • "Decision Transformer: Reinforcement Learning via Sequence Modeling" - Chen et al. (2021) [Paper]
  • "Offline RL with LLMs as Generalist Memory" - Tian et al. (2023) [Paper]

Part 8: Explainability

  • Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [Paper]
  • What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [Paper]
  • When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [Paper]
  • The Impact of Reasoning Step Length on Large Language Models [Paper]
  • Distilling System 2 into System 1 [Paper]
  • System 2 Attention (is something you might need too) [Paper]
  • Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [Paper]
  • LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [Paper]
  • Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [Paper]

Part 9: Multimodal Agent related Slow-Fast System

  • Diving into Self-Evolving Training for Multimodal Reasoning [Paper]
  • Visual Agents as Fast and Slow Thinkers [Paper]
  • Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
  • Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [Paper]
  • Slow Perception: Let's Perceive Geometric Figures Step-by-Step [Paper]
  • AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
  • LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
  • Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
  • I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [Paper]
  • RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [Paper]

Part 10: Benchmark and Datasets

  • Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [Paper]
  • PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [Paper]
  • MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [Paper]
  • Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [Paper]
  • A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [Paper]
  • EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [Paper]
  • SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [Paper]
  • Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [Paper]
  • FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [Paper]
  • Evaluation of OpenAI o1: Opportunities and Challenges of AGI [Paper]
  • MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [Paper]
  • LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [Paper]
  • Humanity's Last Exam [Paper]
  • LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [Paper]
  • BIG-Bench Extra Hard [Paper]

Part 11: Reasoning and Safety

  • Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [Paper]
  • OverThink: Slowdown Attacks on Reasoning LLMs [Paper]
  • GuardReasoner: Towards Reasoning-based LLM Safeguards [Paper]
  • SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
  • ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [Paper]
  • SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
  • H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [Paper]
  • BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [Paper]

πŸš€ RL & LLM Fine-Tuning Repositories

# Repository & Link Description
1 RL4VLM

Archived & Read-Only as of December 15, 2024
Offers code for fine-tuning large vision-language models as decision-making agents via RL. Includes implementations for training models with task-specific rewards and evaluating them in various environments.
2 LlamaGym Simplifies fine-tuning large language model (LLM) agents with online RL. Provides an abstract Agent class to handle various aspects of RL training, allowing for quick iteration and experimentation across different environments.
3 RL-Based Fine-Tuning of Diffusion Models for Biological Sequences Accompanies a tutorial and review paper on RL-based fine-tuning, focusing on the design of biological sequences (DNA/RNA). Provides comprehensive tutorials and code implementations for training and fine-tuning diffusion models using RL.
4 LM-RL-Finetune Aims to improve KL penalty optimization in RL fine-tuning of language models by computing the KL penalty term analytically. Includes configurations for training with Proximal Policy Optimization (PPO).
5 InstructLLaMA Implements pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) to train and fine-tune the LLaMA2 model to follow human instructions, similar to InstructGPT or ChatGPT.
6 SEIKO Introduces a novel RL method to efficiently fine-tune diffusion models in an online setting. Its techniques outperform baselines such as PPO, classifier-based guidance, and direct reward backpropagation for fine-tuning Stable Diffusion.
7 TRL (Train Transformer Language Models with RL) A state-of-the-art library for post-training foundation models using methods like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), GRPO, and Direct Preference Optimization (DPO). Built on the πŸ€— Transformers ecosystem, it supports multiple model architectures and scales efficiently across hardware setups.
8 Fine-Tuning Reinforcement Learning Models as Continual Learning Explores fine-tuning RL models as a forgetting mitigation problem (continual learning). Provides insights and code implementations to address forgetting in RL models.
9 RL4LMs A modular RL library to fine-tune language models to human preferences. Rigorously evaluated through 2000+ experiments using the GRUE benchmark, ensuring robustness across various NLP tasks.
10 Lamorel A high-throughput, distributed architecture for seamless LLM integration in interactive environments. While not specialized in RL or RLHF by default, it supports custom implementations and is ideal for users needing maximum flexibility.
11 LLM-Reverse-Curriculum-RL Implements the ICML 2024 paper "Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning". Focuses on enhancing LLM reasoning capabilities using a reverse curriculum RL approach.
12 veRL A flexible, efficient, and production-ready RL training library for large language models (LLMs). Serves as the open-source implementation of the HybridFlow framework and supports various RL algorithms (PPO, GRPO), advanced resource utilization, and scalability up to 70B models on hundreds of GPUs. Integrates with Hugging Face models, supervised fine-tuning, and RLHF with multiple reward types.
13 trlX A distributed training framework for fine-tuning large language models (LLMs) with reinforcement learning. Supports both Accelerate and NVIDIA NeMo backends, allowing training of models up to 20B+ parameters. Implements PPO and ILQL, and integrates with CHEESE for human-in-the-loop data collection.
14 Okapi A framework for instruction tuning in LLMs with RLHF, supporting 26 languages. Provides multilingual resources such as ChatGPT prompts, instruction datasets, and response ranking data, along with both BLOOM-based and LLaMa-based models and evaluation benchmarks.
15 LLaMA-Factory Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024). Supports a wide array of models (e.g., LLaMA, LLaVA, Qwen, Mistral) with methods including pre-training, multimodal fine-tuning, reward modeling, PPO, DPO, and ORPO. Offers scalable tuning (16-bit, LoRA, QLoRA) with advanced optimizations and logging integrations, and provides fast inference via API, Gradio UI, and CLI with vLLM workers.

⚑ Applications & Benchmarks

  • "AutoGPT: LLMs for Autonomous RL Agents" - OpenAI (2023) [Paper]
  • "Barkour: Benchmarking LLM-Augmented RL" - Wu et al. (2023) [Paper]

πŸ“š Tutorials & Courses

  • πŸŽ₯ Deep RL Bootcamp (Berkeley) [Website]
  • πŸŽ₯ DeepMind RL Series [Website]

πŸ› οΈ Libraries & Implementations


πŸ”— Other Resources


πŸ“Œ Contributing

Contributions are welcome! If you have relevant papers, code, or insights, feel free to submit a pull request.

Star History Chart

Citation

If you find our work useful or use it in your research, please consider citing:

@misc{kumar2025llmposttrainingdeepdive,
      title={LLM Post-Training: A Deep Dive into Reasoning Large Language Models}, 
      author={Komal Kumar and Tajamul Ashraf and Omkar Thawakar and Rao Muhammad Anwer and Hisham Cholakkal and Mubarak Shah and Ming-Hsuan Yang and Phillip H. S. Torr and Salman Khan and Fahad Shahbaz Khan},
      year={2025},
      eprint={2502.21321},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.21321}, 
}

License πŸ“œ

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟 Please raise any issues or questions here.


For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome-LLM-Post-training

Similar Open Source Tools

For similar tasks

For similar jobs