Awesome-RL-for-LRMs

Awesome-RL-for-LRMs

A Survey of Reinforcement Learning for Large Reasoning Models

Stars: 1441

Visit
 screenshot

This repository contains a collection of awesome resources for reinforcement learning in language models. It includes tutorials, code implementations, research papers, and tools to help researchers and practitioners explore and apply reinforcement learning techniques in natural language processing tasks. Whether you are a beginner or an expert in the field, this repository aims to provide valuable insights and guidance to enhance your understanding and implementation of reinforcement learning in language models.

README:

A Survey of Reinforcement Learning for Large Reasoning Models

Awesome Survey Github HF Papers Twitter

We welcome everyone to open an issue for any related work we haven’t discussed, and we’ll try to address it in the next release!

πŸŽ‰ News

  • [2025-09-18] πŸŽ‰ We update the full list of papers in the category structure of the survey!
  • [2025-09-12] πŸŽ‰ Our survey was ranked #1 Paper of the Day on πŸ€— Hugging Face Daily Papers!
  • [2025-09-11] πŸ”₯ Excited to release our RL for LRMs Survey! We’ll be updating the full list of papers in with a new category structure soon. Check it out: Paper.
  • [2025-08-15] πŸ”₯ Introducing SSRL: an investigation for Agentic Search RL without reliance on external search engine. Check it out: GitHub and Paper.
  • [2025-05-27] πŸ”₯ Introducing MARTI: A Framework for LLM-based Multi-Agent Reinforced Training and Inference. Check it out: Github.
  • [2025-04-23] πŸ”₯ Introducing TTRL: an open-source solution for online RL on data without ground-truth labels, especially test data. Check it out: Github and Paper.
  • [2025-03-20] πŸ”₯ We are excited to introduce collection of papers and projects on RL for reasoning models!

πŸ“– Contents

πŸ—ΊοΈ Overview

Our survey provides a comprehensive examination of Reinforcement Learning for Large Reasoning Models.

Overview of RL for LRMs Survey

We organize the survey into five main sections:

  1. Foundational Components: Reward design, policy optimization, and sampling strategies
  2. Foundational Problems: Key debates and challenges in RL for LRMs
  3. Training Resources: Static corpora, dynamic environments, and infrastructure
  4. Applications: Real-world implementations across diverse domains
  5. Future Directions: Emerging research opportunities and challenges

πŸ“„ Paper List

Frontier Models

Date Name Title Paper Github
2025-08 Intern-S1 Intern-S1: A Scientific Multimodal Foundation Model Paper GitHub Stars
2025-08 GLM-4.5 GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Paper GitHub Stars
2025-08 gpt-oss gpt-oss-120b & gpt-oss-20b Model Card Paper GitHub Stars
2025-08 InternVL3.5 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper GitHub Stars
2025-07 Kimi K2 Kimi K2: Open Agentic Intelligence Paper GitHub Stars
2025-07 Step 3 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Paper GitHub Stars
2025-07 GLM-4.1V-Thinking GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper GitHub Stars
2025-07 Skywork-R1V3 Skywork-R1V3 Technical Report Paper GitHub Stars
2025-07 GLM-4.5V GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper GitHub Stars
2025-06 Magistral Magistral Paper -
2025-06 Minimax-M1 MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Paper GitHub Stars
2025-05 MiMo MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining Paper GitHub Stars
2025-05 Qwen3 Qwen3 Technical Report Paper GitHub Stars
2025-05 Llama-Nemotron-Ultra Llama-Nemotron: Efficient Reasoning Models Paper GitHub Stars
2025-05 INTELLECT-2 INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning Paper -
2025-05 Hunyuan-TurboS Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought Paper GitHub Stars
2025-05 Skywork OR-1 Skywork Open Reasoner 1 Technical Report Paper GitHub Stars
2025-04 Phi-4 Reasoning Phi-4-reasoning Technical Report Paper -
2025-04 Skywork-R1V2 Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning Paper GitHub Stars
2025-04 InternVL3 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper GitHub Stars
2025-03 ORZ Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model Paper GitHub Stars
2025-01 DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper GitHub Stars
- QwQ QwQ-32B: Embracing the Power of Reinforcement Learning Blog GitHub Stars
- Seed-OSS Seed-OSS Open-Source Models Paper GitHub Stars
- ERNIE-4.5-Thinking ERNIE 4.5 Technical Report Blog -

Reward Design

Generative Rewards

Date Name Title Paper Github
2025-08 CAPO CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment Paper GitHub Stars
2025-08 CompassVerifier CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward Paper GitHub Stars
2025-08 Cooper Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models Paper GitHub Stars
2025-08 ReviewRL ReviewRL: Towards Automated Scientific Review with RL Paper GitHub Stars
2025-08 Rubicon Reinforcement Learning with Rubric Anchors Paper -
2025-08 RuscaRL Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning Paper -
2025-07 OMNI-THINKER OMNI-THINKER: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards Paper -
2025-07 URPO URPO: A Unified Reward & Policy Optimization Framework for Large Language Models Paper -
2025-07 RaR Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains Paper -
2025-07 RLCF Checklists Are Better Than Reward Models For Aligning Language Models Paper -
2025-07 PCL Post-Completion Learning for Language Models Paper -
2025-07 K2 KIMI K2: OPEN AGENTIC INTELLIGENCE Paper -
2025-07 LIBRA LIBRA: ASSESSING AND IMPROVING REWARD MODEL BY LEARNING TO THINK Paper -
2025-07 TP-GRPO Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner Paper GitHub Stars
2025-06 RewardAnything RewardAnything: Generalizable Principle-Following Reward Models Paper Blog
2025-06 Writing-Zero Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards Paper -
2025-06 Critique-GRPO Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback Paper GitHub Stars
2025-06 PAG PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier Paper -
2025-06 GRAM GRAM: A Generative Foundation Reward Model for Reward Generalization Paper GitHub Stars
2025-06 ProxyReward From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation Paper -
2025-06 QA-LIGN QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA Paper -
2025-05 RM-R1 RM-R1: Reward Modeling as Reasoning Paper GitHub Stars
2025-05 J1 J1: Incentivizing Thinking in LLM-as-a-Judge via RL Paper -
2025-05 TinyV TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning Paper GitHub Stars
2025-05 General-Reasoner General-reasoner: Advancing llm reasoning across all domains Paper -
2025-05 RRM Reward Reasoning Model Paper -
2025-05 RL Tango RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning Paper GitHub Stars
2025-05 Think-RM Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models Paper GitHub Stars
2025-04 JudgeLRM JudgeLRM: Large Reasoning Models as a Judge Paper GitHub Stars
2025-04 GenPRM GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning Paper GitHub Stars
2025-04 DeepSeek-GRM Inference-Time Scaling for Generalist Reward Modeling Paper -
2025-04 AIR AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset Paper -
2025-04 Pairwise-RL A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization Paper -
2025-04 xVerify xVerify: Efficient Answer Verifier for Reasoning Model Evaluations Paper GitHub Stars
2025-04 Seed-Thinking-v1.5 Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning Paper -
2025-04 ThinkPRM Process Reward Models That Think Paper GitHub Stars
2025-03 - Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains Paper -
2025-02 - Self-rewarding correction for mathematical reasoning Paper GitHub Stars
2024-10 GenRM Generative Reward Models Paper -
2024-08 CLoud Critique-out-Loud Reward Models Paper GitHub Stars
2024-08 Generative Verifier Generative Verifiers: Reward Modeling as Next-Token Prediction Paper -
2024-01 Self-Rewarding LM Self-Rewarding Language Models Paper -
2023-10 Auto-J Generative Judge for Evaluating Alignment Paper GitHub Stars
2023-06 Judge LLM-as-a-Judge Judging llm-as-a-judge with mt-bench and chatbot arena Paper GitHub Stars

Dense Rewards

Date Name Title Paper Github
2025-09 TARL Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents Paper -
2025-09 PROF Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training Paper GitHub Stars
2025-09 HICRA Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning Paper -
2025-08 KlearReasoner Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization Paper GitHub Stars
2025-08 CAPO CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment Paper GitHub Stars
2025-08 GTPO & GRPO-S GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy Paper -
2025-08 VSRM Promoting Efficient Reasoning with Verifiable Stepwise Reward Paper -
2025-08 G-RA Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards Paper -
2025-08 SSPO SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression Paper -
2025-08 AIRL-S Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS Paper -
2025-08 TreePO TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling Paper GitHub Stars
2025-08 MUA-RL MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use Paper -
2025-07 SPRO Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Paper -
2025-07 FR3E First Return, Entropy-Eliciting Explore Paper -
2025-07 ARPO Agentic Reinforced Policy Optimization Paper GitHub Stars
2025-07 TP-GRPO Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner Paper GitHub Stars
2025-06 TreeRPO TreeRPO: Tree Relative Policy Optimization Paper GitHub Stars
2025-06 TreeRL TreeRL: LLM Reinforcement Learning with On-Policy Tree Search Paper GitHub Stars
2025-06 Entropy Advantage Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs Paper -
2025-06 ReasonFlux-PRM ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs Paper GitHub Stars
2025-05 S-GRPO S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models Paper -
2025-05 GiGPO Group-in-Group Policy Optimization for LLM Agent Training Paper GitHub Stars
2025-05 - Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment Paper -
2025-05 Tango RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning Paper GitHub Stars
2025-05 StepSearch StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization Paper GitHub Stars
2025-05 - Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition Paper -
2025-05 Tool-Star Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning Paper GitHub Stars
2025-05 SPA-RL SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution Paper GitHub Stars
2025-05 SPO Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Mode Paper GitHub Stars
2025-04 GenPRM GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning Paper GitHub Stars
2025-04 PURE Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning Paper GitHub Stars
2025-03 MRT Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning Paper GitHub Stars
2025-03 SWEET-RL SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks Paper GitHub Stars
2025-02 PRIME Process Reinforcement through Implicit Rewards Paper GitHub Stars
2024-12 Implicit PRM Free Process Rewards without Process Labels Paper GitHub Stars
2024-10 VinePPO VinePPO: Refining Credit Assignment in RL Training of LLMs Paper GitHub Stars
2024-10 PAV Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning Paper -
2024-04 - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function Paper -
2024-03 GELI Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback Paper -
2023-12 Math-Shepherd Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations Paper -
2023-05 PRM800K Let's Verify Step by Step Paper GitHub Stars
2022-11 - Solving math word problems with process- and outcome-based feedback Paper -

Unsupervised Rewards

Date Name Title Paper Github
2025-08 Co-Reward Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement Paper GitHub Stars
2025-08 SQLM Self-Questioning Language Models Paper GitHub Stars
2025-08 R-zero R-Zero: Self-Evolving Reasoning LLM from Zero Data Paper GitHub Stars
2025-08 ETTRL ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism Paper -
2025-07 RLSF Post-Training Large Language Models via Reinforcement Learning from Self-Feedback Paper -
2025-06 RLSC Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models Paper -
2025-06 RPT Reinforcement Pre-Training Paper -
2025-06 CoVo Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning Paper GitHub Stars
2025-06 SEAL Self-Adapting Language Models Paper -
2025-06 Spurious Rewards Spurious Rewards: Rethinking Training Signals in RLVR Paper GitHub Stars
2025-06 No Free Lunch No Free Lunch: Rethinking Internal Feedback for LLM Reasoning Paper -
2025-05 Absolute Zero Absolute Zero: Reinforced Self-play Reasoning with Zero Data Paper GitHub Stars
2025-05 EM-RL The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning Paper GitHub Stars
2025-05 SSR-Zero SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation Paper GitHub Stars
2025-05 - Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers Paper GitHub Stars
2025-05 RLIF Learning to Reason without External Rewards Paper GitHub Stars
2025-05 SeRL SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data Paper GitHub Stars
2025-05 SRT Can Large Reasoning Models Self-Train? Paper GitHub Stars
2025-05 RENT-RL Maximizing Confidence Alone Improves Reasoning Paper GitHub Stars
2025-04 EMPO Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization Paper GitHub Stars
2025-04 TRANS-ZERO TRANS-ZERO: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data Paper GitHub Stars
2025-04 TTRL TTRL: Test-Time Reinforcement Learning Paper GitHub Stars
2025-04 One-Shot-RLVR Reinforcement Learning for Reasoning in Large Language Models with One Training Example Paper GitHub Stars
2025-02 CAGSR A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals Paper -
2024-07 MINIMO Learning Formal Mathematics From Intrinsic Motivation Paper GitHub Stars

Rewards Shaping

Date Name Title Paper Github
2025-09 CDE CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models Paper -
2025-09 DARLING Jointly Reinforcing Diversity and Quality in Language Model Generations Paper GitHub Stars
2025-09 DRER Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL Paper -
2025-09 OBE Outcome-based Exploration for LLM Reasoning Paper -
2025-08 Pass@kTraining Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models Paper GitHub Stars
2025-05 PKPO Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems Paper -
2025-05 rl-without-gt Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers Paper GitHub Stars
2025-03 CrossDomain-RLVR Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains Paper -
2025-01 DeepSeek-R1 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper GitHub Stars
2024-09 Qwen2.5-Math Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement Paper GitHub Stars

Policy Optimization

Policy Gradient Objective

Date Name Title Paper Github
2017-07 PPO Proximal policy optimization algorithms Paper -
- PG Policy gradient methods for reinforcement learning with function approximation. Paper -
- REINFORCE Simple statistical gradient-following algorithms for connectionist reinforcement learning Paper -
- TRPO Trust region policy optimization Paper -

Critic-based Algorithms

Date Name Title Paper Github
2025-08 VRPO VRPO:Rethinking Value Modeling for Robust RL Training under Noisy Supervision Paper -
2025-05 VerIPO VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization Paper GitHub Stars
2025-04 VAPO Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks Paper -
2025-03 VCPPO What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the Secret Paper -
2025-03 Open reasoner-zero open reasoner-zero: An open source approach to scaling up reinforcement learning on the base model Paper GitHub Stars
2025-02 PRIME PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS Paper GitHub Stars
2024-12 Implicit PRM FREE PROCESS REWARDS WITHOUT PROCESS LABELS Paper GitHub Stars
2023-12 Math-shepherd Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations Paper -
2015-06 GAE High-dimensional continuous control using generalized advantage estimation Paper -
- Autopsv Autopsv: Automated process-supervised verifier. Paper GitHub Stars

Critic-Free Algorithms

Date Name Title Paper Github
2025-09 UPGE Towards a Unified View o fLarge Language Model Post-Training Paper GitHub Stars
2025-09 SPO Single-stream Policy Optimization Paper -
2025-08 LitePPO Part I: Tricks or Traps? A Deep Dive into RLfor LLM Reasoning Paper -
2025-07 R1-RE R1-RE: Cross-Domain Relation Extraction with RLVR Paper -
2025-07 GSPO Group Sequence Policy Optimization Paper -
2025-06 CISPO MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention Paper GitHub Stars
2025-05 CPGD CPGD:Toward Stable Rule-based Reinforcement Learning for Language Models Paper GitHub Stars
2025-05 NFT Bridging Supervised Learning and Reinforcement Learning in Math Reasoning Paper -
2025-05 Clip-Cov/KL-Cov The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Paper GitHub Stars
2025-03 OpenVLThinker OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles Paper GitHub Stars
2025-03 DAPO DAPO: an Open-Source LLM Reinforcement Learning System at Scale Paper GitHub Stars
2025-03 Dr. GRPO Understanding R1-Zero-Like Training: A Critical Perspectiv Paper GitHub Stars
2025-01 Kimi k1.5 Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper -
2024-02 RLOO Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms Paper -
2024-02 GRPO DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper GitHub Stars
2023-10 ReMax ReMax: A Simple, Effective, and Efficient Method for Aligning Large Language Models Paper GitHub Stars
- REINFORCE Simple statistical gradient-following algorithms for connectionist reinforcement learning Paper -
- REINFORCE++ REINFORCE++: An Efficient RLHF Algorithm with Robustnessto Both Prompt and Reward Models Paper GitHub Stars
- VinePPO VINEPPO: UNLOCKING RL POTENTIAL FOR LLM REASONING THROUGH REFINED CREDIT ASSIGNMENT Paper GitHub Stars
- FlashRL Fast RL training with Quantized Rollouts Paper GitHub Stars

Off-policy Optimization

Date Name Title Paper Github
2025-09 HPT Towards a Unified View of Large Language Model Post-Training Paper GitHub Stars
2025-08 DFT On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification Paper GitHub Stars
2025-08 RED Recall-Extend Dynamics: Enhancing Small Language Models through Controlled Exploration and Refined Offline Integration Paper GitHub Stars
2025-07 Prefix‑RFT Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling Paper -
2025-07 ReMix Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model Paper GitHub Stars
2025-06 ReLIFT Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions Paper GitHub Stars
2025-06 BREAD BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning Paper -
2025-06 SRFT SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning Paper -
2025-05 UFT UFT: Unifying Supervised and Reinforcement Fine-Tuning Paper GitHub Stars
2025-04 LUFFY Learning to Reason under Off-Policy Guidance Paper GitHub Stars
2025-03 SPO Soft Policy Optimization: Online Off-Policy RL for Sequence Models Paper GitHub Stars
2025-03 TOPR TAPERED OFF-POLICY REINFORCE Stable and efficient reinforcement learning for LLMs Paper -
2024-05 IFT Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process Paper GitHub Stars
2023-05 DPO Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper -
2015-11 - Fixed point quantization of deep convolutional networks Paper -
- - Your Efficient RL Framework Secretly Brings You Off-Policy RL Training Paper GitHub Stars

Off-policy Optimization (Exp replay)

Date Name Title Paper Github
2025-09 SAPO Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing Paper -
2025-09 SEELE Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding Paper GitHub Stars
2025-08 Memory-R1 Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning Paper -
2025-07 RLEP RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning Paper GitHub Stars
2025-06 EFRame EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework Paper GitHub Stars
2025-05 ARPO ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay Paper GitHub Stars
2025-04 - Improving RL Exploration for LLM Reasoning through Retrospective Replay Paper -

Regularization Objectives

Date Name Title Paper Github
2025-09 CDE CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models Paper -
2025-09 DPH RL The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward Paper GitHub Stars
2025-09 empgseed-seed Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents Paper -
2025-06 HighEntropy RL Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Paper -
2025-06 Entropy RL Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs Paper -
2025-06 ALP RL Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning Paper -
2025-05 Skywork OR1 Skywork Open Reasoner 1 Technical Report Paper GitHub Stars
2025-05 Entropy Mechanism The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models Paper GitHub Stars
2025-05 ProRL ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models Paper -
2025-05 Short RL Efficient RL Training for Reasoning Models via Length-Aware Optimization Paper GitHub Stars
2025-03 DAPO DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper -
2025-03 L1 L1: Controlling how long a reasoning model thinks with reinforcement learning Paper GitHub Stars

Sampling Strategy

Dynamic and Structured Sampling

Date Name Title Paper Github
2025-09 DACE Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning Paper -
2025-09 Parallel-R1 Parallel-R1: Towards Parallel Thinking via Reinforcement Learning Paper GitHub Stars
2025-08 G^2RPO-A G^2RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidanc Paper GitHub Stars
2025-08 RuscaRL Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning Paper -
2025-08 TreePO TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling Paper GitHub Stars
2025-07 ARPO Agentic Reinforced Policy Optimization Paper GitHub Stars
2025-06 TreeRPO TreeRPO: Tree Relative Policy Optimization Paper GitHub Stars
2025-06 E2H Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning Paper -
2025-06 TreeRL TreeRL: LLM Reinforcement Learning with On-Policy Tree Search Paper GitHub Stars
2025-05 ToTRL ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving Paper -
2025-03 DARS DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal Paper GitHub Stars
2025-03 DAPO DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper GitHub Stars
2025-02 PRIME Process Reinforcement through Implicit Rewards Paper GitHub Stars
- POLARIS POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS Blog GitHub Stars

Sampling Hyper-Parameters

Date Name Title Paper Github
2025-08 GFPO Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning Paper -
2025-06 AceReason-Nemotron 1.1 AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy Paper -
2025-06 T-PPO Truncated Proximal Policy Optimization Paper -
2025-06 Confucius3-Math Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning Paper GitHub Stars
2025-05 E3-RL4LLMs Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs Paper GitHub Stars
2025-05 AceReason-Nemotron AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning Paper -
2025-05 Pro-RL ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models Paper -
2025-03 - Output Length Effect on DeepSeek-R1's Safety in Forced Thinking Paper -
2025-03 DAPO DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper GitHub Stars
2025-02 PRIME Process Reinforcement through Implicit Rewards Paper GitHub Stars
2025-02 - Training Language Models to Reason Efficiently Paper -
- DeepScaleR DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL Paper GitHub Stars
- POLARIS POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS Paper GitHub Stars

Training Resource

Static Corpus (Code)

Date Name Title Paper Github
2025-05 rStar-Coder rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset Paper GitHub Stars
2025-04 Z1 Z1: Efficient Test-time Scaling with Code Paper GitHub Stars
2025-04 OpenCodeReasoning OpenCodeReasoning: Advancing Data Distillation for Competitive Coding Paper -
2025-04 LeetCodeDataset LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs Paper GitHub Stars
2025-03 KodCode KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Paper -
2025-01 SWE-Fixer SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution Paper GitHub Stars
2024-12 SWE-Gym Training Software Engineering Agents and Verifiers with SWE-Gym Paper GitHub Stars
- Code-R1 Code-R1: Reproducing R1 for Code with Reliable Rewards Paper GitHub Stars
- codeforces-cots CodeForces CoTs Paper -
- DeepCoder DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level Blog GitHub Stars

Static Corpus (STEM)

Date Name Title Paper Github
2025-09 SSMR-Bench Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning Paper GitHub Stars
2025-09 Loong Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers Paper GitHub Stars
2025-07 MegaScience MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning Paper -
2025-06 ReasonMed ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning Paper GitHub Stars
2025-05 ChemCoTDataset Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations Paper -
2025-02 NaturalReasoning NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions Paper -
2025-01 SCP-116K SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain Paper -

Static Corpus (Math)

Date Name Title Paper Github
2025-07 MiroMind-M1-RL-62K MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization Paper GitHub Stars
2025-04 DeepMath DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning Paper GitHub Stars
2025-04 OpenMathReasoning AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset Paper GitHub Stars
2025-03 STILL-3-RL An Empirical Study on Eliciting and Improving R1-like Reasoning Models Paper GitHub Stars
2025-03 Light-R1 Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond Paper -
2025-03 DAPO DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper GitHub Stars
2025-03 OpenReasoningZero Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model Paper GitHub Stars
2025-02 PRIME Process Reinforcement through Implicit Rewards Paper GitHub Stars
2025-02 LIMO Limo: Less is more for reasoning Paper GitHub Stars
2025-02 LIMR Limr: Less is more for rl scaling Paper GitHub Stars
2025-02 Big-MATH Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models Paper -
- NuminaMath 1.5 Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions Paper GitHub Stars
- OpenR1-Math Open R1: A fully open reproduction of DeepSeek-R1 Blog GitHub Stars
- DeepScaleR DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL Paper -

Static Corpus (Agent)

Date Name Title Paper Github
2025-08 ASearcher Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL Paper -
2025-07 WebShaper WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization Paper -
2025-05 ZeroSearch ZeroSearch: Incentivize the Search Capability of LLMs without Searching Paper GitHub Stars
2025-04 ToolRL ToolRL: Reward is All Tool Learning Needs Paper GitHub Stars
2025-03 Search-R1 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning Paper GitHub Stars
2025-03 ToRL ToRL: Scaling Tool-Integrated RL Paper GitHub Stars
- MicroThinker MiroVerse V0.1: A Reproducible, Full-Trajectory, Ever-Growing Deep Research Dataset Paper -
2025-03 DeepRetrieval DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning Paper GitHub Stars

Static Corpus (Mix)

Date Name Title Paper Github
2025-08 Graph-R1 Graph-R1: Unleashing LLM Reasoning with NP-Hard Graph Problem Paper -
2025-06 RewardAnything RewardAnything: Generalizable Principle-Following Reward Models Paper Blog
2025-06 guru-RL-92k Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective Paper -
2025-05 Llama-Nemotron-PT Llama-Nemotron: Efficient Reasoning Models Paper -
2025-05 SkyWork OR1 Skywork Open Reasoner 1 Technical Report Paper GitHub Stars
2025-03 OpenVLThinker OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles Paper GitHub Stars
- AM-DS-R1-0528-Distilled AM-DeepSeek-R1-0528-Distilled Paper GitHub Stars
- dolphin-r1 Dolphin R1 Dataset Paper -
- SYNTHETIC-1/2 SYNTHETIC-1 Release: Two Million Collaboratively Generated Reasoning Traces from Deepseek-R1 Blog -

Dynamic Environment (Rule-based)

Date Name Title Paper Github
2025-06 ProtoReasoning ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs Paper -
2025-05 SynLogic SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond Paper GitHub Stars
2025-05 Reasoning Gym REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards Paper GitHub Stars
2025-05 Enigmata Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles Paper GitHub Stars
2025-02 AutoLogi AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models Paper GitHub Stars
2025-02 Logic-RL Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning Paper GitHub Stars

Dynamic Environment (Code-based)

Date Name Title Paper Github
2025-06 AgentCPM-GUI AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning Paper GitHub Stars
2025-06 MedAgentGym MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale Paper GitHub Stars
2025-05 MLE-Dojo MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering Paper GitHub Stars
2025-05 SWE-rebench SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents Paper -
2025-05 ZeroGUI ZeroGUI: Automating Online GUI Learning at Zero Human Cost Paper GitHub Stars
2025-04 R2E-Gym R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents Paper GitHub Stars
2025-03 ReSearch ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning Paper GitHub Stars
2025-02 MLGym MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper GitHub Stars
2024-07 AppWorld AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents Paper GitHub Stars

Dynamic Environment (Game-based)

Date Name Title Paper Github
2025-08 PuzzleJAX PuzzleJAX: A Benchmark for Reasoning and Learning Paper GitHub Stars
2025-06 Play to Generalize Play to Generalize: Learning to Reason Through Game Play Paper GitHub Stars
2025-06 Optimus-3 Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts Paper GitHub Stars
2025-05 lmgame-Bench lmgame-Bench: How Good are LLMs at Playing Games? Paper GitHub Stars
2025-05 G1 G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning Paper GitHub Stars
2025-05 Code2Logic Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Paper GitHub Stars
2025-05 KORGym KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation Paper GitHub Stars
2025-04 Cross-env-coop Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination Paper GitHub Stars
2022-03 ScienceWorld ScienceWorld: Is your Agent Smarter than a 5th Grader? Paper GitHub Stars
2020-10 ALFWorld ALFWorld: Aligning Text and Embodied Environments for Interactive Learning Paper GitHub Stars

Dynamic Environment (Model-based)

Date Name Title Paper Github
2025-06 SwS SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning Paper GitHub Stars
2025-06 SPIRAL SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning Paper GitHub Stars
2025-05 Absolute Zero Absolute Zero: Reinforced Self-play Reasoning with Zero Data Paper GitHub Stars
2025-04 TextArena TextArena Paper GitHub Stars
2025-03 SWEET-RL SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks Paper GitHub Stars
- Genie 3 Genie 3: A new frontier for world models Blog -

Dynamic Environment (Ensemble-based)

Date Name Title Paper Github
2025-08 InternBootcamp InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling Paper GitHub Stars
- SYNTHETIC-2 SYNTHETIC-2 Release: Four Million Collaboratively Generated Reasoning Traces Blog -

RL Infrastructure (Primary)

Date Name Title Paper Github
2025-06 ROLL Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library Paper GitHub Stars
2025-05 AReaL AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning Paper GitHub Stars
2024-09 veRL HybridFlow: A Flexible and Efficient RLHF Framework Paper GitHub Stars
2024-05 OpenRLHF OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper GitHub Stars
- TRL Transformer Reinforcement Learning - GitHub Stars
- NeMo-RL Nemo RL: A Scalable and Efficient Post-Training Library - GitHub Stars
- slime slime: An SGLang-Native Post-Training Framework for RL Scaling - GitHub Stars
- RLinf RLinf: Reinforcement Learning Infrastructure for Agentic AI - GitHub Stars

RL Infrastructure (Secondary)

Date Name Title Paper Github
2025-09 RL-Factory RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use Paper GitHub Stars
2025-09 verl-tool VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use Paper GitHub Stars
2025-09 dLLM-RL Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models Paper GitHub Stars
2025-08 agent-lightning Agent Lightning: Train ANY AI Agents with Reinforcement Learning Paper GitHub Stars
2025-05 verl-agent Group-in-Group Policy Optimization for LLM Agent Training Paper GitHub Stars
2025-04 VLM-R1 VLM-R1: A stable and generalizable R1-style Large Vision-Language Model Paper GitHub Stars
- rllm rLLM: A Framework for Post-Training Language Agents - GitHub Stars
- EasyR1 EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework - GitHub Stars
- verifiers Verifiers: Reinforcement Learning with LLMs in Verifiable Environments - GitHub Stars
- prime-rl PRIME-RL: Decentralized RL Training at Scale - GitHub Stars
- MARTI A Framework for LLM-based Multi-Agent Reinforced Training and Inference - GitHub Stars

Applications

Coding Agent

Date Name Title Paper Github
2025-09 - Reinforcement Learning for Machine Learning Engineering Agents Paper -
2025-09 - Advancing SLM Tool-Use Capability using Reinforcement Learning Paper -
2025-09 SimpleTIR SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning Paper GitHub Stars
2025-09 - The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper GitHub Stars
2025-08 GLM-4.5 GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Paper GitHub Stars
2025-08 FormaRL FormaRL: Enhancing Autoformalization with no Labeled Data Paper GitHub Stars
2025-08 RLTR Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning Paper -
2025-07 ARPO Agentic Reinforced Policy Optimization Paper GitHub Stars
2025-07 Kimi K2 Kimi K2: Open Agentic Intelligence Paper -
2025-07 AutoTIR AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning Paper GitHub Stars
2025-06 CoRT CoRT: Code-integrated Reasoning within Thinking Paper GitHub Stars
2025-05 EvoScale Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering Paper GitHub Stars
2025-03 ToRL ToRL: Scaling Tool-Integrated RL Paper GitHub Stars
2025-02 SWE-RL SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution Paper GitHub Stars
- Qwen3-Coder Qwen3-Coder: Agentic Coding in the World. - GitHub Stars

Search Agent

Date Name Title Paper Github
2025-08 SSRL SSRL: Self-Search Reinforcement Learning Paper GitHub Stars
2025-07 WebSailor WebSailor: Navigating Super-human Reasoning for Web Agent Paper GitHub Stars
2025-07 WebShaper WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization Paper GitHub Stars
2025-05 ZeroSearch ZeroSearch: Incentivize the Search Capability of LLMs without Searching Paper GitHub Stars
2025-05 SEM SEM: Reinforcement Learning for Search-Efficient Large Language Models Paper -
2025-05 S3 s3: You Don't Need That Much Data to Train a Search Agent via RL Paper GitHub Stars
2025-05 StepSearch StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization Paper GitHub Stars
2025-05 R1-Searcher++ R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning Paper GitHub Stars
2025-04 ReZero ReZero: Enhancing LLM search ability by trying one-more-time Paper -
2025-03 DeepRetrieval DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning Paper GitHub Stars
2025-03 Search-R1 Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning Paper GitHub Stars
2025-03 R1-Searcher R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning Paper GitHub Stars

Browser-Use Agent

Date Name Title Paper Github
2025-05 WebAgent-R1 WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning Paper GitHub Stars
2025-05 WebDancer WebDancer: Towards Autonomous Information Seeking Agency Paper GitHub Stars
2025-04 DeepResearcher DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments Paper GitHub Stars
2024-11 Web-RL WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Paper GitHub Stars
2021-12 WebGPT WebGPT: Browser-assisted question-answering with human feedback Paper -

DeepResearch Agent

Date Name Title Paper Github
2025-09 SFR-DeepResearch SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents Paper -
2025-09 DeepDive DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL Paper GitHub Stars
2025-08 Webwatcher Webwatcher: Breaking new frontiers of vision-language deep research agent Paper GitHub Stars
2025-08 ASearcher Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl Paper GitHub Stars
2025-08 Atom-searcher Atom-searcher: Enhancing agentic deep research via fine-grained atomic thought reward Paper GitHub Stars
2025-08 MedResearcher-R1 Medreseacher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework Paper GitHub Stars
2025-06 Jan-nano Jan-nano Technical Report Paper -
2025-04 WebThinker WebThinker: Empowering Large Reasoning Models with Deep Research Capability Paper GitHub Stars
- Kimi-Researcher Kimi-Researcher-End-to-End RL Training for Emerging Agentic Capabilities Blog -
- Mirothinker Mirothinker: An open-source agentic model series trained for deep research and complex, long-horizon problem solving Blog GitHub Stars

GUI&Computer Agent

Date Name Title Paper Github
2025-09 UI-TARS 2 UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning Paper GitHub Stars
2025-08 GUI-RC Test-Time Reinforcement Learning for GUI Grounding via Region Consistency Paper GitHub Stars
2025-08 Os-r1 OS-R1: Agentic Operating System Kernel Tuning with Reinforcement Learning Paper GitHub Stars
2025-08 ComputerRL ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents Paper -
2025-08 Mobile-Agent-v3 Mobile-Agent-v3: Fundamental Agents for GUI Automation Paper GitHub Stars
2025-08 SWIRL SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control Paper GitHub Stars
2025-08 InquireMobile InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning Paper -
2025-07 MobileGUI-RL MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment Paper -
2025-06 GUI-Critic-R1 Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation Paper GitHub Stars
2025-06 GUI-Reflection GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior Paper -
2025-06 Mobile-R1 Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards Paper -
2025-05 UIShift UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning Paper GitHub Stars
2025-05 GUI-G1 GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents Paper GitHub Stars
2025-05 ARPO ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay Paper GitHub Stars
2025-05 ZeroGUI ZeroGUI: Automating Online GUI Learning at Zero Human Cost Paper GitHub Stars
2025-04 GUI-R1 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents Paper GitHub Stars
2025-03 UI-R1 UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning Paper GitHub Stars
2025-01 UI-TARS UI-TARS: Pioneering Automated GUI Interaction with Native Agents Paper GitHub Stars

Recommendation Agent

Date Name Title Paper Github
2025-07 Shop-R1 Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning Paper -
2025-03 Rec-R1 Rec-R1: Bridging LLMs and Recommendation Systems via Reinforcement Learning Paper GitHub Stars

Agent (Others)

Date Name Title Paper Github
2025-07 OpenTable-R1 OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering Paper GitHub Stars
2025-07 LaViPlan LaViPlan : Language-Guided Visual Path Planning with RLVR Paper -
2025-06 Drive-R1 Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning Paper -

Code Generation

Date Name Title Paper Github
2025-09 Proof2Silicon Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning Paper -
2025-09 AR$^2$ AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models Paper GitHub Stars
2025-09 Dream-Coder Dream-Coder 7B: An Open Diffusion Language Model for Code Paper GitHub Stars
2025-08 MSRL Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation Paper -
2025-07 CogniSQL-R1-Zero CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation Paper -
2025-07 Leanabell-Prover-V2 Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning Paper GitHub Stars
2025-07 StepFun-Prover StepFun-Prover Preview: Let's Think and Verify Step by Step Paper GitHub Stars
2025-06 MedAgentGym MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale Paper GitHub Stars
2025-05 VeriReason VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation Paper GitHub Stars
2025-05 ReEX-SQL ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL Paper -
2025-05 AceReason-Nemotron AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning Paper -
2025-05 SkyWork OR1 Skywork Open Reasoner 1 Technical Report Paper GitHub Stars
2025-05 CodeV-R1 CodeV-R1: Reasoning-Enhanced Verilog Generation Paper GitHub Stars
2025-05 AReaL AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning Paper GitHub Stars
2025-04 SQL-R1 SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning Paper GitHub Stars
2025-04 Kimina-Prover Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning Paper -
2025-04 DeepSeek-Prover-V2 DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition Paper GitHub Stars
2025-03 Reasoning-SQL Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL Paper -
- code-r1 Code-R1: Reproducing R1 for Code with Reliable Rewards - GitHub Stars
- Open-R1 Open-R1: a fully open reproduction of DeepSeek-R1 Blog GitHub Stars
- DeepCoder Deepcoder: A fully open-source 14b coder at o3-mini level Paper GitHub Stars

Software Engineering

Date Name Title Paper Github
2025-08 UTRL Learning to Generate Unit Test via Adversarial Reinforcement Learning Paper -
2025-07 RePaCA RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment Paper -
2025-07 Repair-R1 Repair-R1: Better Test Before Repair Paper GitHub Stars
2025-06 CURE Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning Paper GitHub Stars
2025-05 REAL Training Language Models to Generate Quality Code with Program Analysis Feedback Paper -
2025-05 Afterburner Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization Paper -
2024-09 RepoGenReflex RepoGenReflex: Enhancing Repository-Level Code Completion with Verbal Reinforcement and Retrieval-Augmented Generation Paper -
2024-07 RLCoder RLCoder: Reinforcement Learning for Repository-Level Code Completion Paper GitHub Stars

Multimodal Understanding

Date Name Title Paper Github
2025-09 ReAd-R AdsQA: Towards Advertisement Video Understanding Paper GitHub Stars
2025-09 Keye Kwai Keye-VL 1.5 Technical Report Paper GitHub Stars
2025-08 Sifthinker Sifthinker: Spatially-aware image focus for visual reasoning Paper GitHub Stars
2025-07 Long-RL Scaling rl to long videos Paper GitHub Stars
2025-06 RefSpatial Roborefer: Towards spatial referring with reasoning in vision-language models for robotics Paper GitHub Stars
2025-06 Ego-R1 Ego-R1: Chain-of-tool-thought for ultra-long egocentric video reasoning Paper GitHub Stars
2025-05 VerIPO VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization Paper GitHub Stars
2025-05 Openthinkimg Openthinkimg: Learning to think with images via visual tool reinforcement learning Paper GitHub Stars
2025-05 Visual Planning Visual Planning: Let's think only with images Paper GitHub Stars
2025-05 VideoRFT Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning Paper GitHub Stars
2025-05 Deepeyes Deepeyes: Incentivizing" thinking with images" via reinforcement learning Paper GitHub Stars
2025-05 Visionary-R1 Visionary-R1: Mitigating shortcuts in visual reasoning with reinforcement learning Paper GitHub Stars
2025-05 CoF Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl Paper GitHub Stars
2025-05 GRIT GRIT: Teaching mllms to think with images Paper GitHub Stars
2025-05 Pixel Reasoner Pixel Reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning Paper GitHub Stars
2025-05 - Don’t look only once: Towards multimodal interactive reasoning with selective visual revisitation Paper GitHub Stars
2025-05 Ground-R1 Ground-R1: Incentivizing grounded visual reasoning via reinforcement learning Paper GitHub Stars
2025-05 TACO TACO: Think-answer consistency for optimized long-chain reasoning and efficient data learning via reinforcement learning in lvlms Paper -
2025-05 Qwen-LA Qwen look again: Guiding vision-language reasoning models to re-attention visual information Paper GitHub Stars
2025-05 TW-GRPO Reinforcing video reasoning with focused thinking Paper GitHub Stars
2025-05 Spatial-MLLM Spatial-MLLM: Boosting mllm capabilities in visual-based spatial intelligence Paper GitHub Stars
2025-04 R1-Zero-VSI Improved visual-spatial reasoning via r1-zero-like training Paper GitHub Stars
2025-04 Spacer Spacer: Reinforcing mllms in video spatial reasoning Paper GitHub Stars
2025-04 Videochat-R1 Videochat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning Paper GitHub Stars
2025-04 VLM-R1 VLM-R1: A stable and generalizable r1-style large vision-language model Paper GitHub Stars
2025-03 OpenVLThinker OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles Paper GitHub Stars
2025-03 Visual-RFT Visual-RFT: Visual reinforcement fine-tuning Paper GitHub Stars
2025-03 Vision-R1 Vision-R1: Incentivizing reasoning capability in multimodal large language models Paper GitHub Stars
2025-03 VisRL VisRL: Intention-Driven Visual Perception via Reinforced Reasoning Paper GitHub Stars
2025-03 Metaspatial Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse Paper GitHub Stars
2025-03 Video-R1 Video-R1: Reinforcing video reasoning in mllms Paper GitHub Stars

Multimodal Generation

Date Name Title Paper Github
2025-09 IGPO Inpainting-Guided Policy Optimization for Diffusion Large Language Models Paper -
2025-08 Qwen-Image Qwen-Image Technical Report Paper GitHub Stars
2025-08 TempFlow-GRPO TempFlow-GRPO: When timing matters for grpo in flow models Paper GitHub Stars
2025-07 MixGRPO MixGRPO: Unlocking flow-based grpo efficiency with mixed ode-sde Paper GitHub Stars
2025-06 FocusDiff Focusdiff: Advancing fine-grained text-image alignment for autoregressive visual generation through rl Paper GitHub Stars
2025-06 SUDER Reinforcing multimodal understanding and generation with dual self-rewards Paper -
2025-05 T2I-R1 T2I-R1: Reinforcing image generation with collaborative semantic-level and token-level cot Paper GitHub Stars
2025-05 Flow-GRPO Flow-GRPO: Training flow matching models via online rl Paper GitHub Stars
2025-05 DanceGRPO DanceGRPO: Unleashing grpo on visual generation Paper GitHub Stars
2025-05 GoT-R1 GoT-R1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning Paper GitHub Stars
2025-05 ULM-R1 Co-Reinforcement learning for unified multimodal understanding and generation Paper GitHub Stars
2025-05 RePrompt Reprompt: Reasoning-augmented reprompting for text-to-image generation via reinforcement learning Paper GitHub Stars
2025-05 InfLVG InfLVG: Reinforce inference-time consistent long video generation with grpo Paper GitHub Stars
2025-05 Reasongen-R1 Reasongen-R1: Cot for autoregressive image generation models through sft and rl Paper GitHub Stars
2025-04 PhysAR Reasoning physical video generation with diffusion timestep tokens via reinforcement learning Paper -

Robotics Tasks

Date Name Title Paper Github
2025-09 SimpleVLA-RL SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning Paper GitHub Stars
2025-06 TGRPO TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization Paper -
2025-05 ReinboT ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning Paper GitHub Stars
2025-05 RIPT-VLA InteractivePost-Trainingfor Vision-Language-ActionModels Paper GitHub Stars
2025-05 VLA-RL VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning Paper GitHub Stars
2025-05 RFTF RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback Paper -
2025-05 VLA Generalization What can rl bring to vla generalization? an empirical study Paper GitHub Stars
2025-02 ConRFT ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy Paper GitHub Stars
2024-11 GRAPE GRAPE: Generalizing Robot Policy via Preference Alignment Paper GitHub Stars
- RLinf RLinf: Reinforcement Learning Infrastructure for Agentic AI Paper -

Multi-Agent Systems

Date Name Title Paper Github
2025-09 SoftRankPO, Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning Paper -
2025-09 BFS-Prover-V2 Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers Paper -
2025-08 MAGRPO LLM Collaboration With Multi-Agent Reinforcement Learning Paper -
2025-06 AlphaEvolve AlphaEvolve: A coding agent for scientific and algorithmic discovery Paper -
2025-06 JoyAgents-R1 JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning Paper -
2025-03 ReMA ReMA: Learning to Meta-think for LLMs with Multi-agent Reinforcement Learning Paper GitHub Stars
2025-02 CTRL Teaching Language Models to Critique via Reinforcement Learning Paper GitHub Stars
2025-02 Maporl MAPoRL2: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning Paper GitHub Stars
2023-11 LLaMAC Controlling large language model-based agents for large-scale decision-making: An actor-critic approach Paper -

Scientific Tasks

Date Name Title Paper Github
2025-09 Baichuan-M2 Baichuan-M2: Scaling Medical Capability with Large Verifier System Paper -
2025-08 CX-Mind CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning Paper GitHub Stars
2025-08 MORE-CLEAR MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation Paper -
2025-08 ARMed Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination Paper -
2025-08 ProMed ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs Paper GitHub Stars
2025-08 OwkinZero OwkinZero: Accelerating Biological Discovery with AI Paper -
2025-08 MolReasoner MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs Paper GitHub Stars
2025-08 MedGR$^2$ MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning Paper -
2025-07 MedGround-R1 MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization Paper GitHub Stars
2025-07 MedGemma MedGemma Technical Report Paper -
2025-06 MMedAgent-RL MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning Paper -
2025-06 Cell-o1 Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning Paper GitHub Stars
2025-06 MedAgentGym MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale Paper GitHub Stars
2025-06 Med-U1 Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning Paper GitHub Stars
2025-06 MedVIE Efficient Medical VIE via Reinforcement Learning Paper -
2025-06 LA-CDM Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning Paper -
2025-06 ether0 Training a Scientific Reasoning Model for Chemistry Paper GitHub Stars
2025-06 Gazal-R1 Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training Paper -
2025-05 DRG-Sapphire Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding Paper GitHub Stars
2025-05 BioReason BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model Paper GitHub Stars
2025-05 EHRMIND Training LLMs for EHR-Based Reasoning Tasks via Reinforcement Learning Paper -
2025-04 Open-Medical-R1 Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain Paper GitHub Stars
2025-04 ChestX-Reasoner ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification Paper -
2025-04 BoxMed-RL Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation Paper -
2025-03 PPME Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning Paper -
2025-03 DOLA Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent Paper -
2025-02 Baichuan-M1 Baichuan-M1: Pushing the Medical Capability of Large Language Models Paper -
2025-02 MedVLM-R1 MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning Paper -
2025-02 Med-RLVR Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning Paper -
2025-01 MedXpertQA MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding Paper GitHub Stars
2024-12 HuatuoGPT-o1 HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Paper GitHub Stars
- Pro-1 Pro-1 Blog GitHub Stars
- rbio rbio1 - training scientific reasoning LLMs with biological world models as soft verifiers Paper GitHub Stars

🌟 Acknowledgment

This survey is extended and refined from the original Awesome RL Reasoning Recipes repo. We are deeply grateful to all contributors for their efforts, and we sincerely thank for their all interest in Awesome RL Reasoning Recipes. The contents of the previous repository are available here.

🎈 Citation

If you find this survey helpful, please cite our work:

@article{zhang2025survey,
  title={A Survey of Reinforcement Learning for Large Reasoning Models},
  author={Zhang, Kaiyan and Zuo, Yuxin and He, Bingxiang and Sun, Youbang and Liu, Runze and Jiang, Che and Fan, Yuchen and Tian, Kai and Jia, Guoli and Li, Pengfei and others},
  journal={arXiv preprint arXiv:2509.08827},
  year={2025}
}

Star History

Star History Chart

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome-RL-for-LRMs

Similar Open Source Tools

For similar tasks

For similar jobs