Awesome-AgenticLLM-RL-Papers

Awesome-AgenticLLM-RL-Papers

None

Stars: 245

Visit
 screenshot

This repository serves as the official source for the survey paper 'The Landscape of Agentic Reinforcement Learning for LLMs: A Survey'. It provides an extensive overview of various algorithms, methods, and frameworks related to Agentic RL, including detailed information on different families of algorithms, their key mechanisms, objectives, and links to relevant papers and resources. The repository covers a wide range of tasks such as Search & Research Agent, Code Agent, Mathematical Agent, GUI Agent, RL in Vision Agents, RL in Embodied Agents, and RL in Multi-Agent Systems. Additionally, it includes information on environments, frameworks, and methods suitable for different tasks related to Agentic RL and LLMs.

README:

Awesome-AgenticLLM-RL-Papers

This is the Official repo for the survey paper: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

ArXiv – https://arxiv.org/abs/2509.02547

HuggingFace – https://huggingface.co/papers/2509.02547

Citation

@misc{zhang2025landscapeagenticreinforcementlearning,
      title={The Landscape of Agentic Reinforcement Learning for LLMs: A Survey}, 
      author={Guibin Zhang and Hejia Geng and Xiaohang Yu and Zhenfei Yin and Zaibin Zhang and Zelin Tan and Heng Zhou and Zhongzhi Li and Xiangyuan Xue and Yijiang Li and Yifan Zhou and Yang Chen and Chen Zhang and Yutao Fan and Zihu Wang and Songtao Huang and Yue Liao and Hongru Wang and Mengyue Yang and Heng Ji and Michael Littman and Jun Wang and Shuicheng Yan and Philip Torr and Lei Bai},
      year={2025},
      eprint={2509.02547},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.02547}, 
}

Sec2.7 Agentic RL: Algorithms

Clip corresponds to preventing the policy ratio from moving too far from 1 for ensuring stable updates.
KL penalty corresponds to penalizing the KL divergence between the learned policy and the reference policy for ensuring alignment.

Method Year Objective Type Clip KL Penalty Key Mechanism Signal Link Resource
PPO family
PPO 2017 Policy gradient Yes No Policy ratio clipping Reward Paper -
VAPO 2025 Policy gradient Yes Adaptive Adaptive KL penalty + variance control Reward + variance signal Paper -
PF-PPO 2024 Policy gradient Yes Yes Policy filtration Noisy reward Paper Code
VinePPO 2024 Policy gradient Yes Yes Unbiased value estimates Reward Paper Code
PSGPO 2024 Policy gradient Yes Yes Process supervision Process Reward Paper -
DPO family
DPO 2024 Preference optimization No Yes Implicit reward related to the policy Human preference Paper -
β-DPO 2024 Preference optimization No Adaptive Dynamic KL coefficient Human preference Paper Code
SimPO 2024 Preference optimization No Scaled Use avg log-prob of a sequence as implicit reward Human preference Paper Code
IPO 2024 Implicit preference No No LLMs as preference classifiers Preference rank Paper -
KTO 2024 Knowledge transfer optimization No Yes Teacher stabilization Teacher-student logit Paper Code Model
ORPO 2024 Online regularized preference optimization No Yes Online stabilization Online feedback reward Paper Code Model
Step-DPO 2024 Preference optimization No Yes Step-wise supervision Step-wise preference Paper Code Model
LCPO 2025 Preference optimization No Yes Length preference with limited data/training Reward Paper -
GRPO family
GRPO 2025 Policy gradient under group-based reward Yes Yes Group-based relative reward to eliminate value estimates Group-based reward Paper -
DAPO 2025 Surrogate of GRPO's Yes Yes Decoupled clip + dynamic sampling Dynamic group-based reward Paper Code Model Website
GSPO 2025 Surrogate of GRPO's Yes Yes Sequence-level clipping, rewarding, optimization Smooth group-based reward Paper -
GMPO 2025 Surrogate of GRPO's Yes Yes Geometric mean of token-level rewards Margin-based reward Paper Code
ProRL 2025 Same as GRPO's Yes Yes Reference policy reset Group-based reward Paper Model
Posterior-GRPO 2025 Same as GRPO's Yes Yes Reward only successful processes Process-based reward Paper -
Dr.GRPO 2025 Unbiased GRPO objective Yes Yes Eliminate bias in optimization Group-based reward Paper Code Model
Step-GRPO 2025 Same as GRPO's Yes Yes Rule-based reasoning rewards Step-wise reward Paper Code Model
SRPO 2025 Same as GRPO's Yes Yes Two-staged history-resampling Reward Paper Model
GRESO 2025 Same as GRPO's Yes Yes Pre-rollout filtering Reward Paper Code Website
StarPO 2025 Same as GRPO's Yes Yes Reasoning-guided actions for multi-turn interactions Group-based reward Paper Code Website
GHPO 2025 Policy gradient Yes Yes Adaptive prompt refinement Reward Paper Code
Skywork R1V2 2025 GRPO with hybrid reward signal Yes Yes Selective sample buffer Multimodal reward Paper Code Model
ASPO 2025 GRPO with shaped advantage Yes Yes Clipped bias to advantage Group-based reward Paper Code Model
TreePo 2025 Same as GRPO's Yes Yes Self-guided rollout, reduced compute burden Group-based reward Paper Code Model Website
EDGE-GRPO 2025 Same as GRPO's Yes Yes Entropy-driven advantage + error correction Group-based reward Paper Code Model
DARS 2025 Same as GRPO's Yes No Multi-stage rollout for hardest problems Group-based reward Paper Code Model
CHORD 2025 Weighted GRPO + SFT Yes Yes Auxiliary supervised loss Group-based reward Paper Code
PAPO 2025 Surrogate of GRPO's Yes Yes Implicit Perception Loss Group-based reward Paper Code Model Website
Pass@k Training 2025 Same as GRPO's Yes Yes Pass@k metric as reward Group-based reward Paper Code

Sec4.1 Task: Search & Research Agent

Method Category Base LLM Link Resource
Open Source Methods
DeepRetrieval External Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct Paper Code
Search-R1 External Qwen2.5-3B/7B-Base/Instruct Paper Code
R1-Searcher External Qwen2.5-7B, Llama3.1-8B-Instruct Paper Code
R1-Searcher++ External Qwen2.5-7B-Instruct Paper Code
ReSearch External Qwen2.5-7B/32B-Instruct Paper Code
StepSearch External Qwen2.5-3B/7B-Base/Instruct Paper Code
WebDancer External Qwen2.5-7B/32B, QWQ-32B Paper Code
WebThinker External QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B, Qwen2.5-32B-Instruct Paper Code
WebSailor External Qwen2.5-3B/7B/32B/72B Paper Code
WebWatcher External Qwen2.5-VL-7B/32B Paper Code
ASearcher External Qwen2.5-7B/14B, QwQ-32B Paper Code
ZeroSearch Internal Qwen2.5-3B/7B-Base/Instruct Paper Code
SSRL Internal Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct, Llama-3.1-8B/70B-Instruct, Qwen3-0.6B/1.7B/4B/8B/14B/32B Paper Code
Closed Source Methods
OpenAI Deep Research External OpenAI Models Blog Website
Perplexity’s DeepResearch External - Blog Website
Google Gemini’s DeepResearch External Gemini Blog Website
Kimi-Researcher External Kimi K2 Blog Website
Grok AI DeepSearch External Grok3 Blog Website
Doubao with Deep Think External Doubao Blog Website

Sec4.2 Task: Code Agent

Method RL Reward Type Base LLM Link Resource
RL for Code Generation
AceCoder Outcome Qwen2.5-Coder-7B-Base/Instruct, Qwen2.5-7B-Instruct Paper Code
DeepCoder-14B Outcome Deepseek-R1-Distilled-Qwen-14B Blog Code
RLTF Outcome CodeGen-NL 2.7B, CodeT5 Paper Code
CURE Outcome Qwen2.5-7B/14B-Instruct, Qwen3-4B Paper Code
Absolute Zero Outcome Qwen2.5-7B/14B, Qwen2.5-Coder-3B/7B/14B, Llama-3.1-8B Paper Code
StepCoder Process DeepSeek-Coder-Instruct-6.7B Paper Code
Process Supervision-Guided PO Process - Paper -
CodeBoost Process Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct, Seed-Coder-8B-Instruct, Yi-Coder-9B-Chat Paper Code
PRLCoder Process CodeT5+, Unixcoder, T5-base Paper -
o1-Coder Process DeepSeek-1.3B-Instruct Paper Code
CodeFavor Process Mistral-NeMo-12B-Instruct, Gemma-2-9B-Instruct, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 Paper Code
Focused-DPO Process DeepSeek-Coder-6.7B-Base/Instruct, Magicoder-S-DS-6.7B, Qwen2.5-Coder-7B-Instruct Paper -
RL for Iterative Code Refinement
RLEF Outcome Llama-3.0-8B-Instruct, Llama-3.1-8B/70B-Instruct Paper -
μCode Outcome Llama-3.2-1B/8B-Instruct Paper Code
R1-Code-Interpreter Outcome Qwen2.5-7B/14B-Instruct-1M, Qwen2.5-3B-Instruct Paper Code
IterPref Process Deepseek-Coder-7B-Instruct, Qwen2.5-Coder-7B, StarCoder2-15B Paper -
LeDex Process StarCoder-15B, CodeLlama-7B/13B Paper -
CTRL Process Qwen2.5-Coder-7B/14B/32B-Instruct Paper Code
ReVeal Process DAPO-Qwen-32B, Qwen2.5-32B-Instruc(not-working) Paper -
Posterior-GRPO Process Qwen2.5-Coder-3B/7B-Base, Qwen2.5-Math-7B Paper -
Policy Filtration for RLHF Process DeepSeek-Coder-6.7B, Qwen1.5-7B Paper Code
RL for Automated Software Engineering (SWE)
DeepSWE Outcome Qwen3-32B Blog Code
SWE-RL Outcome Llama-3.3-70B-Instruct Paper Code
Satori-SWE Outcome Qwen-2.5-Math-7B Paper Code
RLCoder Outcome CodeLlama7B, StartCoder-7B, StarCoder2-7B, DeepSeekCoder-1B/7B Paper Code
Qwen3-Coder Outcome - Paper Code
ML-Agent Outcome Qwen2.5-7B-Base/Instruct, DeepSeek-R1-Distill-Qwen-7B Paper Code
Golubev et al. Process Qwen2.5-72B-Instruct Paper -
SWEET-RL Process Llama-3.1-8B/70B-Instruct Paper Code

Sec4.3 Task: Mathematical Agent

Method Reward Link Resource
RL for Informal Mathematical Reasoning
ARTIST Outcome Paper -
ToRL Outcome Paper Code Model
ZeroTIR Outcome Paper Code Model
TTRL Outcome Paper Code
RENT Outcome Paper Code Website
Satori Outcome Paper Code Model Website
1-shot RLVR Outcome Paper Code Model
Prover-Verifier Games (legibility) Outcome Paper -
rStar2-Agent Outcome Paper Code
START Process Paper -
LADDER Process Paper -
SWiRL Process Paper -
RLoT Process Paper Code
RL for Formal Mathematical Reasoning
DeepSeek-Prover-v1.5 Outcome Paper Code Model
Leanabell-Prover Outcome Paper Code Model
Kimina-Prover (Preview) Outcome Paper Code Model
Seed-Prover Outcome Paper Code
DeepSeek-Prover-v2 Process Paper Code Model
ProofNet++ Process Paper -
Leanabell-Prover-v2 Process Paper Code
Hybrid
InternLM2.5-StepProver Hybrid Paper Code
Lean-STaR Hybrid Paper Code Model Website
STP Hybrid Paper Code Model

Sec4.4 Task: GUI Agent

Method Paradigm Environment Link Resource
Non-RL GUI Agents
MM-Navigator Vanilla VLM - Paper Code
SeeAct Vanilla VLM - Paper Code
TRISHUL Vanilla VLM - Paper -
InfiGUIAgent SFT - Paper Code Model Website
UI-AGILE SFT - Paper Code Model
TongUI SFT - Paper Code Model Website
RL-based GUI Agents
GUI-R1 RL Static Paper Code Model
UI-R1 RL Static Paper Code Model
InFiGUI-R1 RL Static Paper Code Model
AgentCPM RL Static Paper Code Model
WebAgent-R1 RL Interactive Paper -
Vattikonda et al. RL Interactive Paper -
UI-TARS RL Interactive Paper Code Model Website
DiGiRL RL Interactive Paper Code Model Website
ZeroGUI RL Interactive Paper Code
MobileGUI-RL RL Interactive Paper -

Sec4.5 Task: RL in Vision Agents

TO BE ADDED


Sec4.6 Task: RL in Embodied Agents

TO BE ADDED


Sec4.7 Task: RL in Multi-Agent Systems

“Dynamic” denotes whether the multi-agent system is task-dynamic, i.e., processes different task queries with different configurations (agent count, topologies, reasoning depth, prompts, etc).
“Train” denotes whether the method involves training the LLM backbone of agents.

Method Dynamic Train RL Algorithm Link Resource
RL-Free Multi-Agent Systems (not exhaustive)
CAMEL - Paper Code Model
MetaGPT - Paper Code
MAD - Paper Code
MoA - Paper Code
AFlow - Paper Code
RL-Based Multi-Agent Training
GPTSwarm policy gradient Paper Code Website
MaAS policy gradient Paper Code
G-Designer policy gradient Paper Code
MALT DPO Paper -
MARFT MARFT Paper Code
MAPoRL PPO Paper Code
MLPO MLPO Paper -
ReMA MAMRP Paper Code
FlowReasoner GRPO Paper Code
LERO MLPO Paper -
CURE rule-based RL Paper Code Model
MMedAgent-RL GRPO Paper -

Sec4.8. Task: Other Tasks

TO BE ADDED

Sec5.1 Environments

The agent capabilities are denoted by:
① Reasoning, ② Planning, ③ Tool Use, ④ Memory, ⑤ Collaboration, ⑥ Self-Improve.

Environment / Benchmark Agent Capability Task Domain Modality Link Resource
LMRL-Gym ①, ④ Interaction Text Paper Code
ALFWorld ②, ① Embodied, Text Games Text Paper Code Website
TextWorld ②, ① Text Games Text Paper Code
ScienceWorld ①, ② Embodied, Science Text Paper Code Website
AgentGym ①, ④ Text Games Text Paper Code Website
Agentbench General Text, Visual Paper Code
InternBootcamp General, Coding, Logic Text Paper Code
LoCoMo Interaction Text Paper Code Website
MemoryAgentBench Interaction Text Paper Code
WebShop ②, ③ Web Text Paper Code Website
Mind2Web ②, ③ Web Text, Visual Paper Code Website
WebArena ②, ③ Web Text Paper Code Website
VisualwebArena ①, ②, ③ Web Text, Visual Paper Code Website
AppWorld ②, ③ App Text Paper Code Website
AndroidWorld ②, ③ GUI, App Text, Visual Paper Code
OSWorld ②, ③ GUI, OS Text, Visual Paper Code Website
Debug-Gym ①, ③ SWE Text Paper Code Website
MLE-Dojo ②, ① MLE Text Paper Code Website
τ-bench ①, ③ SWE Text Paper Code
TheAgentCompany ②, ③, ⑤ SWE Text Paper Code Website
MedAgentGym Science Text Paper Code
SecRepoBench ①, ③ Coding, Security Text Paper -
R2E-Gym ①, ② SWE Text Paper Code Website
HumanEval Coding Text Paper Code
MBPP Coding Text Paper Code
BigCodeBench Coding Text Paper Code Website
LiveCodeBench Coding Text Paper Code Website
SWE-bench ①, ③ SWE Text Paper Code Website
SWE-rebench ①, ③ SWE Text Paper Website
DevBench ②, ① SWE Text Paper Code
ProjectEval ②, ① SWE Text Paper Code Website
DA-Code ①, ③ Data Science, SWE Text Paper Code Website
ColBench ②, ①, ③ SWE, Web Dev Text Paper Code Website
NoCode-bench ②, ① SWE Text Paper Code Website
MLE-Bench ②, ①, ③ MLE Text Paper Code Website
PaperBench ②, ①, ③ MLE Text Paper Code Website
Crafter ②, ④ Game Visual Paper Code Website
Craftax ②, ④ Game Visual Paper Code
ELLM (Crafter variant) ②, ① Game Visual Paper Code Website
SMAC / SMAC-Exp ⑤, ② Game Visual Paper Code
Factorio ②, ① Game Visual Paper Code Website

Sec5.2 Frameworks

Framework Type Key Features Link Resource
Agentic RL Frameworks
Verifiers Agent RL / LLM RL Verifiable environment setup - Code
SkyRL-v0/v0.1 Agent RL Long-horizon real-world training Blog (v0) Blog (v0.1) Code
AREAL Agent RL / LLM RL Asynchronous training Paper Code
MARTI Multi-agent RL / LLM RL Integrated multi-agent training - Code
EasyR1 Agent RL / LLM RL Multimodal support - Code
AgentFly Agent RL Scalable asynchronous execution Paper Code
Agent Lightning Agent RL Decoupled hierarchical RL Paper Code
RLHF and LLM Fine-tuning Frameworks
OpenRLHF RLHF / LLM RL High-performance scalable RLHF Paper Code
TRL RLHF / LLM RL Hugging Face RLHF - Code
trlX RLHF / LLM RL Distributed large-model RLHF Paper Code
HybridFlow RLHF / LLM RL Streamlined experiment management Paper Code
SLiMe RLHF / LLM RL High-performance async RL - Code
General-purpose RL Frameworks
RLlib General RL / Multi-agent RL Production-grade scalable library Paper Code
Acme General RL Modular distributed components Paper Code
Tianshou General RL High-performance PyTorch platform Paper Code
Stable Baselines3 General RL Reliable PyTorch algorithms Paper Code
PFRL General RL Benchmarked prototyping algorithms Paper Code

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome-AgenticLLM-RL-Papers

Similar Open Source Tools

For similar tasks

For similar jobs