
Awesome-RL-based-LLM-Reasoning
Awesome RL-based LLM Reasoning
Stars: 335

This repository is dedicated to enhancing Language Model (LLM) reasoning with reinforcement learning (RL). It includes a collection of the latest papers, slides, and materials related to RL-based LLM reasoning, aiming to facilitate quick learning and understanding in this field. Starring this repository allows users to stay updated and engaged with the forefront of RL-based LLM reasoning.
README:
We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)
- Why do we need reasoning?
- Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)
- [2502] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (Shanghai AI Lab)
- [2502] Demystifying Long Chain-of-Thought Reasoning in LLMs (Introduced cosine length-scaling reward with repetition penalty for stable CoT length growth) (IN.AI)
- [2501] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (HKU, Berkeley)
- [2501] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (Deepseek)
- [2501] Kimi k1.5: Scaling Reinforcement Learning with LLMs (Kimi)
- [2502] S2 R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (Tencent)
- [2502] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (THU)
- [2502] QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (UCLA-Yizhou Sun)
- [2312] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (PKU & Deepseek)
- [2305] Let's verify step by step (OpenAI)
- [2211] Solving math word problems with process-and outcome-based feedback (DeepMind)
- [2503] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (agent & reasoning)
- [2502] Reasoning with Reinforced Functional Token Tuning
- [2503] DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models (short length of thinking by RL)
- [2503] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (CMU)
- [2502] Provably Optimal Distributional RL for LLM Post-Training (Cornell, Harvard)
- [2502] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (Reinforcement Learning via Self-Play) (MIT)
- [2502] STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving (the scarcity of correct proofs sparse rewards will make performance quickly plateaus. To overcome this, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them.) (Stanford-Tengyu Ma)
- [2409] Training Language Models to Self-Correct via Reinforcement Learning (DeepMind)
- [2502] Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls (Tencent)
- [2408] Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (DeepSeek)
- [2310] Solving olympiad geometry without human demonstrations (DeepMind)
- [2503] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
- [2503] Interpreting the Repeated Token Phenomenon in Large Language Models (DeepMind)
- [2503] Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models (Emcie Co Ltd)
- [2501] Reasoning Language Models: A Blueprint
- [2502] From System 1 to System 2: A Survey of Reasoning Large Language Models
- [2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
- [2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
- [2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
- [2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
- [2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
- [2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
- [2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
- [2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
- [2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
- [2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
- [2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
- [2501] s1: Simple test-time scaling (Stanford)
- [2412] Efficiently Serving LLM Reasoning Programs with Certaindex (UCSD) (overthinking, probe in the middle)
- [2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)
- [2412] Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
- [2408] Visual Agents as Fast and Slow Thinkers
- A Visual Guide to Reasoning LLMs
- Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
- What is the difference between large reasoning model and LLM? (Zhihu)
- LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
- Towards Reasoning in Large Language Models Jie Huang-UIUC
- Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
- Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
- Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
- Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
- 为什么在Deepseek-R1-ZERO出现前,无人尝试放弃微调对齐,通过强化学习生成思考链推理模型? Zhihu
- Kimi Flood Sung Zhihu
- Deepseek系列文章梳理 Zhihu
- ChatGPT and The Art of Post-Training Stanford-25/02/18
- [LLM+RL] R1 论文导读,SFT vs. RL,RL 基础以及 GRPO 细节,以及一系列复现工作讨论
- [LLM+RL] 理解 GRPO 公式原理及 TRL GrpoTrainer 代码实现(advantage 与 loss 计算)
- LLM-Based Reasoning: Opportunities and Pitfalls (LAVA Workshop in ACCV 2024)
- Reinforcement Learning in DeepSeek r1 Visualized (Chinese)
- EZ撸paper: DeepSeek-R1 论文详解 part 3:GPT发展史 | scaling law | 训练范式 | emergent ability
- EZ撸paper: DeepSeek-R1 论文详解 part 2:AGI是什么? | Reinforcement Learning快速入门 | AlphaGo介绍
- EZ撸paper: DeepSeek-R1 论文详解 part 1:比肩 OpenAI-o1,如何做到的?
- [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek R1 Explained to your grandma
-
TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
-
Open-r1
-
Logic-RL
-
Unsloth-GRPO (simplest r1 implementation)
-
OpenR (An Open Source Framework for Advanced Reasoning)
- DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k
- deepseek_r1_train
The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.
- Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
- REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.
- [2501] (REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models 6 (REINFORCE++ is more stable in training compared to GRPO and faster than PPO in OpenRLHF report)
- [2407] A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more 10
- [2405] (SimPO) Simple Preference Optimization with a Reference-Free Reward 227
- [2402] (KTO) Model Alignment as Prospect Theoretic Optimization 326
- [2402] (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 250
- [2305] (DPO) Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2580
- [2203] (InstructGPT/PPO+LLM) Training language models to follow instructions with human feedback 12443
- [1707] (PPO) Proximal Policy Optimization Algorithms 23934
- [1706] (RLHF) Deep Reinforcement Learning from Human Preferences 3571
- Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)
- Awesome-Efficient-CoT-Reasoning-Summary
- Awesome RL-based Reasoning MLLMs
- DecryptPrompt (very comprehensive)
- Feel free to contribute more papers or other any resources!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Awesome-RL-based-LLM-Reasoning
Similar Open Source Tools

Awesome-RL-based-LLM-Reasoning
This repository is dedicated to enhancing Language Model (LLM) reasoning with reinforcement learning (RL). It includes a collection of the latest papers, slides, and materials related to RL-based LLM reasoning, aiming to facilitate quick learning and understanding in this field. Starring this repository allows users to stay updated and engaged with the forefront of RL-based LLM reasoning.

Awesome-LLM4Graph-Papers
A collection of papers and resources about Large Language Models (LLM) for Graph Learning (Graph). Integrating LLMs with graph learning techniques to enhance performance in graph learning tasks. Categorizes approaches based on four primary paradigms and nine secondary-level categories. Valuable for research or practice in self-supervised learning for recommendation systems.

Awesome-Text2SQL
Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.

Awesome-explainable-AI
This repository contains frontier research on explainable AI (XAI), a hot topic in the field of artificial intelligence. It includes trends, use cases, survey papers, books, open courses, papers, and Python libraries related to XAI. The repository aims to organize and categorize publications on XAI, provide evaluation methods, and list various Python libraries for explainable AI.

sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system. The core features of SGLang include: - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction. - **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.

lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features: * **Efficient Inference** : LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on. * **Effective Quantization** : LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation. * **Effortless Distribution Server** : Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards. * **Interactive Inference Mode** : By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

Lumi-AI
Lumi AI is a friendly AI sidekick with a human-like personality that offers features like file upload and analysis, web search, local chat storage, custom instructions, changeable conversational style, enhanced context retention, voice query input, and various tools. The project has been developed with contributions from a team of developers, designers, and testers, and is licensed under Apache 2.0 and MIT licenses.

awesome-mcp-servers
A curated list of awesome Model Context Protocol (MCP) servers that enable AI models to securely interact with local and remote resources through standardized server implementations. The list focuses on production-ready and experimental servers extending AI capabilities through file access, database connections, API integrations, and other contextual services.

superagentx
SuperAgentX is a lightweight open-source AI framework designed for multi-agent applications with Artificial General Intelligence (AGI) capabilities. It offers goal-oriented multi-agents with retry mechanisms, easy deployment through WebSocket, RESTful API, and IO console interfaces, streamlined architecture with no major dependencies, contextual memory using SQL + Vector databases, flexible LLM configuration supporting various Gen AI models, and extendable handlers for integration with diverse APIs and data sources. It aims to accelerate the development of AGI by providing a powerful platform for building autonomous AI agents capable of executing complex tasks with minimal human intervention.

BadukMegapack
BadukMegapack is an installer for various AI Baduk (Go) programs, designed for baduk players who want to easily access and use a variety of baduk AI programs without complex installations. The megapack includes popular programs like Lizzie, KaTrain, Sabaki, KataGo, LeelaZero, and more, along with weight files for different AI models. Users can update their graphics card drivers before installation for optimal performance.

Phi-3CookBook
Phi-3CookBook is a manual on how to use the Microsoft Phi-3 family, which consists of open AI models developed by Microsoft. The Phi-3 models are highly capable and cost-effective small language models, outperforming models of similar and larger sizes across various language, reasoning, coding, and math benchmarks. The repository provides detailed information on different Phi-3 models, their performance, availability, and usage scenarios across different platforms like Azure AI Studio, Hugging Face, and Ollama. It also covers topics such as fine-tuning, evaluation, and end-to-end samples for Phi-3-mini and Phi-3-vision models, along with labs, workshops, and contributing guidelines.
For similar tasks

Awesome-RL-based-LLM-Reasoning
This repository is dedicated to enhancing Language Model (LLM) reasoning with reinforcement learning (RL). It includes a collection of the latest papers, slides, and materials related to RL-based LLM reasoning, aiming to facilitate quick learning and understanding in this field. Starring this repository allows users to stay updated and engaged with the forefront of RL-based LLM reasoning.

Step-DPO
Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.

optillm
optillm is an OpenAI API compatible optimizing inference proxy implementing state-of-the-art techniques to enhance accuracy and performance of LLMs, focusing on reasoning over coding, logical, and mathematical queries. By leveraging additional compute at inference time, it surpasses frontier models across diverse tasks.

MoBA
MoBA (Mixture of Block Attention) is an innovative approach for long-context language models, enabling efficient processing of long sequences by dividing the full context into blocks and introducing a parameter-less gating mechanism. It allows seamless transitions between full and sparse attention modes, enhancing efficiency without compromising performance. MoBA has been deployed to support long-context requests and demonstrates significant advancements in efficient attention computation for large language models.

Awesome_Test_Time_LLMs
This repository focuses on test-time computing, exploring various strategies such as test-time adaptation, modifying the input, editing the representation, calibrating the output, test-time reasoning, and search strategies. It covers topics like self-supervised test-time training, in-context learning, activation steering, nearest neighbor models, reward modeling, and multimodal reasoning. The repository provides resources including papers and code for researchers and practitioners interested in enhancing the reasoning capabilities of large language models.

Awesome-Neuro-Symbolic-Learning-with-LLM
The Awesome-Neuro-Symbolic-Learning-with-LLM repository is a curated collection of papers and resources focusing on improving reasoning and planning capabilities of Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) through neuro-symbolic learning. It covers a wide range of topics such as neuro-symbolic visual reasoning, program synthesis, logical reasoning, mathematical reasoning, code generation, visual reasoning, geometric reasoning, classical planning, game AI planning, robotic planning, AI agent planning, and more. The repository provides a comprehensive overview of tutorials, workshops, talks, surveys, papers, datasets, and benchmarks related to neuro-symbolic learning with LLMs and MLLMs.
For similar jobs

Perplexica
Perplexica is an open-source AI-powered search engine that utilizes advanced machine learning algorithms to provide clear answers with sources cited. It offers various modes like Copilot Mode, Normal Mode, and Focus Modes for specific types of questions. Perplexica ensures up-to-date information by using SearxNG metasearch engine. It also features image and video search capabilities and upcoming features include finalizing Copilot Mode and adding Discover and History Saving features.

KULLM
KULLM (구름) is a Korean Large Language Model developed by Korea University NLP & AI Lab and HIAI Research Institute. It is based on the upstage/SOLAR-10.7B-v1.0 model and has been fine-tuned for instruction. The model has been trained on 8×A100 GPUs and is capable of generating responses in Korean language. KULLM exhibits hallucination and repetition phenomena due to its decoding strategy. Users should be cautious as the model may produce inaccurate or harmful results. Performance may vary in benchmarks without a fixed system prompt.

MMMU
MMMU is a benchmark designed to evaluate multimodal models on college-level subject knowledge tasks, covering 30 subjects and 183 subfields with 11.5K questions. It focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of various models highlights substantial challenges, with room for improvement to stimulate the community towards expert artificial general intelligence (AGI).

1filellm
1filellm is a command-line data aggregation tool designed for LLM ingestion. It aggregates and preprocesses data from various sources into a single text file, facilitating the creation of information-dense prompts for large language models. The tool supports automatic source type detection, handling of multiple file formats, web crawling functionality, integration with Sci-Hub for research paper downloads, text preprocessing, and token count reporting. Users can input local files, directories, GitHub repositories, pull requests, issues, ArXiv papers, YouTube transcripts, web pages, Sci-Hub papers via DOI or PMID. The tool provides uncompressed and compressed text outputs, with the uncompressed text automatically copied to the clipboard for easy pasting into LLMs.

gpt-researcher
GPT Researcher is an autonomous agent designed for comprehensive online research on a variety of tasks. It can produce detailed, factual, and unbiased research reports with customization options. The tool addresses issues of speed, determinism, and reliability by leveraging parallelized agent work. The main idea involves running 'planner' and 'execution' agents to generate research questions, seek related information, and create research reports. GPT Researcher optimizes costs and completes tasks in around 3 minutes. Features include generating long research reports, aggregating web sources, an easy-to-use web interface, scraping web sources, and exporting reports to various formats.

ChatTTS
ChatTTS is a generative speech model optimized for dialogue scenarios, providing natural and expressive speech synthesis with fine-grained control over prosodic features. It supports multiple speakers and surpasses most open-source TTS models in terms of prosody. The model is trained with 100,000+ hours of Chinese and English audio data, and the open-source version on HuggingFace is a 40,000-hour pre-trained model without SFT. The roadmap includes open-sourcing additional features like VQ encoder, multi-emotion control, and streaming audio generation. The tool is intended for academic and research use only, with precautions taken to limit potential misuse.

HebTTS
HebTTS is a language modeling approach to diacritic-free Hebrew text-to-speech (TTS) system. It addresses the challenge of accurately mapping text to speech in Hebrew by proposing a language model that operates on discrete speech representations and is conditioned on a word-piece tokenizer. The system is optimized using weakly supervised recordings and outperforms diacritic-based Hebrew TTS systems in terms of content preservation and naturalness of generated speech.

do-research-in-AI
This repository is a collection of research lectures and experience sharing posts from frontline researchers in the field of AI. It aims to help individuals upgrade their research skills and knowledge through insightful talks and experiences shared by experts. The content covers various topics such as evaluating research papers, choosing research directions, research methodologies, and tips for writing high-quality scientific papers. The repository also includes discussions on academic career paths, research ethics, and the emotional aspects of research work. Overall, it serves as a valuable resource for individuals interested in advancing their research capabilities in the field of AI.