Awesome-RL-based-LLM-Reasoning

Awesome RL-based LLM Reasoning

Stars: 380

Visit

This repository is dedicated to enhancing Language Model (LLM) reasoning with reinforcement learning (RL). It includes a collection of the latest papers, slides, and materials related to RL-based LLM reasoning, aiming to facilitate quick learning and understanding in this field. Starring this repository allows users to stay updated and engaged with the forefront of RL-based LLM reasoning.

README:

Awesome-RL-based-LLM-Reasoning

We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)

Why ?

Why do we need reasoning?
Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)

Papers

Question about LLM Reasoning Ability

[2504] Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? (ByteDance Seed) (by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems)
[2503] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (the outcome of the 2025 USA Math Olympiad was worse. Most of the results were 0)

Other Newest Interesting Papers about LLM Reasoning

[2503] Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models (visulize reasoning process)
[2503] Efficient Test-Time Scaling via Self-Calibration (WUSTL) (LLMs are known to be overconfident and provide unreliable confidence estimation)
[2503] Interpreting the Repeated Token Phenomenon in Large Language Models (DeepMind)
[2503] Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models (Emcie Co Ltd)
[2501] Reasoning Language Models: A Blueprint
[2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
[2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
[2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
[2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
[2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
[2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
[2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
[2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
[2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
[2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
[2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
[2501] s1: Simple test-time scaling (Stanford) (distillation and using 'wait' append response)
[2412] Formal Mathematical Reasoning: A New Frontier in AI
[2412] Efficiently Serving LLM Reasoning Programs with Certaindex (UCSD) (overthinking, probe in the middle)
[2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)
[2412] Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
[2408] Visual Agents as Fast and Slow Thinkers

Surveys

Slides and Discussion

Self-improvement of LLM agents through Reinforcement Learning at Scale
A Visual Guide to Reasoning LLMs
Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
What is the difference between large reasoning model and LLM? (Zhihu)
LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
Towards Reasoning in Large Language Models Jie Huang-UIUC
Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
为什么在Deepseek-R1-ZERO出现前，无人尝试放弃微调对齐，通过强化学习生成思考链推理模型？ Zhihu
Kimi Flood Sung Zhihu
Deepseek系列文章梳理 Zhihu
ChatGPT and The Art of Post-Training Stanford-25/02/18

Video

Open-Source Project

TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)
DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k
deepseek_r1_train

Introduction to Reinforcement Learning

The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.

Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.

X_PO

[2501] (REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models 6 (REINFORCE++ is more stable in training compared to GRPO and faster than PPO in OpenRLHF report)
[2405] (SimPO) Simple Preference Optimization with a Reference-Free Reward 227
[2402] (KTO) Model Alignment as Prospect Theoretic Optimization 326
[2402] (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 250
[2305] (DPO) Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2580
[2203] (InstructGPT/PPO+LLM) Training language models to follow instructions with human feedback 12443
[1707] (PPO) Proximal Policy Optimization Algorithms 23934
[1706] (RLHF) Deep Reinforcement Learning from Human Preferences 3571

Cloud GPU

Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)

Other Interesting RL-based Reasoning Repository

Contributing

Feel free to contribute more papers or other any resources!

For Tasks:

Click tags to check more tools for each tasks

improve reasoning verify and reinforce boost language inference solve word problems teach to predict future

For Jobs:

research scientist machine learning engineer data scientist ai researcher academic researcher

Alternative AI tools for Awesome-RL-based-LLM-Reasoning

Similar Open Source Tools

Awesome-RL-based-LLM-Reasoning

github

: 380

Awesome-LLM4Graph-Papers

A collection of papers and resources about Large Language Models (LLM) for Graph Learning (Graph). Integrating LLMs with graph learning techniques to enhance performance in graph learning tasks. Categorizes approaches based on four primary paradigms and nine secondary-level categories. Valuable for research or practice in self-supervised learning for recommendation systems.

github

: 290

Awesome-Text2SQL

Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.

github

: 1.5k

Awesome-explainable-AI

This repository contains frontier research on explainable AI (XAI), a hot topic in the field of artificial intelligence. It includes trends, use cases, survey papers, books, open courses, papers, and Python libraries related to XAI. The repository aims to organize and categorize publications on XAI, provide evaluation methods, and list various Python libraries for explainable AI.

github

: 1.3k

sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system. The core features of SGLang include: - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction. - **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.

github

: 12.9k

awesome-cuda-and-hpc

github

: 221

awesome-cuda-triton-hpc

github

: 211

awesome-cuda-triton-mlir-hpc

github

: 169

awesome-cuda-triton-tvm-hpc

github

: 169

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features: * **Efficient Inference** : LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on. * **Effective Quantization** : LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation. * **Effortless Distribution Server** : Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards. * **Interactive Inference Mode** : By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

github

: 6.0k

Lumi-AI

Lumi AI is a friendly AI sidekick with a human-like personality that offers features like file upload and analysis, web search, local chat storage, custom instructions, changeable conversational style, enhanced context retention, voice query input, and various tools. The project has been developed with contributions from a team of developers, designers, and testers, and is licensed under Apache 2.0 and MIT licenses.

github

: 106

awesome-mcp-servers

A curated list of awesome Model Context Protocol (MCP) servers that enable AI models to securely interact with local and remote resources through standardized server implementations. The list focuses on production-ready and experimental servers extending AI capabilities through file access, database connections, API integrations, and other contextual services.

github

: 1.6k

superagentx

SuperAgentX is a lightweight open-source AI framework designed for multi-agent applications with Artificial General Intelligence (AGI) capabilities. It offers goal-oriented multi-agents with retry mechanisms, easy deployment through WebSocket, RESTful API, and IO console interfaces, streamlined architecture with no major dependencies, contextual memory using SQL + Vector databases, flexible LLM configuration supporting various Gen AI models, and extendable handlers for integration with diverse APIs and data sources. It aims to accelerate the development of AGI by providing a powerful platform for building autonomous AI agents capable of executing complex tasks with minimal human intervention.

github

: 57

BadukMegapack

BadukMegapack is an installer for various AI Baduk (Go) programs, designed for baduk players who want to easily access and use a variety of baduk AI programs without complex installations. The megapack includes popular programs like Lizzie, KaTrain, Sabaki, KataGo, LeelaZero, and more, along with weight files for different AI models. Users can update their graphics card drivers before installation for optimal performance.

github

: 194

Phi-3CookBook

Phi-3CookBook is a manual on how to use the Microsoft Phi-3 family, which consists of open AI models developed by Microsoft. The Phi-3 models are highly capable and cost-effective small language models, outperforming models of similar and larger sizes across various language, reasoning, coding, and math benchmarks. The repository provides detailed information on different Phi-3 models, their performance, availability, and usage scenarios across different platforms like Azure AI Studio, Hugging Face, and Ollama. It also covers topics such as fine-tuning, evaluation, and end-to-end samples for Phi-3-mini and Phi-3-vision models, along with labs, workshops, and contributing guidelines.

github

: 2.7k

dom-to-semantic-markdown

DOM to Semantic Markdown is a tool that converts HTML DOM to Semantic Markdown for use in Large Language Models (LLMs). It maximizes semantic information, token efficiency, and preserves metadata to enhance LLMs' processing capabilities. The tool captures rich web content structure, including semantic tags, image metadata, table structures, and link destinations. It offers customizable conversion options and supports both browser and Node.js environments.

github

: 708

For similar tasks

Awesome-RL-based-LLM-Reasoning

github

: 380

Step-DPO

Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.

github

: 155

optillm

optillm is an OpenAI API compatible optimizing inference proxy implementing state-of-the-art techniques to enhance accuracy and performance of LLMs, focusing on reasoning over coding, logical, and mathematical queries. By leveraging additional compute at inference time, it surpasses frontier models across diverse tasks.

github

: 2.1k

MoBA

MoBA (Mixture of Block Attention) is an innovative approach for long-context language models, enabling efficient processing of long sequences by dividing the full context into blocks and introducing a parameter-less gating mechanism. It allows seamless transitions between full and sparse attention modes, enhancing efficiency without compromising performance. MoBA has been deployed to support long-context requests and demonstrates significant advancements in efficient attention computation for large language models.

github

: 1.3k

Awesome_Test_Time_LLMs

This repository focuses on test-time computing, exploring various strategies such as test-time adaptation, modifying the input, editing the representation, calibrating the output, test-time reasoning, and search strategies. It covers topics like self-supervised test-time training, in-context learning, activation steering, nearest neighbor models, reward modeling, and multimodal reasoning. The repository provides resources including papers and code for researchers and practitioners interested in enhancing the reasoning capabilities of large language models.

github

: 69

Awesome-Neuro-Symbolic-Learning-with-LLM

The Awesome-Neuro-Symbolic-Learning-with-LLM repository is a curated collection of papers and resources focusing on improving reasoning and planning capabilities of Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) through neuro-symbolic learning. It covers a wide range of topics such as neuro-symbolic visual reasoning, program synthesis, logical reasoning, mathematical reasoning, code generation, visual reasoning, geometric reasoning, classical planning, game AI planning, robotic planning, AI agent planning, and more. The repository provides a comprehensive overview of tutorials, workshops, talks, surveys, papers, datasets, and benchmarks related to neuro-symbolic learning with LLMs and MLLMs.

github

: 53

For similar jobs

Perplexica

Perplexica is an open-source AI-powered search engine that utilizes advanced machine learning algorithms to provide clear answers with sources cited. It offers various modes like Copilot Mode, Normal Mode, and Focus Modes for specific types of questions. Perplexica ensures up-to-date information by using SearxNG metasearch engine. It also features image and video search capabilities and upcoming features include finalizing Copilot Mode and adding Discover and History Saving features.

github

: 21.0k

KULLM

KULLM (구름) is a Korean Large Language Model developed by Korea University NLP & AI Lab and HIAI Research Institute. It is based on the upstage/SOLAR-10.7B-v1.0 model and has been fine-tuned for instruction. The model has been trained on 8×A100 GPUs and is capable of generating responses in Korean language. KULLM exhibits hallucination and repetition phenomena due to its decoding strategy. Users should be cautious as the model may produce inaccurate or harmful results. Performance may vary in benchmarks without a fixed system prompt.

github

: 527

MMMU

MMMU is a benchmark designed to evaluate multimodal models on college-level subject knowledge tasks, covering 30 subjects and 183 subfields with 11.5K questions. It focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of various models highlights substantial challenges, with room for improvement to stimulate the community towards expert artificial general intelligence (AGI).

github

: 374

1filellm

1filellm is a command-line data aggregation tool designed for LLM ingestion. It aggregates and preprocesses data from various sources into a single text file, facilitating the creation of information-dense prompts for large language models. The tool supports automatic source type detection, handling of multiple file formats, web crawling functionality, integration with Sci-Hub for research paper downloads, text preprocessing, and token count reporting. Users can input local files, directories, GitHub repositories, pull requests, issues, ArXiv papers, YouTube transcripts, web pages, Sci-Hub papers via DOI or PMID. The tool provides uncompressed and compressed text outputs, with the uncompressed text automatically copied to the clipboard for easy pasting into LLMs.

github

: 292

gpt-researcher

GPT Researcher is an autonomous agent designed for comprehensive online research on a variety of tasks. It can produce detailed, factual, and unbiased research reports with customization options. The tool addresses issues of speed, determinism, and reliability by leveraging parallelized agent work. The main idea involves running 'planner' and 'execution' agents to generate research questions, seek related information, and create research reports. GPT Researcher optimizes costs and completes tasks in around 3 minutes. Features include generating long research reports, aggregating web sources, an easy-to-use web interface, scraping web sources, and exporting reports to various formats.

github

: 20.7k

ChatTTS

ChatTTS is a generative speech model optimized for dialogue scenarios, providing natural and expressive speech synthesis with fine-grained control over prosodic features. It supports multiple speakers and surpasses most open-source TTS models in terms of prosody. The model is trained with 100,000+ hours of Chinese and English audio data, and the open-source version on HuggingFace is a 40,000-hour pre-trained model without SFT. The roadmap includes open-sourcing additional features like VQ encoder, multi-emotion control, and streaming audio generation. The tool is intended for academic and research use only, with precautions taken to limit potential misuse.

github

: 33.9k

HebTTS

HebTTS is a language modeling approach to diacritic-free Hebrew text-to-speech (TTS) system. It addresses the challenge of accurately mapping text to speech in Hebrew by proposing a language model that operates on discrete speech representations and is conditioned on a word-piece tokenizer. The system is optimized using weakly supervised recordings and outperforms diacritic-based Hebrew TTS systems in terms of content preservation and naturalness of generated speech.

github

: 52

do-research-in-AI

This repository is a collection of research lectures and experience sharing posts from frontline researchers in the field of AI. It aims to help individuals upgrade their research skills and knowledge through insightful talks and experiences shared by experts. The content covers various topics such as evaluating research papers, choosing research directions, research methodologies, and tips for writing high-quality scientific papers. The repository also includes discussions on academic career paths, research ethics, and the emotional aspects of research work. Overall, it serves as a valuable resource for individuals interested in advancing their research capabilities in the field of AI.

github

: 61

Awesome-RL-based-LLM-Reasoning

README:

Awesome-RL-based-LLM-Reasoning

Why ?

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Question about LLM Reasoning Ability

Other Newest Interesting Papers about LLM Reasoning

Surveys

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

X_PO

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

For Tasks:

For Jobs:

Alternative AI tools for Awesome-RL-based-LLM-Reasoning

Similar Open Source Tools

Awesome-RL-based-LLM-Reasoning

Awesome-LLM4Graph-Papers

Awesome-Text2SQL

Awesome-explainable-AI

sglang

awesome-cuda-and-hpc

awesome-cuda-triton-hpc

awesome-cuda-triton-mlir-hpc

awesome-cuda-triton-tvm-hpc

lmdeploy

Lumi-AI

awesome-mcp-servers

superagentx

BadukMegapack

Phi-3CookBook

dom-to-semantic-markdown

For similar tasks

Awesome-RL-based-LLM-Reasoning

Step-DPO

optillm

MoBA

Awesome_Test_Time_LLMs

Awesome-Neuro-Symbolic-Learning-with-LLM

For similar jobs

Perplexica

KULLM

MMMU

1filellm

gpt-researcher

ChatTTS

HebTTS

do-research-in-AI