Awesome-RL-based-LLM-Reasoning

Awesome-RL-based-LLM-Reasoning

Awesome RL-based LLM Reasoning

Stars: 191

Visit
 screenshot

This repository is dedicated to enhancing Language Model (LLM) reasoning with reinforcement learning (RL). It includes a collection of the latest papers, slides, and materials related to RL-based LLM reasoning, aiming to facilitate quick learning and understanding in this field. Starring this repository allows users to stay updated and engaged with the forefront of RL-based LLM reasoning.

README:

Awesome-RL-based-LLM-Reasoning

PR Welcome License: MIT Awesome

We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)

Why ?

  • Why do we need reasoning?
  • Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Other Newest Interesting Papers about LLM Reasoning

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.

  • Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
  • REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.

X_PO

Cloud GPU

  • Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)

Other Interesting RL-based Reasoning Repository

Contributing

  • Feel free to contribute more papers or other any resources!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome-RL-based-LLM-Reasoning

Similar Open Source Tools

For similar tasks

For similar jobs