PURE

PURE

SOTA RL fine-tuning solution for advanced math reasoning of LLM

Stars: 94

Visit
 screenshot

PURE (Process-sUpervised Reinforcement lEarning) is a framework that trains a Process Reward Model (PRM) on a dataset and fine-tunes a language model to achieve state-of-the-art mathematical reasoning capabilities. It uses a novel credit assignment method to calculate return and supports multiple reward types. The final model outperforms existing methods with minimal RL data or compute resources, achieving high accuracy on various benchmarks. The tool addresses reward hacking issues and aims to enhance long-range decision-making and reasoning tasks using large language models.

README:

Stop Gamma Decay: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Notion Github Wandb Hugging Face Collection

TL;DR: Process Reward Model (PRM) trained on PRM800K dataset can fine-tune LLM to achieve SOTA-level mathematical reasoning capabilities for < $150 (8 A100 GPUs * 16 hours).

πŸŽ‰ News

  • [2025/03/24] We re-implement our algorithm based on verl. ✨✨ Key features: (1) add ~50 additional metrics to comprehensively monitor the training process and stability, (2) add a custom wandb workerspace to monitor ~20 important metrics, (3) add curriculum learning.✨✨
  • [2025/02/22] πŸ”₯πŸ”₯We release the notion blog, which details our algorithm, the difference between gamma-decay and min-form credit assignment, examples of reward hacking, and so onπŸ”₯πŸ”₯
  • [2025/02/09] We release the training, evaluation code, wandb, and checkpoints. Paper's on it's way!

πŸ“– Introduction

This month, we saw a huge boost in LLM reasoning power from the verifiable reward (VR)-based Reinforcement learning fine-tuning (ReFT). Previous work has encountered challenges and made unsuccessful attempts in exploring PRM, so we wonder: How far can PRM actually take us? How does it stack up against VR-based methods in reasoning performance, training costs?

To answer these questions, we present PURE (Process-sUpervised Reinforcement lEarning). Employing Qwen2.5-Math-7B as the base model, we train a PRM using PRM800K dataset, and then fine-tune another Qwen2.5-Math-7B model using only 8K MATH prompts, process rewards from the PRM, and optional verifiable rewards. For the RL algorithm, we use the PPO loss with an RLOO advantage estimator. We improve credit assignment by using a weighted sum of the process rewards, $\sum_t \text{softmax}(-\text{PR}_t/T)\cdot\text{PR}_t$ which approximates ${\min}_t \text{PR}_t$ when $T\rightarrow 0$, instead of the usual gamma decay sum $\sum_t \gamma^t \cdot \text{PR}_t$ to calculate return. Our framework supports multiple reward types: only process reward (PURE-PRM), only verifiable reward (PURE-VR) which is the Deepseek-R1-zero's setting, or a mix of both (PURE-PRM+VR), as shown in the following table.

πŸ“Š The final model achieves pass@1 accuracy of 82.6% on MATH500, 82.5% on AMC, and 53.3% on average across 5 benchmarks, beating Qwen2.5-math-7B-instruct, PRIME, and SimpleRL with just either <1/50th RL data or 1/5th of the compute resources.

All results are in pass@1 accuracy

AIME 2024 MATH 500 AMC Minerva Math OlympiadBench Avg.
Qwen2.5-Math-7B-Base 13.3 71.8 47.5 29.8 35.1 39.5
Qwen-2.5-Math-7B-Instruct 16.7 83.2 52.5 37.5 41.3 46.2
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1 48.9
Qwen2.5-7B-SimpleRL-Zero 33.3 77.2 62.5 33.5 37.6 48.8
Qwen2.5-7B-PURE-PRM+VR* 20.0 82.6 82.5 37.1 44.1 53.3
Qwen2.5-7B-PURE-PRM 16.7 81.8 60.0 38.2 44.7 49.3
Qwen2.5-7B-PURE-VR 23.3 79.4 60.0 36.8 41.8 48.3

*The SOTA model was trained using 8K MATH problems, of which only ~800 gave ground-truth final answers that could be used to calculate VRs.

Note: Eurus-2-7B-PRIME, and SimpleRL-Zero are also based on Qwen-2.5-Math-7B.

πŸ”§ Quick Start

We implement our algorithm on two frameworks, OpenRLHF and verl, in 2 different branches respectively. If you are new to our project, we recommend using verl version.

Installation

1. OpenRLHF version

Please follow OpenRLHF's guidance to configure required environments. Then run pip install -r requirements.txt.

2. verl version

Please refer to the official installation guidance of verl.

Training of PRM

We train the PRM in 2 stages using TRL and a preprocessed PRM800K dataset. In the first stage, we freeze the LLM and only train the last score layer (MLP) with 1e-4 learning rate rate for 3 epochs. In the second stage, we unfreeze the LLM and fine-tune all parameters with 1e-6 learning rate for 1 epoch. The resultant PRM is released through HuggingFace.

cd PRM
# stage 1
bash train_stage_1.sh
# stage 2
bash train_stage_2.sh

Training of LLM

1. OpenRLHF version

Switch to the openrlhf branch. Run the following command. It uses Ray+vLLM for rollout acceleration, with the first 4 GPUs allocated for the actor, initial actor (reference model), and PRM. The remaining GPUs are used for the vLLM engines. This setup works with 5 to 8 GPUsβ€”just adjust the number of vLLM engines in the script accordingly.

bash examples/scripts/train_pure.sh

2. verl version

Switch to the verl branch. Modify the actor_rollout_ref.model.path, trainer.default_local_dir in the config file. Then start training:

python -m verl.trainer.main_ppo

The hybrid engine of verl allows for higher gpu utilization compared to the openrlhf version.

Evaluation of Math Reasoning

We use Qwen Math's codebase for evaluation (i.e., pass@1 accuracy). For fairness considerations, we completely prohibited solving problems by calling code, following SimpleRL. Please follow the /eval instructions for evaluation.

πŸ“ TODO:

  • [x] re-implementation on verl
  • [ ] paper with more discussions and evaluations
  • [ ] attempts to mitigate reward hacking for PRM (Online PURE)

🎈 Citation

If you find our code useful, we would appreciate it if you could cite our work:

@misc{cheng2025pure,
  title={Stop Gamma Decay: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning},
  author={Jie Cheng and Lijun Li and Gang Xiong and Jing Shao and Yisheng Lv and Fei-Yue Wang},
  year={2025},
  howpublished={\url{https://tungsten-ink-510.notion.site/Stop-Gamma-Decay-Min-Form-Credit-Assignment-Is-All-Process-Reward-Model-Needs-for-Reasoning-19fcb6ed0184804eb07fd310b38af155?pvs=4}},
  note={Notion Blog}
  year={2025}
}

🌻 Acknowledgement

We implement our RL algorithm based on OpenRLHF and verl. We thank the developers of OpenRLHF and the author of SimpleRL for discussion! In addition, we also refer to TRL, PRIME's code and hyperparameter values to varying degrees. Thank them for their wonderful work!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for PURE

Similar Open Source Tools

For similar tasks

For similar jobs