RLinf

RLinf

RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI

Stars: 2418

Visit
 screenshot

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models via reinforcement learning. It provides a robust backbone for next-generation training, supporting open-ended learning, continuous generalization, and limitless possibilities in intelligence development. The tool offers unique features like Macro-to-Micro Flow, flexible execution modes, auto-scheduling strategy, embodied agent support, and fast adaptation for mainstream VLA models. RLinf is fast with hybrid mode and automatic online scaling strategy, achieving significant throughput improvement and efficiency. It is also flexible and easy to use with multiple backend integrations, adaptive communication, and built-in support for popular RL methods. The roadmap includes system-level enhancements and application-level extensions to support various training scenarios and models. Users can get started with complete documentation, quickstart guides, key design principles, example gallery, advanced features, and guidelines for extending the framework. Contributions are welcome, and users are encouraged to cite the GitHub repository and acknowledge the broader open-source community.

README:

RLinf-logo
Hugging Face Ask DeepWiki

English 简体中文

RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI

RLinf is a flexible and scalable open-source RL infrastructure designed for Embodied and Agentic AI. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

RLinf-overview

What's NEW!

Key Features

Embodied AI

Simulators Real-world Robotics Models Algorithms

Agentic AI

We support RL training for improving reasoning ability, such as Math Reasoning, and RL training for improving coding ability, such as Online Coder. We believe embodied AI will also integrate the ability of agents in the future to complete complex tasks.

High flexibility, efficiency, and scalability

Besides the rich functionalities introduced above, RLinf has high flexibility to support diverse RL training workflows (PPO, GRPO, SAC and so on), while hiding the complexity of distributed programming. Users can easily scale RL training to a large number of GPU nodes without modifying code, meeting the increasing demand of computation for RL training.

The high flexibility allows RLinf to explore more efficient scheduling and execution. The hybrid execution mode for embodied RL achieves up to 2.434× throughput compared to existing frameworks.

Multiple Backend Integrations

  • FSDP + HuggingFace/SGLang/vLLM: rapid adaptation to new models and algorithms, ideal for beginners and fast prototyping.
  • Megatron + SGLang/vLLM: optimized for large-scale training, delivering maximum efficiency for expert users with demanding workloads.

Quick Start

Installation: Users can refer to our installation guide to install RLinf. We recommend users to use our provided docker image (i.e., Installation Method 1), as the environment and dependencies of embodied RL are complex.

Run a simple example: After setting up the environment, users can run a simple example of embodied RL with ManiSkill3 simulator following this document.

For more tutorials of RLinf and application examples, checkout our documentation and example gallery.

Main Results

Embodied Intelligence

  • RLinf supports both PPO and GRPO algorithms, enabling state-of-the-art training for Vision-Language-Action models.
  • The framework provides seamless integration with mainstream embodied intelligence benchmarks, and achieves strong performance across diverse evaluation metrics.

OpenVLA and OpenVLA-OFT Results

mani_openvla
OpenVLA
mani_openvlaoft
OpenVLA-OFT
  • Training curves on ManiSkill “PutOnPlateInScene25Mani-v3” with OpenVLA and OpenVLA-OFT models, using PPO and GRPO algorithms. PPO consistently outperforms GRPO and exhibits greater stability.
Evaluation results on ManiSkill. Values denote success rates
In-Distribution Out-Of-Distribution
Vision Semantic Execution Avg.
OpenVLA (Base) 53.91% 38.75% 35.94% 42.11% 39.10%
HFRL4VLA (PPO) 93.75% 80.47% 75.00% 81.77% 79.15%
HFOpenVLA (RLinf-GRPO) 84.38% 74.69% 72.99% 77.86% 75.15%
HFOpenVLA (RLinf-PPO) 96.09% 82.03% 78.35% 85.42% 81.93%
OpenVLA-OFT (Base) 28.13% 27.73% 12.95% 11.72% 18.29%
HFOpenVLA-OFT (RLinf-GRPO) 94.14% 84.69% 45.54% 44.66% 60.64%
HFOpenVLA-OFT (RLinf-PPO) 97.66% 92.11% 64.84% 73.57% 77.05%
Evaluation results of the unified model on the five LIBERO task groups
Model Spatial Object Goal Long 90 Avg.
HFOpenVLA-OFT (Base) 72.18% 71.48% 64.06% 48.44% 70.97% 65.43%
HFOpenVLA-OFT (RLinf-GRPO) 99.40% 99.80% 98.79% 93.95% 98.59% 98.11%
Δ Improvement +27.22 +28.32 +34.73 +45.51 +27.62 +32.68

π0 and π0.5 Results

Evaluation results on the four LIBERO task groups
Model LIBERO
Spatial Object Goal Long Avg. Δ Avg.
Full Dataset SFT
Octo 78.9% 85.7% 84.6% 51.1% 75.1%
OpenVLA 84.7% 88.4% 79.2% 53.7% 76.5%
πfast 96.4% 96.8% 88.6% 60.2% 85.5%
OpenVLA-OFT 91.6% 95.3% 90.6% 86.5% 91.0%
π0 96.8% 98.8% 95.8% 85.2% 94.2%
π0.5 98.8% 98.2% 98.0% 92.4% 96.9%
Few-shot Dataset SFT + RL
π0 ModelScope HFSFT 65.3% 64.4% 49.8% 51.2% 57.6%
Flow-SDE 98.4% 99.4% 96.2% 90.2% 96.1% +38.5
Flow-Noise 99.0% 99.2% 98.2% 93.8% 97.6% +40.0
Few-shot Dataset SFT + RL
π0.5 ModelScope HFSFT 84.6% 95.4% 84.6% 43.9% 77.1%
Flow-SDE 99.6% 100% 98.8% 93.0% 97.9% +20.8
Flow-Noise 99.6% 100% 99.6% 94.0% 98.3% +21.2

Math Reasoning

1.5B model results
Model AIME 24 AIME 25 GPQA-diamond Average
HFDeepSeek-R1-Distill-Qwen-1.5B (base model) 28.33 24.90 27.45 26.89
HFDeepMath-1.5B 37.80 30.42 32.11 33.44
HFDeepScaleR-1.5B-Preview 40.41 30.93 27.54 32.96
HFAReaL-1.5B-Preview-Stage-3 40.73 31.56 28.10 33.46
AReaL-1.5B-retrain* 44.42 34.27 33.81 37.50
HFFastCuRL-1.5B-V3 43.65 32.49 35.00 37.05
HFRLinf-math-1.5B 48.44 35.63 38.46 40.84

* We retrain the model using the default settings for 600 steps.

7B model results
Model AIME 24 AIME 25 GPQA-diamond Average
HFDeepSeek-R1-Distill-Qwen-7B (base model) 54.90 40.20 45.48 46.86
HFAReaL-boba-RL-7B 61.66 49.38 46.93 52.66
HFSkywork-OR1-7B 66.87 52.49 44.43 54.60
HFPolaris-7B-Preview 68.55 51.24 43.88 54.56
HFAceMath-RL-Nemotron-7B 67.30 55.00 45.57 55.96
HFRLinf-math-7B 68.33 52.19 48.18 56.23
  • RLinf achieves state-of-the-art performance on math reasoning tasks, consistently outperforming existing models across multiple benchmarks (AIME 24, AIME 25, GPQA-diamond) for both 1.5B and 7B model sizes.

Roadmap

1. System-Level Enhancements

  • [X] Support for heterogeneous GPUs
  • [ ] Support for asynchronous pipeline execution
  • [X] Support for Mixture of Experts (MoE)

2. Application-Level Extensions

  • [X] Support for Vision-Language Models (VLMs) training
  • [ ] Support for deep searcher agent training
  • [ ] Support for multi-agent training
  • [ ] Support for integration with more embodied simulators (e.g., GENESIS)
  • [ ] Support for more Vision Language Action models (VLAs) (e.g., WALL-OSS)
  • [X] Support for world model
  • [X] Support for real-world RL

CI Test Status

RLinf has comprehensive CI tests for both the core components (via unit tests) and end-to-end RL training workflows of embodied, agent, and reasoning scenarios. Below is the summary of the CI test status of the main branch:

Test Name Status
unit-tests GitHub Actions Workflow Status
agent-reason-e2e-tests GitHub Actions Workflow Status
embodied-e2e-tests GitHub Actions Workflow Status
scheduler-tests GitHub Actions Workflow Status

Contribution Guidelines

We welcome contributions to RLinf. Please read contribution guide before taking action. Thank the following contributors and welcome more developers to join us on this open source project.

Citation and Acknowledgement

If you find RLinf helpful, please cite the paper:

@article{yu2025rlinf,
  title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
  author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and others},
  journal={arXiv preprint arXiv:2509.15965},
  year={2025}
}

If you use RL+VLA in RLinf, you can also cite our technical report and empirical study paper:

@article{zang2025rlinf,
  title={RLinf-VLA: A Unified and Efficient Framework for VLA+ RL Training},
  author={Zang, Hongzhi and Wei, Mingjie and Xu, Si and Wu, Yongji and Guo, Zhen and Wang, Yuanqing and Lin, Hao and Shi, Liangzhi and Xie, Yuqing and Xu, Zhexuan and others},
  journal={arXiv preprint arXiv:2510.06710},
  year={2025}
}
@article{liu2025can,
  title={What can rl bring to vla generalization? an empirical study},
  author={Liu, Jijia and Gao, Feng and Wei, Bingwen and Chen, Xinlei and Liao, Qingmin and Wu, Yi and Yu, Chao and Wang, Yu},
  journal={arXiv preprint arXiv:2505.19789},
  year={2025}
}
@article{chen2025pi_,
  title={$$\backslash$pi\_$\backslash$texttt $\{$RL$\}$ $: Online RL Fine-tuning for Flow-based Vision-Language-Action Models},
  author={Chen, Kang and Liu, Zhihao and Zhang, Tonghe and Guo, Zhen and Xu, Si and Lin, Hao and Zang, Hongzhi and Zhang, Quanlu and Yu, Zhaofei and Fan, Guoliang and others},
  journal={arXiv preprint arXiv:2510.25889},
  year={2025}
}

Acknowledgements RLinf has been inspired by, and benefits from, the ideas and tooling of the broader open-source community. In particular, we would like to thank the teams and contributors behind VeRL, AReaL, Megatron-LM, SGLang, and PyTorch Fully Sharded Data Parallel (FSDP), and if we have inadvertently missed your project or contribution, please open an issue or a pull request so we can properly credit you.

Contact: We welcome applications from Postdocs, PhD/Master's students, and interns. Join us in shaping the future of RL infrastructure and embodied AI!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for RLinf

Similar Open Source Tools

For similar tasks

For similar jobs