R1-Searcher
R1-searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Stars: 368
R1-searcher is a tool designed to incentivize the search capability in large reasoning models (LRMs) via reinforcement learning. It enables LRMs to invoke web search and obtain external information during the reasoning process by utilizing a two-stage outcome-supervision reinforcement learning approach. The tool does not require instruction fine-tuning for cold start and is compatible with existing Base LLMs or Chat LLMs. It includes training code, inference code, model checkpoints, and a detailed technical report.
README:
Large reasoning models (LRMs), such as OpnAI-o1 and Deepseek-R1, have demonstrated the significant impact of reinforcement learning in enhancing the long-step reasoning capabilities of models, thereby greatly improving their reasoning performance. Despite these advantages, when faced with knowledge-intensive problems, especially multi-hop questions and time-sensitive issues, these models may lack the necessary knowledge. Therefore, it is great important to enable LLMs to invoke web search and obtain external information during the reasoning process.
We propose R1-searcher, utilizing a two-stage outcome-supervision reinforcement learning approach to enable the model to learn to invoke web search during the reasoning process: first allowing the model to learn how to invoke web search, and then teaching it how to effectively use that search engine. This method does not require any instruction fine-tuning for cold start, and at the same time, it is compatible with existing Base LLMs or Chat LLMs. We open-source the training code, inference code, model checkpoints, and the detailed technical report.
- Arxiv: https://arxiv.org/pdf/2503.05592
- Model:
- Qwen-2.5-7B-Base-RAG-RL: https://huggingface.co/XXsongLALA/Qwen-2.5-7B-base-RAG-RL
- Llama-3.1-8B-Instruct-RAG-RL: https://huggingface.co/XXsongLALA/Llama-3.1-8B-instruct-RAG-RL
- Train-data: https://huggingface.co/datasets/XXsongLALA/RAG-RL-Hotpotqa-with-2wiki
- By relying solely on outcome-supervised reinforcement learning, we can activate the model's intrinsic search capabilities using only the query-answer pair, regardless of whether we are dealing with Base LLMs or Chat LLMs.
- Recent reinforcement learning algorithms, such as GRPO and Reinforce++ both can effectively activate the internal search capabilities of the LLMs.
- There is no requirement for complex prompt engineering or process supervision during training.
- The capability of the Base LLMs largely influences whether the model can directly start training from Zero.
- LongCoT reasoning after RL is an more effectively and efficient test time scaling method than existing tree-search based methods, e.g., Monte Carlo Tree Search.
- By using a local retrieval for RL training, the model can generalize well to other datasets and online searches scenarios.
- The final 7B parameters LLMs achieve the significant performance improvements compared to existing complex method or even close-sourced LLMs (e.g., GPT-4o-mini).
We employ a Two-Stage Reward Guided RL Training approach:
Stage 1: Learn to invoke search with only format-reward.
Stage 2: Learn to solve questions with invoking search with format-reward and answer-reward.
We use only outcome-supervised reinforcement learning for training, so we need to consider two main aspects: (1) the reinforcement learning algorithm, and (2) the design of the reward.
- RL Algorithm: We use Reinforce++ as our RL algorithm. For each questions, we average the rewards of n samples, which stabilizes the training process. For the solution format, we utilize
<think>...</think>tag for thinking, xxx for searching, and<answer>...</answer>for answering,<begin_of_search>...<end_of_search>for invoking search tool and<begin_of_documents>...<end_of_documents>for returned retrieval documents. - Reward Design:In Stage-1, we use the retrieve-reward: if the model performs retrieval and the solution meets the format requirements, 0.5 points are added to the answer reward. In Stage 2, the retrieval requirement is removed and we utilize the F1-based answer-reward. A penalty of 2 points is subtracted from the answer reward if the solution does not meet the format requirements. Detailed implementation, including hyperparameters can be found in our code.
We choose a portion of the training sets from HotpotQA and 2WikiMultiHopQA as our training data. We use Qwen-2.5-7B-Instruct to perform rollouts on the training dataset.
Based on the number of rollouts required to answer a question correctly, we classify the data into three categories: easy (<10 rollouts), medium (10 < and < 20 rollouts), and difficult (>20 rollouts). These categories are then mixed in a specific ratio to form our training data. All of our training data can be found here: https://huggingface.co/datasets/XXsongLALA/RAG-RL-Hotpotqa-with-2wiki.
Following ReARTeR(https://arxiv.org/pdf/2501.07861), we select four representative benchmarks: HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle.
HotpotQA and 2WikiMultiHopQA are considered in-domain as we use their training-set, while Musique and Bamboogle are classified as out-of-domain, allowing us to assess the generalization capabilities of our model. We randomly sample 500 examples from the development sets of HotpotQA, 2WikiMultiHopQA, and Musique to serve as our test sets. For Bamboogle, we use all of the test set (125 samples) as our test set..
Wikipedia passages serve as the retrieval corpus for all datasets, specifically employing the Wikipedia corpus released by KILT in August 2019. Additionally, due to the recency of the knowledge contained in Bamboogle, we incorporate online web search testing to conduct further evaluations, thereby examining the alignment of our model with online search capabilities.
For the evaluation metrics, we use the ACC_R (Cover-Exect-Match) and ACC_L (LLM-as-Judge).
As we can see, when using the same LLaMA-3.1-8B-Instruct base model, our method has achieved significant improvements compared to existing methods, even surpassing closed-source models such as GPT-4o-mini. Furthermore, when switching to the more powerful base model, Qwen-2.5-7B-Base, we directly conduct reinforcement learning from scratch. Eventually, we can achieve better results and attain the best performance on all in-domain and out-of-domain datasets, demonstrating the exceptional generalization capabilities of our model.
For Bamboogle, we additionally utilize Google for online searches. As we can see, compared to relying solely on a local knowledge base, the incorporation of online search yields superior results, indicating that it is feasible to seamlessly integrate online search capabilities into our model.

Note: the environment is same to STILL-3 (Great work!).
conda create --name r1-searcher python=3.10.16
conda activate r1-searcher
pip install vllm==0.6.5
pip install packaging
pip install ninja
pip install flash-attn --no-build-isolation
pip install deepspeed
pip install accelerate
pip install datasetscd R1-Searcher
## Process wiki only abs
wget -nv --no-check-certificate https://rocketqa.bj.bcebos.com/corpus/nq.tar.gz
tar -zxf nq.tar.gz
rm -rf nq.tar.gz # We only use the title and abs.
## Process wiki full texts
wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json
cd R1-Searcher
python wiki_corpus_index_bulid/split_kilt_to_100.py
## Index the tsv file. We recommend splitting the original TSV file into n parts for embedding, otherwise the process will be very slow.
python wiki_corpus_index_bulid/build_corpus_embedding.py --file_path the_tsv_file_path --save_path the_pickle_path --gpu_id 0
python wiki_corpus_index_bulid/build_corpus_idnex.py
cd R1-Searcher
## Ray start
bash scripts/ray_start.sh
## Mount Wikipedia
python train/wiki_corpus_load.py hotpotqa 5004 &
## Start Reward Server
python train/reward_server_qwen_zero.py --data_path data/training_set/stage_2.jsonl --reward_pretrain the_model_path --log_file results/samples/qwen.jsonl --port 1278
## Training
bash scripts/qwen_reinforce_plus_train.sh | tee results/logs/qwen_reinforce_plus_train.txtcd R1-Searcher
## Local Search
## HotpotQA
python train/wiki_corpus_load.py hotpotqa 5004 &
python evaluation/eval_search_loacl.py --gpu_id 0 --temp 0.0 --port 5004 --prompt_type v0 --src_file data/eval_set/hotpotqa_500.jsonl --model_path the_path_to_model
## 2Wiki, Musique, Bamboogle
python train/wiki_corpus_load.py kilt 5005 &
python evaluation/eval_search_loacl.py --gpu_id 0 --temp 0.0 --port 5005 --prompt_type v0 --src_file data/eval_set/bamboogle_500.jsonl --model_path the_path_to_model
## Online Search
## Bamboogle
python evaluation/eval_search_online.py --gpu_id 0 --temp 0.0 --port 5004 --prompt_type v0 --src_file data/eval_set/bamboogle_500.jsonl --model_path the_path_to_model
## Calculate Metric
## Exact Match, Cover Exact Match, F1 Score
python evaluation/metric_calc_rule.py the_path_to_results
## LLM-as-Judge. Remember replace the input file to your own results.
python evaluation/metric_calc_gpt_as_judge.pyPlease kindly cite our report if they are helpful for your research.
@article{R1-searcher,
title={R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning},
author={Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen, Yang Lu, Xu Miu},
url={https://github.com/SsmallSong/R1-searcher},
year={2025}
}
This project is released under the MIT License.
For any questions or feedback, please reach out to us at [email protected].
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for R1-Searcher
Similar Open Source Tools
R1-Searcher
R1-searcher is a tool designed to incentivize the search capability in large reasoning models (LRMs) via reinforcement learning. It enables LRMs to invoke web search and obtain external information during the reasoning process by utilizing a two-stage outcome-supervision reinforcement learning approach. The tool does not require instruction fine-tuning for cold start and is compatible with existing Base LLMs or Chat LLMs. It includes training code, inference code, model checkpoints, and a detailed technical report.
OREAL
OREAL is a reinforcement learning framework designed for mathematical reasoning tasks, aiming to achieve optimal performance through outcome reward-based learning. The framework utilizes behavior cloning, reshaping rewards, and token-level reward models to address challenges in sparse rewards and partial correctness. OREAL has achieved significant results, with a 7B model reaching 94.0 pass@1 accuracy on MATH-500 and surpassing previous 32B models. The tool provides training tutorials and Hugging Face model repositories for easy access and implementation.
aligner
Aligner is a model-agnostic alignment tool designed to efficiently correct responses from large language models. It redistributes initial answers to align with human intentions, improving performance across various LLMs. The tool can be applied with minimal training, enhancing upstream models and reducing hallucination. Aligner's 'copy and correct' method preserves the base structure while enhancing responses. It achieves significant performance improvements in helpfulness, harmlessness, and honesty dimensions, with notable success in boosting Win Rates on evaluation leaderboards.
NineRec
NineRec is a benchmark dataset suite for evaluating transferable recommendation models. It provides datasets for pre-training and transfer learning in recommender systems, focusing on multimodal and foundation model tasks. The dataset includes user-item interactions, item texts in multiple languages, item URLs, and raw images. Researchers can use NineRec to develop more effective and efficient methods for pre-training recommendation models beyond end-to-end training. The dataset is accompanied by code for dataset preparation, training, and testing in PyTorch environment.
magpie
This is the official repository for 'Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing'. Magpie is a tool designed to synthesize high-quality instruction data at scale by extracting it directly from an aligned Large Language Models (LLMs). It aims to democratize AI by generating large-scale alignment data and enhancing the transparency of model alignment processes. Magpie has been tested on various model families and can be used to fine-tune models for improved performance on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
aimo-progress-prize
This repository contains the training and inference code needed to replicate the winning solution to the AI Mathematical Olympiad - Progress Prize 1. It consists of fine-tuning DeepSeekMath-Base 7B, high-quality training datasets, a self-consistency decoding algorithm, and carefully chosen validation sets. The training methodology involves Chain of Thought (CoT) and Tool Integrated Reasoning (TIR) training stages. Two datasets, NuminaMath-CoT and NuminaMath-TIR, were used to fine-tune the models. The models were trained using open-source libraries like TRL, PyTorch, vLLM, and DeepSpeed. Post-training quantization to 8-bit precision was done to improve performance on Kaggle's T4 GPUs. The project structure includes scripts for training, quantization, and inference, along with necessary installation instructions and hardware/software specifications.
bocoel
BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.
asreview
The ASReview project implements active learning for systematic reviews, utilizing AI-aided pipelines to assist in finding relevant texts for search tasks. It accelerates the screening of textual data with minimal human input, saving time and increasing output quality. The software offers three modes: Oracle for interactive screening, Exploration for teaching purposes, and Simulation for evaluating active learning models. ASReview LAB is designed to support decision-making in any discipline or industry by improving efficiency and transparency in screening large amounts of textual data.
Instruct2Act
Instruct2Act is a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks. It generates Python programs using the LLM model for perception, planning, and action. The framework leverages foundation models like SAM and CLIP to convert high-level instructions into policy codes, accommodating various instruction modalities and task demands. Instruct2Act has been validated on robotic tasks in tabletop manipulation domains, outperforming learning-based policies in several tasks.
gepa
GEPA (Genetic-Pareto) is a framework for optimizing arbitrary systems composed of text components like AI prompts, code snippets, or textual specs against any evaluation metric. It employs LLMs to reflect on system behavior, using feedback from execution and evaluation traces to drive targeted improvements. Through iterative mutation, reflection, and Pareto-aware candidate selection, GEPA evolves robust, high-performing variants with minimal evaluations, co-evolving multiple components in modular systems for domain-specific gains. The repository provides the official implementation of the GEPA algorithm as proposed in the paper titled 'GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning'.
LazyLLM
LazyLLM is a low-code development tool for building complex AI applications with multiple agents. It assists developers in building AI applications at a low cost and continuously optimizing their performance. The tool provides a convenient workflow for application development and offers standard processes and tools for various stages of application development. Users can quickly prototype applications with LazyLLM, analyze bad cases with scenario task data, and iteratively optimize key components to enhance the overall application performance. LazyLLM aims to simplify the AI application development process and provide flexibility for both beginners and experts to create high-quality applications.
Woodpecker
Woodpecker is a tool designed to correct hallucinations in Multimodal Large Language Models (MLLMs) by introducing a training-free method that picks out and corrects inconsistencies between generated text and image content. It consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Woodpecker can be easily integrated with different MLLMs and provides interpretable results by accessing intermediate outputs of the stages. The tool has shown significant improvements in accuracy over baseline models like MiniGPT-4 and mPLUG-Owl.
aligner
Aligner is a model-agnostic alignment tool that learns correctional residuals between preferred and dispreferred answers using a small model. It can be directly applied to various open-source and API-based models with only one-off training, suitable for rapid iteration and improving model performance. Aligner has shown significant improvements in helpfulness, harmlessness, and honesty dimensions across different large language models.
baal
Baal is an active learning library that supports both industrial applications and research use cases. It provides a framework for Bayesian active learning methods such as Monte-Carlo Dropout, MCDropConnect, Deep ensembles, and Semi-supervised learning. Baal helps in labeling the most uncertain items in the dataset pool to improve model performance and reduce annotation effort. The library is actively maintained by a dedicated team and has been used in various research papers for production and experimentation.
ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs). It offers modular trainer designs, simplified code structures, and integrated pipelines for creating and cleaning synthetic data, enabling users to enhance LLM capabilities like code generation and complex reasoning with greater efficiency and flexibility.
Controllable-RAG-Agent
This repository contains a sophisticated deterministic graph-based solution for answering complex questions using a controllable autonomous agent. The solution is designed to ensure that answers are solely based on the provided data, avoiding hallucinations. It involves various steps such as PDF loading, text preprocessing, summarization, database creation, encoding, and utilizing large language models. The algorithm follows a detailed workflow involving planning, retrieval, answering, replanning, content distillation, and performance evaluation. Heuristics and techniques implemented focus on content encoding, anonymizing questions, task breakdown, content distillation, chain of thought answering, verification, and model performance evaluation.
For similar tasks
NeoPass
NeoPass is a free Chrome extension designed for students taking tests on exam portals like Iamneo and Wildlife Ecology NPTEL. It provides features such as NPTEL integration, NeoExamShield bypass, AI chatbot with stealth mode, AI search answers/code, MCQ solving, tab switching bypass, pasting when restricted, and remote logout. Users can install the extension by following simple steps and use shortcuts for quick access to features. The tool is intended for educational purposes only and promotes academic integrity.
R1-Searcher
R1-searcher is a tool designed to incentivize the search capability in large reasoning models (LRMs) via reinforcement learning. It enables LRMs to invoke web search and obtain external information during the reasoning process by utilizing a two-stage outcome-supervision reinforcement learning approach. The tool does not require instruction fine-tuning for cold start and is compatible with existing Base LLMs or Chat LLMs. It includes training code, inference code, model checkpoints, and a detailed technical report.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.
