ReasonFlux

ReasonFlux

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

Stars: 351

Visit
 screenshot

ReasonFlux is a revolutionary template-augmented reasoning paradigm that empowers a 32B model to outperform other models in reasoning tasks. The repository provides official resources for the paper 'ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates', including the latest released model ReasonFlux-F1-32B. It includes updates, dataset links, model zoo, getting started guide, training instructions, evaluation details, inference examples, performance comparisons, reasoning examples, preliminary work references, and citation information.

README:

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

Revolutionary template-augmented reasoning paradigm enpowers a 32B model to outperform o1-mini and DeepSeek-R1 distilled models in reasoning tasks.

Task/Pass@1 ReasonFlux-F1-32B ReasonFlux-Zero-32B R1-Distill-32B o1-mini LIMO -32B s1-32B
MATH500 96.0 91.2 94.3 90.0 90.6 93.0
AIME 2024 76.7 56.7 72.6 56.7 50.0 56.7
AIME 2025 53.3 37.2 46.67 50.8 37.2 49.3
GPQA-Diamond 67.2 61.2 62.1 60.0 65.2 59.6

This repository provides official resources for the paper "ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates". Try our latest released model ReasonFlux-F1-32B.

Table of Contents (ReasonFlux-Zero & ReasonFlux-F1)

Updates

  • [2025/2/11] 🎉We release the data, training scripts for SFT stage and demo inference code along with template library of ReasonFlux-Zero.

Dataset Links

Model Zoo

Model Download
ReasonFlux-F1-32B 🤗 HuggingFace
ReasonFlux-F1-14B 🤗 HuggingFace
ReasonFlux-F1-7B 🤗 HuggingFace

Getting Started

conda create -n ReasonFlux python==3.9
conda activate ReasonFlux
pip install -r requirements.txt

Training

Training ReasonFlux-F1

To train ReasonFlux-F1, you should follow the steps below (also refer to ./reasonflux-f1/README.md):

    Step 1: Add the data path to the file_name field of the ReasonFlux-F1 entry in LLaMA-Factory/data/dataset_info.json.
    Step 2: Run the following command to train ReasonFlux-F1-32B:
llamafactory-cli train \
      --stage sft \
      --do_train True \
      --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
      --preprocessing_num_workers 16 \
      --finetuning_type full \
      --template qwen \
      --flash_attn auto \
      --dataset_dir data \
      --dataset ReasonFlux-F1 \
      --cutoff_len 16384 \
      --learning_rate 1e-05 \
      --num_train_epochs 5.0 \
      --max_samples 100000 \
      --per_device_train_batch_size 1 \
      --gradient_accumulation_steps 2 \
      --lr_scheduler_type cosine \
      --max_grad_norm 1.0 \
      --logging_steps 5 \
      --save_steps 100 \
      --warmup_steps 0 \
      --packing False \
      --report_to none \
      --output_dir saves/DeepSeek-R1-Distill-Qwen-32B/full/ReasonFlux-F1 \
      --bf16 True \
      --plot_loss True \
      --trust_remote_code True \
      --ddp_timeout 180000000 \
      --include_num_input_tokens_seen True \
      --optim adamw_torch \
      --deepspeed cache/ds_z3_offload_config.json
  
Training ReasonFlux-Zero

We utilize the open-source framework LLaMA-Factory for our training process.

Step 1: Add the data path to the file_name field of the ReasonFlux entry in LLaMA-Factory/data/dataset_info.json.
Step 2: Run the following command to train from a 32B model on 8 A100 GPUs:
llamafactory-cli train \
      --stage sft \
      --do_train True \
      --model_name_or_path Qwen/Qwen2.5-32B-Instruct \
      --preprocessing_num_workers 16 \
      --finetuning_type full \
      --template qwen \
      --flash_attn auto \
      --dataset_dir train/LLaMA-Factory/data \
      --dataset ReasonFlux \
      --cutoff_len 2048 \
      --learning_rate 2e-05 \
      --num_train_epochs 3.0 \
      --max_samples 100000 \
      --per_device_train_batch_size 4 \
      --gradient_accumulation_steps 8 \
      --lr_scheduler_type cosine \
      --max_grad_norm 1.0 \
      --logging_steps 5 \
      --save_steps 100 \
      --warmup_steps 0 \
      --packing False \
      --report_to none \
      --output_dir saves/Qwen2.5-32B-Instruct/full \
      --bf16 True \
      --plot_loss True \
      --trust_remote_code True \
      --ddp_timeout 180000000 \
      --optim adamw_torch \
      --deepspeed cache/ds_z3_offload_config.json
  

Evaluation for ReasonFlux-F1

For evaluation, we reuse the evaluation framework in s1 . It is cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524 and has been modified to add some tasks. Setup:

cd reasonflux-f1/eval/lm-evaluation-harness
pip install -e .[math,vllm]

All commands are in eval/commands.sh. For AIME24 we always pick the aime24_nofigures result, which uses a dataset that only contains the AIME24 figures if they are important for the task.

For example, to evaluate ReasonFlux-F1-32B on AIME24/25, MATH500 and GPQA-Diamond, you can use the command below:

OPENAI_API_KEY=Input your openai key here lm_eval --model vllm --model_args pretrained=Gen-verse/ReasonFlux-F1,dtype=float32,tensor_parallel_size=8,gpu_memory_utilization=0.95 --tasks aime24_figures,aime25_nofigures,openai_math,gpqa_diamond_openai --batch_size auto --apply_chat_template --output_path ReasonFlux-F1 --log_samples --gen_kwargs "max_gen_toks=32768"

Inference

ReasonFlux-F1

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = 'Gen-Verse/ReasonFlux-F1'

model = LLM(
    model_id,
    tensor_parallel_size=8,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# stop_token_ids = tok("<|im_end|>\n")["input_ids"]

sampling_params = SamplingParams(
    max_tokens=32768,
)
# 2022 AIME I Problems/Problem 15
question = """Let \(x, y\), and \(z\) be positive real numbers satisfying the system of equations:
\[
\begin{array}{c}
\sqrt{2 x-x y}+\sqrt{2 y-x y}=1 \\
\sqrt{2 y-y z}+\sqrt{2 z-y z}=\sqrt{2} \\
\sqrt{2 z-z x}+\sqrt{2 x-z x}=\sqrt{3} .
\end{array}
\]
Then \(\left[(1-x)(1-y)(1-z)\right]^{2}\) can be written as \(\frac{m}{n}\), where \(m\) and \(n\) are relatively prime positive integers. Find \(m+n\)."""
ds_prompt="<|User|>\n" + question + "<|Assistant|>\n"
output = model.generate(ds_prompt, sampling_params=sampling_params)
print(output[0].outputs[0].text)

ReasonFlux-Zero

When you complete your first-stage training, you can try to use simple lines of codes to conduct reasoning based on few lines of code.

from reasonflux import ReasonFlux

reasonflux = ReasonFlux(navigator_path='path-to-navigator',
                        template_matcher_path='jinaai/jina-embeddings-v3',
                 		inference_path='path-to-infernece-model',
                        template_path='template_library.json')
problem = """Given a sequence {aₙ} satisfying a₁=3, and aₙ₊₁=2aₙ+5 (n≥1), find the general term formula aₙ"""

navigator_path is the path to the navigator, you can put the path to your trained LLM after SFT-stage here.

template_matcher_path is the path to the embedding model, you can set the path to your local embedding model or download jina-embedding-v3 from huggingface.

inference_path is the path to the reasoning model, you can choose different-sized LLMs to test but here we recommend you to choose the same LLMs as the navigator to save memory.

template_path is the path to our template library. When you run the code for the first time, we will encode the template library for efficient query and retrieve and save the embedding in cache, and it is normal the first run will consume longer time in the initialization stage before reasoning.

You can test your trained model after the SFT stage to see if it could retrieve accurate templates given the problem and solve it in our demo implementation.

🚨 It should be noted that if you choose to use jina-embedding-v3, you have to make sure that you do not install flash-attn in your environment, which will cause conflicts and thus fail to encode the query and the template library.

Performance

We present the evaluation results of our ReasonFlux-F1-32B on challenging reasoning tasks including AIME2024,AIM2025,MATH500 and GPQA-Diamond. To make a fair comparison, we report the results of the LLMs on our evaluation scripts in ReasonFlux-F1.

Model AIME2024@pass1 AIME2025@pass1 MATH500@pass1 GPQA@pass1
QwQ-32B-Preview 46.7 37.2 90.6 65.2
LIMO-32B 56.3 44.5 94.8 58.1
s1-32B 56.7 49.3 93.0 59.6
OpenThinker-32B 66.0 53.3 94.8 60.1
R1-Distill-32B 70.0 46.7 92.0 59.6
ReasonFlux-Zero-32B 56.7 37.2 91.2 61.2
ReasonFlux-F1-32B 76.7 53.3 96.0 67.2

Reasoning Example

example

Preliminary Work

We build our ReasonFlux mainly based on some preliminary works, such as Buffer of Thoughts and SuperCorrect.

Citation

@article{yang2025reasonflux,
  title={ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates},
  author={Yang, Ling and Yu, Zhaochen and Cui, Bin and Wang, Mengdi},
  journal={arXiv preprint arXiv:2502.06772},
  year={2025}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ReasonFlux

Similar Open Source Tools

For similar tasks

For similar jobs