dLLM-RL

dLLM-RL

TraceRL: Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Stars: 222

Visit
 screenshot

dLLM-RL is a revolutionary reinforcement learning framework designed for Diffusion Large Language Models. It supports various models with diverse structures, offers inference acceleration, RL training capabilities, and SFT functionalities. The tool introduces TraceRL for trajectory-aware RL and diffusion-based value models for optimization stability. Users can download and try models like TraDo-4B-Instruct and TraDo-8B-Instruct. The tool also provides support for multi-node setups and easy building of reinforcement learning methods. Additionally, it offers supervised fine-tuning strategies for different models and tasks.

README:


Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

CURE Paper on arXiv Coding Datasets on Hugging Face ReasonFlux Coders on Hugging Face

Overview

🌱 Features

  • Model Support: TraDo, SDAR, Dream, LLaDA, MMaDA, Diffu-Coder We support models with diverse structures, including full attention models, adapted models, and block attention models.
  • Inference Acceleration: improved KV-cache, jetengine (based on nano-vllm), different sampling strategies, support multi-nodes, easy to build your own accelerated inference methods
  • RL Training: TraceRL (support diffusion value model), coupled RL, random masking RL, accelerated sampling, including Math, coding, and general RL tasks, support multi-nodes, easy to build your reinforcement learning methods across diverse settings
  • SFT: Block SFT, semi-AR SFT, random masking SFT, support multi-nodes and long-CoT finetune

🧠 RL Methods (TraceRL) & Models (TraDo)

We propose TraceRL, a trajectory-aware reinforcement learning method for diffusion language models, which demonstrates the best performance among RL approaches for DLMs. We also introduce a diffusion-based value model that reduces variance and improves stability during optimization.

Based on TraceRL, we derive a series of diffusion language models, TraDo, which achieve state-of-the-art performance on math and coding reasoning tasks. TraDo-4B-Instruct and TraDo-8B-Instruct are trained solely with TraceRL, while the first long-CoT diffusion language model, TraDo-8B-Thinking, is obtained through a combination of TraceRL and long-CoT data SFT. TraDo models challenge AR models with strong empirical results, as shown in the following table.

Main Table

We can download and try our model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from generate import block_diffusion_generate

model_name = "Gen-Verse/TraDo-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, torch_dtype="float16", device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "What's the solution of x^2 - 2x + 1 = 0\nPlease reason step by step, and put your final answer within \\boxed{}.\n"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

tokens = tokenizer.batch_encode_plus([text], return_tensors='pt', padding=True, truncation=True, max_length=200)
tokens = {k: v.to(model.device) for k, v in tokens.items()}

output_ids = block_diffusion_generate(
    model,
    prompt=tokens,
    mask_id=151669,
    gen_length=200,
    block_length=4, denoising_steps=4,
    temperature=1.0, top_k=0, top_p=1.0,
    remasking_strategy="low_confidence_dynamic",
    confidence_threshold=0.9
)

output_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
cleaned_text = output_text.replace('<|MASK|>', '').replace('<|endoftext|>', '')
print(cleaned_text)

πŸ“° Latest Updates

πŸš€ Quick Start

conda create --name dllm-rl python=3.10
source activate dllm-rl
pip install torch==2.6.0
pip install --no-cache-dir \
  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txt

βš™οΈ Data

You can navigate to ./data to download datasets for evaluation and training, for example as follows. In that directory, you will also find detailed instructions on how to modify your own dataset.

cd data
python download_data.py --dataset MATH500
python download_data.py --dataset MATH_train
cd ..

After downloading the data, you are almost ready to evaluate or train diffusion language models. The only remaining step is to select (or create) a config file in ./configs that corresponds to your project, and then use the following commands. Details on how to select and modify (or create) a config file are provided in ./configs.

πŸ“Š Inference & Evaluations

After downloading the data, take TraDo models as an example. You can set the configurations in configs/trado_eval.yaml (see instructions and details in ./configs) and run the following commands to perform inference with different sampling strategies.

python eval.py config=configs/trado_eval.yaml
# python eval.py config=configs/trado_longcot_eval.yaml
# python eval.py config=configs/sdar_eval.yaml
# python eval.py config=configs/dream_eval.yaml
# python eval.py config=configs/llada_eval.yaml
# see details in ./configs

Use trado_eval.yaml for TraDo models' inference, sdar_eval.yaml for SDAR, dream_eval.yaml for Dream and Diffu-Coder, and llada_eval.yaml for LLaDA and MMaDA. Instructions on how to set the configurations are provided in the corresponding configuration files.
We support both general tasks and coding tasks (including automated execution of code) in evaluation.

There are two main sampling methods you can choose:

Static Sampling: unmask fixed number of tokens each time

Dynamic Sampling: unmask tokens based on a chosen threshold, faster than static

To have a look how diffusion language models sample, open ./sample/trace.viewer.html in your browser, or generate trajectory by your self with ./sample/get_trace_viewer.py.

You can also perform inference across multiple nodes using multinode_eval.py with the same configuration files, with only minor modifications as instructed in the configuration files. In multi-node setup, the first node controls the others. You can run
python multinode_eval.py config=configs/dream_multinode_eval.yaml on the first node to eval, or submit the following as the entry command for a job:

if [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then   
    python multinode_eval.py config=configs/dream_multinode_eval.yaml
else
    exec tail -f /dev/null
fi
# python multinode_eval.py config=configs/trado_longcot_multinode_eval.yaml
# python multinode_eval.py config=configs/llada_multinode_eval.yaml
# ...

πŸ”§ Reinforcement Learning

After downloading the data and model and setting the configuration, you can start reinforcement learning simply with:

python rl.py config=configs/rl_trado.yaml
# python rl.py config=configs/rl_sdar.yaml
# python rl.py config=configs/rl_dream.yaml
# python rl.py config=configs/rl_llada.yaml
# python rl.py config=configs/rl_mmada.yaml
# see details in ./configs

We support TraceRL (optionally with a diffusion-based value model), Coupled RL, and random masking RL across different diffusion language models. The sampling process has been accelerated in all cases by KV-cache.

TraceRL: We optimize the policy based on how it generates sequences. For block-attention models, training can be performed efficiently thanks to block attention. For full-attention models, we introduce a shrinkage parameter, s, that aggregates every s neighboring steps to accelerate training. We also provide a choice of value models for TraceRL, which we find can reduce variance and improve training stability, enabling the use of larger learning rates or fewer gradient accumulation steps more reliably than without using value model.

Random Masking RL: The sampled data are randomly masked and used as training data in RL with a PPO-like objective.

Coupled RL: For each sampled random masking setting, Coupled RL additionally introduces its complement, serving as an extra data sample for training.

We also support a multi-node RL framework; you can submit the following as the entry command:

if [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then   
    python multinode_rl.py config=configs/multinode_rl_trado.yaml
else
    exec tail -f /dev/null
fi
# python multinode_rl.py config=configs/multinode_rl_sdar.yaml
# python multinode_rl.py config=configs/multinode_rl_dream.yaml
# python multinode_rl.py config=configs/multinode_rl_llada.yaml
# python multinode_rl.py config=configs/multinode_rl_mmada.yaml

πŸ”§ Supervised Finetuning

After downloading the data and setting the configurations, you can start supervised fine-tuning with:

accelerate launch \
  --num_machines 1 \
  --machine_rank 0 \
  --main_process_ip 127.0.0.1 \
  --main_process_port 8888 \
  --config_file accelerate_configs/1_node_8_gpus_deepspeed_zero3.yaml \
  train/sft_trado.py \
  config=configs/sft_trado.yaml
# sft_sdar.py, sft_sdar.yaml
# sft_dream.py, sft_dream.yaml
# sft_llada.py, sft_llada.yaml
# sft_mmada.py, sft_mmada.yaml
# see details in ./configs

We support different SFT strategies for different models.

Block diffusion models (e.g., TraDo and SDAR): support semi-autoregressive fine-tuning or trace fine-tuning (requires setting a specific trace first).

Adapted full-attention models (e.g., Dream and DiffuCoder): support the semi-autoregressive method (using sliced data), random-masking SFT, and AR training (i.e., standard SFT for LLMs).

Pretrained full-attention models (e.g., LLaDA and MMaDA): support semi-autoregressive and random-masking SFT.

To use multi-node, simply run:

accelerate launch \
  --num_machines $MLP_WORKER_NUM \
  --machine_rank $MLP_ROLE_INDEX \
  --main_process_ip $MLP_WORKER_0_HOST \
  --main_process_port $MLP_WORKER_0_PORT \
  --config_file accelerate_configs/4_node_8_gpus_deepspeed_zero3.yaml \
  train/sft_dream.py \
  config=configs/sft_dream.yaml
# sft_trado.py, sft_trado.yaml
# ...

🀝 Acknowledgement

This work is heavily built on the following open-source models:

SDAR, Dream, LLaDA, MMaDA, and Diffu-coder.

these acceleration methods (engines):

Fast-dllm, jetengine,

and theoretical foundations:

MDLM, DiffuLLaMA, Block Diffusion.

πŸ“– Citation

@article{wang2025revolutionizing,
  title={Revolutionizing reinforcement learning framework for diffusion large language models},
  author={Wang, Yinjie and Yang, Ling and Li, Bowen and Tian, Ye and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2509.06949},
  year={2025}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for dLLM-RL

Similar Open Source Tools

For similar tasks

For similar jobs