Awesome-System2-Reasoning-LLM

Latest Advances on System-2 Reasoning

Stars: 863

Visit

The Awesome-System2-Reasoning-LLM repository is dedicated to a survey paper titled 'From System 1 to System 2: A Survey of Reasoning Large Language Models'. It explores the development of reasoning Large Language Models (LLMs), their foundational technologies, benchmarks, and future directions. The repository provides resources and updates related to the research, tracking the latest developments in the field of reasoning LLMs.

README:

Awesome-System2-Reasoning-LLM

📢 Updates

2025.02: We released a survey paper "From System 1 to System 2: A Survey of Reasoning Large Language Models". Feel free to cite or open pull requests.

👀 Introduction

Welcome to the repository for our survey paper, "From System 1 to System 2: A Survey of Reasoning Large Language Models". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.

Achieving human-level intelligence requires enhancing the transition from System 1 (fast, intuitive) to System 2 (slow, deliberate) reasoning. While foundational Large Language Models (LLMs) have made significant strides, they still fall short of human-like reasoning in complex tasks. Recent reasoning LLMs, like OpenAI’s o1, have demonstrated expert-level performance in domains such as mathematics and coding, resembling System 2 thinking. This survey explores the development of reasoning LLMs, their foundational technologies, benchmarks, and future directions. We maintain an up-to-date GitHub repository to track the latest developments in this rapidly evolving field.

This image highlights the progression of AI systems, emphasizing the shift from rapid, intuitive approaches to deliberate, reasoning-driven models. It shows how AI has evolved to handle a broader range of real-world challenges.

The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects.

📒 Table of Contents

Awesome-System-2-AI

Part 1: O1 Replication

Open-Reasoner-Zero [Paper]
X-R1 [github]
Unlock-Deepseek [Blog]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
LLM-R1 [github]
mini-deepseek-r1 [Blog]
Run DeepSeek R1 Dynamic 1.58-bit [Blog]
Simple Reinforcement Learning for Reasoning [Notion]
TinyZero [github]
Open R1 [github]
Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [Paper]
o1-Coder: an o1 Replication for Coding [Paper]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
DRT: Deep Reasoning Translation via Long Chain-of-Thought [Paper]
Enhancing LLM Reasoning with Reward-guided Tree Search [Paper]
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [Paper]
O1 Replication Journey: A Strategic Progress Report -- Part 1 [Paper]
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [Paper]

Part 2: Process Reward Models

ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [Paper]
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [Paper]
ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [Paper]
The Lessons of Developing Process Reward Models in Mathematical Reasoning. [Paper]
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [Paper]
AutoPSV: Automated Process-Supervised Verifier [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Free Process Rewards without Process Labels. [Paper]
Outcome-Refining Process Supervision for Code Generation [Paper]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [Paper]
OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [Paper]
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper]
Let's Verify Step by Step. [Paper]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
Making Large Language Models Better Reasoners with Step-Aware Verifier [Paper]
Solving Math Word Problems with Process and Outcome-Based Feedback [Paper]
Uncertainty-Aware Step-wise Verification with Generative Reward Models [Paper]
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [Paper]
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [Paper]
Unified Reward Model for Multimodal Understanding and Generation [Paper]
Multi-head Reward Aggregation Guided by Entropy [Paper]

Part 3: Reinforcement Learning

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [Paper]
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Paper]
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [Paper]
Process Reinforcement through Implicit Rewards [Paper]
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [Paper]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
Kimi k1.5: Scaling Reinforcement Learning with LLMs [Paper]
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [Paper]
Offline Reinforcement Learning for LLM Multi-Step Reasoning [Paper]
ReFT: Representation Finetuning for Language Models [Paper]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models [Paper]
Reasoning with Reinforced Functional Token Tuning [Paper]
Value-Based Deep RL Scales Predictably [Paper]
InfAlign: Inference-aware language model alignment [Paper]
LIMR: Less is More for RL Scaling [Paper]
A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [Paper]
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [Paper]
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [Paper]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
Visual-RFT: Visual Reinforcement Fine-Tuning [Paper]
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [Paper]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [Paper]
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
Improve Vision Language Model Chain-of-thought Reasoning [Paper]
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [Paper]
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [Paper]

Part 4: MCTS/Tree Search

ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [Paper]
On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [Paper]
Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper]
Proposing and solving olympiad geometry with guided tree search [Paper]
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [Paper]
Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [Paper]
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [Paper]
GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [Paper]
MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [Paper]
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [Paper]
Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [Paper]
AFlow: Automating Agentic Workflow Generation [Paper]
Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper]
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [Paper]
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [Paper]
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [Paper]
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [Paper]
LiteSearch: Efficacious Tree Search for LLM [Paper]
Tree Search for Language Model Agents [Paper]
Uncertainty-Guided Optimization on Large Language Model Search Trees [Paper]
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [Paper]
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [Paper]
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [Paper]
AlphaMath Almost Zero: process Supervision without process [Paper]
Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [Paper]
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [Paper]
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
Stream of Search (SoS): Learning to Search in Language [Paper]
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [Paper]
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [Paper]
Reasoning with Language Model is Planning with World Model [Paper]
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [Paper]
ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [Paper]
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [Paper]
MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [Paper]
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [Paper]
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [Paper]
Fine-grained Conversational Decoding via Isotropic and Proximal Search [Paper]
Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [Paper]
Look-back Decoding for Open-Ended Text Generation [Paper]
PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament [Paper]
LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [Paper]
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [Paper]
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [Paper]

Part 5: Self-Training / Self-Improve

Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [Paper]
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [Paper]
ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [Paper]
ReFT: Representation Finetuning for Language Models [Paper]
Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [Paper]
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper]
V-star: Training Verifiers for Self-Taught Reasoners [Paper]
Self-Refine: Iterative Refinement with Self-Feedback [Paper]
ReST: Reinforced Self-Training for Language Modeling [Paper]
STaR: Bootstrapping Reasoning With Reasoning [Paper]
Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [Paper]
Self-Improvement in Language Models: The Sharpening Mechanism [Paper]
Enabling Scalable Oversight via Self-Evolving Critic [Paper]
S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
ProgCo: Program Helps Self-Correction of Large Language Models [Paper]
Self-Refine: Iterative Refinement with Self-Feedback [Paper]
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [Paper]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
Large Language Models are Better Reasoners with Self-Verification [Paper]
Self-Evaluation Guided Beam Search for Reasoning [Paper]
Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [Paper]
Language Models can Self-Improve at State-Value Estimation for Better Search [Paper]
Self-Training Elicits Concise Reasoning in Large Language Models [Paper]

Part 6: Reflection

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [Paper]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper]
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [Paper]
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper]
Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities [Paper]
Reflection-Tuning: An Approach for Data Recycling [Paper]
Learning From Mistakes Makes LLM Better Reasoner [Paper]
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [Paper]

Part 7: Efficient System2

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [Paper]
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [Paper]
DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [Paper]
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [Paper]
Token-Budget-Aware LLM Reasoning [Paper]
Training Large Language Models to Reason in a Continuous Latent Space [Paper]
Guiding Language Model Reasoning with Planning Tokens [Paper]
One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [Paper]
Small Models Struggle to Learn from Strong Reasoners [Paper]
TokenSkip: Controllable Chain-of-Thought Compression in LLMs [Paper]
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [Paper]
Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning [Paper]
Thinking Preference Optimization [Paper]
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [Paper]
Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options [Paper]
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [Paper]
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [Paper]
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [Paper]
Atom of Thoughts for Markov LLM Test-Time Scaling [Paper]
Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [Paper]
Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models [Paper]
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [Paper]
Titans: Learning to Memorize at Test Time [Paper]
MoBA: Mixture of Block Attention for Long-Context LLMs [Paper]
AutoReason: Automatic Few-Shot Reasoning Decomposition [Paper]
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [Paper]
Scalable Language Models with Posterior Inference of Latent Thought Vectors [Paper]
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [Paper]
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [Paper]
LightThinker: Thinking Step-by-Step Compression [Paper]
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities [Paper]
Reasoning with Latent Thoughts: On the Power of Looped Transformers [Paper]
Efficient Reasoning with Hidden Thinking [Paper]
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models [Paper]
Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study [Paper]
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [Paper]
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [Paper]
MixLLM: Dynamic Routing in Mixed Large Language Models [Paper]
PEARL: Towards Permutation-Resilient LLMs [Paper]
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [Paper]
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [Paper]
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [Paper]
Training Large Language Models to be Better Rule Followers [Paper]
SIFT: Grounding LLM Reasoning in Contexts via Stickers [Paper]
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [Paper]
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [Paper]
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [Paper]
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [Paper]
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [Paper]
From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs [Paper]
MALT: Improving Reasoning with Multi-Agent LLM Training [Paper]
Efficient Reasoning with Hidden Thinking [Paper]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [Paper]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [Paper]
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation [Paper]
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [Paper]
Entropy-based Exploration Conduction for Multi-step Reasoning [Paper]
MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion [Paper]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [Paper]
ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [Paper]
Agent models: Internalizing Chain-of-Action Generation into Reasoning models [Paper]
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [Paper]
StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error [Paper]
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [Paper]
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [Paper]
Shared Global and Local Geometry of Language Model Embeddings [Paper]

Part 8: Explainability

Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [Paper]
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [Paper]
When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [Paper]
The Impact of Reasoning Step Length on Large Language Models [Paper]
Distilling System 2 into System 1 [Paper]
System 2 Attention (is something you might need too) [Paper]
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [Paper]
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [Paper]
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [Paper]
Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [Paper]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [Paper]

Part 9: Multimodal Agent related Slow-Fast System

Diving into Self-Evolving Training for Multimodal Reasoning [Paper]
Visual Agents as Fast and Slow Thinkers [Paper]
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [Paper]
Slow Perception: Let's Perceive Geometric Figures Step-by-Step [Paper]
AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [Paper]
RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [Paper]

Part 10: Benchmark and Datasets

Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [Paper]
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [Paper]
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [Paper]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [Paper]
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [Paper]
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [Paper]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [Paper]
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [Paper]
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [Paper]
Evaluation of OpenAI o1: Opportunities and Challenges of AGI [Paper]
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [Paper]
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [Paper]
Humanity's Last Exam [Paper]
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [Paper]
LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [Paper]
BIG-Bench Extra Hard [Paper]
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [paper]
MastermindEval: A Simple But Scalable Reasoning Benchmark [paper]
DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs [paper]
ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [paper]
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [paper]
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [paper]
V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks [github]
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [paper]

Part 11: Reasoning and Safety

Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [Paper]
OverThink: Slowdown Attacks on Reasoning LLMs [Paper]
GuardReasoner: Towards Reasoning-based LLM Safeguards [Paper]
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [Paper]
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [Paper]
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [Paper]
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [Paper]
Deliberative Alignment: Reasoning Enables Safer Language Models [Paper]
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google [Blog]
OpenAI trained o1 and o3 to ‘think’ about its safety policy [Blog]
Why AI Safety Researchers Are Worried About DeepSeek [Blog]
Measuring Faithfulness in Chain-of-Thought Reasoning [Blog]

Part 12: R1 Driven Multimodal Reasoning Enhancement

Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning [github]
MMR1: Advancing the Frontiers of Multimodal Reasoning [github]
Skywork-R1V: Pioneering Multimodal Reasoning with CoT [github]
Open R1 Video [github]
R1-Vision: Let's first take a look at the image [github]
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Blog]
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [paper]
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [paper]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [paper]
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [paper]
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [paper]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [paper]
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [paper]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [paper]
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning [paper]

Citation

If you find this work useful, welcome to cite us.

@misc{li202512surveyreasoning,
      title={From System 1 to System 2: A Survey of Reasoning Large Language Models}, 
      author={Zhong-Zhi Li and Duzhen Zhang and Ming-Liang Zhang and Jiaxin Zhang and Zengyan Liu and Yuxuan Yao and Haotian Xu and Junhao Zheng and Pei-Jie Wang and Xiuyi Chen and Yingying Zhang and Fei Yin and Jiahua Dong and Zhijiang Guo and Le Song and Cheng-Lin Liu},
      year={2025},
      eprint={2502.17419},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.17419}, 
}

⭐ Star History

For Tasks:

Click tags to check more tools for each tasks

analyze reasoning llms track llm developments explore llm benchmarks research ai safety improve reasoning capabilities

For Jobs:

research scientist machine learning engineer data scientist ai researcher natural language processing specialist

Alternative AI tools for Awesome-System2-Reasoning-LLM

Similar Open Source Tools

Awesome-System2-Reasoning-LLM

github

: 863

LLM-for-misinformation-research

LLM-for-misinformation-research is a curated paper list of misinformation research using large language models (LLMs). The repository covers methods for detection and verification, tools for fact-checking complex claims, decision-making and explanation, claim matching, post-hoc explanation generation, and other tasks related to combating misinformation. It includes papers on fake news detection, rumor detection, fact verification, and more, showcasing the application of LLMs in various aspects of misinformation research.

github

: 78

LLM-IR-Bias-Fairness-Survey

LLM-IR-Bias-Fairness-Survey is a collection of papers related to bias and fairness in Information Retrieval (IR) with Large Language Models (LLMs). The repository organizes papers according to a survey paper titled 'Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era'. The survey provides a comprehensive review of emerging issues related to bias and unfairness in the integration of LLMs into IR systems, categorizing mitigation strategies into data sampling and distribution reconstruction approaches.

github

: 52

awesome-LLM-AIOps

The 'awesome-LLM-AIOps' repository is a curated list of academic research and industrial materials related to Large Language Models (LLM) and Artificial Intelligence for IT Operations (AIOps). It covers various topics such as incident management, log analysis, root cause analysis, incident mitigation, and incident postmortem analysis. The repository provides a comprehensive collection of papers, projects, and tools related to the application of LLM and AI in IT operations, offering valuable insights and resources for researchers and practitioners in the field.

github

: 54

Awesome-Efficient-AIGC

This repository, Awesome Efficient AIGC, collects efficient approaches for AI-generated content (AIGC) to cope with its huge demand for computing resources. It includes efficient Large Language Models (LLMs), Diffusion Models (DMs), and more. The repository is continuously improving and welcomes contributions of works like papers and repositories that are missed by the collection.

github

: 145

Awesome-LLM-Compression

Awesome LLM compression research papers and tools to accelerate LLM training and inference.

github

: 1.4k

llm-continual-learning-survey

This repository is an updating survey for Continual Learning of Large Language Models (CL-LLMs), providing a comprehensive overview of various aspects related to the continual learning of large language models. It covers topics such as continual pre-training, domain-adaptive pre-training, continual fine-tuning, model refinement, model alignment, multimodal LLMs, and miscellaneous aspects. The survey includes a collection of relevant papers, each focusing on different areas within the field of continual learning of large language models.

github

: 215

awesome-embodied-vla-va-vln

github

: 1.6k

Awesome-LLMs-in-Graph-tasks

This repository is a collection of papers on leveraging Large Language Models (LLMs) in Graph Tasks. It provides a comprehensive overview of how LLMs can enhance graph-related tasks by combining them with traditional Graph Neural Networks (GNNs). The integration of LLMs with GNNs allows for capturing both structural and contextual aspects of nodes in graph data, leading to more powerful graph learning. The repository includes summaries of various models that leverage LLMs to assist in graph-related tasks, along with links to papers and code repositories for further exploration.

github

: 552

lobe-cli-toolbox

Lobe CLI Toolbox is an AI CLI Toolbox designed to enhance git commit and i18n workflow efficiency. It includes tools like Lobe Commit for generating Gitmoji-based commit messages and Lobe i18n for automating the i18n translation process. The toolbox also features Lobe label for automatically copying issues labels from a template repo. It supports features such as automatic splitting of large files, incremental updates, and customization options for the OpenAI model, API proxy, and temperature.

github

: 232

Efficient-LLMs-Survey

This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

github

: 1.1k

AwesomeLLM4APR

Awesome LLM for APR is a repository dedicated to exploring the capabilities of Large Language Models (LLMs) in Automated Program Repair (APR). It provides a comprehensive collection of research papers, tools, and resources related to using LLMs for various scenarios such as repairing semantic bugs, security vulnerabilities, syntax errors, programming problems, static warnings, self-debugging, type errors, web UI tests, smart contracts, hardware bugs, performance bugs, API misuses, crash bugs, test case repairs, formal proofs, GitHub issues, code reviews, motion planners, human studies, and patch correctness assessments. The repository serves as a valuable reference for researchers and practitioners interested in leveraging LLMs for automated program repair.

github

: 88

awesome-llm-understanding-mechanism

This repository is a collection of papers focused on understanding the internal mechanism of large language models (LLM). It includes research on topics such as how LLMs handle multilingualism, learn in-context, and handle factual associations. The repository aims to provide insights into the inner workings of transformer-based language models through a curated list of papers and surveys.

github

: 376

Efficient_Foundation_Model_Survey

Efficient Foundation Model Survey is a comprehensive analysis of resource-efficient large language models (LLMs) and multimodal foundation models. The survey covers algorithmic and systemic innovations to support the growth of large models in a scalable and environmentally sustainable way. It explores cutting-edge model architectures, training/serving algorithms, and practical system designs. The goal is to provide insights on tackling resource challenges posed by large foundation models and inspire future breakthroughs in the field.

github

: 192

prompt-in-context-learning

An Open-Source Engineering Guide for Prompt-in-context-learning from EgoAlpha Lab. 📝 Papers | ⚡️ Playground | 🛠 Prompt Engineering | 🌍 ChatGPT Prompt | ⛳ LLMs Usage Guide > **⭐️ Shining ⭐️:** This is fresh, daily-updated resources for in-context learning and prompt engineering. As Artificial General Intelligence (AGI) is approaching, let’s take action and become a super learner so as to position ourselves at the forefront of this exciting era and strive for personal and professional greatness. The resources include: _🎉Papers🎉_: The latest papers about _In-Context Learning_ , _Prompt Engineering_ , _Agent_ , and _Foundation Models_. _🎉Playground🎉_: Large language models（LLMs）that enable prompt experimentation. _🎉Prompt Engineering🎉_: Prompt techniques for leveraging large language models. _🎉ChatGPT Prompt🎉_: Prompt examples that can be applied in our work and daily lives. _🎉LLMs Usage Guide🎉_: The method for quickly getting started with large language models by using LangChain. In the future, there will likely be two types of people on Earth (perhaps even on Mars, but that's a question for Musk): - Those who enhance their abilities through the use of AIGC; - Those whose jobs are replaced by AI automation. 💎EgoAlpha: Hello! human👤, are you ready?

github

: 1.5k

AI-System-School

AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.

github

: 2.6k

For similar tasks

Awesome-System2-Reasoning-LLM

github

: 863

Open-Reasoning-Tasks

The Open-Reasoning-Tasks repository is a collaborative project aimed at creating a comprehensive list of reasoning tasks for training large language models (LLMs). Contributors can submit tasks with descriptions, examples, and optional diagrams to enhance LLMs' reasoning capabilities.

github

: 205

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.1k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675