Awesome-System2-Reasoning-LLM
Latest Advances on System-2 Reasoning
Stars: 863
The Awesome-System2-Reasoning-LLM repository is dedicated to a survey paper titled 'From System 1 to System 2: A Survey of Reasoning Large Language Models'. It explores the development of reasoning Large Language Models (LLMs), their foundational technologies, benchmarks, and future directions. The repository provides resources and updates related to the research, tracking the latest developments in the field of reasoning LLMs.
README:
- 2025.02: We released a survey paper "From System 1 to System 2: A Survey of Reasoning Large Language Models". Feel free to cite or open pull requests.
Welcome to the repository for our survey paper, "From System 1 to System 2: A Survey of Reasoning Large Language Models". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.
Achieving human-level intelligence requires enhancing the transition from System 1 (fast, intuitive) to System 2 (slow, deliberate) reasoning. While foundational Large Language Models (LLMs) have made significant strides, they still fall short of human-like reasoning in complex tasks. Recent reasoning LLMs, like OpenAIβs o1, have demonstrated expert-level performance in domains such as mathematics and coding, resembling System 2 thinking. This survey explores the development of reasoning LLMs, their foundational technologies, benchmarks, and future directions. We maintain an up-to-date GitHub repository to track the latest developments in this rapidly evolving field.
This image highlights the progression of AI systems, emphasizing the shift from rapid, intuitive approaches to deliberate, reasoning-driven models. It shows how AI has evolved to handle a broader range of real-world challenges.
The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects.
-
Awesome-System-2-AI
- Part 1: O1 Replication
- Part 2: Process Reward Models
- Part 3: Reinforcement Learning
- Part 4: MCTS/Tree Search
- Part 5: Self-Training / Self-Improve
- Part 6: Reflection
- Part 7: Efficient System2
- Part 8: Explainability
- Part 9: Multimodal Agent related Slow-Fast System
- Part 10: Benchmark and Datasets
- Part 11: Reasoning and Safety
- Part 12: R1 Driven Multimodal Reasoning Enhancement
- Open-Reasoner-Zero [Paper]
- X-R1 [github]
- Unlock-Deepseek [Blog]
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning [Paper]
- LLM-R1 [github]
- mini-deepseek-r1 [Blog]
- Run DeepSeek R1 Dynamic 1.58-bit [Blog]
- Simple Reinforcement Learning for Reasoning [Notion]
- TinyZero [github]
- Open R1 [github]
- Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
- Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
- Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems [Paper]
- o1-Coder: an o1 Replication for Coding [Paper]
- HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
- DRT: Deep Reasoning Translation via Long Chain-of-Thought [Paper]
- Enhancing LLM Reasoning with Reward-guided Tree Search [Paper]
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
- O1 Replication Journey--Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [Paper]
- O1 Replication Journey: A Strategic Progress Report -- Part 1 [Paper]
- The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer [Paper]
- ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [Paper]
- PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. [Paper]
- ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [Paper]
- The Lessons of Developing Process Reward Models in Mathematical Reasoning. [Paper]
- ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark. [Paper]
- AutoPSV: Automated Process-Supervised Verifier [Paper]
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
- Free Process Rewards without Process Labels. [Paper]
- Outcome-Refining Process Supervision for Code Generation [Paper]
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [Paper]
- OVM: Outcome-supervised Value Models for Planning in Mathematical Reasoning [Paper]
- Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [Paper]
- Let's Verify Step by Step. [Paper]
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision [Paper]
- Making Large Language Models Better Reasoners with Step-Aware Verifier [Paper]
- Solving Math Word Problems with Process and Outcome-Based Feedback [Paper]
- Uncertainty-Aware Step-wise Verification with Generative Reward Models [Paper]
- AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
- Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [Paper]
- Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [Paper]
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [Paper]
- Unified Reward Model for Multimodal Understanding and Generation [Paper]
- Multi-head Reward Aggregation Guided by Entropy [Paper]
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [Paper]
- DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL [Paper]
- QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search [Paper]
- Process Reinforcement through Implicit Rewards [Paper]
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [Paper]
- Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies [Paper]
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
- Kimi k1.5: Scaling Reinforcement Learning with LLMs [Paper]
- Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [Paper]
- Offline Reinforcement Learning for LLM Multi-Step Reasoning [Paper]
- ReFT: Representation Finetuning for Language Models [Paper]
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models [Paper]
- Reasoning with Reinforced Functional Token Tuning [Paper]
- Value-Based Deep RL Scales Predictably [Paper]
- InfAlign: Inference-aware language model alignment [Paper]
- LIMR: Less is More for RL Scaling [Paper]
- A Survey on Feedback-based Multi-step Reasoning for Large Language Models on Mathematics [Paper]
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [Paper]
- Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning [Paper]
- R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [Paper]
- Visual-RFT: Visual Reinforcement Fine-Tuning [Paper]
- GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [Paper]
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning [Paper]
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [Paper]
- Improve Vision Language Model Chain-of-thought Reasoning [Paper]
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [Paper]
- UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [Paper]
- ARMAP: Scaling Autonomous Agents via Automatic Reward Modeling And Planning [Paper]
- On the Convergence Rate of MCTS for the Optimal Value Estimation in Markov Decision Processes [Paper]
- Search-o1: Agentic Search-Enhanced Large Reasoning Models [Paper]
- rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
- Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
- HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
- Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper]
- Proposing and solving olympiad geometry with guided tree search [Paper]
- SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [Paper]
- Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning [Paper]
- CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [Paper]
- GPT-Guided Monte Carlo Tree Search for Symbolic Regression in Financial Fraud Detection [Paper]
- MC-NEST -- Enhancing Mathematical Reasoning in Large Language Models with a Monte Carlo Nash Equilibrium Self-Refine Tree [Paper]
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
- SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [Paper]
- Donβt throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding [Paper]
- AFlow: Automating Agentic Workflow Generation [Paper]
- Interpretable Contrastive Monte Carlo Tree Search Reasoning [Paper]
- LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
- Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning [Paper]
- TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [Paper]
- Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination [Paper]
- RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation [Paper]
- Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [Paper]
- LiteSearch: Efficacious Tree Search for LLM [Paper]
- Tree Search for Language Model Agents [Paper]
- Uncertainty-Guided Optimization on Large Language Model Search Trees [Paper]
- Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [Paper]
- Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping [Paper]
- LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models [Paper]
- AlphaMath Almost Zero: process Supervision without process [Paper]
- Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search [Paper]
- MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [Paper]
- Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [Paper]
- Stream of Search (SoS): Learning to Search in Language [Paper]
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [Paper]
- Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models [Paper]
- Reasoning with Language Model is Planning with World Model [Paper]
- Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [Paper]
- ALPHAZERO-LIKE TREE-SEARCH CAN GUIDE LARGE LANGUAGE MODEL DECODING AND TRAINING [Paper]
- Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [Paper]
- MAKING PPO EVEN BETTER: VALUE-GUIDED MONTE-CARLO TREE SEARCH DECODING [Paper]
- Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning [Paper]
- Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [Paper]
- Fine-grained Conversational Decoding via Isotropic and Proximal Search [Paper]
- Control-DAG: Constrained Decoding for Non-Autoregressive Directed Acyclic T5 using Weighted Finite State Automata [Paper]
- Look-back Decoding for Open-Ended Text Generation [Paper]
- PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament [Paper]
- LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [Paper]
- Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking [Paper]
- DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [Paper]
- Small LLMs Can Master Reasoning with Self-Evolved Deep Thinking (Rstar-Math) [Paper]
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [Paper]
- Recursive Introspection: Teaching Language Model Agents How to Self-Improve [Paper]
- B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [Paper]
- ReST-EM: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [Paper]
- ReFT: Representation Finetuning for Language Models [Paper]
- Interactive Evolution: A Neural-Symbolic Self-Training Framework for Large Language Models [Paper]
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [Paper]
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking [Paper]
- V-star: Training Verifiers for Self-Taught Reasoners [Paper]
- Self-Refine: Iterative Refinement with Self-Feedback [Paper]
- ReST: Reinforced Self-Training for Language Modeling [Paper]
- STaR: Bootstrapping Reasoning With Reasoning [Paper]
- Expert Iteration: Thinking Fast and Slow with Deep Learning and Tree Search [Paper]
- Self-Improvement in Language Models: The Sharpening Mechanism [Paper]
- Enabling Scalable Oversight via Self-Evolving Critic [Paper]
- S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
- ProgCo: Program Helps Self-Correction of Large Language Models [Paper]
- Self-Refine: Iterative Refinement with Self-Feedback [Paper]
- SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [Paper]
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [Paper]
- Large Language Models are Better Reasoners with Self-Verification [Paper]
- Self-Evaluation Guided Beam Search for Reasoning [Paper]
- Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [Paper]
- Language Models can Self-Improve at State-Value Estimation for Better Search [Paper]
- Self-Training Elicits Concise Reasoning in Large Language Models [Paper]
- rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [Paper]
- RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? [Paper]
- HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Paper]
- Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [Paper]
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
- Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [Paper]
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [Paper]
- LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
- Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
- LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [Paper]
- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [Paper]
- Refiner: Restructure Retrieved Content Efficiently to Advance Question-Answering Capabilities [Paper]
- Reflection-Tuning: An Approach for Data Recycling [Paper]
- Learning From Mistakes Makes LLM Better Reasoner [Paper]
- SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [Paper]
- O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [Paper]
- Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking [Paper]
- DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models [Paper]
- B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoner [Paper]
- Token-Budget-Aware LLM Reasoning [Paper]
- Training Large Language Models to Reason in a Continuous Latent Space [Paper]
- Guiding Language Model Reasoning with Planning Tokens [Paper]
- One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [Paper]
- Small Models Struggle to Learn from Strong Reasoners [Paper]
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs [Paper]
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [Paper]
- Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning [Paper]
- Thinking Preference Optimization [Paper]
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [Paper]
- Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options [Paper]
- CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [Paper]
- OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning [Paper]
- LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [Paper]
- Atom of Thoughts for Markov LLM Test-Time Scaling [Paper]
- Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [Paper]
- Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models [Paper]
- Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [Paper]
- Titans: Learning to Memorize at Test Time [Paper]
- MoBA: Mixture of Block Attention for Long-Context LLMs [Paper]
- AutoReason: Automatic Few-Shot Reasoning Decomposition [Paper]
- Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [Paper]
- Scalable Language Models with Posterior Inference of Latent Thought Vectors [Paper]
- Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning [Paper]
- Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [Paper]
- LightThinker: Thinking Step-by-Step Compression [Paper]
- The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities [Paper]
- Reasoning with Latent Thoughts: On the Power of Looped Transformers [Paper]
- Efficient Reasoning with Hidden Thinking [Paper]
- Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models [Paper]
- Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study [Paper]
- Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [Paper]
- FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [Paper]
- MixLLM: Dynamic Routing in Mixed Large Language Models [Paper]
- PEARL: Towards Permutation-Resilient LLMs [Paper]
- Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [Paper]
- Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [Paper]
- Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [Paper]
- Training Large Language Models to be Better Rule Followers [Paper]
- SIFT: Grounding LLM Reasoning in Contexts via Stickers [Paper]
- AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [Paper]
- How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach [Paper]
- PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [Paper]
- DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [Paper]
- Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [Paper]
- Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [Paper]
- From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs [Paper]
- MALT: Improving Reasoning with Multi-Agent LLM Training [Paper]
- Efficient Reasoning with Hidden Thinking [Paper]
- CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation [Paper]
- Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [Paper]
- TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation [Paper]
- Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [Paper]
- Entropy-based Exploration Conduction for Multi-step Reasoning [Paper]
- MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion [Paper]
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [Paper]
- ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs [Paper]
- Agent models: Internalizing Chain-of-Action Generation into Reasoning models [Paper]
- Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research [Paper]
- StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error [Paper]
- Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding [Paper]
- Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators [Paper]
- Shared Global and Local Geometry of Language Model Embeddings [Paper]
- Agents Thinking Fast and Slow: A Talker-Reasoner Architecture [Paper]
- What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective [Paper]
- When a Language Model is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI o1 [Paper]
- The Impact of Reasoning Step Length on Large Language Models [Paper]
- Distilling System 2 into System 1 [Paper]
- System 2 Attention (is something you might need too) [Paper]
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought [Paper]
- LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [Paper]
- Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time [Paper]
- Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [Paper]
- Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning [Paper]
- Diving into Self-Evolving Training for Multimodal Reasoning [Paper]
- Visual Agents as Fast and Slow Thinkers [Paper]
- Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [Paper]
- Scaling Inference-Time Search With Vision Value Model for Improved Visual Comprehension [Paper]
- Slow Perception: Let's Perceive Geometric Figures Step-by-Step [Paper]
- AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning [Paper]
- LLaVA-o1: Let Vision Language Models Reason Step-by-Step [Paper]
- Vision-Language Models Can Self-Improve Reasoning via Reflection [Paper]
- I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [Paper]
- RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [Paper]
- Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models [Paper]
- PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [Paper]
- MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [Paper]
- Do NOT Think That Much for 2+3=? On the Overthinking of o1-like LLMs [Paper]
- A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [Paper]
- EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [Paper]
- SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [Paper]
- Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [Paper]
- FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI [Paper]
- Evaluation of OpenAI o1: Opportunities and Challenges of AGI [Paper]
- MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [Paper]
- LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [Paper]
- Humanity's Last Exam [Paper]
- RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style [Paper]
- LR2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems [Paper]
- BIG-Bench Extra Hard [Paper]
- MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts [paper]
- MastermindEval: A Simple But Scalable Reasoning Benchmark [paper]
- DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs [paper]
- ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models [paper]
- MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [paper]
- MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [paper]
- V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks [github]
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [paper]
- Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable [Paper]
- OverThink: Slowdown Attacks on Reasoning LLMs [Paper]
- GuardReasoner: Towards Reasoning-based LLM Safeguards [Paper]
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
- ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails [Paper]
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [Paper]
- H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking [Paper]
- BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack [Paper]
- The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [Paper]
- Deliberative Alignment: Reasoning Enables Safer Language Models [Paper]
- Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google [Blog]
- OpenAI trained o1 and o3 to βthinkβ about its safety policy [Blog]
- Why AI Safety Researchers Are Worried About DeepSeek [Blog]
- Measuring Faithfulness in Chain-of-Thought Reasoning [Blog]
- Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning [github]
- MMR1: Advancing the Frontiers of Multimodal Reasoning [github]
- Skywork-R1V: Pioneering Multimodal Reasoning with CoT [github]
- Open R1 Video [github]
- R1-Vision: Let's first take a look at the image [github]
- VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [Blog]
- MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [paper]
- R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [paper]
- R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [paper]
- R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [paper]
- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [paper]
- Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [paper]
- TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM [paper]
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [paper]
- MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning [paper]
If you find this work useful, welcome to cite us.
@misc{li202512surveyreasoning,
title={From System 1 to System 2: A Survey of Reasoning Large Language Models},
author={Zhong-Zhi Li and Duzhen Zhang and Ming-Liang Zhang and Jiaxin Zhang and Zengyan Liu and Yuxuan Yao and Haotian Xu and Junhao Zheng and Pei-Jie Wang and Xiuyi Chen and Yingying Zhang and Fei Yin and Jiahua Dong and Zhijiang Guo and Le Song and Cheng-Lin Liu},
year={2025},
eprint={2502.17419},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.17419},
}For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Awesome-System2-Reasoning-LLM
Similar Open Source Tools
Awesome-System2-Reasoning-LLM
The Awesome-System2-Reasoning-LLM repository is dedicated to a survey paper titled 'From System 1 to System 2: A Survey of Reasoning Large Language Models'. It explores the development of reasoning Large Language Models (LLMs), their foundational technologies, benchmarks, and future directions. The repository provides resources and updates related to the research, tracking the latest developments in the field of reasoning LLMs.
Awesome-LLM-Post-training
The Awesome-LLM-Post-training repository is a curated collection of influential papers, code implementations, benchmarks, and resources related to Large Language Models (LLMs) Post-Training Methodologies. It covers various aspects of LLMs, including reasoning, decision-making, reinforcement learning, reward learning, policy optimization, explainability, multimodal agents, benchmarks, tutorials, libraries, and implementations. The repository aims to provide a comprehensive overview and resources for researchers and practitioners interested in advancing LLM technologies.
NL2SQL_Handbook
NL2SQL Handbook provides a comprehensive overview of Natural Language to SQL (NL2SQL) advancements, including survey papers, tutorial slides, and a river diagram of NL2SQL methods. It covers the evolution of NL2SQL solutions, module-based methods, benchmark development, and future directions. The repository also offers practical guides for beginners, access to high-performance language models, and evaluation metrics for NL2SQL models.
awesome-llm-understanding-mechanism
This repository is a collection of papers focused on understanding the internal mechanism of large language models (LLM). It includes research on topics such as how LLMs handle multilingualism, learn in-context, and handle factual associations. The repository aims to provide insights into the inner workings of transformer-based language models through a curated list of papers and surveys.
EvalAI
EvalAI is an open-source platform for evaluating and comparing machine learning (ML) and artificial intelligence (AI) algorithms at scale. It provides a central leaderboard and submission interface, making it easier for researchers to reproduce results mentioned in papers and perform reliable & accurate quantitative analysis. EvalAI also offers features such as custom evaluation protocols and phases, remote evaluation, evaluation inside environments, CLI support, portability, and faster evaluation.
Awesome-Text2SQL
Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.
Efficient_Foundation_Model_Survey
Efficient Foundation Model Survey is a comprehensive analysis of resource-efficient large language models (LLMs) and multimodal foundation models. The survey covers algorithmic and systemic innovations to support the growth of large models in a scalable and environmentally sustainable way. It explores cutting-edge model architectures, training/serving algorithms, and practical system designs. The goal is to provide insights on tackling resource challenges posed by large foundation models and inspire future breakthroughs in the field.
AwesomeLLM4APR
Awesome LLM for APR is a repository dedicated to exploring the capabilities of Large Language Models (LLMs) in Automated Program Repair (APR). It provides a comprehensive collection of research papers, tools, and resources related to using LLMs for various scenarios such as repairing semantic bugs, security vulnerabilities, syntax errors, programming problems, static warnings, self-debugging, type errors, web UI tests, smart contracts, hardware bugs, performance bugs, API misuses, crash bugs, test case repairs, formal proofs, GitHub issues, code reviews, motion planners, human studies, and patch correctness assessments. The repository serves as a valuable reference for researchers and practitioners interested in leveraging LLMs for automated program repair.
Awesome-TimeSeries-SpatioTemporal-LM-LLM
Awesome-TimeSeries-SpatioTemporal-LM-LLM is a curated list of Large (Language) Models and Foundation Models for Temporal Data, including Time Series, Spatio-temporal, and Event Data. The repository aims to summarize recent advances in Large Models and Foundation Models for Time Series and Spatio-Temporal Data with resources such as papers, code, and data. It covers various applications like General Time Series Analysis, Transportation, Finance, Healthcare, Event Analysis, Climate, Video Data, and more. The repository also includes related resources, surveys, and papers on Large Language Models, Foundation Models, and their applications in AIOps.
Awesome-Efficient-AIGC
This repository, Awesome Efficient AIGC, collects efficient approaches for AI-generated content (AIGC) to cope with its huge demand for computing resources. It includes efficient Large Language Models (LLMs), Diffusion Models (DMs), and more. The repository is continuously improving and welcomes contributions of works like papers and repositories that are missed by the collection.
Cyberion-Spark-X
Cyberion-Spark-X is a powerful open-source tool designed for cybersecurity professionals and data analysts. It provides advanced capabilities for analyzing and visualizing large datasets to detect security threats and anomalies. The tool integrates with popular data sources and supports various machine learning algorithms for predictive analytics and anomaly detection. Cyberion-Spark-X is user-friendly and highly customizable, making it suitable for both beginners and experienced professionals in the field of cybersecurity and data analysis.
Efficient-LLMs-Survey
This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.
LLM-Discrete-Tokenization-Survey
The repository contains a comprehensive survey paper on Discrete Tokenization for Multimodal Large Language Models (LLMs). It covers various techniques, applications, and challenges related to discrete tokenization in the context of LLMs. The survey explores the use of vector quantization, product quantization, and other methods for tokenizing different modalities like text, image, audio, video, graph, and more. It also discusses the integration of discrete tokenization with LLMs for tasks such as image generation, speech recognition, recommendation systems, and multimodal understanding and generation.
LLMFarm
LLMFarm is an iOS and MacOS app designed to work with large language models (LLM). It allows users to load different LLMs with specific parameters, test the performance of various LLMs on iOS and macOS, and identify the most suitable model for their projects. The tool is based on ggml and llama.cpp by Georgi Gerganov and incorporates sources from rwkv.cpp by saharNooby, Mia by byroneverson, and LlamaChat by alexrozanski. LLMFarm features support for MacOS (13+) and iOS (16+), various inferences and sampling methods, Metal compatibility (not supported on Intel Mac), model setting templates, LoRA adapters support, LoRA finetune support, LoRA export as model support, and more. It also offers a range of inferences including LLaMA, GPTNeoX, Replit, GPT2, Starcoder, RWKV, Falcon, MPT, Bloom, and others. Additionally, it supports multimodal models like LLaVA, Obsidian, and MobileVLM. Users can customize inference options through JSON files and access supported models for download.
AI-resources
AI-resources is a repository containing links to various resources for learning Artificial Intelligence. It includes video lectures, courses, tutorials, and open-source libraries related to deep learning, reinforcement learning, machine learning, and more. The repository categorizes resources for beginners, average users, and advanced users/researchers, providing a comprehensive collection of materials to enhance knowledge and skills in AI.
For similar tasks
Awesome-System2-Reasoning-LLM
The Awesome-System2-Reasoning-LLM repository is dedicated to a survey paper titled 'From System 1 to System 2: A Survey of Reasoning Large Language Models'. It explores the development of reasoning Large Language Models (LLMs), their foundational technologies, benchmarks, and future directions. The repository provides resources and updates related to the research, tracking the latest developments in the field of reasoning LLMs.
Open-Reasoning-Tasks
The Open-Reasoning-Tasks repository is a collaborative project aimed at creating a comprehensive list of reasoning tasks for training large language models (LLMs). Contributors can submit tasks with descriptions, examples, and optional diagrams to enhance LLMs' reasoning capabilities.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.
