Awesome_Test_Time_LLMs

Awesome_Test_Time_LLMs

None

Stars: 69

Visit
 screenshot

This repository focuses on test-time computing, exploring various strategies such as test-time adaptation, modifying the input, editing the representation, calibrating the output, test-time reasoning, and search strategies. It covers topics like self-supervised test-time training, in-context learning, activation steering, nearest neighbor models, reward modeling, and multimodal reasoning. The repository provides resources including papers and code for researchers and practitioners interested in enhancing the reasoning capabilities of large language models.

README:

Awesome Test-time Computing

This repository contains the resources for Test-time Computing: from System-1 Thinking to System-2 Thinking

Overview of Test-time Computing:

Overview of Test-time Computing

Table of Contents

Test-time Adaptation

Updating the Model

  • Test-Time Training with Self-Supervision for Generalization under Distribution Shifts [ICML 2020] paper
  • MT3: Meta Test-Time Training for Self-Supervised Test-Time Adaption [AISTATS 2022] paper code
  • Test-Time Training with Masked Autoencoders [NeurIPS 2022] paper
  • TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? [NeurIPS 2021] paper code
  • Efficient Test-Time Prompt Tuning for Vision-Language Models [arxiv 2024.8] paper
  • Tent: Fully Test-time Adaptation by Entropy Minimization [ICLR 2021] paper code
  • MEMO: Test Time Robustness via Adaptation and Augmentation [NeurIPS 2022] paper code
  • The Entropy Enigma: Success and Failure of Entropy Minimization [arxiv 2024.5] paper code
  • On Pitfalls of Test-Time Adaptation [ICML 2023] paper code
  • Beware of Model Collapse! Fast and Stable Test-time Adaptation for Robust Question Answering [EMNLP 2023] paper code
  • Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization [NeurIPS 2024] paper
  • Protected Test-Time Adaptation via Online Entropy Matching: A Betting Approach [arxiv 2024.8] paper code
  • Simulating Bandit Learning from User Feedback for Extractive Question Answering [ACL 2022] paper code
  • Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment [ACL 2022] paper
  • Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization [ACL 2023] paper code
  • COMET: A Neural Framework for MT Evaluation [EMNLP 2020] paper
  • Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [ICLR 2023] paper code
  • Improving robustness against common corruptions by covariate shift adaptation [NeurIPS 2020] paper
  • Selective Annotation Makes Language Models Better Few-Shot Learners [arxiv 2022.9] paper code
  • Test-Time Adaptation with Perturbation Consistency Learning [arxiv 2023.4] paper
  • Test-Time Prompt Adaptation for Vision-Language Models [NeurIPS 2023] paper
  • Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning [ICCV 2023] paper code
  • Test-Time Model Adaptation with Only Forward Passes [ICML 2024] paper code
  • Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models [arxiv 2024.7] paper code
  • StreamAdapter: Efficient Test Time Adaptation from Contextual Streams [arxiv 2024.11] paper
  • Towards Stable Test-time Adaptation in Dynamic Wild World [ICLR 2023] paper code
  • SoTTA: Robust Test-Time Adaptation on Noisy Data Streams [NeurIPS 2023] paper code
  • Robust Question Answering against Distribution Shifts with Test-Time Adaption: An Empirical Study [EMNLP 2022] paper code

Modifying the Input

  • What Makes Good In-Context Examples for GPT-3? [DeeLIO 2022] paper
  • In-Context Learning with Iterative Demonstration Selection [EMNLP 2024] paper
  • Dr.ICL: Demonstration-Retrieved In-context Learning [arxiv 2023.5] paper
  • Learning To Retrieve Prompts for In-Context Learning [NAACL 2022] paper
  • Unified Demonstration Retriever for In-Context Learning [ACL 2023] paper code
  • Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers [ACL 2023] paper code
  • Finding Support Examples for In-Context Learning [EMNLP 2023] paper code
  • Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [NeurIPS 2023] paper code
  • Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity [ACL 2022] paper
  • Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering [ACL 2023] paper
  • RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning [arxiv 2024.4] paper
  • Automatic Chain of Thought Prompting in Large Language Models [ICLR 2022] paper code
  • Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations [EMNLP 2023] paper code
  • Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations [ACL 2023] paper
  • Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator [arxiv 2022.6] paper
  • Demonstration Augmentation for Zero-shot In-context Learning [ACL 2024] paper code

Editing the Representation

  • Plug and Play Language Models: A Simple Approach to Controlled Text Generation [ICLR 2022] paper
  • Steering Language Models With Activation Engineering [arxiv 2024.10] paper
  • Improving Instruction-Following in Language Models through Activation Steering [arxiv 2024.10] paper
  • Inference-Time Intervention: Eliciting Truthful Answers from a Language Model [arxiv 2024.6] paper code
  • Refusal in Language Models Is Mediated by a Single Direction [arxiv 2024.10] paper code
  • In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering [arxiv 2024.2] paper code
  • Investigating Bias Representations in Llama 2 Chat via Activation Steering [arxiv 2024.2] paper
  • Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [arxiv 2024.7] paper code
  • Spectral Editing of Activations for Large Language Model Alignment [NeurIPS 2024] paper code
  • Multi-property Steering of Large Language Models with Dynamic Activation Composition [BlackboxNLP 2024] paper code

Calibrating the Output

  • Generalization through Memorization: Nearest Neighbor Language Models [ICLR 2020] paper code
  • Nearest Neighbor Machine Translation [ICLR 2021] paper code
  • Efficient Cluster-Based k-Nearest-Neighbor Machine Translation [ACL 2022] paper code
  • What Knowledge Is Needed? Towards Explainable Memory for kNN-MT Domain Adaptation [ACL 2023] paper code
  • Efficient Domain Adaptation for Non-Autoregressive Machine Translation [ACL 2024] paper code
  • kNN-NER: Named Entity Recognition with Nearest Neighbor Search [arxiv 2022.3] paper code
  • kNN-CM: A Non-parametric Inference-Phase Adaptation of Parametric Text Classifiers [EMNLP 2023] paper code
  • AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [ICML 2023] paper code

Test-time Reasoning

Feedback Modeling

Score-based

  • Scaling Laws for Reward Model Overoptimization [PMLR 2023] paper
  • Training Verifiers to Solve Math Word Problems [arxiv 2021.10] paper
  • Advancing LLM Reasoning Generalists with Preference Trees [arxiv 2024.4] paper code
  • V-STaR: Training Verifiers for Self-Taught Reasoners [COLM 2024] paper
  • Solving math word problems with process- and outcome-based feedback [arxiv 2022.11] paper
  • Let's Verify Step by Step [ICLR 2024] paper code
  • Entropy-Regularized Process Reward Model arxiv 2024.12 code
  • Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [ACL 2024] paper
  • Improve Mathematical Reasoning in Language Models by Automated Process Supervision [arxiv 20224.6] paper
  • Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [arxiv 20224.10] paper
  • AutoPSV: Automated Process-Supervised Verifier [NeurIPS 2024] paper code
  • Free Process Rewards without Process Labels [arxiv 2024.12] paper code
  • Critique-out-Loud Reward Models [arxiv 2024.8] paper code
  • Improving Reward Models with Synthetic Critiques [arxiv 2024.5] paper
  • Generative Verifiers: Reward Modeling as Next-Token Prediction [arxiv 2024.8] paper
  • Self-Generated Critiques Boost Reward Modeling for Language Models [arxiv 2024.11] paper

Verbal-based

  • Is ChatGPT a Good NLG Evaluator? A Preliminary Study [ACL 2023] paper code
  • ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [arxiv 2023.3] paper
  • G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment [ACL 2023] pdf code
  • Can Large Language Models Be an Alternative to Human Evaluations? [ACL 2023] paper
  • LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [arxiv 2024.6] paper
  • Large Language Models are not Fair Evaluators [ACL 2024] paper code
  • Large Language Models are Inconsistent and Biased Evaluators [arxiv 2024.5] paper
  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [NeurIPS 2023] paper code
  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization [ICLR 2024] paper
  • JudgeLM: Fine-tuned Large Language Models are Scalable Judges [arxiv 2023.10] paper code
  • Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [arxiv 2024.5] paper code
  • REFINER: Reasoning Feedback on Intermediate Representations [ACL 2024] paper
  • Shepherd: A Critic for Language Model Generation [arxiv 2023.8] paper code
  • Generative Judge for Evaluating Alignment [ICLR 2024] paper code
  • Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [ICLR 2024] paper

Search Strategies

Repeated sampling

  • Competition-level code generation with alphacode [Science 2022] paper code
  • Code Llama: Open Foundation Models for Code [arxiv 2023.8] paper code
  • More Agents Is All You Need [arxiv 2024.2] paper code
  • Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios [ACL 2024] paper
  • Self-Consistency Improves Chain of Thought Reasoning in Language Models [ICLR 2023] paper
  • Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning[arxiv 2024.10] paper code
  • Learning to summarize with human feedback[NeurIPS 2020] paper
  • Training Verifiers to Solve Math Word Problems[arxiv 2021.10] paper
  • WebGPT: Browser-assisted question-answering with human feedback [arxiv 2021.12] paper
  • Making Language Models Better Reasoners with Step-Aware Verifier [ACL 2023] paper code
  • Accelerating Best-of-N via Speculative Rejection [ICML 2024] paper
  • TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling [arxiv 2024.10] paper
  • Fast Best-of-N Decoding via Speculative Rejection [NeurIPS 2024] paper
  • Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation [arxiv 2024.10] paper code
  • Preference-Guided Reflective Sampling for Aligning Language Models [EMNLP 2024] paper code
  • Reinforced Self-Training (ReST) for Language Modeling[arxiv 2023.8] paper
  • Variational Best-of-N Alignment [arxiv 2024.7] paper
  • BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling [NeurIps 2024] paper
  • BOND: Aligning LLMs with Best-of-N Distillation [arxiv 2024.7] paper
  • Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models [arxiv 2024.12] paper

Self-correction

  • Reflexion: Language Agents with Verbal Reinforcement Learning [arxiv 2023.3] paper code
  • Interscript: A dataset for interactive learning of scripts through error feedback [arxiv 2021.12] paper code
  • NL-EDIT: Correcting Semantic Parse Errors through Natural Language Interaction [ACL 2021] paper code
  • Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback [ACL 2022] paper code
  • CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [ICLR 2024] paper code
  • Teaching Large Language Models to Self-Debug [ICLR 2024] paper
  • RARR: Researching and Revising What Language Models Say, Using Language Models [ACL 2023] paper code
  • Graph-based, Self-Supervised Program Repair from Diagnostic Feedback [ICML 2020] paper
  • Improving Factuality and Reasoning in Language Models through Multiagent Debate [arxiv 2023.5] paper code
  • Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate [EMNLP 2023] paper code
  • Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [EMNLP 2024] paper code
  • ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [ACL 2024] paper code
  • Mixture-of-Agents Enhances Large Language Model Capabilities [arxiv 2024.6] paper code
  • Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate [arxiv 2024.7] paper
  • Debating with More Persuasive LLMs Leads to More Truthful Answers [ICML 2024] paper code
  • ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [ICLR 2024] pdf
  • ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [IREC 2024] paper
  • Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [arxiv 2023.11] paper
  • MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate [arxiv 2024.6] paper
  • Teaching Models to Balance Resisting and Accepting Persuasion [arxiv 2024.10] paper code
  • GroupDebate: Enhancing the Efficiency of Multi-Agent Debate Using Group Discussion [arxiv 2024.9] paper
  • Improving Multi-Agent Debate with Sparse Communication Topology [arxiv 2024.6] paper
  • Self-Rewarding Language Models [arxiv 2024.1] paper
  • Constitutional AI: Harmlessness from AI Feedback [arxiv 2022.12] paper code
  • Self-Refine: Iterative Refinement with Self-Feedback [NeurIPS 2023] paper
  • Language Models can Solve Computer Tasks [arxiv 2023.3] paper code
  • Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models [arxiv 2024.2] paper code
  • Is Self-Repair a Silver Bullet for Code Generation? [ICLR 2024] paper code
  • Large Language Models Cannot Self-Correct Reasoning Yet [ICLR 2024] paper
  • Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies [arxiv 2024.6] paper
  • Can Large Language Models Really Improve by Self-critiquing Their Own Plans? [arxiv 2023.10] paper
  • GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems [arxiv 2023.10] paper
  • When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [arxiv 2024.6] paper
  • LLMs cannot find reasoning errors, but can correct them given the error location [ACL 2024] paper code
  • Self-critiquing models for assisting human evaluators [arxiv 2022.6] paper
  • Recursive Introspection: Teaching Language Model Agents How to Self-Improve [arxiv 2024.7] paper
  • Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning [arxiv 2024.10] paper
  • Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning [arxiv 2024.6] paper code
  • GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements [arxiv 2024.2] paper
  • Generating Sequences by Learning to Self-Correct [ICLR 2023] paper code
  • Training Language Models to Self-Correct via Reinforcement Learning [arxiv 2024.9] paper

Tree search

  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models [NeurIPS 2023] paper code
  • Self-Evaluation Guided Beam Search for Reasoning [NeurIPS 2023] paper code
  • Reasoning with Language Model is Planning with World Model [EMNLP 2023] paper code
  • Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B [arxiv 2024.6] paper code
  • Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [arxiv 2025.2] paper code
  • Large Language Model Guided Tree-of-Thought [arxiv 2023.5] paper code
  • Reasoning with Language Model is Planning with World Model [EMNLP 2023] paper code
  • Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training [arxiv 2023.9] paper
  • Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers [arxiv 2024.8] paper code
  • Interpretable Contrastive Monte Carlo Tree Search Reasoning [arxiv 2024.10] paper code
  • ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search [arxiv 2024.6] pdf code
  • Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [arxiv 2024.5] paper code
  • O1 Replication Journey: A Strategic Progress Report -- Part 1 [arxiv 2024.10] paper code
  • Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [arxiv 2024.11] paper code
  • o1-Coder: an o1 Replication for Coding [arxiv 2024.12] paper code

Future Directions

Generalizable System-2 Model

  • A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift [arxiv 2023.10] paper
  • Evaluating Robustness of Reward Models for Mathematical Reasoning [arxiv 2024.10] paper
  • ODIN: Disentangled Reward Mitigates Hacking in RLHF [ICML 2024] paper
  • Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [arxiv 2024.6] paper code
  • DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [EMNLP 2024] paper code
  • Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [NeurIPS 2024] paper code
  • Generalizing Reward Modeling for Out-of-Distribution Preference Learning [ECML-PKDD 2024] paper code
  • Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision [arxiv 2023.12] paper
  • MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models [EMNLP 2023] paper code
  • Training language models to follow instructions with human feedback [arxiv 2022.3] paper

Multimodal Reasoning

  • Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models [ICLR 2024] paper code
  • Multimodal Chain-of-Thought Reasoning in Language Models [TMLR 2024] paper code
  • Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [NeurIPS 2024] paper
  • KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning [AAAI 2024] paper
  • Multimodal Reasoning with Multimodal Knowledge Graph [ACL 2024] paper
  • Interleaved-Modal Chain-of-Thought [arxiv 2024.11] paper
  • LLaVA-Critic: Learning to Evaluate Multimodal Models [arxiv 2024.10] paper
  • LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [arxiv 2024.11] paper code

Efficiency and Performance Trade-off

  • Learning How Hard to Think: Input-Adaptive Allocation of LM Computation [arxiv 2024.10] paper
  • Scaling LLM Inference with Optimized Sample Compute Allocation [arxiv 2024.10] paper code
  • Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [arxiv 2025.2] paper
  • Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies [EMNLP 2024] paper
  • Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models [ICLR 2024] paper
  • MoDE-CoTD: Chain-of-Thought Distillation for Complex Reasoning Tasks with Mixture of Decoupled LoRA-Experts [ACL 2024] paper
  • Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning [arxiv 2025.2] paper
  • LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference [arxiv 2024.7] paper
  • Token-Budget-Aware LLM Reasoning [arxiv 2024.12] paper code
  • Fast Inference from Transformers via Speculative Decoding [ICML 2022] paper
  • Compressed Chain of Thought: Efficient Reasoning Through Dense Representations [arxiv 2024.12] paper
  • Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding [arxiv 2024.6] paper code

Scaling Law

  • Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [arxiv 2024.9] paper
  • Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [arxiv 2024.8] paper
  • Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [arxiv 2024.10] paper code
  • A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models [arxiv 2024.11] paper

Strategy Combination

  • The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [arxiv 2024.11] paper code
  • Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions [arxiv 2024.11] paper code
  • Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [arxiv 2024.11] paper

Reference

If our survey is helpful to your research, please cite our paper:

@article{ji2025test,
  title={Test-time Computing: from System-1 Thinking to System-2 Thinking},
  author={Ji, Yixin and Li, Juntao and Ye, Hai and Wu, Kaixin and Xu, Jia and Mo, Linjian and Zhang, Min},
  journal={arXiv preprint arXiv:2501.02497},
  year={2025}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome_Test_Time_LLMs

Similar Open Source Tools

For similar tasks

For similar jobs