LLM-as-a-Judge

LLM-as-a-Judge

None

Stars: 79

Visit
 screenshot

LLM-as-a-Judge is a repository that includes papers discussed in a survey paper titled 'A Survey on LLM-as-a-Judge'. The repository covers various aspects of using Large Language Models (LLMs) as judges for tasks such as evaluation, reasoning, and decision-making. It provides insights into evaluation pipelines, improvement strategies, and specific tasks related to LLMs. The papers included in the repository explore different methodologies, applications, and future research directions for leveraging LLMs as evaluators in various domains.

README:


🌻 Homepage β€’ πŸ“– Paper List β€’ πŸ“Š Meta-eval β€’ 🌟 Arxiv β€’ πŸ”— Talk

This repo include the papers discussed in our survey paper A Survey on LLM-as-a-Judge

Reference

Feel free to cite if you find our survey is useful for your research:

@article{gu2024surveyllmasajudge,
	title   = {A Survey on LLM-as-a-Judge},
	author  = {Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Yuanzhuo Wang and Jian Guo},
	year    = {2024},
	journal = {arXiv preprint arXiv: 2411.15594}
}

πŸ”” News

πŸ”₯ [2025-01-28] We added analysis on LLM-as-a-Judge and o1-like Reasoning Enhancement, as well as meta-evaluation results on o1-mini, Gemini-2.0-Flash-Thinking-1219, and DeepSeek-R1!

🌟 [2025-01-16] We shared and discussed the methodologies, applications (Finance, RAG, and Synthetic Data), and future research directions of LLM-as-a-Judge at BAAI Talk! πŸ€— [Replay] [Methodology] [RAG & Synthetic Data]

πŸš€ [2024-11-23] We released A Survey on LLM-as-a-Judge, exploring LLMs as reliable, scalable evaluators and outlining key challenges and future directions!

Overview of LLM-as-a-Judge

overview

Evaluation Pipelines

evaluation_pipeline

Improvement Strategies for LLM-as-a-Judge

improvement_strategy

Table of Content

A Survey on LLM-as-a-Judge

Paper List

1 What is LLM-as-a-Judge?

2 How to use LLM-as-a-Judge?

2.1 In-Context Learning

Generating scores
  • A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models NAACL 2024

    Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. [Paper]

  • Generative judge for evaluating alignment. ArXiv preprint 2023

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]

  • Judgelm: Fine-tuned large language models are scalable judges. ArXiv preprint 2023

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]

  • Large Language Models are Better Reasoners with Self-Verification. EMNLP findings 2023

    Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. [Paper]

  • Benchmarking Foundation Models with Language-Model-as-an-Examiner. NeurIPS 2023

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. [Paper]

  • Human-like summarization evaluation with chatgpt. ArXiv preprint 2023

    Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. [Paper]

Solving Yes/No questions
  • Reflexion: language agents with verbal reinforcement learning. NeurIPS 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. [Paper]

  • MacGyver: Are Large Language Models Creative Problem Solvers? NAACL 2024

    Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas Griffiths, and Faeze Brahman. [Paper]

  • Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph. ArXiv preprint 2023

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. [Paper]

Conducting pairwise comparisons
  • Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. NAACL findings 2024

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. [Papaer]

  • **Aligning with human judgement: The role of pairwise preference in large language model evaluators. ** COLM 2024

    Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulic, Anna Korhonen, and Nigel Collier. [Paper]

  • LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models. EACL 2024

    Adian Liusie, Potsawee Manakul, and Mark Gales. [Paper]

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. [Paper]

  • Rrhf: Rank responses to align language models with human feedback without tears. ArXiv preprint 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. [Paper]

  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. ArXiv preprint 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]

  • Human-like summarization evaluation with chatgpt. ArXiv preprint 2023

    Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. [Paper]

Making multiple-choice selections

2.2 Model Selection

General LLM
  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. [Paper]

  • AlpacaEval: An Automatic Evaluator of Instruction-following Models. 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. [Code]

Fine-tuned LLM
  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. ArXiv preprint 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]

  • Judgelm: Fine-tuned large language models are scalable judges. ArXiv preprint 2023

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]

  • Generative judge for evaluating alignment. ArXiv preprint 2023

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]

  • Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ArXiv preprint 2023

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. [Paper]

2.3 Post-processing Method

Extracting specific tokens
  • xFinder: Robust and Pinpoint Answer Extraction for Large Language Models. ArXiv preprint 2024

    Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo Tang, and Ding Chen. [Paper]

  • MacGyver: Are Large Language Models Creative Problem Solvers? NAACL 2024

    Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas Griffiths, and Faeze Brahman. [Paper]

Constrained decoding
  • Guiding LLMs the right way: fast, non-invasive constrained generation. ICML 2024

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. [Paper]

  • XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. ArXiv preprint 2024

    Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. [Paper]

  • SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS 2025

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. [Paper]

Normalizing the output logits
  • Reasoning with Language Model is Planning with World Model. EMNLP 2023

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. [Paper]

  • Speculative rag: Enhancing retrieval augmented generation through drafting. ArXiv preprint 2024

    Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al. [Paper]

  • **Agent-as-a-Judge: Evaluate Agents with Agents. ** ArXiv preprint 2024

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. [Paper]

Selecting sentences
  • Reasoning with Language Model is Planning with World Model. EMNLP 2023

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. [Paper]

2.4 Evaluation Pipeline

LLM-as-a-Judge for Models
  • AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. NeurIPS 2023

    Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. [Paper]

  • Large language models are not fair evaluators. ACL 2024

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. [Paper]

  • Wider and deeper llm networks are fairer llm evaluators. ArXiv preprint 2023

    Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. [Paper]

  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. [Paper]

  • **SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation. ** Blog 2023

    Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, and Minjoon Seo. [Blog]

  • Shepherd: A Critic for Language Model Generation. ArXiv preprint 2023

    Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. [Paper]

  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. ArXiv preprint 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]

LLM-as-a-Judge for Data
  • RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. ArXiv preprint 2023

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. [Paper]

  • Rrhf: Rank responses to align language models with human feedback without tears. ArXiv preprint 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. [Paper]

  • Stanford Alpaca: An Instruction-following LLaMA model. 2023

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. [Code]

  • Languages are rewards: Hindsight finetuning using human feedback. ArXiv preprint 2023

    Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. [Paper]

  • The Wisdom of Hindsight Makes Language Models Better Instruction Followers. PMLR 2023

    Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, and Joseph E. Gonzalez. [Paper]

  • **Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. ** NeurIPS 2023

    Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. [Paper]

  • Wizardmath: Empowering mathematical reasoning for large language models via

    **reinforced evol-instruct. ** ArXiv preprint 2023

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. [Paper]

  • Self-taught evaluators. ArXiv preprint 2024

    Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. [Paper]

  • Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. ArXiv preprint 2023

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. [Paper]

  • Evaluating Object Hallucination in Large Vision-Language Models. EMNLP 2023

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. [Paper]

  • Evaluation and analysis of hallucination in large vision-language models. ArXiv preprint 2023

    Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. [Paper]

  • Aligning large multimodal models with factually augmented rlhf. ArXiv preprint 2023

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. [Paper]

  • MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. ICML 2024

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. [Paper]

LLM-as-a-Judge for Agents
  • **Agent-as-a-Judge: Evaluate Agents with Agents. ** ArXiv preprint 2024

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. [Paper]

  • Reasoning with Language Model is Planning with World Model. EMNLP 2023

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. [Paper]

  • Reflexion: language agents with verbal reinforcement learning. NeurIPS 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. [Paper]

LLM-as-a-Judge for Reasoning/Thinking
  • Towards Reasoning in Large Language Models: A Survey. ACL findings 2023

    Jie Huang and Kevin Chen-Chuan Chang. [Paper]

  • Let’s verify step by step. ICLR 2023

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. [Paper]

3 How to improve LLM-as-a-Judge?

3.1 Design Strategy of Evaluation Prompts

Few-shot promping
  • FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. [Paper]

  • SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. ACL findings 2024

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. [Paper]

  • GPTScore: Evaluate as You Desire. NAACL 2024

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. [Paper]

Evaluation steps decomposition
  • G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. EMNLP 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. [Paper]

  • DHP Benchmark: Are LLMs Good NLG Evaluators? ArXiv preprint 2024

    Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. [Paper]

  • SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation. NAACL findings 2024

    Hangfeng He, Hongming Zhang, and Dan Roth. [Paper]

  • Branch-Solve-Merge Improves Large Language Model Evaluation and Generation. NAACL 2024

    Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. [Paper]

Evaluation criteria decomposition
  • HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition. ACL 2024

    Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. [Paper]

  • Are LLM-based Evaluators Confusing NLG Quality Criteria? ACL 2024

    Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, and Xiaojun Wan. [Paper]

Shuffling contents
  • Large language models are not fair evaluators. ACL 2024

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. [Paper]

  • Generative judge for evaluating alignment. ArXiv preprint 2023

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]

  • Judgelm: Fine-tuned large language models are scalable judges. ArXiv preprint 2023

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]

  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. ArXiv preprint 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]

Conversion of evaluation tasks
  • **Aligning with human judgement: The role of pairwise preference in large language model evaluators. ** COLM 2024

    Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulic, Anna Korhonen, and Nigel Collier. [Paper]

Constraining outputs in structured formats
  • G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. EMNLP 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. [Paper]

  • DHP Benchmark: Are LLMs Good NLG Evaluators? ArXiv preprint 2024

    Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. [Paper]

  • LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models. NLP4ConvAI 2023

    Yen-Ting Lin and Yun-Nung Chen. [Paper]

Providing evaluations with explanations
  • CLAIR: Evaluating Image Captions with Large Language Models. EMNLP 2023

    David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. [Paper]

  • FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. ACL 2024

    Yebin Lee, Imseong Park, and Myungjoo Kang. [Paper]

3.2 Improvement Strategy of LLMs' Abilities

Fine-tuning via Meta Evaluation Dataset
  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. ArXiv preprint 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]

  • SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. ACL findings 2024

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. [Paper]

  • Offsetbias: Leveraging debiased data for tuning evaluators. ArXiv preprint 2024

    Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. [Papaer]

  • Judgelm: Fine-tuned large language models are scalable judges. ArXiv preprint 2023

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]

  • CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. ACL 2024

    Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. [Paper]

Iterative Optimization Based on Feedbacks
  • INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback. EMNLP 2023

    Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, and Lei Li. [Paper]

  • Jade: A linguistics-based safety evaluation platform for llm. ArXiv preprint 2023

    Mi Zhang, Xudong Pan, and Min Yang. [Paper]

3.3 Optimization Strategy of Final Results

Summarize by multiple rounds
  • Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks. EMNLP 2023

    Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. [Paper]

  • On the humanity of conversational ai: Evaluating the psychological portrayal of llms. ICLR 2023

    Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. [Paper]

  • Generative judge for evaluating alignment. ArXiv preprint 2023

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]

Vote by multiple LLMs
  • Goal-Oriented Prompt Attack and Safety Evaluation for LLMs. ArXiv preprint 2023

    Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. [Paper]

  • Benchmarking Foundation Models with Language-Model-as-an-Examiner. NeurIPS 2023

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. [Paper]

Score smoothing
  • FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. ACL 2024

    Yebin Lee, Imseong Park, and Myungjoo Kang. [Paper]

  • G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. EMNLP 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. [Paper]

  • DHP Benchmark: Are LLMs Good NLG Evaluators? ArXiv preprint 2024

    Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. [Paper]

Self validation
  • TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models. EMNLP 2023

    Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. [Paper]

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLM-as-a-Judge

Similar Open Source Tools

For similar tasks

For similar jobs