
LLM-as-a-Judge
None
Stars: 79

LLM-as-a-Judge is a repository that includes papers discussed in a survey paper titled 'A Survey on LLM-as-a-Judge'. The repository covers various aspects of using Large Language Models (LLMs) as judges for tasks such as evaluation, reasoning, and decision-making. It provides insights into evaluation pipelines, improvement strategies, and specific tasks related to LLMs. The papers included in the repository explore different methodologies, applications, and future research directions for leveraging LLMs as evaluators in various domains.
README:
π» Homepage β’ π Paper List β’ π Meta-eval β’ π Arxiv β’ π Talk
This repo include the papers discussed in our survey paper A Survey on LLM-as-a-Judge
Feel free to cite if you find our survey is useful for your research:
@article{gu2024surveyllmasajudge,
title = {A Survey on LLM-as-a-Judge},
author = {Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Yuanzhuo Wang and Jian Guo},
year = {2024},
journal = {arXiv preprint arXiv: 2411.15594}
}
π₯ [2025-01-28] We added analysis on LLM-as-a-Judge and o1-like Reasoning Enhancement, as well as meta-evaluation results on o1-mini, Gemini-2.0-Flash-Thinking-1219, and DeepSeek-R1!
π [2025-01-16] We shared and discussed the methodologies, applications (Finance, RAG, and Synthetic Data), and future research directions of LLM-as-a-Judge at BAAI Talk! π€ [Replay] [Methodology] [RAG & Synthetic Data]
π [2024-11-23] We released A Survey on LLM-as-a-Judge, exploring LLMs as reliable, scalable evaluators and outlining key challenges and future directions!
- Reference
- Overview of LLM-as-a-Judge
- Evaluation Pipelines of LLM-as-a-Judge
- Improvement Strategies for LLM-as-a-Judge
- Table of Content
-
Paper List
- 1 What is LLM-as-a-Judge?
- 2 How to use LLM-as-a-Judge?
- 3 How to improve LLM-as-a-Judge?
-
A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models
NAACL
2024
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. [Paper]
-
Generative judge for evaluating alignment.
ArXiv preprint
2023
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]
-
Judgelm: Fine-tuned large language models are scalable judges.
ArXiv preprint
2023
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]
-
Large Language Models are Better Reasoners with Self-Verification.
EMNLP findings
2023
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. [Paper]
-
Benchmarking Foundation Models with Language-Model-as-an-Examiner.
NeurIPS
2023
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. [Paper]
-
Human-like summarization evaluation with chatgpt.
ArXiv preprint
2023
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. [Paper]
-
Reflexion: language agents with verbal reinforcement learning.
NeurIPS
2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. [Paper]
-
MacGyver: Are Large Language Models Creative Problem Solvers?
NAACL
2024
Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas Griffiths, and Faeze Brahman. [Paper]
-
Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph.
ArXiv preprint
2023
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian Guo. [Paper]
-
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
NAACL findings
2024
Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. [Papaer]
-
**Aligning with human judgement: The role of pairwise preference in large language model evaluators. **
COLM
2024
Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulic, Anna Korhonen, and Nigel Collier. [Paper]
-
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models.
EACL
2024
Adian Liusie, Potsawee Manakul, and Mark Gales. [Paper]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
NeurIPS
2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. [Paper]
-
Rrhf: Rank responses to align language models with human feedback without tears.
ArXiv preprint
2023
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. [Paper]
-
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization.
ArXiv preprint
2023
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]
-
Human-like summarization evaluation with chatgpt.
ArXiv preprint
2023
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. [Paper]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
NeurIPS
2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. [Paper]
-
AlpacaEval: An Automatic Evaluator of Instruction-following Models.
2023
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. [Code]
-
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization.
ArXiv preprint
2023
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]
-
Judgelm: Fine-tuned large language models are scalable judges.
ArXiv preprint
2023
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]
-
Generative judge for evaluating alignment.
ArXiv preprint
2023
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]
-
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models.
ArXiv preprint
2023
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. [Paper]
-
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models.
ArXiv preprint
2024
Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo Tang, and Ding Chen. [Paper]
-
MacGyver: Are Large Language Models Creative Problem Solvers?
NAACL
2024
Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas Griffiths, and Faeze Brahman. [Paper]
-
Guiding LLMs the right way: fast, non-invasive constrained generation.
ICML
2024
Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. [Paper]
-
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models.
ArXiv preprint
2024
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. [Paper]
-
SGLang: Efficient Execution of Structured Language Model Programs.
NeurIPS
2025
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. [Paper]
-
Reasoning with Language Model is Planning with World Model.
EMNLP
2023
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. [Paper]
-
Speculative rag: Enhancing retrieval augmented generation through drafting.
ArXiv preprint
2024
Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, et al. [Paper]
-
**Agent-as-a-Judge: Evaluate Agents with Agents. **
ArXiv preprint
2024
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. [Paper]
-
Reasoning with Language Model is Planning with World Model.
EMNLP
2023
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. [Paper]
-
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.
NeurIPS
2023
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. [Paper]
-
Large language models are not fair evaluators.
ACL
2024
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. [Paper]
-
Wider and deeper llm networks are fairer llm evaluators.
ArXiv preprint
2023
Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. [Paper]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
NeurIPS
2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. [Paper]
-
**SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation. **
Blog
2023
Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, and Minjoon Seo. [Blog]
-
Shepherd: A Critic for Language Model Generation.
ArXiv preprint
2023
Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean OβBrien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. [Paper]
-
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization.
ArXiv preprint
2023
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment.
ArXiv preprint
2023
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. [Paper]
-
Rrhf: Rank responses to align language models with human feedback without tears.
ArXiv preprint
2023
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. [Paper]
-
Stanford Alpaca: An Instruction-following LLaMA model.
2023
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. [Code]
-
Languages are rewards: Hindsight finetuning using human feedback.
ArXiv preprint
2023
Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. [Paper]
-
The Wisdom of Hindsight Makes Language Models Better Instruction Followers.
PMLR
2023
Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, and Joseph E. Gonzalez. [Paper]
-
**Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. **
NeurIPS
2023
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang Gan. [Paper]
-
Wizardmath: Empowering mathematical reasoning for large language models via
**reinforced evol-instruct. **
ArXiv preprint
2023
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. [Paper]
-
Self-taught evaluators.
ArXiv preprint
2024
Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, and Xian Li. [Paper]
-
Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges.
ArXiv preprint
2023
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. [Paper]
-
Evaluating Object Hallucination in Large Vision-Language Models.
EMNLP
2023
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. [Paper]
-
Evaluation and analysis of hallucination in large vision-language models.
ArXiv preprint
2023
Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. [Paper]
-
Aligning large multimodal models with factually augmented rlhf.
ArXiv preprint
2023
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. [Paper]
-
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.
ICML
2024
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. [Paper]
-
**Agent-as-a-Judge: Evaluate Agents with Agents. **
ArXiv preprint
2024
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. [Paper]
-
Reasoning with Language Model is Planning with World Model.
EMNLP
2023
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. [Paper]
-
Reflexion: language agents with verbal reinforcement learning.
NeurIPS
2023
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. [Paper]
-
Towards Reasoning in Large Language Models: A Survey.
ACL findings
2023
Jie Huang and Kevin Chen-Chuan Chang. [Paper]
-
Letβs verify step by step.
ICLR
2023
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. [Paper]
-
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.
EMNLP
2023
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. [Paper]
-
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models.
ACL findings
2024
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. [Paper]
-
GPTScore: Evaluate as You Desire.
NAACL
2024
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. [Paper]
-
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment.
EMNLP
2023
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. [Paper]
-
DHP Benchmark: Are LLMs Good NLG Evaluators?
ArXiv preprint
2024
Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. [Paper]
-
SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation.
NAACL findings
2024
Hangfeng He, Hongming Zhang, and Dan Roth. [Paper]
-
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation.
NAACL
2024
Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. [Paper]
-
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition.
ACL
2024
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. [Paper]
-
Are LLM-based Evaluators Confusing NLG Quality Criteria?
ACL
2024
Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, and Xiaojun Wan. [Paper]
-
Large language models are not fair evaluators.
ACL
2024
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. [Paper]
-
Generative judge for evaluating alignment.
ArXiv preprint
2023
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]
-
Judgelm: Fine-tuned large language models are scalable judges.
ArXiv preprint
2023
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]
-
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization.
ArXiv preprint
2023
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]
-
**Aligning with human judgement: The role of pairwise preference in large language model evaluators. **
COLM
2024
Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulic, Anna Korhonen, and Nigel Collier. [Paper]
-
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment.
EMNLP
2023
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. [Paper]
-
DHP Benchmark: Are LLMs Good NLG Evaluators?
ArXiv preprint
2024
Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. [Paper]
-
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models.
NLP4ConvAI
2023
Yen-Ting Lin and Yun-Nung Chen. [Paper]
-
CLAIR: Evaluating Image Captions with Large Language Models.
EMNLP
2023
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. [Paper]
-
FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model.
ACL
2024
Yebin Lee, Imseong Park, and Myungjoo Kang. [Paper]
-
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization.
ArXiv preprint
2023
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al. 2023. [Paper]
-
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models.
ACL findings
2024
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. [Paper]
-
Offsetbias: Leveraging debiased data for tuning evaluators.
ArXiv preprint
2024
Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. [Papaer]
-
Judgelm: Fine-tuned large language models are scalable judges.
ArXiv preprint
2023
Lianghui Zhu, Xinggang Wang, and Xinlong Wang. [Paper]
-
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation.
ACL
2024
Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. [Paper]
-
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback.
EMNLP
2023
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, and Lei Li. [Paper]
-
Jade: A linguistics-based safety evaluation platform for llm.
ArXiv preprint
2023
Mi Zhang, Xudong Pan, and Min Yang. [Paper]
-
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks.
EMNLP
2023
Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. [Paper]
-
On the humanity of conversational ai: Evaluating the psychological portrayal of llms.
ICLR
2023
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. [Paper]
-
Generative judge for evaluating alignment.
ArXiv preprint
2023
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. [Paper]
-
Goal-Oriented Prompt Attack and Safety Evaluation for LLMs.
ArXiv preprint
2023
Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. [Paper]
-
Benchmarking Foundation Models with Language-Model-as-an-Examiner.
NeurIPS
2023
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. [Paper]
-
FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model.
ACL
2024
Yebin Lee, Imseong Park, and Myungjoo Kang. [Paper]
-
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment.
EMNLP
2023
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. [Paper]
-
DHP Benchmark: Are LLMs Good NLG Evaluators?
ArXiv preprint
2024
Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, and Xia Hu. [Paper]
-
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models.
EMNLP
2023
Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. [Paper]
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for LLM-as-a-Judge
Similar Open Source Tools

LLM-as-a-Judge
LLM-as-a-Judge is a repository that includes papers discussed in a survey paper titled 'A Survey on LLM-as-a-Judge'. The repository covers various aspects of using Large Language Models (LLMs) as judges for tasks such as evaluation, reasoning, and decision-making. It provides insights into evaluation pipelines, improvement strategies, and specific tasks related to LLMs. The papers included in the repository explore different methodologies, applications, and future research directions for leveraging LLMs as evaluators in various domains.

LLMTSCS
LLMLight is a novel framework that employs Large Language Models (LLMs) as decision-making agents for Traffic Signal Control (TSC). The framework leverages the advanced generalization capabilities of LLMs to engage in a reasoning and decision-making process akin to human intuition for effective traffic control. LLMLight has been demonstrated to be remarkably effective, generalizable, and interpretable against various transportation-based and RL-based baselines on nine real-world and synthetic datasets.

LLM-Tuning
LLM-Tuning is a collection of tools and resources for fine-tuning large language models (LLMs). It includes a library of pre-trained LoRA models, a set of tutorials and examples, and a community forum for discussion and support. LLM-Tuning makes it easy to fine-tune LLMs for a variety of tasks, including text classification, question answering, and dialogue generation. With LLM-Tuning, you can quickly and easily improve the performance of your LLMs on downstream tasks.

VLMEvalKit
VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.

LLM-SFT
LLM-SFT is a Chinese large model fine-tuning tool that supports models such as ChatGLM, LlaMA, Bloom, Baichuan-7B, and frameworks like LoRA, QLoRA, DeepSpeed, UI, and TensorboardX. It facilitates tasks like fine-tuning, inference, evaluation, and API integration. The tool provides pre-trained weights for various models and datasets for Chinese language processing. It requires specific versions of libraries like transformers and torch for different functionalities.

Scientific-LLM-Survey
Scientific Large Language Models (Sci-LLMs) is a repository that collects papers on scientific large language models, focusing on biology and chemistry domains. It includes textual, molecular, protein, and genomic languages, as well as multimodal language. The repository covers various large language models for tasks such as molecule property prediction, interaction prediction, protein sequence representation, protein sequence generation/design, DNA-protein interaction prediction, and RNA prediction. It also provides datasets and benchmarks for evaluating these models. The repository aims to facilitate research and development in the field of scientific language modeling.

AwesomeResponsibleAI
Awesome Responsible AI is a curated list of academic research, books, code of ethics, courses, data sets, frameworks, institutes, newsletters, principles, podcasts, reports, tools, regulations, and standards related to Responsible, Trustworthy, and Human-Centered AI. It covers various concepts such as Responsible AI, Trustworthy AI, Human-Centered AI, Responsible AI frameworks, AI Governance, and more. The repository provides a comprehensive collection of resources for individuals interested in ethical, transparent, and accountable AI development and deployment.

TrustEval-toolkit
TrustEval-toolkit is a dynamic and comprehensive framework for evaluating the trustworthiness of Generative Foundation Models (GenFMs) across dimensions such as safety, fairness, robustness, privacy, and more. It offers features like dynamic dataset generation, multi-model compatibility, customizable metrics, metadata-driven pipelines, comprehensive evaluation dimensions, optimized inference, and detailed reports.

VITA
VITA is an open-source interactive omni multimodal Large Language Model (LLM) capable of processing video, image, text, and audio inputs simultaneously. It stands out with features like Omni Multimodal Understanding, Non-awakening Interaction, and Audio Interrupt Interaction. VITA can respond to user queries without a wake-up word, track and filter external queries in real-time, and handle various query inputs effectively. The model utilizes state tokens and a duplex scheme to enhance the multimodal interactive experience.

KG-LLM-Papers
KG-LLM-Papers is a repository that collects papers integrating knowledge graphs (KGs) and large language models (LLMs). It serves as a comprehensive resource for research on the role of KGs in the era of LLMs, covering surveys, methods, and resources related to this integration.

OmniGibson
OmniGibson is a platform for accelerating Embodied AI research built upon NVIDIA's Omniverse platform. It features photorealistic visuals, physical realism, fluid and soft body support, large-scale high-quality scenes and objects, dynamic kinematic and semantic object states, mobile manipulator robots with modular controllers, and an OpenAI Gym interface. The platform provides a comprehensive environment for researchers to conduct experiments and simulations in the field of Embodied AI.

search_with_ai
Build your own conversation-based search with AI, a simple implementation with Node.js & Vue3. Live Demo Features: * Built-in support for LLM: OpenAI, Google, Lepton, Ollama(Free) * Built-in support for search engine: Bing, Sogou, Google, SearXNG(Free) * Customizable pretty UI interface * Support dark mode * Support mobile display * Support local LLM with Ollama * Support i18n * Support Continue Q&A with contexts.

gpt-bitcoin
The gpt-bitcoin repository is focused on creating an automated trading system for Bitcoin using GPT AI technology. It provides different versions of trading strategies utilizing various data sources such as OHLCV, Moving Averages, RSI, Stochastic Oscillator, MACD, Bollinger Bands, Orderbook Data, news data, fear/greed index, and chart images. Users can set up the system by creating a .env file with necessary API keys and installing required dependencies. The repository also includes instructions for setting up the environment on local machines and AWS EC2 Ubuntu servers. The future plan includes expanding the system to support other cryptocurrency exchanges like Bithumb, Binance, Coinbase, OKX, and Bybit.

LongLLaVA
LongLLaVA is a tool for scaling multi-modal LLMs to 1000 images efficiently via hybrid architecture. It includes stages for single-image alignment, instruction-tuning, and multi-image instruction-tuning, with evaluation through a command line interface and model inference. The tool aims to achieve GPT-4V level capabilities and beyond, providing reproducibility of results and benchmarks for efficiency and performance.

mlx-llm
mlx-llm is a library that allows you to run Large Language Models (LLMs) on Apple Silicon devices in real-time using Apple's MLX framework. It provides a simple and easy-to-use API for creating, loading, and using LLM models, as well as a variety of applications such as chatbots, fine-tuning, and retrieval-augmented generation.
For similar tasks

labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

promptfoo
Promptfoo is a tool for testing and evaluating LLM output quality. With promptfoo, you can build reliable prompts, models, and RAGs with benchmarks specific to your use-case, speed up evaluations with caching, concurrency, and live reloading, score outputs automatically by defining metrics, use as a CLI, library, or in CI/CD, and use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API.

vespa
Vespa is a platform that performs operations such as selecting a subset of data in a large corpus, evaluating machine-learned models over the selected data, organizing and aggregating it, and returning it, typically in less than 100 milliseconds, all while the data corpus is continuously changing. It has been in development for many years and is used on a number of large internet services and apps which serve hundreds of thousands of queries from Vespa per second.

python-aiplatform
The Vertex AI SDK for Python is a library that provides a convenient way to use the Vertex AI API. It offers a high-level interface for creating and managing Vertex AI resources, such as datasets, models, and endpoints. The SDK also provides support for training and deploying custom models, as well as using AutoML models. With the Vertex AI SDK for Python, you can quickly and easily build and deploy machine learning models on Vertex AI.

ScandEval
ScandEval is a framework for evaluating pretrained language models on mono- or multilingual language tasks. It provides a unified interface for benchmarking models on a variety of tasks, including sentiment analysis, question answering, and machine translation. ScandEval is designed to be easy to use and extensible, making it a valuable tool for researchers and practitioners alike.

opencompass
OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features include: * Comprehensive support for models and datasets: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. * Efficient distributed evaluation: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. * Diversified evaluation paradigms: Support for zero-shot, few-shot, and chain-of-thought evaluations, combined with standard or dialogue-type prompt templates, to easily stimulate the maximum performance of various models. * Modular design with high extensibility: Want to add new models or datasets, customize an advanced task division strategy, or even support a new cluster management system? Everything about OpenCompass can be easily expanded! * Experiment management and reporting mechanism: Use config files to fully record each experiment, and support real-time reporting of results.

flower
Flower is a framework for building federated learning systems. It is designed to be customizable, extensible, framework-agnostic, and understandable. Flower can be used with any machine learning framework, for example, PyTorch, TensorFlow, Hugging Face Transformers, PyTorch Lightning, scikit-learn, JAX, TFLite, MONAI, fastai, MLX, XGBoost, Pandas for federated analytics, or even raw NumPy for users who enjoy computing gradients by hand.

thinc
Thinc is a lightweight deep learning library that offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow and MXNet. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models.
For similar jobs

asreview
The ASReview project implements active learning for systematic reviews, utilizing AI-aided pipelines to assist in finding relevant texts for search tasks. It accelerates the screening of textual data with minimal human input, saving time and increasing output quality. The software offers three modes: Oracle for interactive screening, Exploration for teaching purposes, and Simulation for evaluating active learning models. ASReview LAB is designed to support decision-making in any discipline or industry by improving efficiency and transparency in screening large amounts of textual data.

NewEraAI-Papers
The NewEraAI-Papers repository provides links to collections of influential and interesting research papers from top AI conferences, along with open-source code to promote reproducibility and provide detailed implementation insights beyond the scope of the article. Users can stay up to date with the latest advances in AI research by exploring this repository. Contributions to improve the completeness of the list are welcomed, and users can create pull requests, open issues, or contact the repository owner via email to enhance the repository further.

cltk
The Classical Language Toolkit (CLTK) is a Python library that provides natural language processing (NLP) capabilities for pre-modern languages. It offers a modular processing pipeline with pre-configured defaults and supports almost 20 languages. Users can install the latest version using pip and access detailed documentation on the official website. The toolkit is designed to meet the unique needs of researchers working with historical languages, filling a void in the NLP landscape that often neglects non-spoken languages and different research goals.

Conference-Acceptance-Rate
The 'Conference-Acceptance-Rate' repository provides acceptance rates for top-tier AI-related conferences in the fields of Natural Language Processing, Computational Linguistics, Computer Vision, Pattern Recognition, Machine Learning, Learning Theory, Artificial Intelligence, Data Mining, Information Retrieval, Speech Processing, and Signal Processing. The data includes acceptance rates for long papers and short papers over several years for each conference, allowing researchers to track trends and make informed decisions about where to submit their work.

pdftochat
PDFToChat is a tool that allows users to chat with their PDF documents in seconds. It is powered by Together AI and Pinecone, utilizing a tech stack including Next.js, Mixtral, M2 Bert, LangChain.js, MongoDB Atlas, Bytescale, Vercel, Clerk, and Tailwind CSS. Users can deploy the tool to Vercel or any other host by setting up Together.ai, MongoDB Atlas database, Bytescale, Clerk, and Vercel. The tool enables users to interact with PDFs through chat, with future tasks including adding features like trash icon for deleting PDFs, exploring different embedding models, implementing auto scrolling, improving replies, benchmarking accuracy, researching chunking and retrieval best practices, adding demo video, upgrading to Next.js 14, adding analytics, customizing tailwind prose, saving chats in postgres DB, compressing large PDFs, implementing custom uploader, session tracking, error handling, and support for images in PDFs.

Awesome-LLM-Strawberry
Awesome LLM Strawberry is a collection of research papers and blogs related to OpenAI Strawberry(o1) and Reasoning. The repository is continuously updated to track the frontier of LLM Reasoning.

Call-for-Reviewers
The `Call-for-Reviewers` repository aims to collect the latest 'call for reviewers' links from various top CS/ML/AI conferences/journals. It provides an opportunity for individuals in the computer/ machine learning/ artificial intelligence fields to gain review experience for applying for NIW/H1B/EB1 or enhancing their CV. The repository helps users stay updated with the latest research trends and engage with the academic community.

Smart-Connections-Visualizer
The Smart Connections Visualizer Plugin is a tool designed to enhance note-taking and information visualization by creating dynamic force-directed graphs that represent connections between notes or excerpts. Users can customize visualization settings, preview notes, and interact with the graph to explore relationships and insights within their notes. The plugin aims to revolutionize communication with AI and improve decision-making processes by visualizing complex information in a more intuitive and context-driven manner.