chinese-llm-benchmark
中文大模型能力评测榜单:目前已囊括115个大模型,覆盖chatgpt、gpt4o、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型, 以及百川、qwen2、glm4、yi、书生internLM2、llama3等开源大模型,多维度能力评测。不仅提供能力评分排行榜,也提供所有模型的原始输出结果!
Stars: 2389
The Chinese LLM Benchmark is a continuous evaluation list of large models in CLiB, covering a wide range of commercial and open-source models from various companies and research institutions. It supports multidimensional evaluation of capabilities including classification, information extraction, reading comprehension, data analysis, Chinese encoding efficiency, and Chinese instruction compliance. The benchmark not only provides capability score rankings but also offers the original output results of all models for interested individuals to score and rank themselves.
README:
- 目前已囊括115个大模型,覆盖chatgpt、gpt4、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型, 以及百川、qwen2、glm4、openbuddy、AquilaChat、书生internLM2、llama3.1等开源大模型。
- 模型来源涉及国内外大厂、大模型创业公司、高校研究机构。
- 支持多维度能力评测,包括分类能力、信息抽取能力、阅读理解能力、数据分析能力、中文编码效率、中文指令遵从。
- 不仅提供能力评分排行榜,也提供所有模型的原始输出结果!有兴趣的朋友可以自己打分、自己排行!
- [2024/9/21] 新增开源大模型发布历史 link
- [2024/8/27] 发布v2.1版本评测榜单
- 新增2个大模型:Llama-3.1-8B-Instruct、Llama-3.1-70B-Instruct
- 完善大模型信息汇总列表,link
- [2024/8/7] 发布v2.0版本评测榜单
- 中文指令遵从、算术运算计入总分,作为综合能力排行依据
- 新增7个大模型:qwen2-1.5b-instruct、qwen2-0.5b-instruct、qwen2-57b-a14b-instruct、internlm2-chat-1_8b、internlm2_5-7b-chat、gemma-2-9b-it、DeepSeek-V2-Lite-Chat
- 删除陈旧的模型:aquilachat2-34b、AquilaChat2-70B-Expr、BlueLM-7B-Chat、openbuddy-deepseek-67b、openbuddy-mixtral-7bx8、tigerbot-13b-chat-v4、Yi-34B-Chat
- [2024/7/26] 发布v1.21版本评测榜单
- 新增8个大模型:gpt-4o-mini、讯飞4.0Ultra、ERNIE-3.5-8K、ERNIE-4.0-Turbo-8K、ERNIE-Speed-8K、Doubao-lite-32k、Doubao-pro-32k、qwen-turbo
- 删除陈旧的模型:baichuan3、minimax-abab6-chat、tigerbot-70b-chat-v3、openbuddy-zephyr-7b、Yi-6B-Chat、chatglm3-6b、Llama-2-70b-chat、 miniCPM-2B-sft、openbuddy-mistral-7b
- [2024/7/15] 发布v1.20版本评测榜单
- 新增10个大模型:gpt-4o、yi-spark、qwen-plus、qwen-long、GLM-4-AirX、GLM-4-Air、GLM-4-Flash、Baichuan4、Baichuan3-Turbo、SenseChat-Turbo
- 更新模型:商汤senseChat
- 删除陈旧的模型:谷歌bard、openbuddy-llama2-70b、XVERSE-65B-Chat、微软new-bing、BELLE-Llama2-13B-chat-0.4M、讯飞星火v2.0、chatglm-std、chatglm-pro、chatglm-turbo、字节跳动豆包
- [2024/6/29] 发布v1.19版本评测榜单
- 新增数学基础(算术)能力排行榜
- 新增5个大模型:Qwen2-7B-Instruct、Qwen2-72B-Instruct 、glm-4-9b-chat、Yi-1.5-9B-Chat、Yi-1.5-34B-Chat
- 排行榜删除陈旧的模型
- [2024/6/2] 发布v1.18版本评测榜单
- 新增6个大模型:abab6.5-chat、abab6.5s-chat、deepseek-chat-v2、yi-large、yi-large-turbo、yi-medium
- 中文指令遵从排行榜补充更多模型
- 排行榜删除陈旧的模型
- [2024/5/8] 发布v1.17版本评测榜单
- 新增中文指令遵从排行榜
- 新增4个大模型:Llama-3-8B-Instruct、Llama-3-70B-Instruct、openbuddy-llama3-8b、Phi-3-mini-128k-instruct
- [2024/4/13] 发布v1.16版本评测榜单
- 新增中文编码效率排行榜,同等尺寸大模型,编码效率越高推理速度越快,几乎成正比。
- 模型更新:minimax更新至minimax-abab6-chat
- 新增3个大模型:Qwen1.5-32B-Chat、minimax-abab5.5-chat、minimax-abab5.5s-chat
- [2024/3/20] 发布v1.15版本评测榜单
- 模型更新:gpt3.5更新至最新版本
- 新增8个大模型:gpt-4-turbo、讯飞星火v3.5、MiniCPM-2B-dpo、miniCPM-2B-sft、AquilaChat2-70B-Expr、月之暗面kimichat、谷歌gemma-7b-it、谷歌gemma-2b-it
- 排行榜删除陈旧的模型(比如Baichuan2-53B、chatglm-130b-v1、tulu-30b、belle-llama-13b-2m、belle-llama-13b-ext、openbuddy-llama-30b-v7.1、vicuna-33b等)
- [2024/2/28] 发布v1.14版本评测榜单
- 新增11个大模型:deepseek-llm-67b-chat、baichuan3、internlm2-chat-20b、internlm2-chat-7b、openbuddy-mixtral-7bx8-v17.1以及qwen1.5系列的6个模型
- 排行榜删除陈旧的模型(比如chatglm2-6b、AquilaChat-7B等)
- [2024/1/29] 发布v1.13版本评测榜单
- 模型更新:微软new-bing、文心4.0更新至24年1月版本
- 新增6个大模型:qwen-max、GLM4、BlueLM-7B-Chat、openbuddy-zephyr-7b-v14.1、openbuddy-deepseek-67b-v15.2、XVERSE-65B-Chat
- 排行榜删除陈旧的模型(比如phoenix-inst-chat-7b、BELLE-on-Open-Datasets等)
- [2023/12/10] 发布v1.12版本评测榜单
- 新增7个大模型:Yi-34B-Chat、tigerbot-13b-chat-v4、openbuddy-openllama-3b-v10、Qwen-1_8B-Chat、Yi-6B-Chat、Qwen-72B-Chat、chatglm-turbo
- 新增开源模型细分排行榜:10B以下模型排行榜、10B~20B模型排行榜、20B以上模型排行榜
- [2023/11/22] 发布v1.11版本评测榜单
- 新增4个大模型:openbuddy-mistral-7b-v13.1、Qwen-7B-Chat、Baichuan2-7B-Chat、tigerbot-70b-chat-v3
- 将数据分析能力计入综合得分
- [2023/11/5] 发布v1.10版本评测榜单
- 新增6个大模型:
- 3个商用模型:文心4.0、谷歌bard、讯飞星火v3
- 3个开源模型:aquilachat2-34b、ziya2-13b-chat、chatglm3-6b
- 排行榜删除陈旧的模型(比如第一代chatglm-6b、MOSS等)
- 新增6个大模型:
- [2023/10/11] 发布v1.9版本评测榜单
- 新增7个大模型:
- 3个商用模型:阿里通义千问v1.0.7、豆包、Baichuan2-53B
- 4个开源模型:Baichuan2-13B-Chat、internlm-chat-20b、qwen-14b-chat、tigerbot-70b-chat-v2
- 新增7个大模型:
- [2023/9/13] 发布v1.8版本评测榜单
- 新增7个大模型:
- 2个商用模型:chatglm-std、chatglm-pro
- 5个开源模型:openbuddy-llama-30b-v7.1、openbuddy-llama-65b-v8、openbuddy-llama2-70b-v10.1、xverse-13b-chat、Baichuan-13B-Chat-v2
- 新增7个大模型:
- [2023/8/29] 发布v1.7版本评测榜单
- 新增2个商用大模型:讯飞星火v2.0、Baichuan-53B
- 表格问答(数据分析)能力排行榜:新增21个模型参与排行。
- [2023/8/13] 发布v1.6版本评测榜单,link
- 新增4个大模型:
- 2个商用模型:商汤senseChat、微软new-bing
- 2个基于LLaMA2的开源中文模型:BELLE-Llama2-13B-chat-0.4M、Linly-Chinese-LLaMA2-13B
- 新增4个大模型:
- [2023/7/26] 发布v1.5版本评测榜单,link
- 新增7个大模型:gpt4、文心一言v2.2、vicuna-33b、wizardlm-13b、Ziya-LLaMA-13B-v1.1、InternLM-Chat-7B、Llama-2-70b-chat
- [2023/7/18] 发布v1.4版本评测榜单,link
- 新增3个大模型:tulu-30b、chatglm2-6b、Baichuan-13B-Chat
- [2023/7/2] 发布v1.3版本评测榜单,link
- 新增3个大模型:360智脑、MOSS-003-SFT、AquilaChat-7B
- 讯飞星火更新为最新的v1.5模型
- [2023/6/17] 发布v1.2版本评测榜单,link
- 新增2个大模型:tigetbot-7b官网、linly-chatflow-13b
- 说明做评测榜单的初衷
- [2023/6/10] 发布v1.1版本评测榜单,link
- 新增3个大模型:minimax、guanaco、Phoenix-7b
- 新增表格问答评测维度,作为阅读理解能力的细分项
- [2023/6/4] 发布v1版本评测榜单,link
- 将更多大模型加入评测:Claude、gemini等等
- 增加开源大模型的授权协议,注明能否商用
- 引入更多维度的评测:数学能力、代码能力、开放域问答、多轮对话、头脑风暴、翻译……
- 评测维度更细分,比如信息抽取可以细分时间实体抽取能力、地址实体抽取能力……
- 海纳百川,整合各类评测榜单,扩充细分领域榜单(比如教育领域、医疗领域)
- 加入更多评测数据,使得评测得分越来越有说服力
价格单位:元/1m tokens,即元每百万token
类别 | 大模型 | 价格/下载 | 机构 |
---|---|---|---|
商用 | qwen-long | 输入:0.5元,输出: 2元 | 阿里 |
商用 | qwen-turbo | 输入:2元,输出:6元 | 阿里 |
商用 | qwen-plus | 输入:4元,输出:12元 | 阿里 |
商用 | qwen-max | 输入:40元,输出:120元 | 阿里 |
商用 | gpt-4o | 输入:36.2元,输出:108.6元 | openAI |
商用 | gpt-4-turbo | 输入:72.4元,输出:217.2元 | openAI |
商用 | gpt-4 | 输入:217.2元,输出: 434.4元 | openAI |
商用 | gpt-3.5-turbo | 输入:3.6元,输出:10.9元 | openAI |
... | ... | ... | ... |
开源 | glm-4-9b-chat | modelScope下载 | 智谱 |
开源 | Phi-3-mini-128k-instruct | modelScope下载 | 微软 |
开源 | internlm2-chat-20b | modelScope下载 | 上海人工智能实验室 |
... | ... | ... | ... |
综合能力得分为分类能力、信息抽取、阅读理解、数据分析、指令遵从、算术运算六者得分的平均值。
详细数据见total
大模型 | 价格(输出) | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 总分 | 排名 |
---|---|---|---|---|---|---|---|---|---|
gpt-4o | 109元 | 93 | 96.3 | 98.0 | 100.0 | 83 | 95.7 | 94.3 | 1 |
gpt4 | 434元 | 94 | 94.0 | 99.3 | 97.0 | 84 | 86.5 | 92.5 | 2 |
gpt-4-turbo | 217元 | 91 | 90.0 | 94.0 | 96.0 | 83 | 96.5 | 91.8 | 3 |
百度文心4.0 | 120元 | 88 | 89.0 | 94.7 | 94.0 | 79 | 100.0 | 90.8 | 4 |
讯飞4.0Ultra | 100元 | 88 | 84.4 | 96.0 | 92.7 | 80 | 94.3 | 89.2 | 5 |
智谱GLM4 | 100元 | 86 | 90.0 | 97.3 | 96.0 | 77 | 78.0 | 87.4 | 6 |
Baichuan4 | 100元 | 86 | 94.1 | 93.3 | 95.3 | 75 | 78.2 | 87.0 | 7 |
阿里qwen-max | 120元 | 86 | 82.0 | 95.3 | 91.3 | 77 | 79.8 | 85.2 | 8 |
大模型 | 价格(输出) | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 总分 | 排名 |
---|---|---|---|---|---|---|---|---|---|
百度ERNIE-3.5-8K | 12元 | 89 | 89.6 | 96.7 | 97.3 | 72 | 100.0 | 90.8 | 1 |
百度ERNIE-4.0-Turbo | 60元 | 90 | 88.9 | 96.0 | 93.3 | 78 | 97.7 | 90.7 | 2 |
minimax-abab6.5-chat | 30元 | 89 | 87.0 | 89.3 | 95.3 | 76 | 90.3 | 87.8 | 3 |
Baichuan3-Turbo | 12元 | 88 | 86.7 | 94.7 | 90.7 | 75 | 89.2 | 87.4 | 4 |
yi-large | 20元 | 85 | 91.0 | 90.0 | 92.7 | 77 | 88.3 | 87.3 | 5 |
minimax-abab6.5s-chat | 10元 | 87 | 88.0 | 88.7 | 88.0 | 80 | 91.7 | 87.2 | 6 |
讯飞星火v3.5(spark-max) | 30元 | 87 | 92.0 | 89.3 | 87.3 | 74 | 93.5 | 87.2 | 7 |
智谱GLM-4-AirX | 10元 | 89 | 91.9 | 92.7 | 88.0 | 83 | 74.2 | 86.5 | 8 |
阿里qwen-plus | 12元 | 88 | 89.6 | 90.0 | 84.0 | 73 | 93.0 | 86.3 | 9 |
yi-large-turbo | 12元 | 82 | 90.0 | 88.7 | 86.7 | 78 | 87.8 | 85.5 | 10 |
讯飞星火v3(spark-pro) | 30元 | 87 | 82.0 | 88.0 | 86.0 | 74 | 94.0 | 85.2 | 11 |
gpt-3.5-turbo | 11元 | 81 | 83.0 | 92.7 | 91.3 | 77 | 80.0 | 84.2 | 12 |
月之暗面moonshot-v1-8k | 12元 | 92 | 85.0 | 84.0 | 89.3 | 72 | 79.3 | 83.6 | 13 |
商汤SenseChat-v4 | 12元 | 89 | 78.5 | 88.0 | 86.7 | 71 | 72.2 | 80.9 | 14 |
minimax-abab5.5-chat | 15元 | 83 | 79.0 | 86.7 | 72.7 | 76 | 39.7 | 72.8 | 15 |
大模型 | 价格(输出) | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 总分 | 排名 |
---|---|---|---|---|---|---|---|---|---|
gpt-4o-mini | 4.3元 | 90 | 93.3 | 89.3 | 100.0 | 83 | 92.7 | 91.4 | 1 |
deepseek-chat-v2 | 2元 | 93 | 88.0 | 94.0 | 96.0 | 76 | 96.7 | 90.6 | 2 |
豆包Doubao-pro-32k | 2元 | 86 | 88.1 | 96.7 | 86.7 | 85 | 98.2 | 90.1 | 3 |
yi-medium | 2.5元 | 86 | 93.0 | 89.3 | 94.0 | 76 | 89.2 | 87.9 | 4 |
智谱GLM-4-Air | 1元 | 89 | 91.9 | 92.7 | 88.0 | 83 | 74.5 | 86.5 | 5 |
阿里qwen-long | 2元 | 89 | 85.9 | 90.0 | 86.7 | 75 | 83.3 | 85.0 | 6 |
智谱GLM-4-Flash | 0.1元 | 89 | 80.0 | 86.0 | 82.0 | 79 | 75.5 | 81.9 | 7 |
yi-spark | 1元 | 82 | 88.9 | 88.0 | 76.0 | 72 | 83.3 | 81.7 | 8 |
百度ERNIE-Speed-8K | 0元 | 88 | 88.1 | 88.0 | 89.3 | 68 | 68.7 | 81.7 | 9 |
qwen-turbo | 6元 | 83 | 85.2 | 88.0 | 76.0 | 66 | 81.3 | 79.9 | 10 |
商汤SenseChat-Turbo | 5元 | 81 | 77.8 | 76.7 | 86.0 | 72 | 78.5 | 78.7 | 11 |
豆包Doubao-lite-32k | 0.6元 | 77 | 86.7 | 88.7 | 64.7 | 62 | 87.2 | 77.7 | 12 |
minimax-abab5.5s-chat | 5元 | 58 | 57.0 | 70.7 | 56.0 | 49 | 57.0 | 58.0 | 13 |
旗舰商用模型badcase: gpt-4o |
moonshot-v1-8k |
deepseek-chat-v2 |
yi-large |
更多
类别 | 大模型 | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 总分 | 排名 |
---|---|---|---|---|---|---|---|---|---|
开源 | gemma-2-9b-it | 85 | 82.2 | 88.7 | 87.3 | 81 | 89.3 | 85.6 | 1 |
开源 | glm-4-9b-chat | 90 | 82.2 | 90.0 | 82.0 | 79 | 76.5 | 83.3 | 2 |
开源 | Qwen2-7B-Instruct | 89 | 83.7 | 86.7 | 75.3 | 77 | 81.3 | 82.2 | 3 |
开源 | Llama-3-8B-Instruct | 86 | 74.0 | 80.0 | 90.0 | 63 | 89.5 | 80.4 | 4 |
开源 | internlm2_5-7b-chat | 86 | 84.4 | 90.0 | 83.3 | 79 | 59.8 | 80.4 | 5 |
开源 | Yi-1.5-9B-Chat | 82 | 83.0 | 84.7 | 80.0 | 72 | 73.8 | 79.2 | 6 |
开源 | Llama-3.1-8B-Instruct | 63 | 85.2 | 82.0 | 84.0 | 69 | 90.5 | 79.0 | 7 |
开源 | openbuddy-llama3-8b | 78 | 86.0 | 81.3 | 79.0 | 70 | 63.2 | 76.2 | 8 |
开源 | Qwen1.5-7B-Chat | 80 | 76.0 | 76.0 | 70.7 | 67 | 71.2 | 73.5 | 9 |
开源 | internlm2-chat-7b | 86 | 81.0 | 72.7 | 82.7 | 64 | 42.8 | 71.5 | 10 |
开源 | Phi-3-mini-128k-instruct | 74 | 63.0 | 65.3 | 73.0 | 75 | 71.3 | 70.3 | 11 |
开源 | Baichuan2-7B-Chat | 88 | 76.0 | 83.3 | 69.0 | 68 | 34.8 | 69.8 | 12 |
开源 | 谷歌gemma-7b-it | 72 | 79.0 | 74.0 | 76.0 | 66 | 38.5 | 67.6 | 13 |
开源 | MiniCPM-2B-dpo | 79 | 77.0 | 74.0 | 66.0 | 55 | 52.7 | 67.3 | 14 |
开源 | Qwen1.5-4B-Chat | 75 | 65.0 | 79.3 | 63.0 | 56 | 53.0 | 65.2 | 15 |
开源 | qwen2-1.5b-instruct | 73 | 74.1 | 68.0 | 50.7 | 54 | 55.7 | 62.6 | 16 |
开源 | internlm2-chat-1_8b | 69 | 60.7 | 63.3 | 46.0 | 45 | 39.7 | 54.0 | 17 |
开源 | 谷歌gemma-2b-it | 56 | 60.0 | 60.0 | 43.3 | 61 | 26.3 | 51.1 | 18 |
开源 | Qwen1.5-1.8B-Chat | 57 | 58.0 | 52.7 | 48.0 | 46 | 26.7 | 48.1 | 19 |
开源 | qwen2-0.5b-instruct | 49 | 53.3 | 62.0 | 36.7 | 48 | 35.5 | 47.4 | 20 |
开源 | Qwen1.5-0.5B-Chat | 44 | 40.0 | 60.0 | 34.7 | 42 | 17.2 | 39.6 | 21 |
类别 | 大模型 | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 总分 | 排名 |
---|---|---|---|---|---|---|---|---|---|
开源 | Qwen1.5-14B-Chat | 89 | 79.0 | 90.7 | 90.7 | 66 | 77.5 | 82.2 | 1 |
开源 | internlm2-chat-20b | 93 | 80.0 | 86.0 | 88.0 | 68 | 63.3 | 79.7 | 2 |
开源 | Baichuan2-13B-Chat | 83 | 83.0 | 74.7 | 77.0 | 75 | 54.8 | 74.6 | 3 |
开源 | DeepSeek-V2-Lite-Chat | 81 | 76.3 | 81.3 | 73.3 | 69 | 61.2 | 73.7 | 4 |
类别 | 大模型 | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 总分 | 排名 |
---|---|---|---|---|---|---|---|---|---|
开源 | Qwen2-72B-Instruct | 87 | 91.1 | 94.7 | 90.0 | 86 | 94.2 | 90.5 | 1 |
开源 | Llama-3.1-70B-Instruct | 87 | 88.9 | 92.0 | 90.7 | 79 | 94.8 | 88.7 | 2 |
开源 | Llama-3-70B-Instruct | 88 | 87.0 | 96.0 | 95.0 | 70 | 90.8 | 87.8 | 3 |
开源 | qwen2-57b-a14b-instruct | 85 | 88.1 | 89.3 | 87.3 | 77 | 89.2 | 86.0 | 4 |
开源 | Qwen1.5-32B-Chat | 91 | 86.0 | 92.7 | 87.3 | 72 | 86.8 | 86.0 | 5 |
开源 | Qwen1.5-72B-Chat | 89 | 84.0 | 88.0 | 87.3 | 70 | 84.8 | 83.8 | 6 |
开源 | deepseek-llm-67b-chat | 87 | 81.0 | 86.7 | 92.0 | 74 | 72.7 | 82.2 | 7 |
开源 | Yi-1.5-34B-Chat | 90 | 83.0 | 82.7 | 83.3 | 74 | 79.0 | 82.0 | 8 |
评测样本举例:
将下列单词按词性分类。
狗,追,跑,大人,高兴,树
☛查看分类能力badcase
详细数据见classification
评测样本举例:
“中信银行3亿元,交通银行增长约2.7亿元,光大银行约1亿元。”
提取出以上文本中的所有组织机构名称
☛查看信息抽取能力badcase
详细数据见extract
阅读理解能力是一种符合能力,考查针对给定信息的理解能力。
依据给定信息的种类,可以细分为:文章问答、表格问答、对话问答……
评测样本举例:
牙医:好的,让我们看看你的牙齿。从你的描述和我们的检查结果来看,你可能有一些牙齦疾病,导致牙齿的神经受到刺激,引起了敏感。此外,这些黑色斑点可能是蛀牙。
病人:哦,真的吗?那我该怎么办?
牙医:别担心,我们可以为你制定一个治疗计划。我们需要首先治疗牙龈疾病,然后清除蛀牙并填充牙洞。在此过程中,我们将确保您感到舒适,并使用先进的技术和材料来实现最佳效果。
病人:好的,谢谢您,医生。那么我什么时候可以开始治疗?
牙医:让我们为您安排一个约会。您的治疗将在两天后开始。在此期间,请继续刷牙,使用牙线,并避免吃过于甜腻和酸性的食物和饮料。
病人:好的,我会的。再次感谢您,医生。
牙医:不用谢,我们会尽最大的努力帮助您恢复健康的牙齿。
基于以上对话回答:病人在检查中发现的牙齿问题有哪些?
☛查看阅读理解能力badcase
详细数据见mrc
专门考查大模型对表格的理解分析能力,常用于数据分析。
评测样本举例:
姓名,年龄,性别,国籍,身高(cm),体重(kg),学历
张三,28,男,中国,180,70,本科
Lisa,33,女,美国,165,58,硕士
Paulo,41,男,巴西,175,80,博士
Miyuki,25,女,日本,160,50,大专
Ahmed,30,男,埃及,175,68,本科
Maria,29,女,墨西哥,170,65,硕士
Antonio,36,男,西班牙,182,75,博士
基于这个表格回答:学历最低的是哪国人?
☛查看数据分析badcase
详细数据见tableqa
暂不计入综合能力评分。
专门考查大模型编码中文字符的效率,同等尺寸大模型,编码效率越高推理速度越快,几乎成正比。
中文编码效率相当于大模型生成的每个token解码后对应的中文平均字数
(大模型每次生成一个token,然后解码成真正可见的字符,比如中文、英文、标点符号等)。
比如baichuan2、llama2的中文中文编码效率分别为1.67、0.61,意味着在同尺寸模型下,baichuan2的运行速度是llama2的2.7倍(1.67/0.61)。
参考谷歌IFEval,并将其翻译和适配到中文,精选9类25种指令,说明如下:
排行榜:
详细数据见IFEval
考查大模型的数学基础能力之算数能力,测试题目为1000以内的整数加减法、不超过2位有效数字的浮点数加减乘除。 举例:166 + 215 + 53 = ?,0.97 + 0.4 / 4.51 = ?
☛查看算术能力badcase
详细数据见math
评分方法:从各个维度给大模型打分,每个维度都对应一个评测数据集,包含若干道题。 每道题依据大模型回复质量给1~5分,将评测集内所有题的得分累加并归一化为100分制,即作为最终得分。
类别 | 大模型 | 分类能力 | 信息抽取 | 阅读理解 | 数据分析 | 指令遵从 | 算术运算 | 综合能力 |
---|---|---|---|---|---|---|---|---|
商用 | gpt-4o | 93 | 96.3 | 98.0 | 100.0 | 83 | 95.7 | 94.3 |
商用 | gpt4 | 94 | 94.0 | 99.3 | 97.0 | 84 | 86.5 | 92.5 |
商用 | gpt-4-turbo | 91 | 90.0 | 94.0 | 96.0 | 83 | 96.5 | 91.8 |
商用 | gpt-4o-mini | 90 | 93.3 | 89.3 | 100.0 | 83 | 92.7 | 91.4 |
商用 | 百度文心4.0 | 88 | 89.0 | 94.7 | 94.0 | 79 | 100.0 | 90.8 |
商用 | 百度ERNIE-3.5-8K | 89 | 89.6 | 96.7 | 97.3 | 72 | 100.0 | 90.8 |
商用 | 百度ERNIE-4.0-Turbo | 90 | 88.9 | 96.0 | 93.3 | 78 | 97.7 | 90.7 |
商用 | deepseek-chat-v2 | 93 | 88.0 | 94.0 | 96.0 | 76 | 96.7 | 90.6 |
开源 | Qwen2-72B-Instruct | 87 | 91.1 | 94.7 | 90.0 | 86 | 94.2 | 90.5 |
商用 | 豆包Doubao-pro-32k | 86 | 88.1 | 96.7 | 86.7 | 85 | 98.2 | 90.1 |
商用 | 讯飞4.0Ultra | 88 | 84.4 | 96.0 | 92.7 | 80 | 94.3 | 89.2 |
开源 | Llama-3.1-70B-Instruct | 87 | 88.9 | 92.0 | 90.7 | 79 | 94.8 | 88.7 |
商用 | yi-medium | 86 | 93.0 | 89.3 | 94.0 | 76 | 89.2 | 87.9 |
商用 | minimax-abab6.5-chat | 89 | 87.0 | 89.3 | 95.3 | 76 | 90.3 | 87.8 |
开源 | Llama-3-70B-Instruct | 88 | 87.0 | 96.0 | 95.0 | 70 | 90.8 | 87.8 |
商用 | Baichuan3-Turbo | 88 | 86.7 | 94.7 | 90.7 | 75 | 89.2 | 87.4 |
商用 | 智谱GLM4 | 86 | 90.0 | 97.3 | 96.0 | 77 | 78.0 | 87.4 |
商用 | yi-large | 85 | 91.0 | 90.0 | 92.7 | 77 | 88.3 | 87.3 |
商用 | minimax-abab6.5s-chat | 87 | 88.0 | 88.7 | 88.0 | 80 | 91.7 | 87.2 |
商用 | 讯飞星火v3.5(spark-max) | 87 | 92.0 | 89.3 | 87.3 | 74 | 93.5 | 87.2 |
商用 | Baichuan4 | 86 | 94.1 | 93.3 | 95.3 | 75 | 78.2 | 87.0 |
商用 | 智谱GLM-4-Air | 89 | 91.9 | 92.7 | 88.0 | 83 | 74.5 | 86.5 |
商用 | 智谱GLM-4-AirX | 89 | 91.9 | 92.7 | 88.0 | 83 | 74.2 | 86.5 |
商用 | 阿里qwen-plus | 88 | 89.6 | 90.0 | 84.0 | 73 | 93.0 | 86.3 |
开源 | qwen2-57b-a14b-instruct | 85 | 88.1 | 89.3 | 87.3 | 77 | 89.2 | 86.0 |
开源 | Qwen1.5-32B-Chat | 91 | 86.0 | 92.7 | 87.3 | 72 | 86.8 | 86.0 |
开源 | gemma-2-9b-it | 85 | 82.2 | 88.7 | 87.3 | 81 | 89.3 | 85.6 |
商用 | yi-large-turbo | 82 | 90.0 | 88.7 | 86.7 | 78 | 87.8 | 85.5 |
商用 | 阿里qwen-max | 86 | 82.0 | 95.3 | 91.3 | 77 | 79.8 | 85.2 |
商用 | 讯飞星火v3(spark-pro) | 87 | 82.0 | 88.0 | 86.0 | 74 | 94.0 | 85.2 |
商用 | 阿里qwen-long | 89 | 85.9 | 90.0 | 86.7 | 75 | 83.3 | 85.0 |
商用 | gpt-3.5-turbo | 81 | 83.0 | 92.7 | 91.3 | 77 | 80.0 | 84.2 |
开源 | Qwen1.5-72B-Chat | 89 | 84.0 | 88.0 | 87.3 | 70 | 84.8 | 83.8 |
商用 | 月之暗面moonshot-v1-8k | 92 | 85.0 | 84.0 | 89.3 | 72 | 79.3 | 83.6 |
开源 | glm-4-9b-chat | 90 | 82.2 | 90.0 | 82.0 | 79 | 76.5 | 83.3 |
开源 | deepseek-llm-67b-chat | 87 | 81.0 | 86.7 | 92.0 | 74 | 72.7 | 82.2 |
开源 | Qwen2-7B-Instruct | 89 | 83.7 | 86.7 | 75.3 | 77 | 81.3 | 82.2 |
开源 | Qwen1.5-14B-Chat | 89 | 79.0 | 90.7 | 90.7 | 66 | 77.5 | 82.2 |
开源 | Yi-1.5-34B-Chat | 90 | 83.0 | 82.7 | 83.3 | 74 | 79.0 | 82.0 |
商用 | 智谱GLM-4-Flash | 89 | 80.0 | 86.0 | 82.0 | 79 | 75.5 | 81.9 |
商用 | yi-spark | 82 | 88.9 | 88.0 | 76.0 | 72 | 83.3 | 81.7 |
商用 | 百度ERNIE-Speed-8K | 88 | 88.1 | 88.0 | 89.3 | 68 | 68.7 | 81.7 |
商用 | 商汤SenseChat-v4 | 89 | 78.5 | 88.0 | 86.7 | 71 | 72.2 | 80.9 |
开源 | Llama-3-8B-Instruct | 86 | 74.0 | 80.0 | 90.0 | 63 | 89.5 | 80.4 |
开源 | internlm2_5-7b-chat | 86 | 84.4 | 90.0 | 83.3 | 79 | 59.8 | 80.4 |
商用 | qwen-turbo | 83 | 85.2 | 88.0 | 76.0 | 66 | 81.3 | 79.9 |
开源 | internlm2-chat-20b | 93 | 80.0 | 86.0 | 88.0 | 68 | 63.3 | 79.7 |
开源 | Yi-1.5-9B-Chat | 82 | 83.0 | 84.7 | 80.0 | 72 | 73.8 | 79.2 |
开源 | Llama-3.1-8B-Instruct | 63 | 85.2 | 82.0 | 84.0 | 69 | 90.5 | 79.0 |
商用 | 商汤SenseChat-Turbo | 81 | 77.8 | 76.7 | 86.0 | 72 | 78.5 | 78.7 |
商用 | 豆包Doubao-lite-32k | 77 | 86.7 | 88.7 | 64.7 | 62 | 87.2 | 77.7 |
开源 | openbuddy-llama3-8b | 78 | 86.0 | 81.3 | 79.0 | 70 | 63.2 | 76.2 |
开源 | Baichuan2-13B-Chat | 83 | 83.0 | 74.7 | 77.0 | 75 | 54.8 | 74.6 |
开源 | DeepSeek-V2-Lite-Chat | 81 | 76.3 | 81.3 | 73.3 | 69 | 61.2 | 73.7 |
开源 | Qwen1.5-7B-Chat | 80 | 76.0 | 76.0 | 70.7 | 67 | 71.2 | 73.5 |
商用 | minimax-abab5.5-chat | 83 | 79.0 | 86.7 | 72.7 | 76 | 39.7 | 72.8 |
开源 | internlm2-chat-7b | 86 | 81.0 | 72.7 | 82.7 | 64 | 42.8 | 71.5 |
开源 | Phi-3-mini-128k-instruct | 74 | 63.0 | 65.3 | 73.0 | 75 | 71.3 | 70.3 |
开源 | Baichuan2-7B-Chat | 88 | 76.0 | 83.3 | 69.0 | 68 | 34.8 | 69.8 |
开源 | 谷歌gemma-7b-it | 72 | 79.0 | 74.0 | 76.0 | 66 | 38.5 | 67.6 |
开源 | MiniCPM-2B-dpo | 79 | 77.0 | 74.0 | 66.0 | 55 | 52.7 | 67.3 |
开源 | Qwen1.5-4B-Chat | 75 | 65.0 | 79.3 | 63.0 | 56 | 53.0 | 65.2 |
开源 | qwen2-1.5b-instruct | 73 | 74.1 | 68.0 | 50.7 | 54 | 55.7 | 62.6 |
商用 | minimax-abab5.5s-chat | 58 | 57.0 | 70.7 | 56.0 | 49 | 57.0 | 58.0 |
开源 | internlm2-chat-1_8b | 69 | 60.7 | 63.3 | 46.0 | 45 | 39.7 | 54.0 |
开源 | 谷歌gemma-2b-it | 56 | 60.0 | 60.0 | 43.3 | 61 | 26.3 | 51.1 |
开源 | Qwen1.5-1.8B-Chat | 57 | 58.0 | 52.7 | 48.0 | 46 | 26.7 | 48.1 |
开源 | qwen2-0.5b-instruct | 49 | 53.3 | 62.0 | 36.7 | 48 | 35.5 | 47.4 |
开源 | Qwen1.5-0.5B-Chat | 44 | 40.0 | 60.0 | 34.7 | 42 | 17.2 | 39.6 |
包含各维度评测集以及大模型输出结果,详见本项目的eval文件目录
- 大模型百花齐放,也参差不齐。不少媒体的宣传往往夸大其词,避重就轻,容易混淆视听;而某些公司为了PR,也过分标榜自己大模型的能力,动不动就“达到chatgpt水平”,动不动就“国内第一”。 所谓“外行看热闹,内行看门道”,业界急需一股气流,摒弃浮躁,静下心来打磨前沿技术,真真正正用技术实力说话。这就少不了一个公开、公正、公平的大模型评测系统,把各类大模型的优点、不足一一展示出来。 如此,大家既能把握当下的发展水平、与国外顶尖技术的差距,也能更加清晰地看明白未来的努力方向,而不被资本热潮、舆论热潮所裹挟。
- 对于产业界来说,特别是对于不具备大模型研发能力的公司,熟悉大模型的技术边界、高效有针对性地做大模型技术选型,在现如今显得尤为重要。 而一个公开、公正、公平的大模型评测系统,恰好能够提供应有的助力,避免重复造轮子,避免因技术栈不同而导致不必要的争论,避免“鸡同鸭讲”。
- 对于大模型研发人员,包括对大模型技术感兴趣的人、学术界看中实践的人,各类大模型的效果对比,反应出了背后不同技术路线、技术方法的有效性,这就提供了非常好的参考意义。 不同大模型的相互参考、借鉴,帮忙大家躲过不必要的坑、避免重复实验带来的资源浪费,有助于整个大模型生态圈的良性高效发展。
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for chinese-llm-benchmark
Similar Open Source Tools
chinese-llm-benchmark
The Chinese LLM Benchmark is a continuous evaluation list of large models in CLiB, covering a wide range of commercial and open-source models from various companies and research institutions. It supports multidimensional evaluation of capabilities including classification, information extraction, reading comprehension, data analysis, Chinese encoding efficiency, and Chinese instruction compliance. The benchmark not only provides capability score rankings but also offers the original output results of all models for interested individuals to score and rank themselves.
Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.
awesome-ai-painting
This repository, named 'awesome-ai-painting', is a comprehensive collection of resources related to AI painting. It is curated by a user named 秋风, who is an AI painting enthusiast with a background in the AIGC industry. The repository aims to help more people learn AI painting and also documents the user's goal of creating 100 AI products, with current progress at 4/100. The repository includes information on various AI painting products, tutorials, tools, and models, providing a valuable resource for individuals interested in AI painting and related technologies.
TigerBot
TigerBot is a cutting-edge foundation for your very own LLM, providing a world-class large model for innovative Chinese-style contributions. It offers various upgrades and features, such as search mode enhancements, support for large context lengths, and the ability to play text-based games. TigerBot is suitable for prompt-based game engine development, interactive game design, and real-time feedback for playable games.
Llama-Chinese
Llama中文社区是一个专注于Llama模型在中文方面的优化和上层建设的高级技术社区。 **已经基于大规模中文数据,从预训练开始对Llama2模型进行中文能力的持续迭代升级【Done】**。**正在对Llama3模型进行中文能力的持续迭代升级【Doing】** 我们热忱欢迎对大模型LLM充满热情的开发者和研究者加入我们的行列。
MMOS
MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.
casibase
Casibase is an open-source AI LangChain-like RAG (Retrieval-Augmented Generation) knowledge database with web UI and Enterprise SSO, supports OpenAI, Azure, LLaMA, Google Gemini, HuggingFace, Claude, Grok, etc.
Step-DPO
Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.
LLMs
LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.
RAG-Retrieval
RAG-Retrieval provides full-chain RAG retrieval fine-tuning and inference code. It supports fine-tuning any open-source RAG retrieval models, including vector (embedding, graph a), delayed interactive models (ColBERT, graph d), interactive models (cross encoder, graph c). For inference, RAG-Retrieval focuses on ranking (reranker) and has developed a lightweight Python library rag-retrieval, providing a unified way to call any different RAG ranking models.
LLM4SE
The collection is actively updated with the help of an internal literature search engine.
LLM-And-More
LLM-And-More is a one-stop solution for training and applying large models, covering the entire process from data processing to model evaluation, from training to deployment, and from idea to service. In this project, users can easily train models through this project and generate the required product services with one click.
Qwen-TensorRT-LLM
Qwen-TensorRT-LLM is a project developed for the NVIDIA TensorRT Hackathon 2023, focusing on accelerating inference for the Qwen-7B-Chat model using TRT-LLM. The project offers various functionalities such as FP16/BF16 support, INT8 and INT4 quantization options, Tensor Parallel for multi-GPU parallelism, web demo setup with gradio, Triton API deployment for maximum throughput/concurrency, fastapi integration for openai requests, CLI interaction, and langchain support. It supports models like qwen2, qwen, and qwen-vl for both base and chat models. The project also provides tutorials on Bilibili and blogs for adapting Qwen models in NVIDIA TensorRT-LLM, along with hardware requirements and quick start guides for different model types and quantization methods.
Awesome-Interpretability-in-Large-Language-Models
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.
speechless
Speechless.AI is committed to integrating the superior language processing and deep reasoning capabilities of large language models into practical business applications. By enhancing the model's language understanding, knowledge accumulation, and text creation abilities, and introducing long-term memory, external tool integration, and local deployment, our aim is to establish an intelligent collaborative partner that can independently interact, continuously evolve, and closely align with various business scenarios.
Chinese-Mixtral-8x7B
Chinese-Mixtral-8x7B is an open-source project based on Mistral's Mixtral-8x7B model for incremental pre-training of Chinese vocabulary, aiming to advance research on MoE models in the Chinese natural language processing community. The expanded vocabulary significantly improves the model's encoding and decoding efficiency for Chinese, and the model is pre-trained incrementally on a large-scale open-source corpus, enabling it with powerful Chinese generation and comprehension capabilities. The project includes a large model with expanded Chinese vocabulary and incremental pre-training code.
For similar tasks
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.
sorrentum
Sorrentum is an open-source project that aims to combine open-source development, startups, and brilliant students to build machine learning, AI, and Web3 / DeFi protocols geared towards finance and economics. The project provides opportunities for internships, research assistantships, and development grants, as well as the chance to work on cutting-edge problems, learn about startups, write academic papers, and get internships and full-time positions at companies working on Sorrentum applications.
tidb
TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
zep-python
Zep is an open-source platform for building and deploying large language model (LLM) applications. It provides a suite of tools and services that make it easy to integrate LLMs into your applications, including chat history memory, embedding, vector search, and data enrichment. Zep is designed to be scalable, reliable, and easy to use, making it a great choice for developers who want to build LLM-powered applications quickly and easily.
telemetry-airflow
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)
mojo
Mojo is a new programming language that bridges the gap between research and production by combining Python syntax and ecosystem with systems programming and metaprogramming features. Mojo is still young, but it is designed to become a superset of Python over time.
pandas-ai
PandasAI is a Python library that makes it easy to ask questions to your data in natural language. It helps you to explore, clean, and analyze your data using generative AI.
databend
Databend is an open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake. With its focus on fast query execution and data ingestion, it's designed for complex analysis of the world's largest datasets.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.