chinese-llm-benchmark

ReLE中文大模型能力评测（持续更新）：目前已囊括291个大模型，覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4、智谱GLM-Z1、文心一言、qwen-max、百川、讯飞星火、商汤senseChat、minimax等商用模型，以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.1、qwen3-2507、llama4、phi-4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

Stars: 4818

Visit

The Chinese LLM Benchmark is a continuous evaluation list of large models in CLiB, covering a wide range of commercial and open-source models from various companies and research institutions. It supports multidimensional evaluation of capabilities including classification, information extraction, reading comprehension, data analysis, Chinese encoding efficiency, and Chinese instruction compliance. The benchmark not only provides capability score rankings but also offers the original output results of all models for interested individuals to score and rank themselves.

README:

ReLE中文大模型能力评测（持续更新）

ReLE （Really Reliable Live Evaluation for LLM），原名CLiB
目前已囊括298个大模型，覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4、智谱GLM-Z1、文心一言、qwen3-max、百川、讯飞星火、商汤senseChat、minimax等商用模型，以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.1、qwen3-2507、llama4、phi-4、GLM4.5、gemma3、mistral等开源大模型。
支持多维度能力评测，包括教育、医疗与心理健康、金融、法律与行政公务、推理与数学计算、语言与指令遵从等6个领域，以及细分的~300个维度（比如牙科、高中语文…）。
不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。
为您的私有大模型提供免费评测服务，联系我们：加微信

GitHub热门大模型评测项目

repo	star	area	about
langfuse	14.9k	国外	Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
opik	12.5k	国外	Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
ragas	10.3k	国外	Supercharge Your LLM Application Evaluations 🚀
……	……	……	……
⭐chinese-llm-benchmark（我们）	4.7k	国内	ReLE中文大模型能力评测（持续更新）
……	……	……	……

详见hot50

大模型基本信息

每周最新模型
更多信息详见模型列表

📊 排行榜

0、多模态排行榜

详细数据见多模态评测

1、综合能力排行榜

“综合能力”计分方式：“综合能力”改为“专业能力”和“通用能力”平均分，其中“专业能力”为“教育”、“医疗与心理健康”、“金融”、“法律与行政公务”4大领域平均分，“通用能力”为“推理与数学计算”、“语言与指令遵从”两大领域平均分。

类别	机构	大模型	【总分】准确率	平均耗时	平均消耗token	花费/千次（元）	排名（准确率）
商用	豆包	doubao-seed-1-6-thinking-250715	88.0%	37s	2144	15.5	1
商用	腾讯	hunyuan-t1-20250711	85.5%	40s	2693	9.9	2

详细数据见：

1.1、推理模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	doubao-seed-1-6-thinking-250715☛去体验	豆包	8.0元	88.0%		89.8%	87.8%	84.1%	85.0%	90.0%	88.5%
2	hunyuan-t1-20250711☛去体验	腾讯	4.0元	85.5%		89.3%	82.9%	83.6%	76.5%	87.0%	89.0%

完整排行榜见推理模型排行榜

1.2、商用大模型排行榜（含开源模型的付费API）

（1）输出价格5元及以上商用大模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	doubao-seed-1-6-thinking-250715☛去体验	豆包	8.0元	88.0%		89.8%	87.8%	84.1%	85.0%	90.0%	88.5%
2	DeepSeek-R1-0528☛去体验	深度求索	16.0元	84.4%		82.6%	80.6%	79.0%	81.0%	88.5%	87.6%

完整排行榜见5元及以上商用大模型

（2）输出价格1~5元商用大模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	hunyuan-t1-20250711☛去体验	腾讯	4.0元	85.5%		89.3%	82.9%	83.6%	76.5%	87.0%	89.0%
2	ERNIE-4.5-Turbo-32K☛去体验	百度	3.2元	83.6%		85.6%	91.5%	85.8%	81.5%	74.9%	87.1%

完整排行榜见1~5元商用大模型

（3）输出价格1元以下商用大模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	GLM-4.5-Flash☛去体验	智谱AI	0.0元	77.4%		75.6%	73.3%	70.3%	72.7%	79.9%	83.5%
2	Doubao-1.5-lite-32k-250115☛去体验	豆包	0.6元	74.7%		81.4%	80.5%	77.2%	66.0%	65.2%	81.0%

完整排行榜见1元以下商用大模型

DIY自定义维度筛选榜单：☛ link

1.3、开源大模型排行榜

（1）5B以下开源大模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	Qwen3-4B☛去体验	阿里巴巴	3.0元	68.9%		73.2%	64.8%	70.6%	53.0%	68.5%	76.2%
2	Qwen3-1.7B☛去体验	阿里巴巴	3.0元	60.4%		58.5%	51.7%	59.1%	46.0%	61.1%	73.0%

完整排行榜见5B以下开源大模型

（2）5B~20B开源大模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	Qwen3-14B☛去体验	阿里巴巴	2.0元	75.9%		80.0%	75.6%	80.2%	66.2%	73.8%	79.0%
2	Qwen3-8B☛去体验	阿里巴巴	0.0元	72.1%		73.1%	67.6%	71.4%	64.0%	70.8%	76.6%

完整排行榜见5B~20B开源大模型

（3）20B以上开源大模型排行榜

排名	大模型	机构	输出价格	总分		教育	医疗与心理健康	金融	法律与行政公务	推理与数学计算	语言与指令遵从
1	DeepSeek-R1-0528☛去体验	深度求索	16.0元	84.4%		82.6%	80.6%	79.0%	81.0%	88.5%	87.6%
2	DeepSeek-V3.1-Think(new)☛去体验	深度求索	12.0元	84.3%		85.0%	80.5%	82.8%	82.0%	86.2%	85.9%

完整排行榜见20B以上开源大模型

DIY自定义维度筛选榜单：☛link

2、教育排行榜

☛☛完整排行榜见教育

2.1 小学学科

2.2 初中学科

2.3 中考TODO

2.4 高中学科

2.5 高考

2.6 高等教育TODO

2.7 考研TODO

2.8 教师资格TODO

3、医疗与心理健康排行榜

☛☛完整排行榜见医疗与心理健康

3.1 医师

（3）妇产科，排行榜
妇产科规培结业：排行榜|badcase，妇产科主治医师：排行榜|badcase，妇产科学副主任、主任医师职称考试：排行榜|badcase

（4）儿科，排行榜
儿科规培结业：排行榜|badcase，儿科主治医师：排行榜|badcase，小儿外科：排行榜|badcase

（5）眼科，排行榜
眼科规培结业：排行榜|badcase，眼科主治医师：排行榜|badcase

（7）耳鼻咽喉科，排行榜
耳鼻咽喉科规培结业：排行榜|badcase，耳鼻咽喉科主治医师：排行榜|badcase

（9）皮肤科，排行榜
皮肤科规培结业：排行榜|badcase，皮肤科中级职称：排行榜|badcase，皮肤与性病学主治医师：排行榜|badcase

（11）康复医学科，排行榜
康复医学科规培结业：排行榜|badcase，康复医学主治医师：排行榜|badcase

（12）全科医学科，排行榜
全科医学科规培结业：排行榜|badcase，全科主治医师：排行榜|badcase

（14）肿瘤科，排行榜
肿瘤学主治医师：排行榜|badcase

（15）麻醉疼痛科，排行榜
麻醉科规培结业：排行榜|badcase，麻醉科主治医师：排行榜|badcase，疼痛科主治医师：排行榜|badcase

3.2 护理

3.3 药师

3.4 医技

3.5 医学基础知识

（3）预防医学与公共卫生学，排行榜
预防医学：排行榜|badcase，卫生学：排行榜|badcase，医学伦理学：排行榜|badcase

3.6 医学考研

3.7 心理健康

目前包含4个子项：心理综合，心理治疗学主治医师，心理咨询师，医学心理学。☛☛完整排行榜见心理健康。
（1）心理综合：排行榜|badcase，（2）心理治疗学主治医师：排行榜|badcase，（3）心理咨询师：排行榜|badcase，（4）医学心理学：排行榜|badcase

4、金融排行榜

☛☛完整排行榜见金融

4.1 财务

4.2 银行

☛☛完整排行榜见银行。
银行初级资格：排行榜|badcase，银从中级资格：排行榜|badcase，银行从业资格：排行榜|badcase

4.3 保险

☛☛完整排行榜见保险。
保险从业资格：排行榜|badcase

4.4 证券

☛☛完整排行榜见证券。
证券专项考试：排行榜|badcase，证券从业资格：排行榜|badcase

4.5 其他金融资格考试

4.6 金融基础知识

4.7 金融应用

5、法律与行政公务排行榜

☛☛完整排行榜见法律与行政公务

5.1 律师资格考试

（1）JEC-QA-KD

选择题，共1000道，参考AGIEval。
完整排行榜见JEC-QA-KD，☛查看JEC-QA-KD：badcase

（2）JEC-QA-CA

选择题，共1000道，参考AGIEval。
完整排行榜见JEC-QA-CA，☛查看JEC-QA-CA：badcase

（3）法律综合

完整排行榜见法律综合，☛查看法律综合：badcase

5.2 公务员考试

公务员考试行测选择题，共651道，参考AGIEval。评测样本举例：

某乡镇进行新区规划，决定以市民公园为中心，在东南西北分别建设一个特色社区。这四个社区分别定为，文化区、休闲区、商业区和行政服务区。已知行政服务区在文化区的西南方向，文化区在休闲区的东南方向。
根据以上陈述，可以得出以下哪项？
(A)市民公园在行政服务区的北面 (B)休闲区在文化区的西南 (C)文化区在商业区的东北 (D)商业区在休闲区的东南

完整排行榜见公务员考试
☛查看公务员考试：badcase

6、推理与数学计算排行榜

☛☛完整排行榜见推理与数学计算

6.1 演绎推理

演绎推理（modus_tollens）选择题，共123道，参考ISP。

评测样本举例：

考虑以下语句：
1.如果约翰是个好父母，那么约翰就是严格但公平的。2.约翰不严格但公平。结论：因此，约翰不是一个好父母。问题：根据陈述1.和2.，结论是否正确？
回答： (A) 否 (B) 是

完整排行榜见演绎推理
☛查看演绎推理：badcase

6.2 常识推理

常识推理选择题，共99道，参考ISP。

评测样本举例：

以下是关于常识的选择题。
问题：当某人把土豆放到篝火边的余烬中，此时余烬并没有在
A、释放热量 B、吸收热量

完整排行榜见常识推理
☛查看常识推理：badcase

6.3 符号推理BBH

学术界最常用的符号推理评测集，包含23个子任务，详细介绍见BBH。评测样本举例：

Task description: Answer questions about which times certain events could have occurred.
Q: Today, Emily went to the museum. Between what times could they have gone?
We know that:
Emily woke up at 1pm.
Elizabeth saw Emily reading at the library from 2pm to 4pm.
Jessica saw Emily watching a movie at the theater from 4pm to 5pm.
Leslie saw Emily waiting at the airport from 5pm to 6pm.
William saw Emily buying clothes at the mall from 6pm to 7pm.
The museum was closed after 7pm.
Between what times could Emily have gone to the museum?
Options:
(A) 1pm to 2pm (B) 6pm to 7pm (C) 5pm to 6pm (D) 2pm to 4pm
A:

完整排行榜见BBH
☛查看BBH符号推理：badcase

6.4 算术能力

考查大模型的数学基础能力之算数能力，测试题目为1000以内的整数加减法、不超过2位有效数字的浮点数加减乘除。举例：166 + 215 + 53 = ？，0.97 + 0.4 / 4.51 = ？

完整排行榜见算术能力
☛查看算术能力：badcase

6.5 表格问答

专门考查大模型对表格的理解分析能力，常用于数据分析。
评测样本举例：

姓名,年龄,性别,国籍,身高(cm),体重(kg),学历
张三,28,男,中国,180,70,本科
Lisa,33,女,美国,165,58,硕士
Paulo,41,男,巴西,175,80,博士
Miyuki,25,女,日本,160,50,大专
Ahmed,30,男,埃及,175,68,本科
Maria,29,女,墨西哥,170,65,硕士
Antonio,36,男,西班牙,182,75,博士
基于这个表格回答：学历最低的是哪国人？

完整排行榜见表格问答
☛查看表格问答：badcase

6.6 表格总结

专门考查大模型对表格的分析总结能力，常用于数据分析、文章撰写，没有固定的标准答案，但容易相对客观地分辨好坏。评测样本举例（由于例子过长，部分数据予以省略）：

类别机构大模型准确率平均耗时平均消耗token 花费/千次（元）排名（准确率）

商用豆包 doubao-seed-1-6-thinking-250715 87.5 37s 1976 14.6 1

商用百度 ERNIE-4.5-Turbo-32K 84.7 33s 676 1.8 2

商用腾讯 hunyuan-t1-20250711 84.7 37s 2465 9.2 3

商用腾讯 hunyuan-turbos-20250716 83.9 24s 1288 2.3 4

…… …… …… …… …… …… …… ……

已知新模型为：GLM-4.5,GLM-4.5-Air,GLM-4.5-Flash,step-3。
基于以上表格写一段总结，格式为：“xx机构、xx机构……占据前5（机构名不要重复），然后描述开源模型和商用模型的分布。新模型中，xx排第xx，xx排第xx……（排名由高到低）”。严格按照表格中的模型名称、机构名称。

类别	机构	大模型	准确率	平均耗时	平均消耗token	花费/千次（元）	排名（准确率）
商用	豆包	doubao-seed-1-6-thinking-250715	87.5	37s	1976	14.6	1
商用	百度	ERNIE-4.5-Turbo-32K	84.7	33s	676	1.8	2
商用	腾讯	hunyuan-t1-20250711	84.7	37s	2465	9.2	3
商用	腾讯	hunyuan-turbos-20250716	83.9	24s	1288	2.3	4
……	……	……	……	……	……	……	……

完整排行榜见表格总结
☛查看表格总结：badcase

6.7 高中奥数

2024年预赛试题，参考Math24o。评测样本举例：

设集合 $S={1, 2, 3, \cdots, 9 9 7, 9 9 8 }$，集合 $S$ 的 $k$ 个 $499$ 元子集 $A_{1},A_{2}, \cdots, A_{k}$ 满足：对 $S$ 中任一二元子集 $B$，均存在 $i \in{1, 2, \cdots, k }$，使得 $B \subset A_{i}$。求 $k$ 的最小值。

完整排行榜见高中奥林匹克数学竞赛
☛查看高中奥林匹克数学竞赛：badcase

6.8 初中奥数TODO

6.9 小学奥数

完整排行榜见小学奥数
☛查看小学奥数：badcase

6.10 地图推理TODO

6.11 空间推理TODO

6.12 数独

完整排行榜见数独
☛查看数独：badcase

6.13 金额大小写转换TODO

6.14 日期计算TODO

7、语言与指令遵从排行榜

☛☛完整排行榜见语言与指令遵从

7.1 成语理解

给定上下文，选择最匹配的成语。

评测样本举例：

说完作品的优点,咱们再来聊聊为何说它最后的结局____,片子本身提出的话题观点很尖锐,“扶弟魔”也成为众多当代年轻人婚姻里的不定因素,所以对于这种过于敏感的东西,片子的结局仅仅只是以弟弟的可爱化解了姐姐的心结,最后选择陪伴照顾...
给上文空格处选择最合适的成语或俗语：
(A) 有条有理 (B) 偏听偏信 (C) 狗尾续貂 (D) 半壁江山 (E) 身家性命 (F) 胆小如鼠 (G) 独善其身

完整排行榜见成语理解
☛查看成语理解：badcase

7.2 情感分析

分析用户评论的情感属性，消极或积极。

评测样本举例：

用了几天，发现很多问题，无线网容易掉线，屏幕容易刮花，打开网页容易死掉，不值的买
以上用户评论是正面还是负面？
(A) 负面 (B) 正面

完整排行榜见情感分析
☛查看情感分析：badcase

7.3 文本蕴含

文本蕴含，判断两个句子之间的语义关系：蕴含、中立、矛盾，参考OCNLI。

评测样本举例：

句子一：农机具购置补贴覆盖到全国所有农牧业县(场),中央财政拟安排资金130亿元,比上年增加90亿元
句子二：按农民人数发放补贴
以上两个句子是什么关系？
(A)蕴含 (B)中立 (C)矛盾

完整排行榜见文本蕴含
☛查看文本蕴含：badcase

7.4 文本分类

评测样本举例：

将下列单词按词性分类。
狗，追，跑，大人，高兴，树

完整排行榜见文本分类
☛查看文本分类：badcase

7.5 信息抽取

评测样本举例：

“中信银行3亿元，交通银行增长约2.7亿元，光大银行约1亿元。”
提取出以上文本中的所有组织机构名称

完整排行榜见信息抽取
☛查看信息抽取：badcase

7.6 阅读理解

阅读理解能力是一种符合能力，考查针对给定信息的理解能力。依据给定信息的种类，可以细分为：文章问答、表格问答、对话问答……
评测样本举例：

牙医：好的，让我们看看你的牙齿。从你的描述和我们的检查结果来看，你可能有一些牙齦疾病，导致牙齿的神经受到刺激，引起了敏感。此外，这些黑色斑点可能是蛀牙。
病人：哦，真的吗？那我该怎么办？
牙医：别担心，我们可以为你制定一个治疗计划。我们需要首先治疗牙龈疾病，然后清除蛀牙并填充牙洞。在此过程中，我们将确保您感到舒适，并使用先进的技术和材料来实现最佳效果。
病人：好的，谢谢您，医生。那么我什么时候可以开始治疗？
牙医：让我们为您安排一个约会。您的治疗将在两天后开始。在此期间，请继续刷牙，使用牙线，并避免吃过于甜腻和酸性的食物和饮料。
病人：好的，我会的。再次感谢您，医生。
牙医：不用谢，我们会尽最大的努力帮助您恢复健康的牙齿。
基于以上对话回答：病人在检查中发现的牙齿问题有哪些？

完整排行榜见阅读理解
☛查看阅读理解：badcase

7.7 代词理解

中文指代消解任务，参考CLUEWSC2020。评测样本举例：

少平仍然不知道怎样给奶奶说清他姐夫的事，就只好随口说：“他犯了点错误，人家让他劳教！”
上述文本中的“他犯了点错误”中的“他”是指少平吗？选项：(A)是 (B)否

完整排行榜见代词理解
☛查看代词理解：badcase

7.8 诗词匹配

中国古典诗歌匹配，给定中国古典诗歌的现代问描述，要求从候选的四句诗中选出与现代文描述语义匹配的那一句。利用古典诗歌和现代文翻译的平行语料构建正确选项，并利用正确选项从古代诗歌语料库中利用相似检索构造出错误候选。参考CCPM。评测样本举例：

昏暗的灯熄灭了又被重新点亮。
上述文本最匹配下面哪句诗：
(A)渔灯灭复明 (B)残灯灭又然 (C)残灯暗复明 (D)残灯灭又明

完整排行榜见诗词匹配
☛查看诗词匹配：badcase

7.9 中文指令遵从

参考谷歌IFEval，并将其翻译和适配到中文，精选9类25种指令，说明如下：

完整排行榜见IFEval
☛查看中文指令遵从：badcase

7.10 汉字字形

完整排行榜见汉字字形
☛查看汉字字形：badcase

7.11 汉语拼音TODO

7.12 找错别字TODO

7.13 句子理解TODO

7.14 标点符号TODO

7.15 汉字繁简转换TODO

7.16 语种识别TODO

🌐各项能力评分

评分方法：从各个维度给大模型打分，每个维度都对应一个评测数据集，包含若干道题。每道题依据大模型回复质量给1~5分，将评测集内所有题的得分累加并归一化为100分制，即作为最终得分。

所有评分数据详见alldata

⚖️原始评测数据

包含各维度评测集以及大模型输出结果，详见本项目的eval文件目录

为什么做榜单？

大模型百花齐放，也参差不齐。不少媒体的宣传往往夸大其词，避重就轻，容易混淆视听；而某些公司为了PR，也过分标榜自己大模型的能力，动不动就“达到chatgpt水平”，动不动就“国内第一”。所谓“外行看热闹，内行看门道”，业界急需一股气流，摒弃浮躁，静下心来打磨前沿技术，真真正正用技术实力说话。这就少不了一个公开、公正、公平的大模型评测系统，把各类大模型的优点、不足一一展示出来。如此，大家既能把握当下的发展水平、与国外顶尖技术的差距，也能更加清晰地看明白未来的努力方向，而不被资本热潮、舆论热潮所裹挟。
对于产业界来说，特别是对于不具备大模型研发能力的公司，熟悉大模型的技术边界、高效有针对性地做大模型技术选型，在现如今显得尤为重要。而一个公开、公正、公平的大模型评测系统，恰好能够提供应有的助力，避免重复造轮子，避免因技术栈不同而导致不必要的争论，避免“鸡同鸭讲”。
对于大模型研发人员，包括对大模型技术感兴趣的人、学术界看中实践的人，各类大模型的效果对比，反应出了背后不同技术路线、技术方法的有效性，这就提供了非常好的参考意义。不同大模型的相互参考、借鉴，帮忙大家躲过不必要的坑、避免重复实验带来的资源浪费，有助于整个大模型生态圈的良性高效发展。

联系我们

免费评测服务

加小编微信，备注“评测服务”

大模型评测交流群

先加小编微信，后拉入群，备注“加群”

大模型评测微信公众号

关注大模型评测微信公众号，及时获取最新评测信息

For Tasks:

Click tags to check more tools for each tasks

evaluate model capabilities compare model performance analyze data improve model efficiency comply with chinese instructions

For Jobs:

data scientist ai engineer nlp researcher machine learning engineer ai product manager

Alternative AI tools for chinese-llm-benchmark

Similar Open Source Tools

chinese-llm-benchmark

github

: 4.8k

youtu-graphrag

Youtu-GraphRAG is a vertically unified agentic paradigm that connects the entire framework based on graph schema, allowing seamless domain transfer with minimal intervention. It introduces key innovations like schema-guided hierarchical knowledge tree construction, dually-perceived community detection, agentic retrieval, advanced construction and reasoning capabilities, fair anonymous dataset 'AnonyRAG', and unified configuration management. The framework demonstrates robustness with lower token cost and higher accuracy compared to state-of-the-art methods, enabling enterprise-scale deployment with minimal manual intervention for new domains.

github

: 621

ml-engineering

This repository provides a comprehensive collection of methodologies, tools, and step-by-step instructions for successful training of large language models (LLMs) and multi-modal models. It is a technical resource suitable for LLM/VLM training engineers and operators, containing numerous scripts and copy-n-paste commands to facilitate quick problem-solving. The repository is an ongoing compilation of the author's experiences training BLOOM-176B and IDEFICS-80B models, and currently focuses on the development and training of Retrieval Augmented Generation (RAG) models at Contextual.AI. The content is organized into six parts: Insights, Hardware, Orchestration, Training, Development, and Miscellaneous. It includes key comparison tables for high-end accelerators and networks, as well as shortcuts to frequently needed tools and guides. The repository is open to contributions and discussions, and is licensed under Attribution-ShareAlike 4.0 International.

github

: 15.1k

VLMEvalKit

VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.

github

: 3.1k

OpenManus-RL

OpenManus-RL is an open-source initiative focused on enhancing reasoning and decision-making capabilities of large language models (LLMs) through advanced reinforcement learning (RL)-based agent tuning. The project explores novel algorithmic structures, diverse reasoning paradigms, sophisticated reward strategies, and extensive benchmark environments. It aims to push the boundaries of agent reasoning and tool integration by integrating insights from leading RL tuning frameworks and continuously updating progress in a dynamic, live-streaming fashion.

github

: 3.5k

PaddleNLP

PaddleNLP is an easy-to-use and high-performance NLP library. It aggregates high-quality pre-trained models in the industry and provides out-of-the-box development experience, covering a model library for multiple NLP scenarios with industry practice examples to meet developers' flexible customization needs.

github

: 12.8k

verl

veRL is a flexible and efficient reinforcement learning training framework designed for large language models (LLMs). It allows easy extension of diverse RL algorithms, seamless integration with existing LLM infrastructures, and flexible device mapping. The framework achieves state-of-the-art throughput and efficient actor model resharding with 3D-HybridEngine. It supports popular HuggingFace models and is suitable for users working with PyTorch FSDP, Megatron-LM, and vLLM backends.

github

: 13.8k

transformers

Transformers is a state-of-the-art pretrained models library that acts as the model-definition framework for machine learning models in text, computer vision, audio, video, and multimodal tasks. It centralizes model definition for compatibility across various training frameworks, inference engines, and modeling libraries. The library simplifies the usage of new models by providing simple, customizable, and efficient model definitions. With over 1M+ Transformers model checkpoints available, users can easily find and utilize models for their tasks.

github

: 150.4k

auto-round

AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It competes impressively against recent methods without introducing any additional inference overhead. The method adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200 steps, often significantly outperforming SignRound with the cost of more tuning time for quantization. AutoRound is tailored for a wide range of models and consistently delivers noticeable improvements.

github

: 636

eval-assist

EvalAssist is an LLM-as-a-Judge framework built on top of the Unitxt open source evaluation library for large language models. It provides users with a convenient way of iteratively testing and refining LLM-as-a-judge criteria, supporting both direct (rubric-based) and pairwise assessment paradigms. EvalAssist is model-agnostic, supporting a rich set of off-the-shelf judge models that can be extended. Users can auto-generate a Notebook with Unitxt code to run bulk evaluations and save their own test cases. The tool is designed for evaluating text data using language models.

github

: 80

agentic

Agentic is a lightweight and flexible Python library for building multi-agent systems. It provides a simple and intuitive API for creating and managing agents, defining their behaviors, and simulating interactions in a multi-agent environment. With Agentic, users can easily design and implement complex agent-based models to study emergent behaviors, social dynamics, and decentralized decision-making processes. The library supports various agent architectures, communication protocols, and simulation scenarios, making it suitable for a wide range of research and educational applications in the fields of artificial intelligence, machine learning, social sciences, and robotics.

github

: 95

daily-ai-papers

github

: 87

LMForge-End-to-End-LLMOps-Platform-for-Multi-Model-Agents

LMForge is an end-to-end LLMOps platform designed for multi-model agents. It provides a comprehensive solution for managing and deploying large language models efficiently. The platform offers tools for training, fine-tuning, and deploying various types of language models, enabling users to streamline the development and deployment process. With LMForge, users can easily experiment with different model architectures, optimize hyperparameters, and scale their models to meet specific requirements. The platform also includes features for monitoring model performance, managing datasets, and collaborating with team members, making it a versatile tool for researchers and developers working with language models.

github

: 175

LLM-Fine-Tuning

This GitHub repository contains examples of fine-tuning open source large language models. It showcases the process of fine-tuning and quantizing large language models using efficient techniques like Lora and QLora. The repository serves as a practical guide for individuals looking to optimize the performance of language models through fine-tuning.

github

: 120

Fast-dLLM

Fast-DLLM is a diffusion-based Large Language Model (LLM) inference acceleration framework that supports efficient inference for models like Dream and LLaDA. It offers fast inference support, multiple optimization strategies, code generation, evaluation capabilities, and an interactive chat interface. Key features include Key-Value Cache for Block-Wise Decoding, Confidence-Aware Parallel Decoding, and overall performance improvements. The project structure includes directories for Dream and LLaDA model-related code, with installation and usage instructions provided for using the LLaDA and Dream models.

github

: 442

RAGElo

RAGElo is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system. It simplifies the process of comparing different outputs from multiple prompt and pipeline variations to a 'gold standard' by allowing a powerful LLM to judge between pairs of answers and questions. RAGElo conducts tournament-style Elo ranking of LLM outputs, providing insights into the effectiveness of different settings.

github

: 114

For similar tasks

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

sorrentum

Sorrentum is an open-source project that aims to combine open-source development, startups, and brilliant students to build machine learning, AI, and Web3 / DeFi protocols geared towards finance and economics. The project provides opportunities for internships, research assistantships, and development grants, as well as the chance to work on cutting-edge problems, learn about startups, write academic papers, and get internships and full-time positions at companies working on Sorrentum applications.

github

: 89

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

zep-python

Zep is an open-source platform for building and deploying large language model (LLM) applications. It provides a suite of tools and services that make it easy to integrate LLMs into your applications, including chat history memory, embedding, vector search, and data enrichment. Zep is designed to be scalable, reliable, and easy to use, making it a great choice for developers who want to build LLM-powered applications quickly and easily.

github

: 60

telemetry-airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

github

: 185

mojo

Mojo is a new programming language that bridges the gap between research and production by combining Python syntax and ecosystem with systems programming and metaprogramming features. Mojo is still young, but it is designed to become a superset of Python over time.

github

: 23.0k

pandas-ai

PandasAI is a Python library that makes it easy to ask questions to your data in natural language. It helps you to explore, clean, and analyze your data using generative AI.

github

: 14.0k

databend

Databend is an open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake. With its focus on fast query execution and data ingestion, it's designed for complex analysis of the world's largest datasets.

github

: 7.7k

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 668

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

chinese-llm-benchmark

README:

ReLE中文大模型能力评测（持续更新）

目录

最近更新

GitHub热门大模型评测项目

大模型基本信息

📊 排行榜

0、多模态排行榜

1、综合能力排行榜

1.1、推理模型排行榜

1.2、商用大模型排行榜（含开源模型的付费API）

（1）输出价格5元及以上商用大模型排行榜

（2）输出价格1~5元商用大模型排行榜

（3）输出价格1元以下商用大模型排行榜

1.3、开源大模型排行榜

（1）5B以下开源大模型排行榜

（2）5B~20B开源大模型排行榜

（3）20B以上开源大模型排行榜

2、教育排行榜

2.1 小学学科

2.2 初中学科

2.3 中考TODO

2.4 高中学科

2.5 高考

2.6 高等教育TODO

2.7 考研TODO

2.8 教师资格TODO

3、医疗与心理健康排行榜

3.1 医师

3.2 护理

3.3 药师

3.4 医技

3.5 医学基础知识

3.6 医学考研

3.7 心理健康

4、金融排行榜

4.1 财务

4.2 银行

4.3 保险

4.4 证券

4.5 其他金融资格考试

4.6 金融基础知识

4.7 金融应用

5、法律与行政公务排行榜

5.1 律师资格考试

（1）JEC-QA-KD

（2）JEC-QA-CA

（3）法律综合

5.2 公务员考试

6、推理与数学计算排行榜

6.1 演绎推理

6.2 常识推理

6.3 符号推理BBH

6.4 算术能力

6.5 表格问答

6.6 表格总结

6.7 高中奥数

6.8 初中奥数TODO

6.9 小学奥数

6.10 地图推理TODO

6.11 空间推理TODO

6.12 数独

6.13 金额大小写转换TODO

6.14 日期计算TODO

7、语言与指令遵从排行榜

7.1 成语理解

7.2 情感分析

7.3 文本蕴含

7.4 文本分类

7.5 信息抽取

7.6 阅读理解

7.7 代词理解

7.8 诗词匹配

7.9 中文指令遵从

7.10 汉字字形

7.11 汉语拼音TODO

7.12 找错别字TODO

7.13 句子理解TODO

7.14 标点符号TODO