TigerBot

TigerBot: A multi-language multi-task LLM

Stars: 2250

Visit

TigerBot is a cutting-edge foundation for your very own LLM, providing a world-class large model for innovative Chinese-style contributions. It offers various upgrades and features, such as search mode enhancements, support for large context lengths, and the ability to play text-based games. TigerBot is suitable for prompt-based game engine development, interactive game design, and real-time feedback for playable games.

README:

TigerBot

A cutting-edge foundation for your very own LLM.

以世界级的基础大模型，贡献于中国式的创新。

🌐 TigerBot • 🤗 Hugging Face • 💻ModelScope

中文 | English

最新发布

[12/28/2024] 虎博TigerBot发布全球首个中文临床术语体系——MedCT (Medical Clinical Terminology)，以及相应的医疗NER模型和医疗基础模型，在中英文医疗NER任务中均取得新SOTA，同时发布真实临床标注数据。
- MedCT Github
- MedCT Huggingface
[06/01/2024] 虎博TigerBot产品升级：搜索模式(search augmented)增加信息源链接。

[03/18/2024] 虎博TigerBot升级，支持100K上下文长度 (70b and 13b chat) 🔥 [models]；TigerBot医疗大模型被AAAI 2024 Spring Symposium on Clinical Foundation Models录取 [paper]。

# 可以根据实际硬件情况调整max_input/generate_length
export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-70b-chat-v6 --rope_scaling yarn --rope_factor 8 --max_input_length 37888 --max_generate_length 62112

[01/18/2024] 新一代的TigerBot基座模型训练了我们认为最大最高质量的中文数据。为了验证，我们做了个有趣的测试：我们让Tigerbot-70b-chat和GPT-4-Turbo对战40局成语接龙，各先出20局。每局中无法成功接龙成语者输（以权威成语词典为准），比分 Tigerbot:GPT-4 = 24:16。基于此，Tigerbot发布第一款基于大模型的小游戏[TigerBot成语接龙]，移动端扫如下二维码参与，有机会赢取顶配iPhone Pro Max. 🔥
- 大模型的生成能力、世界知识、和随机性可以增加文字类游戏的可玩性和知识性，
- 适合的prompt激发出大模型的底层知识，即完成的游戏引擎的开发，
- 前端设计好游戏的互动性和即时反馈便可完成一款可玩的小游戏。
[12/29/2023] Tigerbot发表技术报告（arXiv preprint），分享我们在大模型狂飙和应用落地方面的一些技术细节和浅见 🔥 [paper]
[12/08/2023] Tigerbot family新版模型发布，bigger and better 🔥 [模型下载][测评]
- Tigerbot-70b base (v2) and chat (v4)更新，chat模型综合能力比上一版提升19%；
- Tigerbot-13b base (v3) and chat (v5)更新，chat模型综合能力比上一版提升16%；
- Tigerbot-180b base and chat (v2)更新，base模型优于Bloom 7%, chat模型优于Bloomz/Bloomz-mt 20%；
- 训练数据：500B tokens预训练数据，知识截止到2023年8月。更多优质数据，包括：万卷，arXiv, 中文教科书，法律和专利等领域数据；
- 序列长度：70b-chat和13b-chat均推出4k长度训练版本，推理可外推到32k；
- 训练方法：预训练优化了TP/PP算法，对齐采用了grouped sft, rejection-sampling，dpo (稍后会分享技术报告)；
- 全线产品(web, app, 小程序)和api均升级到最新最强的Tigerbot-70b-chat-v4-4k，产品端增加了search (搜索增加) and document (文件增强) RAG功能。
[10/19/2023] Long(16k)-Tigerbot 发布
- Tigerbot-13/70B web/api接口支持16k-token长度（约为20k字符，或20页的pdf或word文档，paper类的可直接作为context输入）。
- 基于YaRN方法，推理时对RoPE表示进行“高频维度不变，低频维度插值”的方式外推，以达到处理长序列的目的。
- Tigerbot优化了TGI框架，为每个请求根据输入序列长度与最大生成序列长度单独计算cos/sin值，同时保证总长度不大于2048时的效果不变，因此API用户需要选择一个更合适的max_new_token参数，并且不同的max_new_token参数可能会导致不同的结果。
- Tigerbot后续将会在训练阶段继续优化对长序列的支持。
[9/27/2023] Tigerbot-70b-chat-api发布function calling功能: [tech report][tigerbot-api]
- {user prompt, functions def} --> TigerBot --> 3rd party API --> TigerBot --> natural answers with accurate results.
- Tigerbot训练并没有专门使用function calling类的结构化抽取数据，但已经呈现出不错的理解和抽取能力，所以我们相信经过领域数据微调，function calling的表现会接近生产可用。
[9/26/2023] Tigerbot-70b-chat(v3)和Tigerbot-13b-chat(v4)更新发布: [模型下载]
- smaller batch for finer-grained gradient updates, global_batch_size=66 (pretrain_gbs=1920, v2_gbs=240). 我们认为，在高质量数据的前提下，对齐微调的gbs可以到~100K tokens, 更dense的updates和更充分的steps，导致更低的loss（如下图train and validation loss）。
- 增加了高质量对其数据，更好的多样性、丰富度和格式；根据前期用户反馈，去除了原对齐数据中的一些已知脏数据和不符合自然用户习惯的prompts数据；
- 在10+项基准评测中，综合能力均超过上一版本和Llama-2，达到SOTA.
[9/15/2023] Tigerbot-70b-chat(v2)和Tigerbot-13b-chat(v3)更新发布: [模型下载]
- 用更少但更高质量的数据，约5M指令完成数据，覆盖100+任务类型，符合自然用户分布；
- 用10K人类标注数据进行多维度对齐，包括：事实性，创造性，丰富性，安全性和格式等；
- 在10+项基准评测中，中英文综合能力均超过上一版本和Llama-2，达到SOTA.
[9/06/2023] Tigerbot-70b发布，继续开源和免费商用: [paper][模型下载]: fire:
- Tigerbot-70b-base: 在Llama-2-70b的基础上继续预训练，模型综合能力在mmlu等10项主流基准测试中，优于Llama-2-70b，达到业内SOTA；
  - 用高质量的300 billion tokens的多语言数据,
  - 算法上使用了GQA, flash-attn, RoPE，holistic-training等技术,
  - 训练采用了tensor/pipeline-partition技术，计算效率达到Llama-2 paper中报告的SOTA;
- Tigerbot-70b-chat: 在Tigerbot-70b-base基础上，用20M指令完成数据进行sft，和10K人类标注的gold set进行rejection-sampling对齐；
- 同步开放Tigerbot-70b-chat-api，继续对教育和科研开发者免费。
[8/25/2023] TigerBot更新13b-base模型: [模型下载][测评]
- TigerBot-13B-base: 增加了一倍训练数据，至600B tokens，定向增加了高质量中英文数学推理类和科学文献类数据，优化了中文网络数据的清洗（在格式、口语、知识性等方面）。在13项主流基准评测中，英文综合能力优于Llama-2-13b 5%，中文超30%。[测评]
- 开放了基于opencompass 的自动评测体系，以提倡reproducibility。[测评]
[8/21/2023] TigerBot更新发布7b和13b base/chat模型: [模型下载][测评]
- TigerBot-7B-base: 基于Llama-2-7B增量预训练300B tokens，补充了Llama-2不足的中文、代码和推理等数据，并使用holistic training（全局训练）方法；在13项中英文主流benchmark中，优于Llama-2-7B 33%，领先于国内外同等开源模型;
- TigerBot-7B-chat: 基于TigerBot-7B-base用20M涵盖多任务的数据进行指令微调（sft）和拒绝采样对齐（rs-hil）；在13项中英文主流benchmark中，优于Llama-2-7B-chat 29%，亦领先于国内外同等开源模型；
- TigerBot-13B-chat: 基于TigerBot-13B-base用20M涵盖多任务的数据进行指令微调（13b-v1用了5M数据)，并在中英文能力上做了更好的权衡；在13项中英文主流benchmark中，优于Llama-2-13B-chat 15%，领先于国内外同等开源模型;
- 以上tigerbot-api也同步更新 [tigerbot-api]。
[8/19/2023] TigerBot推理（tigerbot.com和tigerbot-api）启用 TGI，达到3x QPS和2x 响应速度。

https://github.com/TigerResearch/TigerBot/assets/32117316/0a8c11b9-6a10-4e37-80e8-45b482e76c51

[8/08/2023] TigerBot 2023.08 (V3) release: 虎博很高兴的发布TigerBot-13B大模型，在Llama-2的基础上以虎博积累的技术和数据继续训练，不但保持了Llama-2出色的英文能力，更是在中文能力上填补了Llama-2的不足，各项主流中文任务中超过Llama-2的49%，在开源同类模型中具有竞争力。: fire: [paper]
- TigerBot-13B-base: 基于Llama-2-13B继续预训练300B tokens，扩充了中文词表到60K vocabulary, 并采用holistic training在预训练中直接使模型具有九成的指令完成能力。在主流英文基准测试中超过Llama-2-13B-base的7%，在中文测试中综合能力超过Llama-2-13B-base的49%，在国内外主流开源基座模型中处于领先位置。[测评][模型下载]
- TigerBot-13B-chat: 基于TigerBot-13B-base用5M指令数据微调，并采用rejection sampling fine-tune对齐人类需求。在主流英文基准测试中达到Llama-2-13B-chat的101%，在中文测试中综合能力超过Llama-2-13B-chat的47%，在国内外主流开源模型中亦处于领先位置。chat模型可以通过python infer.py --model_path TigerResearch/tigerbot-13b-chat 使用。[测评][模型下载]
- TigerBot-API: chat和summarization api 将自动升级到TigerBot-13B-chat，对科研教育用户免费，对商用开发者保持价格不变。[tigerbot-api]
[8/03/2023] TigerBot 兼容 OpenAI 接口。[tigerbot-api]
[7/26/2023] TigerBot 开放 search-api [tigerbot-api]

[7/08/2023] TigerBot 2023.07 (V2) release [paper] 🔥
- tigerbot-7b-base (v2), 在 1.5TB 高质量数据上充分预训练（千卡耗时 4 周，算力成本～ 300 万），在中英文公开数据测评优于 bloom/llama 同等模型 15-30%；[测评][模型下载]
- tigerbot-7b-sft (v2), 在 base-v2 基础上微调的 sft-v2 在 2000 万/20G 高质量清洗和配比的数据上充分训练，在 9 项公开语料测评上优于 sft-v1 9.3%；[测评][模型下载]
  
  新模型可通过以下代码加载：
```
import transformers

# 下载过旧版的用户需要指定`force_download=True`避免使用旧版缓存
model_sft = transformers.AutoModelForCausalLM.from_pretrained('TigerResearch/tigerbot-7b-sft', force_download=True)
model_base = transformers.AutoModelForCausalLM.from_pretrained('TigerResearch/tigerbot-7b-base', force_download=True)
```
- tigerbot 开启搜索模式，对接主流中英文搜索引擎，和结构化工具（如天气，股市，计算器等），打开 LLM+search 的应用场景，同时开放 chat-api with internet search switch；[TigerBot with search mode (default off) 🌏][paper]
- tigerbot 开启流式生成模式，同时开放 chat-api with streaming switch; [TigerBot][TigerBot-API]
- tigerbot-api 全新升级，开放 LLM 应用开发常用的 api 工具，包括：LLM (chat, plugin, finetune), text (embedding, summarization, pdf2text), vision (text2image)。[TigerBot-API]
[6/27/2023] PEFT TigerBot with QLoRA: 在单张 3090 上使用 qlora 微调 tigerbot-7b-sft，加速 16 倍，和/或减少 GPU3/4，同时防止了对下游数据的过拟合。[code] [paper][模型下载]

[6/26/2023] TigerBot now is on desktop! 使用 TigerBot 和 Svelte 框架制作的聊天机器人，感谢 @SaraiQX ！
[6/20/2023] Tigerbot 的云端 api 如何在 langchian 中替代 openai 来应用(sample code) 感谢 @wordweb ！

[6/13/2023] plug-in api 升级，放出：搜索结果数量、prompt 前缀和 tf-idf, embedding mixture weights
[6/13/2023] 模型已支持更迅速的本地下载
[6/13/2023] TigerBot now is on QQ! 基于本地知识库的 TigerBot 大语言模型 qq 群知识库 bot 应用实现，感谢 @wordweb ！
[6/09/2023] 新增 stream infer 和 web demo，感谢 @Tlntin ！
[6/08/2023] TigerBot 已经可以在colab, windows, langchain 和 webui上跑啦，感谢 @wordweb @runfuture !

环境安装

conda create --name tigerbot python=3.8
conda activate tigerbot
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

git clone https://github.com/TigerResearch/TigerBot
cd TigerBot
pip install -r requirements.txt

模型下载

Model	Version	Architecture	Disk size (GB)	Note
tigerbot-70b-base	v2 [🤗][🤖]	llama-2	129	From llama-2-70b weights
	v1 [🤗][🤖]	llama-2	129	From llama-2-70b weights
tigerbot-70b-chat	v6 [🤗][🤖]	llama-2	129	From tigerbot-70b-base v2
	v4-4k [🤗][🤖]	llama-2	129	From tigerbot-70b-base v2
	v4 [🤗][🤖]	llama-2	129	From tigerbot-70b-base v2
	v3 [🤗][🤖]	llama-2	129	From tigerbot-70b-base v1
	v2 [🤗][🤖]	llama-2	129	From tigerbot-70b-base v1
	v1 [🤗]	llama-2	129	From tigerbot-70b-base v1
tigerbot-70b-chat-4bit	v6 [🤗]	llama-2	37	From tigerbot-70b-chat v6
	v4 [🤗]	llama-2	37	From tigerbot-70b-chat v4
	v3 [🤗]	llama-2	37	From tigerbot-70b-chat v3
	v2 [🤗]	llama-2	37	From tigerbot-70b-chat v2
	v1 [🤗]	llama-2	37	From tigerbot-70b-chat v1
tigerbot-13b-base	v3 [🤗][🤖]	llama-2	26.6	From llama-2-13b weights
	v2 [🤗][🤖]	llama-2	26.6	From llama-2-13b weights
	v1 [🤗]	llama-2	26.6	From llama-2-13b weights
tigerbot-13b-chat	v6 [🤗][🤖]	llama-2	26.6	From tigerbot-13b-base v3
	v5 [🤗][🤖]	llama-2	26.6	From tigerbot-13b-base v3
	v5 [🤗][🤖]	llama-2	26.6	From tigerbot-13b-base v3
	v4 [🤗][🤖]	llama-2	26.6	From tigerbot-13b-base v2
	v3 [🤗][🤖]	llama-2	26.6	From tigerbot-13b-base v2
	v2 [🤗]	llama-2	26.6	From tigerbot-13b-base v2
	v1 [🤗]	llama-2	26.6	From tigerbot-13b-base v1
tigerbot-13b-chat-4bit	v6 [🤗]	llama-2	11.5	From tigerbot-13b-chat v6
	v5 [🤗]	llama-2	11.5	From tigerbot-13b-chat v5-4k
	v4 [🤗]	llama-2	11.5	From tigerbot-13b-chat v4
tigerbot-7b-base	v3 [🤗][🤖]	llama-2	13.9	From llama-2-7b weights
	v2 [🤗]	bloom	16.2	From bloom weights
	v1 [🤗]	bloom	16.2	From bloom weights
tigerbot-7b-chat	v3 [🤗][🤖]	llama-2	13.9	From tigerbot-7b-base v3
	v2 [🤗]	bloom	16.2	From tigerbot-7b-base v2
	v1 [🤗]	bloom	16.2	From tigerbot-7b-base v1
tigerbot-7b-chat-8bit	v3 [🤗]	llama-2	10.8	From tigerbot-7b-chat v3
tigerbot-7b-chat-4bit	v3 [🤗]	llama-2	6.5	From tigerbot-7b-chat v3
tigerbot-180b-base	v2 [🤗][🤖]	bloom	347.6	From bloom weights
tigerbot-180b-chat	v2 [🤗][🤖]	bloom	347.6	From tigerbot-180b-chat v2
	v1 [🤗]	bloom	347.6	From bloom weights

推理

CLI

CUDA_VISIBLE_DEVICES=0 python infer.py --model_path tigerbot-13b-chat --max_input_length 1024 --max_generate_length 1024 --streaming True

参数：

--model_path: 模型路径
--model_type=chat: base/chat
--max_input_length=1024: 最大输入长度
--max_generate_length=1024: 最大输出长度
--rope_scaling=None: 长度外推方法(dynamic/yarn supported now)
--rope_factor=8.0: 外推参数
~~--streaming: 流式输出~~

输入 clear 可以清空对话历史，输入 exit 终止推理对话。

WebPage

export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-13b-chat

启动量化模型的web_demo

export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/exllamav2_web_demo.py -- --model_path TigerResearch/tigerbot-70b-chat-v4-4bit-exl2

参数同CLI启动参数

本地API

CLI/WebPage均为demo性质。TGI实现了混合batch，request queue等工程特性，如有大量推理需求，推荐通过TGI镜像提供服务。

docker run --gpus '"device=0,1,2,3"' -d -p 8080:80 -v PATH-TO-MODEL-DIR:/model ghcr.io/huggingface/text-generation-inference:1.1.1 --model-id /model --max-total-tokens=1024 --max-input-length=1024 --max-batch-prefill-tokens=1024

请根据模型规模与硬件情况选择合适的参数。一般来说7B/13B需要A100 40G * 1，70B需要A100 * 4。

注意，TGI部署服务，生成控制参数需要在每个请求中控制。

量化

exllamav2量化推理

使用exllamav2加载[TigerResearch/tigerbot-70b-chat-v4-4bit-exl2]进行推理，推理速度加快

# 安装exllamav2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt

#  启动推理
CUDA_VISIBLE_DEVICES=0 python other_infer/exllamav2_hf_infer.py --model_path ${MODEL_PATH}

MODEL_PATH为量化模型路径，如 TigerResearch/tigerbot-70b-chat-v4-4bit-exl2

使用以上量化方式，请将transformers、bitsandbytes等包升级到最新版（目前transformers==4.33.1和bitsandbytes==0.41.1可以正常使用）

pip install -U transformers bitsandbytes

动态量化模型加载

此方式为在线量化与推理

CUDA_VISIBLE_DEVICES=0 python other_infer/quant_infer.py --model_path ${MODEL_DIR} --wbit 8

训练

预训练

启动训练前安装 DeepSpeed

git clone [email protected]:microsoft/DeepSpeed.git
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

TORCH_CUDA_ARCH_LIST 根据你运行的 GPU 架构做调整，获取 TORCH_CUDA_ARCH_LIST

CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"

如果返回的结果是(8, 0)，那么 TORCH_CUDA_ARCH_LIST="8.0"

启动训练

启动tigerbot-7b训练至少需要 1 x A100 (40GB), 启动tigerbot-180b至少需要 16 x A100 (40GB)

deepspeed \
--include="localhost:0,1,2,3" \
./train_clm.py \
--deepspeed ./ds_config/ds_config_zero3.json \
--model_name_or_path TigerResearch/tigerbot-7b-base \
--dataset_name TigerResearch/dev_pretrain \
--do_train \
--output_dir ./ckpt-clm \
--overwrite_output_dir \
--preprocess_num_workers 8 \
--num_train_epochs 5 \
--learning_rate 1e-5 \
--evaluation_strategy steps \
--eval_steps 10 \
--bf16 True \
--save_strategy steps \
--save_steps 10 \
--save_total_limit 2 \
--logging_steps 10 \
--tf32 True \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2

微调

启动训练

deepspeed \
--include="localhost:0,1,2,3" \
./train_sft.py \
--deepspeed ./ds_config/ds_config_zero3.json \
--model_name_or_path TigerResearch/tigerbot-7b-base \
--dataset_name TigerResearch/dev_sft \
--do_train \
--output_dir ./ckpt-sft \
--overwrite_output_dir \
--preprocess_num_workers 8 \
--num_train_epochs 5 \
--learning_rate 1e-5 \
--evaluation_strategy steps \
--eval_steps 10 \
--bf16 True \
--save_strategy steps \
--save_steps 10 \
--save_total_limit 2 \
--logging_steps 10 \
--tf32 True \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2

测评

我们使用经典的中英文benchmark自动评测，共13项任务，涵盖代码，常识推理，阅读理解，数学，自然语言理解等。我们基于opencompass建立自动评测体系（感谢@opencompass）

# 安装
cd opencompass
pip install -e .

# 下载数据集到 data/ 处
wget https://github.com/InternLM/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip

#运行测评任务：
CUDA_VISIBLE_DEVICES=0,1,2 python run.py configs/eval_tigerbot_13b.py -w outputs/tigerbot-13b-base

总分为各类任务的平均分

chat模型测评结果

base模型测评结果

开源数据集

预训练数据

基于 GPT3 的 pretrain 的数据分布，采集中文书籍，互联网，和百科类数据，并通过数据源质量分过滤和 tf-idf soft deduping，从 20TB 数据过滤到 2TB，保持语言和类目的比例，并在此基础上随机抽样 100G 数据开源：

中文开源预训练集 - 55G，包含中文书籍、中文互联网、中文百科 - 下载 [hugging face]

英文开源预训练集 - 51G，包含英文书籍、英文互联网、英文百科 - 下载 [hugging face]

类型	磁盘占用	来源
中文书籍	12G	自研
中文互联网	25G	自研
中文百科	19G	自研
英文书籍	22G	开源
英文互联网	6.9G	开源
英文百科	22G	开源
总量	106G

完整预训练数据占比如图所示:

中文书籍及代码细分:

微调数据

数据搜集

模型中使用的微调数据的搜集思想如下：

a. 从用户指令的自然分布，人工标注总结 10 大类，120 小类任务，例如，事实性问答，开放式创作，语法分析，代码编辑等；

b. self-instruct: 参考 Alpaca self-instruct 方法，扩充中英文 seed_tasks，增加一些中文习惯种子问题，基于此生成 2M 中文(本次开源 0.5M)及 0.1M 英文(本次开源 50k)；

c. human-labeling: 基于人工写题及答案、网络搜集方式，整理加工问答集数据，在开源列表中标识为[自研]部分，本次开放部分数据；

d. open-source data cleaning: 基于各类公开数据集转换清洗，其中[自研*]部分，表示基于原始数据进行二次开发后得到，[开源] 部分数据集一般原始数据即为较规整的问答数据，进行简单清洗得到；

e. 总的数据分布符合用户指令自然分布。

数据清洗

- 由于各类数据质量存在差异，通过 Alpaca Self-Instruct 生成的数据亦存在各种问题。因此，我们经过细致的人工校验和分类，总结出一套全面且系统化的数据清洗规则与方法。 - 整体规则可以划分为**过滤类规则**和**清洗类规则**两大类。其中，命中过滤规则的数据项将被弃用，而清洗规则旨在处理并保留所需的数据。 - 同时，在数据梳理与积累的过程中，我们也不断对清洗规则进行迭代和优化。 - 通用清洗规则描述如下所示：

a. 过滤类-敏感词规则：基于积累的敏感词库，清洗丢弃涉政、涉黄、涉暴、涉恐等数据项；

b. 过滤类-无效输入输出：此类规则主要针对 Self-Instruct 生成数据缺陷进行专项清理，根据输入输出分别制定规则，以丢弃一些无效的数据项；

无效输入如"<一段文本>"，无效输出如"[图画]"；

c. 清洗类-关键词规则：根据整理的关键词/正则列表进行数据的替换，包括：清理特殊标志位字符、清理非可见字符、清理标签、繁简转换等；

d. 清洗类-特殊逻辑规则：此类规则用于清洗一些特殊现象数据，如指令与输入重复等，如下所示：

{"instruction": "描述如何做一道红烧肉。请提供食材和详细的步骤。", "input": "请描述如何做一道红烧肉，提供食材和详细步骤。", ...}

数据开源

- 指令数据集, 当前开源 120W 问答对，磁盘空间 1.1G (数据集开放到 huggingface，中英文指令明细及下载如下表） - 中文-微调指令集-合集 - 53W 条 - 下载 [hugging face] - 英文-微调指令集-合集 - 67W 条 - 下载 [hugging face]

类型	语言	数据集	数量	来源
alpaca 中文	中文	tigerbot-alpaca-zh-0.5m	0.5m	自研
百科问答	中文	tigerbot-wiki-qa-1k	1k	自研
名著问答	中文	tigerbot-book-qa-1k	1k	自研
猜谜语	中文	tigerbot-riddle-qa-1k	1k	自研
阅读理解	中文	tigerbot-superclue-c3-zh-5k	5k	自研*
问答	中文	tigerbot-HC3-zh-12k	12k	开源
知乎问答	中文	tigerbot-zhihu-zh-10k	10k	开源
alpaca 英文	英文	tigerbot-alpaca-en-50k	50k	自研
头脑风暴	英文	tigerbot-dolly-Brainstorming-en-1.7k	1.7k	开源
分类	英文	tigerbot-dolly-Classification-en-2k	2k	开源｜
代码	英文	tigerbot-kaggle-leetcodesolutions-en-2k	2k	自研*
食谱生成	英文	tigerbot-kaggle-recipes-en-2k	2k	开源
病历生成	英文	tigerbot-mt-note-generation-en	450	开源
多轮对话	英文	tigerbot-OIG-multichat-en-50k	50k	自研*
综合问答	英文	tigerbot-stackexchange-qa-en-0.5m	0.5m	开源
wiki 问答	英文	tigerbot-wiki-qa-bart-en-10k	10k	开源
如何做类教程	英文	tigerbot-youtube-howto-en-50k	50k	开源
总量			120W 条

更多数据集陆续整理开放中...

领域数据

开放金融、法律、百科相关领域数据，作为 rethink 外部数据源

类型	数量
金融-研报	2W 篇
金融-财报	2500 篇
法律	11 类 5.5W 条款
百科	10W 词条

Tigerbot API

chat

示例

plugin

示例

finetune

示例

embedding

示例

summarization

示例

pdf2text

示例

text2image

示例

其他

案例

![image](./image/api/case-1.png) ![image](image/api/case-2.png) ![image](image/api/case-3.png) ![image](image/api/case-4.png) ![image](image/api/case-5.png) ![image](image/api/case-6.png)

开发者生态

使用 TigerBot 和 Svelte 框架制作的聊天机器人，可快速打包成桌面应用
基于本地知识库的 TigerBot 大语言模型 qq 群知识库 bot 应用实现
基于本地知识库的 TigerBot、ChatGLM 等大语言模型应用实现
Run TigerBot on Colab - Colab 版本体验 TigerBot
Run TigerBot on Windows - 在 Windows 系统上使用 TigerBot

加入我们

产品链接

https://www.tigerbot.com

联系电话

021-63888086

联系邮箱

[email protected]

微信讨论群

局限性与免责声明

当前模型可能存在生成幻觉、误导性、或歧视性内容。请谨慎使用 TigerBot 系列模型生成的内容，请勿将生成的有害内容进行传播。如需将模型公开使用或者商用，模型服务所产生的不良影响或者有害言论由服务方负责，本项目开发者不承担任何因使用本项目（包含但不限于数据、模型、代码等）导致的危害或损失。

For Tasks:

Click tags to check more tools for each tasks

play text games develop game engine design interactive games generate text content provide real-time feedback

For Jobs:

data scientist ai researcher software engineer game developer linguist

Alternative AI tools for TigerBot

Similar Open Source Tools

TigerBot

github

: 2.2k

Firefly

Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

github

: 4.8k

JiwuChat

JiwuChat is a lightweight multi-platform chat application built on Tauri2 and Nuxt3, with various real-time messaging features, AI group chat bots (such as 'iFlytek Spark', 'KimiAI' etc.), WebRTC audio-video calling, screen sharing, and AI shopping functions. It supports seamless cross-device communication, covering text, images, files, and voice messages, also supporting group chats and customizable settings. It provides light/dark mode for efficient social networking.

github

: 400

MEGREZ

MEGREZ is a modern and elegant open-source high-performance computing platform that efficiently manages GPU resources. It allows for easy container instance creation, supports multiple nodes/multiple GPUs, modern UI environment isolation, customizable performance configurations, and user data isolation. The platform also comes with pre-installed deep learning environments, supports multiple users, features a VSCode web version, resource performance monitoring dashboard, and Jupyter Notebook support.

github

: 77

EmoLLM

EmoLLM is a series of large-scale psychological health counseling models that can support **understanding-supporting-helping users** in the psychological health counseling chain, which is fine-tuned from `LLM` instructions. Welcome everyone to star~⭐⭐. The currently open source `LLM` fine-tuning configurations are as follows:

github

: 1.3k

AstrBot

AstrBot is a powerful and versatile tool that leverages the capabilities of large language models (LLMs) like GPT-3, GPT-3.5, and GPT-4 to enhance communication and automate tasks. It seamlessly integrates with popular messaging platforms such as QQ, QQ Channel, and Telegram, enabling users to harness the power of AI within their daily conversations and workflows.

github

: 6.6k

chatluna

Chatluna is a machine learning model plugin that provides chat services with large language models. It is highly extensible, supports multiple output formats, and offers features like custom conversation presets, rate limiting, and context awareness. Users can deploy Chatluna under Koishi without additional configuration. The plugin supports various models/platforms like OpenAI, Azure OpenAI, Google Gemini, and more. It also provides preset customization using YAML files and allows for easy forking and development within Koishi projects. However, the project lacks web UI, HTTP server, and project documentation, inviting contributions from the community.

github

: 287

Llama-Chinese

Llama中文社区是一个专注于Llama模型在中文方面的优化和上层建设的高级技术社区。 **已经基于大规模中文数据，从预训练开始对Llama2模型进行中文能力的持续迭代升级【Done】**。**正在对Llama3模型进行中文能力的持续迭代升级【Doing】** 我们热忱欢迎对大模型LLM充满热情的开发者和研究者加入我们的行列。

github

: 14.5k

crazyai-ml

The 'crazyai-ml' repository is a collection of resources related to machine learning, specifically focusing on explaining artificial intelligence models. It includes articles, code snippets, and tutorials covering various machine learning algorithms, data analysis, model training, and deployment. The content aims to provide a comprehensive guide for beginners in the field of AI, offering practical implementations and insights into popular machine learning packages and model tuning techniques. The repository also addresses the integration of AI models and frontend-backend concepts, making it a valuable resource for individuals interested in AI applications.

github

: 184

gpt_server

The GPT Server project leverages the basic capabilities of FastChat to provide the capabilities of an openai server. It perfectly adapts more models, optimizes models with poor compatibility in FastChat, and supports loading vllm, LMDeploy, and hf in various ways. It also supports all sentence_transformers compatible semantic vector models, including Chat templates with function roles, Function Calling (Tools) capability, and multi-modal large models. The project aims to reduce the difficulty of model adaptation and project usage, making it easier to deploy the latest models with minimal code changes.

github

: 163

HivisionIDPhotos

HivisionIDPhoto is a practical algorithm for intelligent ID photo creation. It utilizes a comprehensive model workflow to recognize, cut out, and generate ID photos for various user photo scenarios. The tool offers lightweight cutting, standard ID photo generation based on different size specifications, six-inch layout photo generation, beauty enhancement (waiting), and intelligent outfit swapping (waiting). It aims to solve emergency ID photo creation issues.

github

: 10.3k

LLMs-from-scratch-CN

This repository is a Chinese translation of the GitHub project 'LLMs-from-scratch', including detailed markdown notes and related Jupyter code. The translation process aims to maintain the accuracy of the original content while optimizing the language and expression to better suit Chinese learners' reading habits. The repository features detailed Chinese annotations for all Jupyter code, aiding users in practical implementation. It also provides various supplementary materials to expand knowledge. The project focuses on building Large Language Models (LLMs) from scratch, covering fundamental constructions like Transformer architecture, sequence modeling, and delving into deep learning models such as GPT and BERT. Each part of the project includes detailed code implementations and learning resources to help users construct LLMs from scratch and master their core technologies.

github

: 130

awesome-ai-painting

This repository, named 'awesome-ai-painting', is a comprehensive collection of resources related to AI painting. It is curated by a user named 秋风, who is an AI painting enthusiast with a background in the AIGC industry. The repository aims to help more people learn AI painting and also documents the user's goal of creating 100 AI products, with current progress at 4/100. The repository includes information on various AI painting products, tutorials, tools, and models, providing a valuable resource for individuals interested in AI painting and related technologies.

github

: 11.0k

MedicalGPT

MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).

github

: 3.6k

jiwu-mall-chat-tauri

Jiwu Chat Tauri APP is a desktop chat application based on Nuxt3 + Tauri + Element Plus framework. It provides a beautiful user interface with integrated chat and social functions. It also supports AI shopping chat and global dark mode. Users can engage in real-time chat, share updates, and interact with AI customer service through this application.

github

: 151

AstrBot

github

: 7.0k

For similar tasks

TigerBot

github

: 2.2k

obsidian-arcana

Arcana is a plugin for Obsidian that offers a collection of AI-powered tools inspired by famous historical figures to enhance creativity and productivity. It includes tools for conversation, text-to-speech transcription, speech-to-text replies, metadata markup, text generation, file moving, flashcard generation, auto tagging, and note naming. Users can interact with these tools using the command palette and sidebar views, with an OpenAI API key required for usage. The plugin aims to assist users in various note-taking and knowledge management tasks within the Obsidian vault environment.

github

: 78

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

TigerBot

README:

TigerBot

最新发布

目录

环境安装

模型下载

推理

CLI

WebPage

本地API

量化

exllamav2量化推理

动态量化模型加载

训练

预训练

启动训练

微调

启动训练

测评

开源数据集

预训练数据

微调数据

领域数据

Tigerbot API

chat

plugin

finetune

embedding

summarization

pdf2text

text2image

其他

产品链接

联系电话

联系邮箱

微信讨论群

For Tasks:

For Jobs:

Alternative AI tools for TigerBot

Similar Open Source Tools

TigerBot

Firefly

JiwuChat

MEGREZ

EmoLLM

AstrBot

chatluna

Llama-Chinese

crazyai-ml

gpt_server

HivisionIDPhotos

LLMs-from-scratch-CN

awesome-ai-painting

MedicalGPT

jiwu-mall-chat-tauri

AstrBot

For similar tasks

TigerBot

obsidian-arcana

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick