
LLaMA-Factory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Stars: 59051

LLaMA Factory is a unified framework for fine-tuning 100+ large language models (LLMs) with various methods, including pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO. It features integrated algorithms like GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning, as well as practical tricks like FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA. LLaMA Factory provides experiment monitors like LlamaBoard, TensorBoard, Wandb, MLflow, etc., and supports faster inference with OpenAI-style API, Gradio UI and CLI with vLLM worker. Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.
README:
👋 Join our WeChat, NPU, Lab4AI, LLaMA Factory Online user group.
[ English | 中文 ]
Fine-tuning a large language model can be easy as...
https://github.com/user-attachments/assets/3991a3a8-4276-4d30-9cab-4cb0c4b9b99e
Choose your path:
- Documentation (WIP): https://llamafactory.readthedocs.io/en/latest/
- Documentation (AMD GPU): https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/fine_tune/llama_factory_llama3.html
- Colab (free): https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
- Local machine: Please refer to usage
- PAI-DSW (free trial): https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory
- Alaya NeW (cloud GPU deal): https://docs.alayanew.com/docs/documents/useGuide/LLaMAFactory/mutiple/?utm_source=LLaMA-Factory
- Official Course: https://www.lab4ai.cn/course/detail?id=7c13e60f6137474eb40f6fd3983c0f46&utm_source=LLaMA-Factory
- LLaMA Factory Online: https://www.llamafactory.com.cn/?utm_source=LLaMA-Factory
[!NOTE] Except for the above links, all other websites are unauthorized third-party websites. Please carefully use them.
- Features
- Blogs
- Changelog
- Supported Models
- Supported Training Approaches
- Provided Datasets
- Requirement
- Getting Started
- Projects using LLaMA Factory
- License
- Citation
- Acknowledgement
- Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Qwen2-VL, DeepSeek, Yi, Gemma, ChatGLM, Phi, etc.
- Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
- Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
- Advanced algorithms: GaLore, BAdam, APOLLO, Adam-mini, Muon, OFT, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ and PiSSA.
- Practical tricks: FlashAttention-2, Unsloth, Liger Kernel, RoPE scaling, NEFTune and rsLoRA.
- Wide tasks: Multi-turn dialogue, tool using, image understanding, visual grounding, video recognition, audio understanding, etc.
- Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, SwanLab, etc.
- Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker or SGLang worker.
Support Date | Model Name |
---|---|
Day 0 | Qwen3 / Qwen2.5-VL / Gemma 3 / GLM-4.1V / InternLM 3 / MiniCPM-o-2.6 |
Day 1 | Llama 3 / GLM-4 / Mistral Small / PaliGemma2 / Llama 4 |
- 💡 Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge (English)
- Fine-tune a mental health LLM using LLaMA-Factory (Chinese)
- Fine-tune GPT-OSS for Role-Playing using LLaMA-Factory (Chinese)
- A One-Stop Code-Free Model Reinforcement Learning and Deployment Platform based on LLaMA-Factory and EasyR1 (Chinese)
- How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod (English)
All Blogs
- Fine-tune Llama3.1-70B for Medical Diagnosis using LLaMA-Factory (Chinese)
- Fine-tune Qwen2.5-VL for Autonomous Driving using LLaMA-Factory (Chinese)
- LLaMA Factory: Fine-tuning the DeepSeek-R1-Distill-Qwen-7B Model for News Classifier (Chinese)
- A One-Stop Code-Free Model Fine-Tuning & Deployment Platform based on SageMaker and LLaMA-Factory (Chinese)
- LLaMA Factory Multi-Modal Fine-Tuning Practice: Fine-Tuning Qwen2-VL for Personal Tourist Guide (Chinese)
- LLaMA Factory: Fine-tuning Llama3 for Role-Playing (Chinese)
[25/08/22] We supported OFT and OFTv2. See examples for usage.
[25/08/20] We supported fine-tuning the Intern-S1-mini models. See PR #8976 to get started.
[25/08/06] We supported fine-tuning the GPT-OSS models. See PR #8826 to get started.
Full Changelog
[25/07/02] We supported fine-tuning the GLM-4.1V-9B-Thinking model.
[25/04/28] We supported fine-tuning the Qwen3 model family.
[25/04/21] We supported the Muon optimizer. See examples for usage. Thank @tianshijing's PR.
[25/04/16] We supported fine-tuning the InternVL3 model. See PR #7258 to get started.
[25/04/14] We supported fine-tuning the GLM-Z1 and Kimi-VL models.
[25/04/06] We supported fine-tuning the Llama 4 model. See PR #7611 to get started.
[25/03/31] We supported fine-tuning the Qwen2.5 Omni model. See PR #7537 to get started.
[25/03/15] We supported SGLang as inference backend. Try infer_backend: sglang
to accelerate inference.
[25/03/12] We supported fine-tuning the Gemma 3 model.
[25/02/24] Announcing EasyR1, an efficient, scalable and multi-modality RL training framework for efficient GRPO training.
[25/02/11] We supported saving the Ollama modelfile when exporting the model checkpoints. See examples for usage.
[25/02/05] We supported fine-tuning the Qwen2-Audio and MiniCPM-o-2.6 on audio understanding tasks.
[25/01/31] We supported fine-tuning the DeepSeek-R1 and Qwen2.5-VL models.
[25/01/15] We supported APOLLO optimizer. See examples for usage.
[25/01/14] We supported fine-tuning the MiniCPM-o-2.6 and MiniCPM-V-2.6 models. Thank @BUAADreamer's PR.
[25/01/14] We supported fine-tuning the InternLM 3 models. Thank @hhaAndroid's PR.
[25/01/10] We supported fine-tuning the Phi-4 model.
[24/12/21] We supported using SwanLab for experiment tracking and visualization. See this section for details.
[24/11/27] We supported fine-tuning the Skywork-o1 model and the OpenO1 dataset.
[24/10/09] We supported downloading pre-trained models and datasets from the Modelers Hub. See this tutorial for usage.
[24/09/19] We supported fine-tuning the Qwen2.5 models.
[24/08/30] We supported fine-tuning the Qwen2-VL models. Thank @simonJJJ's PR.
[24/08/27] We supported Liger Kernel. Try enable_liger_kernel: true
for efficient training.
[24/08/09] We supported Adam-mini optimizer. See examples for usage. Thank @relic-yuexi's PR.
[24/07/04] We supported contamination-free packed training. Use neat_packing: true
to activate it. Thank @chuan298's PR.
[24/06/16] We supported PiSSA algorithm. See examples for usage.
[24/06/07] We supported fine-tuning the Qwen2 and GLM-4 models.
[24/05/26] We supported SimPO algorithm for preference learning. See examples for usage.
[24/05/20] We supported fine-tuning the PaliGemma series models. Note that the PaliGemma models are pre-trained models, you need to fine-tune them with paligemma
template for chat completion.
[24/05/18] We supported KTO algorithm for preference learning. See examples for usage.
[24/05/14] We supported training and inference on the Ascend NPU devices. Check installation section for details.
[24/04/26] We supported fine-tuning the LLaVA-1.5 multimodal LLMs. See examples for usage.
[24/04/22] We provided a Colab notebook for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check Llama3-8B-Chinese-Chat and Llama3-Chinese for details.
[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples for usage.
[24/04/16] We supported BAdam optimizer. See examples for usage.
[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.
[24/03/31] We supported ORPO. See examples for usage.
[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!
[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples for usage.
[24/03/13] We supported LoRA+. See examples for usage.
[24/03/07] We supported GaLore optimizer. See examples for usage.
[24/03/07] We integrated vLLM for faster and concurrent inference. Try infer_backend: vllm
to enjoy 270% inference speed.
[24/02/28] We supported weight-decomposed LoRA (DoRA). Try use_dora: true
to activate DoRA training.
[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples for usage.
[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.
[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with dataset: glaive_toolcall_en
.
[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try use_unsloth: true
argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.
[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.
[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub. See this tutorial for usage.
[23/10/21] We supported NEFTune trick for fine-tuning. Try neftune_noise_alpha: 5
argument to activate NEFTune.
[23/09/27] We supported $S^2$-Attn proposed by LongLoRA for the LLaMA models. Try shift_attn: true
argument to enable shift short attention.
[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See examples for usage.
[23/09/10] We supported FlashAttention-2. Try flash_attn: fa2
argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.
[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try rope_scaling: linear
argument in training and rope_scaling: dynamic
argument at inference to extrapolate the position embeddings.
[23/08/11] We supported DPO training for instruction-tuned models. See examples for usage.
[23/07/31] We supported dataset streaming. Try streaming: true
and max_steps: 10000
arguments to load your dataset in streaming mode.
[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.
[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py
to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.
[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.
[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.
[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.
[23/06/03] We supported quantized training and inference (aka QLoRA). See examples for usage.
[!TIP] If you cannot use the latest feature, please pull the latest code and install LLaMA-Factory again.
Model | Model size | Template |
---|---|---|
Baichuan 2 | 7B/13B | baichuan2 |
BLOOM/BLOOMZ | 560M/1.1B/1.7B/3B/7.1B/176B | - |
ChatGLM3 | 6B | chatglm3 |
Command R | 35B/104B | cohere |
DeepSeek (Code/MoE) | 7B/16B/67B/236B | deepseek |
DeepSeek 2.5/3 | 236B/671B | deepseek3 |
DeepSeek R1 (Distill) | 1.5B/7B/8B/14B/32B/70B/671B | deepseekr1 |
Falcon | 7B/11B/40B/180B | falcon |
Falcon-H1 | 0.5B/1.5B/3B/7B/34B | falcon_h1 |
Gemma/Gemma 2/CodeGemma | 2B/7B/9B/27B | gemma/gemma2 |
Gemma 3/Gemma 3n | 270M/1B/4B/6B/8B/12B/27B | gemma3/gemma3n |
GLM-4/GLM-4-0414/GLM-Z1 | 9B/32B | glm4/glmz1 |
GLM-4.1V | 9B | glm4v |
GLM-4.5/GLM-4.5V | 106B/355B | glm4_moe/glm4v_moe |
GPT-2 | 0.1B/0.4B/0.8B/1.5B | - |
GPT-OSS | 20B/120B | gpt |
Granite 3.0-3.3 | 1B/2B/3B/8B | granite3 |
Granite 4 | 7B | granite4 |
Hunyuan | 7B | hunyuan |
Index | 1.9B | index |
InternLM 2-3 | 7B/8B/20B | intern2 |
InternVL 2.5-3.5 | 1B/2B/4B/8B/14B/30B/38B/78B/241B | intern_vl |
InternLM/Intern-S1-mini | 8B | intern_s1 |
Kimi-VL | 16B | kimi_vl |
Llama | 7B/13B/33B/65B | - |
Llama 2 | 7B/13B/70B | llama2 |
Llama 3-3.3 | 1B/3B/8B/70B | llama3 |
Llama 4 | 109B/402B | llama4 |
Llama 3.2 Vision | 11B/90B | mllama |
LLaVA-1.5 | 7B/13B | llava |
LLaVA-NeXT | 7B/8B/13B/34B/72B/110B | llava_next |
LLaVA-NeXT-Video | 7B/34B | llava_next_video |
MiMo | 7B | mimo |
MiniCPM 1-4.1 | 0.5B/1B/2B/4B/8B | cpm/cpm3/cpm4 |
MiniCPM-o-2.6/MiniCPM-V-2.6 | 8B | minicpm_o/minicpm_v |
Ministral/Mistral-Nemo | 8B/12B | ministral |
Mistral/Mixtral | 7B/8x7B/8x22B | mistral |
Mistral Small | 24B | mistral_small |
OLMo | 1B/7B | - |
PaliGemma/PaliGemma2 | 3B/10B/28B | paligemma |
Phi-1.5/Phi-2 | 1.3B/2.7B | - |
Phi-3/Phi-3.5 | 4B/14B | phi |
Phi-3-small | 7B | phi_small |
Phi-4 | 14B | phi4 |
Pixtral | 12B | pixtral |
Qwen (1-2.5) (Code/Math/MoE/QwQ) | 0.5B/1.5B/3B/7B/14B/32B/72B/110B | qwen |
Qwen3 (MoE/Instruct/Thinking/Next) | 0.6B/1.7B/4B/8B/14B/32B/80B/235B | qwen3/qwen3_nothink |
Qwen2-Audio | 7B | qwen2_audio |
Qwen2.5-Omni | 3B/7B | qwen2_omni |
Qwen2-VL/Qwen2.5-VL/QVQ | 2B/3B/7B/32B/72B | qwen2_vl |
Seed (OSS/Coder) | 8B/36B | seed_oss/seed_coder |
Skywork o1 | 8B | skywork_o1 |
StarCoder 2 | 3B/7B/15B | - |
TeleChat2 | 3B/7B/35B/115B | telechat2 |
XVERSE | 7B/13B/65B | xverse |
Yi/Yi-1.5 (Code) | 1.5B/6B/9B/34B | yi |
Yi-VL | 6B/34B | yi_vl |
Yuan 2 | 2B/51B/102B | yuan |
[!NOTE] For the "base" models, the
template
argument can be chosen fromdefault
,alpaca
,vicuna
etc. But make sure to use the corresponding template for the "instruct/chat" models.Remember to use the SAME template in training and inference.
*: You should install the
transformers
from main branch and useDISABLE_VERSION_CHECK=1
to skip version check.**: You need to install a specific version of
transformers
to use the corresponding model.
Please refer to constants.py for a full list of models we supported.
You also can add a custom chat template to template.py.
Approach | Full-tuning | Freeze-tuning | LoRA | QLoRA | OFT | QOFT |
---|---|---|---|---|---|---|
Pre-Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Supervised Fine-Tuning | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Reward Modeling | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
PPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
DPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
KTO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
ORPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
SimPO Training | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
[!TIP] The implementation details of PPO can be found in this blog.
Pre-training datasets
Supervised fine-tuning datasets
- Identity (en&zh)
- Stanford Alpaca (en)
- Stanford Alpaca (zh)
- Alpaca GPT4 (en&zh)
- Glaive Function Calling V2 (en&zh)
- LIMA (en)
- Guanaco Dataset (multilingual)
- BELLE 2M (zh)
- BELLE 1M (zh)
- BELLE 0.5M (zh)
- BELLE Dialogue 0.4M (zh)
- BELLE School Math 0.25M (zh)
- BELLE Multiturn Chat 0.8M (zh)
- UltraChat (en)
- OpenPlatypus (en)
- CodeAlpaca 20k (en)
- Alpaca CoT (multilingual)
- OpenOrca (en)
- SlimOrca (en)
- MathInstruct (en)
- Firefly 1.1M (zh)
- Wiki QA (en)
- Web QA (zh)
- WebNovel (zh)
- Nectar (en)
- deepctrl (en&zh)
- Advertise Generating (zh)
- ShareGPT Hyperfiltered (en)
- ShareGPT4 (en&zh)
- UltraChat 200k (en)
- Infinity Instruct (zh)
- AgentInstruct (en)
- LMSYS Chat 1M (en)
- Evol Instruct V2 (en)
- Cosmopedia (en)
- STEM (zh)
- Ruozhiba (zh)
- Neo-sft (zh)
- Magpie-Pro-300K-Filtered (en)
- Magpie-ultra-v0.1 (en)
- WebInstructSub (en)
- OpenO1-SFT (en&zh)
- Open-Thoughts (en)
- Open-R1-Math (en)
- Chinese-DeepSeek-R1-Distill (zh)
- LLaVA mixed (en&zh)
- Pokemon-gpt4o-captions (en&zh)
- Open Assistant (de)
- Dolly 15k (de)
- Alpaca GPT4 (de)
- OpenSchnabeltier (de)
- Evol Instruct (de)
- Dolphin (de)
- Booksum (de)
- Airoboros (de)
- Ultrachat (de)
Preference datasets
Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.
pip install --upgrade huggingface_hub
huggingface-cli login
Mandatory | Minimum | Recommend |
---|---|---|
python | 3.9 | 3.10 |
torch | 2.0.0 | 2.6.0 |
torchvision | 0.15.0 | 0.21.0 |
transformers | 4.49.0 | 4.50.0 |
datasets | 2.16.0 | 3.2.0 |
accelerate | 0.34.0 | 1.2.1 |
peft | 0.14.0 | 0.15.1 |
trl | 0.8.6 | 0.9.6 |
Optional | Minimum | Recommend |
---|---|---|
CUDA | 11.6 | 12.2 |
deepspeed | 0.10.0 | 0.16.4 |
bitsandbytes | 0.39.0 | 0.43.1 |
vllm | 0.4.3 | 0.8.2 |
flash-attn | 2.5.6 | 2.7.2 |
* estimated
Method | Bits | 7B | 14B | 30B | 70B |
x B |
---|---|---|---|---|---|---|
Full (bf16 or fp16 ) |
32 | 120GB | 240GB | 600GB | 1200GB |
18x GB |
Full (pure_bf16 ) |
16 | 60GB | 120GB | 300GB | 600GB |
8x GB |
Freeze/LoRA/GaLore/APOLLO/BAdam/OFT | 16 | 16GB | 32GB | 64GB | 160GB |
2x GB |
QLoRA / QOFT | 8 | 10GB | 20GB | 40GB | 80GB |
x GB |
QLoRA / QOFT | 4 | 6GB | 12GB | 24GB | 48GB |
x/2 GB |
QLoRA / QOFT | 2 | 4GB | 8GB | 16GB | 24GB |
x/4 GB |
[!IMPORTANT] Installation is mandatory.
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
Extra dependencies available: torch, torch-npu, metrics, deepspeed, liger-kernel, bitsandbytes, hqq, eetq, gptq, aqlm, vllm, sglang, galore, apollo, badam, adam-mini, qwen, minicpm_v, openmind, swanlab, dev
docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest
This image is built on Ubuntu 22.04 (x86_64), CUDA 12.4, Python 3.11, PyTorch 2.6.0, and Flash-attn 2.7.4.
Find the pre-built images: https://hub.docker.com/r/hiyouga/llamafactory/tags
Please refer to build docker to build the image yourself.
Setting up a virtual environment with uv
Create an isolated Python environment with uv:
uv sync --extra torch --extra metrics --prerelease=allow
Run LLaMA-Factory in the isolated environment:
uv run --prerelease=allow llamafactory-cli train examples/train_lora/llama3_lora_pretrain.yaml
For Windows users
You need to manually install the GPU version of PyTorch on the Windows platform. Please refer to the official website and the following command to install PyTorch with CUDA support:
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
python -c "import torch; print(torch.cuda.is_available())"
If you see True
then you have successfully installed PyTorch with CUDA support.
Try dataloader_num_workers: 0
if you encounter Can't pickle local object
error.
If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of bitsandbytes
library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
To enable FlashAttention-2 on the Windows platform, please use the script from flash-attention-windows-wheel to compile and install it by yourself.
For Ascend NPU users
To install LLaMA Factory on Ascend NPU devices, please upgrade Python to version 3.10 or higher and specify extra dependencies: pip install -e ".[torch-npu,metrics]"
. Additionally, you need to install the Ascend CANN Toolkit and Kernels. Please follow the installation tutorial or use the following commands:
# replace the url according to your CANN version and devices
# install CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C20SPC702/Ascend-cann-toolkit_8.0.0.alpha002_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.0.alpha002_linux-"$(uname -i)".run --install
# install CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C20SPC702/Ascend-cann-kernels-910b_8.0.0.alpha002_linux-"$(uname -i)".run
bash Ascend-cann-kernels-910b_8.0.0.alpha002_linux-"$(uname -i)".run --install
# set env variables
source /usr/local/Ascend/ascend-toolkit/set_env.sh
Requirement | Minimum | Recommend |
---|---|---|
CANN | 8.0.RC1 | 8.0.0.alpha002 |
torch | 2.1.0 | 2.4.0 |
torch-npu | 2.1.0 | 2.4.0.post2 |
deepspeed | 0.13.2 | 0.13.2 |
vllm-ascend | - | 0.7.3 |
Remember to use ASCEND_RT_VISIBLE_DEVICES
instead of CUDA_VISIBLE_DEVICES
to specify the device to use.
If you cannot infer model on NPU devices, try setting do_sample: false
in the configurations.
Download the pre-built Docker images: 32GB | 64GB
To use QLoRA based on bitsandbytes on Ascend NPU, please follow these 3 steps:
- Manually compile bitsandbytes: Refer to the installation documentation for the NPU version of bitsandbytes to complete the compilation and installation. The compilation requires a cmake version of at least 3.22.1 and a g++ version of at least 12.x.
# Install bitsandbytes from source
# Clone bitsandbytes repo, Ascend NPU backend is currently enabled on multi-backend-refactor branch
git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git
cd bitsandbytes/
# Install dependencies
pip install -r requirements-dev.txt
# Install the dependencies for the compilation tools. Note that the commands for this step may vary depending on the operating system. The following are provided for reference
apt-get install -y build-essential cmake
# Compile & install
cmake -DCOMPUTE_BACKEND=npu -S .
make
pip install .
- Install transformers from the main branch.
git clone -b main https://github.com/huggingface/transformers.git
cd transformers
pip install .
- Set
double_quantization: false
in the configuration. You can refer to the example.
Please refer to data/README.md for checking the details about the format of dataset files. You can use datasets on HuggingFace / ModelScope / Modelers hub, load the dataset in local disk, or specify a path to s3/gcs cloud storage.
[!NOTE] Please update
data/dataset_info.json
to use your custom dataset.
You can also use Easy Dataset, DataFlow and GraphGen to create synthetic data for fine-tuning.
Use the following 3 commands to run LoRA fine-tuning, inference and merging of the Llama3-8B-Instruct model, respectively.
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
See examples/README.md for advanced usage (including distributed training).
[!TIP] Use
llamafactory-cli help
to show help information.Read FAQs first if you encounter any problems.
Fine-Tuning with LLaMA Board GUI (powered by Gradio)
llamafactory-cli webui
Read our documentation.
For CUDA users:
cd docker/docker-cuda/
docker compose up -d
docker compose exec llamafactory bash
For Ascend NPU users:
cd docker/docker-npu/
docker compose up -d
docker compose exec llamafactory bash
For AMD ROCm users:
cd docker/docker-rocm/
docker compose up -d
docker compose exec llamafactory bash
Build without Docker Compose
For CUDA users:
docker build -f ./docker/docker-cuda/Dockerfile \
--build-arg PIP_INDEX=https://pypi.org/simple \
--build-arg EXTRAS=metrics \
-t llamafactory:latest .
docker run -dit --ipc=host --gpus=all \
-p 7860:7860 \
-p 8000:8000 \
--name llamafactory \
llamafactory:latest
docker exec -it llamafactory bash
For Ascend NPU users:
docker build -f ./docker/docker-npu/Dockerfile \
--build-arg PIP_INDEX=https://pypi.org/simple \
--build-arg EXTRAS=torch-npu,metrics \
-t llamafactory:latest .
docker run -dit --ipc=host \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-p 7860:7860 \
-p 8000:8000 \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
--name llamafactory \
llamafactory:latest
docker exec -it llamafactory bash
For AMD ROCm users:
docker build -f ./docker/docker-rocm/Dockerfile \
--build-arg PIP_INDEX=https://pypi.org/simple \
--build-arg EXTRAS=metrics \
-t llamafactory:latest .
docker run -dit --ipc=host \
-p 7860:7860 \
-p 8000:8000 \
--device /dev/kfd \
--device /dev/dri \
--name llamafactory \
llamafactory:latest
docker exec -it llamafactory bash
Use Docker volumes
You can uncomment VOLUME [ "/root/.cache/huggingface", "/app/shared_data", "/app/output" ]
in the Dockerfile to use data volumes.
When building the Docker image, use -v ./hf_cache:/root/.cache/huggingface
argument to mount the local directory to the container. The following data volumes are available.
-
hf_cache
: Utilize Hugging Face cache on the host machine. -
shared_data
: The directionary to store datasets on the host machine. -
output
: Set export dir to this location so that the merged result can be accessed directly on the host machine.
API_PORT=8000 llamafactory-cli api examples/inference/llama3.yaml infer_backend=vllm vllm_enforce_eager=true
[!TIP] Visit this page for API document.
Examples: Image understanding | Function calling
If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.
export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows
Train the model by specifying a model ID of the ModelScope Hub as the model_name_or_path
. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct
.
You can also use Modelers Hub to download models and datasets.
export USE_OPENMIND_HUB=1 # `set USE_OPENMIND_HUB=1` for Windows
Train the model by specifying a model ID of the Modelers Hub as the model_name_or_path
. You can find a full list of model IDs at Modelers Hub, e.g., TeleAI/TeleChat-7B-pt
.
To use Weights & Biases for logging experimental results, you need to add the following arguments to yaml files.
report_to: wandb
run_name: test_run # optional
Set WANDB_API_KEY
to your key when launching training tasks to log in with your W&B account.
To use SwanLab for logging experimental results, you need to add the following arguments to yaml files.
use_swanlab: true
swanlab_run_name: test_run # optional
When launching training tasks, you can log in to SwanLab in three ways:
- Add
swanlab_api_key=<your_api_key>
to the yaml file, and set it to your API key. - Set the environment variable
SWANLAB_API_KEY
to your API key. - Use the
swanlab login
command to complete the login.
If you have a project that should be incorporated, please contact via email or create a pull request.
Click to show
- Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
- Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
- Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [arxiv]
- Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
- Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
- Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. KDD 2024. [arxiv]
- Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. ACL 2024. [arxiv]
- Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
- Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
- Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
- Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
- Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
- Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
- Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. ACL 2024 Findings. [arxiv]
- Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
- Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
- Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
- Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. ACL 2024. [arxiv]
- Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [arxiv]
- Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [arxiv]
- Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [arxiv]
- Wu et al. Large Language Models are Parallel Multilingual Learners. 2024. [arxiv]
- Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [arxiv]
- Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [arxiv]
- Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. COLING 2024. [arxiv]
- Zan et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. 2024. [arxiv]
- Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [arxiv]
- Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [arxiv]
- Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [arxiv]
- Ma et al. Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation. ICML 2024. [arxiv]
- Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [arxiv]
- Shang et al. How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models. 2024. [arxiv]
- Huang et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models. 2024. [arxiv]
- Deng et al. Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction. 2024. [arxiv]
- Acikgoz et al. Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. 2024. [arxiv]
- Zhang et al. Small Language Models Need Strong Verifiers to Self-Correct Reasoning. ACL 2024 Findings. [arxiv]
- Zhou et al. FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering. NAACL 2024. [arxiv]
- Xu et al. Large Language Models for Cyber Security: A Systematic Literature Review. 2024. [arxiv]
- Dammu et al. "They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations. 2024. [arxiv]
- Yi et al. A safety realignment framework via subspace-oriented model fusion for large language models. 2024. [arxiv]
- Lou et al. SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling. 2024. [arxiv]
- Zhang et al. Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners. 2024. [arxiv]
- Zhang et al. TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models. 2024. [arxiv]
- Zihong Chen. Sentence Segmentation and Sentence Punctuation Based on XunziALLM. 2024. [paper]
- Gao et al. The Best of Both Worlds: Toward an Honest and Helpful Large Language Model. 2024. [arxiv]
- Wang and Song. MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset. 2024. [arxiv]
- Hu et al. Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models. 2024. [arxiv]
- Ge et al. Time Sensitive Knowledge Editing through Efficient Finetuning. ACL 2024. [arxiv]
- Tan et al. Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions. 2024. [arxiv]
- Song et al. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. 2024. [arxiv]
- Gu et al. RWKV-CLIP: A Robust Vision-Language Representation Learner. 2024. [arxiv]
- Chen et al. Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees. 2024. [arxiv]
- Zhu et al. Are Large Language Models Good Statisticians?. 2024. [arxiv]
- Li et al. Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning. 2024. [arxiv]
- Ding et al. IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce. 2024. [arxiv]
- He et al. COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities. 2024. [arxiv]
- Lin et al. FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving. 2024. [arxiv]
- Treutlein et al. Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data. 2024. [arxiv]
- Feng et al. SS-Bench: A Benchmark for Social Story Generation and Evaluation. 2024. [arxiv]
- Feng et al. Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement. 2024. [arxiv]
- Liu et al. Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals. 2024. [arxiv]
- Iyer et al. Exploring Very Low-Resource Translation with LLMs: The University of Edinburgh's Submission to AmericasNLP 2024 Translation Task. AmericasNLP 2024. [paper]
- Li et al. Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring. 2024. [arxiv]
- Yang et al. Financial Knowledge Large Language Model. 2024. [arxiv]
- Lin et al. DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging. 2024. [arxiv]
- Bako et al. Evaluating the Semantic Profiling Abilities of LLMs for Natural Language Utterances in Data Visualization. 2024. [arxiv]
- Huang et al. RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization. 2024. [arxiv]
- Jiang et al. LLM-Collaboration on Automatic Science Journalism for the General Audience. 2024. [arxiv]
- Inouye et al. Applied Auto-tuning on LoRA Hyperparameters. 2024. [paper]
- Qi et al. Research on Tibetan Tourism Viewpoints information generation system based on LLM. 2024. [arxiv]
- Xu et al. Course-Correction: Safety Alignment Using Synthetic Preferences. 2024. [arxiv]
- Sun et al. LAMBDA: A Large Model Based Data Agent. 2024. [arxiv]
- Zhu et al. CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare. 2024. [arxiv]
- Yu et al. Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment. 2024. [arxiv]
- Xie et al. The Power of Personalized Datasets: Advancing Chinese Composition Writing for Elementary School through Targeted Model Fine-Tuning. IALP 2024. [paper]
- Liu et al. Instruct-Code-Llama: Improving Capabilities of Language Model in Competition Level Code Generation by Online Judge Feedback. ICIC 2024. [paper]
- Wang et al. Cybernetic Sentinels: Unveiling the Impact of Safety Data Selection on Model Security in Supervised Fine-Tuning. ICIC 2024. [paper]
- Xia et al. Understanding the Performance and Estimating the Cost of LLM Fine-Tuning. 2024. [arxiv]
- Zeng et al. Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions. 2024. [arxiv]
- Xia et al. Using Pre-trained Language Model for Accurate ESG Prediction. FinNLP 2024. [paper]
- Liang et al. I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm. 2024. [arxiv]
- Bai et al. Aligning Large Language Model with Direct Multi-Preference Optimization for Recommendation. CIKM 2024. [paper]
- Zhang et al. CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling. ACL 2024. [paper]
- StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
- DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
- Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
- CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
- MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.
- Luminia-13B-v3: A large language model specialized in generate metadata for stable diffusion. [demo]
- Chinese-LLaVA-Med: A multimodal large language model specialized in Chinese medical domain, based on LLaVA-1.5-7B.
- AutoRE: A document-level relation extraction system based on large language models.
- NVIDIA RTX AI Toolkit: SDKs for fine-tuning LLMs on Windows PC for NVIDIA RTX.
- LazyLLM: An easy and lazy way for building multi-agent LLMs applications and supports model fine-tuning via LLaMA Factory.
- RAG-Retrieval: A full pipeline for RAG retrieval model fine-tuning, inference, and distillation. [blog]
- 360-LLaMA-Factory: A modified library that supports long sequence SFT & DPO using ring attention.
- Sky-T1: An o1-like model fine-tuned by NovaSky AI with very small cost.
- WeClone: One-stop solution for creating your digital avatar from chat logs.
- EmoLLM: A project about large language models (LLMs) and mental health.
This repository is licensed under the Apache-2.0 License.
Please follow the model licenses to use the corresponding model weights: Baichuan 2 / BLOOM / ChatGLM3 / Command R / DeepSeek / Falcon / Gemma / GLM-4 / GPT-2 / Granite / Index / InternLM / Llama / Llama 2 / Llama 3 / Llama 4 / MiniCPM / Mistral/Mixtral/Pixtral / OLMo / Phi-1.5/Phi-2 / Phi-3/Phi-4 / Qwen / Skywork / StarCoder 2 / TeleChat2 / XVERSE / Yi / Yi-1.5 / Yuan 2
If this work is helpful, please kindly cite as:
@inproceedings{zheng2024llamafactory,
title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Zhangchi Feng and Yongqiang Ma},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
address={Bangkok, Thailand},
publisher={Association for Computational Linguistics},
year={2024},
url={http://arxiv.org/abs/2403.13372}
}
This repo benefits from PEFT, TRL, QLoRA and FastChat. Thanks for their wonderful works.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for LLaMA-Factory
Similar Open Source Tools

LLaMA-Factory
LLaMA Factory is a unified framework for fine-tuning 100+ large language models (LLMs) with various methods, including pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO. It features integrated algorithms like GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning, as well as practical tricks like FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA. LLaMA Factory provides experiment monitors like LlamaBoard, TensorBoard, Wandb, MLflow, etc., and supports faster inference with OpenAI-style API, Gradio UI and CLI with vLLM worker. Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

helicone
Helicone is an open-source observability platform designed for Language Learning Models (LLMs). It logs requests to OpenAI in a user-friendly UI, offers caching, rate limits, and retries, tracks costs and latencies, provides a playground for iterating on prompts and chat conversations, supports collaboration, and will soon have APIs for feedback and evaluation. The platform is deployed on Cloudflare and consists of services like Web (NextJs), Worker (Cloudflare Workers), Jawn (Express), Supabase, and ClickHouse. Users can interact with Helicone locally by setting up the required services and environment variables. The platform encourages contributions and provides resources for learning, documentation, and integrations.

llama.cpp
llama.cpp is a C++ implementation of LLaMA, a large language model from Meta. It provides a command-line interface for inference and can be used for a variety of tasks, including text generation, translation, and question answering. llama.cpp is highly optimized for performance and can be run on a variety of hardware, including CPUs, GPUs, and TPUs.

InternLM
InternLM is a powerful language model series with features such as 200K context window for long-context tasks, outstanding comprehensive performance in reasoning, math, code, chat experience, instruction following, and creative writing, code interpreter & data analysis capabilities, and stronger tool utilization capabilities. It offers models in sizes of 7B and 20B, suitable for research and complex scenarios. The models are recommended for various applications and exhibit better performance than previous generations. InternLM models may match or surpass other open-source models like ChatGPT. The tool has been evaluated on various datasets and has shown superior performance in multiple tasks. It requires Python >= 3.8, PyTorch >= 1.12.0, and Transformers >= 4.34 for usage. InternLM can be used for tasks like chat, agent applications, fine-tuning, deployment, and long-context inference.

llama.cpp
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. It provides a Plain C/C++ implementation without any dependencies, optimized for Apple silicon via ARM NEON, Accelerate and Metal frameworks, and supports various architectures like AVX, AVX2, AVX512, and AMX. It offers integer quantization for faster inference, custom CUDA kernels for NVIDIA GPUs, Vulkan and SYCL backend support, and CPU+GPU hybrid inference. llama.cpp is the main playground for developing new features for the ggml library, supporting various models and providing tools and infrastructure for LLM deployment.

achatbot
achatbot is a factory tool that allows users to create chat bots with various functionalities such as llm (language models), asr (automatic speech recognition), tts (text-to-speech), vad (voice activity detection), ocr (optical character recognition), and object detection. The tool provides a structured project with features like chat bots for cmd, grpc, and http servers. It supports various chat bot processors, transport connectors, and AI modules for different tasks. Users can run chat bots locally or deploy them on cloud services like vercel, Cloudflare, AWS Lambda, or Docker. The tool also includes UI components for easy deployment and service architecture diagrams for reference.

swift
SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning) supports training, inference, evaluation and deployment of nearly **200 LLMs and MLLMs** (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by [PEFT](https://github.com/huggingface/peft), we also provide a complete **Adapters library** to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts. To facilitate use by users unfamiliar with deep learning, we provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners. Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.

intel-extension-for-transformers
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples: * Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor) * Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754)) * Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) * [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of [plugins](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/advanced_features.md) such as [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), and [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md). This framework supports Intel Gaudi2/CPU/GPU. * [Inference](https://github.com/intel/neural-speed/tree/main) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting [GPT-NEOX](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox), [LLAMA](https://github.com/intel/neural-speed/tree/main/neural_speed/models/llama), [MPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/mpt), [FALCON](https://github.com/intel/neural-speed/tree/main/neural_speed/models/falcon), [BLOOM-7B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/bloom), [OPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/opt), [ChatGLM2-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/chatglm), [GPT-J-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptj), and [Dolly-v2-3B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox). Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html).

NewEraAI-Papers
The NewEraAI-Papers repository provides links to collections of influential and interesting research papers from top AI conferences, along with open-source code to promote reproducibility and provide detailed implementation insights beyond the scope of the article. Users can stay up to date with the latest advances in AI research by exploring this repository. Contributions to improve the completeness of the list are welcomed, and users can create pull requests, open issues, or contact the repository owner via email to enhance the repository further.

Hands-On-Large-Language-Models-CN
Hands-On Large Language Models CN(ZH) is a Chinese version of the book 'Hands-On Large Language Models' by Jay Alammar and Maarten Grootendorst. It provides detailed code annotations and additional insights, offers Notebook versions suitable for Chinese network environments, utilizes openbayes for free GPU access, allows convenient environment setup with vscode, and includes accompanying Chinese language videos on platforms like Bilibili and YouTube. The book covers various chapters on topics like Tokens and Embeddings, Transformer LLMs, Text Classification, Text Clustering, Prompt Engineering, Text Generation, Semantic Search, Multimodal LLMs, Text Embedding Models, Fine-tuning Models, and more.

Steel-LLM
Steel-LLM is a project to pre-train a large Chinese language model from scratch using over 1T of data to achieve a parameter size of around 1B, similar to TinyLlama. The project aims to share the entire process including data collection, data processing, pre-training framework selection, model design, and open-source all the code. The goal is to enable reproducibility of the work even with limited resources. The name 'Steel' is inspired by a band '万能青年旅店' and signifies the desire to create a strong model despite limited conditions. The project involves continuous data collection of various cultural elements, trivia, lyrics, niche literature, and personal secrets to train the LLM. The ultimate aim is to fill the model with diverse data and leave room for individual input, fostering collaboration among users.

cia
CIA is a powerful open-source tool designed for data analysis and visualization. It provides a user-friendly interface for processing large datasets and generating insightful reports. With CIA, users can easily explore data, perform statistical analysis, and create interactive visualizations to communicate findings effectively. Whether you are a data scientist, analyst, or researcher, CIA offers a comprehensive set of features to streamline your data analysis workflow and uncover valuable insights.

VideoLLaMA2
VideoLLaMA 2 is a project focused on advancing spatial-temporal modeling and audio understanding in video-LLMs. It provides tools for multi-choice video QA, open-ended video QA, and video captioning. The project offers model zoo with different configurations for visual encoder and language decoder. It includes training and evaluation guides, as well as inference capabilities for video and image processing. The project also features a demo setup for running a video-based Large Language Model web demonstration.

PocketFlow
Pocket Flow is a 100-line minimalist LLM framework designed for (Multi-)Agents, Workflow, RAG, etc. It provides a core abstraction for LLM projects by focusing on computation and communication through a graph structure and shared store. The framework aims to support the development of LLM Agents, such as Cursor AI, by offering a minimal and low-level approach that is well-suited for understanding and usage. Users can install Pocket Flow via pip or by copying the source code, and detailed documentation is available on the project website.

duckduckgpt
DuckDuckGPT brings the magic of ChatGPT to DDG (powered by GPT-4!). DuckDuckGPT is a browser extension that allows you to use ChatGPT within DuckDuckGo. This means you can ask ChatGPT questions, get help with tasks, and generate creative content, all without leaving DuckDuckGo. DuckDuckGPT is easy to use. Once you have installed the extension, simply type your question into the DuckDuckGo search bar and hit enter. ChatGPT will then generate a response that will appear below the search results. DuckDuckGPT is a powerful tool that can help you with a wide variety of tasks. Here are just a few examples of what you can use it for: * Get help with research * Write essays and other creative content * Translate languages * Get coding help * Answer trivia questions * And much more! DuckDuckGPT is still in development, but it is already a very powerful tool. As GPT-4 continues to improve, DuckDuckGPT will only get better. So if you are looking for a way to make your DuckDuckGo searches more productive, be sure to give DuckDuckGPT a try.

bravegpt
BraveGPT is a userscript that brings the power of ChatGPT to Brave Search. It allows users to engage with a conversational AI assistant directly within their search results, providing instant and personalized responses to their queries. BraveGPT is powered by GPT-4, the latest and most advanced language model from OpenAI, ensuring accurate and comprehensive answers. With BraveGPT, users can ask questions, get summaries, generate creative content, and more, all without leaving the Brave Search interface. The tool is easy to install and use, making it accessible to users of all levels. BraveGPT is a valuable addition to the Brave Search experience, enhancing its capabilities and providing users with a more efficient and informative search experience.
For similar tasks

litgpt
LitGPT is a command-line tool designed to easily finetune, pretrain, evaluate, and deploy 20+ LLMs **on your own data**. It features highly-optimized training recipes for the world's most powerful open-source large-language-models (LLMs).

llm-engine
Scale's LLM Engine is an open-source Python library, CLI, and Helm chart that provides everything you need to serve and fine-tune foundation models, whether you use Scale's hosted infrastructure or do it in your own cloud infrastructure using Kubernetes.

LLaMA-Factory
LLaMA Factory is a unified framework for fine-tuning 100+ large language models (LLMs) with various methods, including pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO. It features integrated algorithms like GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning, as well as practical tricks like FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA. LLaMA Factory provides experiment monitors like LlamaBoard, TensorBoard, Wandb, MLflow, etc., and supports faster inference with OpenAI-style API, Gradio UI and CLI with vLLM worker. Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3.7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.

llm-foundry
LLM Foundry is a codebase for training, finetuning, evaluating, and deploying LLMs for inference with Composer and the MosaicML platform. It is designed to be easy-to-use, efficient _and_ flexible, enabling rapid experimentation with the latest techniques. You'll find in this repo: * `llmfoundry/` - source code for models, datasets, callbacks, utilities, etc. * `scripts/` - scripts to run LLM workloads * `data_prep/` - convert text data from original sources to StreamingDataset format * `train/` - train or finetune HuggingFace and MPT models from 125M - 70B parameters * `train/benchmarking` - profile training throughput and MFU * `inference/` - convert models to HuggingFace or ONNX format, and generate responses * `inference/benchmarking` - profile inference latency and throughput * `eval/` - evaluate LLMs on academic (or custom) in-context-learning tasks * `mcli/` - launch any of these workloads using MCLI and the MosaicML platform * `TUTORIAL.md` - a deeper dive into the repo, example workflows, and FAQs

EdgeChains
EdgeChains is an open-source chain-of-thought engineering framework tailored for Large Language Models (LLMs)- like OpenAI GPT, LLama2, Falcon, etc. - With a focus on enterprise-grade deployability and scalability. EdgeChains is specifically designed to **orchestrate** such applications. At EdgeChains, we take a unique approach to Generative AI - we think Generative AI is a deployment and configuration management challenge rather than a UI and library design pattern challenge. We build on top of a tech that has solved this problem in a different domain - Kubernetes Config Management - and bring that to Generative AI. Edgechains is built on top of jsonnet, originally built by Google based on their experience managing a vast amount of configuration code in the Borg infrastructure.

llm-action
This repository provides a comprehensive guide to large language models (LLMs), covering various aspects such as training, fine-tuning, compression, and applications. It includes detailed tutorials, code examples, and explanations of key concepts and techniques. The repository is maintained by Liguo Dong, an AI researcher and engineer with expertise in LLM research and development.

llm-on-openshift
This repository provides resources, demos, and recipes for working with Large Language Models (LLMs) on OpenShift using OpenShift AI or Open Data Hub. It includes instructions for deploying inference servers for LLMs, such as vLLM, Hugging Face TGI, Caikit-TGIS-Serving, and Ollama. Additionally, it offers guidance on deploying serving runtimes, such as vLLM Serving Runtime and Hugging Face Text Generation Inference, in the Single-Model Serving stack of Open Data Hub or OpenShift AI. The repository also covers vector databases that can be used as a Vector Store for Retrieval Augmented Generation (RAG) applications, including Milvus, PostgreSQL+pgvector, and Redis. Furthermore, it provides examples of inference and application usage, such as Caikit, Langchain, Langflow, and UI examples.
For similar jobs

ChatFAQ
ChatFAQ is an open-source comprehensive platform for creating a wide variety of chatbots: generic ones, business-trained, or even capable of redirecting requests to human operators. It includes a specialized NLP/NLG engine based on a RAG architecture and customized chat widgets, ensuring a tailored experience for users and avoiding vendor lock-in.

anything-llm
AnythingLLM is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

mikupad
mikupad is a lightweight and efficient language model front-end powered by ReactJS, all packed into a single HTML file. Inspired by the likes of NovelAI, it provides a simple yet powerful interface for generating text with the help of various backends.

glide
Glide is a cloud-native LLM gateway that provides a unified REST API for accessing various large language models (LLMs) from different providers. It handles LLMOps tasks such as model failover, caching, key management, and more, making it easy to integrate LLMs into applications. Glide supports popular LLM providers like OpenAI, Anthropic, Azure OpenAI, AWS Bedrock (Titan), Cohere, Google Gemini, OctoML, and Ollama. It offers high availability, performance, and observability, and provides SDKs for Python and NodeJS to simplify integration.

onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.

firecrawl
Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown. It crawls all accessible subpages and provides clean markdown for each, without requiring a sitemap. The API is easy to use and can be self-hosted. It also integrates with Langchain and Llama Index. The Python SDK makes it easy to crawl and scrape websites in Python code.