unsloth

Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek, Qwen, Llama, Gemma, TTS 2x faster with 70% less VRAM.

Stars: 51884

Visit

Unsloth is a tool that allows users to fine-tune large language models (LLMs) 2-5x faster with 80% less memory. It is a free and open-source tool that can be used to fine-tune LLMs such as Gemma, Mistral, Llama 2-5, TinyLlama, and CodeLlama 34b. Unsloth supports 4-bit and 16-bit QLoRA / LoRA fine-tuning via bitsandbytes. It also supports DPO (Direct Preference Optimization), PPO, and Reward Modelling. Unsloth is compatible with Hugging Face's TRL, Trainer, Seq2SeqTrainer, and Pytorch code. It is also compatible with NVIDIA GPUs since 2018+ (minimum CUDA Capability 7.0).

README:

Train gpt-oss, DeepSeek, Gemma, Qwen & Llama 2x faster with 70% less VRAM!

✨ Train for Free

Notebooks are beginner friendly. Read our guide. Add dataset, run, then deploy your trained model.

Model	Free Notebooks	Performance	Memory use
gpt-oss (20B)	▶️ Start for free	1.5x faster	70% less
gpt-oss (20B): GRPO	▶️ Start for free	2x faster	80% less
Qwen3: Advanced GRPO	▶️ Start for free	2x faster	50% less
Qwen3-VL (8B): GSPO	▶️ Start for free	1.5x faster	80% less
Gemma 3 (4B) Vision	▶️ Start for free	1.7x faster	60% less
Gemma 3n (e4B)	▶️ Start for free	1.5x faster	50% less
embeddinggemma (300M)	▶️ Start for free	2x faster	20% less
Mistral Ministral 3 (3B)	▶️ Start for free	1.5x faster	60% less
Llama 3.1 (8B) Alpaca	▶️ Start for free	2x faster	70% less
Llama 3.2 Conversational	▶️ Start for free	2x faster	70% less
Orpheus-TTS (3B)	▶️ Start for free	1.5x faster	50% less

See all our notebooks for: Kaggle, GRPO, TTS, embedding & Vision
See all our models and all our notebooks
See detailed documentation for Unsloth here

⚡ Quickstart

Linux or WSL

pip install unsloth

Windows

For Windows, pip install unsloth works only if you have Pytorch installed. Read our Windows Guide.

Docker

Use our official Unsloth Docker image unsloth/unsloth container. Read our Docker Guide.

Blackwell & DGX Spark

For RTX 50x, B200, 6000 GPUs: pip install unsloth. Read our Blackwell Guide and DGX Spark Guide for more details.

🦥 Unsloth News

Embedding models: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. Blog • Notebooks
New 7x longer context RL vs. all other setups, via our new batching algorithms. Blog
New RoPE & MLP Triton Kernels & Padding Free + Packing: 3x faster training & 30% less VRAM. Blog
500K Context: Training a 20B model with >500K context is now possible on an 80GB GPU. Blog
FP8 Reinforcement Learning: You can now do FP8 GRPO on consumer GPUs. Blog • Notebook
DeepSeek-OCR: Fine-tune to improve language understanding by 89%. Guide • Notebook
Docker: Use Unsloth with no setup & environment issues with our new image. Guide • Docker image
Vision RL: You can now train VLMs with GRPO or GSPO in Unsloth! Read guide
gpt-oss by OpenAI: Read our RL blog, Flex Attention blog and gpt-oss Guide. 20B works on 14GB VRAM. 120B on 65GB.

Click for more news

Quantization-Aware Training: We collabed with Pytorch, recovering ~70% accuracy. Read blog
Memory-efficient RL: We're introducing even better RL. Our new kernels & algos allows faster RL with 50% less VRAM & 10× more context. Read blog
Mistral 3: Run Ministral 3 or Devstral 2 and fine-tune with vision/RL sudoku notebooks. Guide • Notebooks
Gemma 3n by Google: Read Blog. We uploaded GGUFs, 4-bit models.
Text-to-Speech (TTS) is now supported, including sesame/csm-1b and STT openai/whisper-large-v3.
Qwen3 is now supported. Qwen3-30B-A3B fits on 17.5GB VRAM.
Introducing Dynamic 2.0 quants that set new benchmarks on 5-shot MMLU & Aider Polyglot.
EVERYTHING is now supported - all models (TTS, BERT, Mamba), FFT, etc. MultiGPU is now supported. Enable FFT with full_finetuning = True, 8-bit with load_in_8bit = True.
📣 DeepSeek-R1 - run or fine-tune them with our guide. All model uploads: here.
📣 Introducing Long-context Reasoning (GRPO) in Unsloth. Train your own reasoning model with just 5GB VRAM. Transform Llama, Phi, Mistral etc. into reasoning LLMs!
📣 Introducing Unsloth Dynamic 4-bit Quantization! We dynamically opt not to quantize certain parameters and this greatly increases accuracy while only using <10% more VRAM than BnB 4-bit. See our collection on Hugging Face here.
📣 Llama 4 by Meta, including Scout & Maverick are now supported.
📣 Phi-4 by Microsoft: We also fixed bugs in Phi-4 and uploaded GGUFs, 4-bit.
📣 Vision models now supported! Llama 3.2 Vision (11B), Qwen 2.5 VL (7B) and Pixtral (12B) 2409
📣 Llama 3.3 (70B), Meta's latest model is supported.
📣 We worked with Apple to add Cut Cross Entropy. Unsloth now supports 89K context for Meta's Llama 3.3 (70B) on a 80GB GPU - 13x longer than HF+FA2. For Llama 3.1 (8B), Unsloth enables 342K context, surpassing its native 128K support.
📣 We found and helped fix a gradient accumulation bug! Please update Unsloth and transformers.
📣 We cut memory usage by a further 30% and now support 4x longer context windows!

🔗 Links and Resources

Type	Links
r/unsloth Reddit	Join Reddit community
📚 Documentation & Wiki	Read Our Docs
Twitter (aka X)	Follow us on X
💾 Installation	Pip & Docker Install
🔮 Our Models	Unsloth Catalog
✍️ Blog	Read our Blogs

⭐ Key Features

Supports full-finetuning, pretraining, 4-bit, 16-bit and FP8 training
Supports all models including TTS, multimodal, embedding and more! Any model that works in transformers, works in Unsloth.
The most efficient library for Reinforcement Learning (RL), using 80% less VRAM. Supports GRPO, GSPO, DrGRPO, DAPO etc.
0% loss in accuracy - no approximation methods - all exact.
Export and deploy your model to GGUF, llama.cpp, vLLM, SGLang and Hugging Face.
Supports NVIDIA (since 2018), AMD and Intel GPUs. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc)
Works on Linux, WSL and Windows
All kernels written in OpenAI's Triton language. Manual backprop engine.
If you trained a model with 🦥Unsloth, you can use this cool sticker!

💾 Install Unsloth

You can also see our docs for more detailed installation and updating instructions here.

Unsloth supports Python 3.13 or lower.

Pip Installation

Install with pip (recommended) for Linux devices:

pip install unsloth

To update Unsloth:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

See here for advanced pip install instructions.

Windows Installation

Install NVIDIA Video Driver: You should install the latest driver for your GPU. Download drivers here: NVIDIA GPU Driver.
Install Visual Studio C++: You will need Visual Studio, with C++ installed. By default, C++ is not installed with Visual Studio, so make sure you select all of the C++ options. Also select options for Windows 10/11 SDK. For detailed instructions with options, see here.
Install CUDA Toolkit: Follow the instructions to install CUDA Toolkit.
Install PyTorch: You will need the correct version of PyTorch that is compatible with your CUDA drivers, so make sure to select them carefully. Install PyTorch.
Install Unsloth:

pip install unsloth

Advanced/Troubleshooting

For advanced installation instructions or if you see weird errors during installations:

First try using an isolated environment via then pip install unsloth

python -m venv unsloth
source unsloth/bin/activate
pip install unsloth

Install torch and triton. Go to https://pytorch.org to install it. For example pip install torch torchvision torchaudio triton
Confirm if CUDA is installed correctly. Try nvcc. If that fails, you need to install cudatoolkit or CUDA drivers.
Install xformers manually via:

pip install ninja
pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Check if `xformers` succeeded with `python -m xformers.info` Go to https://github.com/facebookresearch/xformers. Another option is to install `flash-attn` for Ampere GPUs and ignore `xformers`

For GRPO runs, you can try installing vllm and seeing if pip install vllm succeeds.
Double check that your versions of Python, CUDA, CUDNN, torch, triton, and xformers are compatible with one another. The PyTorch Compatibility Matrix may be useful.
Finally, install bitsandbytes and check it with python -m bitsandbytes

Conda Installation (Optional)

⚠️Only use Conda if you have it. If not, use Pip. We support python=3.10,3.11,3.12,3.13.

conda create --name unsloth_env python==3.12 -y
conda activate unsloth_env

Use nvidia-smi to get the correct CUDA version like 13.0 which becomes cu130

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
pip3 install unsloth

If you're looking to install Conda in a Linux environment, read here, or run the below 🔽

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

Advanced Pip Installation

⚠️Do **NOT** use this if you have Conda. Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10 and CUDA versions.

For other torch versions, we support torch211, torch212, torch220, torch230, torch240, torch250, torch260, torch270, torch280, torch290, torch2100 and for CUDA versions, we support cu118 and cu121 and cu124. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere or cu121-ampere or cu124-ampere. Note: torch 2.10 only supports CUDA 12.6, 12.8, and 13.0.

For example, if you have torch 2.4 and CUDA 12.1, use:

pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

Another example, if you have torch 2.9 and CUDA 13.0, use:

pip install --upgrade pip
pip install "unsloth[cu130-torch290] @ git+https://github.com/unslothai/unsloth.git"

Another example, if you have torch 2.10 and CUDA 12.6, use:

pip install --upgrade pip
pip install "unsloth[cu126-torch2100] @ git+https://github.com/unslothai/unsloth.git"

And other examples:

pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

Or, run the below in a terminal to get the optimal pip installation command:

wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

Or, run the below manually in a Python REPL:

try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
import re
v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
elif v  < V('2.5.1'): x = 'cu{}{}-torch250'
elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
elif v  < V('2.7.0'): x = 'cu{}{}-torch260'
elif v  < V('2.7.9'): x = 'cu{}{}-torch270'
elif v  < V('2.8.0'): x = 'cu{}{}-torch271'
elif v  < V('2.8.9'): x = 'cu{}{}-torch280'
elif v  < V('2.9.1'): x = 'cu{}{}-torch290'
elif v  < V('2.9.2'): x = 'cu{}{}-torch291'
elif v  < V('2.10.1'): x = 'cu{}{}-torch2100'
else: raise RuntimeError(f"Torch = {v} too new!")
if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
if v >= V('2.10.0') and cuda not in ("12.6", "12.8", "13.0"): raise RuntimeError(f"Torch 2.10 requires CUDA 12.6, 12.8, or 13.0! Got CUDA = {cuda}")
x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn
print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation')

Docker Installation

You can use our pre-built Docker container with all dependencies to use Unsloth instantly with no setup required. Read our guide.

This container requires installing NVIDIA's Container Toolkit.

docker run -d -e JUPYTER_PASSWORD="mypassword" \
  -p 8888:8888 -p 2222:22 \
  -v $(pwd)/work:/workspace/work \
  --gpus all \
  unsloth/unsloth

Access Jupyter Lab at http://localhost:8888 and start fine-tuning!

📜 Documentation

Go to our official Documentation for running models, saving to GGUF, checkpointing, evaluation and more!
Read our Guides for: Fine-tuning, Reinforcement Learning, Text-to-Speech (TTS), Vision and any model.
We support Huggingface's transformers, TRL, Trainer, Seq2SeqTrainer and Pytorch code.

Unsloth example code to fine-tune gpt-oss-20b:

from unsloth import FastLanguageModel, FastModel, FastVisionModel
import torch
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", #or choose any model

] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = True,  # 4-bit quantization. False = 16-bit LoRA.
    load_in_8bit = False, # 8-bit quantization
    load_in_16bit = False, # 16-bit LoRA
    full_finetuning = False, # Use for full fine-tuning.
    trust_remote_code = False, # Enable to support new models
    # token = "hf_...", # use one if using gated models
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    tokenizer = tokenizer,
    args = SFTConfig(
        max_seq_length = max_seq_length,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://unsloth.ai/docs for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM or SGLang
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

💡 Reinforcement Learning

RL including GRPO, GSPO, FP8 training, DrGRPO, DAPO, PPO, Reward Modelling, Online DPO all work with Unsloth.

Read our Reinforcement Learning Guide or our advanced RL docs for batching, generation & training parameters.

List of RL notebooks:

gpt-oss GRPO notebook: Link
FP8 Qwen3-8B GRPO notebook (L4): Link
Qwen3-VL GSPO notebook: Link
Advanced Qwen3 GRPO notebook: Link
ORPO notebook: Link
DPO Zephyr notebook: Link
KTO notebook: Link
SimPO notebook: Link

🥇 Performance Benchmarking

For our most detailed benchmarks, read our Llama 3.3 Blog.
Benchmarking of Unsloth was also conducted by 🤗Hugging Face.

We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):

Model	VRAM	🦥 Unsloth speed	🦥 VRAM reduction	🦥 Longer context	😊 Hugging Face + FA2
Llama 3.3 (70B)	80GB	2x	>75%	13x longer	1x
Llama 3.1 (8B)	80GB	2x	>70%	12x longer	1x

Context length benchmarks

Llama 3.1 (8B) max. context length

We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

GPU VRAM	🦥Unsloth context length	Hugging Face + FA2
8 GB	2,972	OOM
12 GB	21,848	932
16 GB	40,724	2,551
24 GB	78,475	5,789
40 GB	153,977	12,264
48 GB	191,728	15,502
80 GB	342,733	28,454

Llama 3.3 (70B) max. context length

We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.

GPU VRAM	🦥Unsloth context length	Hugging Face + FA2
48 GB	12,106	OOM
80 GB	89,389	6,916

Citation

You can cite the Unsloth repo as follows:

@software{unsloth,
  author = {Daniel Han, Michael Han and Unsloth team},
  title = {Unsloth},
  url = {https://github.com/unslothai/unsloth},
  year = {2023}
}

Thank You to

The llama.cpp library that lets users save models with Unsloth
The Hugging Face team and their libraries: transformers and TRL
The Pytorch and Torch AO team for their contributions
And of course for every single person who has contributed or has used Unsloth!

For Tasks:

Click tags to check more tools for each tasks

fine-tune llm train nlp model develop conversational ai create chatbot generate text

For Jobs:

fine-tuning large language models training natural language processing models developing conversational ai creating chatbots generating text

Alternative AI tools for unsloth

Similar Open Source Tools

unsloth

github

: 51.9k

candle-vllm

Candle-vllm is an efficient and easy-to-use platform designed for inference and serving local LLMs, featuring an OpenAI compatible API server. It offers a highly extensible trait-based system for rapid implementation of new module pipelines, streaming support in generation, efficient management of key-value cache with PagedAttention, and continuous batching. The tool supports chat serving for various models and provides a seamless experience for users to interact with LLMs through different interfaces.

github

: 329

Liger-Kernel

Liger Kernel is a collection of Triton kernels designed for LLM training, increasing training throughput by 20% and reducing memory usage by 60%. It includes Hugging Face Compatible modules like RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy. The tool works with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed, aiming to enhance model efficiency and performance for researchers, ML practitioners, and curious novices.

github

: 4.8k

OpenResearcher

OpenResearcher is a fully open agentic large language model designed for long-horizon deep research scenarios. It achieves an impressive 54.8% accuracy on BrowseComp-Plus, surpassing performance of GPT-4.1, Claude-Opus-4, Gemini-2.5-Pro, DeepSeek-R1, and Tongyi-DeepResearch. The tool is fully open-source, providing the training and evaluation recipe—including data, model, training methodology, and evaluation framework for everyone to progress deep research. It offers features like a fully open-source recipe, highly scalable and low-cost generation of deep research trajectories, and remarkable performance on deep research benchmarks.

github

: 285

TempCompass

TempCompass is a benchmark designed to evaluate the temporal perception ability of Video LLMs. It encompasses a diverse set of temporal aspects and task formats to comprehensively assess the capability of Video LLMs in understanding videos. The benchmark includes conflicting videos to prevent models from relying on single-frame bias and language priors. Users can clone the repository, install required packages, prepare data, run inference using examples like Video-LLaVA and Gemini, and evaluate the performance of their models across different tasks such as Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation.

github

: 71

pytorch-lightning

PyTorch Lightning is a framework for training and deploying AI models. It provides a high-level API that abstracts away the low-level details of PyTorch, making it easier to write and maintain complex models. Lightning also includes a number of features that make it easy to train and deploy models on multiple GPUs or TPUs, and to track and visualize training progress. PyTorch Lightning is used by a wide range of organizations, including Google, Facebook, and Microsoft. It is also used by researchers at top universities around the world. Here are some of the benefits of using PyTorch Lightning: * **Increased productivity:** Lightning's high-level API makes it easy to write and maintain complex models. This can save you time and effort, and allow you to focus on the research or business problem you're trying to solve. * **Improved performance:** Lightning's optimized training loops and data loading pipelines can help you train models faster and with better performance. * **Easier deployment:** Lightning makes it easy to deploy models to a variety of platforms, including the cloud, on-premises servers, and mobile devices. * **Better reproducibility:** Lightning's logging and visualization tools make it easy to track and reproduce training results.

github

: 30.1k

agentscope

AgentScope is a multi-agent platform designed to empower developers to build multi-agent applications with large-scale models. It features three high-level capabilities: Easy-to-Use, High Robustness, and Actor-Based Distribution. AgentScope provides a list of `ModelWrapper` to support both local model services and third-party model APIs, including OpenAI API, DashScope API, Gemini API, and ollama. It also enables developers to rapidly deploy local model services using libraries such as ollama (CPU inference), Flask + Transformers, Flask + ModelScope, FastChat, and vllm. AgentScope supports various services, including Web Search, Data Query, Retrieval, Code Execution, File Operation, and Text Processing. Example applications include Conversation, Game, and Distribution. AgentScope is released under Apache License 2.0 and welcomes contributions.

github

: 6.7k

ScaleLLM

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more. ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our **_Roadmap_** for more details. ## Key Features * High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more. * Tensor Parallelism: Utilizes tensor parallelism for efficient model execution. * OpenAI-compatible API: An efficient golang rest api server that compatible with OpenAI. * Huggingface models: Seamless integration with most popular HF models, supporting safetensors. * Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models. * Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.

github

: 418

openlrc

Open-Lyrics is a Python library that transcribes voice files using faster-whisper and translates/polishes the resulting text into `.lrc` files in the desired language using LLM, e.g. OpenAI-GPT, Anthropic-Claude. It offers well preprocessed audio to reduce hallucination and context-aware translation to improve translation quality. Users can install the library from PyPI or GitHub and follow the installation steps to set up the environment. The tool supports GUI usage and provides Python code examples for transcription and translation tasks. It also includes features like utilizing context and glossary for translation enhancement, pricing information for different models, and a list of todo tasks for future improvements.

github

: 476

Pake

Pake is a tool that allows users to turn any webpage into a desktop app with ease. It is lightweight, fast, and supports Mac, Windows, and Linux. Pake provides a battery-included package with shortcut pass-through, immersive windows, and minimalist customization. Users can explore popular packages like WeRead, Twitter, Grok, DeepSeek, ChatGPT, Gemini, YouTube Music, YouTube, LiZhi, ProgramMusic, Excalidraw, and XiaoHongShu. The tool is suitable for beginners, developers, and hackers, offering command-line packaging and advanced usage options. Pake is developed by a community of contributors and offers support through various channels like GitHub, Twitter, and Telegram.

github

: 35.4k

auto-round

AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It competes impressively against recent methods without introducing any additional inference overhead. The method adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200 steps, often significantly outperforming SignRound with the cost of more tuning time for quantization. AutoRound is tailored for a wide range of models and consistently delivers noticeable improvements.

github

: 845

libllm

libLLM is an open-source project designed for efficient inference of large language models (LLM) on personal computers and mobile devices. It is optimized to run smoothly on common devices, written in C++14 without external dependencies, and supports CUDA for accelerated inference. Users can build the tool for CPU only or with CUDA support, and run libLLM from the command line. Additionally, there are API examples available for Python and the tool can export Huggingface models.

github

: 137

ailoy

Ailoy is a lightweight library for building AI applications such as agent systems or RAG pipelines with ease. It enables AI features effortlessly, supporting AI models locally or via cloud APIs, multi-turn conversation, system message customization, reasoning-based workflows, tool calling capabilities, and built-in vector store support. It also supports running native-equivalent functionality in web browsers using WASM. The library is in early development stages and provides examples in the `examples` directory for inspiration on building applications with Agents.

github

: 107

GPULlama3.java

GPULlama3.java powered by TornadoVM is a Java-native implementation of Llama3 that automatically compiles and executes Java code on GPUs via TornadoVM. It supports Llama3, Mistral, Qwen2.5, Qwen3, and Phi3 models in the GGUF format. The repository aims to provide GPU acceleration for Java code, enabling faster execution and high-performance access to off-heap memory. It offers features like interactive and instruction modes, flexible backend switching between OpenCL and PTX, and cross-platform compatibility with NVIDIA, Intel, and Apple GPUs.

github

: 167

MAVIS

MAVIS (Math Visual Intelligent System) is an AI-driven application that allows users to analyze visual data such as images and generate interactive answers based on them. It can perform complex mathematical calculations, solve programming tasks, and create professional graphics. MAVIS supports Python for coding and frameworks like Matplotlib, Plotly, Seaborn, Altair, NumPy, Math, SymPy, and Pandas. It is designed to make projects more efficient and professional.

github

: 85

mistral.rs

Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

github

: 6.6k

For similar tasks

promptflow

**Prompt flow** is a suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, evaluation to production deployment and monitoring. It makes prompt engineering much easier and enables you to build LLM apps with production quality.

github

: 9.2k

unsloth

github

: 51.9k

beyondllm

Beyond LLM offers an all-in-one toolkit for experimentation, evaluation, and deployment of Retrieval-Augmented Generation (RAG) systems. It simplifies the process with automated integration, customizable evaluation metrics, and support for various Large Language Models (LLMs) tailored to specific needs. The aim is to reduce LLM hallucination risks and enhance reliability.

github

: 254

aiwechat-vercel

aiwechat-vercel is a tool that integrates AI capabilities into WeChat public accounts using Vercel functions. It requires minimal server setup, low entry barriers, and only needs a domain name that can be bound to Vercel, with almost zero cost. The tool supports various AI models, continuous Q&A sessions, chat functionality, system prompts, and custom commands. It aims to provide a platform for learning and experimentation with AI integration in WeChat public accounts.

github

: 645

hugging-chat-api

Unofficial HuggingChat Python API for creating chatbots, supporting features like image generation, web search, memorizing context, and changing LLMs. Users can log in, chat with the ChatBot, perform web searches, create new conversations, manage conversations, switch models, get conversation info, use assistants, and delete conversations. The API also includes a CLI mode with various commands for interacting with the tool. Users are advised not to use the application for high-stakes decisions or advice and to avoid high-frequency requests to preserve server resources.

github

: 780

microchain

Microchain is a function calling-based LLM agents tool with no bloat. It allows users to define LLM and templates, use various functions like Sum and Product, and create LLM agents for specific tasks. The tool provides a simple and efficient way to interact with OpenAI models and create conversational agents for various applications.

github

: 268

embedchain

Embedchain is an Open Source Framework for personalizing LLM responses. It simplifies the creation and deployment of personalized AI applications by efficiently managing unstructured data, generating relevant embeddings, and storing them in a vector database. With diverse APIs, users can extract contextual information, find precise answers, and engage in interactive chat conversations tailored to their data. The framework follows the design principle of being 'Conventional but Configurable' to cater to both software engineers and machine learning engineers.

github

: 8.9k

OpenAssistantGPT

OpenAssistantGPT is an open source platform for building chatbot assistants using OpenAI's Assistant. It offers features like easy website integration, low cost, and an open source codebase available on GitHub. Users can build their chatbot with minimal coding required, and OpenAssistantGPT supports direct billing through OpenAI without extra charges. The platform is user-friendly and cost-effective, appealing to those seeking to integrate AI chatbot functionalities into their websites.

github

: 227

For similar jobs

unsloth

github

: 51.9k