one-click-llms

one-click-llms

One click templates for inferencing Language Models

Stars: 107

Visit
 screenshot

The one-click-llms repository provides templates for quickly setting up an API for language models. It includes advanced inferencing scripts for function calling and offers various models for text generation and fine-tuning tasks. Users can choose between Runpod and Vast.AI for different GPU configurations, with recommendations for optimal performance. The repository also supports Trelis Research and offers templates for different model sizes and types, including multi-modal APIs and chat models.

README:

one-click-llms

[!TIP] Post a new issue if you would like other templates. Quickly boot up an API endpoint for a given language, vision or speech/transcription model.

Built by Trelis Research YouTube, Newsletter, Inferencing Scripts

Runpod One-Click Templates

[!TIP] To support the Trelis Research YouTube channel, you can sign up for an account with this link. Trelis is supported by a commission when you use one-click templates.

GPU Choices/Recommendations (last updated Oct 15 2024):

  1. VALUE and best UI: A40 on Runpod (48 GB VRAM) ~$0.39/hr.
  2. Higher Speed: H100 PCI or SXM (80 GB VRAM) - best for fp8 models, but expensive.

Fine-tuning Notebook Setup

  • CUDA 12.1 one-click template here

Inference Engines

  • [Transcription] Faster Whisper Server (Transcription only)
  • [LLMs] SGLang is the fastest across all batch sizes.
  • [LLMs and Multi-modal LLMs] vLLM and TGI are close on speed for small batches.
  • [Multi-modal LLM] Moondream API (tiny vision + text language model).
  • [LLMs] Nvidia NIM (paid service from Nvidia): a bit slower than SGLang. Also inconvenient to use as it requires login.

Faster Whisper

SGLang (from lmsys)

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

Note: The vLLM image has compatibility issues with certain Runpod CUDA drivers, leading to issues on certain pods. A6000 Ada is typically an option that works.

[!IMPORTANT] Note: vLLM runs into issues sometimes if the pod template does not have the correct CUDA drivers. Unfortunately there is no way to know when picking a GPU. An issue has been raised here. As an alternative, you can run TGI (and even query in openai style, guide here). TGI is faster than vLLM and recommended in general. Note however, that TGI does not automatically apply the chat template to the prompt when using the OpenAI style endpoint.

Text Generation Inference:

llama.cpp One-click templates:

Nvidia NIM

MoonDream Multi-modal API (openai-ish)

HuggingFace Speech-to-Speech

[!TIP] As of July 23rd 2024, function calling fine-tuned models are being deprecated in favour of a one-shot approach with stronger models. Find the "Tool Use" video on the Trelis YouTube Channel for more info.

Changelog

15Oct2024:

  • Add whisper turbo endpoint
  • Deprecate Vast.AI templates.

20Jul2023:

  • Update the ./llama-server.sh command in line with breaking changes to llama.cpp

Feb 16 2023:

  • Added a Mamba one click template.

Jan 21 2023:

  • Swapped Runpod to before Vast.AI as user experience is much better with Runpod.

Jan 9 2023:

  • Added Mixtral Instruct AWQ TGI

Dec 30 2023:

  • Support gated models by adding HUGGING_FACE_HUB_TOKEN env variable.
  • Speed up downloading using HuggingFace API.

Dec 29 2023:

  • Add in one-click llama.cpp server template.

Vast AI One-Click Templates (DEPRECATED AS OF OCTOBER 15TH 2024).

[!TIP] To support the Trelis Research YouTube channel, you can sign up for an account with this affiliate link. Trelis is supported by a commission when you use one-click templates.

Fine-tuning Notebook Setup

  • CUDA 12.1 one-click template here.

Text Generation Inference (fastest):

vLLM (requires an A100 or H100 or A6000, i.e. ampere architecture):

llama.cpp One-click templates:

Function-calling One-Click Templates

One-click templates for function-calling are located on the HuggingFace model cards. Check out the collection here.

HuggingFace Speech-to-Speech

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for one-click-llms

Similar Open Source Tools

For similar tasks

For similar jobs