infinity
Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
Stars: 1651
Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. It is developed under the MIT License and powers inference behind Gradient.ai. The API allows users to deploy models from SentenceTransformers, offers fast inference backends utilizing various accelerators, dynamic batching for efficient processing, correct and tested implementation, and easy-to-use API built on FastAPI with Swagger documentation. Users can embed text, rerank documents, and perform text classification tasks using the tool. Infinity supports various models from Huggingface and provides flexibility in deployment via CLI, Docker, Python API, and cloud services like dstack. The tool is suitable for tasks like embedding, reranking, and text classification.
README:
Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap and colpali. Infinity is developed under MIT License.
- Deploy any model from HuggingFace: deploy any embedding, reranking, clip and sentence-transformer model from HuggingFace
- Fast inference backends: The inference server is built on top of PyTorch, optimum (ONNX/TensorRT) and CTranslate2, using FlashAttention to get the most out of your NVIDIA CUDA, AMD ROCM, CPU, AWS INF2 or APPLE MPS accelerator. Infinity uses dynamic batching and tokenization dedicated in worker threads.
- Multi-modal and multi-model: Mix-and-match multiple models. Infinity orchestrates them.
- Tested implementation: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.
- Easy to use: Built on FastAPI. Infinity CLI v2 allows launching of all arguments via Environment variable or argument. OpenAPI aligned to OpenAI's API specs. View the docs at https://michaelfeil.github.io/infinity on how to get started.
- [2024/11] AMD, CPU, ONNX docker images
- [2024/10]
pip install infinity_client
- [2024/07] Inference deployment example via Modal and a free GPU deployment
- [2024/06] Support for multi-modal: clip, text-classification & launch all arguments from env variables
- [2024/05] launch multiple models using the
v2
cli, including--api-key
- [2024/03] infinity supports experimental int8 (cpu/cuda) and fp8 (H100/MI300) support
- [2024/03] Docs are online: https://michaelfeil.github.io/infinity/latest/
- [2024/02] Community meetup at the Run:AI Infra Club
- [2024/01] TensorRT / ONNX inference
- [2023/10] Initial release
pip install infinity-emb[all]
After your pip install, with your venv active, you can run the CLI directly.
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5
Check the v2 --help
command to get a description for all parameters.
infinity_emb v2 --help
Instead of installing the CLI via pip, you may also use docker to run michaelf34/infinity
.
Make sure you mount your accelerator ( i.e. install nvidia-docker
and activate with --gpus all
).
port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data
docker run -it --gpus all \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest \
v2 \
--model-id $model1 \
--model-id $model2 \
--port $port
The cache path inside the docker container is set by the environment variable HF_HOME
.
Docker container for CPU
Use the `latest-cpu` image or `x.x.x-cpu` for slimer image. Run like any other cpu-only docker image. Optimum/Onnx is often the prefered engine.docker run -it \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest-cpu \
v2 \
--engine optimum \
--model-id $model1 \
--model-id $model2 \
--port $port
Docker Container for ROCm (MI200 Series and MI300 Series)
Use the `latest-rocm` image or `x.x.x-rocm` for rocm compatible inference. **This image is currently not build via CI/CD (to large), consider pinning to exact version.** Make sure you have ROCm is correctly installed and ready to use with Docker.Visit Docs for more info.
Docker Container for Onnx-GPU, Cuda Extensions, TensorRT
Use the `latest-trt-onnx` image or `x.x.x-trt-onnx` for nvidia compatible inference. **This image is currently not build via CI/CD (to large), consider pinning to exact version.**This image has support for:
- ONNX-Cuda "CudaExecutionProvider"
- ONNX-TensorRT "TensorRTExecutionProvider" (may not always work due to version mismatch with ORT)
- CudaExtensions and packages, e.g. Tri-Dao's
pip install flash-attn
package when using Pytorch. - nvcc compiler support
docker run -it \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest-trt-onnx \
v2 \
--engine optimum \
--device cuda \
--model-id $model1 \
--port $port
Launching multiple models at once
Since infinity_emb>=0.0.34
, you can use cli v2
method to launch multiple models at the same time.
Checkout infinity_emb v2 --help
for all args and validation.
Multiple Model CLI Playbook:
-
- cli options can be repeated e.g.
v2 --model-id model/id1 --model-id model/id2 --batch-size 8 --batch-size 4
. This will create two modelsmodel/id1
andmodel/id2
- cli options can be repeated e.g.
-
- or adapt the defaults by setting ENV Variables separated by
;
:INFINITY_MODEL_ID="model/id1;model/id2;" && INFINITY_BATCH_SIZE="8;4;"
- or adapt the defaults by setting ENV Variables separated by
-
- single items are broadcasted to
--model-id
length,v2 --model-id model/id1 --model-id/id2 --batch-size 8
making both models have batch-size 8.
- single items are broadcasted to
-
- Everything is broadcasted to the number of
--model-id
+ API requests are routed to the--served-model-name/--model-id
- Everything is broadcasted to the number of
Using environment variables instead of the cli
All CLI arguments are also launchable via environment variables.Environment variables start with INFINITY_{UPPER_CASE_SNAKE_CASE}
and often match the --{lower-case-kebab-case}
cli arguments.
The following two are equivalent:
- CLI
infinity_emb v2 --model-id BAAI/bge-base-en-v1.5
- ENV-CLI:
export INFINITY_MODEL_ID="BAAI/bge-base-en-v1.5" && infinity_emb v2
Multiple arguments can be used via ;
syntax: INFINITY_MODEL_ID="model/id1;model/id2;"
API Key
Supply an `--api-key secret123` via CLI or ENV INFINITY_API_KEY="secret123".Chosing the fastest engine
With the command --engine torch
the model must be compatible with https://github.com/UKPLab/sentence-transformers/ and AutoModel
With the command --engine optimum
, there must be an onnx file. Models from https://huggingface.co/Xenova are recommended.
With the command --engine ctranslate2
- only BERT
models are supported.
Telemetry opt-out
See which telemetry is collected: https://michaelfeil.eu/infinity/main/telemetry/
# Disable
export INFINITY_ANONYMOUS_USAGE_STATS="0"
Infinity aims to be the inference server supporting most functionality for embeddings, reranking and related RAG tasks. The following Infinity tests 15+ architectures and all of the below cases in the Github CI. Click on the sections below to find tasks and validated example models.
Text Embeddings
Text embeddings measure the relatedness of text strings. Embeddings are used for search, clustering, recommendations. Think about a private deployed version of openai's text embeddings. https://platform.openai.com/docs/guides/embeddings
Tested embedding models:
- mixedbread-ai/mxbai-embed-large-v1
- WhereIsAI/UAE-Large-V1
- BAAI/bge-base-en-v1.5
- Alibaba-NLP/gte-large-en-v1.5
- jinaai/jina-embeddings-v2-base-code
- sentence-transformers/all-MiniLM-L6-v2
- intfloat/multilingual-e5-large-instruct
- intfloat/multilingual-e5-small
- jinaai/jina-embeddings-v3
- BAAI/bge-m3, no sparse
- decoder-based models. Keep in mind that they are ~20-100x larger (&slower) than bert-small models:
Other models:
- Most embedding model are likely supported: https://huggingface.co/models?pipeline_tag=feature-extraction&other=text-embeddings-inference&sort=trending
- Check MTEB leaderboard for models https://huggingface.co/spaces/mteb/leaderboard.
Reranking
Given a query and a list of documents, Reranking indexes the documents from most to least semantically relevant to the query. Think like a locally deployed version of https://docs.cohere.com/reference/rerankTested reranking models:
- mixedbread-ai/mxbai-rerank-xsmall-v1
- Alibaba-NLP/gte-multilingual-reranker-base
- BAAI/bge-reranker-base
- BAAI/bge-reranker-large
- BAAI/bge-reranker-v2-m3
- jinaai/jina-reranker-v1-turbo-en
Other reranking models:
- Reranking Models supported by infinity are bert-style classification Models with one category.
- Most reranking model are likely supported: https://huggingface.co/models?pipeline_tag=text-classification&other=text-embeddings-inference&sort=trending
- https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=rerank
Multi-modal and cross-modal - image and audio embeddings
Specialized embedding models that allow for image<->text or image<->audio search. Typically, these models allow for text<->text, text<->other and other<->other search, with accuracy tradeoffs when going cross-modal.Image<->text models can be used for e.g. photo-gallery search, where users can type in keywords to find photos, or use a photo to find related images. Audio<->text models are less popular, and can be e.g. used to find music songs based on a text description or related music songs.
Tested image<->text models:
- wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M
- jinaai/jina-clip-v1
- google/siglip-so400m-patch14-384
- Models of type: ClipModel / SiglipModel in
config.json
Tested audio<->text models:
- Clap Models from LAION
- limited number open source organizations training these models
-
- Note: The sampling rate of the audio data needs to match the model *
Not supported:
- Plain vision models e.g. nomic-ai/nomic-embed-vision-v1.5
ColBert-style late-interaction Embeddings
ColBert Embeddings don't perform any special Pooling methods, but return the raw **token embeddings**. The **token embeddings** are then to be scored with the MaxSim Metric in a VectorDB (Qdrant / Vespa)For usage via the RestAPI, late-interaction embeddings may best be transported via base64
encoding.
Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing
Tested colbert models:
ColPali-style late-interaction Image<->Text Embeddings
Similar usage to ColBert, but scanning over an image<->text instead of only text.For usage via the RestAPI, late-interaction embeddings may best be transported via base64
encoding.
Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing
Tested ColPali/ColQwen models:
- vidore/colpali-v1.2-merged
- michaelfeil/colqwen2-v0.1
- No lora adapters supported, only "merged" models.
Text classification
A bert-style multi-label text classification. Classifies it into distinct categories.Tested models:
- ProsusAI/finbert, financial news classification
- SamLowe/roberta-base-go_emotions, text to emotion categories.
- bert-style text-classifcation models with more than >1 label in
config.json
Instead of the cli & RestAPI use infinity's interface via the Python API.
This gives you most flexibility. The Python API builds on asyncio
with its await/async
features, to allow concurrent processing of requests. Arguments of the CLI are also available via Python.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])
async def embed_text(engine: AsyncEmbeddingEngine):
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
await engine.astop()
asyncio.run(embed_text(array[0]))
Reranking gives you a score for similarity between a query and multiple documents. Use it in conjunction with a VectorDB+Embeddings, or as standalone for small amount of documents. Please select a model from huggingface that is a AutoModelForSequenceClassification compatible model with one class classification.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...",
"Paris is in France!",
"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
[EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)
async def rerank(engine: AsyncEmbeddingEngine):
async with engine:
ranking, usage = await engine.rerank(query=query, docs=docs)
print(list(zip(ranking, docs)))
# or handle the async start / stop yourself.
await engine.astart()
ranking, usage = await engine.rerank(query=query, docs=docs)
await engine.astop()
asyncio.run(rerank(array[0]))
When using the CLI, use this command to launch rerankers:
infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1
CLIP models are able to encode images and text at the same time.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embeddings_image, _ = await engine.image_embed(images=images)
await engine.astop()
asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))
CLAP models are able to encode audio and text at the same time.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io
sentences = ["This is awesome.", "I am bored."]
url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content
audios = [raw_bytes]
engine_args = EngineArgs(
model_name_or_path = "laion/clap-htsat-unfused",
dtype="float32",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embedding_audios = await engine.audio_embed(audios=audios)
await engine.astop()
asyncio.run(embed(array["laion/clap-htsat-unfused"]))
Use text classification with Infinity's classify
feature, which allows for sentiment analysis, emotion detection, and more classification tasks.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
model_name_or_path = "SamLowe/roberta-base-go_emotions",
engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])
async def classifier(engine: AsyncEmbeddingEngine):
async with engine:
predictions, usage = await engine.classify(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
predictions, usage = await engine.classify(sentences=sentences)
await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))
Infinity has a generated client code for RestAPI client side usage.
If you want to call a remote infinity instance via RestAPI, install the following package locally:
pip install infinity_client
For more information, check out the Client Readme https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client
- Serverless deployments at Runpod
- Truefoundry Cognita
- Langchain example
- imitater - A unified language model server built upon vllm and infinity.
- Dwarves Foundation: Deployment examples using Modal.com
- infiniflow/Ragflow
- SAP Core AI
- gpt_server - gpt_server is an open-source framework designed for production-level deployment of LLMs (Large Language Models) or Embeddings.
- KubeAI: Kubernetes AI Operator for inferencing
- LangChain
- Batched, modification of the Batching algoritm in Infinity
View the docs at https:///michaelfeil.github.io/infinity on how to get started.
After startup, the Swagger Ui will be available under {url}:{port}/docs
, in this case http://localhost:7997/docs
. You can also find a interactive preview here: https://infinity.modal.michaelfeil.eu/docs (and https://michaelfeil-infinity.hf.space/docs)
Install via Poetry 1.8.1, Python3.11 on Ubuntu 22.04
cd libs/infinity_emb
poetry install --extras all --with lint,test
To pass the CI:
cd libs/infinity_emb
make precommit
All contributions must be made in a way to be compatible with the MIT License of this repo.
@software{feil_2023_11630143,
author = {Feil, Michael},
title = {Infinity - To Embeddings and Beyond},
month = oct,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.11630143},
url = {https://doi.org/10.5281/zenodo.11630143}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for infinity
Similar Open Source Tools
infinity
Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. It is developed under the MIT License and powers inference behind Gradient.ai. The API allows users to deploy models from SentenceTransformers, offers fast inference backends utilizing various accelerators, dynamic batching for efficient processing, correct and tested implementation, and easy-to-use API built on FastAPI with Swagger documentation. Users can embed text, rerank documents, and perform text classification tasks using the tool. Infinity supports various models from Huggingface and provides flexibility in deployment via CLI, Docker, Python API, and cloud services like dstack. The tool is suitable for tasks like embedding, reranking, and text classification.
ms-swift
ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It supports training, inference, evaluation, quantization, and deployment of over 400 large models and 100+ multi-modal large models. The framework includes various training technologies and accelerates inference, evaluation, and deployment modules. It offers a Gradio-based Web-UI interface and best practices for easy application of large models. ms-swift supports a wide range of model types, dataset types, hardware support, lightweight training methods, distributed training techniques, quantization training, RLHF training, multi-modal training, interface training, plugin and extension support, inference acceleration engines, model evaluation, and model quantization.
modelscope-agent
ModelScope-Agent is a customizable and scalable Agent framework. A single agent has abilities such as role-playing, LLM calling, tool usage, planning, and memory. It mainly has the following characteristics: - **Simple Agent Implementation Process**: Simply specify the role instruction, LLM name, and tool name list to implement an Agent application. The framework automatically arranges workflows for tool usage, planning, and memory. - **Rich models and tools**: The framework is equipped with rich LLM interfaces, such as Dashscope and Modelscope model interfaces, OpenAI model interfaces, etc. Built in rich tools, such as **code interpreter**, **weather query**, **text to image**, **web browsing**, etc., make it easy to customize exclusive agents. - **Unified interface and high scalability**: The framework has clear tools and LLM registration mechanism, making it convenient for users to expand more diverse Agent applications. - **Low coupling**: Developers can easily use built-in tools, LLM, memory, and other components without the need to bind higher-level agents.
openvino.genai
The GenAI repository contains pipelines that implement image and text generation tasks. The implementation uses OpenVINO capabilities to optimize the pipelines. Each sample covers a family of models and suggests certain modifications to adapt the code to specific needs. It includes the following pipelines: 1. Benchmarking script for large language models 2. Text generation C++ samples that support most popular models like LLaMA 2 3. Stable Diffuison (with LoRA) C++ image generation pipeline 4. Latent Consistency Model (with LoRA) C++ image generation pipeline
aircrack-ng
Aircrack-ng is a comprehensive suite of tools designed to evaluate the security of WiFi networks. It covers various aspects of WiFi security, including monitoring, attacking (replay attacks, deauthentication, fake access points), testing WiFi cards and driver capabilities, and cracking WEP and WPA PSK. The tools are command line-based, allowing for extensive scripting and have been utilized by many GUIs. Aircrack-ng primarily works on Linux but also supports Windows, macOS, FreeBSD, OpenBSD, NetBSD, Solaris, and eComStation 2.
BrowserGym
BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.
amica
Amica is an application that allows you to easily converse with 3D characters in your browser. You can import VRM files, adjust the voice to fit the character, and generate response text that includes emotional expressions.
llama-assistant
Llama Assistant is an AI-powered assistant that helps with daily tasks, such as voice recognition, natural language processing, summarizing text, rephrasing sentences, answering questions, and more. It runs offline on your local machine, ensuring privacy by not sending data to external servers. The project is a work in progress with regular feature additions.
Sunshine-AIO
Sunshine-AIO is an all-in-one step-by-step guide to set up Sunshine with all necessary tools for Windows users. It provides a dedicated display for game streaming, virtual monitor switching, automatic resolution adjustment, resource-saving features, game launcher integration, and stream management. The project aims to evolve into an AIO tool as it progresses, welcoming contributions from users.
luna-ai
Luna AI is a virtual streamer driven by a 'brain' composed of ChatterBot, GPT, Claude, langchain, chatglm, text-generation-webui, 讯飞星火, 智谱AI. It can interact with viewers in real-time during live streams on platforms like Bilibili, Douyin, Kuaishou, Douyu, or chat with you locally. Luna AI uses natural language processing and text-to-speech technologies like Edge-TTS, VITS-Fast, elevenlabs, bark-gui, VALL-E-X to generate responses to viewer questions and can change voice using so-vits-svc, DDSP-SVC. It can also collaborate with Stable Diffusion for drawing displays and loop custom texts. This project is completely free, and any identical copycat selling programs are pirated, please stop them promptly.
rig
Rig is a Rust library designed for building scalable, modular, and user-friendly applications powered by large language models (LLMs). It provides full support for LLM completion and embedding workflows, offers simple yet powerful abstractions for LLM providers like OpenAI and Cohere, as well as vector stores such as MongoDB and in-memory storage. With Rig, users can easily integrate LLMs into their applications with minimal boilerplate code.
cb-tumblebug
CB-Tumblebug (CB-TB) is a system for managing multi-cloud infrastructure consisting of resources from multiple cloud service providers. It provides an overview, features, and architecture. The tool supports various cloud providers and resource types, with ongoing development and localization efforts. Users can deploy a multi-cloud infra with GPUs, enjoy multiple LLMs in parallel, and utilize LLM-related scripts. The tool requires Linux, Docker, Docker Compose, and Golang for building the source. Users can run CB-TB with Docker Compose or from the Makefile, set up prerequisites, contribute to the project, and view a list of contributors. The tool is licensed under an open-source license.
Apollo
Apollo is a multilingual medical LLM that covers English, Chinese, French, Hindi, Spanish, Hindi, and Arabic. It is designed to democratize medical AI to 6B people. Apollo has achieved state-of-the-art results on a variety of medical NLP tasks, including question answering, medical dialogue generation, and medical text classification. Apollo is easy to use and can be integrated into a variety of applications, making it a valuable tool for healthcare professionals and researchers.
OutofFocus
Out of Focus v1.0 is a flexible tool in Gradio for image manipulation through prompt manipulation by reconstruction via diffusion inversion process. Users can modify images using this tool, which is the first version of the Image modification tool by Out of AI.
TempCompass
TempCompass is a benchmark designed to evaluate the temporal perception ability of Video LLMs. It encompasses a diverse set of temporal aspects and task formats to comprehensively assess the capability of Video LLMs in understanding videos. The benchmark includes conflicting videos to prevent models from relying on single-frame bias and language priors. Users can clone the repository, install required packages, prepare data, run inference using examples like Video-LLaVA and Gemini, and evaluate the performance of their models across different tasks such as Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation.
ScaleLLM
ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more. ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our **_Roadmap_** for more details. ## Key Features * High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more. * Tensor Parallelism: Utilizes tensor parallelism for efficient model execution. * OpenAI-compatible API: An efficient golang rest api server that compatible with OpenAI. * Huggingface models: Seamless integration with most popular HF models, supporting safetensors. * Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models. * Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
For similar tasks
ai-on-gke
This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources
ray
Ray is a unified framework for scaling AI and Python applications. It consists of a core distributed runtime and a set of AI libraries for simplifying ML compute, including Data, Train, Tune, RLlib, and Serve. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing ecosystem of community integrations. With Ray, you can seamlessly scale the same code from a laptop to a cluster, making it easy to meet the compute-intensive demands of modern ML workloads.
labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.
djl
Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. It is designed to be easy to get started with and simple to use for Java developers. DJL provides a native Java development experience and allows users to integrate machine learning and deep learning models with their Java applications. The framework is deep learning engine agnostic, enabling users to switch engines at any point for optimal performance. DJL's ergonomic API interface guides users with best practices to accomplish deep learning tasks, such as running inference and training neural networks.
mlflow
MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow's current components are:
* `MLflow Tracking
tt-metal
TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.
burn
Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
awsome-distributed-training
This repository contains reference architectures and test cases for distributed model training with Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS. The test cases cover different types and sizes of models as well as different frameworks and parallel optimizations (Pytorch DDP/FSDP, MegatronLM, NemoMegatron...).
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.