infinity

Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali

Stars: 1651

Visit

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. It is developed under the MIT License and powers inference behind Gradient.ai. The API allows users to deploy models from SentenceTransformers, offers fast inference backends utilizing various accelerators, dynamic batching for efficient processing, correct and tested implementation, and easy-to-use API built on FastAPI with Swagger documentation. Users can embed text, rerank documents, and perform text classification tasks using the tool. Infinity supports various models from Huggingface and provides flexibility in deployment via CLI, Docker, Python API, and cloud services like dstack. The tool is suitable for tasks like embedding, reranking, and text classification.

README:

Infinity ♾️

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap and colpali. Infinity is developed under MIT License.

Why Infinity

Deploy any model from HuggingFace: deploy any embedding, reranking, clip and sentence-transformer model from HuggingFace
Fast inference backends: The inference server is built on top of PyTorch, optimum (ONNX/TensorRT) and CTranslate2, using FlashAttention to get the most out of your NVIDIA CUDA, AMD ROCM, CPU, AWS INF2 or APPLE MPS accelerator. Infinity uses dynamic batching and tokenization dedicated in worker threads.
Multi-modal and multi-model: Mix-and-match multiple models. Infinity orchestrates them.
Tested implementation: Unit and end-to-end tested. Embeddings via infinity are correctly embedded. Lets API users create embeddings till infinity and beyond.
Easy to use: Built on FastAPI. Infinity CLI v2 allows launching of all arguments via Environment variable or argument. OpenAPI aligned to OpenAI's API specs. View the docs at https://michaelfeil.github.io/infinity on how to get started.

Latest News 🔥

[2024/11] AMD, CPU, ONNX docker images
[2024/10] pip install infinity_client
[2024/07] Inference deployment example via Modal and a free GPU deployment
[2024/06] Support for multi-modal: clip, text-classification & launch all arguments from env variables
[2024/05] launch multiple models using the v2 cli, including --api-key
[2024/03] infinity supports experimental int8 (cpu/cuda) and fp8 (H100/MI300) support
[2024/03] Docs are online: https://michaelfeil.github.io/infinity/latest/
[2024/02] Community meetup at the Run:AI Infra Club
[2024/01] TensorRT / ONNX inference
[2023/10] Initial release

Getting started

Launch the cli via pip install

pip install infinity-emb[all]

After your pip install, with your venv active, you can run the CLI directly.

infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

Check the v2 --help command to get a description for all parameters.

infinity_emb v2 --help

Launch the CLI using a pre-built docker container (recommended)

Instead of installing the CLI via pip, you may also use docker to run michaelf34/infinity. Make sure you mount your accelerator ( i.e. install nvidia-docker and activate with --gpus all).

port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data

docker run -it --gpus all \
 -v $volume:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:latest \
 v2 \
 --model-id $model1 \
 --model-id $model2 \
 --port $port

The cache path inside the docker container is set by the environment variable HF_HOME.

Specialized docker images

Docker container for CPU

Use the `latest-cpu` image or `x.x.x-cpu` for slimer image. Run like any other cpu-only docker image. Optimum/Onnx is often the prefered engine.

docker run -it \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest-cpu \
v2 \
--engine optimum \
--model-id $model1 \
--model-id $model2 \
--port $port

Docker Container for ROCm (MI200 Series and MI300 Series)

Use the `latest-rocm` image or `x.x.x-rocm` for rocm compatible inference. **This image is currently not build via CI/CD (to large), consider pinning to exact version.** Make sure you have ROCm is correctly installed and ready to use with Docker.

Visit Docs for more info.

Docker Container for Onnx-GPU, Cuda Extensions, TensorRT

Use the `latest-trt-onnx` image or `x.x.x-trt-onnx` for nvidia compatible inference. **This image is currently not build via CI/CD (to large), consider pinning to exact version.**

This image has support for:

ONNX-Cuda "CudaExecutionProvider"
ONNX-TensorRT "TensorRTExecutionProvider" (may not always work due to version mismatch with ORT)
CudaExtensions and packages, e.g. Tri-Dao's pip install flash-attn package when using Pytorch.
nvcc compiler support

docker run -it \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest-trt-onnx \
v2 \
--engine optimum \
--device cuda \
--model-id $model1 \
--port $port

Advanced CLI usage

Launching multiple models at once

Since infinity_emb>=0.0.34, you can use cli v2 method to launch multiple models at the same time. Checkout infinity_emb v2 --help for all args and validation.

Multiple Model CLI Playbook:

1. cli options can be repeated e.g. v2 --model-id model/id1 --model-id model/id2 --batch-size 8 --batch-size 4. This will create two models model/id1 and model/id2
1. or adapt the defaults by setting ENV Variables separated by ;: INFINITY_MODEL_ID="model/id1;model/id2;" && INFINITY_BATCH_SIZE="8;4;"
1. single items are broadcasted to --model-id length, v2 --model-id model/id1 --model-id/id2 --batch-size 8 making both models have batch-size 8.
1. Everything is broadcasted to the number of --model-id + API requests are routed to the --served-model-name/--model-id

Using environment variables instead of the cli

All CLI arguments are also launchable via environment variables.

Environment variables start with INFINITY_{UPPER_CASE_SNAKE_CASE} and often match the --{lower-case-kebab-case} cli arguments.

The following two are equivalent:

CLI infinity_emb v2 --model-id BAAI/bge-base-en-v1.5
ENV-CLI: export INFINITY_MODEL_ID="BAAI/bge-base-en-v1.5" && infinity_emb v2

Multiple arguments can be used via ; syntax: INFINITY_MODEL_ID="model/id1;model/id2;"

API Key

Supply an `--api-key secret123` via CLI or ENV INFINITY_API_KEY="secret123".

Chosing the fastest engine

With the command --engine torch the model must be compatible with https://github.com/UKPLab/sentence-transformers/ and AutoModel

With the command --engine optimum, there must be an onnx file. Models from https://huggingface.co/Xenova are recommended.

With the command --engine ctranslate2 - only BERT models are supported.

Telemetry opt-out

See which telemetry is collected: https://michaelfeil.eu/infinity/main/telemetry/

# Disable
export INFINITY_ANONYMOUS_USAGE_STATS="0"

Supported Tasks and Models by Infinity

Infinity aims to be the inference server supporting most functionality for embeddings, reranking and related RAG tasks. The following Infinity tests 15+ architectures and all of the below cases in the Github CI. Click on the sections below to find tasks and validated example models.

Text Embeddings

Text embeddings measure the relatedness of text strings. Embeddings are used for search, clustering, recommendations. Think about a private deployed version of openai's text embeddings. https://platform.openai.com/docs/guides/embeddings

Tested embedding models:

mixedbread-ai/mxbai-embed-large-v1
WhereIsAI/UAE-Large-V1
BAAI/bge-base-en-v1.5
Alibaba-NLP/gte-large-en-v1.5
jinaai/jina-embeddings-v2-base-code
sentence-transformers/all-MiniLM-L6-v2
intfloat/multilingual-e5-large-instruct
intfloat/multilingual-e5-small
jinaai/jina-embeddings-v3
BAAI/bge-m3, no sparse
decoder-based models. Keep in mind that they are ~20-100x larger (&slower) than bert-small models:

Other models:

Most embedding model are likely supported: https://huggingface.co/models?pipeline_tag=feature-extraction&other=text-embeddings-inference&sort=trending
Check MTEB leaderboard for models https://huggingface.co/spaces/mteb/leaderboard.

Reranking

Given a query and a list of documents, Reranking indexes the documents from most to least semantically relevant to the query. Think like a locally deployed version of https://docs.cohere.com/reference/rerank

Tested reranking models:

Other reranking models:

Reranking Models supported by infinity are bert-style classification Models with one category.
Most reranking model are likely supported: https://huggingface.co/models?pipeline_tag=text-classification&other=text-embeddings-inference&sort=trending
https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=rerank

Multi-modal and cross-modal - image and audio embeddings

Specialized embedding models that allow for image<->text or image<->audio search. Typically, these models allow for text<->text, text<->other and other<->other search, with accuracy tradeoffs when going cross-modal.

Image<->text models can be used for e.g. photo-gallery search, where users can type in keywords to find photos, or use a photo to find related images. Audio<->text models are less popular, and can be e.g. used to find music songs based on a text description or related music songs.

Tested image<->text models:

wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M
jinaai/jina-clip-v1
google/siglip-so400m-patch14-384
Models of type: ClipModel / SiglipModel in config.json

Tested audio<->text models:

Clap Models from LAION
limited number open source organizations training these models
- Note: The sampling rate of the audio data needs to match the model *

Not supported:

Plain vision models e.g. nomic-ai/nomic-embed-vision-v1.5

ColBert-style late-interaction Embeddings

ColBert Embeddings don't perform any special Pooling methods, but return the raw **token embeddings**. The **token embeddings** are then to be scored with the MaxSim Metric in a VectorDB (Qdrant / Vespa)

For usage via the RestAPI, late-interaction embeddings may best be transported via base64 encoding. Example notebook: https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing

Tested colbert models:

ColPali-style late-interaction Image<->Text Embeddings

Similar usage to ColBert, but scanning over an image<->text instead of only text.

Tested ColPali/ColQwen models:

vidore/colpali-v1.2-merged
michaelfeil/colqwen2-v0.1
No lora adapters supported, only "merged" models.

Text classification

A bert-style multi-label text classification. Classifies it into distinct categories.

Tested models:

ProsusAI/finbert, financial news classification
SamLowe/roberta-base-go_emotions, text to emotion categories.
bert-style text-classifcation models with more than >1 label in config.json

Infinity usage via the Python API

Instead of the cli & RestAPI use infinity's interface via the Python API. This gives you most flexibility. The Python API builds on asyncio with its await/async features, to allow concurrent processing of requests. Arguments of the CLI are also available via Python.

Embeddings

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
  EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])

async def embed_text(engine: AsyncEmbeddingEngine): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    await engine.astop()
asyncio.run(embed_text(array[0]))

Reranking

Reranking gives you a score for similarity between a query and multiple documents. Use it in conjunction with a VectorDB+Embeddings, or as standalone for small amount of documents. Please select a model from huggingface that is a AutoModelForSequenceClassification compatible model with one class classification.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...", 
    "Paris is in France!",
    "infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
  [EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)

async def rerank(engine: AsyncEmbeddingEngine): 
    async with engine:
        ranking, usage = await engine.rerank(query=query, docs=docs)
        print(list(zip(ranking, docs)))
    # or handle the async start / stop yourself.
    await engine.astart()
    ranking, usage = await engine.rerank(query=query, docs=docs)
    await engine.astop()

asyncio.run(rerank(array[0]))

When using the CLI, use this command to launch rerankers:

infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1

Image-Embeddings: CLIP models

CLIP models are able to encode images and text at the same time.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
    model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M", 
    engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embeddings_image, _ = await engine.image_embed(images=images)
    await engine.astop()

asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))

Audio-Embeddings: CLAP models

CLAP models are able to encode audio and text at the same time.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io

sentences = ["This is awesome.", "I am bored."]

url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content

audios = [raw_bytes]
engine_args = EngineArgs(
    model_name_or_path = "laion/clap-htsat-unfused",
    dtype="float32", 
    engine="torch"

)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embedding_audios = await engine.audio_embed(audios=audios)
    await engine.astop()

asyncio.run(embed(array["laion/clap-htsat-unfused"]))

Text Classification

Use text classification with Infinity's classify feature, which allows for sentiment analysis, emotion detection, and more classification tasks.

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
    model_name_or_path = "SamLowe/roberta-base-go_emotions", 
    engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])

async def classifier(engine: AsyncEmbeddingEngine): 
    async with engine:
        predictions, usage = await engine.classify(sentences=sentences)
    # or handle the async start / stop yourself.
    await engine.astart()
    predictions, usage = await engine.classify(sentences=sentences)
    await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))

Infinity usage via the Python Client

Infinity has a generated client code for RestAPI client side usage.

If you want to call a remote infinity instance via RestAPI, install the following package locally:

pip install infinity_client

For more information, check out the Client Readme https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client

Integrations:

Documentation

View the docs at https:///michaelfeil.github.io/infinity on how to get started. After startup, the Swagger Ui will be available under {url}:{port}/docs, in this case http://localhost:7997/docs. You can also find a interactive preview here: https://infinity.modal.michaelfeil.eu/docs (and https://michaelfeil-infinity.hf.space/docs)

Contribute and Develop

Install via Poetry 1.8.1, Python3.11 on Ubuntu 22.04

cd libs/infinity_emb
poetry install --extras all --with lint,test

To pass the CI:

cd libs/infinity_emb
make precommit

All contributions must be made in a way to be compatible with the MIT License of this repo.

Citation

@software{feil_2023_11630143,
  author       = {Feil, Michael},
  title        = {Infinity - To Embeddings and Beyond},
  month        = oct,
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.11630143},
  url          = {https://doi.org/10.5281/zenodo.11630143}
}

💚 Current contributors

For Tasks:

Click tags to check more tools for each tasks

embed text rerank documents perform text classification deploy models utilize accelerators

For Jobs:

data scientist machine learning engineer ai researcher nlp engineer data analyst

Alternative AI tools for infinity

Similar Open Source Tools

infinity

github

: 1.7k

BrowserGym

BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.

github

: 456

rig

Rig is a Rust library designed for building scalable, modular, and user-friendly applications powered by large language models (LLMs). It provides full support for LLM completion and embedding workflows, offers simple yet powerful abstractions for LLM providers like OpenAI and Cohere, as well as vector stores such as MongoDB and in-memory storage. With Rig, users can easily integrate LLMs into their applications with minimal boilerplate code.

github

: 3.4k

luna-ai

Luna AI is a virtual streamer driven by a 'brain' composed of ChatterBot, GPT, Claude, langchain, chatglm, text-generation-webui, 讯飞星火, 智谱AI. It can interact with viewers in real-time during live streams on platforms like Bilibili, Douyin, Kuaishou, Douyu, or chat with you locally. Luna AI uses natural language processing and text-to-speech technologies like Edge-TTS, VITS-Fast, elevenlabs, bark-gui, VALL-E-X to generate responses to viewer questions and can change voice using so-vits-svc, DDSP-SVC. It can also collaborate with Stable Diffusion for drawing displays and loop custom texts. This project is completely free, and any identical copycat selling programs are pirated, please stop them promptly.

github

: 1.2k

Apollo

Apollo is a multilingual medical LLM that covers English, Chinese, French, Hindi, Spanish, Hindi, and Arabic. It is designed to democratize medical AI to 6B people. Apollo has achieved state-of-the-art results on a variety of medical NLP tasks, including question answering, medical dialogue generation, and medical text classification. Apollo is easy to use and can be integrated into a variety of applications, making it a valuable tool for healthcare professionals and researchers.

github

: 155

OutofFocus

Out of Focus v1.0 is a flexible tool in Gradio for image manipulation through prompt manipulation by reconstruction via diffusion inversion process. Users can modify images using this tool, which is the first version of the Image modification tool by Out of AI.

github

: 173

evalscope

Eval-Scope is a framework designed to support the evaluation of large language models (LLMs) by providing pre-configured benchmark datasets, common evaluation metrics, model integration, automatic evaluation for objective questions, complex task evaluation using expert models, reports generation, visualization tools, and model inference performance evaluation. It is lightweight, easy to customize, supports new dataset integration, model hosting on ModelScope, deployment of locally hosted models, and rich evaluation metrics. Eval-Scope also supports various evaluation modes like single mode, pairwise-baseline mode, and pairwise (all) mode, making it suitable for assessing and improving LLMs.

github

: 692

autogen

AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

github

: 42.5k

ST-LLM

ST-LLM is a temporal-sensitive video large language model that incorporates joint spatial-temporal modeling, dynamic masking strategy, and global-local input module for effective video understanding. It has achieved state-of-the-art results on various video benchmarks. The repository provides code and weights for the model, along with demo scripts for easy usage. Users can train, validate, and use the model for tasks like video description, action identification, and reasoning.

github

: 99

TempCompass

TempCompass is a benchmark designed to evaluate the temporal perception ability of Video LLMs. It encompasses a diverse set of temporal aspects and task formats to comprehensively assess the capability of Video LLMs in understanding videos. The benchmark includes conflicting videos to prevent models from relying on single-frame bias and language priors. Users can clone the repository, install required packages, prepare data, run inference using examples like Video-LLaVA and Gemini, and evaluate the performance of their models across different tasks such as Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation.

github

: 71

ScaleLLM

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more. ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our **_Roadmap_** for more details. ## Key Features * High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more. * Tensor Parallelism: Utilizes tensor parallelism for efficient model execution. * OpenAI-compatible API: An efficient golang rest api server that compatible with OpenAI. * Huggingface models: Seamless integration with most popular HF models, supporting safetensors. * Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models. * Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.

github

: 418

libllm

libLLM is an open-source project designed for efficient inference of large language models (LLM) on personal computers and mobile devices. It is optimized to run smoothly on common devices, written in C++14 without external dependencies, and supports CUDA for accelerated inference. Users can build the tool for CPU only or with CUDA support, and run libLLM from the command line. Additionally, there are API examples available for Python and the tool can export Huggingface models.

github

: 137

Open-Sora-Plan

Open-Sora-Plan is a project that aims to create a simple and scalable repo to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI"). The project is still in its early stages, but the team is working hard to improve it and make it more accessible to the open-source community. The project is currently focused on training an unconditional model on a landscape dataset, but the team plans to expand the scope of the project in the future to include text2video experiments, training on video2text datasets, and controlling the model with more conditions.

github

: 11.8k

LocalAI

LocalAI is a free and open-source OpenAI alternative that acts as a drop-in replacement REST API compatible with OpenAI (Elevenlabs, Anthropic, etc.) API specifications for local AI inferencing. It allows users to run LLMs, generate images, audio, and more locally or on-premises with consumer-grade hardware, supporting multiple model families and not requiring a GPU. LocalAI offers features such as text generation with GPTs, text-to-audio, audio-to-text transcription, image generation with stable diffusion, OpenAI functions, embeddings generation for vector databases, constrained grammars, downloading models directly from Huggingface, and a Vision API. It provides a detailed step-by-step introduction in its Getting Started guide and supports community integrations such as custom containers, WebUIs, model galleries, and various bots for Discord, Slack, and Telegram. LocalAI also offers resources like an LLM fine-tuning guide, instructions for local building and Kubernetes installation, projects integrating LocalAI, and a how-tos section curated by the community. It encourages users to cite the repository when utilizing it in downstream projects and acknowledges the contributions of various software from the community.

github

: 31.5k

beta9

Beta9 is an open-source platform for running scalable serverless GPU workloads across cloud providers. It allows users to scale out workloads to thousands of GPU or CPU containers, achieve ultrafast cold-start for custom ML models, automatically scale to zero to pay for only what is used, utilize flexible distributed storage, distribute workloads across multiple cloud providers, and easily deploy task queues and functions using simple Python abstractions. The platform is designed for launching remote serverless containers quickly, featuring a custom, lazy loading image format backed by S3/FUSE, a fast redis-based container scheduling engine, content-addressed storage for caching images and files, and a custom runc container runtime.

github

: 744

VideoRefer

VideoRefer Suite is a tool designed to enhance the fine-grained spatial-temporal understanding capabilities of Video Large Language Models (Video LLMs). It consists of three primary components: Model (VideoRefer) for perceiving, reasoning, and retrieval for user-defined regions at any specified timestamps, Dataset (VideoRefer-700K) for high-quality object-level video instruction data, and Benchmark (VideoRefer-Bench) to evaluate object-level video understanding capabilities. The tool can understand any object within a video.

github

: 157

For similar tasks

ai-on-gke

This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources

github

: 280

ray

Ray is a unified framework for scaling AI and Python applications. It consists of a core distributed runtime and a set of AI libraries for simplifying ML compute, including Data, Train, Tune, RLlib, and Serve. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing ecosystem of community integrations. With Ray, you can seamlessly scale the same code from a laptop to a cluster, making it easy to meet the compute-intensive demands of modern ML workloads.

github

: 36.4k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135

djl

Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. It is designed to be easy to get started with and simple to use for Java developers. DJL provides a native Java development experience and allows users to integrate machine learning and deep learning models with their Java applications. The framework is deep learning engine agnostic, enabling users to switch engines at any point for optimal performance. DJL's ergonomic API interface guides users with best practices to accomplish deep learning tasks, such as running inference and training neural networks.

github

: 4.1k

mlflow

MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow's current components are: * `MLflow Tracking `_: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI. * `MLflow Projects `_: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others. * `MLflow Models `_: A model packaging format and tools that let you easily deploy the same model (from any ML library) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML and AWS SageMaker. * `MLflow Model Registry `_: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.

github

: 19.9k

tt-metal

TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.

github

: 786

burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.

github

: 10.2k

awsome-distributed-training

This repository contains reference architectures and test cases for distributed model training with Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS. The test cases cover different types and sizes of models as well as different frameworks and parallel optimizations (Pytorch DDP/FSDP, MegatronLM, NemoMegatron...).

github

: 230

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675