jina
☁️ Build multimodal AI applications with cloud-native stack
Stars: 20966
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
README:
Build multimodal AI applications with cloud-native technologies
Jina lets you build multimodal AI services and pipelines that communicate via gRPC, HTTP and WebSockets, then scale them up and deploy to production. You can focus on your logic and algorithms, without worrying about the infrastructure complexity.
Jina provides a smooth Pythonic experience for serving ML models transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Jina makes advanced solution engineering and cloud-native technologies accessible to every developer.
- Build and serve models for any data type and any mainstream deep learning framework.
- Design high-performance services, with easy scaling, duplex client-server streaming, batching, dynamic batching, async/non-blocking data processing and any protocol.
- Serve LLM models while streaming their output.
- Docker container integration via Executor Hub, OpenTelemetry/Prometheus observability.
- Streamlined CPU/GPU hosting via Jina AI Cloud.
- Deploy to your own cloud or system with our Kubernetes and Docker Compose integration.
Wait, how is Jina different from FastAPI?
Jina's value proposition may seem quite similar to that of FastAPI. However, there are several fundamental differences:Data structure and communication protocols
- FastAPI communication relies on Pydantic and Jina relies on DocArray allowing Jina to support multiple protocols to expose its services. The support for gRPC protocol is specially useful for data intensive applications as for embedding services where the embeddings and tensors can be more efficiently serialized.
Advanced orchestration and scaling capabilities
- Jina allows you to easily containerize and orchestrate your services and models, providing concurrency and scalability.
- Jina lets you deploy applications formed from multiple microservices that can be containerized and scaled independently.
Journey to the cloud
- Jina provides a smooth transition from local development (using DocArray) to local serving using Deployment and Flow to having production-ready services by using Kubernetes capacity to orchestrate the lifetime of containers.
- By using Jina AI Cloud you have access to scalable and serverless deployments of your applications in one command.
pip install jina
Find more install options on Apple Silicon/Windows.
Jina has three fundamental layers:
- Data layer: BaseDoc and DocList (from DocArray) are the input/output formats in Jina.
- Serving layer: An Executor is a Python class that transforms and processes Documents. By simply wrapping your models into an Executor, you allow them to be served and scaled by Jina. Gateway is the service making sure connecting all Executors inside a Flow.
- Orchestration layer: Deployment serves a single Executor, while a Flow serves Executors chained into a pipeline.
The full glossary is explained here.
Let's build a fast, reliable and scalable gRPC-based AI service. In Jina we call this an Executor. Our simple Executor will wrap the StableLM LLM from Stability AI. We'll then use a Deployment to serve it.
Note A Deployment serves just one Executor. To combine multiple Executors into a pipeline and serve that, use a Flow.
Let's implement the service's logic:
executor.py |
---|
from jina import Executor, requests
from docarray import DocList, BaseDoc
from transformers import pipeline
class Prompt(BaseDoc):
text: str
class Generation(BaseDoc):
prompt: str
text: str
class StableLM(Executor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.generator = pipeline(
'text-generation', model='stabilityai/stablelm-base-alpha-3b'
)
@requests
def generate(self, docs: DocList[Prompt], **kwargs) -> DocList[Generation]:
generations = DocList[Generation]()
prompts = docs.text
llm_outputs = self.generator(prompts)
for prompt, output in zip(prompts, llm_outputs):
generations.append(Generation(prompt=prompt, text=output))
return generations |
Then we deploy it with either the Python API or YAML:
Python API: deployment.py |
YAML: deployment.yml |
---|---|
from jina import Deployment
from executor import StableLM
dep = Deployment(uses=StableLM, timeout_ready=-1, port=12345)
with dep:
dep.block() |
jtype: Deployment
with:
uses: StableLM
py_modules:
- executor.py
timeout_ready: -1
port: 12345 And run the YAML Deployment with the CLI: |
Use Jina Client to make requests to the service:
from jina import Client
from docarray import DocList, BaseDoc
class Prompt(BaseDoc):
text: str
class Generation(BaseDoc):
prompt: str
text: str
prompt = Prompt(
text='suggest an interesting image generation prompt for a mona lisa variant'
)
client = Client(port=12345) # use port from output above
response = client.post(on='/', inputs=[prompt], return_type=DocList[Generation])
print(response[0].text)
a steampunk version of the Mona Lisa, incorporating mechanical gears, brass elements, and Victorian era clothing details
Note In a notebook, you can't use
deployment.block()
and then make requests to the client. Please refer to the Colab link above for reproducible Jupyter Notebook code snippets.
Sometimes you want to chain microservices together into a pipeline. That's where a Flow comes in.
A Flow is a DAG pipeline, composed of a set of steps, It orchestrates a set of Executors and a Gateway to offer an end-to-end service.
Note If you just want to serve a single Executor, you can use a Deployment.
For instance, let's combine our StableLM language model with a Stable Diffusion image generation model. Chaining these services together into a Flow will give us a service that will generate images based on a prompt generated by the LLM.
text_to_image.py |
---|
import numpy as np
from jina import Executor, requests
from docarray import BaseDoc, DocList
from docarray.documents import ImageDoc
class Generation(BaseDoc):
prompt: str
text: str
class TextToImage(Executor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
from diffusers import StableDiffusionPipeline
import torch
self.pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
).to("cuda")
@requests
def generate_image(self, docs: DocList[Generation], **kwargs) -> DocList[ImageDoc]:
result = DocList[ImageDoc]()
images = self.pipe(
docs.text
).images # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)
result.tensor = np.array(images)
return result |
Build the Flow with either Python or YAML:
Python API: flow.py |
YAML: flow.yml |
---|---|
from jina import Flow
from executor import StableLM
from text_to_image import TextToImage
flow = (
Flow(port=12345)
.add(uses=StableLM, timeout_ready=-1)
.add(uses=TextToImage, timeout_ready=-1)
)
with flow:
flow.block() |
jtype: Flow
with:
port: 12345
executors:
- uses: StableLM
timeout_ready: -1
py_modules:
- executor.py
- uses: TextToImage
timeout_ready: -1
py_modules:
- text_to_image.py Then run the YAML Flow with the CLI: |
Then, use Jina Client to make requests to the Flow:
from jina import Client
from docarray import DocList, BaseDoc
from docarray.documents import ImageDoc
class Prompt(BaseDoc):
text: str
prompt = Prompt(
text='suggest an interesting image generation prompt for a mona lisa variant'
)
client = Client(port=12345) # use port from output above
response = client.post(on='/', inputs=[prompt], return_type=DocList[ImageDoc])
response[0].display()
Why not just use standard Python to build that service and pipeline? Jina accelerates time to market of your application by making it more scalable and cloud-native. Jina also handles the infrastructure complexity in production and other Day-2 operations so that you can focus on the data application itself.
Increase your application's throughput with scalability features out of the box, like replicas, shards and dynamic batching.
Let's scale a Stable Diffusion Executor deployment with replicas and dynamic batching:
- Create two replicas, with a GPU assigned for each.
- Enable dynamic batching to process incoming parallel requests together with the same model inference.
Normal Deployment | Scaled Deployment |
---|---|
jtype: Deployment
with:
uses: TextToImage
timeout_ready: -1
py_modules:
- text_to_image.py |
jtype: Deployment
with:
uses: TextToImage
timeout_ready: -1
py_modules:
- text_to_image.py
env:
CUDA_VISIBLE_DEVICES: RR
replicas: 2
uses_dynamic_batching: # configure dynamic batching
/default:
preferred_batch_size: 10
timeout: 200 |
Assuming your machine has two GPUs, using the scaled deployment YAML will give better throughput compared to the normal deployment.
These features apply to both Deployment YAML and Flow YAML. Thanks to the YAML syntax, you can inject deployment configurations regardless of Executor code.
In order to deploy your solutions to the cloud, you need to containerize your services. Jina provides the Executor Hub, the perfect tool to streamline this process taking a lot of the troubles with you. It also lets you share these Executors publicly or privately.
You just need to structure your Executor in a folder:
TextToImage/
├── executor.py
├── config.yml
├── requirements.txt
config.yml |
requirements.txt |
---|---|
jtype: TextToImage
py_modules:
- executor.py
metas:
name: TextToImage
description: Text to Image generation Executor based on StableDiffusion
url:
keywords: [] |
diffusers
accelerate
transformers |
Then push the Executor to the Hub by doing: jina hub push TextToImage
.
This will give you a URL that you can use in your Deployment
and Flow
to use the pushed Executors containers.
jtype: Flow
with:
port: 12345
executors:
- uses: jinai+docker://<user-id>/StableLM
- uses: jinai+docker://<user-id>/TextToImage
Using Kubernetes with Jina is easy:
jina export kubernetes flow.yml ./my-k8s
kubectl apply -R -f my-k8s
And so is Docker Compose:
jina export docker-compose flow.yml docker-compose.yml
docker-compose up
Note You can also export Deployment YAML to Kubernetes and Docker Compose.
That's not all. We also support OpenTelemetry, Prometheus, and Jaeger.
What cloud-native technology is still challenging to you? Tell us and we'll handle the complexity and make it easy for you.
You can also deploy a Flow to JCloud, where you can easily enjoy autoscaling, monitoring and more with a single command.
First, turn the flow.yml
file into a JCloud-compatible YAML by specifying resource requirements and using containerized Hub Executors.
Then, use jina cloud deploy
command to deploy to the cloud:
wget https://raw.githubusercontent.com/jina-ai/jina/master/.github/getting-started/jcloud-flow.yml
jina cloud deploy jcloud-flow.yml
Warning
Make sure to delete/clean up the Flow once you are done with this tutorial to save resources and credits.
Read more about deploying Flows to JCloud.
Large Language Models can power a wide range of applications from chatbots to assistants and intelligent systems. However, these models can be heavy and slow and your users want systems that are both intelligent and fast!
Large language models work by turning your questions into tokens and then generating new token one at a time until it decides that generation should stop. This means you want to stream the output tokens generated by a large language model to the client. In this tutorial, we will discuss how to achieve this with Streaming Endpoints in Jina.
The first step is to define the streaming service schemas, as you would do in any other service framework. The input to the service is the prompt and the maximum number of tokens to generate, while the output is simply the token ID:
from docarray import BaseDoc
class PromptDocument(BaseDoc):
prompt: str
max_tokens: int
class ModelOutputDocument(BaseDoc):
token_id: int
generated_text: str
Our service depends on a large language model. As an example, we will use the gpt2
model. This is how you would load
such a model in your executor
from jina import Executor, requests
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
class TokenStreamingExecutor(Executor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.model = GPT2LMHeadModel.from_pretrained('gpt2')
Our streaming endpoint accepts a PromptDocument
as input and streams ModelOutputDocument
s. To stream a document back to
the client, use the yield
keyword in the endpoint implementation. Therefore, we use the model to generate
up to max_tokens
tokens and yield them until the generation stops:
class TokenStreamingExecutor(Executor):
...
@requests(on='/stream')
async def task(self, doc: PromptDocument, **kwargs) -> ModelOutputDocument:
input = tokenizer(doc.prompt, return_tensors='pt')
input_len = input['input_ids'].shape[1]
for _ in range(doc.max_tokens):
output = self.model.generate(**input, max_new_tokens=1)
if output[0][-1] == tokenizer.eos_token_id:
break
yield ModelOutputDocument(
token_id=output[0][-1],
generated_text=tokenizer.decode(
output[0][input_len:], skip_special_tokens=True
),
)
input = {
'input_ids': output,
'attention_mask': torch.ones(1, len(output[0])),
}
Learn more about streaming endpoints from the Executor
documentation.
The final step is to serve the Executor and send requests using the client. To serve the Executor using gRPC:
from jina import Deployment
with Deployment(uses=TokenStreamingExecutor, port=12345, protocol='grpc') as dep:
dep.block()
To send requests from a client:
import asyncio
from jina import Client
async def main():
client = Client(port=12345, protocol='grpc', asyncio=True)
async for doc in client.stream_doc(
on='/stream',
inputs=PromptDocument(prompt='what is the capital of France ?', max_tokens=10),
return_type=ModelOutputDocument,
):
print(doc.generated_text)
asyncio.run(main())
The
The capital
The capital of
The capital of France
The capital of France is
The capital of France is Paris
The capital of France is Paris.
- Join our Discord community and chat with other community members about ideas.
- Subscribe to the latest video tutorials on our YouTube channel
Jina is backed by Jina AI and licensed under Apache-2.0.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for jina
Similar Open Source Tools
jina
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
BentoML
BentoML is an open-source model serving library for building performant and scalable AI applications with Python. It comes with everything you need for serving optimization, model packaging, and production deployment.
lantern
Lantern is an open-source PostgreSQL database extension designed to store vector data, generate embeddings, and handle vector search operations efficiently. It introduces a new index type called 'lantern_hnsw' for vector columns, which speeds up 'ORDER BY ... LIMIT' queries. Lantern utilizes the state-of-the-art HNSW implementation called usearch. Users can easily install Lantern using Docker, Homebrew, or precompiled binaries. The tool supports various distance functions, index construction parameters, and operator classes for efficient querying. Lantern offers features like embedding generation, interoperability with pgvector, parallel index creation, and external index graph generation. It aims to provide superior performance metrics compared to other similar tools and has a roadmap for future enhancements such as cloud-hosted version, hardware-accelerated distance metrics, industry-specific application templates, and support for version control and A/B testing of embeddings.
giskard
Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.
storm
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage. **Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**
llama_index
LlamaIndex is a data framework for building LLM applications. It provides tools for ingesting, structuring, and querying data, as well as integrating with LLMs and other tools. LlamaIndex is designed to be easy to use for both beginner and advanced users, and it provides a comprehensive set of features for building LLM applications.
lhotse
Lhotse is a Python library designed to make speech and audio data preparation flexible and accessible. It aims to attract a wider community to speech processing tasks by providing a Python-centric design and an expressive command-line interface. Lhotse offers standard data preparation recipes, PyTorch Dataset classes for speech tasks, and efficient data preparation for model training with audio cuts. It supports data augmentation, feature extraction, and feature-space cut mixing. The tool extends Kaldi's data preparation recipes with seamless PyTorch integration, human-readable text manifests, and convenient Python classes.
Vitron
Vitron is a unified pixel-level vision LLM designed for comprehensive understanding, generating, segmenting, and editing static images and dynamic videos. It addresses challenges in existing vision LLMs such as superficial instance-level understanding, lack of unified support for images and videos, and insufficient coverage across various vision tasks. The tool requires Python >= 3.8, Pytorch == 2.1.0, and CUDA Version >= 11.8 for installation. Users can deploy Gradio demo locally and fine-tune their models for specific tasks.
web-llm
WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
ShortcutsBench
ShortcutsBench is a project focused on collecting and analyzing workflows created in the Shortcuts app, providing a dataset of shortcut metadata, source files, and API information. It aims to study the integration of large language models with Apple devices, particularly focusing on the role of shortcuts in enhancing user experience. The project offers insights for Shortcuts users, enthusiasts, and researchers to explore, customize workflows, and study automated workflows, low-code programming, and API-based agents.
crewAI-tools
This repository provides a guide for setting up tools for crewAI agents to enhance functionality. It offers steps to equip agents with ready-to-use tools and create custom ones. Tools are expected to return strings for generating responses. Users can create tools by subclassing BaseTool or using the tool decorator. Contributions are welcome to enrich the toolset, and guidelines are provided for contributing. The development setup includes installing dependencies, activating virtual environment, setting up pre-commit hooks, running tests, static type checking, packaging, and local installation. The goal is to empower AI solutions through advanced tooling.
distilabel
Distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency. It helps you synthesize data and provide AI feedback to improve the quality of your AI models. With Distilabel, you can: * **Synthesize data:** Generate synthetic data to train your AI models. This can help you to overcome the challenges of data scarcity and bias. * **Provide AI feedback:** Get feedback from AI models on your data. This can help you to identify errors and improve the quality of your data. * **Improve your AI output quality:** By using Distilabel to synthesize data and provide AI feedback, you can improve the quality of your AI models and get better results.
patchwork
PatchWork is an open-source framework designed for automating development tasks using large language models. It enables users to automate workflows such as PR reviews, bug fixing, security patching, and more through a self-hosted CLI agent and preferred LLMs. The framework consists of reusable atomic actions called Steps, customizable LLM prompts known as Prompt Templates, and LLM-assisted automations called Patchflows. Users can run Patchflows locally in their CLI/IDE or as part of CI/CD pipelines. PatchWork offers predefined patchflows like AutoFix, PRReview, GenerateREADME, DependencyUpgrade, and ResolveIssue, with the flexibility to create custom patchflows. Prompt templates are used to pass queries to LLMs and can be customized. Contributions to new patchflows, steps, and the core framework are encouraged, with chat assistants available to aid in the process. The roadmap includes expanding the patchflow library, introducing a debugger and validation module, supporting large-scale code embeddings, parallelization, fine-tuned models, and an open-source GUI. PatchWork is licensed under AGPL-3.0 terms, while custom patchflows and steps can be shared using the Apache-2.0 licensed patchwork template repository.
KnowAgent
KnowAgent is a tool designed for Knowledge-Augmented Planning for LLM-Based Agents. It involves creating an action knowledge base, converting action knowledge into text for model understanding, and a knowledgeable self-learning phase to continually improve the model's planning abilities. The tool aims to enhance agents' potential for application in complex situations by leveraging external reservoirs of information and iterative processes.
rtdl-num-embeddings
This repository provides the official implementation of the paper 'On Embeddings for Numerical Features in Tabular Deep Learning'. It focuses on transforming scalar continuous features into vectors before integrating them into the main backbone of tabular neural networks, showcasing improved performance. The embeddings for continuous features are shown to enhance the performance of tabular DL models and are applicable to various conventional backbones, offering efficiency comparable to Transformer-based models. The repository includes Python packages for practical usage, exploration of metrics and hyperparameters, and reproducing reported results for different algorithms and datasets.
superduper
superduper.io is a Python framework that integrates AI models, APIs, and vector search engines directly with existing databases. It allows hosting of models, streaming inference, and scalable model training/fine-tuning. Key features include integration of AI with data infrastructure, inference via change-data-capture, scalable model training, model chaining, simple Python interface, Python-first approach, working with difficult data types, feature storing, and vector search capabilities. The tool enables users to turn their existing databases into centralized repositories for managing AI model inputs and outputs, as well as conducting vector searches without the need for specialized databases.
For similar tasks
jina
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
sample-apps
Vespa is an open-source search and AI engine that provides a unified platform for building and deploying search and AI applications. Vespa sample applications showcase various use cases and features of Vespa, including basic search, recommendation, semantic search, image search, text ranking, e-commerce search, question answering, search-as-you-type, and ML inference serving.
For similar jobs
llm-resource
llm-resource is a comprehensive collection of high-quality resources for Large Language Models (LLM). It covers various aspects of LLM including algorithms, training, fine-tuning, alignment, inference, data engineering, compression, evaluation, prompt engineering, AI frameworks, AI basics, AI infrastructure, AI compilers, LLM application development, LLM operations, AI systems, and practical implementations. The repository aims to gather and share valuable resources related to LLM for the community to benefit from.
LitServe
LitServe is a high-throughput serving engine designed for deploying AI models at scale. It generates an API endpoint for models, handles batching, streaming, and autoscaling across CPU/GPUs. LitServe is built for enterprise scale with a focus on minimal, hackable code-base without bloat. It supports various model types like LLMs, vision, time-series, and works with frameworks like PyTorch, JAX, Tensorflow, and more. The tool allows users to focus on model performance rather than serving boilerplate, providing full control and flexibility.
how-to-optim-algorithm-in-cuda
This repository documents how to optimize common algorithms based on CUDA. It includes subdirectories with code implementations for specific optimizations. The optimizations cover topics such as compiling PyTorch from source, NVIDIA's reduce optimization, OneFlow's elementwise template, fast atomic add for half data types, upsample nearest2d optimization in OneFlow, optimized indexing in PyTorch, OneFlow's softmax kernel, linear attention optimization, and more. The repository also includes learning resources related to deep learning frameworks, compilers, and optimization techniques.
aiac
AIAC is a library and command line tool to generate Infrastructure as Code (IaC) templates, configurations, utilities, queries, and more via LLM providers such as OpenAI, Amazon Bedrock, and Ollama. Users can define multiple 'backends' targeting different LLM providers and environments using a simple configuration file. The tool allows users to ask a model to generate templates for different scenarios and composes an appropriate request to the selected provider, storing the resulting code to a file and/or printing it to standard output.
ENOVA
ENOVA is an open-source service for Large Language Model (LLM) deployment, monitoring, injection, and auto-scaling. It addresses challenges in deploying stable serverless LLM services on GPU clusters with auto-scaling by deconstructing the LLM service execution process and providing configuration recommendations and performance detection. Users can build and deploy LLM with few command lines, recommend optimal computing resources, experience LLM performance, observe operating status, achieve load balancing, and more. ENOVA ensures stable operation, cost-effectiveness, efficiency, and strong scalability of LLM services.
jina
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
vidur
Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.