ollama-operator

ollama-operator

Yet another operator for running large language models on Kubernetes with ease. Powered by Ollama! 🐫

Stars: 79

Visit
 screenshot

Ollama Operator is a Kubernetes operator designed to facilitate running large language models on Kubernetes clusters. It simplifies the process of deploying and managing multiple models on the same cluster, providing an easy-to-use interface for users. With support for various Kubernetes environments and seamless integration with Ollama models, APIs, and CLI, Ollama Operator streamlines the deployment and management of language models. By leveraging the capabilities of lama.cpp, Ollama Operator eliminates the need to worry about Python environments and CUDA drivers, making it a reliable tool for running large language models on Kubernetes.

README:

ollama

Ollama Operator

Discord

Yet another operator for running large language models on Kubernetes with ease. 🙀

Powered by Ollama! 🐫

While Ollama is a powerful tool for running large language models locally, and the user experience of CLI is just the same as using Docker CLI, it's not possible yet to replicate the same user experience on Kubernetes, especially when it comes to running multiple models on the same cluster with loads of resources and configurations.

That's where the Ollama Operator kicks in:

  • Install the operator on your Kubernetes cluster
  • Apply the needed CRDs
  • Create your models
  • Wait for the models to be fetched and loaded, that's it!

Thanks to the great works of lama.cpp, no more worries about Python environment, CUDA drivers.

The journey to large language models, AIGC, localized agents, 🦜🔗 Langchain and more is just a few steps away!

Features

  • ✅ Abilities to run multiple models on the same cluster.
  • ✅ Compatible with all Ollama models, APIs, and CLI.
  • ✅ Able to run on general Kubernetes clusters, K3s clusters (Respberry Pi, TrueNAS SCALE, etc.), kind, minikube, etc. You name it!
  • ✅ Easy to install, uninstall, and upgrade.
  • ✅ Pull image once, share across the entire node (just like normal images).
  • ✅ Easy to expose with existing Kubernetes services, ingress, etc.
  • ✅ Doesn't require any additional dependencies, just Kubernetes

Getting started

Install operator

kubectl apply -f https://raw.githubusercontent.com/nekomeowww/ollama-operator/main/dist/install.yaml

Wait for the operator to be ready

kubectl wait --for=jsonpath='{.status.readyReplicas}'=1 deployment/ollama-operator-controller-manager -n ollama-operator-system

Deploy a model

[!NOTE] You can also use the kollama CLI natively shipped by Ollama Operator, and will be easier to interact with the operator.

Install kollama CLI:

go install github.com/nekomeowww/ollama-operator/cmd/kollama@latest

Deploy a model can be done with the following command:

kollama deploy phi --expose --node-port 30001

More information can be found at CLI

[!IMPORTANT] Working with kind?

The default provisioned StorageClass in kind is standard, and will only work with ReadWriteOnce access mode, therefore if you would need to run the operator with kind, you should specify persistentVolume with accessMode: ReadWriteOnce in the Model CRD:

apiVersion: ollama.ayaka.io/v1
kind: Model
metadata:
  name: phi
spec:
  image: phi
  persistentVolume:
    accessMode: ReadWriteOnce

Let's create a Model CR for the model phi:

apiVersion: ollama.ayaka.io/v1
kind: Model
metadata:
  name: phi
spec:
  image: phi

Apply the Model CR to your Kubernetes cluster:

kubectl apply -f ollama-model-phi.yaml

Wait for the model to be ready:

kubectl wait --for=jsonpath='{.status.readyReplicas}'=1 deployment/ollama-model-phi

Access the model

  1. Ready! Now let's forward the ports to access the model:
kubectl port-forward svc/ollama-model-phi ollama
  1. Interact with the model:
ollama run phi

Full options

apiVersion: ollama.ayaka.io/v1
kind: Model
metadata:
  name: phi
spec:
  # Scale the model to 2 replicas
  replicas: 2
  # Use the model image `phi`
  image: phi
  imagePullPolicy: IfNotPresent
  storageClassName: local-path
  # If you have your own PersistentVolumeClaim created
  persistentVolumeClaim: your-pvc
  # If you need to specify the access mode for the PersistentVolume
  persistentVolume:
    accessMode: ReadWriteOnce

Supported models

Unlock the abilities to run the following models with the Ollama Operator over Kubernetes:

[!TIP] By the power of Modelfile backed by Ollama, you can create and bundle any of your own model. As long as it's a GGUF formatted model.

Model Parameters Size Model image Full model image URL Multi-modal Uncensored
Phi-3 Mini 3.8B 2.3GB phi3 registry.ollama.ai/library/phi3
Llama 3 8B 4.7GB llama3 registry.ollama.ai/library/llama3
Dolphin Llama 3 8B 4.7GB dolphin-llama3 registry.ollama.ai/dolphin-llama3
WizardLM-2 7B 4.1GB wizardlm2 registry.ollama.ai/library/wizardlm2
Llama 2 7B 3.8GB llama2 registry.ollama.ai/library/llama2
Mistral 7B 4.1GB mistral registry.ollama.ai/library/mistral
Mixtral 8x7B 8x7B 26GB mixtral:8x7b registry.ollama.ai/library/mixtral:8x7b
Mixtral 8x22B 8x22B 80GB mixtral:8x22b registry.ollama.ai/library/mixtral:8x22b
Command R 35B 20GB command-r registry.ollama.ai/library/command-r
Command R Plus 104B 59GB command-r-plus registry.ollama.ai/library/command-r-plus
Dolphin Phi 2.7B 1.6GB dolphin-phi registry.ollama.ai/library/dolphin-phi
Phi-2 2.7B 1.7GB phi registry.ollama.ai/library/phi
Neural Chat 7B 4.1GB neural-chat registry.ollama.ai/library/neural-chat
Starling 7B 4.1GB starling-lm registry.ollama.ai/library/starling-lm
Code Llama 7B 3.8GB codellama registry.ollama.ai/library/codellama
Llama 2 Uncensored 7B 3.8GB llama2-uncensored registry.ollama.ai/library/llama2-uncensored
Llama 2 13B 13B 7.3GB llama2:13b registry.ollama.ai/library/llama2:13b
Llama 2 70B 70B 39GB llama2:70b registry.ollama.ai/library/llama2:70b
Orca Mini 3B 1.9GB orca-mini registry.ollama.ai/library/orca-mini
Vicuna 7B 3.8GB vicuna registry.ollama.ai/library/vicuna
LLaVA 7B 4.5GB llava registry.ollama.ai/library/llava
Gemma 2B 2B 1.4GB gemma:2b registry.ollama.ai/library/gemma:2b
Gemma 7B 7B 4.8GB gemma:7b registry.ollama.ai/library/gemma:7b

Full list of available images can be found at Ollama Library.

[!WARNING] You should have at least 8 GB of RAM available on your node to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

[!WARNING] The actual size of downloaded large language models are huge by comparing to the size of general container images.

  1. Fast and stable network connection is recommended to download the models.
  2. Efficient storage is required to store the models if you want to run models larger than 13B.

Architecture Overview

There are two major components that the Ollama Operator will create for:

  1. Model Inferencing Server: The model inferencing server is a gRPC server that runs the model and serves the model's API. It is created as a Deployment in the Kubernetes cluster.
  2. Model Image Storage: The model image storage is a PersistentVolume that stores the model image. It is created as a StatefulSet along with a PersistentVolumeClaim in the Kubernetes cluster.

[!NOTE] The image that created by Modelfile of Ollama is a valid OCI format image, however, due to the incompatible contentType value, and the overall structure of the Modelfile image to the general container image, it's not possible to run the model directly with the general container runtime. Therefore a standalone service/deployment of Model Image Storage is required to be persisted on the Kubernetes cluster in order to hold and cache the previously downloaded model image.

The detailed resources it creates, and the relationships between them are shown in the following diagram:

Contributing

Acknowledgements

Gratefully thanks to the following projects and their authors, contributors:

It is because of their hard work and contributions that this program exists.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ollama-operator

Similar Open Source Tools

For similar tasks

For similar jobs