clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Stars: 364

Visit

ClearML Server is a backend service infrastructure for ClearML, facilitating collaboration and experiment management. It includes a web app, RESTful API, and file server for storing images and models. Users can deploy ClearML Server using Docker, AWS EC2 AMI, or Kubernetes. The system design supports single IP or sub-domain configurations with specific open ports. ClearML-Agent Services container allows launching long-lasting jobs and various use cases like auto-scaler service, controllers, optimizer, and applications. Advanced functionality includes web login authentication and non-responsive experiments watchdog. Upgrading ClearML Server involves stopping containers, backing up data, downloading the latest docker-compose.yml file, configuring ClearML-Agent Services, and spinning up docker containers. Community support is available through ClearML FAQ, Stack Overflow, GitHub issues, and email contact.

README:

ClearML - Auto-Magical Suite of tools to streamline your ML workflow Experiment Manager, ML-Ops and Data-Management

Note regarding Apache Log4j2 Remote Code Execution (RCE) Vulnerability - CVE-2021-44228 - ESA-2021-31

According to ElasticSearch's latest report, supported versions of Elasticsearch (6.8.9+, 7.8+) used with recent versions of the JDK (JDK9+) are not susceptible to either remote code execution or information leakage due to Elasticsearch’s usage of the Java Security Manager.

As the latest version of ClearML Server uses Elasticsearch 7.10+ with JDK15, it is not affected by these vulnerabilities.

As a precaution, we've upgraded the ES version to 7.16.2 and added the mitigation recommended by ElasticSearch to our latest docker-compose.yml file.

While previous Elasticsearch versions (5.6.11+, 6.4.0+ and 7.0.0+) used by older ClearML Server versions are only susceptible to the information leakage vulnerability (which in any case does not permit access to data within the Elasticsearch cluster), we still recommend upgrading to the latest version of ClearML Server. Alternatively, you can apply the mitigation as implemented in our latest docker-compose.yml file.

Update 15 December: A further vulnerability (CVE-2021-45046) was disclosed on December 14th. ElasticSearch's guidance for Elasticsearch remains unchanged by this new vulnerability, thus not affecting ClearML Server.

Update 22 December: To keep with ElasticSearch's recommendations, we've upgraded the ES version to the newly released 7.16.2

ClearML Server

Formerly known as Trains Server

The ClearML Server is the backend service infrastructure for ClearML. It allows multiple users to collaborate and manage their experiments. ClearML offers a free hosted service, which is maintained by ClearML and open to anyone. In order to host your own server, you will need to launch the ClearML Server and point ClearML to it.

The ClearML Server contains the following components:

The ClearML Web-App, a single-page UI for experiment management and browsing
RESTful API for:
- Documenting and logging experiment information, statistics and results
- Querying experiments history, logs and results
Locally-hosted file server for storing images and models making them easily accessible using the Web-App

You can quickly deploy your ClearML Server using Docker, AWS EC2 AMI, or Kubernetes.

System design

The ClearML Server has two supported configurations:

Single IP (domain) with the following open ports
- Web application on port 8080
- API service on port 8008
- File storage service on port 8081
Sub-Domain configuration with default http/s ports (80 or 443)
- Web application on sub-domain: app.*.*
- API service on sub-domain: api.*.*
- File storage service on sub-domain: files.*.*

Launching The ClearML Server

Prerequisites

The ports 8080/8081/8008 must be available for the ClearML Server services.

For example, to see if port 8080 is in use:

Linux or macOS:

  sudo lsof -Pn -i4 | grep :8080 | grep LISTEN

Windows:
```
  netstat -an |find /i "8080"
```

Launching

Launch The ClearML Server in any of the following formats:

Pre-built AWS EC2 AMI
Pre-built GCP Custom Image
Pre-built Docker Image
- Linux
- macOS
- Windows 10
Kubernetes
- Kubernetes Helm
- Manual Kubernetes installation

Connecting ClearML to your ClearML Server

In order to set up the ClearML client to work with your ClearML Server:

Run the clearml-init command for an interactive setup.

Or manually edit ~/clearml.conf file, making sure the server settings (api_server, web_server, file_server) are configured correctly, for example:

  api {
      # API server on port 8008
      api_server: "http://localhost:8008"

      # web_server on port 8080
      web_server: "http://localhost:8080"

      # file server on port 8081
      files_server: "http://localhost:8081"
  }

Note: If you have set up your ClearML Server in a sub-domain configuration, then there is no need to specify a port number, it will be inferred from the http/s scheme.

After launching the ClearML Server and configuring the ClearML client to use the ClearML Server, you can use ClearML in your experiments and view them in your ClearML Server web server, for example http://localhost:8080.
For more information about the ClearML client, see ClearML.

ClearML-Agent Services

As of version 0.15 of ClearML Server, dockerized deployment includes a ClearML-Agent Services container running as part of the docker container collection.

ClearML-Agent Services is an extension of ClearML-Agent that provides the ability to launch long-lasting jobs that previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases. To name a few use cases, auto-scaler service (spinning instances when the need arises and the budget allows), Controllers (Implementing pipelines and more sophisticated DevOps logic), Optimizer (such as Hyper-parameter Optimization or sweeping), and Application (such as interactive Bokeh apps for increased data transparency)

ClearML-Agent Services container will spin any task enqueued into the dedicated services queue. Every task launched by ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities.
You can also run the ClearML-Agent Services manually, see details in ClearML-agent services mode

Note: It is the user's responsibility to make sure the proper tasks are pushed into the services queue. Do not enqueue training / inference tasks into the services queue, as it will put unnecessary load on the server.

Advanced Functionality

The ClearML Server provides a few additional useful features, which can be manually enabled:

Restarting ClearML Server

To restart the ClearML Server, you must first stop the containers, and then restart them.

docker-compose down
docker-compose -f docker-compose.yml up

Upgrading

ClearML Server releases are also reflected in the docker compose configuration file.
We strongly encourage you to keep your ClearML Server up to date, by keeping up with the current release.

Note: The following upgrade instructions use the Linux OS as an example.

To upgrade your existing ClearML Server deployment:

Shut down the docker containers
```
docker-compose down
```
We highly recommend backing up your data directory before upgrading.

Assuming your data directory is /opt/clearml, to archive all data into ~/clearml_backup.tgz execute:
```
sudo tar czvf ~/clearml_backup.tgz /opt/clearml/data
```
Restore instructions:

To restore this example backup, execute:
```
sudo rm -R /opt/clearml/data
sudo tar -xzf ~/clearml_backup.tgz -C /opt/clearml/data
```

Download the latest docker-compose.yml file.

curl https://raw.githubusercontent.com/allegroai/trains-server/master/docker/docker-compose.yml -o docker-compose.yml

Configure the ClearML-Agent Services (not supported on Windows installation). If CLEARML_HOST_IP is not provided, ClearML-Agent Services will use the external public address of the ClearML Server. If CLEARML_AGENT_GIT_USER / CLEARML_AGENT_GIT_PASS are not provided, the ClearML-Agent Services will not be able to access any private repositories for running service tasks.
```
export CLEARML_HOST_IP=server_host_ip_here
export CLEARML_AGENT_GIT_USER=git_username_here
export CLEARML_AGENT_GIT_PASS=git_password_here
```
Spin up the docker containers, it will automatically pull the latest ClearML Server build
```
docker-compose -f docker-compose.yml pull
docker-compose -f docker-compose.yml up
```

* If something went wrong along the way, check our FAQ: Common Docker Upgrade Errors.

Community & Support

If you have any questions, look to the ClearML FAQ, or tag your questions on stackoverflow with 'clearml' tag.

For feature requests or bug reports, please use GitHub issues.

Additionally, you can always find us at [email protected]

License

Server Side Public License v1.0

The ClearML Server relies on both MongoDB and ElasticSearch. With the recent changes in both MongoDB's and ElasticSearch's OSS license, we feel it is our responsibility as a member of the community to support the projects we love and cherish. We believe the cause for the license change in both cases is more than just, and chose SSPL because it is the more general and flexible of the two licenses.

This is our way to say - we support you guys!

For Tasks:

Click tags to check more tools for each tasks

manage experiments store images/models launch long-lasting jobs auto-scaling instances optimize hyper-parameters

For Jobs:

machine learning engineer data scientist ai researcher ml operations engineer devops engineer

Alternative AI tools for clearml-server

Similar Open Source Tools

clearml-server

github

: 364

kollektiv

Kollektiv is a Retrieval-Augmented Generation (RAG) system designed to enable users to chat with their favorite documentation easily. It aims to provide LLMs with access to the most up-to-date knowledge, reducing inaccuracies and improving productivity. The system utilizes intelligent web crawling, advanced document processing, vector search, multi-query expansion, smart re-ranking, AI-powered responses, and dynamic system prompts. The technical stack includes Python/FastAPI for backend, Supabase, ChromaDB, and Redis for storage, OpenAI and Anthropic Claude 3.5 Sonnet for AI/ML, and Chainlit for UI. Kollektiv is licensed under a modified version of the Apache License 2.0, allowing free use for non-commercial purposes.

github

: 74

comfyui_LLM_Polymath

github

: 54

agent-zero

Agent Zero is a personal and organic AI framework designed to be dynamic, organically growing, and learning as you use it. It is fully transparent, readable, comprehensible, customizable, and interactive. The framework uses the computer as a tool to accomplish tasks, with no single-purpose tools pre-programmed. It emphasizes multi-agent cooperation, complete customization, and extensibility. Communication is key in this framework, allowing users to give proper system prompts and instructions to achieve desired outcomes. Agent Zero is capable of dangerous actions and should be run in an isolated environment. The framework is prompt-based, highly customizable, and requires a specific environment to run effectively.

github

: 5.4k

Local-File-Organizer

The Local File Organizer is an AI-powered tool designed to help users organize their digital files efficiently and securely on their local device. By leveraging advanced AI models for text and visual content analysis, the tool automatically scans and categorizes files, generates relevant descriptions and filenames, and organizes them into a new directory structure. All AI processing occurs locally using the Nexa SDK, ensuring privacy and security. With support for multiple file types and customizable prompts, this tool aims to simplify file management and bring order to users' digital lives.

github

: 1.0k

aistore

AIStore is a lightweight object storage system designed for AI applications. It is highly scalable, reliable, and easy to use. AIStore can be deployed on any commodity hardware, and it can be used to store and manage large datasets for deep learning and other AI applications.

github

: 1.5k

TaskingAI

TaskingAI brings Firebase's simplicity to **AI-native app development**. The platform enables the creation of GPTs-like multi-tenant applications using a wide range of LLMs from various providers. It features distinct, modular functions such as Inference, Retrieval, Assistant, and Tool, seamlessly integrated to enhance the development process. TaskingAI’s cohesive design ensures an efficient, intelligent, and user-friendly experience in AI application development.

github

: 6.1k

momentum-core

Momentum is an open-source behavioral auditor for backend code that helps developers generate powerful insights into their codebase. It analyzes code behavior, tests it at every git push, and ensures readiness for production. Momentum understands backend code, visualizes dependencies, identifies behaviors, generates test code, runs code in the local environment, and provides debugging solutions. It aims to improve code quality, streamline testing processes, and enhance developer productivity.

github

: 130

eole

EOLE is an open language modeling toolkit based on PyTorch. It aims to provide a research-friendly approach with a comprehensive yet compact and modular codebase for experimenting with various types of language models. The toolkit includes features such as versatile training and inference, dynamic data transforms, comprehensive large language model support, advanced quantization, efficient finetuning, flexible inference, and tensor parallelism. EOLE is a work in progress with ongoing enhancements in configuration management, command line entry points, reproducible recipes, core API simplification, and plans for further simplification, refactoring, inference server development, additional recipes, documentation enhancement, test coverage improvement, logging enhancements, and broader model support.

github

: 106

nextjs-ollama-llm-ui

This web interface provides a user-friendly and feature-rich platform for interacting with Ollama Large Language Models (LLMs). It offers a beautiful and intuitive UI inspired by ChatGPT, making it easy for users to get started with LLMs. The interface is fully local, storing chats in local storage for convenience, and fully responsive, allowing users to chat on their phones with the same ease as on a desktop. It features easy setup, code syntax highlighting, and the ability to easily copy codeblocks. Users can also download, pull, and delete models directly from the interface, and switch between models quickly. Chat history is saved and easily accessible, and users can choose between light and dark mode. To use the web interface, users must have Ollama downloaded and running, and Node.js (18+) and npm installed. Installation instructions are provided for running the interface locally. Upcoming features include the ability to send images in prompts, regenerate responses, import and export chats, and add voice input support.

github

: 925

llm-answer-engine

This repository contains the code and instructions needed to build a sophisticated answer engine that leverages the capabilities of Groq, Mistral AI's Mixtral, Langchain.JS, Brave Search, Serper API, and OpenAI. Designed to efficiently return sources, answers, images, videos, and follow-up questions based on user queries, this project is an ideal starting point for developers interested in natural language processing and search technologies.

github

: 4.5k

LLMstudio

LLMstudio by TensorOps is a platform that offers prompt engineering tools for accessing models from providers like OpenAI, VertexAI, and Bedrock. It provides features such as Python Client Gateway, Prompt Editing UI, History Management, and Context Limit Adaptability. Users can track past runs, log costs and latency, and export history to CSV. The tool also supports automatic switching to larger-context models when needed. Coming soon features include side-by-side comparison of LLMs, automated testing, API key administration, project organization, and resilience against rate limits. LLMstudio aims to streamline prompt engineering, provide execution history tracking, and enable effortless data export, offering an evolving environment for teams to experiment with advanced language models.

github

: 311

restai

RestAI is an AIaaS (AI as a Service) platform that allows users to create and consume AI agents (projects) using a simple REST API. It supports various types of agents, including RAG (Retrieval-Augmented Generation), RAGSQL (RAG for SQL), inference, vision, and router. RestAI features automatic VRAM management, support for any public LLM supported by LlamaIndex or any local LLM supported by Ollama, a user-friendly API with Swagger documentation, and a frontend for easy access. It also provides evaluation capabilities for RAG agents using deepeval.

github

: 416

Simplifine

Simplifine is an open-source library designed for easy LLM finetuning, enabling users to perform tasks such as supervised fine tuning, question-answer finetuning, contrastive loss for embedding tasks, multi-label classification finetuning, and more. It provides features like WandB logging, in-built evaluation tools, automated finetuning parameters, and state-of-the-art optimization techniques. The library offers bug fixes, new features, and documentation updates in its latest version. Users can install Simplifine via pip or directly from GitHub. The project welcomes contributors and provides comprehensive documentation and support for users.

github

: 65

open-webui

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs. For more information, be sure to check out our Open WebUI Documentation.

github

: 87.7k

aws-reference-architecture-pulumi

The Pinecone AWS Reference Architecture with Pulumi is a distributed system designed for vector-database-enabled semantic search over Postgres records. It serves as a starting point for specific use cases or as a learning resource. The architecture is permissively licensed and supported by Pinecone's open-source team, facilitating the setup of high-scale use cases for Pinecone's scalable vector database.

github

: 97

For similar tasks

clearml-server

github

: 364

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675