cluster-toolkit

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.

Stars: 231

Visit

Cluster Toolkit is an open-source software by Google Cloud for deploying AI/ML and HPC environments on Google Cloud. It allows easy deployment following best practices, with high customization and extensibility. The toolkit includes tutorials, examples, and documentation for various modules designed for AI/ML and HPC use cases.

README:

Google Cluster Toolkit (formerly HPC Toolkit)

Description

Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.

Cluster Toolkit allows customers to deploy turnkey AI/ML and HPC environments (compute, networking, storage, etc.) following Google Cloud best-practices, in a repeatable manner. The Cluster Toolkit is designed to be highly customizable and extensible, and intends to address the AI/ML and HPC deployment needs of a broad range of customers.

Detailed documentation and examples

The Toolkit comes with a suite of tutorials, examples, and full documentation for a suite of modules that have been designed for AI/ML and HPC use cases. More information can be found on the Google Cloud Docs.

Quickstart

Running through the quickstart tutorial is the recommended path to get started with the Cluster Toolkit.

If a self directed path is preferred, you can use the following commands to build the gcluster binary:

git clone https://github.com/GoogleCloudPlatform/cluster-toolkit
cd cluster-toolkit
make
./gcluster --version
./gcluster --help

NOTE: You may need to install dependencies first.

Cluster Toolkit Components

Learn about the components that make up the Cluster Toolkit and more on how it works on the Google Cloud Docs Product Overview.

GCP Credentials

Supplying cloud credentials to Terraform

Terraform can discover credentials for authenticating to Google Cloud Platform in several ways. We will summarize Terraform's documentation for using gcloud from your workstation and for automatically finding credentials in cloud environments. We do not recommend following Hashicorp's instructions for downloading service account keys.

Cloud credentials on your workstation

You can generate cloud credentials associated with your Google Cloud account using the following command:

gcloud auth application-default login

You will be prompted to open your web browser and authenticate to Google Cloud and make your account accessible from the command-line. Once this command completes, Terraform will automatically use your "Application Default Credentials."

If you receive failure messages containing "quota project" you should change the quota project associated with your Application Default Credentials with the following command and provide your current project ID as the argument:

gcloud auth application-default set-quota-project ${PROJECT-ID}

Cloud credentials in virtualized cloud environments

In virtualized settings, the cloud credentials of accounts can be attached directly to the execution environment. For example: a VM or a container can have service accounts attached to them. The Google Cloud Shell is an interactive command line environment which inherits the credentials of the user logged in to the Google Cloud Console.

Many of the above examples are easily executed within a Cloud Shell environment. Be aware that Cloud Shell has several limitations, in particular an inactivity timeout that will close running shells after 20 minutes. Please consider it only for blueprints that are quickly deployed.

VM Image Support

Standard Images

The Cluster Toolkit officially supports the following VM images:

HPC CentOS 7
HPC Rocky Linux 8
Debian 11
Ubuntu 20.04 LTS

For more information on these and other images, see docs/vm-images.md.

Slurm Images

Warning: Slurm Terraform modules cannot be directly used on the standard OS images. They must be used in combination with images built for the versioned release of the Terraform module.

The Cluster Toolkit provides modules and examples for implementing pre-built and custom Slurm VM images, see Slurm on GCP

Blueprint Validation

The Toolkit contains "validator" functions that perform basic tests of the blueprint to ensure that deployment variables are valid and that the AI/ML and HPC environment can be provisioned in your Google Cloud project. Further information can be found in dedicated documentation.

Enable GCP APIs

In a new GCP project there are several APIs that must be enabled to deploy your cluster. These will be caught when you perform terraform apply but you can save time by enabling them upfront.

See Google Cloud Docs for instructions.

GCP Quotas

You may need to request additional quota to be able to deploy and use your cluster.

See Google Cloud Docs for more information.

Billing Reports

You can view your billing reports for your cluster on the Cloud Billing Reports page. To view the Cloud Billing reports for your Cloud Billing account, including viewing the cost information for all of the Cloud projects that are linked to the account, you need a role that includes the billing.accounts.getSpendingInformation permission on your Cloud Billing account.

To view the Cloud Billing reports for your Cloud Billing account:

In the Google Cloud Console, go to Navigation Menu > Billing.
At the prompt, choose the Cloud Billing account for which you'd like to view reports. The Billing Overview page opens for the selected billing account.
In the Billing navigation menu, select Reports.

In the right side, expand the Filters view and then filter by label, specifying the key ghpc_deployment (or ghpc_blueprint) and the desired value.

Troubleshooting

Authentication

Confirm that you have properly setup Google Cloud credentials

Slurm Clusters

Please see the dedicated troubleshooting guide for Slurm.

Terraform Deployment

When terraform apply fails, Terraform generally provides a useful error message. Here are some common reasons for the deployment to fail:

GCP Access: The credentials being used to call terraform apply do not have access to the GCP project. This can be fixed by granting access in IAM & Admin.
Disabled APIs: The GCP project must have the proper APIs enabled. See Enable GCP APIs.
Insufficient Quota: The GCP project does not have enough quota to provision the requested resources. See GCP Quotas.
Filestore resource limit: When regularly deploying Filestore instances with a new VPC you may see an error during deployment such as: System limit for internal resources has been reached. See this doc for the solution.
Required permission not found:
- Example: Required 'compute.projects.get' permission for 'projects/... forbidden
- Credentials may not be set, or are not set correctly. Please follow instructions at Cloud credentials on your workstation.
- Ensure proper permissions are set in the cloud console IAM section.

Failure to Destroy VPC Network

If terraform destroy fails with an error such as the following:

│ Error: Error when reading or editing Subnetwork: googleapi: Error 400: The subnetwork resource 'projects/<project_name>/regions/<region>/subnetworks/<subnetwork_name>' is already being used by 'projects/<project_name>/zones/<zone>/instances/<instance_name>', resourceInUseByAnotherResource

│ Error: Error waiting for Deleting Network: The network resource 'projects/<project_name>/global/networks/<vpc_network_name>' is already being used by 'projects/<project_name>/global/firewalls/<firewall_rule_name>'

These errors indicate that the VPC network cannot be destroyed because resources were added outside of Terraform and that those resources depend upon the network. These resources should be deleted manually. The first message indicates that a new VM has been added to a subnetwork within the VPC network. The second message indicates that a new firewall rule has been added to the VPC network. If your error message does not look like these, examine it carefully to identify the type of resource to delete and its unique name. In the two messages above, the resource names appear toward the end of the error message. The following links will take you directly to the areas within the Cloud Console for managing VMs and Firewall rules. Make certain that your project ID is selected in the drop-down menu at the top-left.

Inspecting the Deployment

The deployment will be created with the following directory structure:

<<OUTPUT_PATH>>/<<DEPLOYMENT_NAME>>/{<<DEPLOYMENT_GROUPS>>}/

If an output directory is provided with the --output/-o flag, the deployment directory will be created in the output directory, represented as <<OUTPUT_PATH>> here. If not provided, <<OUTPUT_PATH>> will default to the current working directory.

The deployment directory is created in <<OUTPUT_PATH>> as a directory matching the provided deployment_name deployment variable (vars) in the blueprint.

Within the deployment directory are directories representing each deployment group in the blueprint named the same as the group field for each element in deployment_groups.

In each deployment group directory, are all of the configuration scripts and modules needed to deploy. The modules are in a directory named modules named the same as the source module, for example the vpc module is in a directory named vpc.

A hidden directory containing meta information and backups is also created and named .ghpc.

From the hpc-slurm.yaml example, we get the following deployment directory:

hpc-slurm/
  primary/
    main.tf
    modules/
    providers.tf
    terraform.tfvars
    variables.tf
    versions.tf
  .ghpc/

Dependencies

See Cloud Docs on Installing Dependencies.

Notes on Packer

The Toolkit supports Packer templates in the contemporary HCL2 file format and not in the legacy JSON file format. We require the use of Packer 1.7.9 or above, and recommend using the latest release.

The Toolkit's Packer template module documentation describes input variables and their behavior. An image-building example and usage instructions are provided. The example integrates Packer, Terraform and startup-script runners to demonstrate the power of customizing images using the same scripts that can be applied at boot-time.

Development

The following setup is in addition to the dependencies needed to build and run Cluster-Toolkit.

Please use the pre-commit hooks configured in this repository to ensure that all changes are validated, tested and properly documented before pushing code changes. The pre-commits configured in the Cluster Toolkit have a set of dependencies that need to be installed before successfully passing.

Follow these steps to install and setup pre-commit in your cloned repository:

Install pre-commit using the instructions from the pre-commit website.
Install TFLint using the instructions from the TFLint documentation.

NOTE: The version of TFLint must be compatible with the Google plugin version identified in tflint.hcl. Versions of the plugin >=0.20.0 should use tflint>=0.40.0. These versions are readily available via GitHub or package managers. Please review the TFLint Ruleset for Google Release Notes for up-to-date requirements.

Install ShellCheck using the instructions from the ShellCheck documentation
The other dev dependencies can be installed by running the following command in the project root directory:
```
make install-dev-deps
```
Pre-commit is enabled on a repo-by-repo basis by running the following command in the project root directory:
```
pre-commit install
```

Now pre-commit is configured to automatically run before you commit.

Development on macOS

While macOS is a supported environment for building and executing the Toolkit, it is not supported for Toolkit development due to GNU specific shell scripts.

If developing on a mac, a workaround is to install GNU tooling by installing coreutils and findutils from a package manager such as homebrew or conda.

Contributing

Please refer to the contributing file in our GitHub repository, or to Google’s Open Source documentation.

For Tasks:

Click tags to check more tools for each tasks

deploy ai environment build hpc cluster validate blueprint enable gcp apis troubleshoot deployment

For Jobs:

cloud engineer data scientist devops engineer machine learning engineer systems administrator

Alternative AI tools for cluster-toolkit

Similar Open Source Tools

cluster-toolkit

github

: 231

airbroke

Airbroke is an open-source error catcher tool designed for modern web applications. It provides a PostgreSQL-based backend with an Airbrake-compatible HTTP collector endpoint and a React-based frontend for error management. The tool focuses on simplicity, maintaining a small database footprint even under heavy data ingestion. Users can ask AI about issues, replay HTTP exceptions, and save/manage bookmarks for important occurrences. Airbroke supports multiple OAuth providers for secure user authentication and offers occurrence charts for better insights into error occurrences. The tool can be deployed in various ways, including building from source, using Docker images, deploying on Vercel, Render.com, Kubernetes with Helm, or Docker Compose. It requires Node.js, PostgreSQL, and specific system resources for deployment.

github

: 179

azure-search-openai-demo

This sample demonstrates a few approaches for creating ChatGPT-like experiences over your own data using the Retrieval Augmented Generation pattern. It uses Azure OpenAI Service to access a GPT model (gpt-35-turbo), and Azure AI Search for data indexing and retrieval. The repo includes sample data so it's ready to try end to end. In this sample application we use a fictitious company called Contoso Electronics, and the experience allows its employees to ask questions about the benefits, internal policies, as well as job descriptions and roles.

github

: 6.0k

HackBot

HackBot is an AI-powered cybersecurity chatbot designed to provide accurate answers to cybersecurity-related queries, conduct code analysis, and scan analysis. It utilizes the Meta-LLama2 AI model through the 'LlamaCpp' library to respond coherently. The chatbot offers features like local AI/Runpod deployment support, cybersecurity chat assistance, interactive interface, clear output presentation, static code analysis, and vulnerability analysis. Users can interact with HackBot through a command-line interface and utilize it for various cybersecurity tasks.

github

: 232

serverless-pdf-chat

The serverless-pdf-chat repository contains a sample application that allows users to ask natural language questions of any PDF document they upload. It leverages serverless services like Amazon Bedrock, AWS Lambda, and Amazon DynamoDB to provide text generation and analysis capabilities. The application architecture involves uploading a PDF document to an S3 bucket, extracting metadata, converting text to vectors, and using a LangChain to search for information related to user prompts. The application is not intended for production use and serves as a demonstration and educational tool.

github

: 221

trinityX

TrinityX is an open-source HPC, AI, and cloud platform designed to provide all services required in a modern system, with full customization options. It includes default services like Luna node provisioner, OpenLDAP, SLURM or OpenPBS, Prometheus, Grafana, OpenOndemand, and more. TrinityX also sets up NFS-shared directories, OpenHPC applications, environment modules, HA, and more. Users can install TrinityX on Enterprise Linux, configure network interfaces, set up passwordless authentication, and customize the installation using Ansible playbooks. The platform supports HA, OpenHPC integration, and provides detailed documentation for users to contribute to the project.

github

: 80

ai-goat

AI Goat is a tool designed to help users learn about AI security through a series of vulnerable LLM CTF challenges. It allows users to run everything locally on their system without the need for sign-ups or cloud fees. The tool focuses on exploring security risks associated with large language models (LLMs) like ChatGPT, providing practical experience for security researchers to understand vulnerabilities and exploitation techniques. AI Goat uses the Vicuna LLM, derived from Meta's LLaMA and ChatGPT's response data, to create challenges that involve prompt injections, insecure output handling, and other LLM security threats. The tool also includes a prebuilt Docker image, ai-base, containing all necessary libraries to run the LLM and challenges, along with an optional CTFd container for challenge management and flag submission.

github

: 105

vector-vein

VectorVein is a no-code AI workflow software inspired by LangChain and langflow, aiming to combine the powerful capabilities of large language models and enable users to achieve intelligent and automated daily workflows through simple drag-and-drop actions. Users can create powerful workflows without the need for programming, automating all tasks with ease. The software allows users to define inputs, outputs, and processing methods to create customized workflow processes for various tasks such as translation, mind mapping, summarizing web articles, and automatic categorization of customer reviews.

github

: 887

lumigator

Lumigator is an open-source platform developed by Mozilla.ai to help users select the most suitable language model for their specific needs. It supports the evaluation of summarization tasks using sequence-to-sequence models such as BART and BERT, as well as causal models like GPT and Mistral. The platform aims to make model selection transparent, efficient, and empowering by providing a framework for comparing LLMs using task-specific metrics to evaluate how well a model fits a project's needs. Lumigator is in the early stages of development and plans to expand support to additional machine learning tasks and use cases in the future.

github

: 194

reai-ida

RevEng.AI IDA Pro Plugin is a tool that integrates with the RevEng.AI platform to provide various features such as uploading binaries for analysis, downloading analysis logs, renaming function names, generating AI summaries, synchronizing functions between local analysis and the platform, and configuring plugin settings. Users can upload files for analysis, synchronize function names, rename functions, generate block summaries, and explain function behavior using this plugin. The tool requires IDA Pro v8.0 or later with Python 3.9 and higher. It relies on the 'reait' package for functionality.

github

: 61

eureka-ml-insights

The Eureka ML Insights Framework is a repository containing code designed to help researchers and practitioners run reproducible evaluations of generative models efficiently. Users can define custom pipelines for data processing, inference, and evaluation, as well as utilize pre-defined evaluation pipelines for key benchmarks. The framework provides a structured approach to conducting experiments and analyzing model performance across various tasks and modalities.

github

: 106

atomic_agents

Atomic Agents is a modular and extensible framework designed for creating powerful applications. It follows the principles of Atomic Design, emphasizing small and single-purpose components. Leveraging Pydantic for data validation and serialization, the framework offers a set of tools and agents that can be combined to build AI applications. It depends on the Instructor package and supports various APIs like OpenAI, Cohere, Anthropic, and Gemini. Atomic Agents is suitable for developers looking to create AI agents with a focus on modularity and flexibility.

github

: 236

LLM_Web_search

LLM_Web_search project gives local LLMs the ability to search the web by outputting a specific command. It uses regular expressions to extract search queries from model output and then utilizes duckduckgo-search to search the web. LangChain's Contextual compression and Okapi BM25 or SPLADE are used to extract relevant parts of web pages in search results. The extracted results are appended to the model's output.

github

: 232

chronon

Chronon is a platform that simplifies and improves ML workflows by providing a central place to define features, ensuring point-in-time correctness for backfills, simplifying orchestration for batch and streaming pipelines, offering easy endpoints for feature fetching, and guaranteeing and measuring consistency. It offers benefits over other approaches by enabling the use of a broad set of data for training, handling large aggregations and other computationally intensive transformations, and abstracting away the infrastructure complexity of data plumbing.

github

: 766

vigenair

ViGenAiR is a tool that harnesses the power of Generative AI models on Google Cloud Platform to automatically transform long-form Video Ads into shorter variants, targeting different audiences. It generates video, image, and text assets for Demand Gen and YouTube video campaigns. Users can steer the model towards generating desired videos, conduct A/B testing, and benefit from various creative features. The tool offers benefits like diverse inventory, compelling video ads, creative excellence, user control, and performance insights. ViGenAiR works by analyzing video content, splitting it into coherent segments, and generating variants following Google's best practices for effective ads.

github

: 83

amazon-transcribe-live-call-analytics

The Amazon Transcribe Live Call Analytics (LCA) with Agent Assist Sample Solution is designed to help contact centers assess and optimize caller experiences in real time. It leverages Amazon machine learning services like Amazon Transcribe, Amazon Comprehend, and Amazon SageMaker to transcribe and extract insights from contact center audio. The solution provides real-time supervisor and agent assist features, integrates with existing contact centers, and offers a scalable, cost-effective approach to improve customer interactions. The end-to-end architecture includes features like live call transcription, call summarization, AI-powered agent assistance, and real-time analytics. The solution is event-driven, ensuring low latency and seamless processing flow from ingested speech to live webpage updates.

github

: 85

For similar tasks

cluster-toolkit

github

: 231

For similar jobs

AirGo

AirGo is a front and rear end separation, multi user, multi protocol proxy service management system, simple and easy to use. It supports vless, vmess, shadowsocks, and hysteria2.

github

: 378

mosec

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API. * **Highly performant** : web layer and task coordination built with Rust 🦀, which offers blazing speed in addition to efficient CPU utilization powered by async I/O * **Ease of use** : user interface purely in Python 🐍, by which users can serve their models in an ML framework-agnostic manner using the same code as they do for offline testing * **Dynamic batching** : aggregate requests from different users for batched inference and distribute results back * **Pipelined stages** : spawn multiple processes for pipelined stages to handle CPU/GPU/IO mixed workloads * **Cloud friendly** : designed to run in the cloud, with the model warmup, graceful shutdown, and Prometheus monitoring metrics, easily managed by Kubernetes or any container orchestration systems * **Do one thing well** : focus on the online serving part, users can pay attention to the model optimization and business logic

github

: 834

llm-code-interpreter

The 'llm-code-interpreter' repository is a deprecated plugin that provides a code interpreter on steroids for ChatGPT by E2B. It gives ChatGPT access to a sandboxed cloud environment with capabilities like running any code, accessing Linux OS, installing programs, using filesystem, running processes, and accessing the internet. The plugin exposes commands to run shell commands, read files, and write files, enabling various possibilities such as running different languages, installing programs, starting servers, deploying websites, and more. It is powered by the E2B API and is designed for agents to freely experiment within a sandboxed environment.

github

: 465

pezzo

Pezzo is a fully cloud-native and open-source LLMOps platform that allows users to observe and monitor AI operations, troubleshoot issues, save costs and latency, collaborate, manage prompts, and deliver AI changes instantly. It supports various clients for prompt management, observability, and caching. Users can run the full Pezzo stack locally using Docker Compose, with prerequisites including Node.js 18+, Docker, and a GraphQL Language Feature Support VSCode Extension. Contributions are welcome, and the source code is available under the Apache 2.0 License.

github

: 2.3k

learn-generative-ai

Learn Cloud Applied Generative AI Engineering (GenEng) is a course focusing on the application of generative AI technologies in various industries. The course covers topics such as the economic impact of generative AI, the role of developers in adopting and integrating generative AI technologies, and the future trends in generative AI. Students will learn about tools like OpenAI API, LangChain, and Pinecone, and how to build and deploy Large Language Models (LLMs) for different applications. The course also explores the convergence of generative AI with Web 3.0 and its potential implications for decentralized intelligence.

github

: 592

gcloud-aio

This repository contains shared codebase for two projects: gcloud-aio and gcloud-rest. gcloud-aio is built for Python 3's asyncio, while gcloud-rest is a threadsafe requests-based implementation. It provides clients for Google Cloud services like Auth, BigQuery, Datastore, KMS, PubSub, Storage, and Task Queue. Users can install the library using pip and refer to the documentation for usage details. Developers can contribute to the project by following the contribution guide.

github

: 298

fluid

Fluid is an open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for data-intensive applications, such as big data and AI applications. It implements dataset abstraction, scalable cache runtime, automated data operations, elasticity and scheduling, and is runtime platform agnostic. Key concepts include Dataset and Runtime. Prerequisites include Kubernetes version > 1.16, Golang 1.18+, and Helm 3. The tool offers features like accelerating remote file accessing, machine learning, accelerating PVC, preloading dataset, and on-the-fly dataset cache scaling. Contributions are welcomed, and the project is under the Apache 2.0 license with a vendor-neutral approach.

github

: 1.7k

aiges

AIGES is a core component of the Athena Serving Framework, designed as a universal encapsulation tool for AI developers to deploy AI algorithm models and engines quickly. By integrating AIGES, you can deploy AI algorithm models and engines rapidly and host them on the Athena Serving Framework, utilizing supporting auxiliary systems for networking, distribution strategies, data processing, etc. The Athena Serving Framework aims to accelerate the cloud service of AI algorithm models and engines, providing multiple guarantees for cloud service stability through cloud-native architecture. You can efficiently and securely deploy, upgrade, scale, operate, and monitor models and engines without focusing on underlying infrastructure and service-related development, governance, and operations.

github

: 275