holmesgpt
Your 24/7 On-Call AI Agent - Solve Alerts Faster with Automatic Correlations, Investigations, and More
Stars: 1328
HolmesGPT is an open-source DevOps assistant powered by OpenAI or any tool-calling LLM of your choice. It helps in troubleshooting Kubernetes, incident response, ticket management, automated investigation, and runbook automation in plain English. The tool connects to existing observability data, is compliance-friendly, provides transparent results, supports extensible data sources, runbook automation, and integrates with existing workflows. Users can install HolmesGPT using Brew, prebuilt Docker container, Python Poetry, or Docker. The tool requires an API key for functioning and supports OpenAI, Azure AI, and self-hosted LLMs.
README:
HolmesGPT is an AI agent for investigating problems in your cloud, finding the root cause, and suggesting remediations. It has dozens of built-in integrations for cloud providers, observability tools, and on-call systems.
HolmesGPT has been submitted to the CNCF as a sandbox project (view status). You can learn more about HolmesGPT's maintainers and adopters here.
How it Works |
Installation |
LLM Providers |
YouTube Demo |
HolmesGPT connects AI models with live observability data and organizational knowledge. It uses an agentic loop to analyze data from multiple sources and identify possible root causes.
HolmesGPT integrates with popular observability and cloud platforms. The following data sources ("toolsets") are built-in. Add your own.
| Data Source | Status | Notes |
|---|---|---|
| ✅ | Get status, history and manifests and more of apps, projects and clusters | |
| ✅ | Fetch events, instances, slow query logs and more | |
| ✅ | Private runbooks and documentation | |
| ✅ | Retrieve logs for any resource | |
| ✅ | Date and time-related operations | |
| ✅ | Get images, logs, events, history and more | |
| 🟡 Beta | Remediate alerts by opening pull requests with fixes | |
| 🟡 Beta | Fetches log data from datadog | |
| ✅ | Query logs for Kubernetes resources or any query | |
| ✅ | Fetch trace info, debug issues like high latency in application. | |
| ✅ | Release status, chart metadata, and values | |
| ✅ | Public runbooks, community docs etc | |
| ✅ | Fetch metadata, list consumers and topics or find lagging consumer groups | |
| ✅ | Pod logs, K8s events, and resource status (kubectl describe) | |
| 🟡 Beta | Investigate alerts, query tracing data | |
| ✅ | Query health, shard, and settings related info of one or more clusters | |
| ✅ | Investigate alerts, query metrics and generate PromQL queries | |
| ✅ | Info about partitions, memory/disk alerts to troubleshoot split-brain scenarios and more | |
| ✅ | Multi-cluster monitoring, historical change data, user-configured runbooks, PromQL graphs and more | |
| ✅ | Team knowledge base and runbooks on demand |
HolmesGPT can fetch alerts/tickets to investigate from external systems, then write the analysis back to the source or Slack.
| Integration | Status | Notes |
|---|---|---|
| Slack | 🟡 Beta | Demo. Tag HolmesGPT bot in any Slack message |
| Prometheus/AlertManager | ✅ | Robusta SaaS or HolmesGPT CLI |
| PagerDuty | ✅ | HolmesGPT CLI only |
| OpsGenie | ✅ | HolmesGPT CLI only |
| Jira | ✅ | HolmesGPT CLI only |
| GitHub | ✅ | HolmesGPT CLI only |
Read the installation documentation to learn how to install HolmesGPT.
Read the LLM Providers documentation to learn how to set up your LLM API key.
- In the Robusta SaaS: Go to platform.robusta.dev and use Holmes from your browser
- With HolmesGPT CLI: setup an LLM API key and ask Holmes a question 👇
holmes ask "what pods are unhealthy and why?"You can also provide files as context:
holmes ask "summarize the key points in this document" -f ./mydocument.txtYou can also load the prompt from a file using the --prompt-file option:
holmes ask --prompt-file ~/long-prompt.txt
Enter interactive mode to ask follow-up questions:
```bash
holmes ask "what pods are unhealthy and why?" --interactive
# or
holmes ask "what pods are unhealthy and why?" -iAlso supported:
HolmesGPT CLI: investigate Prometheus alerts
Pull alerts from AlertManager and investigate them with HolmesGPT:
holmes investigate alertmanager --alertmanager-url http://localhost:9093
# if on Mac OS and using the Holmes Docker image👇
# holmes investigate alertmanager --alertmanager-url http://docker.for.mac.localhost:9093To investigate alerts in your browser, sign up for a free trial of Robusta SaaS.
Optional: port-forward to AlertManager before running the command mentioned above (if running Prometheus inside Kubernetes)
kubectl port-forward alertmanager-robusta-kube-prometheus-st-alertmanager-0 9093:9093 &HolmesGPT CLI: investigate PagerDuty and OpsGenie alerts
holmes investigate opsgenie --opsgenie-api-key <OPSGENIE_API_KEY>
holmes investigate pagerduty --pagerduty-api-key <PAGERDUTY_API_KEY>
# to write the analysis back to the incident as a comment
holmes investigate pagerduty --pagerduty-api-key <PAGERDUTY_API_KEY> --updateFor more details, run holmes investigate <source> --help
HolmesGPT can investigate many issues out of the box, with no customization or training. Optionally, you can extend Holmes to improve results:
Custom Data Sources: Add data sources (toolsets) to improve investigations
- If using Robusta SaaS: See here
- If using the CLI: Use
-tflag with custom toolset files or add to~/.holmes/config.yaml
Custom Runbooks: Give HolmesGPT instructions for known alerts:
- If using Robusta SaaS: Use the Robusta UI to add runbooks
- If using the CLI: Use
-rflag with custom runbook files or add to~/.holmes/config.yaml
You can save common settings and API Keys in a config file to avoid passing them from the CLI each time:
Reading settings from a config file
You can save common settings and API keys in config file for re-use. Place the config file in ~/.holmes/config.yaml` or pass it using the --config
You can view an example config file with all available settings here.
HolmesGPT supports transformers to process large tool outputs before sending them to your primary LLM. This feature helps manage context window limits while preserving essential information.
The most common transformer is llm_summarize, which uses a fast secondary model to summarize lengthy outputs from tools like kubectl describe, log queries, or metrics collection.
📖 Learn more: Tool Output Transformers Documentation
By design, HolmesGPT has read-only access and respects RBAC permissions. It is safe to run in production environments.
We do not train HolmesGPT on your data. Data sent to Robusta SaaS is private to your account.
For extra privacy, bring an API key for your own AI model.
Because HolmesGPT relies on LLMs, it relies on a suite of pytest based evaluations to ensure the prompt and HolmesGPT's default set of tools work as expected with LLMs.
- Introduction to HolmesGPT's evals.
- Write your own evals.
- Use Braintrust to view analyze results (optional).
Distributed under the MIT License. See LICENSE.txt for more information.
Join our community meetings to discuss the HolmesGPT roadmap and share feedback:
📅 First Community Meeting: Thursday, August 21, 2025
- Time: 8:00-9:00 AM PT / 11:00 AM-12:00 PM ET / 8:30-9:30 PM IST
- Where: Google Meet
- Agenda: Roadmap discussion, community feedback, and Q&A
📝 Meeting Notes | 📋 Full Details
If you have any questions, feel free to message us on robustacommunity.slack.com
Please read our CONTRIBUTING.md for guidelines and instructions.
For help, contact us on Slack or ask DeepWiki AI your questions.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for holmesgpt
Similar Open Source Tools
holmesgpt
HolmesGPT is an open-source DevOps assistant powered by OpenAI or any tool-calling LLM of your choice. It helps in troubleshooting Kubernetes, incident response, ticket management, automated investigation, and runbook automation in plain English. The tool connects to existing observability data, is compliance-friendly, provides transparent results, supports extensible data sources, runbook automation, and integrates with existing workflows. Users can install HolmesGPT using Brew, prebuilt Docker container, Python Poetry, or Docker. The tool requires an API key for functioning and supports OpenAI, Azure AI, and self-hosted LLMs.
pilot
Pilot is an AI tool designed to streamline the process of handling tickets from GitHub, Linear, Jira, or Asana. It plans the implementation, writes the code, runs tests, and opens a PR for you to review and merge. With features like Autopilot, Epic Decomposition, Self-Review, and more, Pilot aims to automate the ticket handling process and reduce the time spent on prioritizing and completing tasks. It integrates with various platforms, offers intelligence features, and provides real-time visibility through a dashboard. Pilot is free to use, with costs associated with Claude API usage. It is designed for bug fixes, small features, refactoring, tests, docs, and dependency updates, but may not be suitable for large architectural changes or security-critical code.
qwen-code
Qwen Code is an open-source AI agent optimized for Qwen3-Coder, designed to help users understand large codebases, automate tedious work, and expedite the shipping process. It offers an agentic workflow with rich built-in tools, a terminal-first approach with optional IDE integration, and supports both OpenAI-compatible API and Qwen OAuth authentication methods. Users can interact with Qwen Code in interactive mode, headless mode, IDE integration, and through a TypeScript SDK. The tool can be configured via settings.json, environment variables, and CLI flags, and offers benchmark results for performance evaluation. Qwen Code is part of an ecosystem that includes AionUi and Gemini CLI Desktop for graphical interfaces, and troubleshooting guides are available for issue resolution.
runhouse
Runhouse is a tool that allows you to build, run, and deploy production-quality AI apps and workflows on your own compute. It provides simple, powerful APIs for the full lifecycle of AI development, from research to evaluation to production to updates to scaling to management, and across any infra. By automatically packaging your apps into scalable, secure, and observable services, Runhouse can also turn otherwise redundant AI activities into common reusable components across your team or company, which improves cost, velocity, and reproducibility.
Acontext
Acontext is a context data platform designed for production AI agents, offering unified storage, built-in context management, and observability features. It helps agents scale from local demos to production without the need to rebuild context infrastructure. The platform provides solutions for challenges like scattered context data, long-running agents requiring context management, and tracking states from multi-modal agents. Acontext offers core features such as context storage, session management, disk storage, agent skills management, and sandbox for code execution and analysis. Users can connect to Acontext, install SDKs, initialize clients, store and retrieve messages, perform context engineering, and utilize agent storage tools. The platform also supports building agents using end-to-end scripts in Python and Typescript, with various templates available. Acontext's architecture includes client layer, backend with API and core components, infrastructure with PostgreSQL, S3, Redis, and RabbitMQ, and a web dashboard. Join the Acontext community on Discord and follow updates on GitHub.
spacebot
Spacebot is an AI agent designed for teams, communities, and multi-user environments. It splits the monolith into specialized processes that delegate tasks, allowing it to handle concurrent conversations, execute tasks, and respond to multiple users simultaneously. Built for Discord, Slack, and Telegram, Spacebot can run coding sessions, manage files, automate web browsing, and search the web. Its memory system is structured and graph-connected, enabling productive knowledge synthesis. With capabilities for task execution, messaging, memory management, scheduling, model routing, and extensible skills, Spacebot offers a comprehensive solution for collaborative work environments.
traceroot
TraceRoot is a tool that helps engineers debug production issues 10× faster using AI-powered analysis of traces, logs, and code context. It accelerates the debugging process with AI-powered insights, integrates seamlessly into the development workflow, provides real-time trace and log analysis, code context understanding, and intelligent assistance. Features include ease of use, LLM flexibility, distributed services, AI debugging interface, and integration support. Users can get started with TraceRoot Cloud for a 7-day trial or self-host the tool. SDKs are available for Python and JavaScript/TypeScript.
goclaw
goclaw is a powerful AI Agent framework written in Go language. It provides a complete tool system for FileSystem, Shell, Web, and Browser with Docker sandbox support and permission control. The framework includes a skill system compatible with OpenClaw and AgentSkills specifications, supporting automatic discovery and environment gating. It also offers persistent session storage, multi-channel support for Telegram, WhatsApp, Feishu, QQ, and WeWork, flexible configuration with YAML/JSON support, multiple LLM providers like OpenAI, Anthropic, and OpenRouter, WebSocket Gateway, Cron scheduling, and Browser automation based on Chrome DevTools Protocol.
koog
Koog is a Kotlin-based framework for building and running AI agents entirely in idiomatic Kotlin. It allows users to create agents that interact with tools, handle complex workflows, and communicate with users. Key features include pure Kotlin implementation, MCP integration, embedding capabilities, custom tool creation, ready-to-use components, intelligent history compression, powerful streaming API, persistent agent memory, comprehensive tracing, flexible graph workflows, modular feature system, scalable architecture, and multiplatform support.
jat
JAT is a complete, self-contained environment for agentic development, offering task management, agent orchestration, code editor, git integration, and terminal access all in a single IDE. It allows users to connect various external sources like RSS, Slack, Telegram, and Gmail to create tasks and spawn agents automatically. JAT supports hands-on supervision of agents or autonomous operation. The tool provides features such as multi-agent management, task management, smart question UI, epic swarm for parallel agent spawning, autonomous triggers, task scheduling, error recovery, and a skill marketplace. JAT is designed to be a control tower for managing a swarm of agents, whether actively supervised or running autonomously.
rover
Rover is a command-line tool for managing and deploying Docker containers. It provides a simple and intuitive interface to interact with Docker images and containers, allowing users to easily build, run, and manage their containerized applications. With Rover, users can streamline their development workflow by automating container deployment and management tasks. The tool is designed to be lightweight and easy to use, making it ideal for developers and DevOps professionals looking to simplify their container management processes.
oh-my-opencode
OhMyOpenCode is a free and open-source tool that enhances coding productivity by providing an agent harness for orchestrating multiple models and tools. It offers features like background agents, LSP/AST tools, curated MCPs, and compatibility with various agents like Claude Code. The tool aims to boost productivity, automate tasks, and streamline the coding process for users. It is highly extensible and customizable, catering to both hackers and non-hackers alike, with a focus on enhancing the development experience and performance.
emqx
EMQX is a highly scalable and reliable MQTT platform designed for IoT data infrastructure. It supports various protocols like MQTT 5.0, 3.1.1, and 3.1, as well as MQTT-SN, CoAP, LwM2M, and MQTT over QUIC. EMQX allows connecting millions of IoT devices, processing messages in real time, and integrating with backend data systems. It is suitable for applications in AI, IoT, IIoT, connected vehicles, smart cities, and more. The tool offers features like massive scalability, powerful rule engine, flow designer, AI processing, robust security, observability, management, extensibility, and a unified experience with the Business Source License (BSL) 1.1.
ai-manus
AI Manus is a general-purpose AI Agent system that supports running various tools and operations in a sandbox environment. It offers deployment with minimal dependencies, supports multiple tools like Terminal, Browser, File, Web Search, and messaging tools, allocates separate sandboxes for tasks, manages session history, supports stopping and interrupting conversations, file upload and download, and is multilingual. The system also provides user login and authentication. The project primarily relies on Docker for development and deployment, with model capability requirements and recommended Deepseek and GPT models.
stagehand
Stagehand is an AI web browsing framework that simplifies and extends web automation using three simple APIs: act, extract, and observe. It aims to provide a lightweight, configurable framework without complex abstractions, allowing users to automate web tasks reliably. The tool generates Playwright code based on atomic instructions provided by the user, enabling natural language-driven web automation. Stagehand is open source, maintained by the Browserbase team, and supports different models and model providers for flexibility in automation tasks.
Sage
Sage is a production-ready, modular, and intelligent multi-agent orchestration framework for complex problem solving. It intelligently breaks down complex tasks into manageable subtasks through seamless agent collaboration. Sage provides Deep Research Mode for comprehensive analysis and Rapid Execution Mode for quick task completion. It offers features like intelligent task decomposition, agent orchestration, extensible tool system, dual execution modes, interactive web interface, advanced token tracking, rich configuration, developer-friendly APIs, and robust error recovery mechanisms. Sage supports custom workflows, multi-agent collaboration, custom agent development, agent flow orchestration, rule preferences system, message manager for smart token optimization, task manager for comprehensive state management, advanced file system operations, advanced tool system with plugin architecture, token usage & cost monitoring, and rich configuration system. It also includes real-time streaming & monitoring, advanced tool development, error handling & reliability, performance monitoring, MCP server integration, and security features.
For similar tasks
holmesgpt
HolmesGPT is an open-source DevOps assistant powered by OpenAI or any tool-calling LLM of your choice. It helps in troubleshooting Kubernetes, incident response, ticket management, automated investigation, and runbook automation in plain English. The tool connects to existing observability data, is compliance-friendly, provides transparent results, supports extensible data sources, runbook automation, and integrates with existing workflows. Users can install HolmesGPT using Brew, prebuilt Docker container, Python Poetry, or Docker. The tool requires an API key for functioning and supports OpenAI, Azure AI, and self-hosted LLMs.
comfyui_LLM_party
COMFYUI LLM PARTY is a node library designed for LLM workflow development in ComfyUI, an extremely minimalist UI interface primarily used for AI drawing and SD model-based workflows. The project aims to provide a complete set of nodes for constructing LLM workflows, enabling users to easily integrate them into existing SD workflows. It features various functionalities such as API integration, local large model integration, RAG support, code interpreters, online queries, conditional statements, looping links for large models, persona mask attachment, and tool invocations for weather lookup, time lookup, knowledge base, code execution, web search, and single-page search. Users can rapidly develop web applications using API + Streamlit and utilize LLM as a tool node. Additionally, the project includes an omnipotent interpreter node that allows the large model to perform any task, with recommendations to use the 'show_text' node for display output.
cog-comfyui
Cog-ComfyUI is a tool designed to run ComfyUI workflows on Replicate. It allows users to easily integrate their own workflows into their app or website using the Replicate API. The tool includes popular model weights and custom nodes, with the option to request more custom nodes or models. Users can get their API JSON, gather input files, and use custom LoRAs from CivitAI or HuggingFace. Additionally, users can run their workflows and set up their own dedicated instances for better performance and control. The tool provides options for private deployments, forking using Cog, or creating new models from the train tab on Replicate. It also offers guidance on developing locally and running the Web UI from a Cog container.
openclaw-mission-control
OpenClaw Mission Control is a centralized operations and governance platform for running OpenClaw across teams and organizations. It provides unified visibility, approval controls, and gateway-aware orchestration. The platform offers work orchestration, agent and gateway management, approval-driven governance, and API-backed automation. It is designed for multi-team agent operations, human-in-the-loop execution, distributed runtime control, audit and incident review, and API-backed process integration. Mission Control is suitable for platform teams running OpenClaw, operations and engineering teams needing approval and auditability controls, and organizations wanting API-accessible operations with a usable web UI.
For similar jobs
AirGo
AirGo is a front and rear end separation, multi user, multi protocol proxy service management system, simple and easy to use. It supports vless, vmess, shadowsocks, and hysteria2.
mosec
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API. * **Highly performant** : web layer and task coordination built with Rust 🦀, which offers blazing speed in addition to efficient CPU utilization powered by async I/O * **Ease of use** : user interface purely in Python 🐍, by which users can serve their models in an ML framework-agnostic manner using the same code as they do for offline testing * **Dynamic batching** : aggregate requests from different users for batched inference and distribute results back * **Pipelined stages** : spawn multiple processes for pipelined stages to handle CPU/GPU/IO mixed workloads * **Cloud friendly** : designed to run in the cloud, with the model warmup, graceful shutdown, and Prometheus monitoring metrics, easily managed by Kubernetes or any container orchestration systems * **Do one thing well** : focus on the online serving part, users can pay attention to the model optimization and business logic
llm-code-interpreter
The 'llm-code-interpreter' repository is a deprecated plugin that provides a code interpreter on steroids for ChatGPT by E2B. It gives ChatGPT access to a sandboxed cloud environment with capabilities like running any code, accessing Linux OS, installing programs, using filesystem, running processes, and accessing the internet. The plugin exposes commands to run shell commands, read files, and write files, enabling various possibilities such as running different languages, installing programs, starting servers, deploying websites, and more. It is powered by the E2B API and is designed for agents to freely experiment within a sandboxed environment.
pezzo
Pezzo is a fully cloud-native and open-source LLMOps platform that allows users to observe and monitor AI operations, troubleshoot issues, save costs and latency, collaborate, manage prompts, and deliver AI changes instantly. It supports various clients for prompt management, observability, and caching. Users can run the full Pezzo stack locally using Docker Compose, with prerequisites including Node.js 18+, Docker, and a GraphQL Language Feature Support VSCode Extension. Contributions are welcome, and the source code is available under the Apache 2.0 License.
learn-generative-ai
Learn Cloud Applied Generative AI Engineering (GenEng) is a course focusing on the application of generative AI technologies in various industries. The course covers topics such as the economic impact of generative AI, the role of developers in adopting and integrating generative AI technologies, and the future trends in generative AI. Students will learn about tools like OpenAI API, LangChain, and Pinecone, and how to build and deploy Large Language Models (LLMs) for different applications. The course also explores the convergence of generative AI with Web 3.0 and its potential implications for decentralized intelligence.
gcloud-aio
This repository contains shared codebase for two projects: gcloud-aio and gcloud-rest. gcloud-aio is built for Python 3's asyncio, while gcloud-rest is a threadsafe requests-based implementation. It provides clients for Google Cloud services like Auth, BigQuery, Datastore, KMS, PubSub, Storage, and Task Queue. Users can install the library using pip and refer to the documentation for usage details. Developers can contribute to the project by following the contribution guide.
fluid
Fluid is an open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for data-intensive applications, such as big data and AI applications. It implements dataset abstraction, scalable cache runtime, automated data operations, elasticity and scheduling, and is runtime platform agnostic. Key concepts include Dataset and Runtime. Prerequisites include Kubernetes version > 1.16, Golang 1.18+, and Helm 3. The tool offers features like accelerating remote file accessing, machine learning, accelerating PVC, preloading dataset, and on-the-fly dataset cache scaling. Contributions are welcomed, and the project is under the Apache 2.0 license with a vendor-neutral approach.
aiges
AIGES is a core component of the Athena Serving Framework, designed as a universal encapsulation tool for AI developers to deploy AI algorithm models and engines quickly. By integrating AIGES, you can deploy AI algorithm models and engines rapidly and host them on the Athena Serving Framework, utilizing supporting auxiliary systems for networking, distribution strategies, data processing, etc. The Athena Serving Framework aims to accelerate the cloud service of AI algorithm models and engines, providing multiple guarantees for cloud service stability through cloud-native architecture. You can efficiently and securely deploy, upgrade, scale, operate, and monitor models and engines without focusing on underlying infrastructure and service-related development, governance, and operations.
