apo

APO is a comprehensive observability platform combining OpenTelemetry with eBPF. Leveraging LLM to enable automated analysis and troubleshooting 🚀.

Stars: 277

Visit

AutoPilot Observability (APO) is an out-of-the-box observability platform that provides one-click installation and ready-to-use capabilities. APO's OneAgent supports one-click configuration-free installation of Tracing probes, collects application fault scene logs, infrastructure metrics, network metrics of applications and downstream dependencies, and Kubernetes events. It supports collecting causality metrics based on eBPF implementation. APO integrates OpenTelemetry probes, otel-collector, Jaeger, ClickHouse, and VictoriaMetrics, reducing user configuration work. APO innovatively integrates eBPF technology with the OpenTelemetry ecosystem, significantly reducing data storage volume. It offers guided troubleshooting using eBPF technology to assist users in pinpointing fault causes on a single page.

README:

APO - The Intelligent Observability Platform

Visit autopilotobservability.com for more details.

🚀 Introduction

APO (AutoPilot Observability) redefines modern observability by seamlessly combining AI and deep system insights. Powered by state-of-the-art Large Language Models (LLMs), APO empowers teams to decode complex system behaviors, rapidly pinpoint root causes, and automate diagnostic workflows. APO’s AI agentic workflows with designed data plane put you in control, enabling custom automated diagnostics that fit your unique needs.

APO display the following Highlights:

Agentic Workflows for Observability: Low-code orchestration that transforms your expertise into the dynamic core powering the intelligent agents.
LLM-native data plane: Designed data plane for LLM and deeply integrated with AI Agent.
Seamless Data Source Integration: Supports frictionless connection to existing data sources, empowering users to leverage our revolutionary data plane without any system modifications.
Full-Stack Coverage: Monitor logs, traces, and metrics seamlessly across your entire technology stack for comprehensive observability.
10X Cost-Effective: Save operational costs through streamlined processes, intelligent data handling, and efficient resource allocation.

✨ Features

Agentic Workflows for Observability

Low-code orchestration that transforms your expertise into the dynamic core powering the intelligent agents.

Design personalized AI agent for observability.
Build troubleshooting workflows with guided experience.
Customize automated diagnostic workflows.
Experience advanced cross-domain data correlation.

Out-of-the-Box Troubleshooting Workflows

APO comes with a variety of built-in intelligent workflows. You can customize your own workflows with your expertise to enable automated troubleshooting and intelligent operations.

APO has integrated expert knowledge into its workflows, with "Alert Events" featuring two deeply integrated workflows: Alert Validity Analysis and Root Cause Analysis. These workflows automatically analyze alert causes and reduce the workload of alert handling.

Alert Validity Analysis Workflow: This workflow helps you identify which alerts require immediate attention among numerous notifications. With its assistance, you can quickly focus on critical alerts. Additionally, you can design more sensitive alert rules to gather more context information when incidents occur, which will aid in subsequent troubleshooting.
Root Cause Analysis Workflow: When an alert is received, this workflow automatically retrieves alert context, such as related hosts, services, or pods, searches their metrics and anomalies, and conducts comprehensive root cause analysis using Polaris metrics to help you resolve incidents faster.

All built-in workflows can be modified according to your specific needs and scenarios.

Built-in Data Query and Anomaly Detection Tools

Given the abundance of multi-model data in the observability domain, APO provides a suite of data query and anomaly detection tools that everyone can simply drag and drop to use.

Result Verifiability

To prevent unverifiable results caused by large model hallucinations, we offer visual data charts during workflow execution. You can view execution results and data charts at every step. Additionally, cross-validation with eBPF Polaris metrics and multi-source metrics further enhances result reliability.

LLM-native data plane

API-centric service map: APO provides granular visibility into API endpoints within applications, creating clear service dependency maps for specific business flows. Our intelligent similarity algorithms prevent topology sprawl by condensing similar nodes while preserving detailed information in tabular views. Navigate effortlessly between node details with intuitive click-through navigation.
Anomaly events with cross-domain data correlation: Anomaly events with cross-domain data correlation: Given that observability data is diverse in structure and massive in scale, directly feeding it into large models is impractical. APO’s innovative approach transforms varied data into anomaly events, correlating them with the service map while capturing essential contextual details. This enriched data stream enables precise anomaly detection and cross-domain correlation, empowering the system to uncover subtle issues and deliver deeper, actionable insights.

Zero-Touch Tracing Agent Instrumentation

With OneAgent technology, APO supports the automatic instrumentation of multi-language OpenTelemetry agents across traditional and containerized environments, eliminating manual configuration overhead.

All-in-One Observability Hub

Experience complete visibility with APO's unified platform, bringing together traces, metrics, logs, and events in one cohesive view.

Rapid Fault Chain Analysis

APO's intelligent correlation of delay patterns, error rates, and log anomalies quickly surfaces relevant time windows for detailed investigation through logs and traces.

📊 Why APO?

Traditional Observability Tools	APO
Data overload and manual analysis	Simplified, actionable insights
Limited automation and customization	Fully customizable, automated workflows
Complicated agent installatioin	Zero-touch tracing agent Instrumentation
Black-box AIOps with poor explainability	Transparent, explainable recommendations
Vendor lock-in	Open source and extensible design

🔧 Getting Started

Begin your journey with APO here.

📘 Documentation

Explore our comprehensive guides here.

🌐 Contributing

APO is open source, and we welcome contributions! Whether it’s fixing bugs, adding new features, or improving documentation, your input is valuable. Here’s how you can contribute:

Fork the repository.
Create a feature branch.
Commit your changes and push.
Submit a pull request with detailed explanations.

🛡️ License

APO is licensed under the Apache-2.0 License.

❤️ Join Our Community

Join the growing community of developers and engineers transforming observability with APO. Connect with us:

Slack: Join our Slack
Github: GitHub

Ready to transform your observability? Start with APO today! 🚀

For Tasks:

Click tags to check more tools for each tasks

analyze fault causes install opentelemetry probes locate performance issues monitor network metrics troubleshoot application faults

For Jobs:

observability engineer site reliability engineer devops engineer system administrator cloud infrastructure architect

Alternative AI tools for apo

Similar Open Source Tools

apo

github

: 277

agentUniverse

agentUniverse is a framework for developing applications powered by multi-agent based on large language model. It provides essential components for building single agent and multi-agent collaboration mechanism for customizing collaboration patterns. Developers can easily construct multi-agent applications and share pattern practices from different fields. The framework includes pre-installed collaboration patterns like PEER and DOE for complex task breakdown and data-intensive tasks.

github

: 787

llmariner

LLMariner is an extensible open source platform built on Kubernetes to simplify the management of generative AI workloads. It enables efficient handling of training and inference data within clusters, with OpenAI-compatible APIs for seamless integration with a wide range of AI-driven applications.

github

: 63

langwatch

LangWatch is a monitoring and analytics platform designed to track, visualize, and analyze interactions with Large Language Models (LLMs). It offers real-time telemetry to optimize LLM cost and latency, a user-friendly interface for deep insights into LLM behavior, user analytics for engagement metrics, detailed debugging capabilities, and guardrails to monitor LLM outputs for issues like PII leaks and toxic language. The platform supports OpenAI and LangChain integrations, simplifying the process of tracing LLM calls and generating API keys for usage. LangWatch also provides documentation for easy integration and self-hosting options for interested users.

github

: 1.3k

languine

Languine is a CLI tool that helps developers streamline the localization process by providing AI-powered translations, automation features, and developer-centric design. It allows users to easily manage translation files, maintain consistency in tone and style, and save time by automating tasks. With support for over 100 languages and smart detection capabilities, Languine simplifies the localization workflow for developers.

github

: 1.3k

awesome-gpt-security

Awesome GPT + Security is a curated list of awesome security tools, experimental case or other interesting things with LLM or GPT. It includes tools for integrated security, auditing, reconnaissance, offensive security, detecting security issues, preventing security breaches, social engineering, reverse engineering, investigating security incidents, fixing security vulnerabilities, assessing security posture, and more. The list also includes experimental cases, academic research, blogs, and fun projects related to GPT security. Additionally, it provides resources on GPT security standards, bypassing security policies, bug bounty programs, cracking GPT APIs, and plugin security.

github

: 459

languine

Languine is a CLI tool powered by AI that helps developers streamline the localization process by providing AI-powered translations, automation features, consistent localization, developer-centric design, and time-saving workflows. It automates the identification of translation keys, supports multiple file formats, delivers accurate translations in over 100 languages, aligns translations with the original text's tone and intent, extracts translation keys from codebase, and supports hooks for content formatting with Biome or Prettier. Languine is designed to simplify and enhance the localization experience for developers.

github

: 1.7k

agentsociety

AgentSociety is an advanced framework designed for building agents in urban simulation environments. It integrates LLMs' planning, memory, and reasoning capabilities to generate realistic behaviors. The framework supports dataset-based, text-based, and rule-based environments with interactive visualization. It includes tools for interviews, surveys, interventions, and metric recording tailored for social experimentation.

github

: 229

LAMBDA

LAMBDA is a code-free multi-agent data analysis system that utilizes large models to address data analysis challenges in complex data-driven applications. It allows users to perform complex data analysis tasks through human language instruction, seamlessly generate and debug code using two key agent roles, integrate external models and algorithms, and automatically generate reports. The system has demonstrated strong performance on various machine learning datasets, enhancing data science practice by integrating human and artificial intelligence.

github

: 344

higress

Higress is an open-source cloud-native API gateway built on the core of Istio and Envoy, based on Alibaba's internal practice of Envoy Gateway. It is designed for AI-native API gateway, serving AI businesses such as Tongyi Qianwen APP, Bailian Big Model API, and Machine Learning PAI platform. Higress provides capabilities to interface with LLM model vendors, AI observability, multi-model load balancing/fallback, AI token flow control, and AI caching. It offers features for AI gateway, Kubernetes Ingress gateway, microservices gateway, and security protection gateway, with advantages in production-level scalability, stream processing, extensibility, and ease of use.

github

: 4.3k

CodeFuse-muAgent

CodeFuse-muAgent is a Multi-Agent framework designed to streamline Standard Operating Procedure (SOP) orchestration for agents. It integrates toolkits, code libraries, knowledge bases, and sandbox environments for rapid construction of complex Multi-Agent interactive applications. The framework enables efficient execution and handling of multi-layered and multi-dimensional tasks.

github

: 181

img-prompt

IMGPrompt is an AI prompt editor tailored for image and video generation tools like Stable Diffusion, Midjourney, DALL·E, FLUX, and Sora. It offers a clean interface for viewing and combining prompts with translations in multiple languages. The tool includes features like smart recommendations, translation, random color generation, prompt tagging, interactive editing, categorized tag display, character count, and localization. Users can enhance their creative workflow by simplifying prompt creation and boosting efficiency.

github

: 180

ClicShopping_V3

ClicShoppingAI is a powerful open-source Ecommerce solution that supports B2B, B2C, and B2B-B2C. Integrated with cutting-edge generative artificial intelligence systems like Gpt and Ollama, it helps merchants increase turnover and competitiveness for free. With AI capabilities, it optimizes inventory, offers personalized recommendations, and provides top-notch customer service. The solution is modular, lightweight, and user-friendly, with a seamless, responsive design for all devices. Installation is easy, empowering ongoing development through community support. Features include GPT API integration, generative AI functionalities, real-time safety stock predictive, WYSIWYG product description creation, image editor management, full SEO optimization, payment and shipping modules, extension system, GDPR compliance, multi-language support, and more.

github

: 51

GenAI_Agents

GenAI Agents is a comprehensive repository for developing and implementing Generative AI (GenAI) agents, ranging from simple conversational bots to complex multi-agent systems. It serves as a valuable resource for learning, building, and sharing GenAI agents, offering tutorials, implementations, and a platform for showcasing innovative agent creations. The repository covers a wide range of agent architectures and applications, providing step-by-step tutorials, ready-to-use implementations, and regular updates on advancements in GenAI technology.

github

: 10.3k

project-lakechain

Project Lakechain is a cloud-native, AI-powered framework for building document processing pipelines on AWS. It provides a composable API with built-in middlewares for common tasks, scalable architecture, cost efficiency, GPU and CPU support, and the ability to create custom transform middlewares. With ready-made examples and emphasis on modularity, Lakechain simplifies the deployment of scalable document pipelines for tasks like metadata extraction, NLP analysis, text summarization, translations, audio transcriptions, computer vision, and more.

github

: 109

awesome-mlops

Awesome MLOps is a curated list of tools related to Machine Learning Operations, covering areas such as AutoML, CI/CD for Machine Learning, Data Cataloging, Data Enrichment, Data Exploration, Data Management, Data Processing, Data Validation, Data Visualization, Drift Detection, Feature Engineering, Feature Store, Hyperparameter Tuning, Knowledge Sharing, Machine Learning Platforms, Model Fairness and Privacy, Model Interpretability, Model Lifecycle, Model Serving, Model Testing & Validation, Optimization Tools, Simplification Tools, Visual Analysis and Debugging, and Workflow Tools. The repository provides a comprehensive collection of tools and resources for individuals and teams working in the field of MLOps.

github

: 3.7k

For similar tasks

apo

github

: 277

For similar jobs

felafax

Felafax is a framework designed to tune LLaMa3.1 on Google Cloud TPUs for cost efficiency and seamless scaling. It provides a Jupyter notebook for continued-training and fine-tuning open source LLMs using XLA runtime. The goal of Felafax is to simplify running AI workloads on non-NVIDIA hardware such as TPUs, AWS Trainium, AMD GPU, and Intel GPU. It supports various models like LLaMa-3.1 JAX Implementation, LLaMa-3/3.1 PyTorch XLA, and Gemma2 Models optimized for Cloud TPUs with full-precision training support.

github

: 549

apo

github

: 277

llms-txt-hub

The llms.txt hub is a centralized repository for llms.txt implementations and resources, facilitating interactions between LLM-powered tools and services with documentation and codebases. It standardizes documentation access, enhances AI model interpretation, improves AI response accuracy, and sets boundaries for AI content interaction across various projects and platforms.

github

: 119

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

ai-on-gke

This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources

github

: 280

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

nvidia_gpu_exporter

Nvidia GPU exporter for prometheus, using `nvidia-smi` binary to gather metrics.

github

: 1.1k

tracecat

Tracecat is an open-source automation platform for security teams. It's designed to be simple but powerful, with a focus on AI features and a practitioner-obsessed UI/UX. Tracecat can be used to automate a variety of tasks, including phishing email investigation, evidence collection, and remediation plan generation.

github

: 2.6k