apo
APO is an one-stop observability platform combining OpenTelemetry with eBPF. Leveraging LLM capabilities to enable auto-pilot analyzing and troubleshooting 🚀.
Stars: 226
AutoPilot Observability (APO) is an out-of-the-box observability platform that provides one-click installation and ready-to-use capabilities. APO's OneAgent supports one-click configuration-free installation of Tracing probes, collects application fault scene logs, infrastructure metrics, network metrics of applications and downstream dependencies, and Kubernetes events. It supports collecting causality metrics based on eBPF implementation. APO integrates OpenTelemetry probes, otel-collector, Jaeger, ClickHouse, and VictoriaMetrics, reducing user configuration work. APO innovatively integrates eBPF technology with the OpenTelemetry ecosystem, significantly reducing data storage volume. It offers guided troubleshooting using eBPF technology to assist users in pinpointing fault causes on a single page.
README:
欢迎访问 在线 Demo 体验产品。更多详情请查看官方网站。
- 开箱即用的可观测性平台:APO 致力于提供一键安装、开箱即用的可观测性平台。APO 的 OneAgent 支持一键免配置安装 Tracing 探针,支持采集应用的故障现场日志、基础设施指标、应用和下游依赖的网络指标以及Kubernetes 事件,支持采集基于 eBPF 实现的北极星因果指标等数据。支持使用 Jaeger UI 查询 Tracing 数据,使用 PromQL 从 VictoriaMetrics 中查询 Metrics 数据并展示在 Grafana 上。
- 集成 eBPF 技术的 OpenTelemetry 发行版:APO 整合了 OpenTelemetry 探针、otel-collector、Jaeger、ClickHouse 和 VictoriaMetrics,通过 OneAgent 技术支持在传统服务器和容器环境自动安装多语言的 OpenTelemetry 探针,减少用户配置工作。APO 创新性地将 eBPF 技术与 OpenTelemetry 生态融合,基于回溯采样算法,大大降低了数据的存储量。
- 基于 eBPF 的向导式排障产品:AutoPilot 表示辅助用户定界定位故障原因。APO 提供了向导式辅助用户定界定位故障原因的界面,通过使用 APO 的 OneAgent 安装 OpenTelemetry 探针,OneAgent 能够分析环境故障,基于 eBPF 技术保留故障现场数据并展示在向导式排障界面上。eBPF 技术将向导式排障界面辅助用户在一个页面上完成故障原因的定界定位。
通过 OneAgent 技术,支持在传统服务器和容器环境自动安装多语言的 OpenTelemetry 探针,减少用户配置工作。
APO 集成了链路、指标、日志和事件等数据,能够一站式解决可观测性和故障定位的需求。
APO 将相同应用的不同接口调用区分开,清楚地给出应用执行某类业务时的调用关系,相同的应用节点可能会按照调用顺序出现多次。完整拓扑结构太复杂,没有实现拓扑本身应该具有的“地图导航”引导用户找到疑似故障节点的功能,因此 APO 利用延时曲线相似度来收缩相似度较低的节点,更多节点采用表格形式展示,避免拓扑过于复杂无法分析。当用户需要查看下游依赖节点时,可以点击节点名快速切换到不同节点的详情页面。
在请求延时发生故障时,很多节点都会被级联的影响到,从传统告警中看是很多节点都有告警,在APO中,每个节点都会将其下游依赖的延时进行相似度曲线匹配,从而找到延时最相似的节点,最相似的节点是根因的可疑性更高,这里的下游依赖包括直接下游和下游依赖的依赖。
单纯的分析链路数据会留下很多盲区,难以快速判断延时升高时是自身导致还是依赖导致。北极星因果指标主因算法能够直接给出延时波动是由何种原因导致的,给出了故障原因的方向。例如下图给出的主因是对外网络调用延时变化导致了应用延时变化,结合网络延时指标可以判断出原因到底是网络延时变化还是下游节点延时变化。 利用eBPF技术采集的北极星因果指标直接给出故障定位方向,大致分为以下几个方向:
- 代码嵌套循环太多
- 代码锁(垃圾回收、数据连接池争抢锁)
- 存储故障(磁盘故障)
- 下游依赖(中间件故障、服务故障);需要根据网络指标排除网络质量故障
- CPU资源不足(其它程序抢占了CPU或者容器的limit配置不合理)
- 内存不足
根据延时、错误率和日志错误数量曲线可以快速定位故障可能发生时间点,从而查看时间点附近的日志或链路数据。
故障现场的热数据 vs. 传统可观测性数据
- 对于异常(慢或者错)链路数据进行回溯采样,同时保留同时间段的日志(称之为故障现场日志),这些故障现场的数据在 APO 中被称之为热数据。回溯采样算法可以参考算法论文。
- 传统可观测性的保留方式称之为冷数据保留方式。
基于故障现场的热数据关联带来了不同数据之间跳转的极致流畅的体验。此外基于故障现场的热数据为未来优化存储、节约可观测性的成本带来了更多的可能性。
APO 利用 eBPF 技术为原始数据生成并保存了“索引”数据,“索引“数据的大小明显小于原始数据,即使原始数据膨胀,APO 仍然能够通过查询索引数据快速给出可观测性数据结果。同时利用 eBPF 技术在内核中为不同类型的数据增加元信息标签,增强数据间的关联性,使用户在不同数据之间能够无缝跳转。
相比于其他可观测性方案,APO 在保存相同时间周期的数据时,占用更少的存储成本同时能够更快地给出数据结果。
- 日志错误数指标:传统日志告警思路是根据日志内容为“Exception”或者“错误”进行告警,但有时日志会经常输出 Exception,业务却一切正常,如果按照日志内容告警,则会发生误告警。为了避免这个问题,APO 只匹配“Exception”或者“错误”日志,并将这些数量变成指标,如果该指标短时间飙升,可以起到日志告警的作用,提示用户程序执行过程有问题。
- 网络质量的事件化:APO 基于 ping 包时延的方式来快速判断网络质量好坏,如果ping包延时超出预定阀值,APO页面上的网络质量状态会给出提示。
- 北极星因果指标:北极星因果指标是一个应用执行业务过程中,各资源类型延时分布的指标化,基于延时分布能够利用算法判断延时的飙升是应用自身的问题还是依赖的问题。应用自身的问题也能给出故障原因的方向,是代码循环、锁还是资源不足等原因。
- 告警信息的关联:传统告警信息与可观测性数据的联动很少,APO中将基础设施状态、网络质量状态、K8s 事件状态转换成示警信息,帮助用户在同一个界面中更快地判断故障原因。
查看安装文档快速开始使用APO。
请查看官方文档.
下图为 APO 组件的数据流图:
详细的架构与组件介绍请查看架构文档。
如果您在使用过程中有任何问题,欢迎通过下方的方式联系我们:
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for apo
Similar Open Source Tools
apo
AutoPilot Observability (APO) is an out-of-the-box observability platform that provides one-click installation and ready-to-use capabilities. APO's OneAgent supports one-click configuration-free installation of Tracing probes, collects application fault scene logs, infrastructure metrics, network metrics of applications and downstream dependencies, and Kubernetes events. It supports collecting causality metrics based on eBPF implementation. APO integrates OpenTelemetry probes, otel-collector, Jaeger, ClickHouse, and VictoriaMetrics, reducing user configuration work. APO innovatively integrates eBPF technology with the OpenTelemetry ecosystem, significantly reducing data storage volume. It offers guided troubleshooting using eBPF technology to assist users in pinpointing fault causes on a single page.
LLMForEverybody
LLMForEverybody is a comprehensive repository covering various aspects of large language models (LLMs) including pre-training, architecture, optimizers, activation functions, attention mechanisms, tokenization, parallel strategies, training frameworks, deployment, fine-tuning, quantization, GPU parallelism, prompt engineering, agent design, RAG architecture, enterprise deployment challenges, evaluation metrics, and current hot topics in the field. It provides detailed explanations, tutorials, and insights into the workings and applications of LLMs, making it a valuable resource for researchers, developers, and enthusiasts interested in understanding and working with large language models.
AIBotPublic
AIBotPublic is an open-source version of AIBotPro, a comprehensive AI tool that provides various features such as knowledge base construction, AI drawing, API hosting, and more. It supports custom plugins and parallel processing of multiple files. The tool is built using bootstrap4 for the frontend, .NET6.0 for the backend, and utilizes technologies like SqlServer, Redis, and Milvus for database and vector database functionalities. It integrates third-party dependencies like Baidu AI OCR, Milvus C# SDK, Google Search, and more to enhance its capabilities.
Awesome-Lists
Awesome-Lists is a curated list of awesome lists across various domains of computer science and beyond, including programming languages, web development, data science, and more. It provides a comprehensive index of articles, books, courses, open source projects, and other resources. The lists are organized by topic and subtopic, making it easy to find the information you need. Awesome-Lists is a valuable resource for anyone looking to learn more about a particular topic or to stay up-to-date on the latest developments in the field.
chatgpt-plus
ChatGPT-PLUS is an open-source AI assistant solution based on AI large language model API, with a built-in operational management backend for easy deployment. It integrates multiple large language models from platforms like OpenAI, Azure, ChatGLM, Xunfei Xinghuo, and Wenxin Yanyan. Additionally, it includes MidJourney and Stable Diffusion AI drawing features. The system offers a complete open-source solution with ready-to-use frontend and backend applications, providing a seamless typing experience via Websocket. It comes with various pre-trained role applications such as Xiaohongshu writer, English translation master, Socrates, Confucius, Steve Jobs, and weekly report assistant to meet various chat and application needs. Users can enjoy features like Suno Wensheng music, integration with MidJourney/Stable Diffusion AI drawing, personal WeChat QR code for payment, built-in Alipay and WeChat payment functions, support for various membership packages and point card purchases, and plugin API integration for developing powerful plugins using large language model functions.
geekai
GeekAI is an open-source AI assistant solution based on AI large language model API, featuring a complete system with ready-to-use front-end and back-end management, providing a seamless typing experience via Websocket. It integrates various pre-trained character applications like Xiaohongshu writing assistant, English translation master, Socrates, Confucius, Steve Jobs, and weekly report assistant. The tool supports multiple large language models from platforms like OpenAI, Azure, Wenxin Yanyan, Xunfei Xinghuo, and Tsinghua ChatGLM. Additionally, it includes MidJourney and Stable Diffusion AI drawing functionalities for creating various artworks such as text-based images, face swapping, and blending images. Users can utilize personal WeChat QR codes for payment without the need for enterprise payment channels, and the tool offers integrated payment options like Alipay and WeChat Pay with support for multiple membership packages and point card purchases. It also features a plugin API for developing powerful plugins using large language model functions, including built-in plugins for Weibo hot search, today's headlines, morning news, and AI drawing functions.
Awesome-Lists-and-CheatSheets
Awesome-Lists is a curated index of selected resources spanning various fields including programming languages and theories, web and frontend development, server-side development and infrastructure, cloud computing and big data, data science and artificial intelligence, product design, etc. It includes articles, books, courses, examples, open-source projects, and more. The repository categorizes resources according to the knowledge system of different domains, aiming to provide valuable and concise material indexes for readers. Users can explore and learn from a wide range of high-quality resources in a systematic way.
KrillinAI
KrillinAI is a video subtitle translation and dubbing tool based on AI large models, featuring speech recognition, intelligent sentence segmentation, professional translation, and one-click deployment of the entire process. It provides a one-stop workflow from video downloading to the final product, empowering cross-language cultural communication with AI. The tool supports multiple languages for input and translation, integrates features like automatic dependency installation, video downloading from platforms like YouTube and Bilibili, high-speed subtitle recognition, intelligent subtitle segmentation and alignment, custom vocabulary replacement, professional-level translation engine, and diverse external service selection for speech and large model services.
AI-YinMei
AI-YinMei is an AI virtual anchor Vtuber development tool (N card version). It supports fastgpt knowledge base chat dialogue, a complete set of solutions for LLM large language models: [fastgpt] + [one-api] + [Xinference], supports docking bilibili live broadcast barrage reply and entering live broadcast welcome speech, supports Microsoft edge-tts speech synthesis, supports Bert-VITS2 speech synthesis, supports GPT-SoVITS speech synthesis, supports expression control Vtuber Studio, supports painting stable-diffusion-webui output OBS live broadcast room, supports painting picture pornography public-NSFW-y-distinguish, supports search and image search service duckduckgo (requires magic Internet access), supports image search service Baidu image search (no magic Internet access), supports AI reply chat box [html plug-in], supports AI singing Auto-Convert-Music, supports playlist [html plug-in], supports dancing function, supports expression video playback, supports head touching action, supports gift smashing action, supports singing automatic start dancing function, chat and singing automatic cycle swing action, supports multi scene switching, background music switching, day and night automatic switching scene, supports open singing and painting, let AI automatically judge the content.
aippt_PresentationGen
A SpringBoot web application that generates PPT files using a llm. The tool preprocesses single-page templates and dynamically combines them to generate PPTX files with text replacement functionality. It utilizes technologies such as SpringBoot, MyBatis, MySQL, Redis, WebFlux, Apache POI, Aspose Slides, OSS, and Vue2. Users can deploy the tool by configuring various parameters in the application.yml file and setting up necessary resources like MySQL, OSS, and API keys. The tool also supports integration with open-source image libraries like Unsplash for adding images to the presentations.
Interview-for-Algorithm-Engineer
This repository provides a collection of interview questions and answers for algorithm engineers. The questions are organized by topic, and each question includes a detailed explanation of the answer. This repository is a valuable resource for anyone preparing for an algorithm engineering interview.
ComfyUI-BRIA_AI-RMBG
ComfyUI-BRIA_AI-RMBG is an unofficial implementation of the BRIA Background Removal v1.4 model for ComfyUI. The tool supports batch processing, including video background removal, and introduces a new mask output feature. Users can install the tool using ComfyUI Manager or manually by cloning the repository. The tool includes nodes for automatically loading the Removal v1.4 model and removing backgrounds. Updates include support for batch processing and the addition of a mask output feature.
ChatPDF
ChatPDF is a knowledge question and answer retrieval tool based on local LLM. It supports various open-source LLM models like ChatGLM3-6b, Chinese-LLaMA-Alpaca-2, Baichuan, YI, and multiple file formats including PDF, docx, markdown, txt. The tool optimizes RAG accuracy, Chinese chunk segmentation, embedding using text2vec's sentence embedding, retrieval matching with rank_BM25, and introduces reranker module for reranking candidate sets. It also enhances candidate chunk extension context, supports custom RAG models, and provides a Gradio-based RAG conversation page for seamless dialogue.
cf-proxy-ex
Cloudflare Proxy EX is a tool that provides Cloudflare super proxy, OpenAI/ChatGPT proxy, Github acceleration, and online proxy services. It allows users to create a worker in Cloudflare website by copying the content from worker.js file, and add their domain name before any URL to use the tool. The tool is an improvement based on gaboolic's cloudflare-reverse-proxy, offering features like removing '/proxy/', handling redirection events, modifying headers, converting relative paths to absolute paths, and more. It aims to enhance proxy functionality and address issues faced by some websites. However, users are advised not to log in to any website through the online proxy due to potential security risks.
FeedCraft
FeedCraft is a powerful tool to process your rss feeds as a middleware. Use it to translate your feed, extract fulltext, emulate browser to render js-heavy page, use llm such as google gemini to generate brief for your rss article, use natural language to filter your rss feed, and more! It is an open-source tool that can be self-deployed and used with any RSS reader. It supports AI-powered processing using Open AI compatible LLMs, custom prompt, saving rules to apply to different RSS sources, portable mode for on-the-go usage, and dock mode for advanced customization of RSS sources and processing parameters.
chatgpt.js-chrome-starter
chatgpt.js-chrome-starter is a starting point for developing Chrome extensions using chatgpt.js. It provides a template with installation instructions and tips for creating extensions that leverage the ChatGPT technology. The repository includes sample screenshots and references to advanced Chrome API methods for developers to explore.
For similar tasks
apo
AutoPilot Observability (APO) is an out-of-the-box observability platform that provides one-click installation and ready-to-use capabilities. APO's OneAgent supports one-click configuration-free installation of Tracing probes, collects application fault scene logs, infrastructure metrics, network metrics of applications and downstream dependencies, and Kubernetes events. It supports collecting causality metrics based on eBPF implementation. APO integrates OpenTelemetry probes, otel-collector, Jaeger, ClickHouse, and VictoriaMetrics, reducing user configuration work. APO innovatively integrates eBPF technology with the OpenTelemetry ecosystem, significantly reducing data storage volume. It offers guided troubleshooting using eBPF technology to assist users in pinpointing fault causes on a single page.
For similar jobs
felafax
Felafax is a framework designed to tune LLaMa3.1 on Google Cloud TPUs for cost efficiency and seamless scaling. It provides a Jupyter notebook for continued-training and fine-tuning open source LLMs using XLA runtime. The goal of Felafax is to simplify running AI workloads on non-NVIDIA hardware such as TPUs, AWS Trainium, AMD GPU, and Intel GPU. It supports various models like LLaMa-3.1 JAX Implementation, LLaMa-3/3.1 PyTorch XLA, and Gemma2 Models optimized for Cloud TPUs with full-precision training support.
apo
AutoPilot Observability (APO) is an out-of-the-box observability platform that provides one-click installation and ready-to-use capabilities. APO's OneAgent supports one-click configuration-free installation of Tracing probes, collects application fault scene logs, infrastructure metrics, network metrics of applications and downstream dependencies, and Kubernetes events. It supports collecting causality metrics based on eBPF implementation. APO integrates OpenTelemetry probes, otel-collector, Jaeger, ClickHouse, and VictoriaMetrics, reducing user configuration work. APO innovatively integrates eBPF technology with the OpenTelemetry ecosystem, significantly reducing data storage volume. It offers guided troubleshooting using eBPF technology to assist users in pinpointing fault causes on a single page.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
ai-on-gke
This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources
tidb
TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
nvidia_gpu_exporter
Nvidia GPU exporter for prometheus, using `nvidia-smi` binary to gather metrics.
tracecat
Tracecat is an open-source automation platform for security teams. It's designed to be simple but powerful, with a focus on AI features and a practitioner-obsessed UI/UX. Tracecat can be used to automate a variety of tasks, including phishing email investigation, evidence collection, and remediation plan generation.
openinference
OpenInference is a set of conventions and plugins that complement OpenTelemetry to enable tracing of AI applications. It provides a way to capture and analyze the performance and behavior of AI models, including their interactions with other components of the application. OpenInference is designed to be language-agnostic and can be used with any OpenTelemetry-compatible backend. It includes a set of instrumentations for popular machine learning SDKs and frameworks, making it easy to add tracing to your AI applications.