awesome-AIOps
A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).
Stars: 163
awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.
README:
A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).
China (& HK SAR) | |||
---|---|---|---|
Michael R. Lyu, CUHK | Dongmei Zhang, Microsoft | Pengfei Chen, SYSU | Dan Pei, Tsinghua |
Xin Peng, Fudan | |||
USA | |||
Ryan Huang, JHU | Yingnong Dang, Microsoft | Christina Delimitrou, MIT EECS | |
Europe | |||
Odej Kao, TU Berlin | |||
Australia | |||
Hongyu Zhang, UON |
- [AIOps Challenge] A series of AIOps competitions hosted by Tsinghua University
- [PAKDD2020] Alibaba AIOps Competition
- [VMware] Proactive Incident and Problem Management
- [GREATOPS 高效运维社区] 《企业级 AIOps 实施建议》白皮书
- [Awesome Open Source] Aiops Handbook
- [Moogsoft] What is AIOps?
- [Tsinghua University] 清华裴丹:AIOps落地的15条原则
- [Tsinghua University] 清华裴丹:AIOps效果落地最后一公里
- [Alibaba Cloud] 基于大数据的智能网络分析-齐天
- [Microsoft] Advancing Azure service quality with artificial intelligence: AIOps
- [Grafana] GrafanaCON: Grafana Observability Conference 2022
- [InfoQ] 2023,可观测性需求将迎来“爆发之年”?
- [Alibaba] 阿里云张建锋谈新型计算体系:云正在重构硬件、软件和终端世界
- [Cornell] DeathStarBench (An open-source benchmark suite for cloud microservices)
- [Google Cloud] Online Boutique (A microservices demo application)
- [Fudan] Train Ticket (A benchmark microservice system)
- [Weaveworks] Sock Shop (A microservices demo application)
- [Log Analytics] LogPAI
- [AI for Cloud Operation] OpsPAI
- [Outlier Detection] PyOD
- [Anomaly Detection] ADTK
- [Anomaly Detection] PySAD
- [Online Machine Learning] River
- [Online Machine Learning] scikit-multiflow
- [Fault Injection] Chaos Mesh
- [Fault Injection] ChaosBlade
- [Container Monitoring] cAdvisor
- [Performance Monitoring] Netdata
- [Anomaly Detection Labeling Tool] Microsoft TagAnomaly
- [Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
- [Performance Testing Tool] Locust
- [Alibaba Java Diagnostic Tool] Arthas
- Datadog: A monitoring and security platform for cloud applications
- 必示 bizseer
- 日志易
- 博睿数据
- 听云 TINGYUN: 端到端的全平台应用性能管理系统
- Loom Systems
- ICSE21 Workshop on Cloud Intelligence
- AAAI-20 Workshop on Cloud Intelligence
- AIOPS 2020 (International Workshop on Artificial Intelligence for IT Operations)
- [arXiv '23] AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
- [CSUR '22] Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey
- [ASE '22] Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling
- [arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
- [CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
- [ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
- [arXiv '20] A Systematic Mapping Study in AIOps
- [ICSE '19] AIOps: Real-World Challenges and Research Innovations
- [HotOS '19] What bugs cause production cloud incidents?
- [ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
- [ASE '13] Software analytics for incident management of online services: An experience report
- [arXiv '22] Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- [ASPLOS '19] An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems
- [ISSTA '24] LILAC: Log Parsing using LLMs with Adaptive Parsing Cache
- [arXiv '24] Exploring LLM-based Agents for Root Cause Analysis
- [arXiv '24] Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides
- [arXiv '24] Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
- [arXiv '23] Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
- [arXiv '23] OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models
- [arXiv '23] Xpert: Empowering Incident Management with Query Recommendations via Large Language Models
- [arXiv '23] Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study
- [arXiv '23] Assess and Summarize: Improve Outage Understanding with Large Language Models
- [arXiv '23] Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering
- [arXiv '23] Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
- [SoCC '19] A System-Wide Debugging Assistant Powered by Natural Language Processing
- [ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
- [ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
- [arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
- [APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications
- [ASPLOS '21] Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
- [ICDCS '21] Defuse: A Dependency-Guided Function Scheduler to Mitigate Cold Starts on FaaS Platforms
- [FSE '20] Graph-based trace analysis for microservice architecture understanding and problem diagnosis
- [OSDI '20] FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
- [ESEC/FSE '19] Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs
- [TSE '18] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
- [ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
- [NSDI '07] X-Trace: A Pervasive Network Tracing Framework
- [HotNets '06] Discovering Dependencies for Network Management
- [ICSE '23] CONAN: Diagnosing Batch Failures for Cloud Systems
- [ISSRE '22] Share or Not Share? Towards the Practicability of Deep Models for Unsupervised Anomaly Detection in Modern Online Systems [code]
- [ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
- [KDD '19] Time-Series Anomaly Detection Service at Microsoft
- [ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
- [CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
- [SIGCOMM '23] Murphy: Performance Diagnosis of Distributed Cloud Applications
- [FSE '23] Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
- [OSDI '18] Capturing and Enhancing In Situ System Observability for Failure Detection
- [ATC '23] AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure
- [ICSE '23] Incident-aware Duplicate Ticket Aggregation for Cloud Systems
- [SoCC '22] How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service
- [DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
- [USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
- [ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
- [ASE '21] Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings
- [SIGCOMM '20] Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing
- [ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
- [ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
- [ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
- [ESEC/FSE '20] Real-time incident prediction for online service systems
- [ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
- [ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
- [HotOS '19] What bugs cause production cloud incidents?
- [ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
- [ICSE '19] An empirical investigation of incident triage for online service systems
- [WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
- [KDD '14] Correlating Events with Time Series for Incident Diagnosis
- [FAST '23] Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems [data]
- [DSN '21] General Feature Selection for Failure Prediction in Large-scale SSD Deployment
- [TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
- [ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
- [VLDB '20] Diagnosing root causes of intermittent slow queries in cloud databases
- [USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
- [NSDI '18] Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure
- [ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
- [USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error
- [NSDI '22] CloudCluster: Unearthing the Functional Structure of a Cloud Service
- [OSDI '20] Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
- [SOSP '21] Understanding and Detecting Software Upgrade Failures in Distributed Systems
- [NSDI '20] Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure
- [CUHK] Loghub
- [Microsoft Azure] Azure Public Dataset
- [Tsinghua] AIOps Challenge Dataset
- [Google] Cluster Traces
- [Backblaze] Hard Drive Dataset
- [Baidu] SMART Dataset of PAKDD CUP 2020
- [Alibaba] SSD SMART logs and failure data
- [Alibaba] Alibaba Cluster Trace Program
- [CloudWise] GAIA Dataset
- [Huawei Cloud] Serverless traces
- [Coursera] Cloud-Based Network Design & Management Techniques
- [Tsinghua] AIOps Course of Tsinghua
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for awesome-AIOps
Similar Open Source Tools
awesome-AIOps
awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.
Efficient-LLMs-Survey
This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.
llm-continual-learning-survey
This repository is an updating survey for Continual Learning of Large Language Models (CL-LLMs), providing a comprehensive overview of various aspects related to the continual learning of large language models. It covers topics such as continual pre-training, domain-adaptive pre-training, continual fine-tuning, model refinement, model alignment, multimodal LLMs, and miscellaneous aspects. The survey includes a collection of relevant papers, each focusing on different areas within the field of continual learning of large language models.
Awesome-LLM-Compression
Awesome LLM compression research papers and tools to accelerate LLM training and inference.
Awesome_Mamba
Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.
LLM-Tool-Survey
This repository contains a collection of papers related to tool learning with large language models (LLMs). The papers are organized according to the survey paper 'Tool Learning with Large Language Models: A Survey'. The survey focuses on the benefits and implementation of tool learning with LLMs, covering aspects such as task planning, tool selection, tool calling, response generation, benchmarks, evaluation, challenges, and future directions in the field. It aims to provide a comprehensive understanding of tool learning with LLMs and inspire further exploration in this emerging area.
Awesome-LLM-Survey
This repository, Awesome-LLM-Survey, serves as a comprehensive collection of surveys related to Large Language Models (LLM). It covers various aspects of LLM, including instruction tuning, human alignment, LLM agents, hallucination, multi-modal capabilities, and more. Researchers are encouraged to contribute by updating information on their papers to benefit the LLM survey community.
Awesome-Code-LLM
Analyze the following text from a github repository (name and readme text at end) . Then, generate a JSON object with the following keys and provide the corresponding information for each key, in lowercase letters: 'description' (detailed description of the repo, must be less than 400 words,Ensure that no line breaks and quotation marks.),'for_jobs' (List 5 jobs suitable for this tool,in lowercase letters), 'ai_keywords' (keywords of the tool,user may use those keyword to find the tool,in lowercase letters), 'for_tasks' (list of 5 specific tasks user can use this tool to do,in lowercase letters), 'answer' (in english languages)
Everything-LLMs-And-Robotics
The Everything-LLMs-And-Robotics repository is the world's largest GitHub repository focusing on the intersection of Large Language Models (LLMs) and Robotics. It provides educational resources, research papers, project demos, and Twitter threads related to LLMs, Robotics, and their combination. The repository covers topics such as reasoning, planning, manipulation, instructions and navigation, simulation frameworks, perception, and more, showcasing the latest advancements in the field.
Awesome-LLM-Robotics
This repository contains a curated list of **papers using Large Language/Multi-Modal Models for Robotics/RL**. Template from awesome-Implicit-NeRF-Robotics Please feel free to send me pull requests or email to add papers! If you find this repository useful, please consider citing and STARing this list. Feel free to share this list with others! ## Overview * Surveys * Reasoning * Planning * Manipulation * Instructions and Navigation * Simulation Frameworks * Citation
awesome-llm-security
Awesome LLM Security is a curated collection of tools, documents, and projects related to Large Language Model (LLM) security. It covers various aspects of LLM security including white-box, black-box, and backdoor attacks, defense mechanisms, platform security, and surveys. The repository provides resources for researchers and practitioners interested in understanding and safeguarding LLMs against adversarial attacks. It also includes a list of tools specifically designed for testing and enhancing LLM security.
Awesome-Robotics-3D
Awesome-Robotics-3D is a curated list of 3D Vision papers related to Robotics domain, focusing on large models like LLMs/VLMs. It includes papers on Policy Learning, Pretraining, VLM and LLM, Representations, and Simulations, Datasets, and Benchmarks. The repository is maintained by Zubair Irshad and welcomes contributions and suggestions for adding papers. It serves as a valuable resource for researchers and practitioners in the field of Robotics and Computer Vision.
ABigSurveyOfLLMs
ABigSurveyOfLLMs is a repository that compiles surveys on Large Language Models (LLMs) to provide a comprehensive overview of the field. It includes surveys on various aspects of LLMs such as transformers, alignment, prompt learning, data management, evaluation, societal issues, safety, misinformation, attributes of LLMs, efficient LLMs, learning methods for LLMs, multimodal LLMs, knowledge-based LLMs, extension of LLMs, LLMs applications, and more. The repository aims to help individuals quickly understand the advancements and challenges in the field of LLMs through a collection of recent surveys and research papers.
Awesome-Quantization-Papers
This repo contains a comprehensive paper list of **Model Quantization** for efficient deep learning on AI conferences/journals/arXiv. As a highlight, we categorize the papers in terms of model structures and application scenarios, and label the quantization methods with keywords.
For similar tasks
awesome-AIOps
awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.
awesome-LLM-AIOps
The 'awesome-LLM-AIOps' repository is a curated list of academic research and industrial materials related to Large Language Models (LLM) and Artificial Intelligence for IT Operations (AIOps). It covers various topics such as incident management, log analysis, root cause analysis, incident mitigation, and incident postmortem analysis. The repository provides a comprehensive collection of papers, projects, and tools related to the application of LLM and AI in IT operations, offering valuable insights and resources for researchers and practitioners in the field.
qdrant
Qdrant is a vector similarity search engine and vector database. It is written in Rust, which makes it fast and reliable even under high load. Qdrant can be used for a variety of applications, including: * Semantic search * Image search * Product recommendations * Chatbots * Anomaly detection Qdrant offers a variety of features, including: * Payload storage and filtering * Hybrid search with sparse vectors * Vector quantization and on-disk storage * Distributed deployment * Highlighted features such as query planning, payload indexes, SIMD hardware acceleration, async I/O, and write-ahead logging Qdrant is available as a fully managed cloud service or as an open-source software that can be deployed on-premises.
SynapseML
SynapseML (previously known as MMLSpark) is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. It provides simple, composable, and distributed APIs for various machine learning tasks such as text analytics, vision, anomaly detection, and more. Built on Apache Spark, SynapseML allows seamless integration of models into existing workflows. It supports training and evaluation on single-node, multi-node, and resizable clusters, enabling scalability without resource wastage. Compatible with Python, R, Scala, Java, and .NET, SynapseML abstracts over different data sources for easy experimentation. Requires Scala 2.12, Spark 3.4+, and Python 3.8+.
mlx-vlm
MLX-VLM is a package designed for running Vision LLMs on Mac systems using MLX. It provides a convenient way to install and utilize the package for processing large language models related to vision tasks. The tool simplifies the process of running LLMs on Mac computers, offering a seamless experience for users interested in leveraging MLX for vision-related projects.
Java-AI-Book-Code
The Java-AI-Book-Code repository contains code examples for the 2020 edition of 'Practical Artificial Intelligence With Java'. It is a comprehensive update of the previous 2013 edition, featuring new content on deep learning, knowledge graphs, anomaly detection, linked data, genetic algorithms, search algorithms, and more. The repository serves as a valuable resource for Java developers interested in AI applications and provides practical implementations of various AI techniques and algorithms.
Awesome-AI-Data-Guided-Projects
A curated list of data science & AI guided projects to start building your portfolio. The repository contains guided projects covering various topics such as large language models, time series analysis, computer vision, natural language processing (NLP), and data science. Each project provides detailed instructions on how to implement specific tasks using different tools and technologies.
AI-Security-and-Privacy-Events
AI-Security-and-Privacy-Events is a curated list of academic events focusing on AI security and privacy. It includes seminars, conferences, workshops, tutorials, special sessions, and covers various topics such as NLP & LLM Security, Privacy and Security in ML, Machine Learning Security, AI System with Confidential Computing, Adversarial Machine Learning, and more.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.