awesome-AIOps

A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).

Stars: 163

Visit

awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.

README:

awesome-AIOps

A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).

Researchers
Industrial Materials
Academic Materials
- Talks
- Workshops
Papers
Datasets
Others
- Courses

Researchers

China (& HK SAR)
Michael R. Lyu, CUHK	Dongmei Zhang, Microsoft	Pengfei Chen, SYSU	Dan Pei, Tsinghua
Xin Peng, Fudan
USA
Ryan Huang, JHU	Yingnong Dang, Microsoft	Christina Delimitrou, MIT EECS
Europe
Odej Kao, TU Berlin
Australia
Hongyu Zhang, UON

Industrial Materials

Tools

[Log Analytics] LogPAI
[AI for Cloud Operation] OpsPAI
[Outlier Detection] PyOD
[Anomaly Detection] ADTK
[Anomaly Detection] PySAD
[Online Machine Learning] River
[Online Machine Learning] scikit-multiflow
[Fault Injection] Chaos Mesh
[Fault Injection] ChaosBlade
[Container Monitoring] cAdvisor
[Performance Monitoring] Netdata
[Anomaly Detection Labeling Tool] Microsoft TagAnomaly
[Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
[Performance Testing Tool] Locust
[Alibaba Java Diagnostic Tool] Arthas

Companies

Datadog: A monitoring and security platform for cloud applications
必示 bizseer
日志易
博睿数据
听云 TINGYUN: 端到端的全平台应用性能管理系统
Loom Systems

Academic Materials

Talks

[Michael R. Lyu] Reliability-Driven AIOps for Cloud Resilience (Keynote talk at ICSE '21)

Workshops

Papers

Survey & Empirical Study

[arXiv '23] AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
[CSUR '22] Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey
[ASE '22] Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling
[arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
[CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
[ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
[arXiv '20] A Systematic Mapping Study in AIOps
[ICSE '19] AIOps: Real-World Challenges and Research Innovations
[HotOS '19] What bugs cause production cloud incidents?
[ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
[ASE '13] Software analytics for incident management of online services: An experience report

Benchmarks

(Large) Language Models for IT Operations

[ISSTA '24] LILAC: Log Parsing using LLMs with Adaptive Parsing Cache
[arXiv '24] Exploring LLM-based Agents for Root Cause Analysis
[arXiv '24] Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides
[arXiv '24] Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
[arXiv '23] Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
[arXiv '23] OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models
[arXiv '23] Xpert: Empowering Incident Management with Query Recommendations via Large Language Models
[arXiv '23] Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study
[arXiv '23] Assess and Summarize: Improve Outage Understanding with Large Language Models
[arXiv '23] Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering
[arXiv '23] Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
[SoCC '19] A System-Wide Debugging Assistant Powered by Natural Language Processing

Knowledge Graph for AIOps

[ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
[ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
[arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
[APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications

Microservices and Serverless

Dependency and Tracing

[ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
[NSDI '07] X-Trace: A Pervasive Network Tracing Framework
[HotNets '06] Discovering Dependencies for Network Management

Anomaly/Failure Detection

[ICSE '23] CONAN: Diagnosing Batch Failures for Cloud Systems
[ISSRE '22] Share or Not Share? Towards the Practicability of Deep Models for Unsupervised Anomaly Detection in Modern Online Systems [code]
[ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
[KDD '19] Time-Series Anomaly Detection Service at Microsoft
[ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
[CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning

Root Cause Analysis

Incident and Alarm Management

[ATC '23] AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure
[ICSE '23] Incident-aware Duplicate Ticket Aggregation for Cloud Systems
[SoCC '22] How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service
[DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
[USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
[ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
[ASE '21] Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings
[SIGCOMM '20] Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing
[ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
[ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
[ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
[ESEC/FSE '20] Real-time incident prediction for online service systems
[ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
[ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
[HotOS '19] What bugs cause production cloud incidents?
[ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
[ICSE '19] An empirical investigation of incident triage for online service systems
[WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
[KDD '14] Correlating Events with Time Series for Incident Diagnosis

Node, Disk, and Storage

[FAST '23] Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems [data]
[DSN '21] General Feature Selection for Failure Prediction in Large-scale SSD Deployment
[TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
[ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
[VLDB '20] Diagnosing root causes of intermittent slow queries in cloud databases
[USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
[NSDI '18] Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure
[ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
[USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error

VM Analysis and Management

Deployment

Datasets

[CUHK] Loghub
[Microsoft Azure] Azure Public Dataset
[Tsinghua] AIOps Challenge Dataset
[Google] Cluster Traces
[Backblaze] Hard Drive Dataset
[Baidu] SMART Dataset of PAKDD CUP 2020
[Alibaba] SSD SMART logs and failure data
[Alibaba] Alibaba Cluster Trace Program
[CloudWise] GAIA Dataset
[Huawei Cloud] Serverless traces

Others

Courses

[Coursera] Cloud-Based Network Design & Management Techniques
[Tsinghua] AIOps Course of Tsinghua

For Tasks:

Click tags to check more tools for each tasks

analyze incidents monitor performance detect anomalies manage alarms optimize deployments

For Jobs:

data scientist ai engineer it operations analyst cloud architect systems reliability engineer

Alternative AI tools for awesome-AIOps

Similar Open Source Tools

awesome-AIOps

github

: 163

AI-System-School

AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.

github

: 2.6k

llm-continual-learning-survey

This repository is an updating survey for Continual Learning of Large Language Models (CL-LLMs), providing a comprehensive overview of various aspects related to the continual learning of large language models. It covers topics such as continual pre-training, domain-adaptive pre-training, continual fine-tuning, model refinement, model alignment, multimodal LLMs, and miscellaneous aspects. The survey includes a collection of relevant papers, each focusing on different areas within the field of continual learning of large language models.

github

: 215

Efficient-LLMs-Survey

This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

github

: 1.1k

Awesome-LLM-Compression

Awesome LLM compression research papers and tools to accelerate LLM training and inference.

github

: 1.4k

Awesome-TimeSeries-SpatioTemporal-LM-LLM

Awesome-TimeSeries-SpatioTemporal-LM-LLM is a curated list of Large (Language) Models and Foundation Models for Temporal Data, including Time Series, Spatio-temporal, and Event Data. The repository aims to summarize recent advances in Large Models and Foundation Models for Time Series and Spatio-Temporal Data with resources such as papers, code, and data. It covers various applications like General Time Series Analysis, Transportation, Finance, Healthcare, Event Analysis, Climate, Video Data, and more. The repository also includes related resources, surveys, and papers on Large Language Models, Foundation Models, and their applications in AIOps.

github

: 944

Awesome-Efficient-AIGC

This repository, Awesome Efficient AIGC, collects efficient approaches for AI-generated content (AIGC) to cope with its huge demand for computing resources. It includes efficient Large Language Models (LLMs), Diffusion Models (DMs), and more. The repository is continuously improving and welcomes contributions of works like papers and repositories that are missed by the collection.

github

: 145

LLM-Tool-Survey

This repository contains a collection of papers related to tool learning with large language models (LLMs). The papers are organized according to the survey paper 'Tool Learning with Large Language Models: A Survey'. The survey focuses on the benefits and implementation of tool learning with LLMs, covering aspects such as task planning, tool selection, tool calling, response generation, benchmarks, evaluation, challenges, and future directions in the field. It aims to provide a comprehensive understanding of tool learning with LLMs and inspire further exploration in this emerging area.

github

: 220

Awesome-LLM-Post-training

The Awesome-LLM-Post-training repository is a curated collection of influential papers, code implementations, benchmarks, and resources related to Large Language Models (LLMs) Post-Training Methodologies. It covers various aspects of LLMs, including reasoning, decision-making, reinforcement learning, reward learning, policy optimization, explainability, multimodal agents, benchmarks, tutorials, libraries, and implementations. The repository aims to provide a comprehensive overview and resources for researchers and practitioners interested in advancing LLM technologies.

github

: 1.2k

Awesome-LLM-Survey

This repository, Awesome-LLM-Survey, serves as a comprehensive collection of surveys related to Large Language Models (LLM). It covers various aspects of LLM, including instruction tuning, human alignment, LLM agents, hallucination, multi-modal capabilities, and more. Researchers are encouraged to contribute by updating information on their papers to benefit the LLM survey community.

github

: 223

Awesome-Code-LLM

Analyze the following text from a github repository (name and readme text at end) . Then, generate a JSON object with the following keys and provide the corresponding information for each key, in lowercase letters: 'description' (detailed description of the repo, must be less than 400 words，Ensure that no line breaks and quotation marks.),'for_jobs' (List 5 jobs suitable for this tool,in lowercase letters), 'ai_keywords' (keywords of the tool,user may use those keyword to find the tool,in lowercase letters), 'for_tasks' (list of 5 specific tasks user can use this tool to do,in lowercase letters), 'answer' (in english languages)

github

: 2.3k

LLM-Agent-Survey

LLM-Agent-Survey is a comprehensive repository that provides a curated list of papers related to Large Language Model (LLM) agents. The repository categorizes papers based on LLM-Profiled Roles and includes high-quality publications from prestigious conferences and journals. It aims to offer a systematic understanding of LLM-based agents, covering topics such as tool use, planning, and feedback learning. The repository also includes unpublished papers with insightful analysis and novelty, marked for future updates. Users can explore a wide range of surveys, tool use cases, planning workflows, and benchmarks related to LLM agents.

github

: 113

Everything-LLMs-And-Robotics

The Everything-LLMs-And-Robotics repository is the world's largest GitHub repository focusing on the intersection of Large Language Models (LLMs) and Robotics. It provides educational resources, research papers, project demos, and Twitter threads related to LLMs, Robotics, and their combination. The repository covers topics such as reasoning, planning, manipulation, instructions and navigation, simulation frameworks, perception, and more, showcasing the latest advancements in the field.

github

: 718

Awesome-LLM-Robotics

This repository contains a curated list of **papers using Large Language/Multi-Modal Models for Robotics/RL**. Template from awesome-Implicit-NeRF-Robotics Please feel free to send me pull requests or email to add papers! If you find this repository useful, please consider citing and STARing this list. Feel free to share this list with others! ## Overview * Surveys * Reasoning * Planning * Manipulation * Instructions and Navigation * Simulation Frameworks * Citation

github

: 3.5k

awesome-ai4db-paper

The 'awesome-ai4db-paper' repository is a curated paper list focusing on AI for database (AI4DB) theory, frameworks, resources, and tools for data engineers. It includes a collection of research papers related to learning-based query optimization, training data set preparation, cardinality estimation, query-driven approaches, data-driven techniques, hybrid methods, pretraining models, plan hints, cost models, SQL embedding, join order optimization, query rewriting, end-to-end systems, text-to-SQL conversion, traditional database technologies, storage solutions, learning-based index design, and a learning-based configuration advisor. The repository aims to provide a comprehensive resource for individuals interested in AI applications in the field of database management.

github

: 53

awesome-llm-security

Awesome LLM Security is a curated collection of tools, documents, and projects related to Large Language Model (LLM) security. It covers various aspects of LLM security including white-box, black-box, and backdoor attacks, defense mechanisms, platform security, and surveys. The repository provides resources for researchers and practitioners interested in understanding and safeguarding LLMs against adversarial attacks. It also includes a list of tools specifically designed for testing and enhancing LLM security.

github

: 777

For similar tasks

awesome-AIOps

github

: 163

awesome-LLM-AIOps

The 'awesome-LLM-AIOps' repository is a curated list of academic research and industrial materials related to Large Language Models (LLM) and Artificial Intelligence for IT Operations (AIOps). It covers various topics such as incident management, log analysis, root cause analysis, incident mitigation, and incident postmortem analysis. The repository provides a comprehensive collection of papers, projects, and tools related to the application of LLM and AI in IT operations, offering valuable insights and resources for researchers and practitioners in the field.

github

: 54

robusta

Robusta is a tool designed to enhance Prometheus notifications for Kubernetes environments. It offers features such as smart grouping to reduce notification spam, AI investigation for alert analysis, alert enrichment with additional data like pod logs, self-healing capabilities for defining auto-remediation rules, advanced routing options, problem detection without PromQL, change-tracking for Kubernetes resources, auto-resolve functionality, and integration with various external systems like Slack, Teams, and Jira. Users can utilize Robusta with or without Prometheus, and it can be installed alongside existing Prometheus setups or as part of an all-in-one Kubernetes observability stack.

github

: 2.7k

qdrant

Qdrant is a vector similarity search engine and vector database. It is written in Rust, which makes it fast and reliable even under high load. Qdrant can be used for a variety of applications, including: * Semantic search * Image search * Product recommendations * Chatbots * Anomaly detection Qdrant offers a variety of features, including: * Payload storage and filtering * Hybrid search with sparse vectors * Vector quantization and on-disk storage * Distributed deployment * Highlighted features such as query planning, payload indexes, SIMD hardware acceleration, async I/O, and write-ahead logging Qdrant is available as a fully managed cloud service or as an open-source software that can be deployed on-premises.

github

: 22.9k

SynapseML

SynapseML (previously known as MMLSpark) is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. It provides simple, composable, and distributed APIs for various machine learning tasks such as text analytics, vision, anomaly detection, and more. Built on Apache Spark, SynapseML allows seamless integration of models into existing workflows. It supports training and evaluation on single-node, multi-node, and resizable clusters, enabling scalability without resource wastage. Compatible with Python, R, Scala, Java, and .NET, SynapseML abstracts over different data sources for easy experimentation. Requires Scala 2.12, Spark 3.4+, and Python 3.8+.

github

: 5.0k

mlx-vlm

MLX-VLM is a package designed for running Vision LLMs on Mac systems using MLX. It provides a convenient way to install and utilize the package for processing large language models related to vision tasks. The tool simplifies the process of running LLMs on Mac computers, offering a seamless experience for users interested in leveraging MLX for vision-related projects.

github

: 1.1k

Java-AI-Book-Code

The Java-AI-Book-Code repository contains code examples for the 2020 edition of 'Practical Artificial Intelligence With Java'. It is a comprehensive update of the previous 2013 edition, featuring new content on deep learning, knowledge graphs, anomaly detection, linked data, genetic algorithms, search algorithms, and more. The repository serves as a valuable resource for Java developers interested in AI applications and provides practical implementations of various AI techniques and algorithms.

github

: 244

Awesome-AI-Data-Guided-Projects

A curated list of data science & AI guided projects to start building your portfolio. The repository contains guided projects covering various topics such as large language models, time series analysis, computer vision, natural language processing (NLP), and data science. Each project provides detailed instructions on how to implement specific tasks using different tools and technologies.

github

: 83

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

awesome-AIOps

README:

awesome-AIOps

Researchers

Industrial Materials

Competitions

White Papers

Blogs & Tutorials & Magazines

Benchmarks

Tools

Companies

Academic Materials

Talks

Workshops

Papers

Survey & Empirical Study

Benchmarks

(Large) Language Models for IT Operations

Knowledge Graph for AIOps

Microservices and Serverless

Dependency and Tracing

Anomaly/Failure Detection

Root Cause Analysis

Incident and Alarm Management

Node, Disk, and Storage

VM Analysis and Management

Deployment

Datasets

Others

Courses

For Tasks:

For Jobs:

Alternative AI tools for awesome-AIOps

Similar Open Source Tools

awesome-AIOps

AI-System-School

llm-continual-learning-survey

Efficient-LLMs-Survey

Awesome-LLM-Compression

Awesome-TimeSeries-SpatioTemporal-LM-LLM

Awesome-Efficient-AIGC

LLM-Tool-Survey

Awesome-LLM-Post-training

Awesome-LLM-Survey

Awesome-Code-LLM

LLM-Agent-Survey

Everything-LLMs-And-Robotics

Awesome-LLM-Robotics

awesome-ai4db-paper

awesome-llm-security

For similar tasks

awesome-AIOps

awesome-LLM-AIOps

robusta

qdrant

SynapseML

mlx-vlm

Java-AI-Book-Code

Awesome-AI-Data-Guided-Projects

For similar jobs

sweep

teams-ai

ai-guide

classifai

chatbot-ui

BricksLLM

uAgents

griptape