AI-System-School

🚀 AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑‍💻 Video Tutorials.

Stars: 2569

Visit

AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.

README:

AI System School

💫💫💫 System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI)

Updates:

Video Tutorials [YouTube] [bilibili] [小红书]
We are preparing a new website [Lets Go AI] for this repo!!!

Path to System for AI [Whitepaper You Must Read]

A curated list of research in machine learning systems. Link to the code if available is also present. Now we have a team to maintain this project. You are very welcome to pull request by using our template.

System for AI (Ordered by Category)

ML / DL Infra

LLM Infra

Domain-Specific Infra

System for ML/LLM Conference

Conference

OSDI
SOSP
SIGCOMM
NSDI
MLSys
ATC
Eurosys
Middleware
SoCC
TinyML

Survey

Toward Highly Available, Intelligent Cloud and ML Systems [Slide]
A curated list of awesome System Designing articles, videos and resources for distributed computing, AKA Big Data. [GitHub]
awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning [GitHub]
Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
- Ananthanarayanan, Rajagopal, et al. "
- 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
How (and How Not) to Write a Good Systems Paper [Advice]
Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
- Hazelwood, Kim, et al. (HPCA 2018)
Infrastructure for Usable Machine Learning: The Stanford DAWN Project
- Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
Hidden technical debt in machine learning systems [Paper]
- Sculley, David, et al. (NIPS 2015)
End-to-end arguments in system design [Paper]
- Saltzer, Jerome H., David P. Reed, and David D. Clark.
System Design for Large Scale Machine Learning [Thesis]
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
- Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
- Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.
A Berkeley View of Systems Challenges for AI [Paper]

Book

Computer Architecture: A Quantitative Approach [Must read]
Distributed Machine Learning Patterns [Website]
Streaming Systems [Book]
Kubernetes in Action (start to read) [Book]
Machine Learning Systems: Designs that scale [Website]
Trust in Machine Learning [Website]
Automated Machine Learning in Action [Website]

Video

ScalaDML2020: Learn from the best minds in the machine learning community. [Video]
Jeff Dean: "Achieving Rapid Response Times in Large Online Services" Keynote - Velocity 2014 [YouTube]
From Research to Production with PyTorch [Video]
Introduction to Microservices, Docker, and Kubernetes [YouTube]
ICML Keynote: Lessons Learned from Helping 200,000 non-ML experts use ML [Video]
Adaptive & Multitask Learning Systems [Website]
System thinking. A TED talk. [YouTube]
Flexible systems are the next frontier of machine learning. Jeff Dean [YouTube]
Is It Time to Rewrite the Operating System in Rust? [YouTube]
InfoQ: AI, ML and Data Engineering [YouTube]
- Start to watch.
Netflix: Human-centric Machine Learning Infrastructure [InfoQ]
SysML 2019: [YouTube]
ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]
AutoML Basics: Automated Machine Learning in Action. Qingquan Song, Haifeng Jin, Xia Hu [YouTube]

Course

CS692 Seminar: Systems for Machine Learning, Machine Learning for Systems [GitHub]
Topics in Networks: Machine Learning for Networking and Systems, Autumn 2019 [Course Website]
CS6465: Emerging Cloud Technologies and Systems Challenges [Cornell]
CS294: AI For Systems and Systems For AI. [UC Berkeley Spring] (Strong Recommendation) [Machine Learning Systems (Fall 2019)]
CSE 599W: System for ML. [Chen Tianqi] [University of Washington]
EECS 598: Systems for AI (W'21). [Mosharaf Chowdhury] [Systems for AI (W'21)]
Tutorial code on how to build your own Deep Learning System in 2k Lines [GitHub]
CSE 291F: Advanced Data Analytics and ML Systems. [UCSD]
CSci 8980: Machine Learning in Computer Systems [University of Minnesota, Twin Cities]
Mu Li (MxNet, Parameter Server): Introduction to Deep Learning [Best DL Course I think] [Book]
10-605: Machine Learning with Large Datasets. [CMU]
CS 329S: Machine Learning Systems Design. [Stanford]

Blog

Parallelizing across multiple CPU/GPUs to speed up deep learning inference at the edge [Amazon Blog]
Building Robust Production-Ready Deep Learning Vision Models in Minutes [Blog]
Deploy Machine Learning Models with Keras, FastAPI, Redis and Docker [Blog]
How to Deploy a Machine Learning Model -- Creating a production-ready API using FastAPI + Uvicorn [Blog] [GitHub]
Deploying a Machine Learning Model as a REST API [Blog]
Continuous Delivery for Machine Learning [Blog]
Kubernetes CheatSheets In A4 [GitHub]
A Gentle Introduction to Kubernetes [Blog]
Train and Deploy Machine Learning Model With Web Interface - Docker, PyTorch & Flask [GitHub]
Learning Kubernetes, The Chinese Taoist Way [GitHub]
Data pipelines, Luigi, Airflow: everything you need to know [Blog]
The Deep Learning Toolset — An Overview [Blog]
Summary of CSE 599W: Systems for ML [Chinese Blog]
Polyaxon, Argo and Seldon for Model Training, Package and Deployment in Kubernetes [Blog]
Overview of the different approaches to putting Machine Learning (ML) models in production [Blog]
Being a Data Scientist does not make you a Software Engineer [Part1] Architecting a Machine Learning Pipeline [Part2]
Model Serving in PyTorch [Blog]
Machine learning in Netflix [Medium]
SciPy Conference Materials (slides, repo) [GitHub]
继Spark之后，UC Berkeley 推出新一代AI计算引擎——Ray [Blog]
了解/从事机器学习/深度学习系统相关的研究需要什么样的知识结构？ [Zhihu]
Learn Kubernetes in Under 3 Hours: A Detailed Guide to Orchestrating Containers [Blog] [GitHub]
data-engineer-roadmap: Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups [GitHub]
TensorFlow Serving + Docker + Tornado机器学习模型生产级快速部署 [Blog]
Deploying a Machine Learning Model as a REST API [Blog]
Colossal-AI: A Unified Deep Learning System for Big Model Era [Blog] [GitHub]
Data Engineer Roadmap [Scaler Blogs]

For Tasks:

Click tags to check more tools for each tasks

build ml models deploy ml systems train llm models design ai infrastructure explore ai conferences

For Jobs:

machine learning engineer data scientist ai system architect research scientist in ai ai infrastructure engineer

Alternative AI tools for AI-System-School

Similar Open Source Tools

AI-System-School

github

: 2.6k

Awesome-TimeSeries-SpatioTemporal-LM-LLM

Awesome-TimeSeries-SpatioTemporal-LM-LLM is a curated list of Large (Language) Models and Foundation Models for Temporal Data, including Time Series, Spatio-temporal, and Event Data. The repository aims to summarize recent advances in Large Models and Foundation Models for Time Series and Spatio-Temporal Data with resources such as papers, code, and data. It covers various applications like General Time Series Analysis, Transportation, Finance, Healthcare, Event Analysis, Climate, Video Data, and more. The repository also includes related resources, surveys, and papers on Large Language Models, Foundation Models, and their applications in AIOps.

github

: 944

llm-continual-learning-survey

This repository is an updating survey for Continual Learning of Large Language Models (CL-LLMs), providing a comprehensive overview of various aspects related to the continual learning of large language models. It covers topics such as continual pre-training, domain-adaptive pre-training, continual fine-tuning, model refinement, model alignment, multimodal LLMs, and miscellaneous aspects. The survey includes a collection of relevant papers, each focusing on different areas within the field of continual learning of large language models.

github

: 215

Efficient-LLMs-Survey

This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

github

: 1.1k

awesome-AIOps

awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.

github

: 163

LearnPrompt

LearnPrompt is a permanent, free, open-source AIGC course platform that currently supports various tools like ChatGPT, Agent, Midjourney, Runway, Stable Diffusion, AI digital humans, AI voice & music, and large model fine-tuning. The platform offers features such as multilingual support, comment sections, daily selections, and submissions. Users can explore different modules, including sound cloning, RAG, GPT-SoVits, and OpenAI Sora world model. The platform aims to continuously update and provide tutorials, examples, and knowledge systems related to AI technologies.

github

: 1.9k

Awesome-LLM-Survey

This repository, Awesome-LLM-Survey, serves as a comprehensive collection of surveys related to Large Language Models (LLM). It covers various aspects of LLM, including instruction tuning, human alignment, LLM agents, hallucination, multi-modal capabilities, and more. Researchers are encouraged to contribute by updating information on their papers to benefit the LLM survey community.

github

: 223

Awesome-Efficient-AIGC

This repository, Awesome Efficient AIGC, collects efficient approaches for AI-generated content (AIGC) to cope with its huge demand for computing resources. It includes efficient Large Language Models (LLMs), Diffusion Models (DMs), and more. The repository is continuously improving and welcomes contributions of works like papers and repositories that are missed by the collection.

github

: 145

Awesome-LLM-Compression

Awesome LLM compression research papers and tools to accelerate LLM training and inference.

github

: 1.4k

Awesome-LLM-Post-training

The Awesome-LLM-Post-training repository is a curated collection of influential papers, code implementations, benchmarks, and resources related to Large Language Models (LLMs) Post-Training Methodologies. It covers various aspects of LLMs, including reasoning, decision-making, reinforcement learning, reward learning, policy optimization, explainability, multimodal agents, benchmarks, tutorials, libraries, and implementations. The repository aims to provide a comprehensive overview and resources for researchers and practitioners interested in advancing LLM technologies.

github

: 1.2k

awesome-deeplogic

Awesome deep logic is a curated list of papers and resources focusing on integrating symbolic logic into deep neural networks. It includes surveys, tutorials, and research papers that explore the intersection of logic and deep learning. The repository aims to provide valuable insights and knowledge on how logic can be used to enhance reasoning, knowledge regularization, weak supervision, and explainability in neural networks.

github

: 214

rllm

rLLM (relationLLM) is a Pytorch library for Relational Table Learning (RTL) with LLMs. It breaks down state-of-the-art GNNs, LLMs, and TNNs as standardized modules and facilitates novel model building in a 'combine, align, and co-train' way using these modules. The library is LLM-friendly, processes various graphs as multiple tables linked by foreign keys, introduces new relational table datasets, and is supported by students and teachers from Shanghai Jiao Tong University and Tsinghua University.

github

: 421

LLM-Agent-Survey

LLM-Agent-Survey is a comprehensive repository that provides a curated list of papers related to Large Language Model (LLM) agents. The repository categorizes papers based on LLM-Profiled Roles and includes high-quality publications from prestigious conferences and journals. It aims to offer a systematic understanding of LLM-based agents, covering topics such as tool use, planning, and feedback learning. The repository also includes unpublished papers with insightful analysis and novelty, marked for future updates. Users can explore a wide range of surveys, tool use cases, planning workflows, and benchmarks related to LLM agents.

github

: 113

LLM-Tool-Survey

This repository contains a collection of papers related to tool learning with large language models (LLMs). The papers are organized according to the survey paper 'Tool Learning with Large Language Models: A Survey'. The survey focuses on the benefits and implementation of tool learning with LLMs, covering aspects such as task planning, tool selection, tool calling, response generation, benchmarks, evaluation, challenges, and future directions in the field. It aims to provide a comprehensive understanding of tool learning with LLMs and inspire further exploration in this emerging area.

github

: 220

Awesome-LLM4Graph-Papers

A collection of papers and resources about Large Language Models (LLM) for Graph Learning (Graph). Integrating LLMs with graph learning techniques to enhance performance in graph learning tasks. Categorizes approaches based on four primary paradigms and nine secondary-level categories. Valuable for research or practice in self-supervised learning for recommendation systems.

github

: 290

awesome-llm-security

Awesome LLM Security is a curated collection of tools, documents, and projects related to Large Language Model (LLM) security. It covers various aspects of LLM security including white-box, black-box, and backdoor attacks, defense mechanisms, platform security, and surveys. The repository provides resources for researchers and practitioners interested in understanding and safeguarding LLMs against adversarial attacks. It also includes a list of tools specifically designed for testing and enhancing LLM security.

github

: 777

For similar tasks

AI-System-School

github

: 2.6k

netsaur

Netsaur is a powerful machine learning library for Deno, offering a lightweight and easy-to-use neural network solution. It is blazingly fast and efficient, providing a simple API for creating and training neural networks. Netsaur can run on both CPU and GPU, making it suitable for serverless environments. With Netsaur, users can quickly build and deploy machine learning models for various applications with minimal dependencies. This library is perfect for both beginners and experienced machine learning practitioners.

github

: 211

AI-PhD-S24

AI-PhD-S24 is a mono-repo for the PhD course 'AI for Business Research' at CUHK Business School in Spring 2024. The course aims to provide a basic understanding of machine learning and artificial intelligence concepts/methods used in business research, showcase how ML/AI is utilized in business research, and introduce state-of-the-art AI/ML technologies. The course includes scribed lecture notes, class recordings, and covers topics like AI/ML fundamentals, DL, NLP, CV, unsupervised learning, and diffusion models.

github

: 90

RTL-Coder

RTL-Coder is a tool designed to outperform GPT-3.5 in RTL code generation by providing a fully open-source dataset and a lightweight solution. It targets Verilog code generation and offers an automated flow to generate a large labeled dataset with over 27,000 diverse Verilog design problems and answers. The tool addresses the data availability challenge in IC design-related tasks and can be used for various applications beyond LLMs. The tool includes four RTL code generation models available on the HuggingFace platform, each with specific features and performance characteristics. Additionally, RTL-Coder introduces a new LLM training scheme based on code quality feedback to further enhance model performance and reduce GPU memory consumption.

github

: 121

long-context-attention

Long-Context-Attention (YunChang) is a unified sequence parallel approach that combines the strengths of DeepSpeed-Ulysses-Attention and Ring-Attention to provide a versatile and high-performance solution for long context LLM model training and inference. It addresses the limitations of both methods by offering no limitation on the number of heads, compatibility with advanced parallel strategies, and enhanced performance benchmarks. The tool is verified in Megatron-LM and offers best practices for 4D parallelism, making it suitable for various attention mechanisms and parallel computing advancements.

github

: 266

Graph-Reasoning-LLM

This repository, GraphWiz, focuses on developing an instruction-following Language Model (LLM) for solving graph problems. It includes GraphWiz LLMs with strong graph problem-solving abilities, GraphInstruct dataset with over 72.5k training samples across nine graph problem tasks, and models like GPT-4 and Mistral-7B for comparison. The project aims to map textual descriptions of graphs and structures to solve various graph problems explicitly in natural language.

github

: 93

effective_llm_alignment

This is a super customizable, concise, user-friendly, and efficient toolkit for training and aligning LLMs. It provides support for various methods such as SFT, Distillation, DPO, ORPO, CPO, SimPO, SMPO, Non-pair Reward Modeling, Special prompts basket format, Rejection Sampling, Scoring using RM, Effective FAISS Map-Reduce Deduplication, LLM scoring using RM, NER, CLIP, Classification, and STS. The toolkit offers key libraries like PyTorch, Transformers, TRL, Accelerate, FSDP, DeepSpeed, and tools for result logging with wandb or clearml. It allows mixing datasets, generation and logging in wandb/clearml, vLLM batched generation, and aligns models using the SMPO method.

github

: 105

llmstxt-generator

llms.txt Generator is a tool designed for LLM (Legal Language Model) training and inference. It crawls websites to combine content into consolidated text files, offering both standard and full versions. Users can access the tool through a web interface or API without requiring an API key. Powered by Firecrawl for web crawling and GPT-4-mini for text processing.

github

: 284

For similar jobs

llm-resource

llm-resource is a comprehensive collection of high-quality resources for Large Language Models (LLM). It covers various aspects of LLM including algorithms, training, fine-tuning, alignment, inference, data engineering, compression, evaluation, prompt engineering, AI frameworks, AI basics, AI infrastructure, AI compilers, LLM application development, LLM operations, AI systems, and practical implementations. The repository aims to gather and share valuable resources related to LLM for the community to benefit from.

github

: 309

LitServe

LitServe is a high-throughput serving engine designed for deploying AI models at scale. It generates an API endpoint for models, handles batching, streaming, and autoscaling across CPU/GPUs. LitServe is built for enterprise scale with a focus on minimal, hackable code-base without bloat. It supports various model types like LLMs, vision, time-series, and works with frameworks like PyTorch, JAX, Tensorflow, and more. The tool allows users to focus on model performance rather than serving boilerplate, providing full control and flexibility.

github

: 3.0k

how-to-optim-algorithm-in-cuda

This repository documents how to optimize common algorithms based on CUDA. It includes subdirectories with code implementations for specific optimizations. The optimizations cover topics such as compiling PyTorch from source, NVIDIA's reduce optimization, OneFlow's elementwise template, fast atomic add for half data types, upsample nearest2d optimization in OneFlow, optimized indexing in PyTorch, OneFlow's softmax kernel, linear attention optimization, and more. The repository also includes learning resources related to deep learning frameworks, compilers, and optimization techniques.

github

: 2.1k

aiac

AIAC is a library and command line tool to generate Infrastructure as Code (IaC) templates, configurations, utilities, queries, and more via LLM providers such as OpenAI, Amazon Bedrock, and Ollama. Users can define multiple 'backends' targeting different LLM providers and environments using a simple configuration file. The tool allows users to ask a model to generate templates for different scenarios and composes an appropriate request to the selected provider, storing the resulting code to a file and/or printing it to standard output.

github

: 3.4k

ENOVA

ENOVA is an open-source service for Large Language Model (LLM) deployment, monitoring, injection, and auto-scaling. It addresses challenges in deploying stable serverless LLM services on GPU clusters with auto-scaling by deconstructing the LLM service execution process and providing configuration recommendations and performance detection. Users can build and deploy LLM with few command lines, recommend optimal computing resources, experience LLM performance, observe operating status, achieve load balancing, and more. ENOVA ensures stable operation, cost-effectiveness, efficiency, and strong scalability of LLM services.

github

: 124

jina

Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.

github

: 21.0k

vidur

Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.

github

: 241

AI-System-School

github

: 2.6k