AI-System-School

AI-System-School

🚀 AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑‍💻 Video Tutorials.

Stars: 2569

Visit
 screenshot

AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.

README:

Maintenance Commit Activity Last Commit Ask Me Anything ! Awesome GitHub license GitHub stars

AI System School

💫💫💫 System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI)

Updates:

Path to System for AI [Whitepaper You Must Read]

A curated list of research in machine learning systems. Link to the code if available is also present. Now we have a team to maintain this project. You are very welcome to pull request by using our template.

AI system

System for AI (Ordered by Category)

ML / DL Infra

LLM Infra

Domain-Specific Infra

System for ML/LLM Conference

Conference

  • OSDI
  • SOSP
  • SIGCOMM
  • NSDI
  • MLSys
  • ATC
  • Eurosys
  • Middleware
  • SoCC
  • TinyML

General Resources

Survey

  • Toward Highly Available, Intelligent Cloud and ML Systems [Slide]
  • A curated list of awesome System Designing articles, videos and resources for distributed computing, AKA Big Data. [GitHub]
  • awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning [GitHub]
  • Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
    • Ananthanarayanan, Rajagopal, et al. "
    • 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
  • How (and How Not) to Write a Good Systems Paper [Advice]
  • Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
    • Hazelwood, Kim, et al. (HPCA 2018)
  • Infrastructure for Usable Machine Learning: The Stanford DAWN Project
    • Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
  • Hidden technical debt in machine learning systems [Paper]
    • Sculley, David, et al. (NIPS 2015)
  • End-to-end arguments in system design [Paper]
    • Saltzer, Jerome H., David P. Reed, and David D. Clark.
  • System Design for Large Scale Machine Learning [Thesis]
  • Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
    • Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
    • Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.
  • A Berkeley View of Systems Challenges for AI [Paper]

Book

  • Computer Architecture: A Quantitative Approach [Must read]
  • Distributed Machine Learning Patterns [Website]
  • Streaming Systems [Book]
  • Kubernetes in Action (start to read) [Book]
  • Machine Learning Systems: Designs that scale [Website]
  • Trust in Machine Learning [Website]
  • Automated Machine Learning in Action [Website]

Video

  • ScalaDML2020: Learn from the best minds in the machine learning community. [Video]
  • Jeff Dean: "Achieving Rapid Response Times in Large Online Services" Keynote - Velocity 2014 [YouTube]
  • From Research to Production with PyTorch [Video]
  • Introduction to Microservices, Docker, and Kubernetes [YouTube]
  • ICML Keynote: Lessons Learned from Helping 200,000 non-ML experts use ML [Video]
  • Adaptive & Multitask Learning Systems [Website]
  • System thinking. A TED talk. [YouTube]
  • Flexible systems are the next frontier of machine learning. Jeff Dean [YouTube]
  • Is It Time to Rewrite the Operating System in Rust? [YouTube]
  • InfoQ: AI, ML and Data Engineering [YouTube]
    • Start to watch.
  • Netflix: Human-centric Machine Learning Infrastructure [InfoQ]
  • SysML 2019: [YouTube]
  • ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
  • ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
  • A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
  • How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
  • SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
  • SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]
  • AutoML Basics: Automated Machine Learning in Action. Qingquan Song, Haifeng Jin, Xia Hu [YouTube]

Course

Blog

  • Parallelizing across multiple CPU/GPUs to speed up deep learning inference at the edge [Amazon Blog]
  • Building Robust Production-Ready Deep Learning Vision Models in Minutes [Blog]
  • Deploy Machine Learning Models with Keras, FastAPI, Redis and Docker [Blog]
  • How to Deploy a Machine Learning Model -- Creating a production-ready API using FastAPI + Uvicorn [Blog] [GitHub]
  • Deploying a Machine Learning Model as a REST API [Blog]
  • Continuous Delivery for Machine Learning [Blog]
  • Kubernetes CheatSheets In A4 [GitHub]
  • A Gentle Introduction to Kubernetes [Blog]
  • Train and Deploy Machine Learning Model With Web Interface - Docker, PyTorch & Flask [GitHub]
  • Learning Kubernetes, The Chinese Taoist Way [GitHub]
  • Data pipelines, Luigi, Airflow: everything you need to know [Blog]
  • The Deep Learning Toolset — An Overview [Blog]
  • Summary of CSE 599W: Systems for ML [Chinese Blog]
  • Polyaxon, Argo and Seldon for Model Training, Package and Deployment in Kubernetes [Blog]
  • Overview of the different approaches to putting Machine Learning (ML) models in production [Blog]
  • Being a Data Scientist does not make you a Software Engineer [Part1] Architecting a Machine Learning Pipeline [Part2]
  • Model Serving in PyTorch [Blog]
  • Machine learning in Netflix [Medium]
  • SciPy Conference Materials (slides, repo) [GitHub]
  • 继Spark之后,UC Berkeley 推出新一代AI计算引擎——Ray [Blog]
  • 了解/从事机器学习/深度学习系统相关的研究需要什么样的知识结构? [Zhihu]
  • Learn Kubernetes in Under 3 Hours: A Detailed Guide to Orchestrating Containers [Blog] [GitHub]
  • data-engineer-roadmap: Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups [GitHub]
  • TensorFlow Serving + Docker + Tornado机器学习模型生产级快速部署 [Blog]
  • Deploying a Machine Learning Model as a REST API [Blog]
  • Colossal-AI: A Unified Deep Learning System for Big Model Era [Blog] [GitHub]
  • Data Engineer Roadmap [Scaler Blogs]

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for AI-System-School

Similar Open Source Tools

For similar tasks

For similar jobs