AI-System-School
🚀 AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑💻 Video Tutorials.
Stars: 2569
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.
README:
- Video Tutorials [YouTube] [bilibili] [小红书]
- We are preparing a new website [Lets Go AI] for this repo!!!
Path to System for AI [Whitepaper You Must Read]
A curated list of research in machine learning systems. Link to the code if available is also present. Now we have a team to maintain this project. You are very welcome to pull request by using our template.
- Video System
- AutoML System
- Edge AI
- GNN System
- Federated Learning System
- Deep Reinforcement Learning System
- OSDI
- SOSP
- SIGCOMM
- NSDI
- MLSys
- ATC
- Eurosys
- Middleware
- SoCC
- TinyML
- Toward Highly Available, Intelligent Cloud and ML Systems [Slide]
- A curated list of awesome System Designing articles, videos and resources for distributed computing, AKA Big Data. [GitHub]
- awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning [GitHub]
- Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
- Ananthanarayanan, Rajagopal, et al. "
- 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
- How (and How Not) to Write a Good Systems Paper [Advice]
- Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
- Hazelwood, Kim, et al. (HPCA 2018)
- Infrastructure for Usable Machine Learning: The Stanford DAWN Project
- Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
- Hidden technical debt in machine learning systems [Paper]
- Sculley, David, et al. (NIPS 2015)
- End-to-end arguments in system design [Paper]
- Saltzer, Jerome H., David P. Reed, and David D. Clark.
- System Design for Large Scale Machine Learning [Thesis]
- Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
- Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
- Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.
- A Berkeley View of Systems Challenges for AI [Paper]
- Computer Architecture: A Quantitative Approach [Must read]
- Distributed Machine Learning Patterns [Website]
- Streaming Systems [Book]
- Kubernetes in Action (start to read) [Book]
- Machine Learning Systems: Designs that scale [Website]
- Trust in Machine Learning [Website]
- Automated Machine Learning in Action [Website]
- ScalaDML2020: Learn from the best minds in the machine learning community. [Video]
- Jeff Dean: "Achieving Rapid Response Times in Large Online Services" Keynote - Velocity 2014 [YouTube]
- From Research to Production with PyTorch [Video]
- Introduction to Microservices, Docker, and Kubernetes [YouTube]
- ICML Keynote: Lessons Learned from Helping 200,000 non-ML experts use ML [Video]
- Adaptive & Multitask Learning Systems [Website]
- System thinking. A TED talk. [YouTube]
- Flexible systems are the next frontier of machine learning. Jeff Dean [YouTube]
- Is It Time to Rewrite the Operating System in Rust? [YouTube]
- InfoQ: AI, ML and Data Engineering [YouTube]
- Start to watch.
- Netflix: Human-centric Machine Learning Infrastructure [InfoQ]
- SysML 2019: [YouTube]
- ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
- ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
- A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
- How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
- SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
- SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]
- AutoML Basics: Automated Machine Learning in Action. Qingquan Song, Haifeng Jin, Xia Hu [YouTube]
- CS692 Seminar: Systems for Machine Learning, Machine Learning for Systems [GitHub]
- Topics in Networks: Machine Learning for Networking and Systems, Autumn 2019 [Course Website]
- CS6465: Emerging Cloud Technologies and Systems Challenges [Cornell]
- CS294: AI For Systems and Systems For AI. [UC Berkeley Spring] (Strong Recommendation) [Machine Learning Systems (Fall 2019)]
- CSE 599W: System for ML. [Chen Tianqi] [University of Washington]
- EECS 598: Systems for AI (W'21). [Mosharaf Chowdhury] [Systems for AI (W'21)]
- Tutorial code on how to build your own Deep Learning System in 2k Lines [GitHub]
- CSE 291F: Advanced Data Analytics and ML Systems. [UCSD]
- CSci 8980: Machine Learning in Computer Systems [University of Minnesota, Twin Cities]
- Mu Li (MxNet, Parameter Server): Introduction to Deep Learning [Best DL Course I think] [Book]
- 10-605: Machine Learning with Large Datasets. [CMU]
- CS 329S: Machine Learning Systems Design. [Stanford]
- Parallelizing across multiple CPU/GPUs to speed up deep learning inference at the edge [Amazon Blog]
- Building Robust Production-Ready Deep Learning Vision Models in Minutes [Blog]
- Deploy Machine Learning Models with Keras, FastAPI, Redis and Docker [Blog]
- How to Deploy a Machine Learning Model -- Creating a production-ready API using FastAPI + Uvicorn [Blog] [GitHub]
- Deploying a Machine Learning Model as a REST API [Blog]
- Continuous Delivery for Machine Learning [Blog]
- Kubernetes CheatSheets In A4 [GitHub]
- A Gentle Introduction to Kubernetes [Blog]
- Train and Deploy Machine Learning Model With Web Interface - Docker, PyTorch & Flask [GitHub]
- Learning Kubernetes, The Chinese Taoist Way [GitHub]
- Data pipelines, Luigi, Airflow: everything you need to know [Blog]
- The Deep Learning Toolset — An Overview [Blog]
- Summary of CSE 599W: Systems for ML [Chinese Blog]
- Polyaxon, Argo and Seldon for Model Training, Package and Deployment in Kubernetes [Blog]
- Overview of the different approaches to putting Machine Learning (ML) models in production [Blog]
- Being a Data Scientist does not make you a Software Engineer [Part1] Architecting a Machine Learning Pipeline [Part2]
- Model Serving in PyTorch [Blog]
- Machine learning in Netflix [Medium]
- SciPy Conference Materials (slides, repo) [GitHub]
- 继Spark之后,UC Berkeley 推出新一代AI计算引擎——Ray [Blog]
- 了解/从事机器学习/深度学习系统相关的研究需要什么样的知识结构? [Zhihu]
- Learn Kubernetes in Under 3 Hours: A Detailed Guide to Orchestrating Containers [Blog] [GitHub]
- data-engineer-roadmap: Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups [GitHub]
- TensorFlow Serving + Docker + Tornado机器学习模型生产级快速部署 [Blog]
- Deploying a Machine Learning Model as a REST API [Blog]
- Colossal-AI: A Unified Deep Learning System for Big Model Era [Blog] [GitHub]
- Data Engineer Roadmap [Scaler Blogs]
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for AI-System-School
Similar Open Source Tools
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.
Awesome-TimeSeries-SpatioTemporal-LM-LLM
Awesome-TimeSeries-SpatioTemporal-LM-LLM is a curated list of Large (Language) Models and Foundation Models for Temporal Data, including Time Series, Spatio-temporal, and Event Data. The repository aims to summarize recent advances in Large Models and Foundation Models for Time Series and Spatio-Temporal Data with resources such as papers, code, and data. It covers various applications like General Time Series Analysis, Transportation, Finance, Healthcare, Event Analysis, Climate, Video Data, and more. The repository also includes related resources, surveys, and papers on Large Language Models, Foundation Models, and their applications in AIOps.
llm-continual-learning-survey
This repository is an updating survey for Continual Learning of Large Language Models (CL-LLMs), providing a comprehensive overview of various aspects related to the continual learning of large language models. It covers topics such as continual pre-training, domain-adaptive pre-training, continual fine-tuning, model refinement, model alignment, multimodal LLMs, and miscellaneous aspects. The survey includes a collection of relevant papers, each focusing on different areas within the field of continual learning of large language models.
Efficient-LLMs-Survey
This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.
awesome-AIOps
awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.
Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review
This repository is a collection of papers and resources related to recommendation systems, focusing on foundation models, transferable recommender systems, large language models, and multimodal recommender systems. It explores questions such as the necessity of ID embeddings, the shift from matching to generating paradigms, and the future of multimodal recommender systems. The papers cover various aspects of recommendation systems, including pretraining, user representation, dataset benchmarks, and evaluation methods. The repository aims to provide insights and advancements in the field of recommendation systems through literature reviews, surveys, and empirical studies.
Awesome_Mamba
Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.
LearnPrompt
LearnPrompt is a permanent, free, open-source AIGC course platform that currently supports various tools like ChatGPT, Agent, Midjourney, Runway, Stable Diffusion, AI digital humans, AI voice & music, and large model fine-tuning. The platform offers features such as multilingual support, comment sections, daily selections, and submissions. Users can explore different modules, including sound cloning, RAG, GPT-SoVits, and OpenAI Sora world model. The platform aims to continuously update and provide tutorials, examples, and knowledge systems related to AI technologies.
Awesome-LLM4RS-Papers
This paper list is about Large Language Model-enhanced Recommender System. It also contains some related works. Keywords: recommendation system, large language models
Awesome-LLM-Survey
This repository, Awesome-LLM-Survey, serves as a comprehensive collection of surveys related to Large Language Models (LLM). It covers various aspects of LLM, including instruction tuning, human alignment, LLM agents, hallucination, multi-modal capabilities, and more. Researchers are encouraged to contribute by updating information on their papers to benefit the LLM survey community.
Awesome-LLM-Compression
Awesome LLM compression research papers and tools to accelerate LLM training and inference.
glossAPI
The glossAPI project aims to develop a Greek language model as open-source software, with code licensed under EUPL and data under Creative Commons BY-SA. The project focuses on collecting and evaluating open text sources in Greek, with efforts to prioritize and gather textual data sets. The project encourages contributions through the CONTRIBUTING.md file and provides resources in the wiki for viewing and modifying recorded sources. It also welcomes ideas and corrections through issue submissions. The project emphasizes the importance of open standards, ethically secured data, privacy protection, and addressing digital divides in the context of artificial intelligence and advanced language technologies.
awesome-deeplogic
Awesome deep logic is a curated list of papers and resources focusing on integrating symbolic logic into deep neural networks. It includes surveys, tutorials, and research papers that explore the intersection of logic and deep learning. The repository aims to provide valuable insights and knowledge on how logic can be used to enhance reasoning, knowledge regularization, weak supervision, and explainability in neural networks.
rllm
rLLM (relationLLM) is a Pytorch library for Relational Table Learning (RTL) with LLMs. It breaks down state-of-the-art GNNs, LLMs, and TNNs as standardized modules and facilitates novel model building in a 'combine, align, and co-train' way using these modules. The library is LLM-friendly, processes various graphs as multiple tables linked by foreign keys, introduces new relational table datasets, and is supported by students and teachers from Shanghai Jiao Tong University and Tsinghua University.
RPGMaker_LLM_Translator
This is an offline Japanese translator for RPGMaker games based on Mtool and the Sakura model, capable of providing high-quality offline Japanese translations. It is recommended to use the Sakura-13B-Galgame translation model, and the currently supported versions are Sakura v0.8/v0.9/v0.10pre0.
awesome-mcp-servers
A curated list of awesome Model Context Protocol (MCP) servers that enable AI models to securely interact with local and remote resources through standardized server implementations. The list focuses on production-ready and experimental servers extending AI capabilities through file access, database connections, API integrations, and other contextual services.
For similar tasks
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.
netsaur
Netsaur is a powerful machine learning library for Deno, offering a lightweight and easy-to-use neural network solution. It is blazingly fast and efficient, providing a simple API for creating and training neural networks. Netsaur can run on both CPU and GPU, making it suitable for serverless environments. With Netsaur, users can quickly build and deploy machine learning models for various applications with minimal dependencies. This library is perfect for both beginners and experienced machine learning practitioners.
AI-PhD-S24
AI-PhD-S24 is a mono-repo for the PhD course 'AI for Business Research' at CUHK Business School in Spring 2024. The course aims to provide a basic understanding of machine learning and artificial intelligence concepts/methods used in business research, showcase how ML/AI is utilized in business research, and introduce state-of-the-art AI/ML technologies. The course includes scribed lecture notes, class recordings, and covers topics like AI/ML fundamentals, DL, NLP, CV, unsupervised learning, and diffusion models.
RTL-Coder
RTL-Coder is a tool designed to outperform GPT-3.5 in RTL code generation by providing a fully open-source dataset and a lightweight solution. It targets Verilog code generation and offers an automated flow to generate a large labeled dataset with over 27,000 diverse Verilog design problems and answers. The tool addresses the data availability challenge in IC design-related tasks and can be used for various applications beyond LLMs. The tool includes four RTL code generation models available on the HuggingFace platform, each with specific features and performance characteristics. Additionally, RTL-Coder introduces a new LLM training scheme based on code quality feedback to further enhance model performance and reduce GPU memory consumption.
long-context-attention
Long-Context-Attention (YunChang) is a unified sequence parallel approach that combines the strengths of DeepSpeed-Ulysses-Attention and Ring-Attention to provide a versatile and high-performance solution for long context LLM model training and inference. It addresses the limitations of both methods by offering no limitation on the number of heads, compatibility with advanced parallel strategies, and enhanced performance benchmarks. The tool is verified in Megatron-LM and offers best practices for 4D parallelism, making it suitable for various attention mechanisms and parallel computing advancements.
Graph-Reasoning-LLM
This repository, GraphWiz, focuses on developing an instruction-following Language Model (LLM) for solving graph problems. It includes GraphWiz LLMs with strong graph problem-solving abilities, GraphInstruct dataset with over 72.5k training samples across nine graph problem tasks, and models like GPT-4 and Mistral-7B for comparison. The project aims to map textual descriptions of graphs and structures to solve various graph problems explicitly in natural language.
effective_llm_alignment
This is a super customizable, concise, user-friendly, and efficient toolkit for training and aligning LLMs. It provides support for various methods such as SFT, Distillation, DPO, ORPO, CPO, SimPO, SMPO, Non-pair Reward Modeling, Special prompts basket format, Rejection Sampling, Scoring using RM, Effective FAISS Map-Reduce Deduplication, LLM scoring using RM, NER, CLIP, Classification, and STS. The toolkit offers key libraries like PyTorch, Transformers, TRL, Accelerate, FSDP, DeepSpeed, and tools for result logging with wandb or clearml. It allows mixing datasets, generation and logging in wandb/clearml, vLLM batched generation, and aligns models using the SMPO method.
For similar jobs
llm-resource
llm-resource is a comprehensive collection of high-quality resources for Large Language Models (LLM). It covers various aspects of LLM including algorithms, training, fine-tuning, alignment, inference, data engineering, compression, evaluation, prompt engineering, AI frameworks, AI basics, AI infrastructure, AI compilers, LLM application development, LLM operations, AI systems, and practical implementations. The repository aims to gather and share valuable resources related to LLM for the community to benefit from.
LitServe
LitServe is a high-throughput serving engine designed for deploying AI models at scale. It generates an API endpoint for models, handles batching, streaming, and autoscaling across CPU/GPUs. LitServe is built for enterprise scale with a focus on minimal, hackable code-base without bloat. It supports various model types like LLMs, vision, time-series, and works with frameworks like PyTorch, JAX, Tensorflow, and more. The tool allows users to focus on model performance rather than serving boilerplate, providing full control and flexibility.
how-to-optim-algorithm-in-cuda
This repository documents how to optimize common algorithms based on CUDA. It includes subdirectories with code implementations for specific optimizations. The optimizations cover topics such as compiling PyTorch from source, NVIDIA's reduce optimization, OneFlow's elementwise template, fast atomic add for half data types, upsample nearest2d optimization in OneFlow, optimized indexing in PyTorch, OneFlow's softmax kernel, linear attention optimization, and more. The repository also includes learning resources related to deep learning frameworks, compilers, and optimization techniques.
aiac
AIAC is a library and command line tool to generate Infrastructure as Code (IaC) templates, configurations, utilities, queries, and more via LLM providers such as OpenAI, Amazon Bedrock, and Ollama. Users can define multiple 'backends' targeting different LLM providers and environments using a simple configuration file. The tool allows users to ask a model to generate templates for different scenarios and composes an appropriate request to the selected provider, storing the resulting code to a file and/or printing it to standard output.
ENOVA
ENOVA is an open-source service for Large Language Model (LLM) deployment, monitoring, injection, and auto-scaling. It addresses challenges in deploying stable serverless LLM services on GPU clusters with auto-scaling by deconstructing the LLM service execution process and providing configuration recommendations and performance detection. Users can build and deploy LLM with few command lines, recommend optimal computing resources, experience LLM performance, observe operating status, achieve load balancing, and more. ENOVA ensures stable operation, cost-effectiveness, efficiency, and strong scalability of LLM services.
jina
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
vidur
Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.