executorch
On-device AI across mobile, embedded and edge for PyTorch
Stars: 4293
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. Key value propositions of ExecuTorch are: * **Portability:** Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. * **Productivity:** Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. * **Performance:** Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.
README:
ExecuTorch is PyTorch's unified solution for deploying AI models on-device—from smartphones to microcontrollers—built for privacy, performance, and portability. It powers Meta's on-device AI across Instagram, WhatsApp, Quest 3, Ray-Ban Meta Smart Glasses, and more.
Deploy LLMs, vision, speech, and multimodal models with the same PyTorch APIs you already know—accelerating research to production with seamless model export, optimization, and deployment. No manual C++ rewrites. No format conversions. No vendor lock-in.
📘 Table of Contents
- 🔒 Native PyTorch Export — Direct export from PyTorch. No .onnx, .tflite, or intermediate format conversions. Preserve model semantics.
- ⚡ Production-Proven — Powers billions of users at Meta with real-time on-device inference.
- 💾 Tiny Runtime — 50KB base footprint. Runs on microcontrollers to high-end smartphones.
- 🚀 12+ Hardware Backends — Open-source acceleration for Apple, Qualcomm, ARM, MediaTek, Vulkan, and more.
- 🎯 One Export, Multiple Backends — Switch hardware targets with a single line change. Deploy the same model everywhere.
ExecuTorch uses ahead-of-time (AOT) compilation to prepare PyTorch models for edge deployment:
-
🧩 Export — Capture your PyTorch model graph with
torch.export() -
⚙️ Compile — Quantize, optimize, and partition to hardware backends →
.pte -
🚀 Execute — Load
.pteon-device via lightweight C++ runtime
Models use a standardized Core ATen operator set. Partitioners delegate subgraphs to specialized hardware (NPU/GPU) with CPU fallback.
Learn more: How ExecuTorch Works • Architecture Guide
pip install executorchFor platform-specific setup (Android, iOS, embedded systems), see the Quick Start documentation for additional info.
import torch
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
# 1. Export your PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
exported_program = torch.export.export(model, example_inputs)
# 2. Optimize for target hardware (switch backends with one line)
program = to_edge_transform_and_lower(
exported_program,
partitioner=[XnnpackPartitioner()] # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm
).to_executorch()
# 3. Save for deployment
with open("model.pte", "wb") as f:
f.write(program.buffer)
# Test locally via ExecuTorch runtime's pybind API (optional)
from executorch.runtime import Runtime
runtime = Runtime.get()
method = runtime.load_program("model.pte").load_method("forward")
outputs = method.execute([torch.randn(1, 3, 224, 224)])#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
auto outputs = module.forward(tensor);import ExecuTorch
let module = Module(filePath: "model.pte")
let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0], shape: [2, 2])
let outputs = try module.forward(input)val module = Module.load("model.pte")
val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2))
val outputs = module.forward(EValue.from(inputTensor))Export Llama models using the export_llm script or Optimum-ExecuTorch:
# Using export_llm
python -m executorch.extension.llm.export.export_llm --model llama3_2 --output llama.pte
# Using Optimum-ExecuTorch
optimum-cli export executorch \
--model meta-llama/Llama-3.2-1B \
--task text-generation \
--recipe xnnpack \
--output_dir llama_modelRun on-device with the LLM runner API:
#include <executorch/extension/llm/runner/text_llm_runner.h>
auto runner = create_llama_runner("llama.pte", "tiktoken.bin");
executorch::extension::llm::GenerationConfig config{
.seq_len = 128, .temperature = 0.8f};
runner->generate("Hello, how are you?", config);import ExecuTorchLLM
let runner = TextRunner(modelPath: "llama.pte", tokenizerPath: "tiktoken.bin")
try runner.generate("Hello, how are you?", Config {
$0.sequenceLength = 128
}) { token in
print(token, terminator: "")
}Kotlin (Android) — API Docs • Demo App
val llmModule = LlmModule("llama.pte", "tiktoken.bin", 0.8f)
llmModule.load()
llmModule.generate("Hello, how are you?", 128, object : LlmCallback {
override fun onResult(result: String) { print(result) }
override fun onStats(stats: String) { }
})For multimodal models (vision, audio), use the MultiModal runner API which extends the LLM runner to handle image and audio inputs alongside text. See Llava and Voxtral examples.
See examples/models/llama for complete workflow including quantization, mobile deployment, and advanced options.
Next Steps:
- 📖 Step-by-step tutorial — Complete walkthrough for your first model
- ⚡ Colab notebook — Try ExecuTorch instantly in your browser
- 🤖 Deploy Llama models — LLM workflow with quantization and mobile demos
| Platform | Supported Backends |
|---|---|
| Android | XNNPACK, Vulkan, Qualcomm, MediaTek, Samsung Exynos |
| iOS | XNNPACK, MPS, CoreML (Neural Engine) |
| Linux / Windows | XNNPACK, OpenVINO, CUDA (experimental) |
| macOS | XNNPACK, MPS, Metal (experimental) |
| Embedded / MCU | XNNPACK, ARM Ethos-U, NXP, Cadence DSP |
See Backend Documentation for detailed hardware requirements and optimization guides. For desktop/laptop GPU inference with CUDA and Metal, see the Desktop Guide. For Zephyr RTOS integration, see the Zephyr Guide.
ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devices, and partner deployments. View success stories →
LLMs: Llama 3.2/3.1/3, Qwen 3, Phi-4-mini, LiquidAI LFM2
Multimodal: Llava (vision-language), Voxtral (audio-language), Gemma (vision-language)
Vision/Speech: MobileNetV2, DeepLabV3, Whisper
Resources: examples/ directory • executorch-examples out-of-tree demos • Optimum-ExecuTorch for HuggingFace models • Unsloth for fine-tuned LLM deployment
ExecuTorch provides advanced capabilities for production deployment:
- Quantization — Built-in support via torchao for 8-bit, 4-bit, and dynamic quantization
- Memory Planning — Optimize memory usage with ahead-of-time allocation strategies
- Developer Tools — ETDump profiler, ETRecord inspector, and model debugger
- Selective Build — Strip unused operators to minimize binary size
- Custom Operators — Extend with domain-specific kernels
- Dynamic Shapes — Support variable input sizes with bounded ranges
See Advanced Topics for quantization techniques, custom backends, and compiler passes.
- Documentation Home — Complete guides and tutorials
- API Reference — Python, C++, Java/Kotlin APIs
- Backend Integration — Build custom hardware backends
- Troubleshooting — Common issues and solutions
We welcome contributions from the community!
- 💬 GitHub Discussions — Ask questions and share ideas
- 🎮 Discord — Chat with the team and community
- 🐛 Issues — Report bugs or request features
- 🤝 Contributing Guide — Guidelines and codebase structure
ExecuTorch is BSD licensed, as found in the LICENSE file.
Part of the PyTorch ecosystem
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for executorch
Similar Open Source Tools
executorch
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. Key value propositions of ExecuTorch are: * **Portability:** Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. * **Productivity:** Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. * **Performance:** Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.
mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
BrowserAI
BrowserAI is a tool that allows users to run large language models (LLMs) directly in the browser, providing a simple, fast, and open-source solution. It prioritizes privacy by processing data locally, is cost-effective with no server costs, works offline after initial download, and offers WebGPU acceleration for high performance. It is developer-friendly with a simple API, supports multiple engines, and comes with pre-configured models for easy use. Ideal for web developers, companies needing privacy-conscious AI solutions, researchers experimenting with browser-based AI, and hobbyists exploring AI without infrastructure overhead.
superagentx
SuperAgentX is a lightweight open-source AI framework designed for multi-agent applications with Artificial General Intelligence (AGI) capabilities. It offers goal-oriented multi-agents with retry mechanisms, easy deployment through WebSocket, RESTful API, and IO console interfaces, streamlined architecture with no major dependencies, contextual memory using SQL + Vector databases, flexible LLM configuration supporting various Gen AI models, and extendable handlers for integration with diverse APIs and data sources. It aims to accelerate the development of AGI by providing a powerful platform for building autonomous AI agents capable of executing complex tasks with minimal human intervention.
indexify
Indexify is an open-source engine for building fast data pipelines for unstructured data (video, audio, images, and documents) using reusable extractors for embedding, transformation, and feature extraction. LLM Applications can query transformed content friendly to LLMs by semantic search and SQL queries. Indexify keeps vector databases and structured databases (PostgreSQL) updated by automatically invoking the pipelines as new data is ingested into the system from external data sources. **Why use Indexify** * Makes Unstructured Data **Queryable** with **SQL** and **Semantic Search** * **Real-Time** Extraction Engine to keep indexes **automatically** updated as new data is ingested. * Create **Extraction Graph** to describe **data transformation** and extraction of **embedding** and **structured extraction**. * **Incremental Extraction** and **Selective Deletion** when content is deleted or updated. * **Extractor SDK** allows adding new extraction capabilities, and many readily available extractors for **PDF**, **Image**, and **Video** indexing and extraction. * Works with **any LLM Framework** including **Langchain**, **DSPy**, etc. * Runs on your laptop during **prototyping** and also scales to **1000s of machines** on the cloud. * Works with many **Blob Stores**, **Vector Stores**, and **Structured Databases** * We have even **Open Sourced Automation** to deploy to Kubernetes in production.
steedos-platform
Steedos Platform is an enterprise-grade implementation of the ObjectStack architecture, combining Metadata Driven Architecture with Generative AI. It provides a Universal Metadata Standard (ObjectQL) for AI to generate complex applications instantly. The platform consists of ObjectQL (Protocol), ObjectOS (Engine), and Object UI (Renderer), offering AI data modeling, generative UI, and logic & automation capabilities. Users can transition from a monolithic structure to a monorepo workspace with core type definitions, ORM, frontend rendering engine, and modular business application packages. Steedos Platform aims to redefine software development by leveraging AI and metadata to build applications faster and more efficiently.
ChatDev
ChatDev is a virtual software company powered by intelligent agents like CEO, CPO, CTO, programmer, reviewer, tester, and art designer. These agents collaborate to revolutionize the digital world through programming. The platform offers an easy-to-use, highly customizable, and extendable framework based on large language models, ideal for studying collective intelligence. ChatDev introduces innovative methods like Iterative Experience Refinement and Experiential Co-Learning to enhance software development efficiency. It supports features like incremental development, Docker integration, Git mode, and Human-Agent-Interaction mode. Users can customize ChatChain, Phase, and Role settings, and share their software creations easily. The project is open-source under the Apache 2.0 License and utilizes data licensed under CC BY-NC 4.0.
nodetool
NodeTool is a platform designed for AI enthusiasts, developers, and creators, providing a visual interface to access a variety of AI tools and models. It simplifies access to advanced AI technologies, offering resources for content creation, data analysis, automation, and more. With features like a visual editor, seamless integration with leading AI platforms, model manager, and API integration, NodeTool caters to both newcomers and experienced users in the AI field.
sdk
The Kubeflow SDK is a set of unified Pythonic APIs that simplify running AI workloads at any scale without needing to learn Kubernetes. It offers consistent APIs across the Kubeflow ecosystem, enabling users to focus on building AI applications rather than managing complex infrastructure. The SDK provides a unified experience, simplifies AI workloads, is built for scale, allows rapid iteration, and supports local development without a Kubernetes cluster.
tingly-box
Tingly Box is a tool that helps in deciding which model to call, compressing context, and routing requests efficiently. It offers secure, reliable, and customizable functional extensions. With features like unified API, smart routing, context compression, auto API translation, blazing fast performance, flexible authentication, visual control panel, and client-side usage stats, Tingly Box provides a comprehensive solution for managing AI models and tokens. It supports integration with various IDEs, CLI tools, SDKs, and AI applications, making it versatile and easy to use. The tool also allows seamless integration with OAuth providers like Claude Code, enabling users to utilize existing quotas in OpenAI-compatible tools. Tingly Box aims to simplify AI model management and usage by providing a single endpoint for multiple providers with minimal configuration, promoting seamless integration with SDKs and CLI tools.
seline
Seline is a local-first AI desktop application that integrates conversational AI, visual generation tools, vector search, and multi-channel connectivity. It allows users to connect WhatsApp, Telegram, or Slack to create always-on bots with full context and background task delivery. The application supports multi-channel connectivity, deep research mode, local web browsing with Puppeteer, local knowledge and privacy features, visual and creative tools, automation and agents, developer experience enhancements, and more. Seline is actively developed with a focus on improving user experience and functionality.
BrowserAI
BrowserAI is a production-ready tool that allows users to run AI models directly in the browser, offering simplicity, speed, privacy, and open-source capabilities. It provides WebGPU acceleration for fast inference, zero server costs, offline capability, and developer-friendly features. Perfect for web developers, companies seeking privacy-conscious AI solutions, researchers experimenting with browser-based AI, and hobbyists exploring AI without infrastructure overhead. The tool supports various AI tasks like text generation, speech recognition, and text-to-speech, with pre-configured popular models ready to use. It offers a simple SDK with multiple engine support and seamless switching between MLC and Transformers engines.
surogate
Surogate is a high-performance, mixed-precision LLM pre-training and fine-tuning framework designed for practical hardware limits. It combines a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler to achieve industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs. Surogate Studio is the no-code companion for enterprise-grade LLMOps, offering features like pre-training, fine-tuning, native multi-GPU and multi-Node training, smart CPU offloading, pre-built training recipes, mixed-precision training, and adaptive training. It supports various NVIDIA GPUs and provides a unified platform for managing AI infrastructure and operations.
AionUi
AionUi is a user interface library for building modern and responsive web applications. It provides a set of customizable components and styles to create visually appealing user interfaces. With AionUi, developers can easily design and implement interactive web interfaces that are both functional and aesthetically pleasing. The library is built using the latest web technologies and follows best practices for performance and accessibility. Whether you are working on a personal project or a professional application, AionUi can help you streamline the UI development process and deliver a seamless user experience.
layra
LAYRA is the world's first visual-native AI automation engine that sees documents like a human, preserves layout and graphical elements, and executes arbitrarily complex workflows with full Python control. It empowers users to build next-generation intelligent systems with no limits or compromises. Built for Enterprise-Grade deployment, LAYRA features a modern frontend, high-performance backend, decoupled service architecture, visual-native multimodal document understanding, and a powerful workflow engine.
local-cocoa
Local Cocoa is a privacy-focused tool that runs entirely on your device, turning files into memory to spark insights and power actions. It offers features like fully local privacy, multimodal memory, vector-powered retrieval, intelligent indexing, vision understanding, hardware acceleration, focused user experience, integrated notes, and auto-sync. The tool combines file ingestion, intelligent chunking, and local retrieval to build a private on-device knowledge system. The ultimate goal includes more connectors like Google Drive integration, voice mode for local speech-to-text interaction, and a plugin ecosystem for community tools and agents. Local Cocoa is built using Electron, React, TypeScript, FastAPI, llama.cpp, and Qdrant.
For similar tasks
executorch
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. Key value propositions of ExecuTorch are: * **Portability:** Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. * **Productivity:** Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. * **Performance:** Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.
oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.
