
ipex-llm
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.
Stars: 7638

README:
< English | 中文 >
IPEX-LLM
is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
[!NOTE]
IPEX-LLM
provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.- 70+ models have been optimized/verified on
ipex-llm
(e.g., Llama, Phi, Mistral, Mixtral, DeepSeek, Qwen, ChatGLM, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.
- [2025/03] We added support for Gemma3 model in the latest llama.cpp Portable Zip.
- [2025/03] We can now run DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon using the latest llama.cpp Portable Zip.
- [2025/02] We added support of llama.cpp Portable Zip for Intel GPU (both Windows and Linux) and NPU (Windows only).
- [2025/02] We added support of Ollama Portable Zip to directly run Ollama on Intel GPU for both Windows and Linux (without the need of manual installations).
- [2025/02] We added support for running vLLM 0.6.6 on Intel Arc GPUs.
- [2025/01] We added the guide for running
ipex-llm
on Intel Arc B580 GPU. - [2025/01] We added support for running Ollama 0.5.4 on Intel GPU.
- [2024/12] We added both Python and C++ support for Intel Core Ultra NPU (including 100H, 200V, 200K and 200H series).
More updates
- [2024/11] We added support for running vLLM 0.6.2 on Intel Arc GPUs.
- [2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
- [2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
- [2024/07] We added FP6 support on Intel GPU.
- [2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
- [2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
- [2024/06] We added support for running RAGFlow with
ipex-llm
on Intel GPU. - [2024/05]
ipex-llm
now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here. - [2024/05] You can now easily run
ipex-llm
inference, serving and finetuning using the Docker images. - [2024/05] You can now install
ipex-llm
on Windows using just "one command". - [2024/04] You can now run Open WebUI on Intel GPU using
ipex-llm
; see the quickstart here. - [2024/04] You can now run Llama 3 on Intel GPU using
llama.cpp
andollama
withipex-llm
; see the quickstart here. - [2024/04]
ipex-llm
now supports Llama 3 on both Intel GPU and CPU. - [2024/04]
ipex-llm
now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU. - [2024/03]
bigdl-llm
has now becomeipex-llm
(see the migration guide here); you may find the originalBigDL
project here. - [2024/02]
ipex-llm
now supports directly loading model from ModelScope (魔搭). - [2024/02]
ipex-llm
added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use
ipex-llm
through Text-Generation-WebUI GUI. - [2024/02]
ipex-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively. - [2024/02]
ipex-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA). - [2024/01] Using
ipex-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here). - [2023/12]
ipex-llm
now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates"). - [2023/12]
ipex-llm
now supports Mixtral-8x7B on both Intel GPU and CPU. - [2023/12]
ipex-llm
now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"). - [2023/12]
ipex-llm
now supports FP8 and FP4 inference on Intel GPU. - [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into
ipex-llm
is available. - [2023/11]
ipex-llm
now supports vLLM continuous batching on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports QLoRA finetuning on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports FastChat serving on on both Intel CPU and GPU. - [2023/09]
ipex-llm
now supports Intel GPU (including iGPU, Arc, Flex and MAX). - [2023/09]
ipex-llm
tutorial is released.
See demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm
below.
Intel Core Ultra iGPU | Intel Core Ultra NPU | Intel Arc dGPU | 2-Card Intel Arc dGPUs |
|
|
|
|
Ollama (Mistral-7B, Q4_K) |
HuggingFace (Llama3.2-3B, SYM_INT4) |
TextGeneration-WebUI (Llama3-8B, FP8) |
llama.cpp (DeepSeek-R1-Distill-Qwen-32B, Q4_K) |
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
|
|
You may follow the Benchmarking Guide to run ipex-llm
performance benchmark yourself.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
---|---|---|---|---|---|---|
Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
- Ollama: running Ollama on Intel GPU without the need of manual installations
- llama.cpp: running llama.cpp on Intel GPU without the need of manual installations
-
Arc B580: running
ipex-llm
on Intel Arc B580 GPU for Ollama, llama.cpp, PyTorch, HuggingFace, etc. -
NPU: running
ipex-llm
on Intel NPU in both Python/C++ or llama.cpp API. -
PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of
ipex-llm
) on Intel GPU for Windows and Linux -
vLLM: running
ipex-llm
in vLLM on both Intel GPU and CPU -
FastChat: running
ipex-llm
in FastChat serving on on both Intel GPU and CPU -
Serving on multiple Intel GPUs: running
ipex-llm
serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI -
Text-Generation-WebUI: running
ipex-llm
inoobabooga
WebUI -
Axolotl: running
ipex-llm
in Axolotl for LLM finetuning -
Benchmarking: running (latency and throughput) benchmarks for
ipex-llm
on Intel CPU and GPU
-
GPU Inference in C++: running
llama.cpp
,ollama
, etc., withipex-llm
on Intel GPU -
GPU Inference in Python : running HuggingFace
transformers
,LangChain
,LlamaIndex
,ModelScope
, etc. withipex-llm
on Intel GPU -
vLLM on GPU: running
vLLM
serving withipex-llm
on Intel GPU -
vLLM on CPU: running
vLLM
serving withipex-llm
on Intel CPU -
FastChat on GPU: running
FastChat
serving withipex-llm
on Intel GPU -
VSCode on GPU: running and developing
ipex-llm
applications in Python using VSCode on Intel GPU
-
GraphRAG: running Microsoft's
GraphRAG
using local LLM withipex-llm
-
RAGFlow: running
RAGFlow
(an open-source RAG engine) withipex-llm
-
LangChain-Chatchat: running
LangChain-Chatchat
(Knowledge Base QA using RAG pipeline) withipex-llm
-
Coding copilot: running
Continue
(coding copilot in VSCode) withipex-llm
-
Open WebUI: running
Open WebUI
withipex-llm
-
PrivateGPT: running
PrivateGPT
to interact with documents withipex-llm
-
Dify platform: running
ipex-llm
inDify
(production-ready LLM app development platform)
-
Windows GPU: installing
ipex-llm
on Windows with Intel GPU -
Linux GPU: installing
ipex-llm
on Linux with Intel GPU - For more details, please refer to the full installation guide
-
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
-
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
-
-
Low-bit models: saving and loading
ipex-llm
low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) -
GGUF: directly loading GGUF models into
ipex-llm
-
AWQ: directly loading AWQ models into
ipex-llm
-
GPTQ: directly loading GPTQ models into
ipex-llm
-
Low-bit models: saving and loading
- Tutorials
Over 70 models have been optimized/verified on ipex-llm
, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
Model | CPU Example | GPU Example | NPU Example |
---|---|---|---|
LLaMA | link1, link2 | link | |
LLaMA 2 | link1, link2 | link | Python link, C++ link |
LLaMA 3 | link | link | Python link, C++ link |
LLaMA 3.1 | link | link | |
LLaMA 3.2 | link | Python link, C++ link | |
LLaMA 3.2-Vision | link | ||
ChatGLM | link | ||
ChatGLM2 | link | link | |
ChatGLM3 | link | link | |
GLM-4 | link | link | |
GLM-4V | link | link | |
GLM-Edge | link | Python link | |
GLM-Edge-V | link | ||
Mistral | link | link | |
Mixtral | link | link | |
Falcon | link | link | |
MPT | link | link | |
Dolly-v1 | link | link | |
Dolly-v2 | link | link | |
Replit Code | link | link | |
RedPajama | link1, link2 | ||
Phoenix | link1, link2 | ||
StarCoder | link1, link2 | link | |
Baichuan | link | link | |
Baichuan2 | link | link | Python link |
InternLM | link | link | |
InternVL2 | link | ||
Qwen | link | link | |
Qwen1.5 | link | link | |
Qwen2 | link | link | Python link, C++ link |
Qwen2.5 | link | Python link, C++ link | |
Qwen-VL | link | link | |
Qwen2-VL | link | ||
Qwen2-Audio | link | ||
Aquila | link | link | |
Aquila2 | link | link | |
MOSS | link | ||
Whisper | link | link | |
Phi-1_5 | link | link | |
Flan-t5 | link | link | |
LLaVA | link | link | |
CodeLlama | link | link | |
Skywork | link | ||
InternLM-XComposer | link | ||
WizardCoder-Python | link | ||
CodeShell | link | ||
Fuyu | link | ||
Distil-Whisper | link | link | |
Yi | link | link | |
BlueLM | link | link | |
Mamba | link | link | |
SOLAR | link | link | |
Phixtral | link | link | |
InternLM2 | link | link | |
RWKV4 | link | ||
RWKV5 | link | ||
Bark | link | link | |
SpeechT5 | link | ||
DeepSeek-MoE | link | ||
Ziya-Coding-34B-v1.0 | link | ||
Phi-2 | link | link | |
Phi-3 | link | link | |
Phi-3-vision | link | link | |
Yuan2 | link | link | |
Gemma | link | link | |
Gemma2 | link | ||
DeciLM-7B | link | link | |
Deepseek | link | link | |
StableLM | link | link | |
CodeGemma | link | link | |
Command-R/cohere | link | link | |
CodeGeeX2 | link | link | |
MiniCPM | link | link | Python link, C++ link |
MiniCPM3 | link | ||
MiniCPM-V | link | ||
MiniCPM-V-2 | link | link | |
MiniCPM-Llama3-V-2_5 | link | Python link | |
MiniCPM-V-2_6 | link | link | Python link |
MiniCPM-o-2_6 | link | ||
Janus-Pro | link | ||
Moonlight | link | ||
StableDiffusion | link | ||
Bce-Embedding-Base-V1 | Python link | ||
Speech_Paraformer-Large | Python link |
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
-
Performance varies by use, configuration and other factors.
ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩2
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for ipex-llm
Similar Open Source Tools

ipex-llm
IPEX-LLM is a PyTorch library for running Large Language Models (LLMs) on Intel CPUs and GPUs with very low latency. It provides seamless integration with various LLM frameworks and tools, including llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, and more. IPEX-LLM has been optimized and verified on over 50 LLM models, including LLaMA, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, and RWKV. It supports a range of low-bit inference formats, including INT4, FP8, FP4, INT8, INT2, FP16, and BF16, as well as finetuning capabilities for LoRA, QLoRA, DPO, QA-LoRA, and ReLoRA. IPEX-LLM is actively maintained and updated with new features and optimizations, making it a valuable tool for researchers, developers, and anyone interested in exploring and utilizing LLMs.

agentica
Agentica is a human-centric framework for building large language model agents. It provides functionalities for planning, memory management, tool usage, and supports features like reflection, planning and execution, RAG, multi-agent, multi-role, and workflow. The tool allows users to quickly code and orchestrate agents, customize prompts, and make API calls to various services. It supports API calls to OpenAI, Azure, Deepseek, Moonshot, Claude, Ollama, and Together. Agentica aims to simplify the process of building AI agents by providing a user-friendly interface and a range of functionalities for agent development.

vlmrun-cookbook
VLM Run Cookbook is a repository containing practical examples and tutorials for extracting structured data from images, videos, and documents using Vision Language Models (VLMs). It offers comprehensive Colab notebooks demonstrating real-world applications of VLM Run, with complete code and documentation for easy adaptation. The examples cover various domains such as financial documents and TV news analysis.

DownEdit
DownEdit is a fast and powerful program for downloading and editing videos from platforms like TikTok, Douyin, and Kuaishou. It allows users to effortlessly grab videos, make bulk edits, and utilize advanced AI features for generating videos, images, and sounds in bulk. The tool offers features like video, photo, and sound editing, downloading videos without watermarks, bulk AI generation, and AI editing for content enhancement.

DownEdit
DownEdit is a powerful program that allows you to download videos from various social media platforms such as TikTok, Douyin, Kuaishou, and more. With DownEdit, you can easily download videos from user profiles and edit them in bulk. You have the option to flip the videos horizontally or vertically throughout the entire directory with just a single click. Stay tuned for more exciting features coming soon!

ASTRA.ai
Astra.ai is a multimodal agent powered by TEN, showcasing its capabilities in speech, vision, and reasoning through RAG from local documentation. It provides a platform for developing AI agents with features like RTC transportation, extension store, workflow builder, and local deployment. Users can build and test agents locally using Docker and Node.js, with prerequisites including Agora App ID, Azure's speech-to-text and text-to-speech API keys, and OpenAI API key. The platform offers advanced customization options through config files and API keys setup, enabling users to create and deploy their AI agents for various tasks.

PocketFlow
Pocket Flow is a 100-line minimalist LLM framework designed for (Multi-)Agents, Workflow, RAG, etc. It provides a core abstraction for LLM projects by focusing on computation and communication through a graph structure and shared store. The framework aims to support the development of LLM Agents, such as Cursor AI, by offering a minimal and low-level approach that is well-suited for understanding and usage. Users can install Pocket Flow via pip or by copying the source code, and detailed documentation is available on the project website.

bytedesk
Bytedesk is an AI-powered customer service and team instant messaging tool that offers features like enterprise instant messaging, online customer service, large model AI assistant, and local area network file transfer. It supports multi-level organizational structure, role management, permission management, chat record management, seating workbench, work order system, seat management, data dashboard, manual knowledge base, skill group management, real-time monitoring, announcements, sensitive words, CRM, report function, and integrated customer service workbench services. The tool is designed for team use with easy configuration throughout the company, and it allows file transfer across platforms using WiFi/hotspots without the need for internet connection.

cursor-free-vip
Cursor Free VIP is a tool designed to automatically bypass Cursor's membership check, upgrade to 'pro' membership, support Windows and macOS systems, send Token requests in real-time, and reset Cursor's configuration. It provides a seamless experience for users to access premium features without the need for manual upgrades or configuration changes. The tool aims to simplify the process of accessing advanced functionalities offered by Cursor, enhancing user experience and productivity.

PureChat
PureChat is a chat application integrated with ChatGPT, featuring efficient application building with Vite5, screenshot generation and copy support for chat records, IM instant messaging SDK for sessions, automatic light and dark mode switching based on system theme, Markdown rendering, code highlighting, and link recognition support, seamless social experience with GitHub quick login, integration of large language models like ChatGPT Ollama for streaming output, preset prompts, and context, Electron desktop app versions for macOS and Windows, ongoing development of more features. Environment setup requires Node.js 18.20+. Clone code with 'git clone https://github.com/Hyk260/PureChat.git', install dependencies with 'pnpm install', start project with 'pnpm dev', and build with 'pnpm build'.

build_MiniLLM_from_scratch
This repository aims to build a low-parameter LLM model through pretraining, fine-tuning, model rewarding, and reinforcement learning stages to create a chat model capable of simple conversation tasks. It features using the bert4torch training framework, seamless integration with transformers package for inference, optimized file reading during training to reduce memory usage, providing complete training logs for reproducibility, and the ability to customize robot attributes. The chat model supports multi-turn conversations. The trained model currently only supports basic chat functionality due to limitations in corpus size, model scale, SFT corpus size, and quality.

helicone
Helicone is an open-source observability platform designed for Language Learning Models (LLMs). It logs requests to OpenAI in a user-friendly UI, offers caching, rate limits, and retries, tracks costs and latencies, provides a playground for iterating on prompts and chat conversations, supports collaboration, and will soon have APIs for feedback and evaluation. The platform is deployed on Cloudflare and consists of services like Web (NextJs), Worker (Cloudflare Workers), Jawn (Express), Supabase, and ClickHouse. Users can interact with Helicone locally by setting up the required services and environment variables. The platform encourages contributions and provides resources for learning, documentation, and integrations.

Step-DPO
Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.

HivisionIDPhotos
HivisionIDPhoto is a practical algorithm for intelligent ID photo creation. It utilizes a comprehensive model workflow to recognize, cut out, and generate ID photos for various user photo scenarios. The tool offers lightweight cutting, standard ID photo generation based on different size specifications, six-inch layout photo generation, beauty enhancement (waiting), and intelligent outfit swapping (waiting). It aims to solve emergency ID photo creation issues.

pr-agent
PR-Agent is a tool designed to assist in efficiently reviewing and handling pull requests by providing AI feedback and suggestions. It offers various tools such as Review, Describe, Improve, Ask, Update CHANGELOG, and more, with the ability to run them via different interfaces like CLI, PR Comments, or automatically triggering them when a new PR is opened. The tool supports multiple git platforms and models, emphasizing real-life practical usage and modular, customizable tools.