llm-scaler
None
Stars: 150
LLM Scaler is a GenAI solution for text, image, and video generation running on Intel® Arc™ Pro B60 GPUs. It leverages standard frameworks such as vLLM, ComfyUI, SGLang Diffusion, Xinference, etc., ensuring optimal performance for State-of-Art GenAI models on Arc Pro B60 GPUs.
README:
LLM Scaler is an GenAI solution for text generation, image generation, video generation etc. running on Intel® Arc™ Pro B60 GPUs. LLM Scalar leverages standard frameworks such as vLLM, ComfyUI, SGLang Diffusion, Xinference etc and ensures the best performance for State-of-Art GenAI models running on Arc Pro B60 GPUs.
- [2026.01] We released
intel/llm-scaler-vllm:1.3(or,intel/llm-scaler-vllm:0.11.1-b7) for vllm 0.11.1 and PyTorch 2.9 support, various new models support and performance improvements. - [2026.01] We released
intel/llm-scaler-omni:0.1.0-b5for Python 3.12 and PyTorch 2.9 support, various ComfyUI workflows and more SGLang Diffusion support. - [2025.12] We released
intel/llm-scaler-vllm:1.2, same image asintel/llm-scaler-vllm:0.10.2-b6. - [2025.12] We released
intel/llm-scaler-omni:0.1.0-b4to support ComfyUI workflows for Z-Image-Turbo, Hunyuan-Video-1.5 T2V/I2V with multi-XPU, and experimentially support SGLang Diffusion. - [2025.11] We released
intel/llm-scaler-vllm:0.10.2-b6to support Qwen3-VL (Dense/MoE), Qwen3-Omni, Qwen3-30B-A3B (MoE Int4), MinerU 2.5, ERNIE-4.5-vl etc. - [2025.11] We released
intel/llm-scaler-vllm:0.10.2-b5to support gpt-oss models and releasedintel/llm-scaler-omni:0.1.0-b3to support more ComfyUI workflows, and Windows installation. - [2025.10] We released
intel/llm-scaler-omni:0.1.0-b2to support more models with ComfyUI workflows and Xinference. - [2025.09] We released
intel/llm-scaler-vllm:0.10.0-b3to support more models (MinerU, MiniCPM-v-4.5 etc), and releasedintel/llm-scaler-omni:0.1.0-b1to enable first omni GenAI models using ComfyUI and Xinference on Arc Pro B60 GPU. - [2025.08] We released
intel/llm-scaler-vllm:1.0.
llm-scaler-vllm supports running text generation models using vLLM, featuring:
- CCL support (P2P or USM)
- INT4 and FP8 quantized online serving
- Embedding and Reranker model support
- Multi-Modal model support
- Omni model support
- Tensor Parallel, Pipeline Parallel and Data Parallel
- Finding maximum Context Length
- Multi-Modal WebUI
- BPE-Qwen tokenizer
Please follow the instructions in the Getting Started to use llm-scaler-vllm.
| Category | Model Name | FP16 | Dynamic Online FP8 | Dynamic Online Int4 | MXFP4 | Notes |
|---|---|---|---|---|---|---|
| Language Model | openai/gpt-oss-20b | ✅ | ||||
| Language Model | openai/gpt-oss-120b | ✅ | ||||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | ✅ | ✅ | ✅ | ||
| Language Model | deepseek-ai/DeepSeek-V2-Lite | ✅ | ✅ | export VLLM_MLA_DISABLE=1 | ||
| Language Model | deepseek-ai/deepseek-coder-33b-instruct | ✅ | ✅ | ✅ | ||
| Language Model | Qwen/Qwen3-8B | ✅ | ✅ | ✅ | ||
| Language Model | Qwen/Qwen3-14B | ✅ | ✅ | ✅ | ||
| Language Model | Qwen/Qwen3-32B | ✅ | ✅ | ✅ | ||
| Language MOE Model | Qwen/Qwen3-30B-A3B | ✅ | ✅ | ✅ | ||
| Language MOE Model | Qwen/Qwen3-235B-A22B | ✅ | ||||
| Language MOE Model | Qwen/Qwen3-Coder-30B-A3B-Instruct | ✅ | ✅ | ✅ | ||
| Language Model | Qwen/QwQ-32B | ✅ | ✅ | ✅ | ||
| Language Model | mistralai/Ministral-8B-Instruct-2410 | ✅ | ✅ | ✅ | ||
| Language Model | mistralai/Mixtral-8x7B-Instruct-v0.1 | ✅ | ✅ | ✅ | ||
| Language Model | meta-llama/Llama-3.1-8B | ✅ | ✅ | ✅ | ||
| Language Model | meta-llama/Llama-3.1-70B | ✅ | ✅ | ✅ | ||
| Language Model | baichuan-inc/Baichuan2-7B-Chat | ✅ | ✅ | ✅ | with chat_template | |
| Language Model | baichuan-inc/Baichuan2-13B-Chat | ✅ | ✅ | ✅ | with chat_template | |
| Language Model | THUDM/CodeGeex4-All-9B | ✅ | ✅ | ✅ | with chat_template | |
| Language Model | zai-org/GLM-4-9B-0414 | ✅ | use bfloat16 | |||
| Language Model | zai-org/GLM-4-32B-0414 | ✅ | use bfloat16 | |||
| Language MOE Model | zai-org/GLM-4.5-Air | ✅ | ✅ | |||
| Language Model | ByteDance-Seed/Seed-OSS-36B-Instruct | ✅ | ✅ | ✅ | ||
| Language Model | miromind-ai/MiroThinker-v1.5-30B | ✅ | ✅ | ✅ | ||
| Language Model | tencent/Hunyuan-0.5B-Instruct | ✅ | ✅ | ✅ | follow the guide in here | |
| Language Model | tencent/Hunyuan-7B-Instruct | ✅ | ✅ | ✅ | follow the guide in here | |
| Multimodal Model | Qwen/Qwen2-VL-7B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal Model | Qwen/Qwen2.5-VL-7B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal Model | Qwen/Qwen2.5-VL-32B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal Model | Qwen/Qwen2.5-VL-72B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal Model | Qwen/Qwen3-VL-4B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal Model | Qwen/Qwen3-VL-8B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal MOE Model | Qwen/Qwen3-VL-30B-A3B-Instruct | ✅ | ✅ | ✅ | ||
| Multimodal Model | openbmb/MiniCPM-V-2_6 | ✅ | ✅ | ✅ | ||
| Multimodal Model | openbmb/MiniCPM-V-4 | ✅ | ✅ | ✅ | ||
| Multimodal Model | openbmb/MiniCPM-V-4_5 | ✅ | ✅ | ✅ | ||
| Multimodal Model | OpenGVLab/InternVL2-8B | ✅ | ✅ | ✅ | ||
| Multimodal Model | OpenGVLab/InternVL3-8B | ✅ | ✅ | ✅ | ||
| Multimodal Model | OpenGVLab/InternVL3_5-8B | ✅ | ✅ | ✅ | ||
| Multimodal MOE Model | OpenGVLab/InternVL3_5-30B-A3B | ✅ | ✅ | ✅ | ||
| Multimodal Model | rednote-hilab/dots.ocr | ✅ | ✅ | ✅ | ||
| Multimodal Model | ByteDance-Seed/UI-TARS-7B-DPO | ✅ | ✅ | ✅ | ||
| Multimodal Model | google/gemma-3-12b-it | ✅ | use bfloat16 | |||
| Multimodal Model | google/gemma-3-27b-it | ✅ | use bfloat16 | |||
| Multimodal Model | THUDM/GLM-4v-9B | ✅ | ✅ | ✅ | with --hf-overrides and chat_template | |
| Multimodal Model | zai-org/GLM-4.1V-9B-Base | ✅ | ✅ | ✅ | ||
| Multimodal Model | zai-org/GLM-4.1V-9B-Thinking | ✅ | ✅ | ✅ | ||
| Multimodal Model | zai-org/Glyph | ✅ | ✅ | ✅ | ||
| Multimodal Model | opendatalab/MinerU2.5-2509-1.2B | ✅ | ✅ | ✅ | ||
| Multimodal Model | baidu/ERNIE-4.5-VL-28B-A3B-Thinking | ✅ | ✅ | ✅ | ||
| Multimodal Model | zai-org/GLM-4.6V-Flash | ✅ | ✅ | ✅ | pip install transformers==5.0.0rc0 first | |
| Multimodal Model | PaddlePaddle/PaddleOCR-VL | ✅ | ✅ | ✅ | follow the guide in here | |
| Multimodal Model | deepseek-ai/DeepSeek-OCR | ✅ | ✅ | ✅ | ||
| Multimodal Model | moonshotai/Kimi-VL-A3B-Thinking-2506 | ✅ | ✅ | ✅ | ||
| omni | Qwen/Qwen2.5-Omni-7B | ✅ | ✅ | ✅ | ||
| omni | Qwen/Qwen3-Omni-30B-A3B-Instruct | ✅ | ✅ | ✅ | ||
| audio | openai/whisper-medium | ✅ | ✅ | ✅ | ||
| audio | openai/whisper-large-v3 | ✅ | ✅ | ✅ | ||
| Embedding Model | Qwen/Qwen3-Embedding-8B | ✅ | ✅ | ✅ | ||
| Embedding Model | BAAI/bge-m3 | ✅ | ✅ | ✅ | ||
| Embedding Model | BAAI/bge-large-en-v1.5 | ✅ | ✅ | ✅ | ||
| Reranker Model | Qwen/Qwen3-Reranker-8B | ✅ | ✅ | ✅ | ||
| Reranker Model | BAAI/bge-reranker-large | ✅ | ✅ | ✅ | ||
| Reranker Model | BAAI/bge-reranker-v2-m3 | ✅ | ✅ | ✅ |
llm-scaler-omni supports running image/voice/video generation etc., featuring Omni Studio mode (using ComfyUI) and Omni Serving mode (via SGLang Diffusion or Xinference).
Please follow the instructions in the Getting Started to use llm-scaler-omni.
| Qwen-Image | Multi B60 Wan2.2-T2V-14B |
|---|---|
![]() |
![]() |
Omni Stuido supports Image Generation/Edit, Video Generation, Audio Generation, 3D Generation etc.
| Model Category | Model | Type |
|---|---|---|
| Image Generation | Qwen-Image, Qwen-Image-Edit | Text-to-Image, Image Editing |
| Image Generation | Stable Diffusion 3.5 | Text-to-Image, ControlNet |
| Image Generation | Z-Image-Turbo | Text-to-Image |
| Image Generation | Flux.1, Flux.1 Kontext dev | Text-to-Image, Multi-Image Reference, ControlNet |
| Video Generation | Wan2.2 TI2V 5B, Wan2.2 T2V 14B, Wan2.2 I2V 14B | Text-to-Video, Image-to-Video |
| Video Generation | Wan2.2 Animate 14B | Video Animation |
| Video Generation | HunyuanVideo 1.5 8.3B | Text-to-Video, Image-to-Video |
| Video Generation | LTX-2 | Text-to-Video, Image-to-Video |
| 3D Generation | Hunyuan3D 2.1 | Text/Image-to-3D |
| Audio Generation | VoxCPM1.5 | Text-to-Speech |
Please check ComfyUI Support for more details.
Omni Serving supports Image Generation, Audio Generation etc.
- Image Generation (
/v1/images/generations): Stable Diffusion 3.5, Flux.1-dev - Text to Speech (
/v1/audio/speech): Kokoro 82M - Speech to Text (
/v1/audio/transcriptions): whisper-large-v3
Please check Xinference Support for more details.
- Please check out the Docker image releases for llm-scaler-vllm and llm-scaler-omni
- Please report a bug or raise a feature request by opening a Github Issue
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm-scaler
Similar Open Source Tools
For similar tasks
generative-ai
This repository contains notebooks, code samples, sample apps, and other resources that demonstrate how to use, develop and manage generative AI workflows using Generative AI on Google Cloud, powered by Vertex AI. For more Vertex AI samples, please visit the Vertex AI samples Github repository.
AISuperDomain
Aila Desktop Application is a powerful tool that integrates multiple leading AI models into a single desktop application. It allows users to interact with various AI models simultaneously, providing diverse responses and insights to their inquiries. With its user-friendly interface and customizable features, Aila empowers users to engage with AI seamlessly and efficiently. Whether you're a researcher, student, or professional, Aila can enhance your AI interactions and streamline your workflow.
generative-ai-for-beginners
This course has 18 lessons. Each lesson covers its own topic so start wherever you like! Lessons are labeled either "Learn" lessons explaining a Generative AI concept or "Build" lessons that explain a concept and code examples in both **Python** and **TypeScript** when possible. Each lesson also includes a "Keep Learning" section with additional learning tools. **What You Need** * Access to the Azure OpenAI Service **OR** OpenAI API - _Only required to complete coding lessons_ * Basic knowledge of Python or Typescript is helpful - *For absolute beginners check out these Python and TypeScript courses. * A Github account to fork this entire repo to your own GitHub account We have created a **Course Setup** lesson to help you with setting up your development environment. Don't forget to star (🌟) this repo to find it easier later. ## 🧠 Ready to Deploy? If you are looking for more advanced code samples, check out our collection of Generative AI Code Samples in both **Python** and **TypeScript**. ## 🗣️ Meet Other Learners, Get Support Join our official AI Discord server to meet and network with other learners taking this course and get support. ## 🚀 Building a Startup? Sign up for Microsoft for Startups Founders Hub to receive **free OpenAI credits** and up to **$150k towards Azure credits to access OpenAI models through Azure OpenAI Services**. ## 🙏 Want to help? Do you have suggestions or found spelling or code errors? Raise an issue or Create a pull request ## 📂 Each lesson includes: * A short video introduction to the topic * A written lesson located in the README * Python and TypeScript code samples supporting Azure OpenAI and OpenAI API * Links to extra resources to continue your learning ## 🗃️ Lessons | | Lesson Link | Description | Additional Learning | | :-: | :------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------ | | 00 | Course Setup | **Learn:** How to Setup Your Development Environment | Learn More | | 01 | Introduction to Generative AI and LLMs | **Learn:** Understanding what Generative AI is and how Large Language Models (LLMs) work. | Learn More | | 02 | Exploring and comparing different LLMs | **Learn:** How to select the right model for your use case | Learn More | | 03 | Using Generative AI Responsibly | **Learn:** How to build Generative AI Applications responsibly | Learn More | | 04 | Understanding Prompt Engineering Fundamentals | **Learn:** Hands-on Prompt Engineering Best Practices | Learn More | | 05 | Creating Advanced Prompts | **Learn:** How to apply prompt engineering techniques that improve the outcome of your prompts. | Learn More | | 06 | Building Text Generation Applications | **Build:** A text generation app using Azure OpenAI | Learn More | | 07 | Building Chat Applications | **Build:** Techniques for efficiently building and integrating chat applications. | Learn More | | 08 | Building Search Apps Vector Databases | **Build:** A search application that uses Embeddings to search for data. | Learn More | | 09 | Building Image Generation Applications | **Build:** A image generation application | Learn More | | 10 | Building Low Code AI Applications | **Build:** A Generative AI application using Low Code tools | Learn More | | 11 | Integrating External Applications with Function Calling | **Build:** What is function calling and its use cases for applications | Learn More | | 12 | Designing UX for AI Applications | **Learn:** How to apply UX design principles when developing Generative AI Applications | Learn More | | 13 | Securing Your Generative AI Applications | **Learn:** The threats and risks to AI systems and methods to secure these systems. | Learn More | | 14 | The Generative AI Application Lifecycle | **Learn:** The tools and metrics to manage the LLM Lifecycle and LLMOps | Learn More | | 15 | Retrieval Augmented Generation (RAG) and Vector Databases | **Build:** An application using a RAG Framework to retrieve embeddings from a Vector Databases | Learn More | | 16 | Open Source Models and Hugging Face | **Build:** An application using open source models available on Hugging Face | Learn More | | 17 | AI Agents | **Build:** An application using an AI Agent Framework | Learn More | | 18 | Fine-Tuning LLMs | **Learn:** The what, why and how of fine-tuning LLMs | Learn More |
cog-comfyui
Cog-comfyui allows users to run ComfyUI workflows on Replicate. ComfyUI is a visual programming tool for creating and sharing generative art workflows. With cog-comfyui, users can access a variety of pre-trained models and custom nodes to create their own unique artworks. The tool is easy to use and does not require any coding experience. Users simply need to upload their API JSON file and any necessary input files, and then click the "Run" button. Cog-comfyui will then generate the output image or video file.
ai-notes
Notes on AI state of the art, with a focus on generative and large language models. These are the "raw materials" for the https://lspace.swyx.io/ newsletter. This repo used to be called https://github.com/sw-yx/prompt-eng, but was renamed because Prompt Engineering is Overhyped. This is now an AI Engineering notes repo.
llms-with-matlab
This repository contains example code to demonstrate how to connect MATLAB to the OpenAI™ Chat Completions API (which powers ChatGPT™) as well as OpenAI Images API (which powers DALL·E™). This allows you to leverage the natural language processing capabilities of large language models directly within your MATLAB environment.
xef
xef.ai is a one-stop library designed to bring the power of modern AI to applications and services. It offers integration with Large Language Models (LLM), image generation, and other AI services. The library is packaged in two layers: core libraries for basic AI services integration and integrations with other libraries. xef.ai aims to simplify the transition to modern AI for developers by providing an idiomatic interface, currently supporting Kotlin. Inspired by LangChain and Hugging Face, xef.ai may transmit source code and user input data to third-party services, so users should review privacy policies and take precautions. Libraries are available in Maven Central under the `com.xebia` group, with `xef-core` as the core library. Developers can add these libraries to their projects and explore examples to understand usage.
CushyStudio
CushyStudio is a generative AI platform designed for creatives of any level to effortlessly create stunning images, videos, and 3D models. It offers CushyApps, a collection of visual tools tailored for different artistic tasks, and CushyKit, an extensive toolkit for custom apps development and task automation. Users can dive into the AI revolution, unleash their creativity, share projects, and connect with a vibrant community. The platform aims to simplify the AI art creation process and provide a user-friendly environment for designing interfaces, adding custom logic, and accessing various tools.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

