llm-scaler

None

Stars: 150

Visit

LLM Scaler is a GenAI solution for text, image, and video generation running on Intel® Arc™ Pro B60 GPUs. It leverages standard frameworks such as vLLM, ComfyUI, SGLang Diffusion, Xinference, etc., ensuring optimal performance for State-of-Art GenAI models on Arc Pro B60 GPUs.

README:

LLM Scaler

LLM Scaler is an GenAI solution for text generation, image generation, video generation etc. running on Intel® Arc™ Pro B60 GPUs. LLM Scalar leverages standard frameworks such as vLLM, ComfyUI, SGLang Diffusion, Xinference etc and ensures the best performance for State-of-Art GenAI models running on Arc Pro B60 GPUs.

Latest Update

[2026.01] We released intel/llm-scaler-vllm:1.3 (or, intel/llm-scaler-vllm:0.11.1-b7) for vllm 0.11.1 and PyTorch 2.9 support, various new models support and performance improvements.
[2026.01] We released intel/llm-scaler-omni:0.1.0-b5 for Python 3.12 and PyTorch 2.9 support, various ComfyUI workflows and more SGLang Diffusion support.
[2025.12] We released intel/llm-scaler-vllm:1.2, same image as intel/llm-scaler-vllm:0.10.2-b6.
[2025.12] We released intel/llm-scaler-omni:0.1.0-b4 to support ComfyUI workflows for Z-Image-Turbo, Hunyuan-Video-1.5 T2V/I2V with multi-XPU, and experimentially support SGLang Diffusion.
[2025.11] We released intel/llm-scaler-vllm:0.10.2-b6 to support Qwen3-VL (Dense/MoE), Qwen3-Omni, Qwen3-30B-A3B (MoE Int4), MinerU 2.5, ERNIE-4.5-vl etc.
[2025.11] We released intel/llm-scaler-vllm:0.10.2-b5 to support gpt-oss models and released intel/llm-scaler-omni:0.1.0-b3 to support more ComfyUI workflows, and Windows installation.
[2025.10] We released intel/llm-scaler-omni:0.1.0-b2 to support more models with ComfyUI workflows and Xinference.
[2025.09] We released intel/llm-scaler-vllm:0.10.0-b3 to support more models (MinerU, MiniCPM-v-4.5 etc), and released intel/llm-scaler-omni:0.1.0-b1 to enable first omni GenAI models using ComfyUI and Xinference on Arc Pro B60 GPU.
[2025.08] We released intel/llm-scaler-vllm:1.0.

LLM Scaler vLLM

llm-scaler-vllm supports running text generation models using vLLM, featuring:

CCL support (P2P or USM)
INT4 and FP8 quantized online serving
Embedding and Reranker model support
Multi-Modal model support
Omni model support
Tensor Parallel, Pipeline Parallel and Data Parallel
Finding maximum Context Length
Multi-Modal WebUI
BPE-Qwen tokenizer

Please follow the instructions in the Getting Started to use llm-scaler-vllm.

Supported Models

Category	Model Name	FP16	Dynamic Online FP8	Dynamic Online Int4	MXFP4	Notes
Language Model	openai/gpt-oss-20b				✅
Language Model	openai/gpt-oss-120b				✅
Language Model	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-R1-0528-Qwen3-8B	✅	✅	✅
Language Model	deepseek-ai/DeepSeek-V2-Lite	✅	✅			export VLLM_MLA_DISABLE=1
Language Model	deepseek-ai/deepseek-coder-33b-instruct	✅	✅	✅
Language Model	Qwen/Qwen3-8B	✅	✅	✅
Language Model	Qwen/Qwen3-14B	✅	✅	✅
Language Model	Qwen/Qwen3-32B	✅	✅	✅
Language MOE Model	Qwen/Qwen3-30B-A3B	✅	✅	✅
Language MOE Model	Qwen/Qwen3-235B-A22B		✅
Language MOE Model	Qwen/Qwen3-Coder-30B-A3B-Instruct	✅	✅	✅
Language Model	Qwen/QwQ-32B	✅	✅	✅
Language Model	mistralai/Ministral-8B-Instruct-2410	✅	✅	✅
Language Model	mistralai/Mixtral-8x7B-Instruct-v0.1	✅	✅	✅
Language Model	meta-llama/Llama-3.1-8B	✅	✅	✅
Language Model	meta-llama/Llama-3.1-70B	✅	✅	✅
Language Model	baichuan-inc/Baichuan2-7B-Chat	✅	✅	✅		with chat_template
Language Model	baichuan-inc/Baichuan2-13B-Chat	✅	✅	✅		with chat_template
Language Model	THUDM/CodeGeex4-All-9B	✅	✅	✅		with chat_template
Language Model	zai-org/GLM-4-9B-0414		✅			use bfloat16
Language Model	zai-org/GLM-4-32B-0414		✅			use bfloat16
Language MOE Model	zai-org/GLM-4.5-Air	✅	✅
Language Model	ByteDance-Seed/Seed-OSS-36B-Instruct	✅	✅	✅
Language Model	miromind-ai/MiroThinker-v1.5-30B	✅	✅	✅
Language Model	tencent/Hunyuan-0.5B-Instruct	✅	✅	✅		follow the guide in here
Language Model	tencent/Hunyuan-7B-Instruct	✅	✅	✅		follow the guide in here
Multimodal Model	Qwen/Qwen2-VL-7B-Instruct	✅	✅	✅
Multimodal Model	Qwen/Qwen2.5-VL-7B-Instruct	✅	✅	✅
Multimodal Model	Qwen/Qwen2.5-VL-32B-Instruct	✅	✅	✅
Multimodal Model	Qwen/Qwen2.5-VL-72B-Instruct	✅	✅	✅
Multimodal Model	Qwen/Qwen3-VL-4B-Instruct	✅	✅	✅
Multimodal Model	Qwen/Qwen3-VL-8B-Instruct	✅	✅	✅
Multimodal MOE Model	Qwen/Qwen3-VL-30B-A3B-Instruct	✅	✅	✅
Multimodal Model	openbmb/MiniCPM-V-2_6	✅	✅	✅
Multimodal Model	openbmb/MiniCPM-V-4	✅	✅	✅
Multimodal Model	openbmb/MiniCPM-V-4_5	✅	✅	✅
Multimodal Model	OpenGVLab/InternVL2-8B	✅	✅	✅
Multimodal Model	OpenGVLab/InternVL3-8B	✅	✅	✅
Multimodal Model	OpenGVLab/InternVL3_5-8B	✅	✅	✅
Multimodal MOE Model	OpenGVLab/InternVL3_5-30B-A3B	✅	✅	✅
Multimodal Model	rednote-hilab/dots.ocr	✅	✅	✅
Multimodal Model	ByteDance-Seed/UI-TARS-7B-DPO	✅	✅	✅
Multimodal Model	google/gemma-3-12b-it		✅			use bfloat16
Multimodal Model	google/gemma-3-27b-it		✅			use bfloat16
Multimodal Model	THUDM/GLM-4v-9B	✅	✅	✅		with --hf-overrides and chat_template
Multimodal Model	zai-org/GLM-4.1V-9B-Base	✅	✅	✅
Multimodal Model	zai-org/GLM-4.1V-9B-Thinking	✅	✅	✅
Multimodal Model	zai-org/Glyph	✅	✅	✅
Multimodal Model	opendatalab/MinerU2.5-2509-1.2B	✅	✅	✅
Multimodal Model	baidu/ERNIE-4.5-VL-28B-A3B-Thinking	✅	✅	✅
Multimodal Model	zai-org/GLM-4.6V-Flash	✅	✅	✅		pip install transformers==5.0.0rc0 first
Multimodal Model	PaddlePaddle/PaddleOCR-VL	✅	✅	✅		follow the guide in here
Multimodal Model	deepseek-ai/DeepSeek-OCR	✅	✅	✅
Multimodal Model	moonshotai/Kimi-VL-A3B-Thinking-2506	✅	✅	✅
omni	Qwen/Qwen2.5-Omni-7B	✅	✅	✅
omni	Qwen/Qwen3-Omni-30B-A3B-Instruct	✅	✅	✅
audio	openai/whisper-medium	✅	✅	✅
audio	openai/whisper-large-v3	✅	✅	✅
Embedding Model	Qwen/Qwen3-Embedding-8B	✅	✅	✅
Embedding Model	BAAI/bge-m3	✅	✅	✅
Embedding Model	BAAI/bge-large-en-v1.5	✅	✅	✅
Reranker Model	Qwen/Qwen3-Reranker-8B	✅	✅	✅
Reranker Model	BAAI/bge-reranker-large	✅	✅	✅
Reranker Model	BAAI/bge-reranker-v2-m3	✅	✅	✅

LLM Scaler Omni (experimental)

llm-scaler-omni supports running image/voice/video generation etc., featuring Omni Studio mode (using ComfyUI) and Omni Serving mode (via SGLang Diffusion or Xinference).

Please follow the instructions in the Getting Started to use llm-scaler-omni.

Omni Demos

Qwen-Image	Multi B60 Wan2.2-T2V-14B

Omni Studio (ComfyUI WebUI interaction)

Omni Stuido supports Image Generation/Edit, Video Generation, Audio Generation, 3D Generation etc.

Model Category	Model	Type
Image Generation	Qwen-Image, Qwen-Image-Edit	Text-to-Image, Image Editing
Image Generation	Stable Diffusion 3.5	Text-to-Image, ControlNet
Image Generation	Z-Image-Turbo	Text-to-Image
Image Generation	Flux.1, Flux.1 Kontext dev	Text-to-Image, Multi-Image Reference, ControlNet
Video Generation	Wan2.2 TI2V 5B, Wan2.2 T2V 14B, Wan2.2 I2V 14B	Text-to-Video, Image-to-Video
Video Generation	Wan2.2 Animate 14B	Video Animation
Video Generation	HunyuanVideo 1.5 8.3B	Text-to-Video, Image-to-Video
Video Generation	LTX-2	Text-to-Video, Image-to-Video
3D Generation	Hunyuan3D 2.1	Text/Image-to-3D
Audio Generation	VoxCPM1.5	Text-to-Speech

Please check ComfyUI Support for more details.

Omni Serving (OpenAI-API compatible serving)

Omni Serving supports Image Generation, Audio Generation etc.

Image Generation (/v1/images/generations): Stable Diffusion 3.5, Flux.1-dev
Text to Speech (/v1/audio/speech): Kokoro 82M
Speech to Text (/v1/audio/transcriptions): whisper-large-v3

Please check Xinference Support for more details.

Releases

Please check out the Docker image releases for llm-scaler-vllm and llm-scaler-omni

Get Support

Please report a bug or raise a feature request by opening a Github Issue

For Tasks:

Click tags to check more tools for each tasks

generate text create images produce videos run genai models optimize performance

For Jobs:

ai researcher machine learning engineer data scientist artificial intelligence developer deep learning specialist

Alternative AI tools for llm-scaler

Similar Open Source Tools

No tools available

For similar tasks

generative-ai

This repository contains notebooks, code samples, sample apps, and other resources that demonstrate how to use, develop and manage generative AI workflows using Generative AI on Google Cloud, powered by Vertex AI. For more Vertex AI samples, please visit the Vertex AI samples Github repository.

github

: 11.6k

AISuperDomain

Aila Desktop Application is a powerful tool that integrates multiple leading AI models into a single desktop application. It allows users to interact with various AI models simultaneously, providing diverse responses and insights to their inquiries. With its user-friendly interface and customizable features, Aila empowers users to engage with AI seamlessly and efficiently. Whether you're a researcher, student, or professional, Aila can enhance your AI interactions and streamline your workflow.

github

: 1.2k

generative-ai-for-beginners

This course has 18 lessons. Each lesson covers its own topic so start wherever you like! Lessons are labeled either "Learn" lessons explaining a Generative AI concept or "Build" lessons that explain a concept and code examples in both **Python** and **TypeScript** when possible. Each lesson also includes a "Keep Learning" section with additional learning tools. **What You Need** * Access to the Azure OpenAI Service **OR** OpenAI API - _Only required to complete coding lessons_ * Basic knowledge of Python or Typescript is helpful - *For absolute beginners check out these Python and TypeScript courses. * A Github account to fork this entire repo to your own GitHub account We have created a **Course Setup** lesson to help you with setting up your development environment. Don't forget to star (🌟) this repo to find it easier later. ## 🧠 Ready to Deploy? If you are looking for more advanced code samples, check out our collection of Generative AI Code Samples in both **Python** and **TypeScript**. ## 🗣️ Meet Other Learners, Get Support Join our official AI Discord server to meet and network with other learners taking this course and get support. ## 🚀 Building a Startup? Sign up for Microsoft for Startups Founders Hub to receive **free OpenAI credits** and up to **$150k towards Azure credits to access OpenAI models through Azure OpenAI Services**. ## 🙏 Want to help? Do you have suggestions or found spelling or code errors? Raise an issue or Create a pull request ## 📂 Each lesson includes: * A short video introduction to the topic * A written lesson located in the README * Python and TypeScript code samples supporting Azure OpenAI and OpenAI API * Links to extra resources to continue your learning ## 🗃️ Lessons | | Lesson Link | Description | Additional Learning | | :-: | :------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------ | | 00 | Course Setup | **Learn:** How to Setup Your Development Environment | Learn More | | 01 | Introduction to Generative AI and LLMs | **Learn:** Understanding what Generative AI is and how Large Language Models (LLMs) work. | Learn More | | 02 | Exploring and comparing different LLMs | **Learn:** How to select the right model for your use case | Learn More | | 03 | Using Generative AI Responsibly | **Learn:** How to build Generative AI Applications responsibly | Learn More | | 04 | Understanding Prompt Engineering Fundamentals | **Learn:** Hands-on Prompt Engineering Best Practices | Learn More | | 05 | Creating Advanced Prompts | **Learn:** How to apply prompt engineering techniques that improve the outcome of your prompts. | Learn More | | 06 | Building Text Generation Applications | **Build:** A text generation app using Azure OpenAI | Learn More | | 07 | Building Chat Applications | **Build:** Techniques for efficiently building and integrating chat applications. | Learn More | | 08 | Building Search Apps Vector Databases | **Build:** A search application that uses Embeddings to search for data. | Learn More | | 09 | Building Image Generation Applications | **Build:** A image generation application | Learn More | | 10 | Building Low Code AI Applications | **Build:** A Generative AI application using Low Code tools | Learn More | | 11 | Integrating External Applications with Function Calling | **Build:** What is function calling and its use cases for applications | Learn More | | 12 | Designing UX for AI Applications | **Learn:** How to apply UX design principles when developing Generative AI Applications | Learn More | | 13 | Securing Your Generative AI Applications | **Learn:** The threats and risks to AI systems and methods to secure these systems. | Learn More | | 14 | The Generative AI Application Lifecycle | **Learn:** The tools and metrics to manage the LLM Lifecycle and LLMOps | Learn More | | 15 | Retrieval Augmented Generation (RAG) and Vector Databases | **Build:** An application using a RAG Framework to retrieve embeddings from a Vector Databases | Learn More | | 16 | Open Source Models and Hugging Face | **Build:** An application using open source models available on Hugging Face | Learn More | | 17 | AI Agents | **Build:** An application using an AI Agent Framework | Learn More | | 18 | Fine-Tuning LLMs | **Learn:** The what, why and how of fine-tuning LLMs | Learn More |

github

: 106.2k

cog-comfyui

Cog-comfyui allows users to run ComfyUI workflows on Replicate. ComfyUI is a visual programming tool for creating and sharing generative art workflows. With cog-comfyui, users can access a variety of pre-trained models and custom nodes to create their own unique artworks. The tool is easy to use and does not require any coding experience. Users simply need to upload their API JSON file and any necessary input files, and then click the "Run" button. Cog-comfyui will then generate the output image or video file.

github

: 604

ai-notes

Notes on AI state of the art, with a focus on generative and large language models. These are the "raw materials" for the https://lspace.swyx.io/ newsletter. This repo used to be called https://github.com/sw-yx/prompt-eng, but was renamed because Prompt Engineering is Overhyped. This is now an AI Engineering notes repo.

github

: 5.1k

llms-with-matlab

This repository contains example code to demonstrate how to connect MATLAB to the OpenAI™ Chat Completions API (which powers ChatGPT™) as well as OpenAI Images API (which powers DALL·E™). This allows you to leverage the natural language processing capabilities of large language models directly within your MATLAB environment.

github

: 143

xef

xef.ai is a one-stop library designed to bring the power of modern AI to applications and services. It offers integration with Large Language Models (LLM), image generation, and other AI services. The library is packaged in two layers: core libraries for basic AI services integration and integrations with other libraries. xef.ai aims to simplify the transition to modern AI for developers by providing an idiomatic interface, currently supporting Kotlin. Inspired by LangChain and Hugging Face, xef.ai may transmit source code and user input data to third-party services, so users should review privacy policies and take precautions. Libraries are available in Maven Central under the `com.xebia` group, with `xef-core` as the core library. Developers can add these libraries to their projects and explore examples to understand usage.

github

: 175

CushyStudio

CushyStudio is a generative AI platform designed for creatives of any level to effortlessly create stunning images, videos, and 3D models. It offers CushyApps, a collection of visual tools tailored for different artistic tasks, and CushyKit, an extensive toolkit for custom apps development and task automation. Users can dive into the AI revolution, unleash their creativity, share projects, and connect with a vibrant community. The platform aims to simplify the AI art creation process and provide a user-friendly environment for designing interfaces, adding custom logic, and accessing various tools.

github

: 641

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.9k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675