
vlmrun-hub
A hub for various industry-specific schemas to be used with VLMs.
Stars: 419

VLMRun Hub is a versatile tool for managing and running virtual machines in a centralized manner. It provides a user-friendly interface to easily create, start, stop, and monitor virtual machines across multiple hosts. With VLMRun Hub, users can efficiently manage their virtualized environments and streamline their workflow. The tool offers flexibility and scalability, making it suitable for both small-scale personal projects and large-scale enterprise deployments.
README:
Welcome to VLM Run Hub, a comprehensive repository of pre-defined Pydantic schemas for extracting structured data from unstructured visual domains such as images, videos, and documents. Designed for Vision Language Models (VLMs) and optimized for real-world use cases, VLM Run Hub simplifies the integration of visual ETL into your workflows.
While vision models like OpenAIβs GPT-4o and Anthropicβs Claude Vision excel in exploratory tasks like "chat with images," they often lack practicality for automation and integration, where strongly-typed, validated outputs are crucial.
The Structured Outputs API (popularized by GPT-4o, Gemini) addresses this by constraining LLMs to return data in precise, strongly-typed formats such as Pydantic models. This eliminates complex parsing and validation, ensuring outputs conform to expected types and structures. These schemas can be nested and include complex types like lists and dictionaries, enabling seamless integration with existing systems while leveraging the full capabilities of the model.
- π Easy to use: Pydantic is a well-understood and battle-tested data model for structured data.
- π Batteries included: Each schema in this repo has been validated across real-world industry use casesβfrom healthcare to finance to mediaβsaving you weeks of development effort.
- π Automatic Data-validation: Built-in Pydantic validation ensures your extracted data is clean, accurate, and reliable, reducing errors and simplifying downstream workflows.
- π Type-safety: With Pydanticβs type-safety and compatibility with tools like
mypy
andpyright
, you can build composable, modular systems that are robust and maintainable. - π§° Model-agnostic: Use the same schema with multiple VLM providers, no need to rewrite prompts for different VLMs.
- π Optimized for Visual ETL: Purpose-built for extracting structured data from images, videos, and documents, this repo bridges the gap between unstructured data and actionable insights.
The VLM Run Hub maintains a comprehensive catalog of all available schemas in the vlmrun/hub/catalog.yaml
file. The catalog is automatically validated to ensure consistency and completeness of schema documentation. We refer the developer to the catalog-spec.yaml for the full YAML specification. Here are some featured schemas:
- Documents: document.bank-statement, document.invoice, document.receipt, document.resume, document.us-drivers-license, document.utility-bill, document.w2-form
- Other industry-specific schemas: healthcare.medical-insurance-card, retail.ecommerce-product-caption, media.tv-news, aerospace.remote-sensing
If you have a new schema you want to add to the catalog, please refer to the SCHEMA-GUIDELINES.md for the full guidelines.
Let's say we want to extract invoice metadata from an invoice image. You can readily use our Invoice
schema we have defined under vlmrun.hub.schemas.document.invoice
and use it with any VLM of your choosing.
For a comprehensive walkthrough of available schemas and their usage, check out our Schema Showcase Notebook.
pip install vlmrun-hub
With Instructor / OpenAI
import instructor
from openai import OpenAI
from vlmrun.hub.schemas.document.invoice import Invoice
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.invoice/invoice_1.jpg"
client = instructor.from_openai(
OpenAI(), mode=instructor.Mode.MD_JSON
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{ "role": "user", "content": [
{"type": "text", "text": "Extract the invoice in JSON."},
{"type": "image_url", "image_url": {"url": IMAGE_URL}, "detail": "auto"}
]}
],
response_model=Invoice,
temperature=0,
)
JSON Response:
Image | JSON Output π |
|
{
"invoice_id": "9999999",
"period_start": null,
"period_end": null,
"invoice_issue_date": "2023-11-11",
"invoice_due_date": null,
"order_id": null,
"customer_id": null,
"issuer": "Anytown, USA",
"issuer_address": {
"street": "123 Main Street",
"city": "Anytown",
"state": "USA",
"postal_code": "01234",
"country": null
},
"customer": "Fred Davis",
"customer_email": "[email protected]",
"customer_phone": "(800) 123-4567",
"customer_billing_address": {
"street": "1335 Martin Luther King Jr Ave",
"city": "Dunedin",
"state": "FL",
"postal_code": "34698",
"country": null
},
"customer_shipping_address": {
"street": "249 Windward Passage",
"city": "Clearwater",
"state": "FL",
"postal_code": "33767",
"country": null
},
"items": [
{
"description": "Service",
"quantity": 1,
"currency": null,
"unit_price": 200.0,
"total_price": 200.0
},
{
"description": "Parts AAA",
"quantity": 1,
"currency": null,
"unit_price": 100.0,
"total_price": 100.0
},
{
"description": "Parts BBB",
"quantity": 2,
"currency": null,
"unit_price": 50.0,
"total_price": 100.0
}
],
"subtotal": 400.0,
"tax": null,
"total": 400.0,
"currency": null,
"notes": "",
"others": null
} |
With VLM Run
import requests
from vlmrun.hub.schemas.document.invoice import Invoice
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.invoice/invoice_1.jpg"
json_data = {
"image": IMAGE_URL,
"model": "vlm-1",
"domain": "document.invoice",
"json_schema": Invoice.model_json_schema(),
}
response = requests.post(
f"https://api.vlm.run/v1/image/generate",
headers={"Authorization": f"Bearer <your-api-key>"},
json=json_data,
)
import instructor
from openai import OpenAI
from vlmrun.hub.schemas.document.invoice import Invoice
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.invoice/invoice_1.jpg"
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": [
{"type": "text", "text": "Extract the invoice in JSON."},
{"type": "image_url", "image_url": {"url": IMAGE_URL}, "detail": "auto"}
]},
],
response_format=Invoice,
temperature=0,
)
When working with the OpenAI Structured Outputs API, you need to ensure that the
response_format
is a valid Pydantic model with the supported types.
Locally with Ollama
Note: For certain vlmrun.common
utilities, you will need to install our main Python SDK
via pip install vlmrun
.
from ollama import chat
from vlmrun.common.image import encode_image
from vlmrun.common.utils import remote_image
from vlmrun.hub.schemas.document.invoice import Invoice
IMAGE_URL = "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/document.invoice/invoice_1.jpg"
img = remote_image(IMAGE_URL)
chat_response = chat(
model="llama3.2-vision:11b",
format=Invoice.model_json_schema(),
messages=[
{
"role": "user",
"content": "Extract the invoice in JSON.",
"images": [encode_image(img, format="JPEG").split(",")[1]],
},
],
options={
"temperature": 0
},
)
response = Invoice.model_validate_json(
chat_response.message.content
)
We periodically run popular VLMs on each of the examples & schemas in the catalog.yaml file and publish the results in the benchmarks directory.
Provider | Model | Date | Results |
---|---|---|---|
OpenAI | gpt-4o-2024-11-20 | 2025-01-09 | link |
OpenAI | gpt-4o-mini-2024-07-18 | 2025-01-09 | link |
Gemini | gemini-2.0-flash-exp | 2025-01-10 | link |
Ollama | llama3.2-vision:11b | 2025-01-10 | link |
Ollama | Qwen2.5-VL-7B-Instruct:Q4_K_M_benxh | 2025-02-20 | link |
Ollama + Instructor | Qwen2.5-VL-7B-Instruct:Q4_K_M_benxh | 2025-02-20 | link |
Microsoft | phi-4 | 2025-01-10 | link |
Schemas are organized by industry for easy navigation:
vlmrun
βββ hub
βββ schemas
| βββ <industry>
| | βββ <use-case-1>.py
| | βββ <use-case-2>.py
| | βββ ...
βΒ Β βββ aerospace
βΒ Β βΒ Β βββ remote_sensing.py
βΒ Β βββ document # all document schemas are here
| | βββ invoice.py
| | βββ us_drivers_license.py
| | βββ ...
βΒ Β βββ healthcare
βΒ Β βΒ Β βββ medical_insurance_card.py
βΒ Β βββ retail
βΒ Β βΒ Β βββ ecommerce_product_caption.py
βΒ Β βββ contrib # all contributions are welcome here!
βΒ Β βββ <schema-name>.py
βββ version.py
We're building this hub for the community, and contributions are always welcome! Follow the CONTRIBUTING and SCHEMA-GUIDELINES.md to get started.
- π¬ Send us an email at [email protected] or join our Discord for help.
- π£ Follow us on Twitter, and LinkedIn to keep up-to-date on our products.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for vlmrun-hub
Similar Open Source Tools

vlmrun-hub
VLMRun Hub is a versatile tool for managing and running virtual machines in a centralized manner. It provides a user-friendly interface to easily create, start, stop, and monitor virtual machines across multiple hosts. With VLMRun Hub, users can efficiently manage their virtualized environments and streamline their workflow. The tool offers flexibility and scalability, making it suitable for both small-scale personal projects and large-scale enterprise deployments.

functionary
Functionary is a language model that interprets and executes functions/plugins. It determines when to execute functions, whether in parallel or serially, and understands their outputs. Function definitions are given as JSON Schema Objects, similar to OpenAI GPT function calls. It offers documentation and examples on functionary.meetkai.com. The newest model, meetkai/functionary-medium-v3.1, is ranked 2nd in the Berkeley Function-Calling Leaderboard. Functionary supports models with different context lengths and capabilities for function calling and code interpretation. It also provides grammar sampling for accurate function and parameter names. Users can deploy Functionary models serverlessly using Modal.com.

lego-ai-parser
Lego AI Parser is an open-source application that uses OpenAI to parse visible text of HTML elements. It is built on top of FastAPI, ready to set up as a server, and make calls from any language. It supports preset parsers for Google Local Results, Amazon Listings, Etsy Listings, Wayfair Listings, BestBuy Listings, Costco Listings, Macy's Listings, and Nordstrom Listings. Users can also design custom parsers by providing prompts, examples, and details about the OpenAI model under the classifier key.

AICentral
AI Central is a powerful tool designed to take control of your AI services with minimal overhead. It is built on Asp.Net Core and dotnet 8, offering fast web-server performance. The tool enables advanced Azure APIm scenarios, PII stripping logging to Cosmos DB, token metrics through Open Telemetry, and intelligent routing features. AI Central supports various endpoint selection strategies, proxying asynchronous requests, custom OAuth2 authorization, circuit breakers, rate limiting, and extensibility through plugins. It provides an extensibility model for easy plugin development and offers enriched telemetry and logging capabilities for monitoring and insights.

RagaAI-Catalyst
RagaAI Catalyst is a comprehensive platform designed to enhance the management and optimization of LLM projects. It offers features such as project management, dataset management, evaluation management, trace management, prompt management, synthetic data generation, and guardrail management. These functionalities enable efficient evaluation and safeguarding of LLM applications.

firecrawl-mcp-server
Firecrawl MCP Server is a Model Context Protocol (MCP) server implementation that integrates with Firecrawl for web scraping capabilities. It supports features like scrape, crawl, search, extract, and batch scrape. It provides web scraping with JS rendering, URL discovery, web search with content extraction, automatic retries with exponential backoff, credit usage monitoring, comprehensive logging system, support for cloud and self-hosted FireCrawl instances, mobile/desktop viewport support, and smart content filtering with tag inclusion/exclusion. The server includes configurable parameters for retry behavior and credit usage monitoring, rate limiting and batch processing capabilities, and tools for scraping, batch scraping, checking batch status, searching, crawling, and extracting structured information from web pages.

langcorn
LangCorn is an API server that enables you to serve LangChain models and pipelines with ease, leveraging the power of FastAPI for a robust and efficient experience. It offers features such as easy deployment of LangChain models and pipelines, ready-to-use authentication functionality, high-performance FastAPI framework for serving requests, scalability and robustness for language processing applications, support for custom pipelines and processing, well-documented RESTful API endpoints, and asynchronous processing for faster response times.

pipecat-flows
Pipecat Flows is a framework designed for building structured conversations in AI applications. It allows users to create both predefined conversation paths and dynamically generated flows, handling state management and LLM interactions. The framework includes a Python module for building conversation flows and a visual editor for designing and exporting flow configurations. Pipecat Flows is suitable for scenarios such as customer service scripts, intake forms, personalized experiences, and complex decision trees.

AIGODLIKE-ComfyUI-Translation
A plugin for multilingual translation of ComfyUI, This plugin implements translation of resident menu bar/search bar/right-click context menu/node, etc

manga-image-translator
Translate texts in manga/images. Some manga/images will never be translated, therefore this project is born. * Image/Manga Translator * Samples * Online Demo * Disclaimer * Installation * Pip/venv * Poetry * Additional instructions for **Windows** * Docker * Hosting the web server * Using as CLI * Setting Translation Secrets * Using with Nvidia GPU * Building locally * Usage * Batch mode (default) * Demo mode * Web Mode * Api Mode * Related Projects * Docs * Recommended Modules * Tips to improve translation quality * Options * Language Code Reference * Translators Reference * GPT Config Reference * Using Gimp for rendering * Api Documentation * Synchronous mode * Asynchronous mode * Manual translation * Next steps * Support Us * Thanks To All Our Contributors :

beelzebub
Beelzebub is an advanced honeypot framework designed to provide a highly secure environment for detecting and analyzing cyber attacks. It offers a low code approach for easy implementation and utilizes virtualization techniques powered by OpenAI Generative Pre-trained Transformer. Key features include OpenAI Generative Pre-trained Transformer acting as Linux virtualization, SSH Honeypot, HTTP Honeypot, TCP Honeypot, Prometheus openmetrics integration, Docker integration, RabbitMQ integration, and kubernetes support. Beelzebub allows easy configuration for different services and ports, enabling users to create custom honeypot scenarios. The roadmap includes developing Beelzebub into a robust PaaS platform. The project welcomes contributions and encourages adherence to the Code of Conduct for a supportive and respectful community.

arxiv-mcp-server
The ArXiv MCP Server acts as a bridge between AI assistants and arXiv's research repository, enabling AI models to search for and access papers programmatically through the Message Control Protocol (MCP). It offers features like paper search, access, listing, local storage, and research prompts. Users can install it via Smithery or manually for Claude Desktop. The server provides tools for paper search, download, listing, and reading, along with specialized prompts for paper analysis. Configuration can be done through environment variables, and testing is supported with a test suite. The tool is released under the MIT License and is developed by the Pearl Labs Team.

chatgpt-exporter
A script to export the chat history of ChatGPT. Supports exporting to text, HTML, Markdown, PNG, and JSON formats. Also allows for exporting multiple conversations at once.

llm_finetuning
This repository provides a comprehensive set of tools for fine-tuning large language models (LLMs) using various techniques, including full parameter training, LoRA (Low-Rank Adaptation), and P-Tuning V2. It supports a wide range of LLM models, including Qwen, Yi, Llama, and others. The repository includes scripts for data preparation, training, and inference, making it easy for users to fine-tune LLMs for specific tasks. Additionally, it offers a collection of pre-trained models and provides detailed documentation and examples to guide users through the process.

scylla
Scylla is an intelligent proxy pool tool designed for humanities, enabling users to extract content from the internet and build their own Large Language Models in the AI era. It features automatic proxy IP crawling and validation, an easy-to-use JSON API, a simple web-based user interface, HTTP forward proxy server, Scrapy and requests integration, and headless browser crawling. Users can start using Scylla with just one command, making it a versatile tool for various web scraping and content extraction tasks.

llmproxy
llmproxy is a reverse proxy for LLM API based on Cloudflare Worker, supporting platforms like OpenAI, Gemini, and Groq. The interface is compatible with the OpenAI API specification and can be directly accessed using the OpenAI SDK. It provides a convenient way to interact with various AI platforms through a unified API endpoint, enabling seamless integration and usage in different applications.
For similar tasks

vlmrun-hub
VLMRun Hub is a versatile tool for managing and running virtual machines in a centralized manner. It provides a user-friendly interface to easily create, start, stop, and monitor virtual machines across multiple hosts. With VLMRun Hub, users can efficiently manage their virtualized environments and streamline their workflow. The tool offers flexibility and scalability, making it suitable for both small-scale personal projects and large-scale enterprise deployments.
For similar jobs

minio
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.

ai-on-gke
This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources

kong
Kong, or Kong API Gateway, is a cloud-native, platform-agnostic, scalable API Gateway distinguished for its high performance and extensibility via plugins. It also provides advanced AI capabilities with multi-LLM support. By providing functionality for proxying, routing, load balancing, health checking, authentication (and more), Kong serves as the central layer for orchestrating microservices or conventional API traffic with ease. Kong runs natively on Kubernetes thanks to its official Kubernetes Ingress Controller.

AI-in-a-Box
AI-in-a-Box is a curated collection of solution accelerators that can help engineers establish their AI/ML environments and solutions rapidly and with minimal friction, while maintaining the highest standards of quality and efficiency. It provides essential guidance on the responsible use of AI and LLM technologies, specific security guidance for Generative AI (GenAI) applications, and best practices for scaling OpenAI applications within Azure. The available accelerators include: Azure ML Operationalization in-a-box, Edge AI in-a-box, Doc Intelligence in-a-box, Image and Video Analysis in-a-box, Cognitive Services Landing Zone in-a-box, Semantic Kernel Bot in-a-box, NLP to SQL in-a-box, Assistants API in-a-box, and Assistants API Bot in-a-box.

awsome-distributed-training
This repository contains reference architectures and test cases for distributed model training with Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS. The test cases cover different types and sizes of models as well as different frameworks and parallel optimizations (Pytorch DDP/FSDP, MegatronLM, NemoMegatron...).

generative-ai-cdk-constructs
The AWS Generative AI Constructs Library is an open-source extension of the AWS Cloud Development Kit (AWS CDK) that provides multi-service, well-architected patterns for quickly defining solutions in code to create predictable and repeatable infrastructure, called constructs. The goal of AWS Generative AI CDK Constructs is to help developers build generative AI solutions using pattern-based definitions for their architecture. The patterns defined in AWS Generative AI CDK Constructs are high level, multi-service abstractions of AWS CDK constructs that have default configurations based on well-architected best practices. The library is organized into logical modules using object-oriented techniques to create each architectural pattern model.

model_server
OpenVINOβ’ Model Server (OVMS) is a high-performance system for serving models. Implemented in C++ for scalability and optimized for deployment on Intel architectures, the model server uses the same architecture and API as TensorFlow Serving and KServe while applying OpenVINO for inference execution. Inference service is provided via gRPC or REST API, making deploying new algorithms and AI experiments easy.

dify-helm
Deploy langgenius/dify, an LLM based chat bot app on kubernetes with helm chart.