visual-reasoning-playground

AI-powered visual reasoning tools for broadcast & ProAV. PTZ camera tracking, object detection, scene analysis using Moondream VLM. By StreamGeeks & PTZOptics.

Stars: 83

Visit

AI-powered visual reasoning tools for broadcast, live streaming, and ProAV professionals. The Visual Reasoning Playground provides 17 ready-to-use tools demonstrating real-world applications of Vision Language Models (VLMs) using Moondream. From PTZ camera auto-tracking to multimodal audio+video automation, the tools offer functionalities like scene description, object detection, gesture control, smart counting, scene analysis, zone monitoring, color matching, multimodal fusion, smart photography, PTZ tracking, tracking comparison, scoreboard extraction, scoreboard OCR, framing assistance, PTZ color tuning, multimodal studio automation, voice triggers, and OBS plugin integration. The tools are designed to streamline tasks in live streaming, broadcast automation, camera control, content creation workflows, security monitoring, and more.

README:

Visual Reasoning Playground

AI-powered visual reasoning tools for broadcast, live streaming, and ProAV professionals.

17 ready-to-use tools demonstrating real-world applications of Vision Language Models (VLMs) using Moondream. From PTZ camera auto-tracking to multimodal audio+video automation.

🚀 Try All Tools Online Now - No installation required!

🎮 Playground Mode: All tools work without a camera! Sample videos included for testing.

From the book: Visual Reasoning AI for Broadcast and ProAV by Paul Richards

Author: Paul Richards - Co-CEO at PTZOptics | Chief Streaming Officer at StreamGeeks

Why Visual Reasoning?

Traditional computer vision requires training custom models for each task. Visual Reasoning uses pre-trained Vision Language Models that understand natural language - just describe what you want to detect.

Old way: Train a model on 10,000 images of "person at podium"
New way: Just ask "Is there a person standing at the podium?"

Perfect for:

Live streaming & broadcast automation
PTZ camera control & auto-tracking
Smart conference rooms
Security & monitoring
Content creation workflows
OBS & vMix integration

The Tools

👁️ Tool 1: Scene Describer — Try it now

Natural language descriptions of any scene in real-time.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Camera    │────▶│  Moondream API  │────▶│  "A person at   │
│   Frame     │     │    /caption     │     │   a desk with   │
└─────────────┘     └─────────────────┘     │   a laptop..."  │
                                            └─────────────────┘

📁 01-scene-describer/

📦 Tool 2: Detection Boxes — Try it now

Draw bounding boxes around any object you describe.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Camera    │────▶│  Moondream API  │────▶│   Video Feed    │
│   Frame     │     │    /detect      │     │   + Colored     │
└─────────────┘     │ "person","mug"  │     │   Bounding Boxes│
                    └─────────────────┘     └─────────────────┘

📁 02-detection-boxes/

✋ Tool 3: Gesture OBS Control — Try it now

Control OBS scene switching with hand gestures.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Camera    │────▶│  Moondream API  │────▶│  OBS WebSocket  │
│   Frame     │     │ "thumbs up?" →  │     │  Scene Switch   │
└─────────────┘     │   YES/NO        │     └─────────────────┘
                    └─────────────────┘              │
                                                     ▼
                                            ┌─────────────────┐
                                            │   OBS Studio    │
                                            │   Scene 1 → 2   │
                                            └─────────────────┘

🔌 OBS Script Available! Install directly in OBS Studio: moondream-gesture-control.py

📁 03-gesture-obs/

🔢 Tool 5: Smart Counter — Try it now

Count objects entering or exiting across a virtual line.

                    ┌─────────────────┐
                    │  Define Line    │
                    │  ─ ─ ─ ─ ─ ─ ─  │
                    └────────┬────────┘
                             │
┌─────────────┐     ┌────────▼────────┐     ┌─────────────────┐
│   Camera    │────▶│  Track Objects  │────▶│   IN: 12        │
│   Frame     │     │  Across Line    │     │   OUT: 8        │
└─────────────┘     └─────────────────┘     │   TOTAL: +4     │
                                            └─────────────────┘

📁 05-smart-counter/

🔍 Tool 6: Scene Analyzer — Try it now

Ask questions about what the camera sees.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Camera    │────▶│  Moondream API  │────▶│  "Yes, there    │
│   Frame     │     │     /query      │     │   are 3 people  │
└─────────────┘     └─────────────────┘     │   in the room"  │
                             ▲              └─────────────────┘
                    ┌────────┴────────┐
                    │  "How many      │
                    │   people?"      │
                    └─────────────────┘

📁 06-scene-analyzer/

🚧 Tool 7: Zone Monitor — Try it now

Draw custom zones, get alerts when objects enter.

┌─────────────────────────────────┐
│         Camera View             │
│   ┌───────────┐                 │
│   │  ZONE A   │    ○ person     │
│   │  (alert!) │   enters        │
│   └───────────┘     │           │
└─────────────────────┼───────────┘
                      ▼
              ┌───────────────┐
              │   Webhook     │────▶  Alert!
              │   Trigger     │
              └───────────────┘

📁 07-zone-monitor/

🎨 Tool 10: Color Matcher — Try it now

Match your camera's color settings to a reference image.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Reference  │────▶│   Moondream     │     │  Suggested      │
│   Image     │     │   Analyze Both  │────▶│  Adjustments:   │
└─────────────┘     └─────────────────┘     │  WB: +200K      │
                             ▲              │  Sat: -10       │
┌─────────────┐              │              │  Exp: +0.5      │
│   Camera    │──────────────┘              └─────────────────┘
│   Feed      │
└─────────────┘

📁 10-color-matcher/

🔊 Tool 12: Multimodal Fusion — Try it now

Combine audio + video for intelligent automation.

┌─────────────┐
│   Camera    │────┐
│   (Video)   │    │     ┌─────────────────┐     ┌─────────────┐
└─────────────┘    ├────▶│  Fusion Engine  │────▶│  Trigger    │
                   │     │  Video + Audio  │     │  Automation │
┌─────────────┐    │     │  Confidence: 95%│     └─────────────┘
│ Microphone  │────┘     └─────────────────┘
│  (Speech)   │
└─────────────┘

Example: "Start meeting" + people visible = HIGH confidence → trigger

📁 12-multimodal-fusion/

📸 Tool 13: Smart AI Photographer — Try it now

Auto-capture photos when AI detects your target.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Camera    │────▶│  Moondream API  │────▶│  Target Found?  │
│   Frame     │     │    /detect      │     │   YES → 📸      │
└─────────────┘     │ "person smiling"│     └────────┬────────┘
                    └─────────────────┘              │
                                                     ▼
                                            ┌─────────────────┐
                                            │  Photo Gallery  │
                                            │  + Download     │
                                            └─────────────────┘

📁 13-smart-photographer/

🎯 Featured: PTZ Auto-Tracker — Try it now

Autonomous PTZ camera tracking using AI vision.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  PTZOptics  │────▶│  Moondream API  │────▶│  Calculate      │
│   Camera    │     │    /detect      │     │  Pan/Tilt       │
└─────────────┘     │  "red shirt"    │     │  Commands       │
      ▲             └─────────────────┘     └────────┬────────┘
      │                                              │
      │             ┌─────────────────┐              │
      └─────────────│  PTZOptics API  │◀─────────────┘
                    │  Move Camera    │
                    └─────────────────┘

📁 PTZOptics-Moondream-Tracker/

⚡ Tool 14: Tracking Comparison — Try it now

Compare MediaPipe (local CV) vs Moondream (cloud VLM) for PTZ tracking.

┌─────────────┐     ┌──────────────┐
│   Camera    │────▶│  MediaPipe   │──── Local: ~10ms ────┐
│   Frame     │     │  (Browser)   │                      │
└─────────────┘     └──────────────┘                      ├──▶ Compare!
      │             ┌──────────────┐                      │
      └────────────▶│  Moondream   │──── Cloud: ~200ms ───┘
                    │  (API)       │
                    └──────────────┘

🧪 See the tradeoffs — latency, accuracy, and flexibility side-by-side.

📁 14-tracking-comparison/

🏆 Tool 4: Scoreboard Extractor — Try it now

Extract scores from physical scoreboards using AI vision.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Scoreboard │────▶│  Moondream API  │────▶│  HOME: 24       │
│   Camera    │     │  "Read score"   │     │  AWAY: 18       │
└─────────────┘     └─────────────────┘     │  QTR: 3         │
                                            └────────┬────────┘
                                                     │
                                            ┌────────▼────────┐
                                            │  Graphics       │
                                            │  Overlay        │
                                            └─────────────────┘

📁 04-scoreboard-extractor/

📝 Tool 4b: Scoreboard OCR — Try it now

Extract scores using local Tesseract.js OCR — no API key needed.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Scoreboard │────▶│  Tesseract.js   │────▶│  HOME: 24       │
│   Camera    │     │  (Local OCR)    │     │  AWAY: 18       │
└─────────────┘     │  Region-based   │     │  QTR: 3         │
                    └─────────────────┘     └─────────────────┘

🔄 Compare approaches! Use this alongside Tool 4 to see VLM vs OCR tradeoffs.

📁 04b-scoreboard-ocr/

🖼️ Tool 8: Framing Assistant — Try it now

AI-powered framing suggestions for PTZ cameras.

┌─────────────────────────────────┐
│         Camera View             │
│                                 │
│      ┌ ─ ─ ─ ─ ─ ┐              │
│      │ Suggested │  ○ subject   │
│      │  Frame    │              │
│      └ ─ ─ ─ ─ ─ ┘              │
└─────────────────────────────────┘
              │
              ▼
    "Move camera UP 5°, 
     zoom IN 10% for
     better composition"

📁 08-framing-assistant/

🎛️ Tool 9: PTZ Color Tuner — Try it now

Direct PTZ camera color control via API with AI-assisted adjustments.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  PTZOptics  │────▶│  Moondream AI   │────▶│  Recommended    │
│   Camera    │     │  Analyze Scene  │     │  Adjustments    │
└─────────────┘     └─────────────────┘     └────────┬────────┘
      ▲                                              │
      │             ┌─────────────────┐              │
      └─────────────│  PTZOptics API  │◀─────────────┘
                    │  Apply Settings │
                    └─────────────────┘

📁 09-ptz-color-tuner/

🎬 Tool 11: Multimodal Studio — Try it now

Full production automation: PTZ + OBS + Audio + AI.

┌─────────────┐
│  PTZOptics  │────┐
│   Camera    │    │
└─────────────┘    │     ┌─────────────────┐     ┌─────────────┐
                   ├────▶│    Multimodal   │────▶│  PTZ Move   │
┌─────────────┐    │     │     Studio      │     ├─────────────┤
│ Microphone  │────┤     │   Controller    │────▶│  OBS Scene  │
│  (Voice)    │    │     └─────────────────┘     ├─────────────┤
└─────────────┘    │                             │  Webhook    │
                   │                             └─────────────┘
┌─────────────┐    │
│     OBS     │────┘
│   Studio    │
└─────────────┘

Voice: "Camera 2, close up" → PTZ moves + OBS switches

📁 11-multimodal-studio/

🎙️ Tool 15: Voice Triggers — Try it now

Speech-to-text automation with Whisper AI running entirely in-browser.

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ Microphone  │────▶│  Whisper AI     │────▶│  "switch to     │
│   Input     │     │  (In-Browser)   │     │   camera two"   │
└─────────────┘     └─────────────────┘     └────────┬────────┘
                                                     │
                    ┌─────────────────┐              │
                    │  Trigger Rules  │◀─────────────┘
                    │  phrase → action│
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Execute Action │
                    │  (Log/Alert/OBS)│
                    └─────────────────┘

Key Features:

No API key needed - Whisper runs locally via WebGPU/WASM
~40MB model - Downloads once, cached in browser
Trigger rules - Map phrases to actions
Privacy-first - Audio never leaves your device

📁 15-voice-triggers/

🔌 OBS Plugin: Visual Reasoning AI — Try it now

Complete AI control panel as an OBS Browser Dock.

┌─────────────────────────────────────────────────────┐
│           OBS BROWSER DOCK                          │
├─────────────────────────────────────────────────────┤
│  ┌─────────┬───────────┬────────────┐               │
│  │Gestures │ Describe  │ Auto-Switch│  ← Tabs       │
│  └─────────┴───────────┴────────────┘               │
│                                                     │
│  ┌─────────────────────────────────┐                │
│  │        Camera Preview           │                │
│  │     [Gesture Detection]         │                │
│  └─────────────────────────────────┘                │
│                                                     │
│  👍 Thumbs Up  → Scene: Wide Shot                   │
│  👎 Thumbs Down → Scene: Close Up                   │
│                                                     │
│  Auto-Switch Rules:                                 │
│  "whiteboard" → Whiteboard Cam                      │
│  "standing"   → Full Body Shot                      │
└─────────────────────────────────────────────────────┘
              │
              ▼
    ┌─────────────────┐
    │   OBS Studio    │
    │  Scene Switch   │
    │  Start/Stop Rec │
    └─────────────────┘

📁 obs-visual-reasoning/

Quick Start

Option A: Try Online Instantly (Recommended)

Get Your API Key - Sign up at console.moondream.ai (free tier available)
Open Any Tool - Visit the Visual Reasoning Playground
Enter Your API Key - Paste it once, and you're ready to go!

Option B: Run Locally

Important: Clone the full repository — individual tool folders won't work alone because they depend on shared libraries in shared/.

git clone https://github.com/streamgeeks/visual-reasoning-playground.git
cd visual-reasoning-playground
python server.py

Then open http://localhost:8000 and select any tool. The included server.py enables CORS so sample videos work with AI detection.

Use Cases

Every tool includes both business and personal examples:

Tool	Business Use	Personal Use
Scene Describer	Patient fall detection	Fridge inventory for recipes
Detection Boxes	Manufacturing QA	"Where are my keys?"
PTZ Auto-Tracker	Speaker tracking at events	Pet cam follows your dog
Smart Counter	Retail foot traffic analytics	Count kids going outside
Scene Analyzer	Security: "Anyone in restricted area?"	"Is my garage door open?"
Zone Monitor	Warehouse safety alerts	Driveway arrival notifications
Color Assistant	Multi-cam color matching	Match YouTuber's style
Multimodal Fusion	Smart conference room	Voice-controlled smart home

Integration Ready

These tools are designed to integrate with your existing workflow:

Platform	Integration
OBS Studio	WebSocket triggers, scene switching, native Python script
vMix	HTTP API commands, input control
PTZOptics	Full API 2.0 support for all PTZ cameras
NDI	Works with NDI video sources
Webhooks	Trigger any HTTP endpoint
Home Assistant	Smart home automation

OBS Studio Plugin

Moondream Gesture Control Script

Control OBS scenes with hand gestures - runs natively inside OBS Studio!

Installation:

Download moondream-gesture-control.py
In OBS: Tools → Scripts → + → Select the .py file
Configure your Moondream API key and gesture mappings
Enable detection and start gesturing!

Features:

👍 Thumbs up → Switch to Scene A
👎 Thumbs down → Switch to Scene B
Configurable detection interval and cooldown
Debug mode for troubleshooting
No browser required - runs entirely within OBS

Requirements:

OBS Studio 28.0 or later
Moondream API key (get one free)
Webcam

💡 Try before installing: Use the web demo to test gesture detection before installing the OBS script.

Architecture

All tools follow a consistent pattern: Video → AI → Action

Shared utilities in shared/:

moondream-client.js - Unified API client with detect, caption, query, point methods
video-source-adapter.js - Toggle between live camera and sample videos
api-key-manager.js - Secure API key storage and validation
styles.css - Consistent dark theme UI components

Project Structure

visual-reasoning-playground/
├── index.html                    # Landing page with all tools
├── server.py                     # Local dev server (CORS enabled)
├── shared/                       # Reusable utilities for all tools
│
├── 01-scene-describer/           # Natural language scene descriptions
├── 02-detection-boxes/           # Bounding box visualization
├── 03-gesture-obs/               # Gesture-based OBS control
├── 04-scoreboard-extractor/      # Score extraction (VLM approach)
├── 04b-scoreboard-ocr/           # Score extraction (Tesseract OCR)
├── 05-smart-counter/             # Object counting across line
├── 06-scene-analyzer/            # Visual Q&A chat
├── 07-zone-monitor/              # Zone-based alerts
├── 08-framing-assistant/         # PTZ framing suggestions
├── 09-ptz-color-tuner/           # PTZ color control
├── 10-color-matcher/             # Color matching to reference
├── 11-multimodal-studio/         # Full PTZ+OBS+voice automation
├── 12-multimodal-fusion/         # Audio+video fusion engine
├── 13-smart-photographer/        # Auto-capture on detection
├── 14-tracking-comparison/       # MediaPipe vs Moondream test
├── 15-voice-triggers/            # Voice command automation
│
├── PTZOptics-Moondream-Tracker/  # Featured PTZ auto-tracking
├── obs-visual-reasoning/         # OBS Browser Dock plugin
├── 00-visual-reasoning-harness/  # Harness pattern documentation
│
└── assets/                       # Sample videos & color profiles
    ├── sample-videos/            # Demo videos for playground mode
    └── color-profiles/           # Reference images for color tool

See CONTRIBUTING.md for details on adding new tools.

API Cost Guide

Moondream charges per API call. Control costs with the rate slider in each tool:

Detection Rate	API Calls/Hour	Best For
0.5/sec	1,800	Static scenes, budget-conscious
1.0/sec	3,600	General use (default)
2.0/sec	7,200	Active scenes
3.0/sec	10,800	Fast action, sports

Requirements

All Tools:

Moondream API Key (free tier available)
Modern browser (Chrome recommended)
Local web server

Tool-Specific:

PTZ Auto-Tracker, Framing Assistant, Color Tuner: PTZOptics camera with network access
Multimodal Studio, Multimodal Fusion, Voice Triggers: Microphone for speech recognition
Gesture OBS Control, OBS Plugin: OBS Studio with WebSocket Server enabled

Learn More

Get the Book

Visual Reasoning AI for Broadcast and ProAV by Paul Richards covers:

Complete theory behind Vision Language Models
Step-by-step tool building tutorials
Production deployment strategies
Industry-specific applications

Get your copy at VisualReasoning.ai/book

Official Resources

VisualReasoning.ai - Book, online course, and free tools
Moondream Documentation - API reference & guides
PTZOptics API 2.0 - Camera control documentation
StreamGeeks Academy - Live streaming education

Community

StreamGeeks Discord - Get help, share projects
PTZOptics Support - Camera-specific questions

Contributing

Found a bug? Have an idea? PRs welcome!

Fork this repo
Create a feature branch
Submit a pull request

License

MIT License - Use freely in personal and commercial projects.

Built by Paul Richards
Co-CEO at PTZOptics | Chief Streaming Officer at StreamGeeks

For Tasks:

Click tags to check more tools for each tasks

describe scenes detect objects control gestures count objects analyze scenes

For Jobs:

broadcast technician live streaming producer audiovisual engineer content creator security analyst

Alternative AI tools for visual-reasoning-playground

Similar Open Source Tools

visual-reasoning-playground

github

: 83

bumpgen

bumpgen is a tool designed to automatically upgrade TypeScript / TSX dependencies and make necessary code changes to handle any breaking issues that may arise. It uses an abstract syntax tree to analyze code relationships, type definitions for external methods, and a plan graph DAG to execute changes in the correct order. The tool is currently limited to TypeScript and TSX but plans to support other strongly typed languages in the future. It aims to simplify the process of upgrading dependencies and handling code changes caused by updates.

github

: 67

detour

Detour is an autonomous collision-avoidance system designed to run on-board satellites using NVIDIA's Nemotron LLM on the ASUS Ascent GX10. It utilizes a multi-agent LangGraph pipeline to detect debris threats, assess risk, plan maneuvers, validate safety constraints, and execute avoidance burns locally with zero ground-station latency. The system consists of key components such as Agent Pipeline, Physics Engine, Satellite Model, Tool Wrappers, API, Frontend, and Ascent GX10 Setup. The tool aims to provide fast and autonomous decision-making capabilities to prevent collisions in Low Earth Orbit (LEO) by leveraging edge AI technology.

github

: 139

lanhu-mcp

Lanhu MCP Server is a powerful Model Context Protocol (MCP) server designed for the AI programming era, perfectly supporting the Lanhu design collaboration platform. It offers features like intelligent requirement analysis, team knowledge base, UI design support, and performance optimization. The server is suitable for Cursor + Lanhu, Windsurf + Lanhu, Claude Code + Lanhu, Trae + Lanhu, and Cline + Lanhu integrations. It aims to break the isolation of AI IDEs and enable all AI assistants to share knowledge and context.

github

: 436

chronicle

Chronicle is a self-hostable AI system that captures audio/video data from OMI devices and other sources to generate memories, action items, and contextual insights about conversations and daily interactions. It includes a mobile app for OMI devices, backend services with AI features, a web dashboard for conversation and memory management, and optional services like speaker recognition and offline ASR. The project aims to provide a system that records personal spoken context and visual context to generate memories, action items, and enable home automation.

github

: 56

aio-coding-hub

AIO Coding Hub is a local AI CLI unified gateway that allows requests from Claude Code, Codex, and Gemini CLI to go through a single entry point. It solves the pain points of configuring base URLs and API keys for each CLI, provides intelligent failover in case of upstream instability, offers full observability with trace tracking and usage statistics, enables easy switching of providers with a single toggle, and ensures security and privacy through local data storage and encrypted API keys. The tool features a unified gateway proxy supporting multiple CLI tools, intelligent routing and fault tolerance, observability with request tracing and usage statistics, channel validation with multi-dimensional templates, and security and privacy measures like local data storage and encrypted API keys.

github

: 125

auto-paper-digest

Auto Paper Digest (APD) is a tool designed to automatically fetch cutting-edge AI research papers, download PDFs, generate video explanations, and publish them on platforms like HuggingFace, Douyin, and portal websites. It provides functionalities such as fetching papers from Hugging Face, downloading PDFs from arXiv, generating videos using NotebookLM, automatic publishing to HuggingFace Dataset, automatic publishing to Douyin, and hosting videos on a Gradio portal website. The tool also supports resuming interrupted tasks, persistent login states for Google and Douyin, and a structured workflow divided into three phases: Upload, Download, and Publish.

github

: 485

Zen-Ai-Pentest

Zen-AI-Pentest is a professional AI-powered penetration testing framework designed for security professionals, bug bounty hunters, and enterprise security teams. It combines cutting-edge language models with 20+ integrated security tools, offering comprehensive security assessments. The framework is security-first with multiple safety controls, extensible with a plugin system, cloud-native for deployment on AWS, Azure, or GCP, and production-ready with CI/CD, monitoring, and support. It features autonomous AI agents, risk analysis, exploit validation, benchmarking, CI/CD integration, AI persona system, subdomain scanning, and multi-cloud & virtualization support.

github

: 202

observers

Observers is a lightweight library for AI observability that provides support for various generative AI APIs and storage backends. It allows users to track interactions with AI models and sync observations to different storage systems. The library supports OpenAI, Hugging Face transformers, AISuite, Litellm, and Docling for document parsing and export. Users can configure different stores such as Hugging Face Datasets, DuckDB, Argilla, and OpenTelemetry to manage and query their observations. Observers is designed to enhance AI model monitoring and observability in a user-friendly manner.

github

: 231

Agentic-ADK

Agentic ADK is an Agent application development framework launched by Alibaba International AI Business, based on Google-ADK and Ali-LangEngine. It is used for developing, constructing, evaluating, and deploying powerful, flexible, and controllable complex AI Agents. ADK aims to make Agent development simpler and more user-friendly, enabling developers to more easily build, deploy, and orchestrate various Agent applications ranging from simple tasks to complex collaborations.

github

: 508

Autopilot-Notes

Autopilot Notes is an open-source knowledge base for systematically learning autonomous driving technology. It covers basic theory, hardware, algorithms, tools, and practical engineering practices across 10+ chapters. The repository provides daily updates on industry trends, in-depth analysis of mainstream solutions like Tesla, Baidu Apollo, and Openpilot, and hands-on content including simulation, deployment, and optimization. Contributors are welcome to submit pull requests to improve the documentation.

github

: 765

PaiAgent

PaiAgent is an enterprise-level AI workflow visualization orchestration platform that simplifies the combination and scheduling of AI capabilities. It allows developers and business users to quickly build complex AI processing flows through an intuitive drag-and-drop interface, without the need to write code, enabling collaboration of various large models.

github

: 78

claude-emporium

Claude Emporium is a Roman-themed plugin marketplace for Claude Code, offering six plugins that wrap standalone MCP servers with automation hooks, commands, and skills. The plugins include Praetorian for context guard, Historian for session memory, Oracle for tool discovery, Gladiator for continuous learning, Vigil for file recovery, and Orator for prompt rhetoric. Each plugin self-configures on install, and the MCP servers handle the actual work. The plugins coordinate automatically when multiple are installed, enhancing behaviors and synergy. The tool is designed with zero overhead, no external API calls, no background processes, and no databases, making it efficient and lightweight for users.

github

: 522

kweaver

KWeaver is an open-source ecosystem for building, deploying, and running decision intelligence AI applications. It adopts ontology as the core methodology for business knowledge networks, with DIP as the core platform, aiming to provide elastic, agile, and reliable enterprise-grade decision intelligence to further unleash productivity. The DIP platform includes key subsystems such as ADP, Decision Agent, DIP Studio, and AI Store.

github

: 154

aiohomematic

AIO Homematic (hahomematic) is a lightweight Python 3 library for controlling and monitoring HomeMatic and HomematicIP devices, with support for third-party devices/gateways. It automatically creates entities for device parameters, offers custom entity classes for complex behavior, and includes features like caching paramsets for faster restarts. Designed to integrate with Home Assistant, it requires specific firmware versions for HomematicIP devices. The public API is defined in modules like central, client, model, exceptions, and const, with example usage provided. Useful links include changelog, data point definitions, troubleshooting, and developer resources for architecture, data flow, model extension, and Home Assistant lifecycle.

github

: 161

huf

HUF is an AI-native engine designed to centralize intelligence and execution into a single engine, enabling AI to operate inside real business systems. It offers multi-provider AI connectivity, intelligent tools, knowledge grounding, event-driven execution, visual workflow builder, full auditability, and cost control. HUF can be used as AI infrastructure for products, internal intelligence platform, automation & orchestration engine, embedded AI layer for SaaS, and enterprise AI control plane. Core capabilities include agent system, knowledge management, trigger system, visual flow builder, and observability. The tech stack includes Frappe Framework, Python 3.10+, LiteLLM, SQLite FTS5, React 18, TypeScript, Tailwind CSS, and MariaDB.

github

: 62

For similar tasks

crossfire-yolo-TensorRT

This repository supports the YOLO series models and provides an AI auto-aiming tool based on YOLO-TensorRT for the game CrossFire. Users can refer to the provided link for compilation and running instructions. The tool includes functionalities for screenshot + inference, mouse movement, and smooth mouse movement. The next goal is to automatically set the optimal PID parameters on the local machine. Developers are welcome to contribute to the improvement of this tool.

github

: 192

Dataset

DL3DV-10K is a large-scale dataset of real-world scene-level videos with annotations, covering diverse scenes with different levels of reflection, transparency, and lighting. It includes 10,510 multi-view scenes with 51.2 million frames at 4k resolution, and offers benchmark videos for novel view synthesis (NVS) methods. The dataset is designed to facilitate research in deep learning-based 3D vision and provides valuable insights for future research in NVS and 3D representation learning.

github

: 279

AliceVision

AliceVision is a photogrammetric computer vision framework which provides a 3D reconstruction pipeline. It is designed to process images from different viewpoints and create detailed 3D models of objects or scenes. The framework includes various algorithms for feature detection, matching, and structure from motion. AliceVision is suitable for researchers, developers, and enthusiasts interested in computer vision, photogrammetry, and 3D modeling. It can be used for applications such as creating 3D models of buildings, archaeological sites, or objects for virtual reality and augmented reality experiences.

github

: 3.2k

visual-reasoning-playground

github

: 83

MME-RealWorld

MME-RealWorld is a benchmark designed to address real-world applications with practical relevance, featuring 13,366 high-resolution images and 29,429 annotations across 43 tasks. It aims to provide substantial recognition challenges and overcome common barriers in existing Multimodal Large Language Model benchmarks, such as small data scale, restricted data quality, and insufficient task difficulty. The dataset offers advantages in data scale, data quality, task difficulty, and real-world utility compared to existing benchmarks. It also includes a Chinese version with additional images and QA pairs focused on Chinese scenarios.

github

: 94

VisionLLM

VisionLLM is a series of large language models designed for vision-centric tasks. The latest version, VisionLLM v2, is a generalist multimodal model that supports hundreds of vision-language tasks, including visual understanding, perception, and generation.

github

: 966

AiTreasureBox

AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.

github

: 368

react-native-vision-camera

VisionCamera is a powerful, high-performance Camera library for React Native. It features Photo and Video capture, QR/Barcode scanner, Customizable devices and multi-cameras ("fish-eye" zoom), Customizable resolutions and aspect-ratios (4k/8k images), Customizable FPS (30..240 FPS), Frame Processors (JS worklets to run facial recognition, AI object detection, realtime video chats, ...), Smooth zooming (Reanimated), Fast pause and resume, HDR & Night modes, Custom C++/GPU accelerated video pipeline (OpenGL).

github

: 8.2k

For similar jobs

visual-reasoning-playground

github

: 83

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

daily-poetry-image

Daily Chinese ancient poetry and AI-generated images powered by Bing DALL-E-3. GitHub Action triggers the process automatically. Poetry is provided by Today's Poem API. The website is built with Astro.

github

: 492

exif-photo-blog

EXIF Photo Blog is a full-stack photo blog application built with Next.js, Vercel, and Postgres. It features built-in authentication, photo upload with EXIF extraction, photo organization by tag, infinite scroll, light/dark mode, automatic OG image generation, a CMD-K menu with photo search, experimental support for AI-generated descriptions, and support for Fujifilm simulations. The application is easy to deploy to Vercel with just a few clicks and can be customized with a variety of environment variables.

github

: 1.7k

SillyTavern

SillyTavern is a user interface you can install on your computer (and Android phones) that allows you to interact with text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development and has added many major features. At this point, they can be thought of as completely independent programs.

github

: 23.4k

Twitter-Insight-LLM

This project enables you to fetch liked tweets from Twitter (using Selenium), save it to JSON and Excel files, and perform initial data analysis and image captions. This is part of the initial steps for a larger personal project involving Large Language Models (LLMs).

github

: 401

AISuperDomain

Aila Desktop Application is a powerful tool that integrates multiple leading AI models into a single desktop application. It allows users to interact with various AI models simultaneously, providing diverse responses and insights to their inquiries. With its user-friendly interface and customizable features, Aila empowers users to engage with AI seamlessly and efficiently. Whether you're a researcher, student, or professional, Aila can enhance your AI interactions and streamline your workflow.

github

: 1.2k

ChatGPT-On-CS

This project is an intelligent dialogue customer service tool based on a large model, which supports access to platforms such as WeChat, Qianniu, Bilibili, Douyin Enterprise, Douyin, Doudian, Weibo chat, Xiaohongshu professional account operation, Xiaohongshu, Zhihu, etc. You can choose GPT3.5/GPT4.0/ Lazy Treasure Box (more platforms will be supported in the future), which can process text, voice and pictures, and access external resources such as operating systems and the Internet through plug-ins, and support enterprise AI applications customized based on their own knowledge base.

github

: 768