vision-agent

vision-agent

Enable AI to control your desktop, mobile and HMI devices

Stars: 370

Visit
 screenshot

AskUI Vision Agent is a powerful automation framework that enables you and AI agents to control your desktop, mobile, and HMI devices and automate tasks. It supports multiple AI models, multi-platform compatibility, and enterprise-ready features. The tool provides support for Windows, Linux, MacOS, Android, and iOS device automation, single-step UI automation commands, in-background automation on Windows machines, flexible model use, and secure deployment of agents in enterprise environments.

README:

πŸ€– AskUI Vision Agent

Release Notes PyPI - License

Enable AI agents to control your desktop (Windows, MacOS, Linux), mobile (Android, iOS) and HMI devices

Join the AskUI Discord.

Table of Contents

πŸ“– Introduction

AskUI Vision Agent is a powerful automation framework that enables you and AI agents to control your desktop, mobile, and HMI devices and automate tasks. With support for multiple AI models, multi-platform compatibility, and enterprise-ready features,

https://github.com/user-attachments/assets/a74326f2-088f-48a2-ba1c-4d94d327cbdf

🎯 Key Features

  • Support for Windows, Linux, MacOS, Android and iOS device automation (Citrix supported)
  • Support for single-step UI automation commands (RPA like) as well as agentic intent-based instructions
  • In-background automation on Windows machines (agent can create a second session; you do not have to watch it take over mouse and keyboard)
  • Flexible model use (hot swap of models) and infrastructure for reteaching of models (available on-premise)
  • Secure deployment of agents in enterprise environments

πŸ“¦ Installation

AskUI Python Package

pip install askui[all]

Requires Python >=3.10

AskUI Agent OS

Agent OS is a device controller that allows agents to take screenshots, move the mouse, click, and type on the keyboard across any operating system. It is installed on a Desktop OS but can control also mobile devices and HMI devices connected.

It offers powerful features like

  • multi-screen support,
  • support for all major operating systems (incl. Windows, MacOS and Linux),
  • process visualizations,
  • real Unicode character typing
  • and more exciting features like application selection, in background automation and video streaming are to be released soon.
Windows

AMD64

AskUI Installer for AMD64

ARM64

AskUI Installer for ARM64

Linux

⚠️ Warning: Agent OS currently does not work on Wayland. Switch to XOrg to use it.

AMD64

curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run
bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-AMD64-Web.run

ARM64

curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run
bash /tmp/AskUI-Suite-Latest-User-Installer-Linux-ARM64-Web.run
MacOS

⚠️ Warning: Agent OS currently does not work on MacOS with Intel chips (x86_64/amd64 architecture). Switch to a Mac with Apple Silicon (arm64 architecture), e.g., M1, M2, M3, etc.

ARM64

curl -L -o /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run https://files.askui.com/releases/Installer/Latest/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run
bash /tmp/AskUI-Suite-Latest-User-Installer-MacOS-ARM64-Web.run

πŸš€ Quickstart

πŸ§‘ Control your devices

Double click where-ever the cursor is currently at:

from askui import VisionAgent

with VisionAgent() as agent:
    agent.click(button="left", repeat=2)

By default, the agent works within the context of a display that is selected which defaults to the primary display.

Run the script with python <file path>, e.g python test.py to see if it works.

πŸ€– Let AI agents control your devices

In order to let AI agents control your devices, you need to be able to connect to an AI model (provider). We host some models ourselves and support several other ones, e.g. Anthropic, OpenRouter, Hugging Face, etc. out of the box. If you want to use a model provider or model that is not supported, you can easily plugin your own (see Custom Models).

For this example, we will us AskUI as the model provider to easily get started.

πŸ” Sign up with AskUI

Sign up at hub.askui.com to:

  • Activate your free trial by signing up (no credit card required)
  • Get your workspace ID and access token

βš™οΈ Configure environment variables

Linux & MacOS
export ASKUI_WORKSPACE_ID=<your-workspace-id-here>
export ASKUI_TOKEN=<your-token-here>
Windows PowerShell
$env:ASKUI_WORKSPACE_ID="<your-workspace-id-here>"
$env:ASKUI_TOKEN="<your-token-here>"

πŸ’» Example

from askui import VisionAgent

with VisionAgent(log_level="DEBUG") as agent:
    # Give complex instructions to the agent (may have problems with virtual displays out of the box, so make sure there is no browser opened on a virtual display that the agent may not see)
    agent.act(
        "Look for a browser on the current device (checking all available displays, "
        "making sure window has focus),"
        " open a new window or tab and navigate to https://docs.askui.com"
        " and click on 'Search...' to open search panel. If the search panel is already "
        "opened, empty the search field so I can start a fresh search."
    )
    agent.type("Introduction")
    # Locates elements by text (you can also use images, natural language descriptions, coordinates, etc. to
    # describe what to click on)
    agent.click(
        "Documentation > Tutorial > Introduction",
    )
    first_paragraph = agent.get(
        "What does the first paragraph of the introduction say?"
    )
    print("\n--------------------------------")
    print("FIRST PARAGRAPH:\n")
    print(first_paragraph)
    print("--------------------------------\n\n")

Run the script with python <file path>, e.g python test.py.

Note: The log_level parameter is set to DEBUG to give you a better picture of what is happening. By default, it is set to INFO to see less logs.

If you see a lot of logs and the first paragraph of the introduction in the console, congratulations! You've successfully let AI agents control your device to automate a task! If you have any issues, please check the documentation or join our Discord for support.

πŸ“š Further Documentation

Aside from our official documentation, we also have some additional guides and examples under the docs folder that you may find useful, for example:

  • Chat - How to interact with agents through a chat
  • Direct Tool Use - How to use the tools, e.g., clipboard, the Agent OS etc.
  • Extracting Data - How to extract data from the screen and documents
  • MCP - How to use MCP servers to extend the capabilities of an agent
  • Observability - Logging and reporting
  • Telemetry - Which data we gather and how to disable it
  • Using Models - How to use different models including how to register your own custom models

🀝 Contributing

We'd love your help! Contributions, ideas, and feedback are always welcome. A proper contribution guide is coming soonβ€”stay tuned!

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for vision-agent

Similar Open Source Tools

For similar tasks

For similar jobs