agent-device

CLI to control iOS and Android devices for AI agents

Stars: 530

Visit

CLI tool for controlling iOS and Android devices for AI agents, with core commands like open, back, home, press, and more. It supports minimal dependencies, TypeScript execution on Node 22+, and is in early development. The tool allows for automation flows, session management, semantic finding, assertions, replay updates, and settings helpers for simulators. It also includes backends for iOS snapshots, app resolution, iOS-specific notes, testing, and building. Contributions are welcome, and the project is maintained by Callstack, a group of React and React Native enthusiasts.

README:

agent-device

CLI to control iOS and Android devices for AI agents influenced by Vercel’s agent-browser.

The project is in early development and considered experimental. Pull requests are welcome!

Features

Platforms: iOS (simulator + limited device support) and Android (emulator + device).
Core commands: open, back, home, app-switcher, press, long-press, focus, type, fill, scroll, scrollintoview, wait, alert, screenshot, close.
Inspection commands: snapshot (accessibility tree).
Device tooling: adb (Android), simctl/devicectl (iOS via Xcode).
Minimal dependencies; TypeScript executed directly on Node 22+ (no build step).

Install

npm install -g agent-device

Or use it without installing:

npx agent-device open SampleApp

Quick Start

Use refs for agent-driven exploration and normal automation flows.

agent-device open Contacts --platform ios # creates session on iOS Simulator
agent-device snapshot
agent-device click @e5
agent-device fill @e6 "John"
agent-device fill @e7 "Doe"
agent-device click @e3
agent-device close

CLI Usage

agent-device <command> [args] [--json]

Basic flow:

agent-device open SampleApp
agent-device snapshot
agent-device click @e7
agent-device fill @e8 "hello"
agent-device close SampleApp

Debug flow:

agent-device trace start
agent-device snapshot -s "Sample App"
agent-device find label "Wi-Fi" click
agent-device trace stop ./trace.log

Coordinates:

All coordinate-based commands (press, long-press, focus, fill) use device coordinates with origin at top-left.
X increases to the right, Y increases downward.

Command Index

open, close, home, back, app-switcher
snapshot, find, get
click, focus, type, fill, press, long-press, scroll, scrollintoview, is
alert, wait, screenshot
trace start, trace stop
settings wifi|airplane|location on|off
appstate, apps, devices, session list

Backends (iOS snapshots)

Backend	Speed	Accuracy	Requirements
`xctest`	Fast	High	No Accessibility permission required
`ax`	Fast	Medium	Accessibility permission for the terminal app, not recommended

Notes:

Default backend is xctest on iOS.
Scope snapshots with -s "<label>" or -s @ref.
If XCTest returns 0 nodes (e.g., foreground app changed), agent-device falls back to AX when available.

Flags:

--version, -V print version and exit
--platform ios|android
--device <name>
--udid <udid> (iOS)
--serial <serial> (Android)
--activity <component> (Android; package/Activity or package/.Activity)
--session <name>
--verbose for daemon and runner logs
--json for structured output
--backend ax|xctest (snapshot only; defaults to xctest on iOS)

Skills

Install the automation skills listed in SKILL.md.

npx skills add https://github.com/callstackincubator/agent-device --skill agent-device

Sessions:

open starts a session. Without args boots/activates the target device/simulator without launching an app.
All interaction commands require an open session.
If a session is already open, open <app> switches the active app and updates the session app bundle.
close stops the session and releases device resources. Pass an app to close it explicitly, or omit to just close the session.
Use --session <name> to manage multiple sessions.
Session scripts are written to ~/.agent-device/sessions/<session>-<timestamp>.ad when recording is enabled with --save-script.
Deterministic replay is .ad-based; use replay --update (-u) to update selector drift and rewrite the replay file in place.

Find (semantic):

find <text> <action> [value] finds by any text (label/value/identifier) using a scoped snapshot.
find text|label|value|role|id <value> <action> [value] for specific locators.
Actions: click (default), fill, type, focus, get text, get attrs, wait [timeout], exists.

Assertions:

is predicates: visible, hidden, exists, editable, selected, text.
is text uses exact equality.

Replay update:

replay <path> runs deterministic replay from .ad scripts.
replay -u <path> attempts selector updates on failures and atomically rewrites the same file.
Refs are the default/core mechanism for interactive agent flows.
Update targets: click, fill, get, is, wait.
Selector matching is a replay-update internal: replay parses .ad lines into actions, tries them, snapshots on failure, resolves a better selector, then rewrites that failing line.

Update examples:

# Before (stale selector)
click "id=\"old_continue\" || label=\"Continue\""

# After replay -u (rewritten in place)
click "id=\"auth_continue\" || label=\"Continue\""

# Before (ref-based action from discovery)
snapshot -i -c -s "Continue"
click @e13 "Continue"

# After replay -u (upgraded to selector-based action)
snapshot -i -c -s "Continue"
click "id=\"auth_continue\" || label=\"Continue\""

Android fill reliability:

fill clears the current value, then enters text.
type enters text into the focused field without clearing.
fill now verifies the entered value on Android.
If value does not match, agent-device clears the field and retries once with slower typing.
This reduces IME-related character swaps on long strings (e.g. emails and IDs).

Settings helpers (simulators):

settings wifi on|off
settings airplane on|off
settings location on|off (iOS uses per-app permission for the current session app) Note: iOS wifi/airplane toggles status bar indicators, not actual network state. Airplane off clears status bar overrides.

App state:

appstate shows the foreground app/activity (Android). On iOS it uses the current session app when available, otherwise it falls back to a snapshot-based guess (AX first, XCTest if AX can’t identify).
apps --metadata returns app list with minimal metadata.

Debug

agent-device trace start
agent-device trace stop ./trace.log
The trace log includes snapshot logs and XCTest runner logs for the session.
Built-in retries cover transient runner connection failures, AX snapshot hiccups, and Android UI dumps.
For snapshot issues (missing elements), compare with --raw flag for unaltered output and scope with -s "<label>".

App resolution

Bundle/package identifiers are accepted directly (e.g., com.apple.Preferences).
Human-readable names are resolved when possible (e.g., Settings).
Built-in aliases include Settings for both platforms.

iOS notes

Input commands (press, type, scroll, etc.) are supported only on simulators in v1 and use the XCTest runner.
alert and scrollintoview use the XCTest runner and are simulator-only in v1.
Real device support (including snapshots) is on the roadmap for iOS.

Testing

pnpm test

Build

pnpm build

Environment selectors:

ANDROID_DEVICE=Pixel_9_Pro_XL or ANDROID_SERIAL=emulator-5554
IOS_DEVICE="iPhone 17 Pro" or IOS_UDID=<udid>

Test screenshots are written to:

test/screenshots/android-settings.png
test/screenshots/ios-settings.png

Contributing

See CONTRIBUTING.md.

Made at Callstack

agent-device is an open source project and will always remain free to use. Callstack is a group of React and React Native geeks. Contact us at [email protected] if you need any help with these technologies or just want to say hi.

For Tasks:

Click tags to check more tools for each tasks

automate app interactions manage device sessions find elements semantically update replay scripts control simulator settings

For Jobs:

automation engineer mobile app tester software developer quality assurance analyst ai agent developer

Alternative AI tools for agent-device

Similar Open Source Tools

agent-device

github

: 530

chat-ui

A chat interface using open source models, eg OpenAssistant or Llama. It is a SvelteKit app and it powers the HuggingChat app on hf.co/chat.

github

: 10.5k

CodexMonitor

CodexMonitor is a Tauri app designed for managing multiple Codex agents in local workspaces. It offers features such as workspace and thread management, composer and agent controls, Git and GitHub integration, file and prompt handling, as well as a user-friendly UI and experience. The tool requires Node.js, Rust toolchain, CMake, LLVM/Clang, Codex CLI, Git CLI, and optionally GitHub CLI. It supports iOS with Tailscale setup, and provides instructions for iOS support, prerequisites, simulator usage, USB device deployment, and release builds. The project structure includes frontend and backend components, with persistent data storage, settings, and feature configurations. Tauri IPC surface enables various functionalities like settings management, workspace operations, thread handling, Git/GitHub interactions, prompts management, terminal/dictation/notifications, and remote backend helpers.

github

: 2.4k

slack-mcp-server

Slack MCP Server is a Model Context Protocol server for Slack Workspaces, offering powerful features like Stealth and OAuth Modes, Enterprise Workspaces Support, Channel and Thread Support, Smart History, Search Messages, Safe Message Posting, DM and Group DM support, Embedded user information, Cache support, and multiple transport options. It provides tools like conversations_history, conversations_replies, conversations_add_message, conversations_search_messages, and channels_list for managing messages, threads, adding messages, searching messages, and listing channels. The server also exposes directory resources for workspace metadata access. The tool is designed to enhance Slack workspace functionality and improve user experience.

github

: 696

shell-pilot

Shell-pilot is a simple, lightweight shell script designed to interact with various AI models such as OpenAI, Ollama, Mistral AI, LocalAI, ZhipuAI, Anthropic, Moonshot, and Novita AI from the terminal. It enhances intelligent system management without any dependencies, offering features like setting up a local LLM repository, using official models and APIs, viewing history and session persistence, passing input prompts with pipe/redirector, listing available models, setting request parameters, generating and running commands in the terminal, easy configuration setup, system package version checking, and managing system aliases.

github

: 73

ChatDBG

ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.

github

: 825

olah

Olah is a self-hosted lightweight Huggingface mirror service that implements mirroring feature for Huggingface resources at file block level, enhancing download speeds and saving bandwidth. It offers cache control policies and allows administrators to configure accessible repositories. Users can install Olah with pip or from source, set up the mirror site, and download models and datasets using huggingface-cli. Olah provides additional configurations through a configuration file for basic setup and accessibility restrictions. Future work includes implementing an administrator and user system, OOS backend support, and mirror update schedule task. Olah is released under the MIT License.

github

: 132

chatgpt-subtitle-translator

This tool utilizes the OpenAI ChatGPT API to translate text, with a focus on line-based translation, particularly for SRT subtitles. It optimizes token usage by removing SRT overhead and grouping text into batches, allowing for arbitrary length translations without excessive token consumption while maintaining a one-to-one match between line input and output.

github

: 347

hume-python-sdk

The Hume AI Python SDK allows users to integrate Hume APIs directly into their Python applications. Users can access complete documentation, quickstart guides, and example notebooks to get started. The SDK is designed to provide support for Hume's expressive communication platform built on scientific research. Users are encouraged to create an account at beta.hume.ai and stay updated on changes through Discord. The SDK may undergo breaking changes to improve tooling and ensure reliable releases in the future.

github

: 158

aiolauncher_scripts

AIO Launcher Scripts is a collection of Lua scripts that can be used with AIO Launcher to enhance its functionality. These scripts can be used to create widget scripts, search scripts, and side menu scripts. They provide various functions such as displaying text, buttons, progress bars, charts, and interacting with app widgets. The scripts can be used to customize the appearance and behavior of the launcher, add new features, and interact with external services.

github

: 81

orbiton

Orbiton is a text editor and simple IDE designed with minimal annoyance in mind, not highly configurable to help users stay focused, and supports rapid edit-format-compile cycles. It is suitable for writing git commit messages, editing README.md and TODO.md files, writing Markdown and exporting to HTML or PDF, learning programming languages, editing files within larger projects, solving Advent of Code tasks, and providing a distraction-free environment for writing. The tool offers unique features like smart cursor movement, paste and copy shortcuts, portal for copying lines across files, code building and formatting shortcuts, and more.

github

: 523

cheating-based-prompt-engine

This is a vulnerability mining engine purely based on GPT, requiring no prior knowledge base, no fine-tuning, yet its effectiveness can overwhelmingly surpass most of the current related research. The core idea revolves around being task-driven, not question-driven, driven by prompts, not by code, and focused on prompt design, not model design. The essence is encapsulated in one word: deception. It is a type of code understanding logic vulnerability mining that fully stimulates the capabilities of GPT, suitable for real actual projects.

github

: 185

k8sgpt

K8sGPT is a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.

github

: 6.4k

godot-llm

Godot LLM is a plugin that enables the utilization of large language models (LLM) for generating content in games. It provides functionality for text generation, text embedding, multimodal text generation, and vector database management within the Godot game engine. The plugin supports features like Retrieval Augmented Generation (RAG) and integrates llama.cpp-based functionalities for text generation, embedding, and multimodal capabilities. It offers support for various platforms and allows users to experiment with LLM models in their game development projects.

github

: 80

forge

Forge is a powerful open-source tool for building modern web applications. It provides a simple and intuitive interface for developers to quickly scaffold and deploy projects. With Forge, you can easily create custom components, manage dependencies, and streamline your development workflow. Whether you are a beginner or an experienced developer, Forge offers a flexible and efficient solution for your web development needs.

github

: 4.6k

bilingual_book_maker

The bilingual_book_maker is an AI translation tool that uses ChatGPT to assist users in creating multi-language versions of epub/txt/srt files and books. It supports various models like gpt-4, gpt-3.5-turbo, claude-2, palm, llama-2, azure-openai, command-nightly, and gemini. Users need ChatGPT or OpenAI token, epub/txt books, internet access, and Python 3.8+. The tool provides options to specify OpenAI API key, model selection, target language, proxy server, context addition, translation style, and more. It generates bilingual books in epub format after translation. Users can test translations, set batch size, tweak prompts, and use different models like DeepL, Google Gemini, Tencent TranSmart, and more. The tool also supports retranslation, translating specific tags, and e-reader type specification. Docker usage is available for easy setup.

github

: 7.8k

For similar tasks

agent-device

github

: 530

For similar jobs

goat

GOAT (Great Onchain Agent Toolkit) is an open-source framework designed to simplify the process of making AI agents perform onchain actions by providing a provider-agnostic solution that abstracts away the complexities of interacting with blockchain tools such as wallets, token trading, and smart contracts. It offers a catalog of ready-made blockchain actions for agent developers and allows dApp/smart contract developers to develop plugins for easy access by agents. With compatibility across popular agent frameworks, support for multiple blockchains and wallet providers, and customizable onchain functionalities, GOAT aims to streamline the integration of blockchain capabilities into AI agents.

github

: 535

typedai

TypedAI is a TypeScript-first AI platform designed for developers to create and run autonomous AI agents, LLM based workflows, and chatbots. It offers advanced autonomous agents, software developer agents, pull request code review agent, AI chat interface, Slack chatbot, and supports various LLM services. The platform features configurable Human-in-the-loop settings, functional callable tools/integrations, CLI and Web UI interface, and can be run locally or deployed on the cloud with multi-user/SSO support. It leverages the Python AI ecosystem through executing Python scripts/packages and provides flexible run/deploy options like single user mode, Firestore & Cloud Run deployment, and multi-user SSO enterprise deployment. TypedAI also includes UI examples, code examples, and automated LLM function schemas for seamless development and execution of AI workflows.

github

: 1.0k

appworld

AppWorld is a high-fidelity execution environment of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. It provides a benchmark of natural, diverse, and challenging autonomous agent tasks requiring rich and interactive coding. The repository includes implementations of AppWorld apps and APIs, along with tests. It also introduces safety features for code execution and provides guides for building agents and extending the benchmark.

github

: 170

mcp-agent

mcp-agent is a simple, composable framework designed to build agents using the Model Context Protocol. It handles the lifecycle of MCP server connections and implements patterns for building production-ready AI agents in a composable way. The framework also includes OpenAI's Swarm pattern for multi-agent orchestration in a model-agnostic manner, making it the simplest way to build robust agent applications. It is purpose-built for the shared protocol MCP, lightweight, and closer to an agent pattern library than a framework. mcp-agent allows developers to focus on the core business logic of their AI applications by handling mechanics such as server connections, working with LLMs, and supporting external signals like human input.

github

: 7.4k

openrouter-kit

OpenRouter Kit is a powerful TypeScript/JavaScript library for interacting with the OpenRouter API. It simplifies working with LLMs by providing a high-level API for chats, dialogue history management, tool calls with error handling, security module, and cost tracking. Ideal for building chatbots, AI agents, and integrating LLMs into applications.

github

: 59

starknet-agentic

Open-source stack for giving AI agents wallets, identity, reputation, and execution rails on Starknet. `starknet-agentic` is a monorepo with Cairo smart contracts for agent wallets, identity, reputation, and validation, TypeScript packages for MCP tools, A2A integration, and payment signing, reusable skills for common Starknet agent capabilities, and examples and docs for integration. It provides contract primitives + runtime tooling in one place for integrating agents. The repo includes various layers such as Agent Frameworks / Apps, Integration + Runtime Layer, Packages / Tooling Layer, Cairo Contract Layer, and Starknet L2. It aims for portability of agent integrations without giving up Starknet strengths, with a cross-chain interop strategy and skills marketplace. The repository layout consists of directories for contracts, packages, skills, examples, docs, and website.

github

: 57

SwiftAgent

A type-safe, declarative framework for building AI agents in Swift, SwiftAgent is built on Apple FoundationModels. It allows users to compose agents by combining Steps in a declarative syntax similar to SwiftUI. The framework ensures compile-time checked input/output types, native Apple AI integration, structured output generation, and built-in security features like permission, sandbox, and guardrail systems. SwiftAgent is extensible with MCP integration, distributed agents, and a skills system. Users can install SwiftAgent with Swift 6.2+ on iOS 26+, macOS 26+, or Xcode 26+ using Swift Package Manager.

github

: 73

agent-device

github

: 530