
BrowserGym
🌎💪 BrowserGym, a Gym environment for web task automation
Stars: 456

BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.
README:
🛠️ Setup - 🏋 Usage - 💻 Demo - 🌐 Ecosystem - 🚀 AgentLab - 🌟 Contributors - 📄 Paper - 📝 Citation
pip install browsergym
[!WARNING] BrowserGym is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research. It is not meant to be a consumer product. Use with caution!
[!TIP] 🚀 Check out AgentLab✨ ! A seamless framework to implement, test, and evaluate your web agents on all BrowserGym benchmarks.
https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85
Example of a GPT4-V agent executing openended tasks (top row, chat interactive), as well as WebArena and WorkArena tasks (bottom row).
BrowserGym includes the following benchmarks by default:
- MiniWoB
- WebArena
- VisualWebArena
- WorkArena
- AssistantBench
- WebLINX (static benchmark)
Designing new web benchmarks with BrowserGym is easy, and simply requires to inherit the AbstractBrowserTask
class.
To use browsergym, install one of the following packages:
pip install browsergym # (recommended) everything below
pip install browsergym-experiments # experiment utilities (agent, loop, benchmarks) + everything below
pip install browsergym-core # core functionalities only (no benchmark, just the openended task)
pip install browsergym-miniwob # core + miniwob
pip install browsergym-webarena # core + webarena
pip install browsergym-visualwebarena # core + visualwebarena
pip install browsergym-workarena # core + workarena
pip install browsergym-assistantbench # core + assistantbench
pip install weblinx-browsergym # core + weblinx
Then setup playwright by running
playwright install chromium
Finally, each benchmark comes with its own specific setup that requires to follow additional steps.
- for MiniWoB++, see miniwob/README.md
- for WebArena, see webarena/README.md
- for VisualWebArena, see visualwebarena/README.md
- for WorkArena, see WorkArena
- for AssistantBench, see assistantbench/README.md
To install browsergym locally for development, use the following commands:
git clone [email protected]:ServiceNow/BrowserGym.git
cd BrowserGym
make install
Contributions are welcome! 😊
Boilerplate code to run an agent on an interactive, open-ended task:
import gymnasium as gym
import browsergym.core # register the openended task as a gym environment
# start an openended environment
env = gym.make(
"browsergym/openended",
task_kwargs={"start_url": "https://www.google.com/"}, # starting URL
wait_for_user_message=True, # wait for a user message after each agent message sent to the chat
)
# run the environment <> agent loop until termination
obs, info = env.reset()
while True:
action = ... # implement your agent here
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
# release the environment
env.close()
MiniWoB
import gymnasium as gym
import browsergym.miniwob # register miniwob tasks as gym environments
# start a miniwob task
env = gym.make("browsergym/miniwob.choose-list")
...
# list all the available miniwob tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/miniwob")]
print("\n".join(env_ids))
WorkArena
import gymnasium as gym
import browsergym.workarena # register workarena tasks as gym environments
# start a workarena task
env = gym.make("browsergym/workarena.servicenow.order-ipad-pro")
...
# list all the available workarena tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/workarena")]
print("\n".join(env_ids))
WebArena
import gymnasium as gym
import browsergym.webarena # register webarena tasks as gym environments
# start a webarena task
env = gym.make("browsergym/webarena.310")
...
# list all the available webarena tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/webarena")]
print("\n".join(env_ids))
VisualWebArena
import gymnasium as gym
import browsergym.webarena # register webarena tasks as gym environments
# start a visualwebarena task
env = gym.make("browsergym/visualwebarena.721")
...
# list all the available visualwebarena tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/visualwebarena")]
print("\n".join(env_ids))
AssistantBench
import gymnasium as gym
import browsergym.workarena # register assistantbench tasks as gym environments
# start an assistantbench task
env = gym.make("browsergym/assistantbench.validation.3")
...
# list all the available assistantbench tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/workarena")]
print("\n".join(env_ids))
If you want to experiment with a demo agent in BrowserGym, follow these steps
# conda setup
conda env create -f demo_agent/environment.yml
conda activate demo_agent
# or pip setup
pip install -r demo_agent/requirements.txt
# then download the browser for playwright
playwright install chromium
Our demo agent uses openai
as a backend, be sure to set your OPENAI_API_KEY
.
Launch the demo agent as follows
# openended (interactive chat mode)
python demo_agent/run_demo.py --task_name openended --start_url https://www.google.com
# miniwob
python demo_agent/run_demo.py --task_name miniwob.click-test
# workarena
python demo_agent/run_demo.py --task_name workarena.servicenow.order-standard-laptop
# webarena
python demo_agent/run_demo.py --task_name webarena.4
# visualwebarena
python demo_agent/run_demo.py --task_name visualwebarena.398
You can customize your experience by changing the model_name
to your preferred LLM (it uses gpt-4o-mini
by default), adding screenshots for your VLMs with use_screenshot
, and much more!
python demo_agent/run_demo.py --help
- AgentLab: Seamlessly run agents on benchmarks, collect and analyse traces.
- WorkArena(++): A benchmark for web agents on the ServiceNow platform.
- WebArena: A benchmark of realistic web tasks on self-hosted domains.
- VisualWebArena: A benchmark of realistic visual web tasks on self-hosted domains.
- MiniWoB(++): A collection of over 100 web tasks on synthetic web pages.
- WebLINX: A dataset of real-world web interaction traces.
- AssistantBench: A benchmark of realistic and time-consuming tasks on the open web.
Please use the following BibTeX to cite our work:
@inproceedings{workarena2024,
title = {{W}ork{A}rena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?},
author = {Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {11642--11662},
year = {2024},
editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
volume = {235},
series = {Proceedings of Machine Learning Research},
month = {21--27 Jul},
publisher = {PMLR},
url = {https://proceedings.mlr.press/v235/drouin24a.html},
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for BrowserGym
Similar Open Source Tools

BrowserGym
BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.

claude-code.nvim
Claude Code Neovim Plugin is a seamless integration between Claude Code AI assistant and Neovim. It allows users to toggle Claude Code in a terminal window with a single key press, automatically detect and reload files modified by Claude Code, provide real-time buffer updates when files are changed externally, offer customizable window position and size, integrate with which-key, use git project root as working directory, maintain a modular code structure, provide type annotations with LuaCATS for better IDE support, offer configuration validation, and include a testing framework for reliability. The plugin creates a terminal buffer running the Claude Code CLI, sets up autocommands to detect file changes on disk, automatically reloads files modified by Claude Code, provides keymaps and commands for toggling the terminal, and detects git repositories to set the working directory to the git root.

complexity
Complexity is a community-driven, open-source, and free third-party extension that enhances the features of Perplexity.ai. It provides various UI/UX/QoL tweaks, LLM/Image gen model selectors, a customizable theme, and a prompts library. The tool intercepts network traffic to alter the behavior of the host page, offering a solution to the limitations of Perplexity.ai. Users can install Complexity from Chrome Web Store, Mozilla Add-on, or build it from the source code.

modelscope-agent
ModelScope-Agent is a customizable and scalable Agent framework. A single agent has abilities such as role-playing, LLM calling, tool usage, planning, and memory. It mainly has the following characteristics: - **Simple Agent Implementation Process**: Simply specify the role instruction, LLM name, and tool name list to implement an Agent application. The framework automatically arranges workflows for tool usage, planning, and memory. - **Rich models and tools**: The framework is equipped with rich LLM interfaces, such as Dashscope and Modelscope model interfaces, OpenAI model interfaces, etc. Built in rich tools, such as **code interpreter**, **weather query**, **text to image**, **web browsing**, etc., make it easy to customize exclusive agents. - **Unified interface and high scalability**: The framework has clear tools and LLM registration mechanism, making it convenient for users to expand more diverse Agent applications. - **Low coupling**: Developers can easily use built-in tools, LLM, memory, and other components without the need to bind higher-level agents.

openvino.genai
The GenAI repository contains pipelines that implement image and text generation tasks. The implementation uses OpenVINO capabilities to optimize the pipelines. Each sample covers a family of models and suggests certain modifications to adapt the code to specific needs. It includes the following pipelines: 1. Benchmarking script for large language models 2. Text generation C++ samples that support most popular models like LLaMA 2 3. Stable Diffuison (with LoRA) C++ image generation pipeline 4. Latent Consistency Model (with LoRA) C++ image generation pipeline

infinity
Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. It is developed under the MIT License and powers inference behind Gradient.ai. The API allows users to deploy models from SentenceTransformers, offers fast inference backends utilizing various accelerators, dynamic batching for efficient processing, correct and tested implementation, and easy-to-use API built on FastAPI with Swagger documentation. Users can embed text, rerank documents, and perform text classification tasks using the tool. Infinity supports various models from Huggingface and provides flexibility in deployment via CLI, Docker, Python API, and cloud services like dstack. The tool is suitable for tasks like embedding, reranking, and text classification.

docling
Docling is a tool that bundles PDF document conversion to JSON and Markdown in an easy, self-contained package. It can convert any PDF document to JSON or Markdown format, understand detailed page layout, reading order, recover table structures, extract metadata such as title, authors, references, and language, and optionally apply OCR for scanned PDFs. The tool is designed to be stable, lightning fast, and suitable for macOS and Linux environments.

J.A.R.V.I.S
J.A.R.V.I.S (Just A Rather Very Intelligent System) is an advanced AI assistant inspired by Iron Man's Jarvis, designed to assist with various tasks, from navigating websites to controlling your PC with natural language commands.

CodeTF
CodeTF is a Python transformer-based library for code large language models (Code LLMs) and code intelligence. It provides an interface for training and inferencing on tasks like code summarization, translation, and generation. The library offers utilities for code manipulation across various languages, including easy extraction of code attributes. Using tree-sitter as its core AST parser, CodeTF enables parsing of function names, comments, and variable names. It supports fast model serving, fine-tuning of LLMs, various code intelligence tasks, preprocessed datasets, model evaluation, pretrained and fine-tuned models, and utilities to manipulate source code. CodeTF aims to facilitate the integration of state-of-the-art Code LLMs into real-world applications, ensuring a user-friendly environment for code intelligence tasks.

fiftyone
FiftyOne is an open-source tool designed for building high-quality datasets and computer vision models. It supercharges machine learning workflows by enabling users to visualize datasets, interpret models faster, and improve efficiency. With FiftyOne, users can explore scenarios, identify failure modes, visualize complex labels, evaluate models, find annotation mistakes, and much more. The tool aims to streamline the process of improving machine learning models by providing a comprehensive set of features for data analysis and model interpretation.

RD-Agent
RD-Agent is a tool designed to automate critical aspects of industrial R&D processes, focusing on data-driven scenarios to streamline model and data development. It aims to propose new ideas ('R') and implement them ('D') automatically, leading to solutions of significant industrial value. The tool supports scenarios like Automated Quantitative Trading, Data Mining Agent, Research Copilot, and more, with a framework to push the boundaries of research in data science. Users can create a Conda environment, install the RDAgent package from PyPI, configure GPT model, and run various applications for tasks like quantitative trading, model evolution, medical prediction, and more. The tool is intended to enhance R&D processes and boost productivity in industrial settings.

esp-ai
ESP-AI provides a complete AI conversation solution for your development board, including IAT+LLM+TTS integration solutions for ESP32 series development boards. It can be injected into projects without affecting existing ones. By providing keys from platforms like iFlytek, Jiling, and local services, you can run the services without worrying about interactions between services or between development boards and services. The project's server-side code is based on Node.js, and the hardware code is based on Arduino IDE.

typedai
TypedAI is a TypeScript-first AI platform designed for developers to create and run autonomous AI agents, LLM based workflows, and chatbots. It offers advanced autonomous agents, software developer agents, pull request code review agent, AI chat interface, Slack chatbot, and supports various LLM services. The platform features configurable Human-in-the-loop settings, functional callable tools/integrations, CLI and Web UI interface, and can be run locally or deployed on the cloud with multi-user/SSO support. It leverages the Python AI ecosystem through executing Python scripts/packages and provides flexible run/deploy options like single user mode, Firestore & Cloud Run deployment, and multi-user SSO enterprise deployment. TypedAI also includes UI examples, code examples, and automated LLM function schemas for seamless development and execution of AI workflows.

scikit-llm
Scikit-LLM is a tool that seamlessly integrates powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks. It allows users to leverage large language models for various text analysis applications within the familiar scikit-learn framework. The tool simplifies the process of incorporating advanced language processing capabilities into machine learning pipelines, enabling users to benefit from the latest advancements in natural language processing.

aircrack-ng
Aircrack-ng is a comprehensive suite of tools designed to evaluate the security of WiFi networks. It covers various aspects of WiFi security, including monitoring, attacking (replay attacks, deauthentication, fake access points), testing WiFi cards and driver capabilities, and cracking WEP and WPA PSK. The tools are command line-based, allowing for extensive scripting and have been utilized by many GUIs. Aircrack-ng primarily works on Linux but also supports Windows, macOS, FreeBSD, OpenBSD, NetBSD, Solaris, and eComStation 2.
For similar tasks

BrowserGym
BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.