![BrowserGym](/statics/github-mark.png)
BrowserGym
🌎💪 BrowserGym, a Gym environment for web task automation
Stars: 456
![screenshot](/screenshots_githubs/ServiceNow-BrowserGym.jpg)
BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.
README:
🛠️ Setup - 🏋 Usage - 💻 Demo - 🌐 Ecosystem - 🚀 AgentLab - 🌟 Contributors - 📄 Paper - 📝 Citation
pip install browsergym
[!WARNING] BrowserGym is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research. It is not meant to be a consumer product. Use with caution!
[!TIP] 🚀 Check out AgentLab✨ ! A seamless framework to implement, test, and evaluate your web agents on all BrowserGym benchmarks.
https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85
Example of a GPT4-V agent executing openended tasks (top row, chat interactive), as well as WebArena and WorkArena tasks (bottom row).
BrowserGym includes the following benchmarks by default:
- MiniWoB
- WebArena
- VisualWebArena
- WorkArena
- AssistantBench
- WebLINX (static benchmark)
Designing new web benchmarks with BrowserGym is easy, and simply requires to inherit the AbstractBrowserTask
class.
To use browsergym, install one of the following packages:
pip install browsergym # (recommended) everything below
pip install browsergym-experiments # experiment utilities (agent, loop, benchmarks) + everything below
pip install browsergym-core # core functionalities only (no benchmark, just the openended task)
pip install browsergym-miniwob # core + miniwob
pip install browsergym-webarena # core + webarena
pip install browsergym-visualwebarena # core + visualwebarena
pip install browsergym-workarena # core + workarena
pip install browsergym-assistantbench # core + assistantbench
pip install weblinx-browsergym # core + weblinx
Then setup playwright by running
playwright install chromium
Finally, each benchmark comes with its own specific setup that requires to follow additional steps.
- for MiniWoB++, see miniwob/README.md
- for WebArena, see webarena/README.md
- for VisualWebArena, see visualwebarena/README.md
- for WorkArena, see WorkArena
- for AssistantBench, see assistantbench/README.md
To install browsergym locally for development, use the following commands:
git clone [email protected]:ServiceNow/BrowserGym.git
cd BrowserGym
make install
Contributions are welcome! 😊
Boilerplate code to run an agent on an interactive, open-ended task:
import gymnasium as gym
import browsergym.core # register the openended task as a gym environment
# start an openended environment
env = gym.make(
"browsergym/openended",
task_kwargs={"start_url": "https://www.google.com/"}, # starting URL
wait_for_user_message=True, # wait for a user message after each agent message sent to the chat
)
# run the environment <> agent loop until termination
obs, info = env.reset()
while True:
action = ... # implement your agent here
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
# release the environment
env.close()
MiniWoB
import gymnasium as gym
import browsergym.miniwob # register miniwob tasks as gym environments
# start a miniwob task
env = gym.make("browsergym/miniwob.choose-list")
...
# list all the available miniwob tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/miniwob")]
print("\n".join(env_ids))
WorkArena
import gymnasium as gym
import browsergym.workarena # register workarena tasks as gym environments
# start a workarena task
env = gym.make("browsergym/workarena.servicenow.order-ipad-pro")
...
# list all the available workarena tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/workarena")]
print("\n".join(env_ids))
WebArena
import gymnasium as gym
import browsergym.webarena # register webarena tasks as gym environments
# start a webarena task
env = gym.make("browsergym/webarena.310")
...
# list all the available webarena tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/webarena")]
print("\n".join(env_ids))
VisualWebArena
import gymnasium as gym
import browsergym.webarena # register webarena tasks as gym environments
# start a visualwebarena task
env = gym.make("browsergym/visualwebarena.721")
...
# list all the available visualwebarena tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/visualwebarena")]
print("\n".join(env_ids))
AssistantBench
import gymnasium as gym
import browsergym.workarena # register assistantbench tasks as gym environments
# start an assistantbench task
env = gym.make("browsergym/assistantbench.validation.3")
...
# list all the available assistantbench tasks
env_ids = [id for id in gym.envs.registry.keys() if id.startswith("browsergym/workarena")]
print("\n".join(env_ids))
If you want to experiment with a demo agent in BrowserGym, follow these steps
# conda setup
conda env create -f demo_agent/environment.yml
conda activate demo_agent
# or pip setup
pip install -r demo_agent/requirements.txt
# then download the browser for playwright
playwright install chromium
Our demo agent uses openai
as a backend, be sure to set your OPENAI_API_KEY
.
Launch the demo agent as follows
# openended (interactive chat mode)
python demo_agent/run_demo.py --task_name openended --start_url https://www.google.com
# miniwob
python demo_agent/run_demo.py --task_name miniwob.click-test
# workarena
python demo_agent/run_demo.py --task_name workarena.servicenow.order-standard-laptop
# webarena
python demo_agent/run_demo.py --task_name webarena.4
# visualwebarena
python demo_agent/run_demo.py --task_name visualwebarena.398
You can customize your experience by changing the model_name
to your preferred LLM (it uses gpt-4o-mini
by default), adding screenshots for your VLMs with use_screenshot
, and much more!
python demo_agent/run_demo.py --help
- AgentLab: Seamlessly run agents on benchmarks, collect and analyse traces.
- WorkArena(++): A benchmark for web agents on the ServiceNow platform.
- WebArena: A benchmark of realistic web tasks on self-hosted domains.
- VisualWebArena: A benchmark of realistic visual web tasks on self-hosted domains.
- MiniWoB(++): A collection of over 100 web tasks on synthetic web pages.
- WebLINX: A dataset of real-world web interaction traces.
- AssistantBench: A benchmark of realistic and time-consuming tasks on the open web.
Please use the following BibTeX to cite our work:
@inproceedings{workarena2024,
title = {{W}ork{A}rena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?},
author = {Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {11642--11662},
year = {2024},
editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
volume = {235},
series = {Proceedings of Machine Learning Research},
month = {21--27 Jul},
publisher = {PMLR},
url = {https://proceedings.mlr.press/v235/drouin24a.html},
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for BrowserGym
Similar Open Source Tools
![BrowserGym Screenshot](/screenshots_githubs/ServiceNow-BrowserGym.jpg)
BrowserGym
BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.
![catai Screenshot](/screenshots_githubs/withcatai-catai.jpg)
catai
CatAI is a tool that allows users to run GGUF models on their computer with a chat UI. It serves as a local AI assistant inspired by Node-Llama-Cpp and Llama.cpp. The tool provides features such as auto-detecting programming language, showing original messages by clicking on user icons, real-time text streaming, and fast model downloads. Users can interact with the tool through a CLI that supports commands for installing, listing, setting, serving, updating, and removing models. CatAI is cross-platform and supports Windows, Linux, and Mac. It utilizes node-llama-cpp and offers a simple API for asking model questions. Additionally, developers can integrate the tool with node-llama-cpp@beta for model management and chatting. The configuration can be edited via the web UI, and contributions to the project are welcome. The tool is licensed under Llama.cpp's license.
![modelscope-agent Screenshot](/screenshots_githubs/modelscope-modelscope-agent.jpg)
modelscope-agent
ModelScope-Agent is a customizable and scalable Agent framework. A single agent has abilities such as role-playing, LLM calling, tool usage, planning, and memory. It mainly has the following characteristics: - **Simple Agent Implementation Process**: Simply specify the role instruction, LLM name, and tool name list to implement an Agent application. The framework automatically arranges workflows for tool usage, planning, and memory. - **Rich models and tools**: The framework is equipped with rich LLM interfaces, such as Dashscope and Modelscope model interfaces, OpenAI model interfaces, etc. Built in rich tools, such as **code interpreter**, **weather query**, **text to image**, **web browsing**, etc., make it easy to customize exclusive agents. - **Unified interface and high scalability**: The framework has clear tools and LLM registration mechanism, making it convenient for users to expand more diverse Agent applications. - **Low coupling**: Developers can easily use built-in tools, LLM, memory, and other components without the need to bind higher-level agents.
![openvino.genai Screenshot](/screenshots_githubs/openvinotoolkit-openvino.genai.jpg)
openvino.genai
The GenAI repository contains pipelines that implement image and text generation tasks. The implementation uses OpenVINO capabilities to optimize the pipelines. Each sample covers a family of models and suggests certain modifications to adapt the code to specific needs. It includes the following pipelines: 1. Benchmarking script for large language models 2. Text generation C++ samples that support most popular models like LLaMA 2 3. Stable Diffuison (with LoRA) C++ image generation pipeline 4. Latent Consistency Model (with LoRA) C++ image generation pipeline
![LocalAI Screenshot](/screenshots_githubs/mudler-LocalAI.jpg)
LocalAI
LocalAI is a free and open-source OpenAI alternative that acts as a drop-in replacement REST API compatible with OpenAI (Elevenlabs, Anthropic, etc.) API specifications for local AI inferencing. It allows users to run LLMs, generate images, audio, and more locally or on-premises with consumer-grade hardware, supporting multiple model families and not requiring a GPU. LocalAI offers features such as text generation with GPTs, text-to-audio, audio-to-text transcription, image generation with stable diffusion, OpenAI functions, embeddings generation for vector databases, constrained grammars, downloading models directly from Huggingface, and a Vision API. It provides a detailed step-by-step introduction in its Getting Started guide and supports community integrations such as custom containers, WebUIs, model galleries, and various bots for Discord, Slack, and Telegram. LocalAI also offers resources like an LLM fine-tuning guide, instructions for local building and Kubernetes installation, projects integrating LocalAI, and a how-tos section curated by the community. It encourages users to cite the repository when utilizing it in downstream projects and acknowledges the contributions of various software from the community.
![starwhale Screenshot](/screenshots_githubs/star-whale-starwhale.jpg)
starwhale
Starwhale is an MLOps/LLMOps platform that brings efficiency and standardization to machine learning operations. It streamlines the model development lifecycle, enabling teams to optimize workflows around key areas like model building, evaluation, release, and fine-tuning. Starwhale abstracts Model, Runtime, and Dataset as first-class citizens, providing tailored capabilities for common workflow scenarios including Models Evaluation, Live Demo, and LLM Fine-tuning. It is an open-source platform designed for clarity and ease of use, empowering developers to build customized MLOps features tailored to their needs.
![RD-Agent Screenshot](/screenshots_githubs/microsoft-RD-Agent.jpg)
RD-Agent
RD-Agent is a tool designed to automate critical aspects of industrial R&D processes, focusing on data-driven scenarios to streamline model and data development. It aims to propose new ideas ('R') and implement them ('D') automatically, leading to solutions of significant industrial value. The tool supports scenarios like Automated Quantitative Trading, Data Mining Agent, Research Copilot, and more, with a framework to push the boundaries of research in data science. Users can create a Conda environment, install the RDAgent package from PyPI, configure GPT model, and run various applications for tasks like quantitative trading, model evolution, medical prediction, and more. The tool is intended to enhance R&D processes and boost productivity in industrial settings.
![J.A.R.V.I.S Screenshot](/screenshots_githubs/AnubhavChaturvedi-GitHub-J.A.R.V.I.S.jpg)
J.A.R.V.I.S
J.A.R.V.I.S (Just A Rather Very Intelligent System) is an advanced AI assistant inspired by Iron Man's Jarvis, designed to assist with various tasks, from navigating websites to controlling your PC with natural language commands.
![docling Screenshot](/screenshots_githubs/DS4SD-docling.jpg)
docling
Docling is a tool that bundles PDF document conversion to JSON and Markdown in an easy, self-contained package. It can convert any PDF document to JSON or Markdown format, understand detailed page layout, reading order, recover table structures, extract metadata such as title, authors, references, and language, and optionally apply OCR for scanned PDFs. The tool is designed to be stable, lightning fast, and suitable for macOS and Linux environments.
![CodeTF Screenshot](/screenshots_githubs/salesforce-CodeTF.jpg)
CodeTF
CodeTF is a Python transformer-based library for code large language models (Code LLMs) and code intelligence. It provides an interface for training and inferencing on tasks like code summarization, translation, and generation. The library offers utilities for code manipulation across various languages, including easy extraction of code attributes. Using tree-sitter as its core AST parser, CodeTF enables parsing of function names, comments, and variable names. It supports fast model serving, fine-tuning of LLMs, various code intelligence tasks, preprocessed datasets, model evaluation, pretrained and fine-tuned models, and utilities to manipulate source code. CodeTF aims to facilitate the integration of state-of-the-art Code LLMs into real-world applications, ensuring a user-friendly environment for code intelligence tasks.
![beyondllm Screenshot](/screenshots_githubs/aiplanethub-beyondllm.jpg)
beyondllm
Beyond LLM offers an all-in-one toolkit for experimentation, evaluation, and deployment of Retrieval-Augmented Generation (RAG) systems. It simplifies the process with automated integration, customizable evaluation metrics, and support for various Large Language Models (LLMs) tailored to specific needs. The aim is to reduce LLM hallucination risks and enhance reliability.
![autogen Screenshot](/screenshots_githubs/microsoft-autogen.jpg)
autogen
AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.
![esp-ai Screenshot](/screenshots_githubs/wangzongming-esp-ai.jpg)
esp-ai
ESP-AI provides a complete AI conversation solution for your development board, including IAT+LLM+TTS integration solutions for ESP32 series development boards. It can be injected into projects without affecting existing ones. By providing keys from platforms like iFlytek, Jiling, and local services, you can run the services without worrying about interactions between services or between development boards and services. The project's server-side code is based on Node.js, and the hardware code is based on Arduino IDE.
![stable-diffusion.cpp Screenshot](/screenshots_githubs/leejet-stable-diffusion.cpp.jpg)
stable-diffusion.cpp
The stable-diffusion.cpp repository provides an implementation for inferring stable diffusion in pure C/C++. It offers features such as support for different versions of stable diffusion, lightweight and dependency-free implementation, various quantization support, memory-efficient CPU inference, GPU acceleration, and more. Users can download the built executable program or build it manually. The repository also includes instructions for downloading weights, building from scratch, using different acceleration methods, running the tool, converting weights, and utilizing various features like Flash Attention, ESRGAN upscaling, PhotoMaker support, and more. Additionally, it mentions future TODOs and provides information on memory requirements, bindings, UIs, contributors, and references.
![scikit-llm Screenshot](/screenshots_githubs/iryna-kondr-scikit-llm.jpg)
scikit-llm
Scikit-LLM is a tool that seamlessly integrates powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks. It allows users to leverage large language models for various text analysis applications within the familiar scikit-learn framework. The tool simplifies the process of incorporating advanced language processing capabilities into machine learning pipelines, enabling users to benefit from the latest advancements in natural language processing.
![aircrack-ng Screenshot](/screenshots_githubs/aircrack-ng-aircrack-ng.jpg)
aircrack-ng
Aircrack-ng is a comprehensive suite of tools designed to evaluate the security of WiFi networks. It covers various aspects of WiFi security, including monitoring, attacking (replay attacks, deauthentication, fake access points), testing WiFi cards and driver capabilities, and cracking WEP and WPA PSK. The tools are command line-based, allowing for extensive scripting and have been utilized by many GUIs. Aircrack-ng primarily works on Linux but also supports Windows, macOS, FreeBSD, OpenBSD, NetBSD, Solaris, and eComStation 2.
For similar tasks
![BrowserGym Screenshot](/screenshots_githubs/ServiceNow-BrowserGym.jpg)
BrowserGym
BrowserGym is an open, easy-to-use, and extensible framework designed to accelerate web agent research. It provides benchmarks like MiniWoB, WebArena, VisualWebArena, WorkArena, AssistantBench, and WebLINX. Users can design new web benchmarks by inheriting the AbstractBrowserTask class. The tool allows users to install different packages for core functionalities, experiments, and specific benchmarks. It supports the development setup and offers boilerplate code for running agents on various tasks. BrowserGym is not a consumer product and should be used with caution.
For similar jobs
![weave Screenshot](/screenshots_githubs/wandb-weave.jpg)
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
![LLMStack Screenshot](/screenshots_githubs/trypromptly-LLMStack.jpg)
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
![VisionCraft Screenshot](/screenshots_githubs/VisionCraft-org-VisionCraft.jpg)
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
![kaito Screenshot](/screenshots_githubs/Azure-kaito.jpg)
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
![PyRIT Screenshot](/screenshots_githubs/Azure-PyRIT.jpg)
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
![tabby Screenshot](/screenshots_githubs/TabbyML-tabby.jpg)
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
![spear Screenshot](/screenshots_githubs/isl-org-spear.jpg)
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
![Magick Screenshot](/screenshots_githubs/Oneirocom-Magick.jpg)
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.