WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

Stars: 147

Visit

Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.

README:

Windows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.

WAA supports the deployment of agents at scale using the Azure ML cloud infrastructure, allowing for the parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes, not days.

📢 Updates

2024-09-13: We released our paper, code, project page, and blog post. Check it out!

📚 Citation

Our technical report paper can be found here. If you find this environment useful, please consider citing our work:

@article{bonatti2024windows,
author = { Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
title = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
institution = {Microsoft},
year = {2024},
month = {September}, 
}

☝️ Pre-requisites:

Docker daemon installed and running. On Windows, we recommend using Docker with WSL 2.
An OpenAI or Azure OpenAI API Key.
Python 3.9 - we recommend using Conda and creating an adhoc python environment for running the scripts. For creating a new environment run conda create -n winarena python=3.9.

Clone the repository and install dependencies:

git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena

# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt

💻 Local deployment (WSL or Linux)

1. Configuration file

Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints):

{
    "OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
    "AZURE_API_KEY": "<AZURE_API_KEY>",  // if you are using Azure endpoint
    "AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}

2. Prepare the Windows Arena Docker image

To use the default docker image from Docker Hub:

docker pull windowsarena/winarena:latest

(Optional) 2.1 Build the Windows Arena Docker image locally:

To build your own image from scratch (optional):

cd scripts
./build-container-image.sh

For a list of parameters that can be changed during building of the docker images:

./build-container-image.sh --help

3. Prepare the Windows 11 image

3.1 Download Windows 11 Evaluation .iso file:

Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]
After downloading, rename the file to setup.iso and copy it to the directory WindowsAgentArena/src/win-arena-container/vm/image

3.2 Automatic Setup of the Windows 11 golden image:

Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.

To prepare the gold snapshot, run once:

cd ./scripts
./run-local.sh --prepare-image true

During development, if you want to include any changes made in the src/win-arena-container directory in the WAA golden image, please ensure to specify the flag --skip-build false (default to true). This will ensure that a new container image is built instead than using the prebuilt windowsarena/winarena:latest image.

You can check the VM install screen by accessing http://localhost:8006 in your browser (unless you have provided an alternate --browser-port parameter). The preparation process is fully automated and will take around 20 minutes.

Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.

At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.

You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:

NOTES:

If you have previously run an installation process and want to do it again from scratch, make sure to delete the content of storage.
We recommend copying this storage folder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup.
Depending on your docker settings, you might have to run the above command with sudo.
Running on WSL2? If you encounter the error /bin/bash: bad interpreter: No such file or directory, we recommend converting the bash scripts from DOS/Windows format to Unix format:

cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +

4. Deploying the agent in the arena

4.1 Running the base benchmark

The entire setup runs inside a docker container. The entry point for the agent is the src/win-arena-container/run.py script (copied to /client/run.py in the container). The Windows OS runs as a VM process inside the container, and they communicate via GET/POST. To run the entire setup at once, run:

cd scripts
./run-local.sh --start-client true

On your host, open your browser and go to http://localhost:8006 to see the Windows VM with the agent running.

For a list of parameters that can be changed:

./run-local.sh --help

At the end of the run you can display the results using the command:

cd src/win-arena-container/client
python show_results.py --result_dir <path_to_results_folder>

The table below provides a comparison of various combinations of hyperparameters used by the Navi agent in our study, which can be overridden by specifying --som-origin <som_origin> --a11y-backend <a11y_backend> when running the run-local.sh script:

Hyperparameter	Possible Values	Description	Recommended Complementary Value
`som_origin`	`oss`, `a11y`, `mixed-oss`	Determines how the Set-of-Mark (SoM) is achieved.	`win32` for `oss`; `uia` for `a11y`, `mixed-oss`
	`mixed-oss`	If set to any "mixed" option, the agent partially relies on the accessibility tree for SoM entities.	`uia` (more reliable but slower)
	`oss`	Uses webparse, groundingdino, and OCR (TesseractOCR) pipelines.	`win32` (faster performance)
	`a11y`	Relies on accessibility tree extraction for SoM.	`uia` (more reliable but slower)
`a11y_backend`	`win32`, `uia`	Dictates how the accessibility tree should be extracted.	`win32` for `oss`; `uia` for `a11y` and mixed types
	`win32`	Faster but less reliable accessibility tree extraction.	Use with `oss` or non-"mixed" types.
	`uia`	Slower but more reliable accessibility tree extraction.	Use with `a11y`, `mixed-oss`

4.2 Local development tips

At first sight it might seem challenging to develop/debug code running inside the docker container. However, we provide a few tips to make this process easier. Check the Development-Tips Doc for more details such as:

How to attach a VSCode window (with debugger) to the running container
How to change the agent and Windows server code from your local machine and see the changes reflected in real time in the container

🌐 Azure Deployment -> Parallelizing the benchmark

We offer a seamless way to run the Windows Agent Arena on Azure ML Compute VMs. This option will significantly reduce the time needed to test your agent in all benchmark tasks from hours/days to minutes.

1. Set up the Azure resource group:

If you don't already have an Azure subscription, you can start a free trial. Take note of the subscription id, we will use it as AZURE_SUBSCRIPTION_ID in Section 3.
In the Azure portal, create a new resource group (e.g. agents) in the region of your choice. Take note of the resource group name, we will use it as AZURE_ML_RESOURCE_GROUP in Section 3.
Within this resource group, create an Azure Machine Learning resource (e.g. name it agents_ml). Take note of the ML workspace name, we will use it as AZURE_ML_WORKSPACE_NAME in Section 3. During the creation wizard, make sure to check the boxes for automatically creating new:
- Storage Account. Note: Take note of the Storage Account name, we will use it to upload the golden image in Section 2.
- Key vault.
- Application Insights.
- [optional] Container Registry. You can use the Azure Container Registry to privately store your custom docker images without the need to push them to the public Docker Hub.

Once creation is complete, navigate to the Azure Machine Learning portal and click on your workspace (agents)

In the workspace, navigate to the Notebooks tab. In your user-assigned folder (as shown in the figure below), create a new bash (.sh) file named compute-instance-startup.sh. Copy the content of scripts/azure_files/compute-instance-startup.sh into this file and save it. This script will be used every time a new VM is launched in Azure to apply some base configurations. Take note of the path where you save the file (in the form of Users/<YOUR_USER>/compute-instance-startup.sh), we will use it to run the script in Section 3.

[Optional] You might want to ask for more compute quota for your region depending on your needs. You can do so by navigating to the Azure Quota page. As a reference, we currently use the Standard_D8_v3 VM size for our benchmarking, which falls under the Standard Dv3 Family Cluster Dedicated vCPUs category. Each VM uses 8 cores. Make sure the machine type you use supports nested virtualization.

2. Uploading Windows 11 and Docker images to Azure

Upload the Windows 11 storage folder to the Blob container associated with your default datastore. By default, the Azure ML Workspace's underlying data is backed by a Storage Account through one or more ML datastores. The default datastore, named workspaceblobstore, is created during the workspace setup and linked to a Blob container under the Azure Storage Account. You can review the association between the datastores and containers by visiting Azure ML Datastore. Once found, you can then upload the storage folder in different ways:
- Download the Azure Storage Explorer program, log in, and select the blob container. Upload the WindowsAgentArena/src/win-arena-container/vm/storage folder from your local machine after running the local setup steps.
- Alternatively, you can use the Azure CLI to upload the folder. To install the CLI, follow the steps provided here. Once installed, you can use the following command:
```
az login --use-device-code # Only needed if prompted
az storage blob upload-batch --account-name <STORAGE_ACCOUNT_NAME> --destination <CONTAINER_NAME> --source <LOCAL_FOLDER>
# For a list of parameters check: https://docs.microsoft.com/en-us/cli/azure/storage/blob?view=azure-cli-latest
```
- Alternatively, use the Azure portal interface to upload the folder. Navigate to the storage account, click on Storage browser->Blob containers, select your container, and upload the folder. This option is not recommended for large files as connections might get unstable.

[Optional] If you are not using the default windowsarena/winarena:latest image, you can upload your custom image to the Azure Container Registry. You can do so by following the Azure Container Registry documentation

az login --use-device-code
# potentially needed if commands below don't work: az acr login --name <ACR_NAME>
docker login # you will be prompted to enter your ACR credentials (username + password which can be found in the Azure portal)
docker tag <IMAGE_NAME> <ACR_NAME>.azurecr.io/<IMAGE_NAME>:<TAG>
docker push <ACR_NAME>.azurecr.io/<IMAGE_NAME>:<TAG>

3. Environment configurations and deployment

Add the additional keys to the config.json file at the root of the project:

{
    ... // Your previous configs

    "AZURE_SUBSCRIPTION_ID": "<YOUR_AZURE_SUBSCRIPTION_ID>", 
    "AZURE_ML_RESOURCE_GROUP": "<YOUR_AZURE_ML_RESOURCE_GROUP>",
    "AZURE_ML_WORKSPACE_NAME": "<YOUR_AZURE_ML_WORKSPACE_NAME>"
}

Create a new file named experiments.json to specify any parameters needed for each experiment run, including the agent to deploy and the underlying LLM model to use. You can find a reference experiments.json consisting of multiple experiments to run at scripts/experiments.json:

{
  "experiment_1": {
    "ci_startup_script_path": "Users/<YOUR_USER>/compute-instance-startup.sh", // As seen in Section 1
    "agent": "navi",
    "datastore_input_path": "storage",
    "docker_img_name": "windowsarena/winarena:latest",
    "exp_name": "experiment_1",
    "num_workers": 4,
    "use_managed_identity": false,
    "json_name": "evaluation_examples_windows/test_all.json",
    "model_name": "gpt-4-1106-vision-preview",
    "som_origin": "oss", // or a11y, or mixed-oss
    "a11y_backend": "win32" // or uia
  }
  // ...
}

(Optional) You can also generate the experiments.json by using both the --experiments_json and --update_json parameters of run_azure.py, the above JSON is equivalent to the following command:

cd scripts
python run_azure.py --experiments_json "experiments.json" --update_json --exp_name "experiment_1" --ci_startup_script_path "Users/<YOUR_USER>/compute-instance-startup.sh" --agent "navi" --json_name "evaluation_examples_windows/test_all.json" --num_workers 4 --som_origin oss --a11y_backend win32

Deploy the agent on Azure ML Compute by running:

az login --use-device-code # https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
# If multiple tenants or subscriptions, make sure to select the right ones with:
# az login --use-device-code --tenant "<YOUR_AZURE_AD_TENANT_ID>"
# az account set --subscription "<YOUR_AZURE_AD_TENANT_ID>"

# Make sure you have installed the python requirements in your conda environment
# conda activate winarena
# pip install -r requirements.txt

# From your activated conda environment:
cd scripts
python run_azure.py --experiments_json "experiments.json"

For any unfinished experiments in experiments.json, the script will:

Create <num_workers Azure Compute Instance VMs.
Run one ML Training Job named <exp_name> per VM.
Dispose the VMs once the jobs are completed.

The logs from the run will be saved in a agent_outputs folder in the same blob container where you uploaded the Windows 11 image. You can download the agent_outputs folder to your local machine and run the show_azure.py script to see the results from every experiment as a markdown table.

cd scripts
python show_azure.py --json_config "experiments.json" --result_dir <path_to_downloaded_agent_outputs_folder>

🤖 BYOA: Bring Your Own Agent

Want to test your own agents in Windows Agent Arena? You can use our default agent as a template and create your own folder under src/win-arena-container/client/mm_agents. You just need to make sure that your agent.py file features predict() and reset() functions. For more information on agent development check out the BYOA Doc.

👩‍💻 Open-source contributions

We welcome contributions to the Windows Agent Arena project. In particular, we welcome:

New open-sourced agents to be added to the benchmark
New tasks to be added to our existing categories, or new categories altogether

If you are interested in contributing, please check out our Task Development Guidelines.

❓ FAQ

What are approximate running times and costs for the benchmark?

Component	Cost	Time
Azure Standard_D8_v3 VM	~$8 ($0.38/h * 40 * 0.5h)
GPT-4V	$100	~35min with 40 VMs
GPT-4o	$100	~35min with 40 VMs
GPT-4o-mini	$15	~30min with 40 VMs

👏 Acknowledgements

OS World for the original benchmark task framework.
Dockur for the Docker infrastructure underlying WAA.
GroundingDINO for the object detection module in our Navi Agent.
NotebookLM for our AI-generated podcast.

🤝 Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

🛡️ Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

For Tasks:

Click tags to check more tools for each tasks

test agents benchmark workflows deploy agents run benchmark tasks evaluate multi-modal agents

For Jobs:

ai researcher software developer data scientist machine learning engineer cloud engineer

Alternative AI tools for WindowsAgentArena

Similar Open Source Tools

WindowsAgentArena

github

: 147

torchchat

torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.

github

: 3.5k

telemetry-airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

github

: 185

ai-starter-kit

SambaNova AI Starter Kits is a collection of open-source examples and guides designed to facilitate the deployment of AI-driven use cases for developers and enterprises. The kits cover various categories such as Data Ingestion & Preparation, Model Development & Optimization, Intelligent Information Retrieval, and Advanced AI Capabilities. Users can obtain a free API key using SambaNova Cloud or deploy models using SambaStudio. Most examples are written in Python but can be applied to any programming language. The kits provide resources for tasks like text extraction, fine-tuning embeddings, prompt engineering, question-answering, image search, post-call analysis, and more.

github

: 215

btp-genai-starter-kit

This repository provides a quick way for users of the SAP Business Technology Platform (BTP) to learn how to use generative AI with BTP services. It guides users through setting up the necessary infrastructure, deploying AI models, and running genAI experiments on SAP BTP. The repository includes scripts, examples, and instructions to help users get started with generative AI on the SAP BTP platform.

github

: 83

LeanCopilot

Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.

github

: 1.0k

LLM-Engineers-Handbook

The LLM Engineer's Handbook is an official repository containing a comprehensive guide on creating an end-to-end LLM-based system using best practices. It covers data collection & generation, LLM training pipeline, a simple RAG system, production-ready AWS deployment, comprehensive monitoring, and testing and evaluation framework. The repository includes detailed instructions on setting up local and cloud dependencies, project structure, installation steps, infrastructure setup, pipelines for data processing, training, and inference, as well as QA, tests, and running the project end-to-end.

github

: 2.8k

web-llm

WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.

github

: 13.1k

vectara-answer

Vectara Answer is a sample app for Vectara-powered Summarized Semantic Search (or question-answering) with advanced configuration options. For examples of what you can build with Vectara Answer, check out Ask News, LegalAid, or any of the other demo applications.

github

: 249

vulnerability-analysis

The NVIDIA AI Blueprint for Vulnerability Analysis for Container Security showcases accelerated analysis on common vulnerabilities and exposures (CVE) at an enterprise scale, reducing mitigation time from days to seconds. It enables security analysts to determine software package vulnerabilities using large language models (LLMs) and retrieval-augmented generation (RAG). The blueprint is designed for security analysts, IT engineers, and AI practitioners in cybersecurity. It requires NVAIE developer license and API keys for vulnerability databases, search engines, and LLM model services. Hardware requirements include L40 GPU for pipeline operation and optional LLM NIM and Embedding NIM. The workflow involves LLM pipeline for CVE impact analysis, utilizing LLM planner, agent, and summarization nodes. The blueprint uses NVIDIA NIM microservices and Morpheus Cybersecurity AI SDK for vulnerability analysis.

github

: 86

mcp-server-qdrant

github

: 386

bolna

Bolna is an open-source platform for building voice-driven conversational applications using large language models (LLMs). It provides a comprehensive set of tools and integrations to handle various aspects of voice-based interactions, including telephony, transcription, LLM-based conversation handling, and text-to-speech synthesis. Bolna simplifies the process of creating voice agents that can perform tasks such as initiating phone calls, transcribing conversations, generating LLM-powered responses, and synthesizing speech. It supports multiple providers for each component, allowing users to customize their setup based on their specific needs. Bolna is designed to be easy to use, with a straightforward local setup process and well-documented APIs. It is also extensible, enabling users to integrate with other telephony providers or add custom functionality.

github

: 205

garak

Garak is a free tool that checks if a Large Language Model (LLM) can be made to fail in a way that is undesirable. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses. Garak's a free tool. We love developing it and are always interested in adding functionality to support applications.

github

: 1.3k

unstructured

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

github

: 10.5k

garak

Garak is a vulnerability scanner designed for LLMs (Large Language Models) that checks for various weaknesses such as hallucination, data leakage, prompt injection, misinformation, toxicity generation, and jailbreaks. It combines static, dynamic, and adaptive probes to explore vulnerabilities in LLMs. Garak is a free tool developed for red-teaming and assessment purposes, focusing on making LLMs or dialog systems fail. It supports various LLM models and can be used to assess their security and robustness.

github

: 4.2k

mflux

MFLUX is a line-by-line port of the FLUX implementation in the Huggingface Diffusers library to Apple MLX. It aims to run powerful FLUX models from Black Forest Labs locally on Mac machines. The codebase is minimal and explicit, prioritizing readability over generality and performance. Models are implemented from scratch in MLX, with tokenizers from the Huggingface Transformers library. Dependencies include Numpy and Pillow for image post-processing. Installation can be done using `uv tool` or classic virtual environment setup. Command-line arguments allow for image generation with specified models, prompts, and optional parameters. Quantization options for speed and memory reduction are available. LoRA adapters can be loaded for fine-tuning image generation. Controlnet support provides more control over image generation with reference images. Current limitations include generating images one by one, lack of support for negative prompts, and some LoRA adapters not working.

github

: 1.3k

For similar tasks

WindowsAgentArena

github

: 147

multi-agent-orchestrator

Multi-Agent Orchestrator is a flexible and powerful framework for managing multiple AI agents and handling complex conversations. It intelligently routes queries to the most suitable agent based on context and content, supports dual language implementation in Python and TypeScript, offers flexible agent responses, context management across agents, extensible architecture for customization, universal deployment options, and pre-built agents and classifiers. It is suitable for various applications, from simple chatbots to sophisticated AI systems, accommodating diverse requirements and scaling efficiently.

github

: 4.6k

Upsonic

Upsonic offers a cutting-edge enterprise-ready framework for orchestrating LLM calls, agents, and computer use to complete tasks cost-effectively. It provides reliable systems, scalability, and a task-oriented structure for real-world cases. Key features include production-ready scalability, task-centric design, MCP server support, tool-calling server, computer use integration, and easy addition of custom tools. The framework supports client-server architecture and allows seamless deployment on AWS, GCP, or locally using Docker.

github

: 7.2k

AutoAgent

AutoAgent is a fully-automated and zero-code framework that enables users to create and deploy LLM agents through natural language alone. It is a top performer on the GAIA Benchmark, equipped with a native self-managing vector database, and allows for easy creation of tools, agents, and workflows without any coding. AutoAgent seamlessly integrates with a wide range of LLMs and supports both function-calling and ReAct interaction modes. It is designed to be dynamic, extensible, customized, and lightweight, serving as a personal AI assistant.

github

: 1.9k

agent-starter-pack

The agent-starter-pack is a collection of production-ready Generative AI Agent templates built for Google Cloud. It accelerates development by providing a holistic, production-ready solution, addressing common challenges in building and deploying GenAI agents. The tool offers pre-built agent templates, evaluation tools, production-ready infrastructure, and customization options. It also provides CI/CD automation and data pipeline integration for RAG agents. The starter pack covers all aspects of agent development, from prototyping and evaluation to deployment and monitoring. It is designed to simplify project creation, template selection, and deployment for agent development on Google Cloud.

github

: 304

kagent

Kagent is a Kubernetes native framework for building AI agents, designed to be easy to understand and use. It provides a flexible and powerful way to build, deploy, and manage AI agents in Kubernetes. The framework consists of agents, tools, and model configurations defined as Kubernetes custom resources, making them easy to manage and modify. Kagent is extensible, flexible, observable, declarative, testable, and has core components like a controller, UI, engine, and CLI.

github

: 432

motia

Motia is an AI agent framework designed for software engineers to create, test, and deploy production-ready AI agents quickly. It provides a code-first approach, allowing developers to write agent logic in familiar languages and visualize execution in real-time. With Motia, developers can focus on business logic rather than infrastructure, offering zero infrastructure headaches, multi-language support, composable steps, built-in observability, instant APIs, and full control over AI logic. Ideal for building sophisticated agents and intelligent automations, Motia's event-driven architecture and modular steps enable the creation of GenAI-powered workflows, decision-making systems, and data processing pipelines.

github

: 1.5k

agent-evaluation

Agent Evaluation is a generative AI-powered framework for testing virtual agents. It implements an LLM agent (evaluator) to orchestrate conversations with your own agent (target) and evaluate responses. It supports popular AWS services, allows concurrent multi-turn conversations, defines hooks for additional tasks, and can be used in CI/CD pipelines for faster delivery and stable production environments.

github

: 57

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675