WindowsAgentArena
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Stars: 147
Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.
README:
Windows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.
WAA supports the deployment of agents at scale using the Azure ML cloud infrastructure, allowing for the parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes, not days.
- 2024-09-13: We released our paper, code, project page, and blog post. Check it out!
Our technical report paper can be found here. If you find this environment useful, please consider citing our work:
@article{bonatti2024windows,
author = { Bonatti, Rogerio and Zhao, Dan and Bonacci, Francesco and Dupont, Dillon, and Abdali, Sara and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito and Bucker, Arthur and Jang, Lawrence and Hui, Zack},
title = {Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale},
institution = {Microsoft},
year = {2024},
month = {September},
}
- Docker daemon installed and running. On Windows, we recommend using Docker with WSL 2.
- An OpenAI or Azure OpenAI API Key.
- Python 3.9 - we recommend using Conda and creating an adhoc python environment for running the scripts. For creating a new environment run
conda create -n winarena python=3.9
.
Clone the repository and install dependencies:
git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt
Create a new config.json
at the root of the project with the necessary keys (from OpenAI or Azure endpoints):
{
"OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
"AZURE_API_KEY": "<AZURE_API_KEY>", // if you are using Azure endpoint
"AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}
To use the default docker image from Docker Hub:
docker pull windowsarena/winarena:latest
To build your own image from scratch (optional):
cd scripts
./build-container-image.sh
For a list of parameters that can be changed during building of the docker images:
./build-container-image.sh --help
- Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB]
- After downloading, rename the file to
setup.iso
and copy it to the directoryWindowsAgentArena/src/win-arena-container/vm/image
Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.
To prepare the gold snapshot, run once:
cd ./scripts
./run-local.sh --prepare-image true
During development, if you want to include any changes made in the src/win-arena-container
directory in the WAA golden image, please ensure to specify the flag --skip-build false
(default to true). This will ensure that a new container image is built instead than using the prebuilt windowsarena/winarena:latest
image.
You can check the VM install screen by accessing http://localhost:8006
in your browser (unless you have provided an alternate --browser-port
parameter). The preparation process is fully automated and will take around 20 minutes.
Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.
At the end, you should expect the Docker container named winarena
to gracefully terminate as shown from the below logs.
You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage
, consisting of the following files:
NOTES:
- If you have previously run an installation process and want to do it again from scratch, make sure to delete the content of
storage
. - We recommend copying this
storage
folder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup. - Depending on your docker settings, you might have to run the above command with
sudo
. - Running on WSL2? If you encounter the error
/bin/bash: bad interpreter: No such file or directory
, we recommend converting the bash scripts from DOS/Windows format to Unix format:
cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +
The entire setup runs inside a docker container. The entry point for the agent is the src/win-arena-container/run.py
script (copied to /client/run.py
in the container). The Windows OS runs as a VM process inside the container, and they communicate via GET/POST. To run the entire setup at once, run:
cd scripts
./run-local.sh --start-client true
On your host, open your browser and go to http://localhost:8006 to see the Windows VM with the agent running.
For a list of parameters that can be changed:
./run-local.sh --help
At the end of the run you can display the results using the command:
cd src/win-arena-container/client
python show_results.py --result_dir <path_to_results_folder>
The table below provides a comparison of various combinations of hyperparameters used by the Navi agent in our study, which can be overridden by specifying --som-origin <som_origin> --a11y-backend <a11y_backend>
when running the run-local.sh
script:
Hyperparameter | Possible Values | Description | Recommended Complementary Value |
---|---|---|---|
som_origin |
oss , a11y , mixed-oss
|
Determines how the Set-of-Mark (SoM) is achieved. |
win32 for oss ; uia for a11y , mixed-oss
|
mixed-oss |
If set to any "mixed" option, the agent partially relies on the accessibility tree for SoM entities. |
uia (more reliable but slower) |
|
oss |
Uses webparse, groundingdino, and OCR (TesseractOCR) pipelines. |
win32 (faster performance) |
|
a11y |
Relies on accessibility tree extraction for SoM. |
uia (more reliable but slower) |
|
a11y_backend |
win32 , uia
|
Dictates how the accessibility tree should be extracted. |
win32 for oss ; uia for a11y and mixed types |
win32 |
Faster but less reliable accessibility tree extraction. | Use with oss or non-"mixed" types. |
|
uia |
Slower but more reliable accessibility tree extraction. | Use with a11y , mixed-oss
|
At first sight it might seem challenging to develop/debug code running inside the docker container. However, we provide a few tips to make this process easier. Check the Development-Tips Doc for more details such as:
- How to attach a VSCode window (with debugger) to the running container
- How to change the agent and Windows server code from your local machine and see the changes reflected in real time in the container
We offer a seamless way to run the Windows Agent Arena on Azure ML Compute VMs. This option will significantly reduce the time needed to test your agent in all benchmark tasks from hours/days to minutes.
- If you don't already have an Azure subscription, you can start a free trial. Take note of the subscription id, we will use it as
AZURE_SUBSCRIPTION_ID
in Section 3. - In the Azure portal, create a new resource group (e.g.
agents
) in the region of your choice. Take note of the resource group name, we will use it asAZURE_ML_RESOURCE_GROUP
in Section 3. - Within this resource group, create an Azure Machine Learning resource (e.g. name it
agents_ml
). Take note of the ML workspace name, we will use it asAZURE_ML_WORKSPACE_NAME
in Section 3. During the creation wizard, make sure to check the boxes for automatically creating new:- Storage Account. Note: Take note of the Storage Account name, we will use it to upload the golden image in Section 2.
- Key vault.
- Application Insights.
- [optional] Container Registry. You can use the Azure Container Registry to privately store your custom docker images without the need to push them to the public Docker Hub.
- Once creation is complete, navigate to the Azure Machine Learning portal and click on your workspace (
agents
)
- In the workspace, navigate to the
Notebooks
tab. In your user-assigned folder (as shown in the figure below), create a new bash (.sh) file namedcompute-instance-startup.sh
. Copy the content ofscripts/azure_files/compute-instance-startup.sh
into this file and save it. This script will be used every time a new VM is launched in Azure to apply some base configurations. Take note of the path where you save the file (in the form ofUsers/<YOUR_USER>/compute-instance-startup.sh
), we will use it to run the script in Section 3.
- [Optional] You might want to ask for more compute quota for your region depending on your needs. You can do so by navigating to the Azure Quota page. As a reference, we currently use the
Standard_D8_v3
VM size for our benchmarking, which falls under theStandard Dv3 Family Cluster Dedicated vCPUs
category. Each VM uses 8 cores. Make sure the machine type you use supports nested virtualization.
-
Upload the Windows 11 storage folder to the Blob container associated with your default datastore. By default, the Azure ML Workspace's underlying data is backed by a Storage Account through one or more ML datastores. The default datastore, named
workspaceblobstore
, is created during the workspace setup and linked to a Blob container under the Azure Storage Account. You can review the association between the datastores and containers by visiting Azure ML Datastore. Once found, you can then upload the storage folder in different ways:- Download the Azure Storage Explorer program, log in, and select the blob container. Upload the
WindowsAgentArena/src/win-arena-container/vm/storage
folder from your local machine after running the local setup steps. - Alternatively, you can use the Azure CLI to upload the folder. To install the CLI, follow the steps provided here. Once installed, you can use the following command:
az login --use-device-code # Only needed if prompted az storage blob upload-batch --account-name <STORAGE_ACCOUNT_NAME> --destination <CONTAINER_NAME> --source <LOCAL_FOLDER> # For a list of parameters check: https://docs.microsoft.com/en-us/cli/azure/storage/blob?view=azure-cli-latest
- Alternatively, use the Azure portal interface to upload the folder. Navigate to the storage account, click on
Storage browser->Blob containers
, select your container, and upload the folder. This option is not recommended for large files as connections might get unstable.
- Download the Azure Storage Explorer program, log in, and select the blob container. Upload the
-
[Optional] If you are not using the default
windowsarena/winarena:latest
image, you can upload your custom image to the Azure Container Registry. You can do so by following the Azure Container Registry documentationaz login --use-device-code # potentially needed if commands below don't work: az acr login --name <ACR_NAME> docker login # you will be prompted to enter your ACR credentials (username + password which can be found in the Azure portal) docker tag <IMAGE_NAME> <ACR_NAME>.azurecr.io/<IMAGE_NAME>:<TAG> docker push <ACR_NAME>.azurecr.io/<IMAGE_NAME>:<TAG>
- Add the additional keys to the
config.json
file at the root of the project:
{
... // Your previous configs
"AZURE_SUBSCRIPTION_ID": "<YOUR_AZURE_SUBSCRIPTION_ID>",
"AZURE_ML_RESOURCE_GROUP": "<YOUR_AZURE_ML_RESOURCE_GROUP>",
"AZURE_ML_WORKSPACE_NAME": "<YOUR_AZURE_ML_WORKSPACE_NAME>"
}
- Create a new file named
experiments.json
to specify any parameters needed for each experiment run, including the agent to deploy and the underlying LLM model to use. You can find a referenceexperiments.json
consisting of multiple experiments to run atscripts/experiments.json
:
{
"experiment_1": {
"ci_startup_script_path": "Users/<YOUR_USER>/compute-instance-startup.sh", // As seen in Section 1
"agent": "navi",
"datastore_input_path": "storage",
"docker_img_name": "windowsarena/winarena:latest",
"exp_name": "experiment_1",
"num_workers": 4,
"use_managed_identity": false,
"json_name": "evaluation_examples_windows/test_all.json",
"model_name": "gpt-4-1106-vision-preview",
"som_origin": "oss", // or a11y, or mixed-oss
"a11y_backend": "win32" // or uia
}
// ...
}
- (Optional) You can also generate the
experiments.json
by using both the--experiments_json
and--update_json
parameters ofrun_azure.py
, the above JSON is equivalent to the following command:
cd scripts
python run_azure.py --experiments_json "experiments.json" --update_json --exp_name "experiment_1" --ci_startup_script_path "Users/<YOUR_USER>/compute-instance-startup.sh" --agent "navi" --json_name "evaluation_examples_windows/test_all.json" --num_workers 4 --som_origin oss --a11y_backend win32
- Deploy the agent on Azure ML Compute by running:
az login --use-device-code # https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
# If multiple tenants or subscriptions, make sure to select the right ones with:
# az login --use-device-code --tenant "<YOUR_AZURE_AD_TENANT_ID>"
# az account set --subscription "<YOUR_AZURE_AD_TENANT_ID>"
# Make sure you have installed the python requirements in your conda environment
# conda activate winarena
# pip install -r requirements.txt
# From your activated conda environment:
cd scripts
python run_azure.py --experiments_json "experiments.json"
For any unfinished experiments in experiments.json
, the script will:
- Create
<num_workers
Azure Compute Instance VMs. - Run one ML Training Job named
<exp_name>
per VM. - Dispose the VMs once the jobs are completed.
The logs from the run will be saved in a agent_outputs
folder in the same blob container where you uploaded the Windows 11 image. You can download the agent_outputs
folder to your local machine and run the show_azure.py
script to see the results from every experiment as a markdown table.
cd scripts
python show_azure.py --json_config "experiments.json" --result_dir <path_to_downloaded_agent_outputs_folder>
Want to test your own agents in Windows Agent Arena? You can use our default agent as a template and create your own folder under src/win-arena-container/client/mm_agents
. You just need to make sure that your agent.py
file features predict()
and reset()
functions. For more information on agent development check out the BYOA Doc.
We welcome contributions to the Windows Agent Arena project. In particular, we welcome:
- New open-sourced agents to be added to the benchmark
- New tasks to be added to our existing categories, or new categories altogether
If you are interested in contributing, please check out our Task Development Guidelines.
Component | Cost | Time |
---|---|---|
Azure Standard_D8_v3 VM | ~$8 ($0.38/h * 40 * 0.5h) | |
GPT-4V | $100 | ~35min with 40 VMs |
GPT-4o | $100 | ~35min with 40 VMs |
GPT-4o-mini | $15 | ~30min with 40 VMs |
- OS World for the original benchmark task framework.
- Dockur for the Docker infrastructure underlying WAA.
- GroundingDINO for the object detection module in our Navi Agent.
- NotebookLM for our AI-generated podcast.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for WindowsAgentArena
Similar Open Source Tools
WindowsAgentArena
Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.
ai-starter-kit
SambaNova AI Starter Kits is a collection of open-source examples and guides designed to facilitate the deployment of AI-driven use cases for developers and enterprises. The kits cover various categories such as Data Ingestion & Preparation, Model Development & Optimization, Intelligent Information Retrieval, and Advanced AI Capabilities. Users can obtain a free API key using SambaNova Cloud or deploy models using SambaStudio. Most examples are written in Python but can be applied to any programming language. The kits provide resources for tasks like text extraction, fine-tuning embeddings, prompt engineering, question-answering, image search, post-call analysis, and more.
torchchat
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.
telemetry-airflow
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)
aiid
The Artificial Intelligence Incident Database (AIID) is a collection of incidents involving the development and use of artificial intelligence (AI). The database is designed to help researchers, policymakers, and the public understand the potential risks and benefits of AI, and to inform the development of policies and practices to mitigate the risks and promote the benefits of AI. The AIID is a collaborative project involving researchers from the University of California, Berkeley, the University of Washington, and the University of Toronto.
code2prompt
code2prompt is a command-line tool that converts your codebase into a single LLM prompt with a source tree, prompt templating, and token counting. It automates generating LLM prompts from codebases of any size, customizing prompt generation with Handlebars templates, respecting .gitignore, filtering and excluding files using glob patterns, displaying token count, including Git diff output, copying prompt to clipboard, saving prompt to an output file, excluding files and folders, adding line numbers to source code blocks, and more. It helps streamline the process of creating LLM prompts for code analysis, generation, and other tasks.
aiexe
aiexe is a cutting-edge command-line interface (CLI) and graphical user interface (GUI) tool that integrates powerful AI capabilities directly into your terminal or desktop. It is designed for developers, tech enthusiasts, and anyone interested in AI-powered automation. aiexe provides an easy-to-use yet robust platform for executing complex tasks with just a few commands. Users can harness the power of various AI models from OpenAI, Anthropic, Ollama, Gemini, and GROQ to boost productivity and enhance decision-making processes.
LeanCopilot
Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.
desktop
ComfyUI Desktop is a packaged desktop application that allows users to easily use ComfyUI with bundled features like ComfyUI source code, ComfyUI-Manager, and uv. It automatically installs necessary Python dependencies and updates with stable releases. The app comes with Electron, Chromium binaries, and node modules. Users can store ComfyUI files in a specified location and manage model paths. The tool requires Python 3.12+ and Visual Studio with Desktop C++ workload for Windows. It uses nvm to manage node versions and yarn as the package manager. Users can install ComfyUI and dependencies using comfy-cli, download uv, and build/launch the code. Troubleshooting steps include rebuilding modules and installing missing libraries. The tool supports debugging in VSCode and provides utility scripts for cleanup. Crash reports can be sent to help debug issues, but no personal data is included.
web-llm
WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
BentoVLLM
BentoVLLM is an example project demonstrating how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-efficient inference engine. It provides a basis for advanced code customization, such as custom models, inference logic, or vLLM options. The project allows for simple LLM hosting with OpenAI compatible endpoints without the need to write any code. Users can interact with the server using Swagger UI or other methods, and the service can be deployed to BentoCloud for better management and scalability. Additionally, the repository includes integration examples for different LLM models and tools.
vectara-answer
Vectara Answer is a sample app for Vectara-powered Summarized Semantic Search (or question-answering) with advanced configuration options. For examples of what you can build with Vectara Answer, check out Ask News, LegalAid, or any of the other demo applications.
garak
Garak is a free tool that checks if a Large Language Model (LLM) can be made to fail in a way that is undesirable. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses. Garak's a free tool. We love developing it and are always interested in adding functionality to support applications.
garak
Garak is a vulnerability scanner designed for LLMs (Large Language Models) that checks for various weaknesses such as hallucination, data leakage, prompt injection, misinformation, toxicity generation, and jailbreaks. It combines static, dynamic, and adaptive probes to explore vulnerabilities in LLMs. Garak is a free tool developed for red-teaming and assessment purposes, focusing on making LLMs or dialog systems fail. It supports various LLM models and can be used to assess their security and robustness.
slack-bot
The Slack Bot is a tool designed to enhance the workflow of development teams by integrating with Jenkins, GitHub, GitLab, and Jira. It allows for custom commands, macros, crons, and project-specific commands to be implemented easily. Users can interact with the bot through Slack messages, execute commands, and monitor job progress. The bot supports features like starting and monitoring Jenkins jobs, tracking pull requests, querying Jira information, creating buttons for interactions, generating images with DALL-E, playing quiz games, checking weather, defining custom commands, and more. Configuration is managed via YAML files, allowing users to set up credentials for external services, define custom commands, schedule cron jobs, and configure VCS systems like Bitbucket for automated branch lookup in Jenkins triggers.
SillyTavern
SillyTavern is a user interface you can install on your computer (and Android phones) that allows you to interact with text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development and has added many major features. At this point, they can be thought of as completely independent programs.
For similar tasks
WindowsAgentArena
Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.
multi-agent-orchestrator
Multi-Agent Orchestrator is a flexible and powerful framework for managing multiple AI agents and handling complex conversations. It intelligently routes queries to the most suitable agent based on context and content, supports dual language implementation in Python and TypeScript, offers flexible agent responses, context management across agents, extensible architecture for customization, universal deployment options, and pre-built agents and classifiers. It is suitable for various applications, from simple chatbots to sophisticated AI systems, accommodating diverse requirements and scaling efficiently.
agent-evaluation
Agent Evaluation is a generative AI-powered framework for testing virtual agents. It implements an LLM agent (evaluator) to orchestrate conversations with your own agent (target) and evaluate responses. It supports popular AWS services, allows concurrent multi-turn conversations, defines hooks for additional tasks, and can be used in CI/CD pipelines for faster delivery and stable production environments.
synthora
Synthora is a lightweight and extensible framework for LLM-driven Agents and ALM research. It aims to simplify the process of building, testing, and evaluating agents by providing essential components. The framework allows for easy agent assembly with a single config, reducing the effort required for tuning and sharing agents. Although in early development stages with unstable APIs, Synthora welcomes feedback and contributions to enhance its stability and functionality.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.