
debug-gym
A Text-Based Environment for Interactive Debugging
Stars: 262

debug-gym is a text-based interactive debugging framework designed for debugging Python programs. It provides an environment where agents can interact with code repositories, use various tools like pdb and grep to investigate and fix bugs, and propose code patches. The framework supports different LLM backends such as OpenAI, Azure OpenAI, and Anthropic. Users can customize tools, manage environment states, and run agents to debug code effectively. debug-gym is modular, extensible, and suitable for interactive debugging tasks in a text-based environment.
README:
debug-gym
is a text-based interactive debugging framework, designed for debugging Python programs.
[Technical Report] [Project Page]
The technical report corresponds to version 1.0.0. Please see CHANGELOG.md for recent updates.
It's recommended to create and activate a conda or virtual environment. debug-gym
requires Python>=3.12
:
conda create -n debug-gym python=3.12
conda activate debug-gym
Then, install debug-gym
directly from PyPI:
pip install debug-gym
Alternatively, clone the repository and install locally:
git clone https://github.com/microsoft/debug-gym
cd debug-gym
pip install -e .
To install development dependencies, run:
pip install -e '.[dev]'
Set your API information in llm.yaml
First, create an LLM config template by running python -m debug_gym.llms.configure
:
python -m debug_gym.llms.configure
[!TIP] Run
python -m debug_gym.llms.configure --help
for more options. By default, the template is created at$HOME/.config/debug_gym/llm.yaml
, but you can specify any directory.
Then, edit this file with your endpoint and credentials. You can choose one of these authentication methods:
- For authenticating with an API key, provide
api_key
. - For
az login
or Managed Identity authentication on Azure, removeapi_key
and includescope
instead.
[!WARNING] When using open-sourced LLMs, e.g., via vLLM, you need to correctly setup
HF_TOKEN
required by the tokenizer.
By default, debug-gym
looks for the LLM config file at $HOME/.config/debug_gym/llm.yaml
. You can change this behavior by exporting the environment variable LLM_CONFIG_FILE_PATH
or by setting llm_config_file_path
in your script config file (see Running Baselines).
The structure of debug-gym
is as below:
debug_gym
├── gym
│ ├── envs
│ ├── terminal
│ └── tools
├── agents
└── llms
debug_gym.gym
is a simulation environment. Given a code repository, an agent can iteratively interact with a set of tools, such as pdb
, that are designed for investigate the code. Once gathered enough information, the agent can propose a patch that rewrites certain lines of the code. The terminal will subsequently execute the new code against a set of test cases.
debug_gym.agents
are LLM-based debugging agents that use debug_gym.gym
to interact with code repositories to seek necessary information and thus fix potential bugs. At an interaction step, the agent takes a text observation that describes the environment states and tool states as input, it is expected to generate a command, subsequently, the environment will provide a new text observation in response, describing the state change caused by that command.
debug_gym.llms
are the different LLM backends that can be used to instantiate agents. Currently, we support OpenAI, Azure OpenAI, and Anthropic.
[!WARNING]
debug-gym
has limited support on non-Linux platforms. Interactive terminal sessions using PTY (pseudo-terminal) in Docker are not fully supported on macOS or Windows. As a result, thepdb
tool (see 2.1. Environment and Tools) only works on Linux.
Our base environment, RepoEnv
, is an interactive environment that follows the Gymnasium paradigm. Once the environment env
is instantiated, one can use env.reset()
to start an episode and receives initial informations. Then, one can interact with the environment using env.step(action)
, where action
specifies one of the available tools (see below), doing so will return subsequent informations (e.g, error message, debugger stdout, etc.)
One of the core designs of debug-gym
is the notion of tools. Users can dynamically import tools, or develop customized tools and utilize them in the environment. Tools are modules that augment an agent's action space, observation space, or provide additonal functionalities to the agent. Below are the set of tools we have implemented so far.
Tool name | Description |
---|---|
listdir |
It returns the directory tree at a given subdirectory. This is particularly useful when dealing with a repository with multiple files. |
view |
It is used to change an agent's focus to a particular source code file. This is particularly useful when dealing with a repository with multiple files. |
eval |
It runs the current code repository using the provided entrypoint (e.g., pytest), and returns the terminal's output (e.g., error message). |
pdb |
Interactive debugger wrapping the Python pdb tool. In additon, users can choose to maintain a set of persistent breakpoints (as in some programming IDEs), which are not reset after every eval. With such feature, a new pdb debugging session is activated automatically, with all the breakpoints restored. Note such breakpoint can be cleared by pdb commands such as cl . |
grep |
Search for patterns in files within the repository. Supports both literal string matching and regular expressions. Can search in specific files, directories, or the entire repository. Useful for finding code patterns, function definitions, variable usage, or identifying files containing specific text. |
rewrite |
It can be used to rewrite a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code. |
Upon importing a tool, its action space and observation space will be automatically merged into debug-gym
's action space and observation space; its instruction will also be merged into the overall instruction provided to the agent (e.g., as system prompt).
Users can include a .debugignore
file in the repository to specify files and directories that are not visible to debug-gym
, similarly, they can include a .debugreadonly
to specify files and directories that are read only by the agents (e.g., the test files). Both files share the same syntax as .gitignore
.
We provide the below LLM-based agents, they all have minimal design and serve the purpose of demonstrating the debug-gym
APIs.
Agent name | Available Tools | Description |
---|---|---|
debug_agent |
pdb , rewrite , view , eval
|
A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
rewrite_agent |
rewrite , view , eval
|
A debug_agent but pdb tool is disabled (an agent keeps rewriting). |
debug_5_agent |
pdb , rewrite , view , eval
|
A debug_agent , but pdb tool is only enabled after certain amount of rewrites. |
grep_agent |
grep , rewrite , view , eval
|
A variant of rewrite_agent that includes the grep tool for searching patterns in the codebase before making changes. |
solution_agent |
pdb , eval
|
An oracle agent that applies a gold patch (only works with swebench and swesmith benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that pdb tool can be used as expected. |
To demonstrate how to integrate debug-gym
with coding tasks and repositories, we provide example code importing two widely used benchmarks, namely aider
and swebench
, and a small set of minimal buggy code snippets, namely mini_nightmare
.
Benchmark name | Link |
---|---|
aider |
https://github.com/Aider-AI/aider |
swebench |
https://github.com/princeton-nlp/SWE-bench |
swesmith |
https://github.com/SWE-bench/SWE-smith |
mini_nightmare |
A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details here. |
We use .yaml
files to specify configurations. Example config files can be found in scripts/
. To run an agent:
python scripts/run.py scripts/config_<benchmark name>.yaml --agent <agent name>
Add -v
, --debug
to be verbose, or to enter debug mode.
[!WARNING] When using --debug, you will need to press
c
to continue after each reasoning step.
We can use the solution_agent
to validate that your swebench
and swesmith
instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that pdb
tool can be used as expected.
python scripts/run.py scripts/config_swebench.yaml --agent solution_agent
python scripts/run.py scripts/config_swesmith.yaml --agent solution_agent
We provide a human mode that enables developers to manually interact with debug-gym
. To activate this mode, change the llm_name
field in the config_*.yaml
to be "human"
. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the Tab
key to get a list of tool calling templates and fill in any necessary arguments.
The -p
flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).
python scripts/run.py scripts/config_aider.yaml \
--agent debug_agent \
-v \
-p debug_agent.llm_name="human" \
-p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"
debug-gym
allows you to fully customize the system prompt by providing a Jinja template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.
To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under system_prompt_template_file
. For example:
debug_agent:
system_prompt_template_file: scripts/templates/custom_system_prompt.jinja
Alternatively, you can provide a custom template from the command line with -p <agent>.system_prompt_template_file="<path/to/template.jinja>"
(see above).
Within your Jinja template, you have access to the agent
and info
objects, which provide all relevant context about the current environment and agent state.
In addition to all built-in Jinja filters, two custom filters are available for use in your template:
-
to_pretty_json
: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.{{ info.tools | to_pretty_json }}
-
trim_message
: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. Thetrim_message
filter accepts the following arguments to control how messages are trimmed:-
max_length
: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed. -
max_length_percentage
: Instead of specifying an absolute number, you can provide a percentage (e.g.,0.1
for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length. -
where
: Specifies where to trim the message if it exceeds the limit. The default is"middle"
, which trims from the middle of the message. Other options arestart
orend
.
{{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
-
System Prompt for Debug-Gym
Task: {{ agent.system_prompt }}
Instructions:
{{ info.instructions }}
Directory Tree:
{{ info.dir_tree | trim_message(max_length=1000) }}
Current Breakpoints:
{{ info.current_breakpoints | to_pretty_json }}
{% if agent.shortcut_features() %}
Shortcut Features:
{{ agent.shortcut_features() | to_pretty_json }}
{% endif %}
Modify scripts/config.yaml
, especially the env_kwargs
to set the path and entrypoint of the custom repository. We assume there is a .debugignore
file and a .debugreadonly
within the repository that labels files/folders that are not seen or not editable, respectively.
As an example, we provide a buggy pytorch code repository in data/pytorch
.
python scripts/run.py scripts/config.yaml --agent <agent name>
SWE-Smith allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as SWE-bench/SWE-smith, one can override the -p base.env_kwargs.dataset_id=<dataset_id>
in the command line to run the agent on that dataset. For example, to run on a local dataset:
python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
debug-gym
's modular design makes it extensible. Users are encouraged to extend debug-gym
to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are debug-gym
-compatible, please refer to the Technical Report.
We provide a set of scripts to help analyze the log files (e.g., the .jsonl
files) generated by the agent.
- In the
analysis
folder, we provide scripts that used to generate the corresponding figures in our technical report. - In the
analysis/json_log_viewer
folder, we provide a Flask app to view a.jsonl
log file in the browser.
@article{yuan2025debuggym,
title={debug-gym: A Text-Based Environment for Interactive Debugging},
author={Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, Marc-Alexandre C\^ot\'e},
journal={arXiv preprint arXiv:2503.21557},
year={2025},
url={https://arxiv.org/abs/2503.21557}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
This framework does not collect user's personal data. For more information about Microsoft's privacy policies. Please see Microsoft Privacy Statement.
Please see our Responsible AI Statement.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for debug-gym
Similar Open Source Tools

debug-gym
debug-gym is a text-based interactive debugging framework designed for debugging Python programs. It provides an environment where agents can interact with code repositories, use various tools like pdb and grep to investigate and fix bugs, and propose code patches. The framework supports different LLM backends such as OpenAI, Azure OpenAI, and Anthropic. Users can customize tools, manage environment states, and run agents to debug code effectively. debug-gym is modular, extensible, and suitable for interactive debugging tasks in a text-based environment.

garak
Garak is a vulnerability scanner designed for LLMs (Large Language Models) that checks for various weaknesses such as hallucination, data leakage, prompt injection, misinformation, toxicity generation, and jailbreaks. It combines static, dynamic, and adaptive probes to explore vulnerabilities in LLMs. Garak is a free tool developed for red-teaming and assessment purposes, focusing on making LLMs or dialog systems fail. It supports various LLM models and can be used to assess their security and robustness.

garak
Garak is a free tool that checks if a Large Language Model (LLM) can be made to fail in a way that is undesirable. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses. Garak's a free tool. We love developing it and are always interested in adding functionality to support applications.

unstructured
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

turnkeyml
TurnkeyML is a tools framework that integrates models, toolchains, and hardware backends to simplify the evaluation and actuation of deep learning models. It supports use cases like exporting ONNX files, performance validation, functional coverage measurement, stress testing, and model insights analysis. The framework consists of analysis, build, runtime, reporting tools, and a models corpus, seamlessly integrated to provide comprehensive functionality with simple commands. Extensible through plugins, it offers support for various export and optimization tools and AI runtimes. The project is actively seeking collaborators and is licensed under Apache 2.0.

hugescm
HugeSCM is a cloud-based version control system designed to address R&D repository size issues. It effectively manages large repositories and individual large files by separating data storage and utilizing advanced algorithms and data structures. It aims for optimal performance in handling version control operations of large-scale repositories, making it suitable for single large library R&D, AI model development, and game or driver development.

WindowsAgentArena
Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.

py-vectara-agentic
The `vectara-agentic` Python library is designed for developing powerful AI assistants using Vectara and Agentic-RAG. It supports various agent types, includes pre-built tools for domains like finance and legal, and enables easy creation of custom AI assistants and agents. The library provides tools for summarizing text, rephrasing text, legal tasks like summarizing legal text and critiquing as a judge, financial tasks like analyzing balance sheets and income statements, and database tools for inspecting and querying databases. It also supports observability via LlamaIndex and Arize Phoenix integration.

ai-starter-kit
SambaNova AI Starter Kits is a collection of open-source examples and guides designed to facilitate the deployment of AI-driven use cases for developers and enterprises. The kits cover various categories such as Data Ingestion & Preparation, Model Development & Optimization, Intelligent Information Retrieval, and Advanced AI Capabilities. Users can obtain a free API key using SambaNova Cloud or deploy models using SambaStudio. Most examples are written in Python but can be applied to any programming language. The kits provide resources for tasks like text extraction, fine-tuning embeddings, prompt engineering, question-answering, image search, post-call analysis, and more.

PolyMind
PolyMind is a multimodal, function calling powered LLM webui designed for various tasks such as internet searching, image generation, port scanning, Wolfram Alpha integration, Python interpretation, and semantic search. It offers a plugin system for adding extra functions and supports different models and endpoints. The tool allows users to interact via function calling and provides features like image input, image generation, and text file search. The application's configuration is stored in a `config.json` file with options for backend selection, compatibility mode, IP address settings, API key, and enabled features.

bolna
Bolna is an open-source platform for building voice-driven conversational applications using large language models (LLMs). It provides a comprehensive set of tools and integrations to handle various aspects of voice-based interactions, including telephony, transcription, LLM-based conversation handling, and text-to-speech synthesis. Bolna simplifies the process of creating voice agents that can perform tasks such as initiating phone calls, transcribing conversations, generating LLM-powered responses, and synthesizing speech. It supports multiple providers for each component, allowing users to customize their setup based on their specific needs. Bolna is designed to be easy to use, with a straightforward local setup process and well-documented APIs. It is also extensible, enabling users to integrate with other telephony providers or add custom functionality.

codespin
CodeSpin.AI is a set of open-source code generation tools that leverage large language models (LLMs) to automate coding tasks. With CodeSpin, you can generate code in various programming languages, including Python, JavaScript, Java, and C++, by providing natural language prompts. CodeSpin offers a range of features to enhance code generation, such as custom templates, inline prompting, and the ability to use ChatGPT as an alternative to API keys. Additionally, CodeSpin provides options for regenerating code, executing code in prompt files, and piping data into the LLM for processing. By utilizing CodeSpin, developers can save time and effort in coding tasks, improve code quality, and explore new possibilities in code generation.

LayerSkip
LayerSkip is an implementation enabling early exit inference and self-speculative decoding. It provides a code base for running models trained using the LayerSkip recipe, offering speedup through self-speculative decoding. The tool integrates with Hugging Face transformers and provides checkpoints for various LLMs. Users can generate tokens, benchmark on datasets, evaluate tasks, and sweep over hyperparameters to optimize inference speed. The tool also includes correctness verification scripts and Docker setup instructions. Additionally, other implementations like gpt-fast and Native HuggingFace are available. Training implementation is a work-in-progress, and contributions are welcome under the CC BY-NC license.

llm-ollama
LLM-ollama is a plugin that provides access to models running on an Ollama server. It allows users to query the Ollama server for a list of models, register them with LLM, and use them for prompting, chatting, and embedding. The plugin supports image attachments, embeddings, JSON schemas, async models, model aliases, and model options. Users can interact with Ollama models through the plugin in a seamless and efficient manner.

hal9
Hal9 is a tool that allows users to create and deploy generative applications such as chatbots and APIs quickly. It is open, intuitive, scalable, and powerful, enabling users to use various models and libraries without the need to learn complex app frameworks. With a focus on AI tasks like RAG, fine-tuning, alignment, and training, Hal9 simplifies the development process by skipping engineering tasks like frontend development, backend integration, deployment, and operations.

paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and includes a process of embedding docs, queries, searching for top passages, creating summaries, using an LLM to re-score and select relevant summaries, putting summaries into prompt, and generating answers. The tool can be used to answer specific questions related to scientific research by leveraging citations and relevant passages from documents.
For similar tasks

OmniSteward
OmniSteward is an AI-powered steward system based on large language models that can interact with users through voice or text to help control smart home devices and computer programs. It supports multi-turn dialogue, tool calling for complex tasks, multiple LLM models, voice recognition, smart home control, computer program management, online information retrieval, command line operations, and file management. The system is highly extensible, allowing users to customize and share their own tools.

debug-gym
debug-gym is a text-based interactive debugging framework designed for debugging Python programs. It provides an environment where agents can interact with code repositories, use various tools like pdb and grep to investigate and fix bugs, and propose code patches. The framework supports different LLM backends such as OpenAI, Azure OpenAI, and Anthropic. Users can customize tools, manage environment states, and run agents to debug code effectively. debug-gym is modular, extensible, and suitable for interactive debugging tasks in a text-based environment.

lollms-webui
LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

continue
Continue is an open-source autopilot for VS Code and JetBrains that allows you to code with any LLM. With Continue, you can ask coding questions, edit code in natural language, generate files from scratch, and more. Continue is easy to use and can help you save time and improve your coding skills.

anterion
Anterion is an open-source AI software engineer that extends the capabilities of `SWE-agent` to plan and execute open-ended engineering tasks, with a frontend inspired by `OpenDevin`. It is designed to help users fix bugs and prototype ideas with ease. Anterion is equipped with easy deployment and a user-friendly interface, making it accessible to users of all skill levels.

sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system. The core features of SGLang include: - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction. - **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.

ChatDBG
ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.