debug-gym

A Text-Based Environment for Interactive Debugging

Stars: 262

Visit

debug-gym is a text-based interactive debugging framework designed for debugging Python programs. It provides an environment where agents can interact with code repositories, use various tools like pdb and grep to investigate and fix bugs, and propose code patches. The framework supports different LLM backends such as OpenAI, Azure OpenAI, and Anthropic. Users can customize tools, manage environment states, and run agents to debug code effectively. debug-gym is modular, extensible, and suitable for interactive debugging tasks in a text-based environment.

README:

debug-gym: A Text-Based Environment for Interactive Debugging

debug-gym is a text-based interactive debugging framework, designed for debugging Python programs.

[Technical Report] [Project Page]

The technical report corresponds to version 1.0.0. Please see CHANGELOG.md for recent updates.

1. Installation

It's recommended to create and activate a conda or virtual environment. debug-gym requires Python>=3.12:

conda create -n debug-gym python=3.12
conda activate debug-gym

Then, install debug-gym directly from PyPI:

pip install debug-gym

Alternatively, clone the repository and install locally:

git clone https://github.com/microsoft/debug-gym
cd debug-gym
pip install -e .

To install development dependencies, run:

pip install -e '.[dev]'

Set your API information in llm.yaml

First, create an LLM config template by running python -m debug_gym.llms.configure:

python -m debug_gym.llms.configure

[!TIP] Run python -m debug_gym.llms.configure --help for more options. By default, the template is created at $HOME/.config/debug_gym/llm.yaml, but you can specify any directory.

Then, edit this file with your endpoint and credentials. You can choose one of these authentication methods:

For authenticating with an API key, provide api_key.
For az login or Managed Identity authentication on Azure, remove api_key and include scope instead.

[!WARNING] When using open-sourced LLMs, e.g., via vLLM, you need to correctly setup HF_TOKEN required by the tokenizer.

By default, debug-gym looks for the LLM config file at $HOME/.config/debug_gym/llm.yaml. You can change this behavior by exporting the environment variable LLM_CONFIG_FILE_PATH or by setting llm_config_file_path in your script config file (see Running Baselines).

2. System Design

The structure of debug-gym is as below:

debug_gym
├── gym
│   ├── envs
│   ├── terminal
│   └── tools
├── agents
└── llms

debug_gym.gym is a simulation environment. Given a code repository, an agent can iteratively interact with a set of tools, such as pdb, that are designed for investigate the code. Once gathered enough information, the agent can propose a patch that rewrites certain lines of the code. The terminal will subsequently execute the new code against a set of test cases.

debug_gym.agents are LLM-based debugging agents that use debug_gym.gym to interact with code repositories to seek necessary information and thus fix potential bugs. At an interaction step, the agent takes a text observation that describes the environment states and tool states as input, it is expected to generate a command, subsequently, the environment will provide a new text observation in response, describing the state change caused by that command.

debug_gym.llms are the different LLM backends that can be used to instantiate agents. Currently, we support OpenAI, Azure OpenAI, and Anthropic.

[!WARNING] debug-gym has limited support on non-Linux platforms. Interactive terminal sessions using PTY (pseudo-terminal) in Docker are not fully supported on macOS or Windows. As a result, the pdb tool (see 2.1. Environment and Tools) only works on Linux.

2.1. Environment and Tools

Our base environment, RepoEnv, is an interactive environment that follows the Gymnasium paradigm. Once the environment env is instantiated, one can use env.reset() to start an episode and receives initial informations. Then, one can interact with the environment using env.step(action), where action specifies one of the available tools (see below), doing so will return subsequent informations (e.g, error message, debugger stdout, etc.)

One of the core designs of debug-gym is the notion of tools. Users can dynamically import tools, or develop customized tools and utilize them in the environment. Tools are modules that augment an agent's action space, observation space, or provide additonal functionalities to the agent. Below are the set of tools we have implemented so far.

Tool name	Description
`listdir`	It returns the directory tree at a given subdirectory. This is particularly useful when dealing with a repository with multiple files.
`view`	It is used to change an agent's focus to a particular source code file. This is particularly useful when dealing with a repository with multiple files.
`eval`	It runs the current code repository using the provided entrypoint (e.g., pytest), and returns the terminal's output (e.g., error message).
`pdb`	Interactive debugger wrapping the Python pdb tool. In additon, users can choose to maintain a set of persistent breakpoints (as in some programming IDEs), which are not reset after every eval. With such feature, a new pdb debugging session is activated automatically, with all the breakpoints restored. Note such breakpoint can be cleared by pdb commands such as `cl`.
`grep`	Search for patterns in files within the repository. Supports both literal string matching and regular expressions. Can search in specific files, directories, or the entire repository. Useful for finding code patterns, function definitions, variable usage, or identifying files containing specific text.
`rewrite`	It can be used to rewrite a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code.

Upon importing a tool, its action space and observation space will be automatically merged into debug-gym's action space and observation space; its instruction will also be merged into the overall instruction provided to the agent (e.g., as system prompt).

Users can include a .debugignore file in the repository to specify files and directories that are not visible to debug-gym, similarly, they can include a .debugreadonly to specify files and directories that are read only by the agents (e.g., the test files). Both files share the same syntax as .gitignore.

2.2. Agents

We provide the below LLM-based agents, they all have minimal design and serve the purpose of demonstrating the debug-gym APIs.

Agent name	Available Tools	Description
`debug_agent`	`pdb`, `rewrite`, `view`, `eval`	A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command.
`rewrite_agent`	`rewrite`, `view`, `eval`	A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting).
`debug_5_agent`	`pdb`, `rewrite`, `view`, `eval`	A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites.
`grep_agent`	`grep`, `rewrite`, `view`, `eval`	A variant of `rewrite_agent` that includes the `grep` tool for searching patterns in the codebase before making changes.
`solution_agent`	`pdb`, `eval`	An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected.

2.3. Benchmarks

To demonstrate how to integrate debug-gym with coding tasks and repositories, we provide example code importing two widely used benchmarks, namely aider and swebench, and a small set of minimal buggy code snippets, namely mini_nightmare.

Benchmark name	Link
`aider`	https://github.com/Aider-AI/aider
`swebench`	https://github.com/princeton-nlp/SWE-bench
`swesmith`	https://github.com/SWE-bench/SWE-smith
`mini_nightmare`	A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details here.

3. Running Baselines

We use .yaml files to specify configurations. Example config files can be found in scripts/. To run an agent:

python scripts/run.py scripts/config_<benchmark name>.yaml --agent <agent name>

Add -v, --debug to be verbose, or to enter debug mode.

[!WARNING] When using --debug, you will need to press c to continue after each reasoning step.

3.1 Sanity Checks

We can use the solution_agent to validate that your swebench and swesmith instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that pdb tool can be used as expected.

python scripts/run.py scripts/config_swebench.yaml --agent solution_agent
python scripts/run.py scripts/config_swesmith.yaml --agent solution_agent

3.2 Human Mode

We provide a human mode that enables developers to manually interact with debug-gym. To activate this mode, change the llm_name field in the config_*.yaml to be "human". Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the Tab key to get a list of tool calling templates and fill in any necessary arguments.

3.3. Overriding Values in Config

The -p flag is a handy way to override values defined in the config file. For example, the command below will run the rewrite_agent agent on Aider with human mode (even if the config file specifies gpt-4o). The command also overrides the default system prompt (see below for more information).

python scripts/run.py scripts/config_aider.yaml \
    --agent debug_agent \
    -v \
    -p debug_agent.llm_name="human" \
    -p debug_agent.system_prompt_template_file="scripts/templates/human_friendly_system_prompt.jinja"

3.4. Customizing the System Prompt with Jinja Templates

debug-gym allows you to fully customize the system prompt by providing a Jinja template file. This enables you to control the format and content of the prompt sent to the LLM, making it easier to adapt the environment to your specific needs or research experiments.

To use a custom system prompt template, specify the path to your Jinja template file in your agent's configuration under system_prompt_template_file. For example:

debug_agent:
  system_prompt_template_file: scripts/templates/custom_system_prompt.jinja

Alternatively, you can provide a custom template from the command line with -p <agent>.system_prompt_template_file="<path/to/template.jinja>" (see above).

Within your Jinja template, you have access to the agent and info objects, which provide all relevant context about the current environment and agent state.

Custom Jinja Filters

In addition to all built-in Jinja filters, two custom filters are available for use in your template:

to_pretty_json: Converts a Python object to a pretty-printed JSON string. Useful for displaying structured data in a readable format.
```
{{ info.tools | to_pretty_json }}
```
trim_message: Trims a string to fit within a token or character limit, also filtering out non-UTF8 characters. This is helpful for ensuring that large outputs (such as directory trees or evaluation results) do not exceed the LLM's context window. The trim_message filter accepts the following arguments to control how messages are trimmed:
- max_length: The maximum number of tokens to keep in the message. If the message exceeds this length, it will be trimmed.
- max_length_percentage: Instead of specifying an absolute number, you can provide a percentage (e.g., 0.1 for 10%) of the LLM's context window. The message will be trimmed to fit within this percentage of the model's maximum context length.
- where: Specifies where to trim the message if it exceeds the limit. The default is "middle", which trims from the middle of the message. Other options are start or end.
```
{{ info.dir_tree | trim_message(max_length_percentage=0.1, where="end") }}
```

Example Template

System Prompt for Debug-Gym

Task: {{ agent.system_prompt }}

Instructions:
{{ info.instructions }}

Directory Tree:
{{ info.dir_tree | trim_message(max_length=1000) }}

Current Breakpoints:
{{ info.current_breakpoints | to_pretty_json }}

{% if agent.shortcut_features() %}
Shortcut Features:
{{ agent.shortcut_features() | to_pretty_json }}
{% endif %}

3.5. Debugging a Custom Repository

Modify scripts/config.yaml, especially the env_kwargs to set the path and entrypoint of the custom repository. We assume there is a .debugignore file and a .debugreadonly within the repository that labels files/folders that are not seen or not editable, respectively.

As an example, we provide a buggy pytorch code repository in data/pytorch.

python scripts/run.py scripts/config.yaml --agent <agent name>

3.6. Debugging a Custom SWE-Smith Instance

SWE-Smith allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as SWE-bench/SWE-smith, one can override the -p base.env_kwargs.dataset_id=<dataset_id> in the command line to run the agent on that dataset. For example, to run on a local dataset:

python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"

3.7. Design Your Own Tool

debug-gym's modular design makes it extensible. Users are encouraged to extend debug-gym to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are debug-gym-compatible, please refer to the Technical Report.

3.8. Analysis and Visualization

We provide a set of scripts to help analyze the log files (e.g., the .jsonl files) generated by the agent.

In the analysis folder, we provide scripts that used to generate the corresponding figures in our technical report.
In the analysis/json_log_viewer folder, we provide a Flask app to view a .jsonl log file in the browser.

Citation

@article{yuan2025debuggym,
  title={debug-gym: A Text-Based Environment for Interactive Debugging},
  author={Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, Marc-Alexandre C\^ot\'e},
  journal={arXiv preprint arXiv:2503.21557},
  year={2025},
  url={https://arxiv.org/abs/2503.21557}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Privacy

This framework does not collect user's personal data. For more information about Microsoft's privacy policies. Please see Microsoft Privacy Statement.

Responsible AI

Please see our Responsible AI Statement.

For Tasks:

Click tags to check more tools for each tasks

debug code fix bugs propose patches customize tools run agents

For Jobs:

software developer quality assurance engineer data scientist machine learning engineer research scientist

Alternative AI tools for debug-gym

Similar Open Source Tools

debug-gym

github

: 262

garak

Garak is a vulnerability scanner designed for LLMs (Large Language Models) that checks for various weaknesses such as hallucination, data leakage, prompt injection, misinformation, toxicity generation, and jailbreaks. It combines static, dynamic, and adaptive probes to explore vulnerabilities in LLMs. Garak is a free tool developed for red-teaming and assessment purposes, focusing on making LLMs or dialog systems fail. It supports various LLM models and can be used to assess their security and robustness.

github

: 5.8k

garak

Garak is a free tool that checks if a Large Language Model (LLM) can be made to fail in a way that is undesirable. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses. Garak's a free tool. We love developing it and are always interested in adding functionality to support applications.

github

: 1.3k

turnkeyml

TurnkeyML is a tools framework that integrates models, toolchains, and hardware backends to simplify the evaluation and actuation of deep learning models. It supports use cases like exporting ONNX files, performance validation, functional coverage measurement, stress testing, and model insights analysis. The framework consists of analysis, build, runtime, reporting tools, and a models corpus, seamlessly integrated to provide comprehensive functionality with simple commands. Extensible through plugins, it offers support for various export and optimization tools and AI runtimes. The project is actively seeking collaborators and is licensed under Apache 2.0.

github

: 58

hugescm

HugeSCM is a cloud-based version control system designed to address R&D repository size issues. It effectively manages large repositories and individual large files by separating data storage and utilizing advanced algorithms and data structures. It aims for optimal performance in handling version control operations of large-scale repositories, making it suitable for single large library R&D, AI model development, and game or driver development.

github

: 111

unstructured

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

github

: 12.7k

WindowsAgentArena

Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.

github

: 147

py-vectara-agentic

The `vectara-agentic` Python library is designed for developing powerful AI assistants using Vectara and Agentic-RAG. It supports various agent types, includes pre-built tools for domains like finance and legal, and enables easy creation of custom AI assistants and agents. The library provides tools for summarizing text, rephrasing text, legal tasks like summarizing legal text and critiquing as a judge, financial tasks like analyzing balance sheets and income statements, and database tools for inspecting and querying databases. It also supports observability via LlamaIndex and Arize Phoenix integration.

github

: 98

ai-starter-kit

SambaNova AI Starter Kits is a collection of open-source examples and guides designed to facilitate the deployment of AI-driven use cases for developers and enterprises. The kits cover various categories such as Data Ingestion & Preparation, Model Development & Optimization, Intelligent Information Retrieval, and Advanced AI Capabilities. Users can obtain a free API key using SambaNova Cloud or deploy models using SambaStudio. Most examples are written in Python but can be applied to any programming language. The kits provide resources for tasks like text extraction, fine-tuning embeddings, prompt engineering, question-answering, image search, post-call analysis, and more.

github

: 215

PolyMind

PolyMind is a multimodal, function calling powered LLM webui designed for various tasks such as internet searching, image generation, port scanning, Wolfram Alpha integration, Python interpretation, and semantic search. It offers a plugin system for adding extra functions and supports different models and endpoints. The tool allows users to interact via function calling and provides features like image input, image generation, and text file search. The application's configuration is stored in a `config.json` file with options for backend selection, compatibility mode, IP address settings, API key, and enabled features.

github

: 204

bolna

Bolna is an open-source platform for building voice-driven conversational applications using large language models (LLMs). It provides a comprehensive set of tools and integrations to handle various aspects of voice-based interactions, including telephony, transcription, LLM-based conversation handling, and text-to-speech synthesis. Bolna simplifies the process of creating voice agents that can perform tasks such as initiating phone calls, transcribing conversations, generating LLM-powered responses, and synthesizing speech. It supports multiple providers for each component, allowing users to customize their setup based on their specific needs. Bolna is designed to be easy to use, with a straightforward local setup process and well-documented APIs. It is also extensible, enabling users to integrate with other telephony providers or add custom functionality.

github

: 205

vector-inference

This repository provides an easy-to-use solution for running inference servers on Slurm-managed computing clusters using vLLM. All scripts in this repository run natively on the Vector Institute cluster environment. Users can deploy models as Slurm jobs, check server status and performance metrics, and shut down models. The repository also supports launching custom models with specific configurations. Additionally, users can send inference requests and set up an SSH tunnel to run inference from a local device.

github

: 77

chat-ai

A Seamless Slurm-Native Solution for HPC-Based Services. This repository contains the stand-alone web interface of Chat AI, which can be set up independently to act as a wrapper for an OpenAI-compatible API endpoint. It consists of two Docker containers, 'front' and 'back', providing a ReactJS app served by ViteJS and a wrapper for message requests to prevent CORS errors. Configuration files allow setting port numbers, backend paths, models, user data, default conversation settings, and more. The 'back' service interacts with an OpenAI-compatible API endpoint using configurable attributes in 'back.json'. Customization options include creating a path for available models and setting the 'modelsPath' in 'front.json'. Acknowledgements to contributors and the Chat AI community are included.

github

: 57

codespin

CodeSpin.AI is a set of open-source code generation tools that leverage large language models (LLMs) to automate coding tasks. With CodeSpin, you can generate code in various programming languages, including Python, JavaScript, Java, and C++, by providing natural language prompts. CodeSpin offers a range of features to enhance code generation, such as custom templates, inline prompting, and the ability to use ChatGPT as an alternative to API keys. Additionally, CodeSpin provides options for regenerating code, executing code in prompt files, and piping data into the LLM for processing. By utilizing CodeSpin, developers can save time and effort in coding tasks, improve code quality, and explore new possibilities in code generation.

github

: 60

LayerSkip

LayerSkip is an implementation enabling early exit inference and self-speculative decoding. It provides a code base for running models trained using the LayerSkip recipe, offering speedup through self-speculative decoding. The tool integrates with Hugging Face transformers and provides checkpoints for various LLMs. Users can generate tokens, benchmark on datasets, evaluate tasks, and sweep over hyperparameters to optimize inference speed. The tool also includes correctness verification scripts and Docker setup instructions. Additionally, other implementations like gpt-fast and Native HuggingFace are available. Training implementation is a work-in-progress, and contributions are welcome under the CC BY-NC license.

github

: 255

llm-ollama

LLM-ollama is a plugin that provides access to models running on an Ollama server. It allows users to query the Ollama server for a list of models, register them with LLM, and use them for prompting, chatting, and embedding. The plugin supports image attachments, embeddings, JSON schemas, async models, model aliases, and model options. Users can interact with Ollama models through the plugin in a seamless and efficient manner.

github

: 247

For similar tasks

OmniSteward

OmniSteward is an AI-powered steward system based on large language models that can interact with users through voice or text to help control smart home devices and computer programs. It supports multi-turn dialogue, tool calling for complex tasks, multiple LLM models, voice recognition, smart home control, computer program management, online information retrieval, command line operations, and file management. The system is highly extensible, allowing users to customize and share their own tools.

github

: 66

debug-gym

github

: 262

lollms-webui

LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

github

: 4.8k

ava

Air-gapped Virtual Assistant / Personal Language Server

github

: 407

continue

Continue is an open-source autopilot for VS Code and JetBrains that allows you to code with any LLM. With Continue, you can ask coding questions, edit code in natural language, generate files from scratch, and more. Continue is easy to use and can help you save time and improve your coding skills.

github

: 29.1k

anterion

Anterion is an open-source AI software engineer that extends the capabilities of `SWE-agent` to plan and execute open-ended engineering tasks, with a frontend inspired by `OpenDevin`. It is designed to help users fix bugs and prototype ideas with ease. Anterion is equipped with easy deployment and a user-friendly interface, making it accessible to users of all skill levels.

github

: 137

sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system. The core features of SGLang include: - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction. - **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.

github

: 18.4k

ChatDBG

ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.

github

: 825

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

oss-fuzz-gen

This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

github

: 1.2k

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136