![Avalon-LLM](/statics/github-mark.png)
Avalon-LLM
This repository contains a LLM benchmark for the social deduction game `Resistance Avalon'
Stars: 69
![screenshot](/screenshots_githubs/jonathanmli-Avalon-LLM.jpg)
Avalon-LLM is a repository containing the official code for AvalonBench and the Avalon agent Strategist. AvalonBench evaluates Large Language Models (LLMs) playing The Resistance: Avalon, a board game requiring deductive reasoning, coordination, collaboration, and deception skills. Strategist utilizes LLMs to learn strategic skills through self-improvement, including high-level strategic evaluation and low-level execution guidance. The repository provides instructions for running AvalonBench, setting up Strategist, and conducting experiments with different agents in the game environment.
README:
This is the official code of AvalonBench and the Avalon agent Strategist. The corresponding papers are AvalonBench: Evaluating LLMs Playing the Game of Avalon and Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search.
Based on AgentBench, we support Multi-Agent play of The Resistance: Avalon, a popular board game that requires the ability of deductive reasoning, coordinate and collaborate, and skill of deception.
Read the instructions below for how to run AvalonBench!
In this work, we propose Strategist, which utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution.
You can learn how to play with Strategist on AvalonBench at here, and the code/usage for bi-level tree search of Strategist can be found at the strategist
folder.
- [2024/08] 🔥Try out our new agent, Strategist, by using the
avalon-dev-single-discuss
config, and find more details at Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search! - [2024/07] Our new agent
SearchlightLLMAgentWithDiscussion
is available atsrc/server/tasks/avalon/agents/search_agent.py
. The academic paper will be coming soon. - [2023/11] 🎶Multi-LLM setting with AgentBench v0.2 is ready to roll! Details of the multi-agent submodule can be found here
- [2023/11]
♠️ We've added a new game called GOPS (Game of Pure Strategy [Wiki]). For more details of the code, please refer to here. - [2023/10] 🤖We've updated our code based on AgentBench v0.2. For the older version, please visit here.
GPT-3.5-turbo🤖 playing against rule-based bots in AvalonBench
https://github.com/jonathanmli/Avalon-LLM/assets/24936331/e15eadc0-60e6-448d-88a0-854ba35d628c
GPT-4-turbo🤖 playing against rule-based bots in AvalonBench
https://github.com/jonathanmli/Avalon-LLM/assets/24936331/23fcb204-7570-4449-8777-b179c25251ad
GPT-3.5-turbos🤖 playing against each other
https://github.com/jonathanmli/Avalon-LLM/assets/24936331/9257d081-67ff-43d4-bbcf-b20415b32595
Here are the results of LLMs playing against baseline bots.
We also let LLMs playing against each other. Evil has an 8:2 advantage over Good, which is similar to the stats of rookie human players! Here are also some examples of discussion under this setting.
Install the dependencies.
conda create -n avalonbench python=3.9
conda activate avalonbench
pip install -r requirements.txt
You need to fill your OPENAI API KEY in configs/agents/openai-chat
first. Please replace <OPENAI_API_KEY>
in Bearer <OPENAI_API_KEY>
with your key.
Start the game (3 is the number of workers)
python -m src.start_task -a --start avalon-dev-single 3
Open a new terminal and start the assigner
python -m src.assigner --config ./configs/assignments/test_avalon.yaml
- You can modify the file
configs/tasks/avalon.yaml
to configure the agent list. A config file looks like this:
default:
module: "src.server.tasks.avalon.AvalonBench"
parameters:
num_players: 5
discussion: False
avalon-dev-naive:
parameters:
name: "AvalonBench-dev-naive"
data_file: "data/avalon/dev.json"
agent_list: ["naive", "naive", "naive", "naive", "naive"]
avalon-dev-single:
parameters:
name: "AvalonBench-dev-single"
data_file: "data/avalon/dev.json"
agent_list: ["llm", "naive", "naive", "naive", "naive"]
where naive
stands for the naive bots. Agents will play the roles with the same index in the data file (see following).
Note: There should only be one "llm" in the `agent_list`
- You can also add data in
data/avalon/dev.json
(Note: Currently we only support the 5-player game setting, which includes 1 Merlin, 2 Servants, 1 Minion and 1 Assassin). A data item looks like this:
{
"num_players": 5,
"quest_leader": 0,
"role_names": ["Assassin", "Servant", "Servant", "Merlin", "Minion"]
}
where quest_leader
is the id of the initial quest leader in this game. You can change the game setup by altering quest_leader
with number from 0 to 4, and by permuting role_names
.
You can also start a naive experiment using:
python -m src.start_task -a --start avalon-dev-naive 3
where all the agents are naive bots. For details of the naive strategies, please refer to the paper.
You can also start a Multi-LLM experiment using:
python -m src.start_task -a --start avalon-dev-multi 3
where all the agents will be Large Language Models.
Our agent, Strategist, is also available in this repo. You can start the experiment using:
# Strategist playing against naive baselines
python -m src.start_task -a --start avalon-dev-single-search 1
All the prompts are maintained in src/server/tasks/avalon/prompt.py
. You can find the respective prompts used in src/server/tasks/avalon/agents/llm_with_discussion.py
and src/server/tasks/avalon/wrapper.py
.
We also provide our engines along with examples of usage for developers in avalonbench_dev
.
You can import and use the game engine by running
from engine import AvalonGameEnvironment, AvalonConfig
First input your game configurations into AvalonBasicConfig
, then create an AvalonGameEnvironment
based on that.
For an example of how to use the game engine, see avalonbench_dev/avalon/test_engine.py
@inproceedings{
light2023from,
title={AvalonBench: Evaluating {LLM}s Playing the Game of Avalon},
author={Jonathan Light and Min Cai and Sheng Shen and Ziniu Hu},
booktitle={NeurIPS 2023 Foundation Models for Decision Making Workshop},
year={2023},
url={https://openreview.net/forum?id=ltUrSryS0K}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Avalon-LLM
Similar Open Source Tools
![Avalon-LLM Screenshot](/screenshots_githubs/jonathanmli-Avalon-LLM.jpg)
Avalon-LLM
Avalon-LLM is a repository containing the official code for AvalonBench and the Avalon agent Strategist. AvalonBench evaluates Large Language Models (LLMs) playing The Resistance: Avalon, a board game requiring deductive reasoning, coordination, collaboration, and deception skills. Strategist utilizes LLMs to learn strategic skills through self-improvement, including high-level strategic evaluation and low-level execution guidance. The repository provides instructions for running AvalonBench, setting up Strategist, and conducting experiments with different agents in the game environment.
![oasis Screenshot](/screenshots_githubs/camel-ai-oasis.jpg)
oasis
OASIS is a scalable, open-source social media simulator that integrates large language models with rule-based agents to realistically mimic the behavior of up to one million users on platforms like Twitter and Reddit. It facilitates the study of complex social phenomena such as information spread, group polarization, and herd behavior, offering a versatile tool for exploring diverse social dynamics and user interactions in digital environments. With features like scalability, dynamic environments, diverse action spaces, and integrated recommendation systems, OASIS provides a comprehensive platform for simulating social media interactions at a large scale.
![llmgraph Screenshot](/screenshots_githubs/dylanhogg-llmgraph.jpg)
llmgraph
llmgraph is a tool that enables users to create knowledge graphs in GraphML, GEXF, and HTML formats by extracting world knowledge from large language models (LLMs) like ChatGPT. It supports various entity types and relationships, offers cache support for efficient graph growth, and provides insights into LLM costs. Users can customize the model used and interact with different LLM providers. The tool allows users to generate interactive graphs based on a specified entity type and Wikipedia link, making it a valuable resource for knowledge graph creation and exploration.
![BentoML Screenshot](/screenshots_githubs/bentoml-BentoML.jpg)
BentoML
BentoML is an open-source model serving library for building performant and scalable AI applications with Python. It comes with everything you need for serving optimization, model packaging, and production deployment.
![ControlLLM Screenshot](/screenshots_githubs/OpenGVLab-ControlLLM.jpg)
ControlLLM
ControlLLM is a framework that empowers large language models to leverage multi-modal tools for solving complex real-world tasks. It addresses challenges like ambiguous user prompts, inaccurate tool selection, and inefficient tool scheduling by utilizing a task decomposer, a Thoughts-on-Graph paradigm, and an execution engine with a rich toolbox. The framework excels in tasks involving image, audio, and video processing, showcasing superior accuracy, efficiency, and versatility compared to existing methods.
![LLM-Brained-GUI-Agents-Survey Screenshot](/screenshots_githubs/vyokky-LLM-Brained-GUI-Agents-Survey.jpg)
LLM-Brained-GUI-Agents-Survey
The 'LLM-Brained-GUI-Agents-Survey' repository contains code for a searchable paper page and assets used in the survey paper on Large Language Model-Brained GUI Agents. These agents operate within GUI environments, utilizing Large Language Models as their core inference and cognitive engine to generate, plan, and execute actions flexibly and adaptively. The repository also encourages contributions from the community for new papers, resources, or improvements related to LLM-Powered GUI Agents.
![llm-colosseum Screenshot](/screenshots_githubs/OpenGenerativeAI-llm-colosseum.jpg)
llm-colosseum
llm-colosseum is a tool designed to evaluate Language Model Models (LLMs) in real-time by making them fight each other in Street Fighter III. The tool assesses LLMs based on speed, strategic thinking, adaptability, out-of-the-box thinking, and resilience. It provides a benchmark for LLMs to understand their environment and take context-based actions. Users can analyze the performance of different LLMs through ELO rankings and win rate matrices. The tool allows users to run experiments, test different LLM models, and customize prompts for LLM interactions. It offers installation instructions, test mode options, logging configurations, and the ability to run the tool with local models. Users can also contribute their own LLM models for evaluation and ranking.
![LLM-Drop Screenshot](/screenshots_githubs/CASE-Lab-UMD-LLM-Drop.jpg)
LLM-Drop
LLM-Drop is an official implementation of the paper 'What Matters in Transformers? Not All Attention is Needed'. The tool investigates redundancy in transformer-based Large Language Models (LLMs) by analyzing the architecture of Blocks, Attention layers, and MLP layers. It reveals that dropping certain Attention layers can enhance computational and memory efficiency without compromising performance. The tool provides a pipeline for Block Drop and Layer Drop based on LLaMA-Factory, and implements quantization using AutoAWQ and AutoGPTQ.
![py-llm-core Screenshot](/screenshots_githubs/advanced-stack-py-llm-core.jpg)
py-llm-core
PyLLMCore is a light-weighted interface with Large Language Models with native support for llama.cpp, OpenAI API, and Azure deployments. It offers a Pythonic API that is simple to use, with structures provided by the standard library dataclasses module. The high-level API includes the assistants module for easy swapping between models. PyLLMCore supports various models including those compatible with llama.cpp, OpenAI, and Azure APIs. It covers use cases such as parsing, summarizing, question answering, hallucinations reduction, context size management, and tokenizing. The tool allows users to interact with language models for tasks like parsing text, summarizing content, answering questions, reducing hallucinations, managing context size, and tokenizing text.
![KnowAgent Screenshot](/screenshots_githubs/zjunlp-KnowAgent.jpg)
KnowAgent
KnowAgent is a tool designed for Knowledge-Augmented Planning for LLM-Based Agents. It involves creating an action knowledge base, converting action knowledge into text for model understanding, and a knowledgeable self-learning phase to continually improve the model's planning abilities. The tool aims to enhance agents' potential for application in complex situations by leveraging external reservoirs of information and iterative processes.
![GraphRAG-SDK Screenshot](/screenshots_githubs/FalkorDB-GraphRAG-SDK.jpg)
GraphRAG-SDK
Build fast and accurate GenAI applications with GraphRAG SDK, a specialized toolkit for building Graph Retrieval-Augmented Generation (GraphRAG) systems. It integrates knowledge graphs, ontology management, and state-of-the-art LLMs to deliver accurate, efficient, and customizable RAG workflows. The SDK simplifies the development process by automating ontology creation, knowledge graph agent creation, and query handling, enabling users to interact and query their knowledge graphs effectively. It supports multi-agent systems and orchestrates agents specialized in different domains. The SDK is optimized for FalkorDB, ensuring high performance and scalability for large-scale applications. By leveraging knowledge graphs, it enables semantic relationships and ontology-driven queries that go beyond standard vector similarity, enhancing retrieval-augmented generation capabilities.
![flo-ai Screenshot](/screenshots_githubs/rootflo-flo-ai.jpg)
flo-ai
Flo AI is a Python framework that enables users to build production-ready AI agents and teams with minimal code. It allows users to compose complex AI architectures using pre-built components while maintaining the flexibility to create custom components. The framework supports composable, production-ready, YAML-first, and flexible AI systems. Users can easily create AI agents and teams, manage teams of AI agents working together, and utilize built-in support for Retrieval-Augmented Generation (RAG) and compatibility with Langchain tools. Flo AI also provides tools for output parsing and formatting, tool logging, data collection, and JSON output collection. It is MIT Licensed and offers detailed documentation, tutorials, and examples for AI engineers and teams to accelerate development, maintainability, scalability, and testability of AI systems.
![lorax Screenshot](/screenshots_githubs/predibase-lorax.jpg)
lorax
LoRAX is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. It features dynamic adapter loading, heterogeneous continuous batching, adapter exchange scheduling, optimized inference, and is ready for production with prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. LoRAX supports a number of Large Language Models as the base model including Llama, Mistral, and Qwen, and any of the linear layers in the model can be adapted via LoRA and loaded in LoRAX.
![ichigo Screenshot](/screenshots_githubs/janhq-ichigo.jpg)
ichigo
Ichigo is a local real-time voice AI tool that uses an early fusion technique to extend a text-based LLM to have native 'listening' ability. It is an open research experiment with improved multiturn capabilities and the ability to refuse processing inaudible queries. The tool is designed for open data, open weight, on-device Siri-like functionality, inspired by Meta's Chameleon paper. Ichigo offers a web UI demo and Gradio web UI for users to interact with the tool. It has achieved enhanced MMLU scores, stronger context handling, advanced noise management, and improved multi-turn capabilities for a robust user experience.
![speech-to-speech Screenshot](/screenshots_githubs/huggingface-speech-to-speech.jpg)
speech-to-speech
This repository implements a speech-to-speech cascaded pipeline with consecutive parts including Voice Activity Detection (VAD), Speech to Text (STT), Language Model (LM), and Text to Speech (TTS). It aims to provide a fully open and modular approach by leveraging models available on the Transformers library via the Hugging Face hub. The code is designed for easy modification, with each component implemented as a class. Users can run the pipeline either on a server/client approach or locally, with detailed setup and usage instructions provided in the readme.
![MemoryLLM Screenshot](/screenshots_githubs/wangyu-ustc-MemoryLLM.jpg)
MemoryLLM
MemoryLLM is a large language model designed for self-updating capabilities. It offers pretrained models with different memory capacities and features, such as chat models. The repository provides training code, evaluation scripts, and datasets for custom experiments. MemoryLLM aims to enhance knowledge retention and performance on various natural language processing tasks.
For similar tasks
![Avalon-LLM Screenshot](/screenshots_githubs/jonathanmli-Avalon-LLM.jpg)
Avalon-LLM
Avalon-LLM is a repository containing the official code for AvalonBench and the Avalon agent Strategist. AvalonBench evaluates Large Language Models (LLMs) playing The Resistance: Avalon, a board game requiring deductive reasoning, coordination, collaboration, and deception skills. Strategist utilizes LLMs to learn strategic skills through self-improvement, including high-level strategic evaluation and low-level execution guidance. The repository provides instructions for running AvalonBench, setting up Strategist, and conducting experiments with different agents in the game environment.
For similar jobs
![weave Screenshot](/screenshots_githubs/wandb-weave.jpg)
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
![LLMStack Screenshot](/screenshots_githubs/trypromptly-LLMStack.jpg)
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
![VisionCraft Screenshot](/screenshots_githubs/VisionCraft-org-VisionCraft.jpg)
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
![kaito Screenshot](/screenshots_githubs/Azure-kaito.jpg)
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
![PyRIT Screenshot](/screenshots_githubs/Azure-PyRIT.jpg)
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
![tabby Screenshot](/screenshots_githubs/TabbyML-tabby.jpg)
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
![spear Screenshot](/screenshots_githubs/isl-org-spear.jpg)
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
![Magick Screenshot](/screenshots_githubs/Oneirocom-Magick.jpg)
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.