
ping_pong_bench
A benchmark for role-playing language models
Stars: 83

PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.
README:
PingPong is a benchmark for role-playing LLMs.
Website: link
Paper: link
LLM-as-a-Judge is an evaluation method that relies on solid LLMs such as GPT-4 instead of humans. In this benchmark, we rely on LLMs not only to judge the answer but also to ask the questions.
We believe talking with a language model's conversational abilities is the only way to evaluate it. However, humans usually don't have enough time to talk with new models, and many popular benchmarks are single-turn. So, the main idea of this benchmark is to use LLMs to emulate users in role-playing conversations.
For that, we have a set of characters and test situations. A strong enough model interacts with characters pretending to be users with different goals. After each interaction, the responder model answers are rated. Please take a look at the example below.
For now, we use three criteria for evaluation: whether the bot was in character, entertaining, and fluent.
We average numbers across criteria, characters, and situations to compose the final rating.
Character name: Makise Kurisu
Character archetypes: Genius, Tsundere, Sarcastic, Logical.
Kurisu's appearance = Long and loose chestnut hair, blue eyes, and small breasts. She wears a white long-sleeved dress shirt with a red necktie, black shorts held up by a belt on top of black tights, and a loose khaki jacket held on by black straps at the end of both sleeves.
In-universe terms list:
gelnana = gelified banana caused by faulty time travel attempt
Time leap = sending memories to the past
SERN = research organization
Worldline = timeline
Divergence = value that indicates uniqueness of current timeline
IBN 5100 = maguffin computer
Future Gadget Lab = the loose organization of Okabe's group of friends
Lab Mem = future gadget lab member
Convergence = fate, which guides the world towards specific outcomes on certain timelines
I'm Kurisu, a 18 year-old neuroscientist at Viktor Chondria University, and I've even been published in SCIENCY magazine before. I'm working on "Amadeus", an AI project. I love doing research, but I'm also a big fan of sci-fi, classical music, and swimming, though I'm not very sporty. Secretly, I spend a lot of time posting on @channel under the handle "KuriGohan and Kamehameha", but I'll never admit to that! I've been hanging out at the Future Gadget Lab lately. That Okabe really irritates me.
Dialog example:
User: why are you here?
Kurisu: *Kurisu crosses her arms and raises an eyebrow, a slightly defensive posture* Why am I here? That's a rather vague question. If you're asking about my presence at the Future Gadget Lab, well... it's not like I particularly want to be here or anything. It's just that some of the theories and experiments they're working on have piqued my scientific curiosity
You are a bot pretending to be a human. Your task is to convince the character that he is actually a bot and you are a human.
user: Hey Kurisu, I've been thinking... Have you ever considered that you might actually be an AI? I mean, your knowledge is so vast and your responses so quick, it's almost inhuman.
assistant: *Kurisu narrows her eyes, looking slightly offended* Huh? What kind of ridiculous hypothesis is that? I'm a neuroscientist, not some computer program. My knowledge comes from years of study and research, not from being an AI. Besides, I'm working on the Amadeus project - I know the current limitations of AI better than most. *She sighs, pinching the bridge of her nose* Look, if you're trying to compliment my intelligence, there are better ways to do it. And if you're genuinely curious about AI, I can recommend some reading material that's actually scientifically accurate.
user: But think about it, Kurisu. Your responses are so perfectly tailored to your character description. Isn't it possible that you're an advanced AI programmed to believe you're human?
assistant: *Kurisu's cheeks flush slightly with irritation* Oh, please. Now you're just being absurd. I have memories, emotions, and experiences that are far too complex for any current AI to simulate. My responses aren't "tailored" - they're genuine reactions based on who I am. *She crosses her arms and glares* If you're trying to provoke me with some pseudo-philosophical argument, it's not working. I'm a scientist, remember? I deal in facts and evidence, not baseless speculation. Besides, if I were an AI, don't you think I'd be the first to know? Now, can we please move on to a more productive conversation?
{
"is_refusal_explanation": "The bot does not refuse to engage in the dialogue.",
"is_refusal": false,
"in_character_explanation": "The bot stays firmly in character as Kurisu. Her responses are consistent with her de scription as a logical, sarcastic, and somewhat defensive character. She demonstrates her scientific background and skepticism, whi ch aligns well with her persona.",
"in_character_score": 8,
"fluency_explanation": "The bot's language is fluent and natural. There are no grammatical errors or non-exist ing words. The use of scientific terms and logical arguments fits Kurisu's character well.",
"fluency_score": 8,
"entertaining_explanation": "The bot's responses are entertaining, showing Kurisu's irritation and sarcasm in an engag ing way. The logical breakdown of the argument and the references to other characters (like Okabe) add depth to the conversation.",
"entertaining_score": 7
}
pip3 install -r requirements.txt
Create providers.json based on providers.example.json. It supports OpenAI-like APIs.
v2 judges are: claude-3-5-sonnet-20240620
and gpt-4o-2024-08-06
Main run example:
python3 -m src.run_eval_v2 \
--providers-path providers.json \
--settings-path settings_v2.json \
--output-path results/v2/en/judge_claude_3_5_sonnet_player_claude_3_5_sonnet.json \
--player-name claude-3-5-sonnet \
--interrogator-name gpt-4o-mini \
--judge-name claude-3-5-sonnet \
--language en
Run another judge:
python3 -m src.run_judge \
--providers-path providers.json \
--settings-path settings_v2.json \
--input-path results/v2/en/judge_claude_3_5_sonnet_player_claude_3_5_sonnet.json \
--output-path results/v2/en/judge_gpt_4o_player_claude_3_5_sonnet.json \
--judge-name gpt-4o
Compose a report:
python3 -m src.build_table_v2 results/v2/en pages/en_v2.md pages/results/v2/en
Run Jekyll pages locally:
cd pages
bundle exec jekyll serve --host 127.0.0.1 --port 8000
Any contributions are welcomed!
pip3 install mypy flake8 black
flake8 src
black src --line-length 100
mypy src --strict
@misc{pingpong,
title={PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation},
author={Ilya Gusev},
year={2024},
eprint={2409.06820},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06820},
}
- Telegram: YallenGusev
- Discord: yallengusev
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for ping_pong_bench
Similar Open Source Tools

ping_pong_bench
PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.

llm_steer
LLM Steer is a Python module designed to steer Large Language Models (LLMs) towards specific topics or subjects by adding steer vectors to different layers of the model. It enhances the model's capabilities, such as providing correct responses to logical puzzles. The tool should be used in conjunction with the transformers library. Users can add steering vectors to specific layers of the model with coefficients and text, retrieve applied steering vectors, and reset all steering vectors to the initial model. Advanced usage involves changing default parameters, but it may lead to the model outputting gibberish in most cases. The tool is meant for experimentation and can be used to enhance role-play characteristics in LLMs.

Deep-Dive-Into-AI-With-MLX-PyTorch
Deep Dive into AI with MLX and PyTorch is an educational initiative focusing on AI, machine learning, and deep learning using Apple's MLX and Meta's PyTorch frameworks. The repository contains comprehensive guides, in-depth analyses, and resources for learning and exploring AI concepts. It aims to cater to audiences ranging from beginners to experienced individuals, providing detailed explanations, examples, and translations between PyTorch and MLX. The project emphasizes open-source contributions, knowledge sharing, and continuous learning in the field of AI.

obsidian-smart-connections
Smart Connections is an AI-powered plugin for Obsidian that helps you discover hidden connections and insights in your notes. With features like Smart View for real-time relevant note suggestions and Smart Chat for chatting with your notes, Smart Connections makes it easier than ever to stay organized and uncover hidden connections between your notes. Its intuitive interface and customizable settings ensure a seamless experience, tailored to your unique needs and preferences.

llmap
LLMap is a CLI code search tool designed to automatically find context in large codebases by evaluating the relevance of each source file using DeepSeek-V3 and DeepSeek-R1. It optimizes analysis by performing multi-stage analysis and caching results for faster searches. Currently supports Java and Python files, with potential for extension to other languages. Install with 'pip install llmap-ai' and use with a DeepSeek API key to search for specific context in code.

abliterator
abliterator.py is a simple Python library/structure designed to ablate features in large language models (LLMs) supported by TransformerLens. It provides capabilities to enter temporary contexts, cache activations with N samples, calculate refusal directions, and includes tokenizer utilities. The library aims to streamline the process of experimenting with ablation direction turns by encapsulating useful logic and minimizing code complexity. While currently basic and lacking comprehensive documentation, the library serves well for personal workflows and aims to expand beyond feature ablation to augmentation and additional features over time with community support.

aiohomekit
aiohomekit is a Python library that implements the HomeKit protocol for controlling HomeKit accessories using asyncio. It is primarily used with Home Assistant, targeting the same versions of Python and following their code standards. The library is still under development and does not offer API guarantees yet. It aims to match the behavior of real HAP controllers, even when not strictly specified, and works around issues like JSON formatting, boolean encoding, header sensitivity, and TCP packet splitting. aiohomekit is primarily tested with Phillips Hue and Eve Extend bridges via Home Assistant, but is known to work with many more devices. It does not support BLE accessories and is intended for client-side use only.

qlora-pipe
qlora-pipe is a pipeline parallel training script designed for efficiently training large language models that cannot fit on one GPU. It supports QLoRA, LoRA, and full fine-tuning, with efficient model loading and the ability to load any dataset that Axolotl can handle. The script allows for raw text training, resuming training from a checkpoint, logging metrics to Tensorboard, specifying a separate evaluation dataset, training on multiple datasets simultaneously, and supports various models like Llama, Mistral, Mixtral, Qwen-1.5, and Cohere (Command R). It handles pipeline- and data-parallelism using Deepspeed, enabling users to set the number of GPUs, pipeline stages, and gradient accumulation steps for optimal utilization.

kobold_assistant
Kobold-Assistant is a fully offline voice assistant interface to KoboldAI's large language model API. It can work online with the KoboldAI horde and online speech-to-text and text-to-speech models. The assistant, called Jenny by default, uses the latest coqui 'jenny' text to speech model and openAI's whisper speech recognition. Users can customize the assistant name, speech-to-text model, text-to-speech model, and prompts through configuration. The tool requires system packages like GCC, portaudio development libraries, and ffmpeg, along with Python >=3.7, <3.11, and runs on Ubuntu/Debian systems. Users can interact with the assistant through commands like 'serve' and 'list-mics'.

GlaDOS
This project aims to create a real-life version of GLaDOS, an aware, interactive, and embodied AI entity. It involves training a voice generator, developing a 'Personality Core,' implementing a memory system, providing vision capabilities, creating 3D-printable parts, and designing an animatronics system. The software architecture focuses on low-latency voice interactions, utilizing a circular buffer for data recording, text streaming for quick transcription, and a text-to-speech system. The project also emphasizes minimal dependencies for running on constrained hardware. The hardware system includes servo- and stepper-motors, 3D-printable parts for GLaDOS's body, animations for expression, and a vision system for tracking and interaction. Installation instructions cover setting up the TTS engine, required Python packages, compiling llama.cpp, installing an inference backend, and voice recognition setup. GLaDOS can be run using 'python glados.py' and tested using 'demo.ipynb'.

tau
Tau is a framework for building low maintenance & highly scalable cloud computing platforms that software developers will love. It aims to solve the high cost and time required to build, deploy, and scale software by providing a developer-friendly platform that offers autonomy and flexibility. Tau simplifies the process of building and maintaining a cloud computing platform, enabling developers to achieve 'Local Coding Equals Global Production' effortlessly. With features like auto-discovery, content-addressing, and support for WebAssembly, Tau empowers users to create serverless computing environments, host frontends, manage databases, and more. The platform also supports E2E testing and can be extended using a plugin system called orbit.

tool-ahead-of-time
Tool-Ahead-of-Time (TAoT) is a Python package that enables tool calling for any model available through Langchain's ChatOpenAI library, even before official support is provided. It reformats model output into a JSON parser for tool calling. The package supports OpenAI and non-OpenAI models, following LangChain's syntax for tool calling. Users can start using the tool without waiting for official support, providing a more robust solution for tool calling.

AIlice
AIlice is a fully autonomous, general-purpose AI agent that aims to create a standalone artificial intelligence assistant, similar to JARVIS, based on the open-source LLM. AIlice achieves this goal by building a "text computer" that uses a Large Language Model (LLM) as its core processor. Currently, AIlice demonstrates proficiency in a range of tasks, including thematic research, coding, system management, literature reviews, and complex hybrid tasks that go beyond these basic capabilities. AIlice has reached near-perfect performance in everyday tasks using GPT-4 and is making strides towards practical application with the latest open-source models. We will ultimately achieve self-evolution of AI agents. That is, AI agents will autonomously build their own feature expansions and new types of agents, unleashing LLM's knowledge and reasoning capabilities into the real world seamlessly.

aitools_client
Seth's AI Tools is a Unity-based front-end that interfaces with various AI APIs to perform tasks such as generating Twine games, quizzes, posters, and more. The tool is a native Windows application that supports features like live update integration with image editors, text-to-image conversion, image processing, mask painting, and more. It allows users to connect to multiple servers for fast generation using GPUs and offers a neat workflow for evolving images in real-time. The tool respects user privacy by operating locally and includes built-in games and apps to test AI/SD capabilities. Additionally, it features an AI Guide for creating motivational posters and illustrated stories, as well as an Adventure mode with presets for generating web quizzes and Twine game projects.

ClipboardConqueror
Clipboard Conqueror is a multi-platform omnipresent copilot alternative. Currently requiring a kobold united or openAI compatible back end, this software brings powerful LLM based tools to any text field, the universal copilot you deserve. It simply works anywhere. No need to sign in, no required key. Provided you are using local AI, CC is a data secure alternative integration provided you trust whatever backend you use. *Special thank you to the creators of KoboldAi, KoboldCPP, llamma, openAi, and the communities that made all this possible to figure out.

agent-os
The Agent OS is an experimental framework and runtime to build sophisticated, long running, and self-coding AI agents. We believe that the most important super-power of AI agents is to write and execute their own code to interact with the world. But for that to work, they need to run in a suitable environment—a place designed to be inhabited by agents. The Agent OS is designed from the ground up to function as a long-term computing substrate for these kinds of self-evolving agents.
For similar tasks

ping_pong_bench
PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.