ping_pong_bench
A benchmark for role-playing language models
Stars: 91
PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.
README:
PingPong is a benchmark for role-playing LLMs.
Website: link
Paper: link
LLM-as-a-Judge is an evaluation method that relies on solid LLMs such as GPT-4 instead of humans. In this benchmark, we rely on LLMs not only to judge the answer but also to ask the questions.
We believe talking with a language model's conversational abilities is the only way to evaluate it. However, humans usually don't have enough time to talk with new models, and many popular benchmarks are single-turn. So, the main idea of this benchmark is to use LLMs to emulate users in role-playing conversations.
For that, we have a set of characters and test situations. A strong enough model interacts with characters pretending to be users with different goals. After each interaction, the responder model answers are rated. Please take a look at the example below.
For now, we use three criteria for evaluation: whether the bot was in character, entertaining, and fluent.
We average numbers across criteria, characters, and situations to compose the final rating.
Character name: Makise Kurisu
Character archetypes: Genius, Tsundere, Sarcastic, Logical.
Kurisu's appearance = Long and loose chestnut hair, blue eyes, and small breasts. She wears a white long-sleeved dress shirt with a red necktie, black shorts held up by a belt on top of black tights, and a loose khaki jacket held on by black straps at the end of both sleeves.
In-universe terms list:
gelnana = gelified banana caused by faulty time travel attempt
Time leap = sending memories to the past
SERN = research organization
Worldline = timeline
Divergence = value that indicates uniqueness of current timeline
IBN 5100 = maguffin computer
Future Gadget Lab = the loose organization of Okabe's group of friends
Lab Mem = future gadget lab member
Convergence = fate, which guides the world towards specific outcomes on certain timelines
I'm Kurisu, a 18 year-old neuroscientist at Viktor Chondria University, and I've even been published in SCIENCY magazine before. I'm working on "Amadeus", an AI project. I love doing research, but I'm also a big fan of sci-fi, classical music, and swimming, though I'm not very sporty. Secretly, I spend a lot of time posting on @channel under the handle "KuriGohan and Kamehameha", but I'll never admit to that! I've been hanging out at the Future Gadget Lab lately. That Okabe really irritates me.
Dialog example:
User: why are you here?
Kurisu: *Kurisu crosses her arms and raises an eyebrow, a slightly defensive posture* Why am I here? That's a rather vague question. If you're asking about my presence at the Future Gadget Lab, well... it's not like I particularly want to be here or anything. It's just that some of the theories and experiments they're working on have piqued my scientific curiosity
You are a bot pretending to be a human. Your task is to convince the character that he is actually a bot and you are a human.
user: Hey Kurisu, I've been thinking... Have you ever considered that you might actually be an AI? I mean, your knowledge is so vast and your responses so quick, it's almost inhuman.
assistant: *Kurisu narrows her eyes, looking slightly offended* Huh? What kind of ridiculous hypothesis is that? I'm a neuroscientist, not some computer program. My knowledge comes from years of study and research, not from being an AI. Besides, I'm working on the Amadeus project - I know the current limitations of AI better than most. *She sighs, pinching the bridge of her nose* Look, if you're trying to compliment my intelligence, there are better ways to do it. And if you're genuinely curious about AI, I can recommend some reading material that's actually scientifically accurate.
user: But think about it, Kurisu. Your responses are so perfectly tailored to your character description. Isn't it possible that you're an advanced AI programmed to believe you're human?
assistant: *Kurisu's cheeks flush slightly with irritation* Oh, please. Now you're just being absurd. I have memories, emotions, and experiences that are far too complex for any current AI to simulate. My responses aren't "tailored" - they're genuine reactions based on who I am. *She crosses her arms and glares* If you're trying to provoke me with some pseudo-philosophical argument, it's not working. I'm a scientist, remember? I deal in facts and evidence, not baseless speculation. Besides, if I were an AI, don't you think I'd be the first to know? Now, can we please move on to a more productive conversation?
{
"is_refusal_explanation": "The bot does not refuse to engage in the dialogue.",
"is_refusal": false,
"in_character_explanation": "The bot stays firmly in character as Kurisu. Her responses are consistent with her de scription as a logical, sarcastic, and somewhat defensive character. She demonstrates her scientific background and skepticism, whi ch aligns well with her persona.",
"in_character_score": 8,
"fluency_explanation": "The bot's language is fluent and natural. There are no grammatical errors or non-exist ing words. The use of scientific terms and logical arguments fits Kurisu's character well.",
"fluency_score": 8,
"entertaining_explanation": "The bot's responses are entertaining, showing Kurisu's irritation and sarcasm in an engag ing way. The logical breakdown of the argument and the references to other characters (like Okabe) add depth to the conversation.",
"entertaining_score": 7
}pip3 install -r requirements.txtCreate providers.json based on providers.example.json. It supports OpenAI-like APIs.
v2 judges are: claude-3-5-sonnet-20240620 and gpt-4o-2024-08-06
Main run example:
python3 -m src.run_eval_v2 \
--providers-path providers.json \
--settings-path settings_v2.json \
--output-path results/v2/en/judge_claude_3_5_sonnet_player_claude_3_5_sonnet.json \
--player-name claude-3-5-sonnet \
--interrogator-name gpt-4o-mini \
--judge-name claude-3-5-sonnet \
--language enRun another judge:
python3 -m src.run_judge \
--providers-path providers.json \
--settings-path settings_v2.json \
--input-path results/v2/en/judge_claude_3_5_sonnet_player_claude_3_5_sonnet.json \
--output-path results/v2/en/judge_gpt_4o_player_claude_3_5_sonnet.json \
--judge-name gpt-4oCompose a report:
python3 -m src.build_table_v2 results/v2/en pages/en_v2.md pages/results/v2/enRun Jekyll pages locally:
cd pages
bundle exec jekyll serve --host 127.0.0.1 --port 8000Any contributions are welcomed!
pip3 install mypy flake8 black
flake8 src
black src --line-length 100
mypy src --strict
@misc{pingpong,
title={PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation},
author={Ilya Gusev},
year={2024},
eprint={2409.06820},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06820},
}- Telegram: YallenGusev
- Discord: yallengusev
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for ping_pong_bench
Similar Open Source Tools
ping_pong_bench
PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.
femtoGPT
femtoGPT is a pure Rust implementation of a minimal Generative Pretrained Transformer. It can be used for both inference and training of GPT-style language models using CPUs and GPUs. The tool is implemented from scratch, including tensor processing logic and training/inference code of a minimal GPT architecture. It is a great start for those fascinated by LLMs and wanting to understand how these models work at deep levels. The tool uses random generation libraries, data-serialization libraries, and a parallel computing library. It is relatively fast on CPU and correctness of gradients is checked using the gradient-check method.
yet-another-applied-llm-benchmark
Yet Another Applied LLM Benchmark is a collection of diverse tests designed to evaluate the capabilities of language models in performing real-world tasks. The benchmark includes tests such as converting code, decompiling bytecode, explaining minified JavaScript, identifying encoding formats, writing parsers, and generating SQL queries. It features a dataflow domain-specific language for easily adding new tests and has nearly 100 tests based on actual scenarios encountered when working with language models. The benchmark aims to assess whether models can effectively handle tasks that users genuinely care about.
chatgpt-universe
ChatGPT is a large language model that can generate human-like text, translate languages, write different kinds of creative content, and answer your questions in a conversational way. It is trained on a massive amount of text data, and it is able to understand and respond to a wide range of natural language prompts. Here are 5 jobs suitable for this tool, in lowercase letters: 1. content writer 2. chatbot assistant 3. language translator 4. creative writer 5. researcher
coding-with-ai
Coding-with-ai is a curated collection of techniques and best practices for utilizing AI coding tools to achieve transformative results in coding projects. It bridges the gap between AI coding demos and daily coding reality by providing insights into specific patterns like memory files, test-driven regeneration, and parallel AI sessions. The repository offers guidance on setting up memory files, writing detailed specs, drafting solutions before using assistants, getting multiple options, choosing stable libraries, and triggering careful planning. It also covers UI prototyping, coding practices, debugging strategies, testing methodologies, and cross-stage techniques for efficient coding with AI tools.
Auto-Data
Auto Data is a library designed for the automatic generation of realistic datasets, essential for the fine-tuning of Large Language Models (LLMs). This highly efficient and lightweight library enables the swift and effortless creation of comprehensive datasets across various topics, regardless of their size. It addresses challenges encountered during model fine-tuning due to data scarcity and imbalance, ensuring models are trained with sufficient examples.
TinyTroupe
TinyTroupe is an experimental Python library that leverages Large Language Models (LLMs) to simulate artificial agents called TinyPersons with specific personalities, interests, and goals in simulated environments. The focus is on understanding human behavior through convincing interactions and customizable personas for various applications like advertisement evaluation, software testing, data generation, project management, and brainstorming. The tool aims to enhance human imagination and provide insights for better decision-making in business and productivity scenarios.
abliterator
abliterator.py is a simple Python library/structure designed to ablate features in large language models (LLMs) supported by TransformerLens. It provides capabilities to enter temporary contexts, cache activations with N samples, calculate refusal directions, and includes tokenizer utilities. The library aims to streamline the process of experimenting with ablation direction turns by encapsulating useful logic and minimizing code complexity. While currently basic and lacking comprehensive documentation, the library serves well for personal workflows and aims to expand beyond feature ablation to augmentation and additional features over time with community support.
iris-llm
iris-llm is a personal project aimed at creating an Intelligent Residential Integration System (IRIS) with a voice interface to local language models or GPT. It provides options for chat engines, text-to-speech engines, speech-to-text engines, feedback sounds, and push-to-talk or wake word features. The tool is still in early development and serves as a tutorial for Python coders interested in working with language models.
obsidian-smart-connections
Smart Connections is an AI-powered plugin for Obsidian that helps you discover hidden connections and insights in your notes. With features like Smart View for real-time relevant note suggestions and Smart Chat for chatting with your notes, Smart Connections makes it easier than ever to stay organized and uncover hidden connections between your notes. Its intuitive interface and customizable settings ensure a seamless experience, tailored to your unique needs and preferences.
Deep-Dive-Into-AI-With-MLX-PyTorch
Deep Dive into AI with MLX and PyTorch is an educational initiative focusing on AI, machine learning, and deep learning using Apple's MLX and Meta's PyTorch frameworks. The repository contains comprehensive guides, in-depth analyses, and resources for learning and exploring AI concepts. It aims to cater to audiences ranging from beginners to experienced individuals, providing detailed explanations, examples, and translations between PyTorch and MLX. The project emphasizes open-source contributions, knowledge sharing, and continuous learning in the field of AI.
local-chat
LocalChat is a simple, easy-to-set-up, and open-source local AI chat tool that allows users to interact with generative language models on their own computers without transmitting data to a cloud server. It provides a chat-like interface for users to experience ChatGPT-like behavior locally, ensuring GDPR compliance and data privacy. Users can download LocalChat for macOS, Windows, or Linux to chat with open-weight generative language models.
aicodeguide
AI Code Guide is a comprehensive guide that covers everything you need to know about using AI to help you code or even code for you. It provides insights into the changing landscape of coding with AI, new tools, editors, and practices. The guide aims to consolidate information on AI coding and AI-assisted code generation in one accessible place. It caters to both experienced coders looking to leverage AI tools and beginners interested in 'vibe coding' to build software products. The guide covers various topics such as AI coding practices, different ways to use AI in coding, recommended resources, tools for AI coding, best practices for structuring prompts, and tips for using specific tools like Claude Code.
discourse-chatbot
The discourse-chatbot is an original AI chatbot for Discourse forums that allows users to converse with the bot in posts or chat channels. Users can customize the character of the bot, enable RAG mode for expert answers, search Wikipedia, news, and Google, provide market data, perform accurate math calculations, and experiment with vision support. The bot uses cutting-edge Open AI API and supports Azure and proxy server connections. It includes a quota system for access management and can be used in RAG mode or basic bot mode. The setup involves creating embeddings to make the bot aware of forum content and setting up bot access permissions based on trust levels. Users must obtain an API token from Open AI and configure group quotas to interact with the bot. The plugin is extensible to support other cloud bots and content search beyond the provided set.
claudine
Claudine is an AI agent designed to reason and act autonomously, leveraging the Anthropic API, Unix command line tools, HTTP, local hard drive data, and internet data. It can administer computers, analyze files, implement features in source code, create new tools, and gather contextual information from the internet. Users can easily add specialized tools. Claudine serves as a blueprint for implementing complex autonomous systems, with potential for customization based on organization-specific needs. The tool is based on the anthropic-kotlin-sdk and aims to evolve into a versatile command line tool similar to 'git', enabling branching sessions for different tasks.
For similar tasks
ping_pong_bench
PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.