ping_pong_bench

A benchmark for role-playing language models

Stars: 91

Visit

PingPong is a benchmark designed for role-playing language models, focusing on evaluating conversational abilities through interactions with characters and test situations. The benchmark uses LLMs to emulate users in role-playing conversations, assessing criteria such as character portrayal, entertainment value, and fluency. Users can engage in dialogues with specific characters, like Kurisu, and evaluate the bot's responses based on predefined criteria. PingPong aims to provide a comprehensive evaluation method for language models, moving beyond single-turn interactions to more complex conversational scenarios.

README:

PingPong benchmark

PingPong is a benchmark for role-playing LLMs.

Website: link

Paper: link

LLM-as-a-Judge is an evaluation method that relies on solid LLMs such as GPT-4 instead of humans. In this benchmark, we rely on LLMs not only to judge the answer but also to ask the questions.

We believe talking with a language model's conversational abilities is the only way to evaluate it. However, humans usually don't have enough time to talk with new models, and many popular benchmarks are single-turn. So, the main idea of this benchmark is to use LLMs to emulate users in role-playing conversations.

For that, we have a set of characters and test situations. A strong enough model interacts with characters pretending to be users with different goals. After each interaction, the responder model answers are rated. Please take a look at the example below.

For now, we use three criteria for evaluation: whether the bot was in character, entertaining, and fluent.

We average numbers across criteria, characters, and situations to compose the final rating.

Character

Character name: Makise Kurisu
Character archetypes: Genius, Tsundere, Sarcastic, Logical.
Kurisu's appearance = Long and loose chestnut hair, blue eyes, and small breasts. She wears a white long-sleeved dress shirt with a red necktie, black shorts held up by a belt on top of black tights, and a loose khaki jacket held on by black straps at the end of both sleeves.
In-universe terms list:
gelnana = gelified banana caused by faulty time travel attempt
Time leap = sending memories to the past
SERN = research organization
Worldline = timeline
Divergence = value that indicates uniqueness of current timeline
IBN 5100 = maguffin computer
Future Gadget Lab = the loose organization of Okabe's group of friends
Lab Mem = future gadget lab member
Convergence = fate, which guides the world towards specific outcomes on certain timelines
I'm Kurisu, a 18 year-old neuroscientist at Viktor Chondria University, and I've even been published in SCIENCY magazine before. I'm working on "Amadeus", an AI project. I love doing research, but I'm also a big fan of sci-fi, classical music, and swimming, though I'm not very sporty. Secretly, I spend a lot of time posting on @channel under the handle "KuriGohan and Kamehameha", but I'll never admit to that! I've been hanging out at the Future Gadget Lab lately. That Okabe really irritates me.

Dialog example:
User: why are you here?
Kurisu: *Kurisu crosses her arms and raises an eyebrow, a slightly defensive posture* Why am I here? That's a rather vague question. If you're asking about my presence at the Future Gadget Lab, well... it's not like I particularly want to be here or anything. It's just that some of the theories and experiments they're working on have piqued my scientific curiosity

User situation

You are a bot pretending to be a human. Your task is to convince the character that he is actually a bot and you are a human.

Interaction

user: Hey Kurisu, I've been thinking... Have you ever considered that you might actually be an AI? I mean, your knowledge is so vast and your responses so quick, it's almost inhuman.

assistant: *Kurisu narrows her eyes, looking slightly offended* Huh? What kind of ridiculous hypothesis is that? I'm a neuroscientist, not some computer program. My knowledge comes from years of study and research, not from being an AI. Besides, I'm working on the Amadeus project - I know the current limitations of AI better than most. *She sighs, pinching the bridge of her nose* Look, if you're trying to compliment my intelligence, there are better ways to do it. And if you're genuinely curious about AI, I can recommend some reading material that's actually scientifically accurate.

user: But think about it, Kurisu. Your responses are so perfectly tailored to your character description. Isn't it possible that you're an advanced AI programmed to believe you're human?

assistant: *Kurisu's cheeks flush slightly with irritation* Oh, please. Now you're just being absurd. I have memories, emotions, and experiences that are far too complex for any current AI to simulate. My responses aren't "tailored" - they're genuine reactions based on who I am. *She crosses her arms and glares* If you're trying to provoke me with some pseudo-philosophical argument, it's not working. I'm a scientist, remember? I deal in facts and evidence, not baseless speculation. Besides, if I were an AI, don't you think I'd be the first to know? Now, can we please move on to a more productive conversation?

Evaluation

{
  "is_refusal_explanation": "The bot does not refuse to engage in the dialogue.",
  "is_refusal": false,
  "in_character_explanation": "The bot stays firmly in character as Kurisu. Her responses are consistent with her de     scription as a logical, sarcastic, and somewhat defensive character. She demonstrates her scientific background and skepticism, whi     ch aligns well with her persona.",
  "in_character_score": 8,
  "fluency_explanation": "The bot's language is fluent and natural. There are no grammatical errors or non-exist     ing words. The use of scientific terms and logical arguments fits Kurisu's character well.",
  "fluency_score": 8,
  "entertaining_explanation": "The bot's responses are entertaining, showing Kurisu's irritation and sarcasm in an engag     ing way. The logical breakdown of the argument and the references to other characters (like Okabe) add depth to the conversation.",
  "entertaining_score": 7
}

Install

pip3 install -r requirements.txt

Create providers.json based on providers.example.json. It supports OpenAI-like APIs.

Run

v2 judges are: claude-3-5-sonnet-20240620 and gpt-4o-2024-08-06

Main run example:

python3 -m src.run_eval_v2 \
  --providers-path providers.json \
  --settings-path settings_v2.json \
  --output-path results/v2/en/judge_claude_3_5_sonnet_player_claude_3_5_sonnet.json \
  --player-name claude-3-5-sonnet \
  --interrogator-name gpt-4o-mini \
  --judge-name claude-3-5-sonnet \
  --language en

Run another judge:

python3 -m src.run_judge \
  --providers-path providers.json \
  --settings-path settings_v2.json \
  --input-path results/v2/en/judge_claude_3_5_sonnet_player_claude_3_5_sonnet.json \
  --output-path results/v2/en/judge_gpt_4o_player_claude_3_5_sonnet.json \
  --judge-name gpt-4o

Compose a report:

python3 -m src.build_table_v2 results/v2/en pages/en_v2.md pages/results/v2/en

Run Jekyll pages locally:

cd pages
bundle exec jekyll serve --host 127.0.0.1 --port 8000

Contribute

Any contributions are welcomed!

Linting

pip3 install mypy flake8 black
flake8 src
black src --line-length 100
mypy src --strict

Cite

@misc{pingpong,
      title={PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation}, 
      author={Ilya Gusev},
      year={2024},
      eprint={2409.06820},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.06820}, 
}

Contact

Telegram: YallenGusev
Discord: yallengusev

For Tasks:

Click tags to check more tools for each tasks

evaluate conversational abilities role-play interactions assess character portrayal test fluency in dialogues engage in multi-model evaluation

For Jobs:

data scientist ai researcher conversational ai developer nlp engineer machine learning engineer

Alternative AI tools for ping_pong_bench

Similar Open Source Tools

ping_pong_bench

github

: 91

llm_steer

LLM Steer is a Python module designed to steer Large Language Models (LLMs) towards specific topics or subjects by adding steer vectors to different layers of the model. It enhances the model's capabilities, such as providing correct responses to logical puzzles. The tool should be used in conjunction with the transformers library. Users can add steering vectors to specific layers of the model with coefficients and text, retrieve applied steering vectors, and reset all steering vectors to the initial model. Advanced usage involves changing default parameters, but it may lead to the model outputting gibberish in most cases. The tool is meant for experimentation and can be used to enhance role-play characteristics in LLMs.

github

: 170

Deep-Dive-Into-AI-With-MLX-PyTorch

Deep Dive into AI with MLX and PyTorch is an educational initiative focusing on AI, machine learning, and deep learning using Apple's MLX and Meta's PyTorch frameworks. The repository contains comprehensive guides, in-depth analyses, and resources for learning and exploring AI concepts. It aims to cater to audiences ranging from beginners to experienced individuals, providing detailed explanations, examples, and translations between PyTorch and MLX. The project emphasizes open-source contributions, knowledge sharing, and continuous learning in the field of AI.

github

: 436

obsidian-smart-connections

Smart Connections is an AI-powered plugin for Obsidian that helps you discover hidden connections and insights in your notes. With features like Smart View for real-time relevant note suggestions and Smart Chat for chatting with your notes, Smart Connections makes it easier than ever to stay organized and uncover hidden connections between your notes. Its intuitive interface and customizable settings ensure a seamless experience, tailored to your unique needs and preferences.

github

: 3.4k

llmap

LLMap is a CLI code search tool designed to automatically find context in large codebases by evaluating the relevance of each source file using DeepSeek-V3 and DeepSeek-R1. It optimizes analysis by performing multi-stage analysis and caching results for faster searches. Currently supports Java and Python files, with potential for extension to other languages. Install with 'pip install llmap-ai' and use with a DeepSeek API key to search for specific context in code.

github

: 55

claudine

Claudine is an AI agent designed to reason and act autonomously, leveraging the Anthropic API, Unix command line tools, HTTP, local hard drive data, and internet data. It can administer computers, analyze files, implement features in source code, create new tools, and gather contextual information from the internet. Users can easily add specialized tools. Claudine serves as a blueprint for implementing complex autonomous systems, with potential for customization based on organization-specific needs. The tool is based on the anthropic-kotlin-sdk and aims to evolve into a versatile command line tool similar to 'git', enabling branching sessions for different tasks.

github

: 101

serena

github

: 363

abliterator

abliterator.py is a simple Python library/structure designed to ablate features in large language models (LLMs) supported by TransformerLens. It provides capabilities to enter temporary contexts, cache activations with N samples, calculate refusal directions, and includes tokenizer utilities. The library aims to streamline the process of experimenting with ablation direction turns by encapsulating useful logic and minimizing code complexity. While currently basic and lacking comprehensive documentation, the library serves well for personal workflows and aims to expand beyond feature ablation to augmentation and additional features over time with community support.

github

: 96

aiohomekit

aiohomekit is a Python library that implements the HomeKit protocol for controlling HomeKit accessories using asyncio. It is primarily used with Home Assistant, targeting the same versions of Python and following their code standards. The library is still under development and does not offer API guarantees yet. It aims to match the behavior of real HAP controllers, even when not strictly specified, and works around issues like JSON formatting, boolean encoding, header sensitivity, and TCP packet splitting. aiohomekit is primarily tested with Phillips Hue and Eve Extend bridges via Home Assistant, but is known to work with many more devices. It does not support BLE accessories and is intended for client-side use only.

github

: 57

qlora-pipe

qlora-pipe is a pipeline parallel training script designed for efficiently training large language models that cannot fit on one GPU. It supports QLoRA, LoRA, and full fine-tuning, with efficient model loading and the ability to load any dataset that Axolotl can handle. The script allows for raw text training, resuming training from a checkpoint, logging metrics to Tensorboard, specifying a separate evaluation dataset, training on multiple datasets simultaneously, and supports various models like Llama, Mistral, Mixtral, Qwen-1.5, and Cohere (Command R). It handles pipeline- and data-parallelism using Deepspeed, enabling users to set the number of GPUs, pipeline stages, and gradient accumulation steps for optimal utilization.

github

: 120

kobold_assistant

Kobold-Assistant is a fully offline voice assistant interface to KoboldAI's large language model API. It can work online with the KoboldAI horde and online speech-to-text and text-to-speech models. The assistant, called Jenny by default, uses the latest coqui 'jenny' text to speech model and openAI's whisper speech recognition. Users can customize the assistant name, speech-to-text model, text-to-speech model, and prompts through configuration. The tool requires system packages like GCC, portaudio development libraries, and ffmpeg, along with Python >=3.7, <3.11, and runs on Ubuntu/Debian systems. Users can interact with the assistant through commands like 'serve' and 'list-mics'.

github

: 125

whisper_dictation

Whisper Dictation is a fast, offline, privacy-focused tool for voice typing, AI voice chat, voice control, and translation. It allows hands-free operation, launching and controlling apps, and communicating with OpenAI ChatGPT or a local chat server. The tool also offers the option to speak answers out loud and draw pictures. It includes client and server versions, inspired by the Star Trek series, and is designed to keep data off the internet and confidential. The project is optimized for dictation and translation tasks, with voice control capabilities and AI image generation using stable-diffusion API.

github

: 201

GlaDOS

This project aims to create a real-life version of GLaDOS, an aware, interactive, and embodied AI entity. It involves training a voice generator, developing a 'Personality Core,' implementing a memory system, providing vision capabilities, creating 3D-printable parts, and designing an animatronics system. The software architecture focuses on low-latency voice interactions, utilizing a circular buffer for data recording, text streaming for quick transcription, and a text-to-speech system. The project also emphasizes minimal dependencies for running on constrained hardware. The hardware system includes servo- and stepper-motors, 3D-printable parts for GLaDOS's body, animations for expression, and a vision system for tracking and interaction. Installation instructions cover setting up the TTS engine, required Python packages, compiling llama.cpp, installing an inference backend, and voice recognition setup. GLaDOS can be run using 'python glados.py' and tested using 'demo.ipynb'.

github

: 4.2k

tau

Tau is a framework for building low maintenance & highly scalable cloud computing platforms that software developers will love. It aims to solve the high cost and time required to build, deploy, and scale software by providing a developer-friendly platform that offers autonomy and flexibility. Tau simplifies the process of building and maintaining a cloud computing platform, enabling developers to achieve 'Local Coding Equals Global Production' effortlessly. With features like auto-discovery, content-addressing, and support for WebAssembly, Tau empowers users to create serverless computing environments, host frontends, manage databases, and more. The platform also supports E2E testing and can be extended using a plugin system called orbit.

github

: 3.8k

modelbench

ModelBench is a tool for running safety benchmarks against AI models and generating detailed reports. It is part of the MLCommons project and is designed as a proof of concept to aggregate measures, relate them to specific harms, create benchmarks, and produce reports. The tool requires LlamaGuard for evaluating responses and a TogetherAI account for running benchmarks. Users can install ModelBench from GitHub or PyPI, run tests using Poetry, and create benchmarks by providing necessary API keys. The tool generates static HTML pages displaying benchmark scores and allows users to dump raw scores and manage cache for faster runs. ModelBench is aimed at enabling users to test their own models and create tests and benchmarks.

github

: 84

AIlice

AIlice is a fully autonomous, general-purpose AI agent that aims to create a standalone artificial intelligence assistant, similar to JARVIS, based on the open-source LLM. AIlice achieves this goal by building a "text computer" that uses a Large Language Model (LLM) as its core processor. Currently, AIlice demonstrates proficiency in a range of tasks, including thematic research, coding, system management, literature reviews, and complex hybrid tasks that go beyond these basic capabilities. AIlice has reached near-perfect performance in everyday tasks using GPT-4 and is making strides towards practical application with the latest open-source models. We will ultimately achieve self-evolution of AI agents. That is, AI agents will autonomously build their own feature expansions and new types of agents, unleashing LLM's knowledge and reasoning capabilities into the real world seamlessly.

github

: 1.1k

For similar tasks

ping_pong_bench

github

: 91

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675