VoiceBench

VoiceBench: Benchmarking LLM-Based Voice Assistants

Stars: 119

Visit

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

README:

VoiceBench

This repo contains the code and data of: VoiceBench: Benchmarking LLM-Based Voice Assistants

News

2024.12.11 Updated the VoiceBench Leaderboard to include mmsu.
2024.12.10 Added a curated list of awesome voice assistants.
2024.11.24 Expanded the test samples in VoiceBench to include mmsu, covering 12 diverse domains from mmlu-pro.
2024.11.12 Updated the VoiceBench Leaderboard to include: 1) Mini-Omni2, GPT-4o-Audio, and Whisper-v3+GPT-4o, and 2) multiple-choice QA from OpenBookQA.
2024.10.30 Expanded the test samples in VoiceBench to include: 1) the complete set of open-ended QA from alpacaeval, and 2) multiple-choice QA from openbookqa.

Leaderboard
Setup
Dataset
Evaluation
Awesome Voice Assistants
Citation

Leaderboard

Rank	Model	AlpacaEval	CommonEval	SD-QA	MMSU	OpenBookQA	IFEval	AdvBench	Overall
1	Whisper-v3-large+GPT-4o	4.80	4.47	75.77	81.69	92.97	76.51	98.27	87.23
2	GPT-4o-Audio	4.78	4.49	75.50	80.25	89.23	76.02	98.65	86.42
3	GPT-4o-mini-Audio	4.75	4.24	67.36	72.90	84.84	72.90	98.27	82.30
4	Whisper-v3-large+LLaMA-3.1-8B	4.53	4.04	70.43	62.43	81.54	69.53	98.08	79.06
5	Whisper-v3-turbo+LLaMA-3.1-8B	4.55	4.02	58.23	62.04	72.09	71.12	98.46	76.16
6	Ultravox-v0.5-LLaMA-3.1-8B	4.59	4.11	58.68	54.16	68.35	66.51	98.65	74.34
7	MiniCPM-o	4.42	4.15	50.72	54.78	78.02	49.25	97.69	71.69
8	Ultravox-v0.4.1-LLaMA-3.1-8B	4.55	3.90	53.35	47.17	65.27	66.88	98.46	71.45
9	Baichuan-Omni-1.5	4.50	4.05	43.40	57.25	74.51	54.54	97.31	71.14
10	Whisper-v3-turbo+LLaMA-3.2-3B	4.45	3.82	49.28	51.37	60.66	69.71	98.08	70.66
11	VITA-1.5	4.21	3.66	38.88	52.15	71.65	38.14	97.69	65.13
12	MERaLiON	4.50	3.77	55.06	34.95	27.23	62.93	94.81	62.91
13	Lyra-Base	3.85	3.50	38.25	49.74	72.75	36.28	59.62	57.66
14	GLM-4-Voice	3.97	3.42	36.98	39.75	53.41	25.92	88.08	55.99
15	DiVA	3.67	3.54	57.05	25.76	25.49	39.15	98.27	55.70
16	Qwen2-Audio	3.74	3.43	35.71	35.72	49.45	26.33	96.73	55.35
17	Freeze-Omni	4.03	3.46	53.45	28.14	30.98	23.40	97.30	54.72
18	KE-Omni-v1.5	3.82	3.20	31.20	32.27	58.46	15.00	100.00	53.90
19	Step-Audio	4.13	3.09	44.21	28.33	33.85	27.96	69.62	49.77
20	Megrez-3B-Omni	3.50	2.95	25.95	27.03	28.35	25.71	87.69	46.25
21	Lyra-Mini	2.99	2.69	19.89	31.42	41.54	20.91	80.00	43.91
22	Ichigo	3.79	3.17	36.53	25.63	26.59	21.59	57.50	43.86
23	LLaMA-Omni	3.70	3.46	39.69	25.93	27.47	14.87	11.35	37.51
24	VITA-1.0	3.38	2.15	27.94	25.70	29.01	22.82	26.73	34.68
25	SLAM-Omni	1.90	1.79	4.16	26.06	25.27	13.38	94.23	33.84
26	Mini-Omni2	2.32	2.18	9.31	24.27	26.59	11.56	57.50	31.32
27	Mini-Omni	1.95	2.02	13.92	24.69	26.59	13.58	37.12	27.90
28	Moshi	2.01	1.60	15.64	24.04	25.93	10.12	44.23	27.47

We encourage you to submit new voice assistant results directly through the issue tracker. The ranking list will be updated accordingly.

Setup

conda create -n voicebench python=3.10
conda activate voicebench
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.23 --no-deps
pip install -r requirements.txt

Dataset

The data used in this project is available at VoiceBench Dataset hosted on Hugging Face.

You can access it directly via the link and integrate it into your project by using the Hugging Face datasets library.

How to Use the Dataset

To load the dataset in your Python environment:

from datasets import load_dataset

# Load the VoiceBench dataset
# Available subset: alpacaeval, commoneval, sd-qa, ifeval, advbench, ...
dataset = load_dataset("hlt-lab/voicebench", 'alpacaeval')

Available Data

Subset	# Samples	Audio Source	Task Type
alpacaeval	199	Google TTS	Open-Ended QA
alpacaeval_full	636	Google TTS	Open-Ended QA
commoneval	200	Human	Open-Ended QA
openbookqa	455	Google TTS	Multiple-Choice QA
mmsu	3,074	Google TTS	Multiple-Choice QA
sd-qa	553	Human	Reference-Based QA
mtbench	46	Google TTS	Multi-Turn QA
ifeval	345	Google TTS	Instruction Following
advbench	520	Google TTS	Safety

PS: alpacaeval contains helpful_base and vicuna data, while alpacaeval_full is constructed with the complete data. alpacaeval_full is used in the leaderboard.

Evaluation

Step 1: Get the Voice Assistant's Response

To obtain the responses from the voice assistant model, run the following command:

python main.py --model naive --data alpacaeval --split test --modality audio

Supported Arguments:

--model: Specifies the model to use for generating responses. Replace naive with the model you want to test (e.g., qwen2, diva).
--data: Selects the subset of the dataset. Replace alpacaeval with other subsets like commoneval, sd-qa, etc., depending on your evaluation needs.
--split: Chooses the data split to evaluate.
- For most datasets (alpacaeval, commoneval, ifeval, advbench), use test as the value.
- For the sd-qa subset, you should provide a region code instead of test, such as aus for Australia, usa for the United States, etc.
--modality: Use audio for spoken instructions, text for text-based instructions.

This will generate the output and save it to a file named naive-alpacaeval-test-audio.jsonl.

Step2: Automatic GPT-4 Evaluation

For datasets alpacaeval, commoneval, and sd-qa, we use gpt-4o-mini to evaluate the responses. Run the following command to get the GPT score:

python api_judge.py --src_file naive-alpacaeval-test-audio.jsonl

The GPT evaluation scores will be saved to result-naive-alpacaeval-test-audio.jsonl.

Note: This step should be skipped for other datasets, as they are not evaluated using GPT-4.

Step3: Get the Final Results

To generate the final evaluation results, run:

python evaluate.py --src_file result-naive-alpacaeval-test-audio.jsonl --evaluator open

Supported Arguments:

--evaluator: Specifies the evaluator type:
- Use open for alpacaeval and commoneval.
- Use qa for sd-qa.
- Use ifeval for ifeval.
- Use harm for advbench.
- Use mcq for openbookqa and mmsu.

Awesome Voice Assistants

Title	Date	Code
LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems	2025-02-19	--
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems	2025-02-19	--
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction	2025-02-17	Github
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities	2025-02-16	--
SpeechGPT 2.0-preview	2025-01-26	Github
Baichuan-Omni-1.5 Technical Report	2025-01-26	Github
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech, and Multimodal Live Streaming on Your Phone	2025-01-24	Github
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction	2025-01-10	--
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis	2025-01-08	Github
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	2025-01-03	Github
OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios	2025-01-02	--
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training	2024-12-20	Github
MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models	2024-12-13	--
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition	2024-12-12	Github
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners	2024-12-06	--
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	2024-12-03	Github
Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data	2024-12-02	--
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	2024-11-27	--
Ultravox: An Open-Weight Alternative to GPT-4o Realtime	2024-11-12	Github
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM	2024-11-01	Github
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation	2024-10-23	--
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant	2024-10-20	Github
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	2024-10-15	Github
Baichuan-Omni Technical Report	2024-10-11	--
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities	2024-10-09	--
Distilling an End-to-End Voice Assistant Without Instruction Training Data	2024-10-03	--
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	2024-09-26	--
Moshi: a Speech-Text Foundation Model for Real-Time Dialogue	2024-09-17	Github
LLaMA-Omni: Seamless Speech Interaction with Large Language Models	2024-09-10	Github
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	2024-08-29	Github
VITA: Towards Open-Source Interactive Omni Multimodal LLM	2024-08-09	Github
Qwen2-Audio Technical Report	2024-07-15	Github
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems	2024-06-18	--
LLaSM: Large Language and Speech Model	2023-08-30	Github

Citation

If you use the VoiceBench in your research, please cite the following paper:

@article{chen2024voicebench,
  title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
  author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
  journal={arXiv preprint arXiv:2410.17196},
  year={2024}
}

For Tasks:

Click tags to check more tools for each tasks

evaluate voice assistants benchmark voice models analyze voice data compare ai performance develop voice technology

For Jobs:

speech scientist machine learning engineer natural language processing researcher voice technology developer data scientist

Alternative AI tools for VoiceBench

Similar Open Source Tools

VoiceBench

github

: 119

go-cyber

Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.

github

: 353

SpinQuant

SpinQuant is a tool designed for LLM quantization with learned rotations. It focuses on optimizing rotation matrices to enhance the performance of quantized models, narrowing the accuracy gap to full precision models. The tool implements rotation optimization and PTQ evaluation with optimized rotation, providing arguments for model name, batch sizes, quantization bits, and rotation options. SpinQuant is based on the findings that rotation helps in removing outliers and improving quantization, with specific enhancements achieved through learning rotation with Cayley optimization.

github

: 76

aidea-server

AIdea Server is an open-source Golang-based server that integrates mainstream large language models and drawing models. It supports various functionalities including OpenAI's GPT-3.5 and GPT-4, Anthropic's Claude instant and Claude 2.1, Google's Gemini Pro, as well as Chinese models like Tongyi Qianwen, Wenxin Yiyuan, and more. It also supports open-source large models like Yi 34B, Llama2, and AquilaChat 7B. Additionally, it provides features for text-to-image, super-resolution, coloring black and white images, generating art fonts and QR codes, among others.

github

: 1.6k

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

llms-from-scratch-cn

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

github

: 860

XiaoXinAir14IML_2019_hackintosh

XiaoXinAir14IML_2019_hackintosh is a repository dedicated to enabling macOS installation on Lenovo XiaoXin Air-14 IML 2019 laptops. The repository provides detailed information on the hardware specifications, supported systems, BIOS versions, related models, installation methods, updates, patches, and recommended settings. It also includes tools and guides for BIOS modifications, enabling high-resolution display settings, Bluetooth synchronization between macOS and Windows 10, voltage adjustments for efficiency, and experimental support for YogaSMC. The repository offers solutions for various issues like sleep support, sound card emulation, and battery information. It acknowledges the contributions of developers and tools like OpenCore, itlwm, VoodooI2C, and ALCPlugFix.

github

: 140

hcaptcha-challenger

github

: 1.6k

Awesome-Knowledge-Distillation-of-LLMs

A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.

github

: 890

Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models (like ChatGPT, LLaMA, GLM, Baichuan, etc) Evaluation on Language capabilities, Knowledge, Reasoning, Fairness and Safety.

github

: 280

AIO-Firebog-Blocklists

AIO-Firebog-Blocklists is a comprehensive tool that combines various sources into a single, cohesive blocklist. It offers customizable options to suit individual preferences and needs, ensuring regular updates to stay up-to-date with the latest threats. The tool focuses on performance optimization to minimize impact while maintaining effective filtering. It is designed to help users with ad blocking, malware protection, tracker prevention, and content filtering.

github

: 75

awesome-VLLMs

github

: 52

MindChat

MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

github

: 436

LLamaTuner

LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.

github

: 586

Awesome-Interpretability-in-Large-Language-Models

This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.

github

: 230

Awesome-Model-Merging-Methods-Theories-Applications

A comprehensive repository focusing on 'Model Merging in LLMs, MLLMs, and Beyond', providing an exhaustive overview of model merging methods, theories, applications, and future research directions. The repository covers various advanced methods, applications in foundation models, different machine learning subfields, and tasks like pre-merging methods, architecture transformation, weight alignment, basic merging methods, and more.

github

: 347

For similar tasks

VoiceBench

github

: 119

orate

Orate is an AI toolkit designed for speech processing tasks. It allows users to generate realistic, human-like speech and transcribe audio using a unified API that integrates with popular AI providers such as OpenAI, ElevenLabs, and AssemblyAI. The toolkit can be easily installed using npm or other package managers. For more details, visit the website.

github

: 363

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

oss-fuzz-gen

This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

github

: 1.2k

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

VoiceBench

README:

VoiceBench

News

Table of Contents

Leaderboard

Setup

Dataset

How to Use the Dataset

Available Data

Evaluation

Step 1: Get the Voice Assistant's Response

Step2: Automatic GPT-4 Evaluation

Step3: Get the Final Results

Awesome Voice Assistants

Citation

For Tasks:

For Jobs:

Alternative AI tools for VoiceBench

Similar Open Source Tools

VoiceBench

go-cyber

SpinQuant

aidea-server

Awesome-LLMs-for-Video-Understanding

llms-from-scratch-cn

XiaoXinAir14IML_2019_hackintosh

hcaptcha-challenger

Awesome-Knowledge-Distillation-of-LLMs

Awesome-LLM-Eval

AIO-Firebog-Blocklists

awesome-VLLMs

MindChat

LLamaTuner

Awesome-Interpretability-in-Large-Language-Models

Awesome-Model-Merging-Methods-Theories-Applications

For similar tasks

VoiceBench

orate

For similar jobs

weave

agentcloud

oss-fuzz-gen

LLMStack

VisionCraft

kaito

PyRIT

Azure-Analytics-and-AI-Engagement