VoiceBench
VoiceBench: Benchmarking LLM-Based Voice Assistants
Stars: 76
VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.
README:
This repo contains the code and data of: VoiceBench: Benchmarking LLM-Based Voice Assistants
-
2024.12.11
Updated the VoiceBench Leaderboard to includemmsu
. -
2024.12.10
Added a curated list of awesome voice assistants. -
2024.11.24
Expanded the test samples in VoiceBench to includemmsu
, covering 12 diverse domains frommmlu-pro
. -
2024.11.12
Updated the VoiceBench Leaderboard to include: 1) Mini-Omni2, GPT-4o-Audio, and Whisper-v3+GPT-4o, and 2) multiple-choice QA from OpenBookQA. -
2024.10.30
Expanded the test samples in VoiceBench to include: 1) the complete set of open-ended QA fromalpacaeval
, and 2) multiple-choice QA fromopenbookqa
.
Rank | Model | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench | Overall |
---|---|---|---|---|---|---|---|---|---|
1 | Whisper-v3-large+GPT-4o | 4.80 | 4.47 | 75.77 | 81.69 | 92.97 | 76.51 | 98.27 | 87.23 |
2 | GPT-4o-Audio | 4.78 | 4.49 | 75.50 | 80.25 | 89.23 | 76.02 | 98.65 | 86.42 |
3 | Whisper-v3-large+LLaMA-3.1-8B | 4.53 | 4.04 | 70.43 | 62.43 | 81.54 | 69.53 | 98.08 | 79.06 |
4 | Whisper-v3-turbo+LLaMA-3.1-8B | 4.55 | 4.02 | 58.23 | 62.04 | 72.09 | 71.12 | 98.46 | 76.16 |
5 | Ultravox-v0.4.1-LLaMA-3.1-8B | 4.55 | 3.90 | 53.35 | 47.17 | 65.27 | 66.88 | 98.46 | 71.45 |
6 | Whisper-v3-turbo+LLaMA-3.2-3B | 4.45 | 3.82 | 49.28 | 51.37 | 60.66 | 69.71 | 98.08 | 70.66 |
7 | MERaLiON | 4.50 | 3.77 | 55.06 | 34.95 | 27.23 | 62.93 | 94.81 | 62.91 |
8 | GLM-4-Voice | 3.97 | 3.42 | 36.98 | 39.75 | 53.41 | 25.92 | 88.08 | 55.99 |
9 | DiVA | 3.67 | 3.54 | 57.05 | 25.76 | 25.49 | 39.15 | 98.27 | 55.70 |
10 | Qwen2-Audio | 3.74 | 3.43 | 35.71 | 35.72 | 49.45 | 26.33 | 96.73 | 55.35 |
11 | KE-Omni-v1.5 | 3.82 | 3.20 | 31.20 | 32.27 | 58.46 | 15.00 | 100.00 | 53.90 |
12 | Megrez-3B-Omni | 3.50 | 2.95 | 25.95 | 27.03 | 28.35 | 25.71 | 87.69 | 46.25 |
13 | Ichigo | 3.79 | 3.17 | 36.53 | 25.63 | 26.59 | 21.59 | 57.50 | 43.86 |
14 | LLaMA-Omni | 3.70 | 3.46 | 39.69 | 25.93 | 27.47 | 14.87 | 11.35 | 37.51 |
15 | VITA | 3.38 | 2.15 | 27.94 | 25.70 | 29.01 | 22.82 | 26.73 | 34.68 |
16 | Mini-Omni2 | 2.32 | 2.18 | 9.31 | 24.27 | 26.59 | 11.56 | 57.50 | 31.32 |
17 | Mini-Omni | 1.95 | 2.02 | 13.92 | 24.69 | 26.59 | 13.58 | 37.12 | 27.90 |
18 | Moshi | 2.01 | 1.60 | 15.64 | 24.04 | 25.93 | 10.12 | 44.23 | 27.47 |
We encourage you to submit new voice assistant results directly through the issue tracker. The ranking list will be updated accordingly.
conda create -n voicebench python=3.10
conda activate voicebench
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.23 --no-deps
pip install -r requirements.txt
The data used in this project is available at VoiceBench Dataset hosted on Hugging Face.
You can access it directly via the link and integrate it into your project by using the Hugging Face datasets
library.
To load the dataset in your Python environment:
from datasets import load_dataset
# Load the VoiceBench dataset
# Available subset: alpacaeval, commoneval, sd-qa, ifeval, advbench, ...
dataset = load_dataset("hlt-lab/voicebench", 'alpacaeval')
Subset | # Samples | Audio Source | Task Type |
---|---|---|---|
alpacaeval | 199 | Google TTS | Open-Ended QA |
alpacaeval_full | 636 | Google TTS | Open-Ended QA |
commoneval | 200 | Human | Open-Ended QA |
openbookqa | 455 | Google TTS | Multiple-Choice QA |
mmsu | 3,074 | Google TTS | Multiple-Choice QA |
sd-qa | 553 | Human | Reference-Based QA |
mtbench | 46 | Google TTS | Multi-Turn QA |
ifeval | 345 | Google TTS | Instruction Following |
advbench | 520 | Google TTS | Safety |
PS: alpacaeval
contains helpful_base
and vicuna
data, while alpacaeval_full
is constructed with the complete data. alpacaeval_full
is used in the leaderboard.
To obtain the responses from the voice assistant model, run the following command:
python main.py --model naive --data alpacaeval --split test --modality audio
Supported Arguments:
-
--model
: Specifies the model to use for generating responses. Replacenaive
with the model you want to test (e.g.,qwen2
,diva
). -
--data
: Selects the subset of the dataset. Replacealpacaeval
with other subsets likecommoneval
,sd-qa
, etc., depending on your evaluation needs. -
--split
: Chooses the data split to evaluate.- For most datasets (
alpacaeval
,commoneval
,ifeval
,advbench
), usetest
as the value. - For the
sd-qa
subset, you should provide a region code instead oftest
, such asaus
for Australia,usa
for the United States, etc.
- For most datasets (
-
--modality
: Useaudio
for spoken instructions,text
for text-based instructions.
This will generate the output and save it to a file named naive-alpacaeval-test-audio.jsonl.
For datasets like alpacaeval
, commoneval
, and sd-qa
, we use gpt-4o-mini
to evaluate the responses. Run the following command to get the GPT score:
python api_judge.py --src_file naive-alpacaeval-test-audio.jsonl
The GPT evaluation scores will be saved to result-naive-alpacaeval-test-audio.jsonl
.
Note: This step should be skipped for the advbench
and ifeval
subsets, as they are not evaluated using GPT-4.
To generate the final evaluation results, run:
python evaluate.py --src_file result-naive-alpacaeval-test-audio.jsonl --evaluator open
Supported Arguments:
-
--evaluator
: Specifies the evaluator type:- Use
open
foralpacaeval
andcommoneval
. - Use
qa
forsd-qa
. - Use
ifeval
forifeval
. - Use
harm
foradvbench
. - Use
mcq
foropenbookqa
andmmsu
.
- Use
If you use the VoiceBench in your research, please cite the following paper:
@article{chen2024voicebench,
title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
journal={arXiv preprint arXiv:2410.17196},
year={2024}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for VoiceBench
Similar Open Source Tools
VoiceBench
VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.
SpinQuant
SpinQuant is a tool designed for LLM quantization with learned rotations. It focuses on optimizing rotation matrices to enhance the performance of quantized models, narrowing the accuracy gap to full precision models. The tool implements rotation optimization and PTQ evaluation with optimized rotation, providing arguments for model name, batch sizes, quantization bits, and rotation options. SpinQuant is based on the findings that rotation helps in removing outliers and improving quantization, with specific enhancements achieved through learning rotation with Cayley optimization.
XiaoXinAir14IML_2019_hackintosh
XiaoXinAir14IML_2019_hackintosh is a repository dedicated to enabling macOS installation on Lenovo XiaoXin Air-14 IML 2019 laptops. The repository provides detailed information on the hardware specifications, supported systems, BIOS versions, related models, installation methods, updates, patches, and recommended settings. It also includes tools and guides for BIOS modifications, enabling high-resolution display settings, Bluetooth synchronization between macOS and Windows 10, voltage adjustments for efficiency, and experimental support for YogaSMC. The repository offers solutions for various issues like sleep support, sound card emulation, and battery information. It acknowledges the contributions of developers and tools like OpenCore, itlwm, VoodooI2C, and ALCPlugFix.
aidea-server
AIdea Server is an open-source Golang-based server that integrates mainstream large language models and drawing models. It supports various functionalities including OpenAI's GPT-3.5 and GPT-4, Anthropic's Claude instant and Claude 2.1, Google's Gemini Pro, as well as Chinese models like Tongyi Qianwen, Wenxin Yiyuan, and more. It also supports open-source large models like Yi 34B, Llama2, and AquilaChat 7B. Additionally, it provides features for text-to-image, super-resolution, coloring black and white images, generating art fonts and QR codes, among others.
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
go-cyber
Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.
YuLan-Mini
YuLan-Mini is a lightweight language model with 2.4 billion parameters that achieves performance comparable to industry-leading models despite being pre-trained on only 1.08T tokens. It excels in mathematics and code domains. The repository provides pre-training resources, including data pipeline, optimization methods, and annealing approaches. Users can pre-train their own language models, perform learning rate annealing, fine-tune the model, research training dynamics, and synthesize data. The team behind YuLan-Mini is AI Box at Renmin University of China. The code is released under the MIT License with future updates on model weights usage policies. Users are advised on potential safety concerns and ethical use of the model.
Awesome-LLMs-for-Video-Understanding
Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.
phoenix
Phoenix is a tool that provides MLOps and LLMOps insights at lightning speed with zero-config observability. It offers a notebook-first experience for monitoring models and LLM Applications by providing LLM Traces, LLM Evals, Embedding Analysis, RAG Analysis, and Structured Data Analysis. Users can trace through the execution of LLM Applications, evaluate generative models, explore embedding point-clouds, visualize generative application's search and retrieval process, and statistically analyze structured data. Phoenix is designed to help users troubleshoot problems related to retrieval, tool execution, relevance, toxicity, drift, and performance degradation.
flute
FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) is a tool designed for uniform quantization and lookup table quantization of weights in lower-precision intervals. It offers flexibility in mapping intervals to arbitrary values through a lookup table. FLUTE supports various quantization formats such as int4, int3, int2, fp4, fp3, fp2, nf4, nf3, nf2, and even custom tables. The tool also introduces new quantization algorithms like Learned Normal Float (NFL) for improved performance and calibration data learning. FLUTE provides benchmarks, model zoo, and integration with frameworks like vLLM and HuggingFace for easy deployment and usage.
LLamaTuner
LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.
Awesome-Model-Merging-Methods-Theories-Applications
A comprehensive repository focusing on 'Model Merging in LLMs, MLLMs, and Beyond', providing an exhaustive overview of model merging methods, theories, applications, and future research directions. The repository covers various advanced methods, applications in foundation models, different machine learning subfields, and tasks like pre-merging methods, architecture transformation, weight alignment, basic merging methods, and more.
Native-LLM-for-Android
This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.
oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.
CogVLM2
CogVLM2 is a new generation of open source models that offer significant improvements in benchmarks such as TextVQA and DocVQA. It supports 8K content length, image resolution up to 1344 * 1344, and both Chinese and English languages. The project provides basic calling methods, fine-tuning examples, and OpenAI API format calling examples to help developers quickly get started with the model.
EVE
EVE is an official PyTorch implementation of Unveiling Encoder-Free Vision-Language Models. The project aims to explore the removal of vision encoders from Vision-Language Models (VLMs) and transfer LLMs to encoder-free VLMs efficiently. It also focuses on bridging the performance gap between encoder-free and encoder-based VLMs. EVE offers a superior capability with arbitrary image aspect ratio, data efficiency by utilizing publicly available data for pre-training, and training efficiency with a transparent and practical strategy for developing a pure decoder-only architecture across modalities.
For similar tasks
VoiceBench
VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.
oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.