AudioLLM

AudioLLM

Audio Large Language Models

Stars: 71

Visit
 screenshot

AudioLLMs is a curated collection of research papers focusing on developing, implementing, and evaluating language models for audio data. The repository aims to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. It includes models for speech interaction, speech recognition, speech translation, audio generation, and more. Additionally, it covers methodologies like multitask audioLLMs and segment-level Q-Former, as well as evaluation benchmarks like AudioBench and AIR-Bench. Adversarial attacks such as VoiceJailbreak are also discussed.

README:

AudioLLMs

This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!

Models

Date Model Key Affiliations Paper Link
2024-10 DiVA Georgia Tech, Stanford Distilling an End-to-End Voice Assistant Without Instruction Training Data Paper / Project
2024-09 Moshi Kyutai Moshi: a speech-text foundation model for real-time dialogue Paper / Code
2024-09 LLaMA-Omni CAS LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper / Code
2024-09 Ultravox fixie-ai GitHub Open Source Code
2024-08 Mini-Omni Tsinghua Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming Paper / Code
2024-08 Typhoon-Audio Typhoon Typhoon-Audio Preview Release Page
2024-08 USDM SNU Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation Paper
2024-08 MooER Moore Threads MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Paper / Code
2024-07 GAMA UMD GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Paper / Code
2024-07 LLaST CUHK-SZ LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models Paper / Code
2024-07 CompA University of Maryland CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models Paper / Code / Project
2024-07 Qwen2-Audio Alibaba Qwen2-Audio Technical Report Paper / Code
2024-07 FunAudioLLM Alibaba FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper / Code / Demo
2024-06 DeSTA NTU-Taiwan, Nvidia DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment Paper / Code
2024-05 AudioChatLlama Meta AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs Paper
2024-05 Audio Flamingo Nvidia Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities Paper / Code
2024-05 SpeechVerse AWS SpeechVerse: A Large-scale Generalizable Audio Language Model Paper
2024-04 SALMONN Tsinghua SALMONN: Towards Generic Hearing Abilities for Large Language Models Paper / Code / Demo
2024-03 WavLLM CUHK WavLLM: Towards Robust and Adaptive Speech Large Language Model Paper / Code
2024-02 SLAM-LLM MoE Key Lab of Artificial Intelligence An Embarrassingly Simple Approach for LLM with Strong ASR Capacity Paper / Code
2024-01 Pengi Microsoft Pengi: An Audio Language Model for Audio Tasks Paper / Code
2023-12 Qwen-Audio Alibaba Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Paper / Code / Demo
2023-12 LTU-AS MIT Joint Audio and Speech Understanding Paper / Code / Demo
2023-10 Speech-LLaMA Microsoft On decoder-only architecture for speech-to-text and large language model integration Paper
2023-10 UniAudio CUHK An Audio Foundation Model Toward Universal Audio Generation Paper / Code / Demo
2023-09 LLaSM LinkSoul.AI LLaSM: Large Language and Speech Model Paper / Code
2023-06 AudioPaLM Google AudioPaLM: A Large Language Model that Can Speak and Listen Paper / Demo
2023-05 VioLA Microsoft VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation Paper
2023-05 SpeechGPT Fudan SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities Paper / Code / Demo
2023-04 AudioGPT Zhejiang Uni AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Paper / Code
2022-09 AudioLM Google AudioLM: a Language Modeling Approach to Audio Generation Paper / Demo

Models (language + audio + other modalities)

Date Model Key Affiliations Paper Link
2024-09 EMOVA HKUST EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Paper / Demo
2023-11 CoDi-2 UC Berkeley CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Paper / Code / Demo
2023-06 Macaw-LLM Tencent Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration Paper / Code

Methodology

Date Name Key Affiliations Paper Link
2024-09 AudioBERT Postech AudioBERT: Audio Knowledge Augmented Language Model Paper / Code
2024-09 MoWE-Audio A*STAR MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders Paper
2024-09 - Tsinghua SIGS Comparing Discrete and Continuous Space LLMs for Speech Recognition Paper
2024-06 Speech ReaLLM Meta Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time Paper
2023-09 Segment-level Q-Former Tsinghua Connecting Speech Encoder and Large Language Model for ASR Paper
2023-07 - Meta Prompting Large Language Models with Speech Recognition Abilities Paper

Adversarial Attacks

Date Name Key Affiliations Paper Link
2024-05 VoiceJailbreak CISPA Voice Jailbreak Attacks Against GPT-4o Paper

Evaluation

Date Name Key Affiliations Paper Link
2024-06 AudioBench A*STAR AudioBench: A Universal Benchmark for Audio Large Language Models Paper / Code / LeaderBoard
2024-05 AIR-Bench ZJU, Alibaba AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension Paper / Code
2024-08 MuChoMusic UPF, QMUL, UMG MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models Paper / Code
2023-09 Dynamic-SUPERB NTU-Taiwan, etc. Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech Paper / Code

Audio Model

Audio Models are different from Audio Large Language Models.

Evaluation

Date Name Key Affiliations Paper Link
2024-09 Salmon Hebrew University of Jerusalem A Suite for Acoustic Language Model Evaluation Paper / Code

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for AudioLLM

Similar Open Source Tools

For similar tasks

For similar jobs