Awesome-Audio-LLM

Awesome-Audio-LLM

Audio Large Language Models

Stars: 214

Visit
 screenshot

Awesome-Audio-LLM is a repository dedicated to various models and methods related to audio and language processing. It includes a wide range of research papers and models developed by different institutions and authors. The repository covers topics such as bridging audio and language, speech emotion recognition, voice assistants, and more. It serves as a comprehensive resource for those interested in the intersection of audio and language processing.

README:

🌟🌟🌟 Find interesting work or want your work to be on-board? Raise an Issue or Pull Requests! :)

Contributors

We thank the following contributors for their valuable contributions! zwenyu, Yuan-ManX, chaoweihuang, Liu-Tianchi, Sakshi113,

Table of Contents

Timeline Visualization

Abbreviations with Links

Model and Methods

  • 【2024-12】-【MERaLiON-AudioLLM】-【I2R, A*STAR, Singapore】-【Type: Model】

    • MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
    • Author(s): Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw
    • Paper / Hugging Face Model / Demo
  • 【2024-11】-【Taiwanese AudioLLM】-【National Taiwan University】-【Type: Model】

    • Building a Taiwanese Mandarin Spoken Language Model: A First Attempt
    • Author(s): Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, Tzu-Quan Lin, Hsiu-Hsuan Wang, En-Pei Hu, Chan-Jan Hsu, Liang-Hsuan Tseng, I-Hsiang Chiu, Ulin Sanga, Xuanjun Chen, Po-chun Hsu, Shu-wen Yang, Hung-yi Lee
    • Paper
  • 【2024-10】-【SPIRIT LM】-【Meta】-【Type: Model】

    • SPIRIT LM: Interleaved Spoken and Written Language Model
    • Author(s): Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot, Emmanuel Dupoux
    • GitHub stars
    • Paper / Demo
  • 【2024-10】-【DiVA】-【Georgia Tech, Stanford】-【Type: Model】

    • Distilling an End-to-End Voice Assistant Without Instruction Training Data
    • Author(s): William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang
    • GitHub stars
    • Paper / Demo
  • 【2024-10】-【SPIRIT LM】-【Meta】-【Type: Model】

    • SPIRIT LM: Interleaved Spoken and Written Language Model
    • Author(s): Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux
    • GitHub stars
    • Paper / Other Link
  • 【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】-【Type: Model】

    • Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
    • Author(s): Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar, Ozlem Kalinli
    • Paper
  • 【2024-09】-【Moshi】-【Kyutai】-【Type: Model】

    • Moshi: a speech-text foundation model for real-time dialogue
    • Author(s): Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
    • GitHub stars
    • Paper
  • 【2024-09】-【LLaMA-Omni】-【Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)】-【Type: Model】

    • LLaMA-Omni: Seamless Speech Interaction with Large Language Models
    • Author(s): Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng
    • GitHub stars
    • Paper
  • 【2024-09】-【Ultravox】-【Fixie.ai】-【Type: Model】

    • Ultravox: A Fast Multimodal LLM for Real-Time Voice
    • Author(s):
    • GitHub stars
  • 【2024-09】-【MoWE-Audio】-【A*STAR】-【Type: Model】

    • MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
    • Author(s): Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
    • Paper
  • 【2024-09】-【AudioBERT】-【POSTECH, Inha University】-【Type: Model】

    • AudioBERT: Audio Knowledge Augmented Language Model
    • Author(s): Hyunjong Ok, Suho Yoo, Jaeho Lee
    • GitHub stars
    • Paper
  • 【2024-09】-【DeSTA2】-【National Taiwan University, NVIDIA】-【Type: Model】

    • Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
    • Author(s): Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
    • GitHub stars
    • Paper
  • 【2024-09】-【ASRCompare】-【Tsinghua University, Tencent AI Lab】-【Type: Model】

    • Comparing Discrete and Continuous Space LLMs for Speech Recognition
    • Author(s): Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu
    • GitHub stars
    • Paper
  • 【2024-08】-【MooER】-【Moore Threads】-【Type: Model】

    • MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
    • Author(s): Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang
    • GitHub stars
    • Paper
  • 【2024-08】-【Mini-Omni】-【Tsinghua University】-【Type: Model】

    • Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
    • Author(s): Zhifei Xie, Changqiao Wu
    • GitHub stars
    • Paper
  • 【2024-07】-【FunAudioLLM】-【Alibaba】-【Type: Model】

    • FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper / Demo
  • 【2024-07】-【Qwen2-Audio】-【Alibaba Group】-【Type: Model】

    • Qwen2-Audio Technical Report
    • Author(s): Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou
    • GitHub stars
    • Paper
  • 【2024-07】-【GAMA】-【University of Maryland, College Park】-【Type: Model】

    • GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
    • Author(s): Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
    • GitHub stars
    • Paper / Demo
  • 【2024-07】-【LLaST】-【The Chinese University of Hong Kong, Shenzhen; Shanghai AI Laboratory; Nara Institute of Science and Technology, Japan】-【Type: Model】

    • LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
    • Author(s): Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura
    • GitHub stars
    • Paper
  • 【2024-07】-【Decoder-only LLMs for STT】-【NTU-Taiwan, Meta】-【Type: Research】

    • Investigating Decoder-only Large Language Models for Speech-to-text Translation
    • Author(s): Authors not specified in the provided information
    • Paper
  • 【2024-07】-【CompA】-【University of Maryland, College Park; Adobe, USA; NVIDIA, Bangalore, India】-【Type: Model】

    • CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
    • Author(s): Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
    • GitHub stars
    • Paper / Demo
  • 【2024-06】-【DeSTA】-【NTU-Taiwan, Nvidia】-【Type: Model】

    • DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper
  • 【2024-06】-【Speech ReaLLM】-【Meta】-【Type: Model】

    • Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
    • Author(s): Authors not specified in the provided information
    • Paper
  • 【2024-05】-【Audio Flamingo】-【Nvidia】-【Type: Model】

    • Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper
  • 【2024-04】-【SALMONN】-【Tsinghua】-【Type: Model】

    • SALMONN: Towards Generic Hearing Abilities for Large Language Models
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper / Demo
  • 【2024-03】-【WavLLM】-【CUHK】-【Type: Model】

    • WavLLM: Towards Robust and Adaptive Speech Large Language Model
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper
  • 【2024-02】-【SLAM-LLM】-【Shanghai Jiao Tong University (SJTU)】-【Type: Model】

    • An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper
  • 【2024-01】-【Pengi】-【Microsoft】-【Type: Model】

    • Pengi: An Audio Language Model for Audio Tasks
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper
  • 【2023-12】-【Qwen-Audio】-【Alibaba】-【Type: Model】

    • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper / Demo
  • 【2023-10】-【UniAudio】-【Chinese University of Hong Kong (CUHK)】-【Type: Model】

    • An Audio Foundation Model Toward Universal Audio Generation
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper / Demo
  • 【2023-09】-【LLaSM】-【LinkSoul.AI】-【Type: Model】

    • LLaSM: Large Language and Speech Model
    • Author(s): Authors not specified in the provided information
    • GitHub stars
    • Paper
  • 【2023-09】-【Segment-level Q-Former】-【Tsinghua University, ByteDance】-【Type: Model】

    • Connecting Speech Encoder and Large Language Model for ASR
    • Author(s): Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
    • Paper
  • 【2023-07】-【Prompting LLMs with Speech Recognition】-【Meta】-【Type: Model】

    • Prompting Large Language Models with Speech Recognition Abilities
    • Author(s): Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
    • Paper
  • 【2023-05】-【SpeechGPT】-【Fudan University】-【Type: Model】

    • SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
    • Author(s): Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
    • GitHub stars
    • Paper / Demo
  • 【2023-04】-【AudioGPT】-【Zhejiang University】-【Type: Model】

    • AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
    • Author(s): Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe
    • GitHub stars
    • Paper

Benchmark

  • 【2024-12】-【ADU-Bench】-【Tsinghua University, University of Oxford】-【Type: Benchmark】

    • Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
    • Author(s): Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu
    • Paper
  • 【2024-11】-【Dynamic-SUPERB Phase-2】-【National Taiwan University, University of Texas at Austin, Carnegie Mellon University, Nanyang Technological University, Toyota Technological Institute of Chicago, Université du Québec (INRS-EMT), NVIDIA, ASAPP, Renmin University of China】-【Type: Evaluation Framework】

    • Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
    • Author(s): Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee
    • GitHub stars
    • Paper / Other Link
  • 【2024-10】-【VoiceBench】-【National University of Singapore】-【Type: Benchmark】

    • VoiceBench: Benchmarking LLM-Based Voice Assistants
    • Author(s): Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, Haizhou Li
    • GitHub stars
    • Paper
  • 【2024-10】-【MMAU】-【University of Maryland】-【Type: Benchmark】

    • MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
    • Author(s): S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha
    • GitHub stars
    • Paper / Other Link
  • 【2024-09】-【SALMon】-【Hebrew University of Jerusalem】-【Type: Benchmark】

    • A Suite for Acoustic Language Model Evaluation
    • Author(s): Gallil Maimon, Amit Roth, Yossi Adi
    • GitHub stars
    • Paper / Demo
  • 【2024-08】-【MuChoMusic】-【UPF, QMUL, UMG】-【Type: Benchmark】

    • MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
    • Author(s): Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, Dmitry Bogdanov
    • GitHub stars
    • Paper
  • 【2024-07】-【AudioEntailment】-【CMU, Microsoft】-【Type: Benchmark】

    • Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
    • Author(s): Soham Deshmukh, Shuo Han, Hazim Bukhari, Benjamin Elizalde, Hannes Gamper, Rita Singh, Bhiksha Raj
    • GitHub stars
    • Paper
  • 【2024-06】-【SD-Eval】-【CUHK, Bytedance】-【Type: Benchmark】

    • SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
    • Author(s): Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu
    • GitHub stars
    • Paper
  • 【2024-06】-【AudioBench】-【A*STAR, Singapore】-【Type: Benchmark】

    • AudioBench: A Universal Benchmark for Audio Large Language Models
    • Author(s): Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, Nancy F. Chen
    • GitHub stars
    • Paper / Demo
  • 【2024-05】-【AIR-Bench】-【ZJU, Alibaba】-【Type: Benchmark】

    • AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
    • Author(s): Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou
    • GitHub stars
    • Paper
  • 【2023-09】-【Dynamic-SUPERB】-【NTU-Taiwan, etc.】-【Type: Benchmark】

    • Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
    • Author(s): Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, Roshan Sharma, Shinji Watanabe, Bhiksha Ramakrishnan, Shady Shehata, Hung-yi Lee
    • GitHub stars
    • Paper

Multimodal

  • 【2024-09】-【EMOVA】-【HKUST】-【Type: Model】

    • EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
    • Author(s): Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu
    • Paper / Demo
  • 【2023-11】-【CoDi-2】-【UC Berkeley】-【Type: Model】

    • CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
    • Author(s): Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal
    • GitHub stars
    • Paper / Demo
  • 【2023-06】-【Macaw-LLM】-【Tencent】-【Type: Model】

    • Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
    • Author(s): Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu
    • GitHub stars
    • Paper

Survey

  • 【2024-11】-【WavChat-Survey】-【Zhejiang University】-【Type: Survey】

    • WavChat: A Survey of Spoken Dialogue Models
    • Author(s): Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao
    • Paper
  • 【2024-10】-【SpeechLLM-Survey】-【SJTU, AISpeech】-【Type: Survey】

    • A Survey on Speech Large Language Models
    • Author(s): Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu
    • Paper
  • 【2024-10】-【SpeechLM-Survey】-【CUHK, Tencent】-【Type: Survey】

    • Recent Advances in Speech Language Models: A Survey
    • Author(s): Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, Irwin King
    • Paper

Study

  • 【2024-06】-【Audio Hallucination】-【NTU-Taiwan】-【Type: Research】
    • Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
    • Author(s): Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee
    • GitHub stars
    • Paper

Safety

  • 【2024-05】-【VoiceJailbreak】-【CISPA】-【Type: Method】
    • Voice Jailbreak Attacks Against GPT-4o
    • Author(s): Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang
    • GitHub stars
    • Paper

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome-Audio-LLM

Similar Open Source Tools

For similar tasks

For similar jobs