speech-trident

speech-trident

Awesome speech/audio LLMs, representation learning, and codec models

Stars: 636

Visit
 screenshot

Speech Trident is a repository focusing on speech/audio large language models, covering representation learning, neural codec, and language models. It explores speech representation models, speech neural codec models, and speech large language models. The repository includes contributions from various researchers and provides a comprehensive list of speech/audio language models, representation models, and codec models.

README:

🔱 Speech Trident - Awesome Speech LM

Speech Trident

In this repository, we survey three crucial areas: (1) representation learning, (2) neural codec, and (3) language models that contribute to speech/audio large language models.

1.âš¡ Speech Representation Models: These models focus on learning structural speech representations, which can then be quantized into discrete speech tokens, often refer to semantic tokens.

2.âš¡ Speech Neural Codec Models: These models are designed to learn speech and audio discrete tokens, often referred to as acoustic tokens, while maintaining reconstruction ability and low bitrate.

3.âš¡ Speech Large Language Models: These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation.

🔱 Contributors


Kai-Wei Chang

Haibin Wu

Wei-Cheng Tseng

Kehan Lu

Chun-Yi Kuan

Hung-yi Lee

🔱 Speech/Audio Language Models

Date Model Name Paper Title Link
2024-09 Moshi Moshi: a speech-text foundation model for real-time dialogue paper
2024-09 Takin AudioLLM Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models paper
2024-09 FireRedTTS FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications paper
2024-09 LLaMA-Omni LLaMA-Omni: Seamless Speech Interaction with Large Language Models paper
2024-09 MaskGCT MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer paper
2024-09 SSR-Speech SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis paper
2024-09 MoWE-Audio MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders paper
2024-08 Mini-Omni Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming paper
2024-08 Make-A-Voice 2 Language Model Can Listen While Speaking paper
2024-08 LSLM Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learner paper
2024-06 SimpleSpeech SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models paper
2024-06 UniAudio 1.5 UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner paper
2024-06 VALL-E R VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment paper
2024-06 VALL-E 2 VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers paper
2024-06 GPST Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer paper
2024-04 CLaM-TTS CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech paper
2024-04 RALL-E RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis paper
2024-04 WavLLM WavLLM: Towards Robust and Adaptive Speech Large Language Model paper
2024-02 MobileSpeech MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech paper
2024-02 SLAM-ASR An Embarrassingly Simple Approach for LLM with Strong ASR Capacity paper
2024-02 AnyGPT AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling paper
2024-02 SpiRit-LM SpiRit-LM: Interleaved Spoken and Written Language Model paper
2024-02 USDM Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation paper
2024-02 BAT BAT: Learning to Reason about Spatial Sounds with Large Language Models paper
2024-02 Audio Flamingo Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities paper
2024-02 Text Description to speech Natural language guidance of high-fidelity text-to-speech with synthetic annotations paper
2024-02 GenTranslate GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators paper
2024-02 Base-TTS BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data paper
2024-02 -- It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition paper
2024-01 -- Large Language Models are Efficient Learners of Noise-Robust Speech Recognition paper
2024-01 ELLA-V ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering paper
2023-12 Seamless Seamless: Multilingual Expressive and Streaming Speech Translation paper
2023-11 Qwen-Audio Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models paper
2023-10 LauraGPT LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT paper
2023-10 SALMONN SALMONN: Towards Generic Hearing Abilities for Large Language Models paper
2023-10 UniAudio UniAudio: An Audio Foundation Model Toward Universal Audio Generation paper
2023-10 Whispering LLaMA Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition paper
2023-09 VoxtLM Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks paper
2023-09 LTU-AS Joint Audio and Speech Understanding paper
2023-09 SLM SLM: Bridge the thin gap between speech and text foundation models paper
2023-09 -- Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting paper
2023-08 SpeechGen SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts paper
2023-08 SpeechX SpeechX: Neural Codec Language Model as a Versatile Speech Transformer paper
2023-08 LLaSM Large Language and Speech Model paper
2023-08 SeamlessM4T Massively Multilingual & Multimodal Machine Translation paper
2023-07 Speech-LLaMA On decoder-only architecture for speech-to-text and large language model integration paper
2023-07 LLM-ASR(temp.) Prompting Large Language Models with Speech Recognition Abilities paper
2023-06 AudioPaLM AudioPaLM: A Large Language Model That Can Speak and Listen paper
2023-05 Make-A-Voice Make-A-Voice: Unified Voice Synthesis With Discrete Representation paper
2023-05 Spectron Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM paper
2023-05 TWIST Textually Pretrained Speech Language Models paper
2023-05 Pengi Pengi: An Audio Language Model for Audio Tasks paper
2023-05 SoundStorm Efficient Parallel Audio Generation paper
2023-05 LTU Joint Audio and Speech Understanding paper
2023-05 SpeechGPT Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities paper
2023-05 VioLA Unified Codec Language Models for Speech Recognition, Synthesis, and Translation paper
2023-05 X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages paper
2023-03 Google USM Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages paper
2023-03 VALL-E X Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling paper
2023-02 SPEAR-TTS Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision paper
2023-01 VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers paper
2022-12 Whisper Robust Speech Recognition via Large-Scale Weak Supervision paper
2022-10 AudioGen AudioGen: Textually Guided Audio Generation paper
2022-09 AudioLM AudioLM: a Language Modeling Approach to Audio Generation paper
2022-05 Wav2Seq Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages paper
2022-04 Unit mBART Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation paper
2022-03 d-GSLM Generative Spoken Dialogue Language Modeling paper
2021-10 SLAM SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training paper
2021-09 p-GSLM Text-Free Prosody-Aware Generative Spoken Language Modeling paper
2021-02 GSLM Generative Spoken Language Modeling from Raw Audio paper

🔱 Speech/Audio Representation Models

Date Model Name Paper Title Link
2024-09 NEST-RQ NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training paper
2024-01 EAT Self-Supervised Pre-Training with Efficient Audio Transformer paper
2023-10 MR-HuBERT Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction paper
2023-10 SpeechFlow Generative Pre-training for Speech with Flow Matching paper
2023-09 WavLabLM Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning paper
2023-08 W2v-BERT 2.0 Massively Multilingual & Multimodal Machine Translation paper
2023-07 Whisper-AT Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers paper
2023-06 ATST Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks paper
2023-05 SPIN Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering paper
2023-05 DinoSR Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning paper
2023-05 NFA Self-supervised neural factor analysis for disentangling utterance-level speech representations paper
2022-12 Data2vec 2.0 Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language paper
2022-12 BEATs Audio Pre-Training with Acoustic Tokenizers paper
2022-11 MT4SSL MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets paper
2022-08 DINO Non-contrastive self-supervised learning of utterance-level speech representations paper
2022-07 Audio-MAE Masked Autoencoders that Listen paper
2022-04 MAESTRO Matched Speech Text Representations through Modality Matching paper
2022-03 MAE-AST Masked Autoencoding Audio Spectrogram Transformer paper
2022-03 LightHuBERT Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT paper
2022-02 Data2vec A General Framework for Self-supervised Learning in Speech, Vision and Language paper
2021-10 WavLM WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing paper
2021-08 W2v-BERT Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training paper
2021-07 mHuBERT Direct speech-to-speech translation with discrete units paper
2021-06 HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units paper
2021-03 BYOL-A Self-Supervised Learning for General-Purpose Audio Representation paper
2020-12 DeCoAR2.0 DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization paper
2020-07 TERA TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech paper
2020-06 Wav2vec2.0 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations paper
2019-10 APC Generative Pre-Training for Speech with Autoregressive Predictive Coding paper
2018-07 CPC Representation Learning with Contrastive Predictive Coding paper

🔱 Speech/Audio Codec Models

Date Model Name Paper Title Link
2024-09 Mimi Moshi: a speech-text foundation model for real-time dialogue paper
2024-09 NDVQ NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization paper
2024-09 SoCodec SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis paper
2024-09 BigCodec BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec paper
2024-08 X-Codec Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model paper
2024-08 WavTokenizer WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling paper
2024-07 Super-Codec SuperCodec: A Neural Speech Codec with Selective Back-Projection Network paper
2024-06 Single-Codec Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation paper
2024-06 SQ-Codec SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models paper
2024-06 PQ-VAE Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder paper
2024-06 LLM-Codec UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner paper
2024-05 HILCodec HILCodec: High Fidelity and Lightweight Neural Audio Codec paper
2024-04 SemantiCodec SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound paper
2024-04 PromptCodec PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders paper
2024-04 ESC ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers paper
2024-03 FACodec NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models paper
2024-02 AP-Codec APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding paper
2024-02 Language-Codec Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models paper
2024-01 ScoreDec ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter paper
2023-11 HierSpeech++ HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis paper
2023-10 TiCodec FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES paper
2023-09 RepCodec RepCodec: A Speech Representation Codec for Speech Tokenization paper
2023-09 FunCodec FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec paper
2023-08 SpeechTokenizer Speechtokenizer: Unified speech tokenizer for speech large language models paper
2023-06 VOCOS VOCOS: CLOSING THE GAP BETWEEN TIME-DOMAIN AND FOURIER-BASED NEURAL VOCODERS FOR HIGH-QUALITY AUDIO SYNTHESIS paper
2023-06 Descript-audio-codec High-Fidelity Audio Compression with Improved RVQGAN paper
2023-05 AudioDec Audiodec: An open-source streaming highfidelity neural audio codec paper
2023-05 HiFi-Codec Hifi-codec: Group-residual vector quantization for high fidelity audio codec paper
2023-03 LMCodec LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models paper
2022-11 Disen-TF-Codec Disentangled Feature Learning for Real-Time Neural Speech Coding paper
2022-10 EnCodec High fidelity neural audio compression paper
2022-07 S-TFNet Cross-Scale Vector Quantization for Scalable Neural Speech Coding paper
2022-01 TFNet End-to-End Neural Speech Coding for Real-Time Communications paper
2021-07 SoundStream SoundStream: An End-to-End Neural Audio Codec paper

🔱 Interspeech 2024 Survey Talk

Professor Hung-Yi Lee will be giving a talk as part of the Interspeech 2024 survey talk titled Challenges in Developing Spoken Language Models. The topic will cover nowday's speech/audio large language models.

🔱 ICASSP 2024 Tutorial Information

I (Kai-Wei Chang) will be giving a talk as part of the ICASSP 2024 tutorial titled Parameter-Efficient and Prompt Learning for Speech and Language Foundation Models. The topic will cover nowday's speech/audio large language models.

Tutorial speakers:

  • Dr. Huck Yang (NVIDIA)
  • Dr. Pin-Yu Chen (IBM Research)
  • Prof. Hung-yi Lee (National Taiwan University)
  • Kai-Wei Chang (National Taiwan University)
  • Cheng-Han Chiang (National Taiwan University)

See you in Seoul!

🔱 Update: The Tutorial was successfully conducted at ICASSP 2024. Thank all attendees for their participation. The slides from my presentation is available at https://kwchang.org/talks/. Please feel free to reach out to me for any discussions.

🔱 Related Repository

Citation

If you find this repository useful, please consider citing the following papers.

@article{wu2024codec,
  title={Codec-SUPERB: An In-Depth Analysis of Sound Codec Models},
  author={Wu, Haibin and Chung, Ho-Lam and Lin, Yi-Cheng and Wu, Yuan-Kuei and Chen, Xuanjun and Pai, Yu-Chi and Wang, Hsiu-Hsuan and Chang, Kai-Wei and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13071},
  year={2024}
}
@article{wu2024towards,
  title={Towards audio language modeling-an overview},
  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kai-wei and Chung, Ho-Lam and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13236},
  year={2024}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for speech-trident

Similar Open Source Tools

For similar tasks

For similar jobs