LLaSA_training

LLaSA_training

LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis

Stars: 453

Visit
 screenshot

LLaSA_training is a repository focused on training models for speech synthesis using a large amount of open-source speech data. The repository provides instructions for finetuning models and offers pre-trained models for multilingual speech synthesis. It includes tools for training, data downloading, and data processing using specialized tokenizers for text and speech sequences. The repository also supports direct usage on Hugging Face platform with specific codecs and collections.

README:

arXiv

Update (2025-02-13): Add Llasa finetune instruction. You can try the finetuning results here:

Update (2025-02-07): Our paper has been released! Llasa 1b Multilingual version released!

Training

torchrun --nproc_per_node=8 train_tts.py config.json 

or

sbatch run_slurm.sh

Data

You can download tokenized open-source speech data here. This includes LibriHeavy, Emilia (in both Chinese and English), and WenetSpeech4TTS, totaling approximately 160,000 hours of open-source data.

Our models are trained on 250,000 hours of speech data. Of this, 160,000 hours come from the open-source datasets mentioned above, while the remaining 90,000 hours are from internal datasets, which are not yet available for open-source release.

Data instruction

Text_sequence is encoded by the text tokenizer from Llama, for example, Llama-3.2-1B-Instruct

Speech_sequence is extrated through X-codec2 We change the value of speech tokens by adding len(text tokenizer) +8 special tokens thereby forming a unified tokenizer that encompasses both speech and text.

Directly used on Hugging Face

Codec: xcodec2

Llasa-collections: Llasa-collections

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLaSA_training

Similar Open Source Tools

For similar tasks

For similar jobs