data:image/s3,"s3://crabby-images/74c83/74c83df2ebf176f02fdd6a78b77f5efae33d2d47" alt="YuE"
YuE
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Stars: 1136
data:image/s3,"s3://crabby-images/c2c97/c2c97dd8b8eaafde2057478373be0f41c98a6869" alt="screenshot"
YuE (乐) is an open-source foundation model designed for music generation, specifically transforming lyrics into full songs. It can generate complete songs in various genres and vocal styles, ensuring a polished and cohesive result. The model requires significant GPU memory for generating long sequences and recommends specific configurations for optimal performance. Users can customize the number of sessions for memory usage. The tool provides a quickstart guide for generating music using Transformers and includes tips for execution time and tag selection. The project is licensed under Creative Commons Attribution Non Commercial 4.0.
README:
Demo 🎶 | 📑 Paper (coming soon)
YuE-s1-7B-anneal-en-cot 🤗 | YuE-s1-7B-anneal-en-icl 🤗 | YuE-s1-7B-anneal-jp-kr-cot 🤗
YuE-s1-7B-anneal-jp-kr-icl 🤗 | YuE-s1-7B-anneal-zh-cot 🤗 | YuE-s1-7B-anneal-zh-icl 🤗
YuE-s2-1B-general 🤗 | YuE-upsampler 🤗
Our model's name is YuE (乐). In Chinese, the word means "music" and "happiness." Some of you may find words that start with Yu hard to pronounce. If so, you can just call it "yeah." We wrote a song with our model's name, see here.
YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs (lyrics2song). It can generate a complete song, lasting several minutes, that includes both a catchy vocal track and accompaniment track. YuE is capable of modeling diverse genres/languages/vocal techniques. Please visit the Demo Page for amazing vocal performance.
- 2025.01.29 🎉: We have updated the license description. we ENCOURAGE artists and content creators to sample and incorporate outputs generated by our model into their own works, and even monetize them. The only requirement is to credit our name: YuE by HKUST/M-A-P (alphabetic order).
- 2025.01.28 🫶: Thanks to Fahd for creating a tutorial on how to quickly get started with YuE. Here is his demonstration.
- 2025.01.26 🔥: We have released the YuE series.
- [ ] Support dual-track ICL mode.
- [ ] Support gradio interface. https://github.com/multimodal-art-projection/YuE/issues/1
- [ ] Support transformers tensor parallel. https://github.com/multimodal-art-projection/YuE/issues/7
- [ ] Online serving on huggingface space.
- [ ] Example finetune code for enabling BPM control using 🤗 Transformers.
- [ ] Support stemgen mode https://github.com/multimodal-art-projection/YuE/issues/21
- [ ] Support seeding https://github.com/multimodal-art-projection/YuE/issues/20
YuE requires significant GPU memory for generating long sequences. Below are the recommended configurations:
- For GPUs with 24GB memory or less: Run up to 2 sessions concurrently to avoid out-of-memory (OOM) errors.
- For full song generation (many sessions, e.g., 4 or more): Use GPUs with at least 80GB memory. i.e. H800, A100, or multiple RTX4090s with tensor parallel.
To customize the number of sessions, the interface allows you to specify the desired session count. By default, the model runs 2 sessions (1 verse + 1 chorus) to avoid OOM issue.
On an H800 GPU, generating 30s audio takes 150 seconds. On an RTX 4090 GPU, generating 30s audio takes approximately 360 seconds.
Quick start VIDEO TUTORIAL by Fahd: Link here. We recommend watching this video if you are not familiar with machine learning or the command line.
Make sure properly install flash attention 2 to reduce VRAM usage.
# We recommend using conda to create a new environment.
conda create -n yue python=3.8 # Python >=3.8 is recommended.
conda activate yue
# install cuda >= 11.8
conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
# For saving GPU memory, FlashAttention 2 is mandatory.
# Without it, long audio may lead to out-of-memory (OOM) errors.
# Be careful about matching the cuda version and flash-attn version
pip install flash-attn --no-build-isolation
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://github.com/multimodal-art-projection/YuE.git
cd YuE/inference/
git clone https://huggingface.co/m-a-p/xcodec_mini_infer
Now generate music with YuE using 🤗 Transformers. Make sure your step 1 and 2 are properly set up.
Note:
-
Set
--run_n_segments
to the number of lyric sections if you want to generate a full song. Additionally, you can increase--stage2_batch_size
based on your available GPU memory. -
You may customize the prompt in
genre.txt
andlyrics.txt
. See prompt engineering guide here. -
LM ckpts will be automatically downloaded from huggingface.
# This is the CoT mode.
cd YuE/inference/
python infer.py \
--stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \
--stage2_model m-a-p/YuE-s2-1B-general \
--genre_txt genre.txt \
--lyrics_txt lyrics.txt \
--run_n_segments 2 \
--stage2_batch_size 4 \
--output_dir ./output \
--cuda_idx 0 \
--max_new_tokens 3000
If you want to use music in-context-learning (provide a reference song), enable --use_audio_prompt
, --prompt_start_time
, and --prompt_end_time
to specify the audio segment.
Note:
-
ICL requires a different ckpt, e.g.
m-a-p/YuE-s1-7B-anneal-en-icl
. -
Music ICL generally requires a 30s audio segment. The model will write new songs with similiar style of the provided audio, and may improve musicality.
-
We have 4 modes for ICL: mix, vocal, instrumental, and dual-track.
-
We currently only support mix mode.
-
Dual-track mode work the best, will support in the infer code soon.
# This is the ICL mode. Currently only mix-ICL is supported.
cd YuE/inference/
python infer.py \
--stage1_model m-a-p/YuE-s1-7B-anneal-en-icl \
--stage2_model m-a-p/YuE-s2-1B-general \
--genre_txt genre.txt \
--lyrics_txt lyrics.txt \
--run_n_segments 2 \
--stage2_batch_size 4 \
--output_dir ./output \
--cuda_idx 0 \
--max_new_tokens 3000 \
--audio_prompt_path {YOUR_AUDIO_FILE} \
--prompt_start_time 0 \
--prompt_end_time 30
The prompt consists of three parts: genre tags, lyrics, and ref audio.
-
An example genre tagging prompt can be found here.
-
A stable tagging prompt usually consists of five components: genre, instrument, mood, gender, and timbre. All five should be included if possible, separated by space (space delimiter).
-
Although our tags have an open vocabulary, we have provided the top 200 most commonly used tags. It is recommended to select tags from this list for more stable results.
-
The order of the tags is flexible. For example, a stable genre tagging prompt might look like: "inspiring female uplifting pop airy vocal electronic bright vocal vocal."
-
Additionally, we have introduced the "Mandarin" and "Cantonese" tags to distinguish between Mandarin and Cantonese, as their lyrics often share similarities.
-
An example lyric prompt can be found here.
-
We support multiple languages, including but not limited to English, Mandarin Chinese, Cantonese, Japanese, and Korean. The default top language distribution during the annealing phase is revealed in issue 12. A language ID on a specific annealing checkpoint indicates that we have adjusted the mixing ratio to enhance support for that language.
-
The lyrics prompt should be divided into sessions, with structure labels (e.g., [verse], [chorus], [bridge], [outro]) prepended. Each session should be separated by 2 newline character "\n\n".
-
DONOT put too many words in a single segment, since each session is around 30s (
--max_new_tokens 3000
by default). -
We find that [intro] label is less stable, so we recommend starting with [verse] or [chorus].
-
For generating music with no vocal, see issue 18.
-
Audio prompt is optional. Providing ref audio for ICL usually increase the good case rate, and result in less diversity since the generated token space is bounded by the ref audio. CoT only (no ref) will result in a more diverse output.
-
We find that dual-track ICL mode gives the best musicality and prompt following. We will support this mode soon.
-
Use the chorus part of the music as prompt will result in better musicality.
- Our models are licensed under Creative Commons Attribution Non Commercial 4.0, meaning the model weights themselves CANNOT be used for commercial purposes.
- However, we ENCOURAGE artists and content creators to sample and incorporate outputs generated by our model into their own works, and even monetize them. The only requirement is to credit our name: YuE by HKUST/M-A-P (alphabetic order).
- We DO NOT assume any responsibility for any misuse of this model, including but not limited to illegal, malicious, or unethical activities.
- Users are solely responsible for any content generated with the model and any consequences arising from its use.
The project is co-lead by HKUST and M-A-P (alphabetic order). Also thanks moonshot.ai, bytedance, 01.ai, and geely for supporting the project. A friendly link to HKUST Audio group's huggingface space.
We deeply appreciate all the support we received along the way. Long live open-source AI!
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@misc{yuan2025yue,
title={YuE: Open Music Foundation Models for Full-Song Generation},
author={Ruibin Yuan and Hanfeng Lin and Shawn Guo and Ge Zhang and Jiahao Pan and Yongyi Zang and Haohe Liu and Xingjian Du and Xeron Du and Zhen Ye and Tianyu Zheng and Yinghao Ma and Minghao Liu and Lijun Yu and Zeyue Tian and Ziya Zhou and Liumeng Xue and Xingwei Qu and Yizhi Li and Tianhao Shen and Ziyang Ma and Shangda Wu and Jun Zhan and Chunhui Wang and Yatian Wang and Xiaohuan Zhou and Xiaowei Chi and Xinyue Zhang and Zhenzhu Yang and Yiming Liang and Xiangzhou Wang and Shansong Liu and Lingrui Mei and Peng Li and Yong Chen and Chenghua Lin and Xie Chen and Gus Xia and Zhaoxiang Zhang and Chao Zhang and Wenhu Chen and Xinyu Zhou and Xipeng Qiu and Roger Dannenberg and Jiaheng Liu and Jian Yang and Stephen Huang and Wei Xue and Xu Tan and Yike Guo},
howpublished={\url{https://github.com/multimodal-art-projection/YuE}},
year={2025},
note={GitHub repository}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for YuE
Similar Open Source Tools
data:image/s3,"s3://crabby-images/c2c97/c2c97dd8b8eaafde2057478373be0f41c98a6869" alt="YuE Screenshot"
YuE
YuE (乐) is an open-source foundation model designed for music generation, specifically transforming lyrics into full songs. It can generate complete songs in various genres and vocal styles, ensuring a polished and cohesive result. The model requires significant GPU memory for generating long sequences and recommends specific configurations for optimal performance. Users can customize the number of sessions for memory usage. The tool provides a quickstart guide for generating music using Transformers and includes tips for execution time and tag selection. The project is licensed under Creative Commons Attribution Non Commercial 4.0.
data:image/s3,"s3://crabby-images/9b259/9b2594a1d1c12d8b44969470f62bda7622d994b9" alt="MInference Screenshot"
MInference
MInference is a tool designed to accelerate pre-filling for long-context Language Models (LLMs) by leveraging dynamic sparse attention. It achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy. The tool supports various decoding LLMs, including LLaMA-style models and Phi models, and provides custom kernels for attention computation. MInference is useful for researchers and developers working with large-scale language models who aim to improve efficiency without compromising accuracy.
data:image/s3,"s3://crabby-images/242aa/242aa28a6fa5d0ac73c28249ff55c58c7bafb177" alt="fuse-med-ml Screenshot"
fuse-med-ml
FuseMedML is a Python framework designed to accelerate machine learning-based discovery in the medical field by promoting code reuse. It provides a flexible design concept where data is stored in a nested dictionary, allowing easy handling of multi-modality information. The framework includes components for creating custom models, loss functions, metrics, and data processing operators. Additionally, FuseMedML offers 'batteries included' key components such as fuse.data for data processing, fuse.eval for model evaluation, and fuse.dl for reusable deep learning components. It supports PyTorch and PyTorch Lightning libraries and encourages the creation of domain extensions for specific medical domains.
data:image/s3,"s3://crabby-images/4c312/4c3125c6a8d7baf71f4f6dfa6357bd4557d06962" alt="HolmesVAD Screenshot"
HolmesVAD
Holmes-VAD is a framework for unbiased and explainable Video Anomaly Detection using multimodal instructions. It addresses biased detection in challenging events by leveraging precise temporal supervision and rich multimodal instructions. The framework includes a largescale VAD instruction-tuning benchmark, VAD-Instruct50k, created with single-frame annotations and a robust video captioner. It offers accurate anomaly localization and comprehensive explanations through a customized solution for interpretable video anomaly detection.
data:image/s3,"s3://crabby-images/dc782/dc782cdf3dbd3d7671ce7521fc08b660e9ebadd5" alt="open-parse Screenshot"
open-parse
Open Parse is a Python library for visually discerning document layouts and chunking them effectively. It is designed to fill the gap in open-source libraries for handling complex documents. Unlike text splitting, which converts a file to raw text and slices it up, Open Parse visually analyzes documents for superior LLM input. It also supports basic markdown for parsing headings, bold, and italics, and has high-precision table support, extracting tables into clean Markdown formats with accuracy that surpasses traditional tools. Open Parse is extensible, allowing users to easily implement their own post-processing steps. It is also intuitive, with great editor support and completion everywhere, making it easy to use and learn.
data:image/s3,"s3://crabby-images/51e4b/51e4b7e45dad35732a66ca27a14e05777fe1db3b" alt="Controllable-RAG-Agent Screenshot"
Controllable-RAG-Agent
This repository contains a sophisticated deterministic graph-based solution for answering complex questions using a controllable autonomous agent. The solution is designed to ensure that answers are solely based on the provided data, avoiding hallucinations. It involves various steps such as PDF loading, text preprocessing, summarization, database creation, encoding, and utilizing large language models. The algorithm follows a detailed workflow involving planning, retrieval, answering, replanning, content distillation, and performance evaluation. Heuristics and techniques implemented focus on content encoding, anonymizing questions, task breakdown, content distillation, chain of thought answering, verification, and model performance evaluation.
data:image/s3,"s3://crabby-images/1005c/1005ce3368424fd01a5dbaf0c6d0fedec663b560" alt="CogVideo Screenshot"
CogVideo
CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.
data:image/s3,"s3://crabby-images/75734/7573418a3f8b6f5a13fde8d04df4f00da303d5aa" alt="premsql Screenshot"
premsql
PremSQL is an open-source library designed to help developers create secure, fully local Text-to-SQL solutions using small language models. It provides essential tools for building and deploying end-to-end Text-to-SQL pipelines with customizable components, ideal for secure, autonomous AI-powered data analysis. The library offers features like Local-First approach, Customizable Datasets, Robust Executors and Evaluators, Advanced Generators, Error Handling and Self-Correction, Fine-Tuning Support, and End-to-End Pipelines. Users can fine-tune models, generate SQL queries from natural language inputs, handle errors, and evaluate model performance against predefined metrics. PremSQL is extendible for customization and private data usage.
data:image/s3,"s3://crabby-images/989e3/989e31fe556828f36a8691d8f6a6abccef0c34c9" alt="MathCoder Screenshot"
MathCoder
MathCoder is a repository focused on enhancing mathematical reasoning by fine-tuning open-source language models to use code for modeling and deriving math equations. It introduces MathCodeInstruct dataset with solutions interleaving natural language, code, and execution results. The repository provides MathCoder models capable of generating code-based solutions for challenging math problems, achieving state-of-the-art scores on MATH and GSM8K datasets. It offers tools for model deployment, inference, and evaluation, along with a citation for referencing the work.
data:image/s3,"s3://crabby-images/a67a7/a67a779d2c94651cb60b498e19e6fc72c24643ee" alt="LLM-Zero-to-Hundred Screenshot"
LLM-Zero-to-Hundred
LLM-Zero-to-Hundred is a repository showcasing various applications of LLM chatbots and providing insights into training and fine-tuning Language Models. It includes projects like WebGPT, RAG-GPT, WebRAGQuery, LLM Full Finetuning, RAG-Master LLamaindex vs Langchain, open-source-RAG-GEMMA, and HUMAIN: Advanced Multimodal, Multitask Chatbot. The projects cover features like ChatGPT-like interaction, RAG capabilities, image generation and understanding, DuckDuckGo integration, summarization, text and voice interaction, and memory access. Tutorials include LLM Function Calling and Visualizing Text Vectorization. The projects have a general structure with folders for README, HELPER, .env, configs, data, src, images, and utils.
data:image/s3,"s3://crabby-images/334fd/334fdcffbdcc3ed7f178cfd2211f1766d0df68e7" alt="petals Screenshot"
petals
Petals is a tool that allows users to run large language models at home in a BitTorrent-style manner. It enables fine-tuning and inference up to 10x faster than offloading. Users can generate text with distributed models like Llama 2, Falcon, and BLOOM, and fine-tune them for specific tasks directly from their desktop computer or Google Colab. Petals is a community-run system that relies on people sharing their GPUs to increase its capacity and offer a distributed network for hosting model layers.
data:image/s3,"s3://crabby-images/4fcdb/4fcdb5be5b99b1e26c61b279d9406d1287960693" alt="postgresml Screenshot"
postgresml
PostgresML is a powerful Postgres extension that seamlessly combines data storage and machine learning inference within your database. It enables running machine learning and AI operations directly within PostgreSQL, leveraging GPU acceleration for faster computations, integrating state-of-the-art large language models, providing built-in functions for text processing, enabling efficient similarity search, offering diverse ML algorithms, ensuring high performance, scalability, and security, supporting a wide range of NLP tasks, and seamlessly integrating with existing PostgreSQL tools and client libraries.
data:image/s3,"s3://crabby-images/b552f/b552f1f8b96b3483120cd87390a6aa8872e81124" alt="APOLLO Screenshot"
APOLLO
APOLLO is a memory-efficient optimizer designed for large language model (LLM) pre-training and full-parameter fine-tuning. It offers SGD-like memory cost with AdamW-level performance. The optimizer integrates low-rank approximation and optimizer state redundancy reduction to achieve significant memory savings while maintaining or surpassing the performance of Adam(W). Key contributions include structured learning rate updates for LLM training, approximated channel-wise gradient scaling in a low-rank auxiliary space, and minimal-rank tensor-wise gradient scaling. APOLLO aims to optimize memory efficiency during training large language models.
data:image/s3,"s3://crabby-images/69449/694491e791f3b431611ea06edafb4c814cdbc7cd" alt="Loyal-Elephie Screenshot"
Loyal-Elephie
Embark on an exciting adventure with Loyal Elephie, your faithful AI sidekick! This project combines the power of a neat Next.js web UI and a mighty Python backend, leveraging the latest advancements in Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to deliver a seamless and meaningful chatting experience. Features include controllable memory, hybrid search, secure web access, streamlined LLM agent, and optional Markdown editor integration. Loyal Elephie supports both open and proprietary LLMs and embeddings serving as OpenAI compatible APIs.
data:image/s3,"s3://crabby-images/4627a/4627af62cf01899bd9b1d89303627f0e7a843c56" alt="OpenMusic Screenshot"
OpenMusic
OpenMusic is a repository providing an implementation of QA-MDT, a Quality-Aware Masked Diffusion Transformer for music generation. The code integrates state-of-the-art models and offers training strategies for music generation. The repository includes implementations of AudioLDM, PixArt-alpha, MDT, AudioMAE, and Open-Sora. Users can train or fine-tune the model using different strategies and datasets. The model is well-pretrained and can be used for music generation tasks. The repository also includes instructions for preparing datasets, training the model, and performing inference. Contact information is provided for any questions or suggestions regarding the project.
data:image/s3,"s3://crabby-images/90dfa/90dfa32bab33b6ba521c6eb460477c11f97eabfa" alt="Video-MME Screenshot"
Video-MME
Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in Video Analysis. It assesses the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities. The dataset comprises 900 videos with 256 hours and 2,700 human-annotated question-answer pairs. It distinguishes itself through features like duration variety, diversity in video types, breadth in data modalities, and quality in annotations.
For similar tasks
data:image/s3,"s3://crabby-images/c2c97/c2c97dd8b8eaafde2057478373be0f41c98a6869" alt="YuE Screenshot"
YuE
YuE (乐) is an open-source foundation model designed for music generation, specifically transforming lyrics into full songs. It can generate complete songs in various genres and vocal styles, ensuring a polished and cohesive result. The model requires significant GPU memory for generating long sequences and recommends specific configurations for optimal performance. Users can customize the number of sessions for memory usage. The tool provides a quickstart guide for generating music using Transformers and includes tips for execution time and tag selection. The project is licensed under Creative Commons Attribution Non Commercial 4.0.
data:image/s3,"s3://crabby-images/db66b/db66b3876a97e5c495fe62c6a38c7d35c2793ee2" alt="lollms-webui Screenshot"
lollms-webui
LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.
data:image/s3,"s3://crabby-images/1e921/1e92140626e352c2cab3981ad76fcf81250c0cd3" alt="suno-api Screenshot"
suno-api
Suno AI API is an open-source project that allows developers to integrate the music generation capabilities of Suno.ai into their own applications. The API provides a simple and convenient way to generate music, lyrics, and other audio content using Suno.ai's powerful AI models. With Suno AI API, developers can easily add music generation functionality to their apps, websites, and other projects.
data:image/s3,"s3://crabby-images/c621c/c621c46ac4c3d25c7e7cc36b626e0e71719c2868" alt="openvino-plugins-ai-audacity Screenshot"
openvino-plugins-ai-audacity
OpenVINO™ AI Plugins for Audacity* are a set of AI-enabled effects, generators, and analyzers for Audacity®. These AI features run 100% locally on your PC -- no internet connection necessary! OpenVINO™ is used to run AI models on supported accelerators found on the user's system such as CPU, GPU, and NPU. * **Music Separation**: Separate a mono or stereo track into individual stems -- Drums, Bass, Vocals, & Other Instruments. * **Noise Suppression**: Removes background noise from an audio sample. * **Music Generation & Continuation**: Uses MusicGen LLM to generate snippets of music, or to generate a continuation of an existing snippet of music. * **Whisper Transcription**: Uses whisper.cpp to generate a label track containing the transcription or translation for a given selection of spoken audio or vocals.
data:image/s3,"s3://crabby-images/0b669/0b6695876cc45c5c2472df1bbf2f5f00cffa34a7" alt="SunoApi Screenshot"
SunoApi
SunoAPI is an unofficial client for Suno AI, built on Python and Streamlit. It supports functions like generating music and obtaining music information. Users can set up multiple account information to be saved for use. The tool also features built-in maintenance and activation functions for tokens, eliminating concerns about token expiration. It supports multiple languages and allows users to upload pictures for generating songs based on image content analysis.
data:image/s3,"s3://crabby-images/1d54b/1d54b19e4239e8b2014d0fb9a4fcaa8ef396f7f8" alt="awesome-generative-ai Screenshot"
awesome-generative-ai
Awesome Generative AI is a curated list of modern Generative Artificial Intelligence projects and services. Generative AI technology creates original content like images, sounds, and texts using machine learning algorithms trained on large data sets. It can produce unique and realistic outputs such as photorealistic images, digital art, music, and writing. The repo covers a wide range of applications in art, entertainment, marketing, academia, and computer science.
data:image/s3,"s3://crabby-images/b2d6e/b2d6e70f290b5699cd7f9e0cce7413e3a45612fc" alt="ai-audio-datasets Screenshot"
ai-audio-datasets
AI Audio Datasets List (AI-ADL) is a comprehensive collection of datasets consisting of speech, music, and sound effects, used for Generative AI, AIGC, AI model training, and audio applications. It includes datasets for speech recognition, speech synthesis, music information retrieval, music generation, audio processing, sound synthesis, and more. The repository provides a curated list of diverse datasets suitable for various AI audio tasks.
data:image/s3,"s3://crabby-images/b8811/b8811c213afe4bae8132b6beef3ec03c49527b8e" alt="awesome-large-audio-models Screenshot"
awesome-large-audio-models
This repository is a curated list of awesome large AI models in audio signal processing, focusing on the application of large language models to audio tasks. It includes survey papers, popular large audio models, automatic speech recognition, neural speech synthesis, speech translation, other speech applications, large audio models in music, and audio datasets. The repository aims to provide a comprehensive overview of recent advancements and challenges in applying large language models to audio signal processing, showcasing the efficacy of transformer-based architectures in various audio tasks.
For similar jobs
data:image/s3,"s3://crabby-images/7689b/7689ba1fce50eb89a5e34075170d6aaee3c49f87" alt="weave Screenshot"
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
data:image/s3,"s3://crabby-images/10ae7/10ae70fb544e4cb1ced622d6de4a6da32e2f9150" alt="LLMStack Screenshot"
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
data:image/s3,"s3://crabby-images/83afc/83afcd39fd69a41723dd590c7594d452ad40edd5" alt="VisionCraft Screenshot"
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
data:image/s3,"s3://crabby-images/065d0/065d091551616e8781269d4b98673eee8b08234f" alt="kaito Screenshot"
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
data:image/s3,"s3://crabby-images/48887/488870f896a867b538f8a551521f4987e02b7077" alt="PyRIT Screenshot"
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
data:image/s3,"s3://crabby-images/c92ac/c92accb591e608b2d38283e73dd764fb033bff25" alt="tabby Screenshot"
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
data:image/s3,"s3://crabby-images/7740a/7740ad4457091afbcd6c9b0f3b808492d0dccb01" alt="spear Screenshot"
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
data:image/s3,"s3://crabby-images/33099/330995f291fdf6166ad2fee1a67c879cd5496194" alt="Magick Screenshot"
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.