RAVE
Official implementation of the RAVE model: a Realtime Audio Variational autoEncoder
Stars: 1192
RAVE is a variational autoencoder for fast and high-quality neural audio synthesis. It can be used to generate new audio samples from a given dataset, or to modify the style of existing audio samples. RAVE is easy to use and can be trained on a variety of audio datasets. It is also computationally efficient, making it suitable for real-time applications.
README:
Official implementation of RAVE: A variational autoencoder for fast and high-quality neural audio synthesis (article link) by Antoine Caillon and Philippe Esling.
If you use RAVE as a part of a music performance or installation, be sure to cite either this repository or the article !
If you want to share / discuss / ask things about RAVE you can do so in our discord server !
Please check the FAQ before posting an issue!
RAVE VST RAVE VST for Windows, Mac and Linux is available as beta on the corresponding Forum IRCAM webpage. For problems, please write an issue here or on the Forum IRCAM discussion page.
Tutorials : new tutorials are available on the Forum IRCAM webpage, and video versions are coming soon!
- Tutorial: Neural Synthesis in a DAW with RAVE
- Tutorial: Neural Synthesis in Max 8 with RAVE
- Tutorial: Training RAVE models on custom data
The original implementation of the RAVE model can be restored using
git checkout v1
Install RAVE using
pip install acids-rave
Warning It is strongly advised to install torch
and torchaudio
before acids-rave
, so you can choose the appropriate version of torch on the library website. For future compatibility with new devices (and modern Python environments), rave-acids
does not enforce torch==1.13 anymore.
You will need ffmpeg on your computer. You can install it locally inside your virtual environment using
conda install ffmpeg
A colab to train RAVEv2 is now available thanks to hexorcismos !
Training a RAVE model usually involves 3 separate steps, namely dataset preparation, training and export.
You can know prepare a dataset using two methods: regular and lazy. Lazy preprocessing allows RAVE to be trained directly on the raw files (i.e. mp3, ogg), without converting them first. Warning: lazy dataset loading will increase your CPU load by a large margin during training, especially on Windows. This can however be useful when training on large audio corpus which would not fit on a hard drive when uncompressed. In any case, prepare your dataset using
rave preprocess --input_path /audio/folder --output_path /dataset/path --channels X (--lazy)
RAVEv2 has many different configurations. The improved version of the v1 is called v2
, and can therefore be trained with
rave train --config v2 --db_path /dataset/path --out_path /model/out --name give_a_name --channels X
We also provide a discrete configuration, similar to SoundStream or EnCodec
rave train --config discrete ...
By default, RAVE is built with non-causal convolutions. If you want to make the model causal (hence lowering the overall latency of the model), you can use the causal mode
rave train --config discrete --config causal ...
New in 2.3, data augmentations are also available to improve the model's generalization in low data regimes. You can add data augmentation by adding augmentation configuration files with the --augment
keyword
rave train --config v2 --augment mute --augment compress
Many other configuration files are available in rave/configs
and can be combined. Here is a list of all the available configurations & augmentations :
Type | Name | Description |
---|---|---|
Architecture | v1 | Original continuous model (minimum GPU memory : 8Go) |
v2 | Improved continuous model (faster, higher quality) (minimum GPU memory : 16Go) | |
v2_small | v2 with a smaller receptive field, adpated adversarial training, and noise generator, adapted for timbre transfer for stationary signals (minimum GPU memory : 8Go) | |
v2_nopqmf | (experimental) v2 without pqmf in generator (more efficient for bending purposes) (minimum GPU memory : 16Go) | |
v3 | v2 with Snake activation, descript discriminator and Adaptive Instance Normalization for real style transfer (minimum GPU memory : 32Go) | |
discrete | Discrete model (similar to SoundStream or EnCodec) (minimum GPU memory : 18Go) | |
onnx | Noiseless v1 configuration for onnx usage (minimum GPU memory : 6Go) | |
raspberry | Lightweight configuration compatible with realtime RaspberryPi 4 inference (minimum GPU memory : 5Go) | |
Regularization (v2 only) | default | Variational Auto Encoder objective (ELBO) |
wasserstein | Wasserstein Auto Encoder objective (MMD) | |
spherical | Spherical Auto Encoder objective | |
Discriminator | spectral_discriminator | Use the MultiScale discriminator from EnCodec. |
Others | causal | Use causal convolutions |
noise | Enables noise synthesizer V2 | |
hybrid | Enable mel-spectrogram input | |
Augmentations | mute | Randomly mutes data batches (default prob : 0.1). Enforces the model to learn silence |
compress | Randomly compresses the waveform (equivalent to light non-linear amplification of batches) | |
gain | Applies a random gain to waveform (default range : [-6, 3]) |
Once trained, export your model to a torchscript file using
rave export --run /path/to/your/run (--streaming)
Setting the --streaming
flag will enable cached convolutions, making the model compatible with realtime processing. If you forget to use the streaming mode and try to load the model in Max, you will hear clicking artifacts.
For discrete models, we redirect the user to the msprior
library here. However, as this library is still experimental, the prior from version 1.x has been re-integrated in v2.3.
To train a prior for a pretrained RAVE model :
rave train_prior --model /path/to/your/run --db_path /path/to/your_preprocessed_data --out_path /path/to/output
this will train a prior over the latent of the pretrained model path/to/your/run
, and save the model and tensorboard logs to folder /path/to/output
.
To script a prior along with a RAVE model, export your model by providing the --prior
keyword to your pretrained prior :
rave export --run /path/to/your/run --prior /path/to/your/prior (--streaming)
Several pretrained streaming models are available here. We'll keep the list updated with new models.
This section presents how RAVE can be loaded inside nn~
in order to be used live with Max/MSP or PureData.
A pretrained RAVE model named darbouka.gin
available on your computer can be loaded inside nn~
using the following syntax, where the default method is set to forward (i.e. encode then decode)
This does the same thing as the following patch, but slightly faster.
Having an explicit access to the latent representation yielded by RAVE allows us to interact with the representation using Max/MSP or PureData signal processing tools:
By default, RAVE can be used as a style transfer tool, based on the large compression ratio of the model. We recently added a technique inspired from StyleGAN to include Adaptive Instance Normalization to the reconstruction process, effectively allowing to define source and target styles directly inside Max/MSP or PureData, using the attribute system of nn~
.
Other attributes, such as enable
or gpu
can enable/disable computation, or use the gpu to speed up things (still experimental).
A batch generation script has been released in v2.3 to allow transformation of large amount of files
rave generate model_path path_1 path_2 --out out_path
where model_path
is the path to your trained model (original or scripted), path_X
a list of audio files or directories, and out_path
the out directory of the generations.
If you have questions, want to share your experience with RAVE or share musical pieces done with the model, you can use the Discussion tab !
Demonstration of what you can do with RAVE and the nn~ external for maxmsp !
Using nn~ for puredata, RAVE can be used in realtime on embedded platforms !
Question : my preprocessing is stuck, showing 0it[00:00, ?it/s]
Answer : This means that the audio files in your dataset are too short to provide a sufficient temporal scope to RAVE. Try decreasing the signal window with the --num_signal XXX(samples)
with preprocess
, without forgetting afterwards to add the --n_signal XXX(samples)
with train
Question : During training I got an exception resembling ValueError: n_components=128 must be between 0 and min(n_samples, n_features)=64 with svd_solver='full'
Answer : This means that your dataset does not have enough data batches to compute the intern latent PCA, that requires at least 128 examples (then batches).
This work is led at IRCAM, and has been funded by the following projects
- ANR MakiMono
- ACTOR
- DAFNE+ N° 101061548
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for RAVE
Similar Open Source Tools
RAVE
RAVE is a variational autoencoder for fast and high-quality neural audio synthesis. It can be used to generate new audio samples from a given dataset, or to modify the style of existing audio samples. RAVE is easy to use and can be trained on a variety of audio datasets. It is also computationally efficient, making it suitable for real-time applications.
NekoImageGallery
NekoImageGallery is an online AI image search engine that utilizes the Clip model and Qdrant vector database. It supports keyword search and similar image search. The tool generates 768-dimensional vectors for each image using the Clip model, supports OCR text search using PaddleOCR, and efficiently searches vectors using the Qdrant vector database. Users can deploy the tool locally or via Docker, with options for metadata storage using Qdrant database or local file storage. The tool provides API documentation through FastAPI's built-in Swagger UI and can be used for tasks like image search, text extraction, and vector search.
agentok
Agentok Studio is a tool built upon AG2, a powerful agent framework from Microsoft, offering intuitive visual tools to streamline the creation and management of complex agent-based workflows. It simplifies the process for creators and developers by generating native Python code with minimal dependencies, enabling users to create self-contained code that can be executed anywhere. The tool is currently under development and not recommended for production use, but contributions are welcome from the community to enhance its capabilities and functionalities.
rosa
ROSA is an AI Agent designed to interact with ROS-based robotics systems using natural language queries. It can generate system reports, read and parse ROS log files, adapt to new robots, and run various ROS commands using natural language. The tool is versatile for robotics research and development, providing an easy way to interact with robots and the ROS environment.
autoarena
AutoArena is a tool designed to create leaderboards ranking Language Model outputs against one another using automated judge evaluation. It allows users to rank outputs from different LLMs, RAG setups, and prompts to find the best configuration of their system. Users can perform automated head-to-head evaluation using judges from various platforms like OpenAI, Anthropic, and Cohere. Additionally, users can define and run custom judges, connect to internal services, or implement bespoke logic. AutoArena enables users to run the application locally, providing full control over their environment and data.
SillyTavern
SillyTavern is a user interface you can install on your computer (and Android phones) that allows you to interact with text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development and has added many major features. At this point, they can be thought of as completely independent programs.
LlamaIndexTS
LlamaIndex.TS is a data framework for your LLM application. Use your own data with large language models (LLMs, OpenAI ChatGPT and others) in Typescript and Javascript.
DocsGPT
DocsGPT is an open-source documentation assistant powered by GPT models. It simplifies the process of searching for information in project documentation by allowing developers to ask questions and receive accurate answers. With DocsGPT, users can say goodbye to manual searches and quickly find the information they need. The tool aims to revolutionize project documentation experiences and offers features like live previews, Discord community, guides, and contribution opportunities. It consists of a Flask app, Chrome extension, similarity search index creation script, and a frontend built with Vite and React. Users can quickly get started with DocsGPT by following the provided setup instructions and can contribute to its development by following the guidelines in the CONTRIBUTING.md file. The project follows a Code of Conduct to ensure a harassment-free community environment for all participants. DocsGPT is licensed under MIT and is built with LangChain.
bia-bob
BIA `bob` is a Jupyter-based assistant for interacting with data using large language models to generate Python code. It can utilize OpenAI's chatGPT, Google's Gemini, Helmholtz' blablador, and Ollama. Users need respective accounts to access these services. Bob can assist in code generation, bug fixing, code documentation, GPU-acceleration, and offers a no-code custom Jupyter Kernel. It provides example notebooks for various tasks like bio-image analysis, model selection, and bug fixing. Installation is recommended via conda/mamba environment. Custom endpoints like blablador and ollama can be used. Google Cloud AI API integration is also supported. The tool is extensible for Python libraries to enhance Bob's functionality.
PanelCleaner
Panel Cleaner is a tool that uses machine learning to find text in images and generate masks to cover it up with high accuracy. It is designed to clean text bubbles without leaving artifacts, avoiding painting over non-text parts, and inpainting bubbles that can't be masked out. The tool offers various customization options, detailed analytics on the cleaning process, supports batch processing, and can run OCR on pages. It supports CUDA acceleration, multiple themes, and can handle bubbles on any solid grayscale background color. Panel Cleaner is aimed at saving time for cleaners by automating monotonous work and providing precise cleaning of text bubbles.
obs-cleanstream
CleanStream is an OBS plugin that utilizes real-time local AI to clean live audio streams by removing unwanted words and utterances, such as 'uh' and 'um', and configurable words like profanity. It employs a neural network (OpenAI Whisper) to predict speech in real-time and eliminate undesired words. The plugin runs efficiently using the Whisper.cpp project from ggerganov. CleanStream offers users the ability to adjust settings and add the plugin to any audio-generating source in OBS, providing a seamless experience for content creators looking to enhance the quality of their live audio streams.
obs-cleanstream
CleanStream is an OBS plugin that utilizes AI to clean live audio streams by removing unwanted words and utterances, such as 'uh's and 'um's, and configurable words like profanity. It uses a neural network (OpenAI Whisper) in real-time to predict speech and eliminate unwanted words. The plugin is still experimental and not recommended for live production use, but it is functional for testing purposes. Users can adjust settings and configure the plugin to enhance audio quality during live streams.
metavoice-src
MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities: * Emotional speech rhythm and tone in English. * Zero-shot cloning for American & British voices, with 30s reference audio. * Support for (cross-lingual) voice cloning with finetuning. * We have had success with as little as 1 minute training data for Indian speakers. * Synthesis of arbitrary length text
linkedin-api
The Linkedin API for Python allows users to programmatically search profiles, send messages, and find jobs using a regular Linkedin user account. It does not require 'official' API access, just a valid Linkedin account. However, it is important to note that this library is not officially supported by LinkedIn and using it may violate LinkedIn's Terms of Service. Users can authenticate using any Linkedin account credentials and access features like getting profiles, profile contact info, and connections. The library also provides commercial alternatives for extracting data, scraping public profiles, and accessing a full LinkedIn API. It is not endorsed or supported by LinkedIn and is intended for educational purposes and personal use only.
neural
Neural is a Vim and Neovim plugin that integrates various machine learning tools to assist users in writing code, generating text, and explaining code or paragraphs. It supports multiple machine learning models, focuses on privacy, and is compatible with Vim 8.0+ and Neovim 0.8+. Users can easily configure Neural to interact with third-party machine learning tools, such as OpenAI, to enhance code generation and completion. The plugin also provides commands like `:NeuralExplain` to explain code or text and `:NeuralStop` to stop Neural from working. Neural is maintained by the Dense Analysis team and comes with a disclaimer about sending input data to third-party servers for machine learning queries.
Easy-Translate
Easy-Translate is a script designed for translating large text files with a single command. It supports various models like M2M100, NLLB200, SeamlessM4T, LLaMA, and Bloom. The tool is beginner-friendly and offers seamless and customizable features for advanced users. It allows acceleration on CPU, multi-CPU, GPU, multi-GPU, and TPU, with support for different precisions and decoding strategies. Easy-Translate also provides an evaluation script for translations. Built on HuggingFace's Transformers and Accelerate library, it supports prompt usage and loading huge models efficiently.
For similar tasks
RAVE
RAVE is a variational autoencoder for fast and high-quality neural audio synthesis. It can be used to generate new audio samples from a given dataset, or to modify the style of existing audio samples. RAVE is easy to use and can be trained on a variety of audio datasets. It is also computationally efficient, making it suitable for real-time applications.
awesome-generative-ai
A curated list of Generative AI projects, tools, artworks, and models
WavCraft
WavCraft is an LLM-driven agent for audio content creation and editing. It applies LLM to connect various audio expert models and DSP function together. With WavCraft, users can edit the content of given audio clip(s) conditioned on text input, create an audio clip given text input, get more inspiration from WavCraft by prompting a script setting and let the model do the scriptwriting and create the sound, and check if your audio file is synthesized by WavCraft.
ragdoll-studio
Ragdoll Studio is a platform offering web apps and libraries for interacting with Ragdoll, enabling users to go beyond fine-tuning and create flawless creative deliverables, rich multimedia, and engaging experiences. It provides various modes such as Story Mode for creating and chatting with characters, Vector Mode for producing vector art, Raster Mode for producing raster art, Video Mode for producing videos, Audio Mode for producing audio, and 3D Mode for producing 3D objects. Users can export their content in various formats and share their creations on the community site. The platform consists of a Ragdoll API and a front-end React application for seamless usage.
ChatTTS-Forge
ChatTTS-Forge is a powerful text-to-speech generation tool that supports generating rich audio long texts using a SSML-like syntax and provides comprehensive API services, suitable for various scenarios. It offers features such as batch generation, support for generating super long texts, style prompt injection, full API services, user-friendly debugging GUI, OpenAI-style API, Google-style API, support for SSML-like syntax, speaker management, style management, independent refine API, text normalization optimized for ChatTTS, and automatic detection and processing of markdown format text. The tool can be experienced and deployed online through HuggingFace Spaces, launched with one click on Colab, deployed using containers, or locally deployed after cloning the project, preparing models, and installing necessary dependencies.
simple-openai
Simple-OpenAI is a Java library that provides a simple way to interact with the OpenAI API. It offers consistent interfaces for various OpenAI services like Audio, Chat Completion, Image Generation, and more. The library uses CleverClient for HTTP communication, Jackson for JSON parsing, and Lombok to reduce boilerplate code. It supports asynchronous requests and provides methods for synchronous calls as well. Users can easily create objects to communicate with the OpenAI API and perform tasks like text-to-speech, transcription, image generation, and chat completions.
AI
AI is an open-source Swift framework for interfacing with generative AI. It provides functionalities for text completions, image-to-text vision, function calling, DALLE-3 image generation, audio transcription and generation, and text embeddings. The framework supports multiple AI models from providers like OpenAI, Anthropic, Mistral, Groq, and ElevenLabs. Users can easily integrate AI capabilities into their Swift projects using AI framework.
RAG-Survey
This repository is dedicated to collecting and categorizing papers related to Retrieval-Augmented Generation (RAG) for AI-generated content. It serves as a survey repository based on the paper 'Retrieval-Augmented Generation for AI-Generated Content: A Survey'. The repository is continuously updated to keep up with the rapid growth in the field of RAG.
For similar jobs
RAVE
RAVE is a variational autoencoder for fast and high-quality neural audio synthesis. It can be used to generate new audio samples from a given dataset, or to modify the style of existing audio samples. RAVE is easy to use and can be trained on a variety of audio datasets. It is also computationally efficient, making it suitable for real-time applications.
LocalAI
LocalAI is a free and open-source OpenAI alternative that acts as a drop-in replacement REST API compatible with OpenAI (Elevenlabs, Anthropic, etc.) API specifications for local AI inferencing. It allows users to run LLMs, generate images, audio, and more locally or on-premises with consumer-grade hardware, supporting multiple model families and not requiring a GPU. LocalAI offers features such as text generation with GPTs, text-to-audio, audio-to-text transcription, image generation with stable diffusion, OpenAI functions, embeddings generation for vector databases, constrained grammars, downloading models directly from Huggingface, and a Vision API. It provides a detailed step-by-step introduction in its Getting Started guide and supports community integrations such as custom containers, WebUIs, model galleries, and various bots for Discord, Slack, and Telegram. LocalAI also offers resources like an LLM fine-tuning guide, instructions for local building and Kubernetes installation, projects integrating LocalAI, and a how-tos section curated by the community. It encourages users to cite the repository when utilizing it in downstream projects and acknowledges the contributions of various software from the community.