voice-pro

The best gradio web-ui for ai transcription, translation and TTS. Automatic subtitle creation using faster whisper. Easy one click installation. Fully portable.

Stars: 233

Visit

Voice-Pro is an integrated solution for subtitles, translation, and TTS. It offers features like multilingual subtitles, live translation, vocal remover, and supports OpenAI Whisper and Open-Source Translator. The tool provides a Studio tab for various functions, Whisper Caption tab for subtitle creation, Translate tab for translation, TTS tab for text-to-speech, Live Translation tab for real-time voice recognition, and Batch tab for processing multiple files. Users can download YouTube videos, improve voice recognition accuracy, create automatic subtitles, and produce multilingual videos with ease. The tool is easy to install with one-click and offers a Web-UI for user convenience.

README:

Voice-Pro

🌍 한국어 ∙ English ∙ 中文简体 ∙ 中文繁體 ∙ 日本語

The best gradio web-ui for transcription, translation and tts. Easy one click installation. Fully portable.

Introduction

Voice-Pro is an integrated solution for subtitles, translation, and TTS.
Add multilingual subtitles and multilingual audio to your video with Voice-Pro. Expansion into the global market is possible!
You watch world news every morning? Then, try using the live translation feature. It supports real-time translation, just like what you see on YouTube.
Voice-Pro is equipped with Vocal Remover provided by UVR5 and Meta's Demucs engine.
Voice-Pro uses OpenAI Whisper and the free Open-Source Translator and Open-Source TTS.
Voice-Pro can be easily installed with one click and provides Gradio Web-UI.
Experience the highest level of On-Device AI Voice technology.

Run screen

https://github.com/user-attachments/assets/27b4e79c-7b29-4efd-80c3-5757fa5f71e4

main function

Studio tab
- Provides integrated environment for YouTube downloader, noise removal, subtitles, translation, and TTS
- All video/audio formats supported by ffmpeg can be used
- Selectable output audio format (wav, flac, mp3)
- Speech recognition and subtitle creation for 100 languages
- Select subtitle creation options suitable for PC performance (Whisper Model & Compute Type)
- Translation into over 100 languages and voice generation through TTS
- The BGM and sound effects from the original video are maintained in the multilingual video.
- Supports TTS voice speed, volume, and pitch adjustment

Whisper Caption tab
- A tab dedicated to creating subtitles. Supports over 90 languages
- Display subtitles created with the video
- World-Level Highlight function provided
- Denoise function provided (1-Demucs, 2-MDXNet)
Translate tab
- Dedicated tab for translation. Supports over 100 languages
- Supports subtitle files (ass, ssa, srt, mpl2, tmp, vtt, microdvd, json)
- Direct text input is also possible
- Automatically detects the language of uploaded files
TTS tab
- TTS-only tab. Supports over 100 languages and 400 voices
- Supports subtitle files (ass, ssa, srt, mpl2, tmp, vtt, microdvd, json)
- Direct text input is also possible
- Automatically detects the language of uploaded files
- Pitch, volume, and speed adjustable
Live Translation tab
- Real-time voice recognition & translation support
- Select audio input source such as Mic, Speaker, etc.
- Provides the ability to save captured audio, recognized subtitles, and translated subtitles
Batch tab
- Batch processing for large amounts of files
- Subtitles, translation, TTS

Features

You can download YouTube videos (mp4, webm) and save them as audio files (mp3, wav, flac).
You can increase the accuracy of voice recognition by removing noise and separating vocals. MDX-Net and Meta's Demucs are used.
Provides automatic subtitle creation, machine translation, and TTS functions through AI voice recognition.
You can easily produce multilingual videos.
One-click installation. Once installed, you can use it permanently at no additional cost. (※ Free version has 30 minute limit on usage time)
Provides Web-UI. Google Chrome browser is recommended.

Execution environment

OS: Windows 10/11 (64bits) ※ Linux and Mac OS are not supported.
CPU: Intel processor 2GHz or higher (or equivalent compatible)
RAM: 4GB or more
HDD: At least 20GB of free space during installation
GPU: NVIDIA graphics card supporting CUDA 12.1 recommended. VRAM 4GB or more. 8GB or more recommended.
Internet connection required (installation and translation work)

Install and run

step 1. Package preparation

A. Paid version
- Unzip the compressed file (voice-pro-x.zip) included in the USB to an appropriate location on your computer.
- Or, copy the already unzipped folder (voice-pro-x) to an appropriate location on your computer.
B. Free version
- [](https://github.com/abus-aikorea/voice-pro/ Download and unzip the latest release (Source code (zip)) from
- Or, download source code with git clone

git clone https://github.com/abus-aikorea/voice-pro.git

step 2. Install and run the program

Run configure.bat
- Install ffmpeg and CUDA (if using NVIDIA GPU) on Windows.
- You only need to run it the first time.
Run start.bat
- Start Voice-Pro. Web-UI will run automatically.
- When running for the first time, Voice-Pro is installed first.
- Voice-Pro installation requires an Internet connection, and depending on the system, installation may take more than an hour.
- Never close the Windows-Command window during installation.
- If a problem occurs during installation, delete the installer_files folder and run start.bat again.

step 3. Uninstall program

Run uninstall.bat:
- Remove the installer_files folder.
- Remove ffmepg and CUDA packages installed on Windows (if selected)
Voice-Pro has portable installation as standard. To uninstall the program, deleting the installation folder is sufficient.

Tips & Tricks

If Browser does not run automatically

Close the Windows-Commnad window and run start.bat again.
Run the browser directly and enter the address displayed in the Windows-Command window (e.g. http://127.0.0.1:7892) in the address bar.

If a CUDA Out-Of-Memory error occurs

Check the GPU memory status in Windows Task Manager - Performance tab.
Set the Denoise level to 0 or 1. Denoise level 2 requires at least 8GB of GPU memory.
Set Compute Type to int type. The float type has better quality, but requires more GPU memory.

How to improve the quality of subtitles?

The quality of subtitles tends to improve with larger Whisper models, but this is not necessarily the case. large > medium > small > base > tiny
Among compute types, float type has good performance. The int type is a model that reduces GPU usage and increases speed through model quantization. On the other hand, performance decreases.
If you increase the denoise level, more background sounds will be removed, and only the remaining voice will be used for voice recognition. It does not always guarantee good results.

caution

When Windows Defender mistakenly recognizes a batch file as a Trojan, this is often called a 'False Positive'. To solve this problem, you can go through the following steps:

File exception handling: In Windows Defender, you can set certain files or processes to skip security scanning. To do this, follow the steps below:
- Click the ‘Start’ button and go to ‘Settings’.
- Click ‘Update & Security’.
- Select ‘Windows Security’ and go to ‘Virus & threat protection’.
- Click ‘Manage Virus & Threat Protection Settings’.
- Select 'Add exception' in 'Virus & threat protection settings'.
- Select 'File or Folder', find the batch file in question and add it as an exception.
Temporarily disable Windows Defender: This may be a temporary solution. However, you must be careful when using this method as it may expose your computer to other threats.
Report the problem to anti-virus software: If you are sure that the file is not a Trojan horse, you can report it to Microsoft as a False Positive. Microsoft will review this and take any necessary action.

Contact us

e-mail: [email protected]
homepage(Korean): https://abuskorea.imweb.me
Amazon(US): https://www.amazon.com/dp/B0DBR69JPL
Amazon(Japan): https://www.amazon.co.jp/dp/B0DBVRJ542
Amazon(Singapore): https://www.amazon.sg/dp/B0DCGKL8R4
Amazon(UAE): https://www.amazon.ae/dp/B0DCGKM7FF
네이버 스마트스토어 (S/W): https://smartstore.naver.com/abus/products/10385660040
네이버 스마트스토어 (Solution): https://smartstore.naver.com/abus/products/10298346364

YouTube

Product Information: https://youtube.com/playlist?list=PLwx5dnMDVC9Y7dAjm9r26CZUw1uU5VIeq&si=873MgzUtu4POE9jO
Home Karaoke (Pop): https://youtube.com/playlist?list=PLwx5dnMDVC9bVxfGo58U-R-w3fUHqwiD6&si=aWRDfF8TxFp2oAR0
Home Karaoke (K-Pop): https://youtube.com/playlist?list=PLwx5dnMDVC9Z8kB01tQKfzTysaCCxC3C8&si=1_-9p722rd_JXpzv
Home Karaoke (J-Pop): https://youtube.com/playlist?list=PLwx5dnMDVC9apyxrP9LE9PiT821G7lJXk&si=0a474CP7ZIjMoGN9

Credits

FacebookResearch Demucs: https://github.com/facebookresearch/demucs
yt-dlp: https://github.com/yt-dlp/yt-dlp
gradio: https://github.com/gradio-app/gradio

Copyright

by ABUS

For Tasks:

Click tags to check more tools for each tasks

create subtitles translate videos generate voice remove noise download youtube videos

For Jobs:

transcription specialist language translator video editor ai voice technology developer audio engineer

Alternative AI tools for voice-pro

Similar Open Source Tools

voice-pro

github

: 233

comfyui_LLM_Polymath

github

: 54

kollektiv

Kollektiv is a Retrieval-Augmented Generation (RAG) system designed to enable users to chat with their favorite documentation easily. It aims to provide LLMs with access to the most up-to-date knowledge, reducing inaccuracies and improving productivity. The system utilizes intelligent web crawling, advanced document processing, vector search, multi-query expansion, smart re-ranking, AI-powered responses, and dynamic system prompts. The technical stack includes Python/FastAPI for backend, Supabase, ChromaDB, and Redis for storage, OpenAI and Anthropic Claude 3.5 Sonnet for AI/ML, and Chainlit for UI. Kollektiv is licensed under a modified version of the Apache License 2.0, allowing free use for non-commercial purposes.

github

: 74

eole

EOLE is an open language modeling toolkit based on PyTorch. It aims to provide a research-friendly approach with a comprehensive yet compact and modular codebase for experimenting with various types of language models. The toolkit includes features such as versatile training and inference, dynamic data transforms, comprehensive large language model support, advanced quantization, efficient finetuning, flexible inference, and tensor parallelism. EOLE is a work in progress with ongoing enhancements in configuration management, command line entry points, reproducible recipes, core API simplification, and plans for further simplification, refactoring, inference server development, additional recipes, documentation enhancement, test coverage improvement, logging enhancements, and broader model support.

github

: 106

TaskingAI

TaskingAI brings Firebase's simplicity to **AI-native app development**. The platform enables the creation of GPTs-like multi-tenant applications using a wide range of LLMs from various providers. It features distinct, modular functions such as Inference, Retrieval, Assistant, and Tool, seamlessly integrated to enhance the development process. TaskingAI’s cohesive design ensures an efficient, intelligent, and user-friendly experience in AI application development.

github

: 6.1k

Simplifine

Simplifine is an open-source library designed for easy LLM finetuning, enabling users to perform tasks such as supervised fine tuning, question-answer finetuning, contrastive loss for embedding tasks, multi-label classification finetuning, and more. It provides features like WandB logging, in-built evaluation tools, automated finetuning parameters, and state-of-the-art optimization techniques. The library offers bug fixes, new features, and documentation updates in its latest version. Users can install Simplifine via pip or directly from GitHub. The project welcomes contributors and provides comprehensive documentation and support for users.

github

: 65

Director

Director is a framework to build video agents that can reason through complex video tasks like search, editing, compilation, generation, etc. It enables users to summarize videos, search for specific moments, create clips instantly, integrate GenAI projects and APIs, add overlays, generate thumbnails, and more. Built on VideoDB's 'video-as-data' infrastructure, Director is perfect for developers, creators, and teams looking to simplify media workflows and unlock new possibilities.

github

: 791

qapyq

qapyq is an image viewer and AI-assisted editing tool designed to help curate datasets for generative AI models. It offers features such as image viewing, editing, captioning, batch processing, and AI assistance. Users can perform tasks like cropping, scaling, editing masks, tagging, and applying sorting and filtering rules. The tool supports state-of-the-art captioning and masking models, with options for model settings, GPU acceleration, and quantization. qapyq aims to streamline the process of preparing images for training AI models by providing a user-friendly interface and advanced functionalities.

github

: 106

DevoxxGenieIDEAPlugin

Devoxx Genie is a Java-based IntelliJ IDEA plugin that integrates with local and cloud-based LLM providers to aid in reviewing, testing, and explaining project code. It supports features like code highlighting, chat conversations, and adding files/code snippets to context. Users can modify REST endpoints and LLM parameters in settings, including support for cloud-based LLMs. The plugin requires IntelliJ version 2023.3.4 and JDK 17. Building and publishing the plugin is done using Gradle tasks. Users can select an LLM provider, choose code, and use commands like review, explain, or generate unit tests for code analysis.

github

: 414

gemini-android

Gemini Android is a repository showcasing Google's Generative AI on Android using Stream Chat SDK for Compose. It demonstrates the Gemini API for Android, implements UI elements with Jetpack Compose, utilizes Android architecture components like Hilt and AppStartup, performs background tasks with Kotlin Coroutines, and integrates chat systems with Stream Chat Compose SDK for real-time event handling. The project also provides technical content, instructions on building the project, tech stack details, architecture overview, modularization strategies, and a contribution guideline. It follows Google's official architecture guidance and offers a real-world example of app architecture implementation.

github

: 303

efficient-recorder

Efficient Recorder is a battery-life friendly tool designed to stream video, screen, mic, and system audio to any S3-compatible cloud storage service. It captures audio, screenshots, and webcam photos at configurable fps, utilizing low-energy volume detection for audio recording. The tool streams data to a configurable S3 endpoint or a custom server using MinIO. It aims to be storage and battery efficient, providing queued upload processing and minimal system resource overhead. The tool requires SoX for audio recording and webcam capture tools for operation. Users can specify various command line options for customization, such as enabling screenshot and webcam capture with specific intervals and image quality settings.

github

: 148

coding-aider

Coding-Aider is a plugin for IntelliJ IDEA that seamlessly integrates Aider's AI-powered coding assistance into the IDE. It boosts productivity by offering rapid access for precision code generation and refactoring, with complete control over the context utilized by the LLM. The plugin provides various features such as AI-powered coding assistance, intuitive access through keyboard shortcuts, persistent file management, dual execution modes, Git integration, real-time progress tracking, multi-file support, web crawling, clipboard image support, and various specialized actions. It also supports structured mode and plans for managing complex features, working directory support, summarized output, and the ability to specify additional arguments for Aider commands. Coding-Aider addresses limitations in existing IntelliJ plugins by offering optimized token usage, a feature-rich terminal interface, a wide range of commands, and robust recovery mechanisms with seamless Git integration.

github

: 66

llm-answer-engine

This repository contains the code and instructions needed to build a sophisticated answer engine that leverages the capabilities of Groq, Mistral AI's Mixtral, Langchain.JS, Brave Search, Serper API, and OpenAI. Designed to efficiently return sources, answers, images, videos, and follow-up questions based on user queries, this project is an ideal starting point for developers interested in natural language processing and search technologies.

github

: 4.5k

clearml-server

ClearML Server is a backend service infrastructure for ClearML, facilitating collaboration and experiment management. It includes a web app, RESTful API, and file server for storing images and models. Users can deploy ClearML Server using Docker, AWS EC2 AMI, or Kubernetes. The system design supports single IP or sub-domain configurations with specific open ports. ClearML-Agent Services container allows launching long-lasting jobs and various use cases like auto-scaler service, controllers, optimizer, and applications. Advanced functionality includes web login authentication and non-responsive experiments watchdog. Upgrading ClearML Server involves stopping containers, backing up data, downloading the latest docker-compose.yml file, configuring ClearML-Agent Services, and spinning up docker containers. Community support is available through ClearML FAQ, Stack Overflow, GitHub issues, and email contact.

github

: 364

motia

Motia is an AI agent framework designed for software engineers to create, test, and deploy production-ready AI agents quickly. It provides a code-first approach, allowing developers to write agent logic in familiar languages and visualize execution in real-time. With Motia, developers can focus on business logic rather than infrastructure, offering zero infrastructure headaches, multi-language support, composable steps, built-in observability, instant APIs, and full control over AI logic. Ideal for building sophisticated agents and intelligent automations, Motia's event-driven architecture and modular steps enable the creation of GenAI-powered workflows, decision-making systems, and data processing pipelines.

github

: 1.5k

LocalAIVoiceChat

LocalAIVoiceChat is an experimental alpha software that enables real-time voice chat with a customizable AI personality and voice on your PC. It integrates Zephyr 7B language model with speech-to-text and text-to-speech libraries. The tool is designed for users interested in state-of-the-art voice solutions and provides an early version of a local real-time chatbot.

github

: 362

For similar tasks

gpt-subtrans

GPT-Subtrans is an open-source subtitle translator that utilizes large language models (LLMs) as translation services. It supports translation between any language pairs that the language model supports. Note that GPT-Subtrans requires an active internet connection, as subtitles are sent to the provider's servers for translation, and their privacy policy applies.

github

: 418

auto-subs

Auto-subs is a tool designed to automatically transcribe editing timelines using OpenAI Whisper and Stable-TS for extreme accuracy. It generates subtitles in a custom style, is completely free, and runs locally within Davinci Resolve. It works on Mac, Linux, and Windows, supporting both Free and Studio versions of Resolve. Users can jump to positions on the timeline using the Subtitle Navigator and translate from any language to English. The tool provides a user-friendly interface for creating and customizing subtitles for video content.

github

: 799

VideoLingo

VideoLingo is an all-in-one video translation and localization dubbing tool designed to generate Netflix-level high-quality subtitles. It aims to eliminate stiff machine translation, multiple lines of subtitles, and can even add high-quality dubbing, allowing knowledge from around the world to be shared across language barriers. Through an intuitive Streamlit web interface, the entire process from video link to embedded high-quality bilingual subtitles and even dubbing can be completed with just two clicks, easily creating Netflix-quality localized videos. Key features and functions include using yt-dlp to download videos from Youtube links, using WhisperX for word-level timeline subtitle recognition, using NLP and GPT for subtitle segmentation based on sentence meaning, summarizing intelligent term knowledge base with GPT for context-aware translation, three-step direct translation, reflection, and free translation to eliminate strange machine translation, checking single-line subtitle length and translation quality according to Netflix standards, using GPT-SoVITS for high-quality aligned dubbing, and integrating package for one-click startup and one-click output in streamlit.

github

: 12.1k

voice-pro

github

: 233

ai-no-jimaku-gumi

AI no jimaku gumi is a command-line utility designed to assist in video translation. It supports translating subtitles using AI models and provides options for different translation and subtitle sources. Users can easily set up the tool by following the installation steps and use it to translate videos to different languages with customizable settings. The tool currently supports DeepL and llm translation backends and SRT subtitle export. It aims to simplify the process of adding subtitles to videos by leveraging AI technology.

github

: 130

tts-generation-webui

TTS Generation WebUI is a comprehensive tool that provides a user-friendly interface for text-to-speech and voice cloning tasks. It integrates various AI models such as Bark, MusicGen, AudioGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, and MAGNeT. The tool offers one-click installers, Google Colab demo, videos for guidance, and extra voices for Bark. Users can generate audio outputs, manage models, caches, and system space for AI projects. The project is open-source and emphasizes ethical and responsible use of AI technology.

github

: 1.6k

GlaDOS

This project aims to create a real-life version of GLaDOS, an aware, interactive, and embodied AI entity. It involves training a voice generator, developing a 'Personality Core,' implementing a memory system, providing vision capabilities, creating 3D-printable parts, and designing an animatronics system. The software architecture focuses on low-latency voice interactions, utilizing a circular buffer for data recording, text streaming for quick transcription, and a text-to-speech system. The project also emphasizes minimal dependencies for running on constrained hardware. The hardware system includes servo- and stepper-motors, 3D-printable parts for GLaDOS's body, animations for expression, and a vision system for tracking and interaction. Installation instructions cover setting up the TTS engine, required Python packages, compiling llama.cpp, installing an inference backend, and voice recognition setup. GLaDOS can be run using 'python glados.py' and tested using 'demo.ipynb'.

github

: 4.2k

orcish-ai-nextjs-framework

The Orcish AI Next.js Framework is a powerful tool that leverages OpenAI API to seamlessly integrate AI functionalities into Next.js applications. It allows users to generate text, images, and text-to-speech based on specified input. The framework provides an easy-to-use interface for utilizing AI capabilities in application development.

github

: 129

For similar jobs

voice-pro

github

: 233

metavoice-src

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities: * Emotional speech rhythm and tone in English. * Zero-shot cloning for American & British voices, with 30s reference audio. * Support for (cross-lingual) voice cloning with finetuning. * We have had success with as little as 1 minute training data for Indian speakers. * Synthesis of arbitrary length text

github

: 3.1k

suno-api

Suno AI API is an open-source project that allows developers to integrate the music generation capabilities of Suno.ai into their own applications. The API provides a simple and convenient way to generate music, lyrics, and other audio content using Suno.ai's powerful AI models. With Suno AI API, developers can easily add music generation functionality to their apps, websites, and other projects.

github

: 1.7k

bark.cpp

Bark.cpp is a C/C++ implementation of the Bark model, a real-time, multilingual text-to-speech generation model. It supports AVX, AVX2, and AVX512 for x86 architectures, and is compatible with both CPU and GPU backends. Bark.cpp also supports mixed F16/F32 precision and 4-bit, 5-bit, and 8-bit integer quantization. It can be used to generate realistic-sounding audio from text prompts.

github

: 696

NSMusicS

NSMusicS is a local music software that is expected to support multiple platforms with AI capabilities and multimodal features. The goal of NSMusicS is to integrate various functions (such as artificial intelligence, streaming, music library management, cross platform, etc.), which can be understood as similar to Navidrome but with more features than Navidrome. It wants to become a plugin integrated application that can almost have all music functions.

github

: 713

ai-voice-cloning

This repository provides a tool for AI voice cloning, allowing users to generate synthetic speech that closely resembles a target speaker's voice. The tool is designed to be user-friendly and accessible, with a graphical user interface that guides users through the process of training a voice model and generating synthetic speech. The tool also includes a variety of features that allow users to customize the generated speech, such as the pitch, volume, and speaking rate. Overall, this tool is a valuable resource for anyone interested in creating realistic and engaging synthetic speech.

github

: 268

RVC_CLI

**RVC_CLI: Retrieval-based Voice Conversion Command Line Interface** This command-line interface (CLI) provides a comprehensive set of tools for voice conversion, enabling you to modify the pitch, timbre, and other characteristics of audio recordings. It leverages advanced machine learning models to achieve realistic and high-quality voice conversions. **Key Features:** * **Inference:** Convert the pitch and timbre of audio in real-time or process audio files in batch mode. * **TTS Inference:** Synthesize speech from text using a variety of voices and apply voice conversion techniques. * **Training:** Train custom voice conversion models to meet specific requirements. * **Model Management:** Extract, blend, and analyze models to fine-tune and optimize performance. * **Audio Analysis:** Inspect audio files to gain insights into their characteristics. * **API:** Integrate the CLI's functionality into your own applications or workflows. **Applications:** The RVC_CLI finds applications in various domains, including: * **Music Production:** Create unique vocal effects, harmonies, and backing vocals. * **Voiceovers:** Generate voiceovers with different accents, emotions, and styles. * **Audio Editing:** Enhance or modify audio recordings for podcasts, audiobooks, and other content. * **Research and Development:** Explore and advance the field of voice conversion technology. **For Jobs:** * Audio Engineer * Music Producer * Voiceover Artist * Audio Editor * Machine Learning Engineer **AI Keywords:** * Voice Conversion * Pitch Shifting * Timbre Modification * Machine Learning * Audio Processing **For Tasks:** * Convert Pitch * Change Timbre * Synthesize Speech * Train Model * Analyze Audio

github

: 71

openvino-plugins-ai-audacity

OpenVINO™ AI Plugins for Audacity* are a set of AI-enabled effects, generators, and analyzers for Audacity®. These AI features run 100% locally on your PC -- no internet connection necessary! OpenVINO™ is used to run AI models on supported accelerators found on the user's system such as CPU, GPU, and NPU. * **Music Separation**: Separate a mono or stereo track into individual stems -- Drums, Bass, Vocals, & Other Instruments. * **Noise Suppression**: Removes background noise from an audio sample. * **Music Generation & Continuation**: Uses MusicGen LLM to generate snippets of music, or to generate a continuation of an existing snippet of music. * **Whisper Transcription**: Uses whisper.cpp to generate a label track containing the transcription or translation for a given selection of spoken audio or vocals.

github

: 885