bailing
百聆 是一个类似GPT-4o的语音对话机器人,通过ASR+LLM+TTS实现,时延低至800ms,低配置也可运行,支持打断
Stars: 185
Bailing is an open-source voice assistant designed for natural conversations with users. It combines Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), Large Language Model (LLM), and Text-to-Speech (TTS) technologies to provide a high-quality voice interaction experience similar to GPT-4o. Bailing aims to achieve GPT-4o-like conversation effects without the need for GPU, making it suitable for various edge devices and low-resource environments. The project features efficient open-source models, modular design allowing for module replacement and upgrades, support for memory function, tool integration for information retrieval and task execution via voice commands, and efficient task management with progress tracking and reminders.
README:
百聆 是一个开源的语音对话助手,旨在通过语音与用户进行自然的对话。该项目结合了语音识别 (ASR)、语音活动检测 (VAD)、大语言模型 (LLM) 和语音合成 (TTS) 技术,这是一个类似GPT-4o的语音对话机器人,通过ASR+LLM+TTS实现,提供高质量的语音对话体验,端到端时延800ms。百聆旨在无需GPU的情况下,实现类GPT-4o的对话效果,适用于各种边缘设备和低资源环境。
- 高效开源模型:百聆使用了多个开源模型,确保高效、可靠的语音对话体验。
- 无需GPU:通过优化,可本地部署,仍能提供类GPT-4的性能表现。
- 模块化设计:ASR、VAD、LLM和TTS模块相互独立,可以根据需求进行替换和升级。
- 支持记忆功能: 具备持续学习能力,能够记忆用户的偏好与历史对话,提供个性化的互动体验。
- 支持工具调用: 灵活集成外部工具,用户可通过语音直接请求信息或执行操作,提升助手的实用性。
- 支持任务管理: 高效管理用户任务,能够跟踪进度、设置提醒,并提供动态更新,确保用户不错过任何重要事项。
百聆通过以下技术组件实现语音对话功能:
- ASR: 使用 FunASR 进行自动语音识别,将用户的语音转换为文本。
- VAD: 使用 silero-vad 进行语音活动检测,以确保只处理有效的语音片段。
- LLM: 使用 deepseek 作为大语言模型来处理用户输入并生成响应,极具性价比。
- TTS: 使用 edge-tts ChatTTS MacOS say进行文本到语音的转换,将生成的文本响应转换为自然流畅的语音。
Robot 负责高效的任务管理与记忆管理,能够智能地处理用户的打断请求,同时实现各个模块之间的无缝协调与连接,以确保流畅的交互体验。
播放器状态 | 是否说话 | 说明 |
---|---|---|
播放中 | 未说话 | 正常 |
播放中 | 说话 | 打断场景 |
未播放 | 未说话 | 正常 |
未播放 | 说话 | VAD判断,ASR识别 |
- 语音输入:通过 FunASR 进行准确的语音识别。
- 语音活动检测:使用 silero-vad 过滤无效音频,提升识别效率。
- 智能对话生成:依靠 deepseek 提供的强大语言理解能力生成自然的文本回复,极具性价比。
- 语音输出:通过 edge-tts 将文本转为语音,为用户提供逼真的听觉反馈。
- 支持打断:灵活配置打断策略,能够识别关键字和语音打断,确保用户在对话中的即时反馈与控制,提高交互流畅度。
- 支持记忆功能: 具备持续学习能力,能够记忆用户的偏好与历史对话,提供个性化的互动体验。
- 支持工具调用: 灵活集成外部工具,用户可通过语音直接请求信息或执行操作,提升助手的实用性。
- 支持任务管理: 高效管理用户任务,能够跟踪进度、设置提醒,并提供动态更新,确保用户不错过任何重要事项。
- 高质量语音对话:整合了优秀的ASR、LLM和TTS技术,确保语音对话的流畅性和准确性。
- 轻量化设计:无需高性能硬件即可运行,适用于资源受限的环境。
- 完全开源:百聆完全开源,鼓励社区贡献与二次开发。
请确保你的开发环境中安装了以下工具和库:
- Python 3.8 或更高版本
-
pip
包管理器 - FunASR、silero-vad、deepseek、edge-tts 所需的依赖库
-
克隆项目仓库:
git clone https://github.com/wwbin2017/bailing.git cd bailing
-
安装所需依赖:
pip install -r requirements.txt
-
配置环境变量:
- 打开config/config.yaml 配置ASR LLM等相关配置
- 下载SenseVoiceSmall到目录models/SenseVoiceSmall SenseVoiceSmall下载地址
- 去deepseek官网,获取配置api_key,deepseek获取api_key,当然也可以配置openai、qwen、gemini、01yi等其他模型
-
运行项目:
cd server python server.py # 启动后端服务,也可不执行这一步
python main.py
- 启动应用后,系统会等待语音输入。
- 通过 FunASR 将用户语音转为文本。
- silero-vad 进行语音活动检测,确保只处理有效语音。
- deepseek 处理文本输入,并生成智能回复。
- edge-tts, ChatTTS, MacOs say 将生成的文本转换为语音,并播放给用户。
- [x] 基本语音对话功能
- [x] 支持插件调用
- [x] 任务管理
- [x] Rag & Agent
- [x] Memory
- [ ] 支持语音唤醒
- [ ] 强化WebSearch
- [ ] 支持WebRTC
未来,百聆将升华为一款类JARVIS个人助手,仿佛一位贴心的智囊,具备无与伦比的记忆力与前瞻性的任务管理能力。依托于尖端的RAG与Agent技术,它将精确掌控您的事务与知识,化繁为简。只需轻声一语,例如“帮我查找最近新闻”或“总结大模型的最新进展”,百聆便会迅速响应,智能分析,实时跟踪,并将成果优雅地呈现给您。想象一下,您拥有的不仅是一名助手,而是一个深谙您需求的智慧伙伴,伴您在未来的每个重要瞬间,助您洞察万象,决胜千里。
函数名 | 描述 | 功能 | 示例 |
---|---|---|---|
get_weather |
获取某个地点的天气信息 | 提供地点名称后,返回该地点的天气情况 | 用户说:“杭州天气怎么样?” → zhejiang/hangzhou
|
ielts_speaking_practice |
IELTS(雅思)口语练习 | 生成雅思口语练习题目和对话,帮助用户进行雅思口语练习 | - |
get_day_of_week |
获取当前的星期几或日期 | 当用户询问当前时间、日期或者星期几时,返回相应的信息 | 用户说:“今天星期几?” → 返回当前的星期几 |
schedule_task |
创建一个定时任务 | 用户可以指定任务的执行时间和内容,定时提醒用户 | 用户说:“每天早上8点提醒我喝水。” → time: '08:00', content: '提醒我喝水'
|
open_application |
在 Mac 电脑上打开指定的应用程序 | 用户可以指定应用程序的名称,脚本将在 Mac 上启动相应的应用 | 用户说:“打开Safari。” → application_name: 'Safari'
|
web_search |
在网上搜索指定的关键词 | 根据用户提供的搜索内容,返回相应的搜索结果 | 用户说:“搜索最新的科技新闻。” → query: '最新的科技新闻'
|
欢迎任何形式的贡献!如果你对百聆项目有改进建议或发现问题,请通过 GitHub Issues 进行反馈或提交 Pull Request。
该项目基于 MIT 许可证 开源。你可以自由地使用、修改和分发此项目,但需要保留原始许可证声明。
如有任何疑问或建议,请联系:
- GitHub Issues: 项目问题追踪
百聆 (Bailing) 是一个开源项目,旨在用于个人学习和研究目的。使用本项目时,请注意以下免责声明:
- 个人用途:本项目仅用于个人学习和研究,不适用于商业用途或生产环境。
- 风险和责任:使用百聆 (Bailing) 可能会导致数据丢失、系统故障或其他问题。我们对因使用本项目而导致的任何损失、损害或问题不承担任何责任。
- 支持:本项目不提供任何形式的技术支持或保证。用户应自行承担使用本项目的风险。
在使用本项目之前,请确保您已了解并接受这些免责声明。如果您不同意这些条款,请不要使用本项目。
感谢您的理解与支持!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for bailing
Similar Open Source Tools
bailing
Bailing is an open-source voice assistant designed for natural conversations with users. It combines Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), Large Language Model (LLM), and Text-to-Speech (TTS) technologies to provide a high-quality voice interaction experience similar to GPT-4o. Bailing aims to achieve GPT-4o-like conversation effects without the need for GPU, making it suitable for various edge devices and low-resource environments. The project features efficient open-source models, modular design allowing for module replacement and upgrades, support for memory function, tool integration for information retrieval and task execution via voice commands, and efficient task management with progress tracking and reminders.
Speech-AI-Forge
Speech-AI-Forge is a project developed around TTS generation models, implementing an API Server and a WebUI based on Gradio. The project offers various ways to experience and deploy Speech-AI-Forge, including online experience on HuggingFace Spaces, one-click launch on Colab, container deployment with Docker, and local deployment. The WebUI features include TTS model functionality, speaker switch for changing voices, style control, long text support with automatic text segmentation, refiner for ChatTTS native text refinement, various tools for voice control and enhancement, support for multiple TTS models, SSML synthesis control, podcast creation tools, voice creation, voice testing, ASR tools, and post-processing tools. The API Server can be launched separately for higher API throughput. The project roadmap includes support for various TTS models, ASR models, voice clone models, and enhancer models. Model downloads can be manually initiated using provided scripts. The project aims to provide inference services and may include training-related functionalities in the future.
ChuanhuChatGPT
Chuanhu Chat is a user-friendly web graphical interface that provides various additional features for ChatGPT and other language models. It supports GPT-4, file-based question answering, local deployment of language models, online search, agent assistant, and fine-tuning. The tool offers a range of functionalities including auto-solving questions, online searching with network support, knowledge base for quick reading, local deployment of language models, GPT 3.5 fine-tuning, and custom model integration. It also features system prompts for effective role-playing, basic conversation capabilities with options to regenerate or delete dialogues, conversation history management with auto-saving and search functionalities, and a visually appealing user experience with themes, dark mode, LaTeX rendering, and PWA application support.
Awesome-ChatTTS
Awesome-ChatTTS is an official recommended guide for ChatTTS beginners, compiling common questions and related resources. It provides a comprehensive overview of the project, including official introduction, quick experience options, popular branches, parameter explanations, voice seed details, installation guides, FAQs, and error troubleshooting. The repository also includes video tutorials, discussion community links, and project trends analysis. Users can explore various branches for different functionalities and enhancements related to ChatTTS.
HaE
HaE is a framework project in the field of network security (data security) that combines artificial intelligence (AI) large models to achieve highlighting and information extraction of HTTP messages (including WebSocket). It aims to reduce testing time, focus on valuable and meaningful messages, and improve vulnerability discovery efficiency. The project provides a clear and visual interface design, simple interface interaction, and centralized data panel for querying and extracting information. It also features built-in color upgrade algorithm, one-click export/import of data, and integration of AI large models API for optimized data processing.
chats
Sdcb Chats is a powerful and flexible frontend for large language models, supporting multiple functions and platforms. Whether you want to manage multiple model interfaces or need a simple deployment process, Sdcb Chats can meet your needs. It supports dynamic management of multiple large language model interfaces, integrates visual models to enhance user interaction experience, provides fine-grained user permission settings for security, real-time tracking and management of user account balances, easy addition, deletion, and configuration of models, transparently forwards user chat requests based on the OpenAI protocol, supports multiple databases including SQLite, SQL Server, and PostgreSQL, compatible with various file services such as local files, AWS S3, Minio, Aliyun OSS, Azure Blob Storage, and supports multiple login methods including Keycloak SSO and phone SMS verification.
Awesome-LLM-RAG-Application
Awesome-LLM-RAG-Application is a repository that provides resources and information about applications based on Large Language Models (LLM) with Retrieval-Augmented Generation (RAG) pattern. It includes a survey paper, GitHub repo, and guides on advanced RAG techniques. The repository covers various aspects of RAG, including academic papers, evaluation benchmarks, downstream tasks, tools, and technologies. It also explores different frameworks, preprocessing tools, routing mechanisms, evaluation frameworks, embeddings, security guardrails, prompting tools, SQL enhancements, LLM deployment, observability tools, and more. The repository aims to offer comprehensive knowledge on RAG for readers interested in exploring and implementing LLM-based systems and products.
ChatGPT-Next-Web-Pro
ChatGPT-Next-Web-Pro is a tool that provides an enhanced version of ChatGPT-Next-Web with additional features and functionalities. It offers complete ChatGPT-Next-Web functionality, file uploading and storage capabilities, drawing and video support, multi-modal support, reverse model support, knowledge base integration, translation, customizations, and more. The tool can be deployed with or without a backend, allowing users to interact with AI models, manage accounts, create models, manage API keys, handle orders, manage memberships, and more. It supports various cloud services like Aliyun OSS, Tencent COS, and Minio for file storage, and integrates with external APIs like Azure, Google Gemini Pro, and Luma. The tool also provides options for customizing website titles, subtitles, icons, and plugin buttons, and offers features like voice input, file uploading, real-time token count display, and more.
lawyer-llama
Lawyer LLaMA is a large language model that has been specifically trained on legal data, including Chinese laws, regulations, and case documents. It has been fine-tuned on a large dataset of legal questions and answers, enabling it to understand and respond to legal inquiries in a comprehensive and informative manner. Lawyer LLaMA is designed to assist legal professionals and individuals with a variety of law-related tasks, including: * **Legal research:** Quickly and efficiently search through vast amounts of legal information to find relevant laws, regulations, and case precedents. * **Legal analysis:** Analyze legal issues, identify potential legal risks, and provide insights on how to proceed. * **Document drafting:** Draft legal documents, such as contracts, pleadings, and legal opinions, with accuracy and precision. * **Legal advice:** Provide general legal advice and guidance on a wide range of legal matters, helping users understand their rights and options. Lawyer LLaMA is a powerful tool that can significantly enhance the efficiency and effectiveness of legal research, analysis, and decision-making. It is an invaluable resource for lawyers, paralegals, law students, and anyone else who needs to navigate the complexities of the legal system.
ai_quant_trade
The ai_quant_trade repository is a comprehensive platform for stock AI trading, offering learning, simulation, and live trading capabilities. It includes features such as factor mining, traditional strategies, machine learning, deep learning, reinforcement learning, graph networks, and high-frequency trading. The repository provides tools for monitoring stocks, stock recommendations, and deployment tools for live trading. It also features new functionalities like sentiment analysis using StructBERT, reinforcement learning for multi-stock trading with a 53% annual return, automatic factor mining with 5000 factors, customized stock monitoring software, and local deep reinforcement learning strategies.
SwanLab
SwanLab is an open-source, lightweight AI experiment tracking tool that provides a platform for tracking, comparing, and collaborating on experiments, aiming to accelerate the research and development efficiency of AI teams by 100 times. It offers a friendly API and a beautiful interface, combining hyperparameter tracking, metric recording, online collaboration, experiment link sharing, real-time message notifications, and more. With SwanLab, researchers can document their training experiences, seamlessly communicate and collaborate with collaborators, and machine learning engineers can develop models for production faster.
llm-action
This repository provides a comprehensive guide to large language models (LLMs), covering various aspects such as training, fine-tuning, compression, and applications. It includes detailed tutorials, code examples, and explanations of key concepts and techniques. The repository is maintained by Liguo Dong, an AI researcher and engineer with expertise in LLM research and development.
HivisionIDPhotos
HivisionIDPhoto is a practical algorithm for intelligent ID photo creation. It utilizes a comprehensive model workflow to recognize, cut out, and generate ID photos for various user photo scenarios. The tool offers lightweight cutting, standard ID photo generation based on different size specifications, six-inch layout photo generation, beauty enhancement (waiting), and intelligent outfit swapping (waiting). It aims to solve emergency ID photo creation issues.
AirPower4T
AirPower4T is a development base library based on Vue3 TypeScript Element Plus Vite, using decorators, object-oriented, Hook and other front-end development methods. It provides many common components and some feedback components commonly used in background management systems, and provides a lot of enums and decorators.
sanic-web
Sanic-Web is a lightweight, end-to-end, and easily customizable large model application project built on technologies such as Dify, Ollama & Vllm, Sanic, and Text2SQL. It provides a one-stop solution for developing large model applications, supporting graphical data-driven Q&A using ECharts, handling table-based Q&A with CSV files, and integrating with third-party RAG systems for general knowledge Q&A. As a lightweight framework, Sanic-Web enables rapid iteration and extension to facilitate the quick implementation of large model projects.
how-to-optim-algorithm-in-cuda
This repository documents how to optimize common algorithms based on CUDA. It includes subdirectories with code implementations for specific optimizations. The optimizations cover topics such as compiling PyTorch from source, NVIDIA's reduce optimization, OneFlow's elementwise template, fast atomic add for half data types, upsample nearest2d optimization in OneFlow, optimized indexing in PyTorch, OneFlow's softmax kernel, linear attention optimization, and more. The repository also includes learning resources related to deep learning frameworks, compilers, and optimization techniques.
For similar tasks
M.I.L.E.S
M.I.L.E.S. (Machine Intelligent Language Enabled System) is a voice assistant powered by GPT-4 Turbo, offering a range of capabilities beyond existing assistants. With its advanced language understanding, M.I.L.E.S. provides accurate and efficient responses to user queries. It seamlessly integrates with smart home devices, Spotify, and offers real-time weather information. Additionally, M.I.L.E.S. possesses persistent memory, a built-in calculator, and multi-tasking abilities. Its realistic voice, accurate wake word detection, and internet browsing capabilities enhance the user experience. M.I.L.E.S. prioritizes user privacy by processing data locally, encrypting sensitive information, and adhering to strict data retention policies.
bailing
Bailing is an open-source voice assistant designed for natural conversations with users. It combines Automatic Speech Recognition (ASR), Voice Activity Detection (VAD), Large Language Model (LLM), and Text-to-Speech (TTS) technologies to provide a high-quality voice interaction experience similar to GPT-4o. Bailing aims to achieve GPT-4o-like conversation effects without the need for GPU, making it suitable for various edge devices and low-resource environments. The project features efficient open-source models, modular design allowing for module replacement and upgrades, support for memory function, tool integration for information retrieval and task execution via voice commands, and efficient task management with progress tracking and reminders.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.