micdrop
Micdrop is a set packages for node and browser that simplify voice conversations with AI systems.
Stars: 51
Micdrop is an open source Typescript package for building real-time voice conversations with AI agents. It simplifies complexities on both browser and server sides, providing implementations for various AI providers. Users can customize voice applications by selecting APIs, reducing costs, and controlling conversation flow. Features include microphone handling, advanced audio playback, WebSocket communication, and support for AI components like Large Language Models, Text-to-Speech, and Speech-to-Text.
README:
Micdrop website | Documentation
Micdrop is a set of open source Typescript packages to build real-time voice conversations with AI agents. It handles all the complexities on the browser and server side (microphone, speaker, VAD, network communication, etc) and provides ready-to-use implementations for various AI providers.
-
@micdrop/client- Browser library handling microphone input, audio playback, and real-time communication -
@micdrop/server- Server implementation for audio streaming and AI integration orchestration
-
@micdrop/openai- OpenAI integration providing LLM agent and speech-to-text capabilities -
@micdrop/ai-sdk- AI SDK agent compatible with a lot of LLM providers. -
@micdrop/elevenlabs- ElevenLabs text-to-speech integration with streaming support -
@micdrop/cartesia- Cartesia text-to-speech integration for real-time voice synthesis -
@micdrop/gradium- Gradium text-to-speech integration with WebSocket streaming -
@micdrop/mistral- Mistral AI agent integration for conversation handling -
@micdrop/gladia- Gladia speech-to-text integration for audio transcription
-
@micdrop/react- React hooks for Micdrop
-
demo-client- Example web application with React. -
demo-server- Example server with fastify.
See the author Godefroy de Compreignac talking about Micdrop and voice AI in this video:
While real-time multimodal models (voice-to-voice) offer impressive capabilities, they often come with limitations in terms of customization and cost. Micdrop takes a different approach by:
- ๐ฏ Allowing you to choose the best-in-class API for each component:
- Select specific voices from TTS providers
- Use different LLMs optimized for your use case
- Pick STT engines suited for specific languages/accents
- ๐ฐ Reducing costs by letting you:
- Use more cost-effective API providers
- Mix open source and commercial solutions
- Control exactly when APIs are called
- ๐ง Providing granular control over the conversation flow
- ๐ Supporting a wider range of languages and voices through specialized providers
This modular approach gives you the flexibility to build voice applications that are both powerful and cost-effective.
- ๐๏ธ Microphone handling with:
- Streaming support
- Voice Activity Detection (VAD)
- ๐ Advanced audio playback with:
- Streaming support
- Device selection and control
- ๐ WebSocket communication
- ๐ฆ AI implementations provided for OpenAI, ElevenLabs, Gradium, Mistral, Gladia, and more
- ๐ Bring your own AI components (framework agnostic)
- Large Language Models (LLM)
- Text-to-Speech (TTS)
- Speech-to-Text (STT)
For detailed development instructions, including how to build, test, and publish packages, please see DEVELOPMENT.md.
MIT License - see the LICENSE file for details
Originally developed for Raconte.ai and open sourced by Lonestone (GitHub)
Thanks to ricky0123/vad for their work on voice activity detection.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for micdrop
Similar Open Source Tools
micdrop
Micdrop is an open source Typescript package for building real-time voice conversations with AI agents. It simplifies complexities on both browser and server sides, providing implementations for various AI providers. Users can customize voice applications by selecting APIs, reducing costs, and controlling conversation flow. Features include microphone handling, advanced audio playback, WebSocket communication, and support for AI components like Large Language Models, Text-to-Speech, and Speech-to-Text.
refly
Refly.AI is an open-source AI-native creation engine that empowers users to transform ideas into production-ready content. It features a free-form canvas interface with multi-threaded conversations, knowledge base integration, contextual memory, intelligent search, WYSIWYG AI editor, and more. Users can leverage AI-powered capabilities, context memory, knowledge base integration, quotes, and AI document editing to enhance their content creation process. Refly offers both cloud and self-hosting options, making it suitable for individuals, enterprises, and organizations. The tool is designed to facilitate human-AI collaboration and streamline content creation workflows.
Genkit
Genkit is an open-source framework for building full-stack AI-powered applications, used in production by Google's Firebase. It provides SDKs for JavaScript/TypeScript (Stable), Go (Beta), and Python (Alpha) with unified interface for integrating AI models from providers like Google, OpenAI, Anthropic, Ollama. Rapidly build chatbots, automations, and recommendation systems using streamlined APIs for multimodal content, structured outputs, tool calling, and agentic workflows. Genkit simplifies AI integration with open-source SDK, unified APIs, and offers text and image generation, structured data generation, tool calling, prompt templating, persisted chat interfaces, AI workflows, and AI-powered data retrieval (RAG).
genkit
Firebase Genkit (beta) is a framework with powerful tooling to help app developers build, test, deploy, and monitor AI-powered features with confidence. Genkit is cloud optimized and code-centric, integrating with many services that have free tiers to get started. It provides unified API for generation, context-aware AI features, evaluation of AI workflow, extensibility with plugins, easy deployment to Firebase or Google Cloud, observability and monitoring with OpenTelemetry, and a developer UI for prototyping and testing AI features locally. Genkit works seamlessly with Firebase or Google Cloud projects through official plugins and templates.
saga-reader
Saga Reader is an AI-driven think tank-style reader that automatically retrieves information from the internet based on user-specified topics and preferences. It uses cloud or local large models to summarize and provide guidance, and it includes an AI-driven interactive companion reading function, allowing you to discuss and exchange ideas with AI about the content you've read. Saga Reader is completely free and open-source, meaning all data is securely stored on your own computer and is not controlled by third-party service providers. Additionally, you can manage your subscription keywords based on your interests and preferences without being disturbed by advertisements and commercialized content.
ten_framework
TEN Framework, short for Transformative Extensions Network, is the world's first real-time multimodal AI agent framework. It offers native support for high-performance, real-time multimodal interactions, supports multiple languages and platforms, enables edge-cloud integration, provides flexibility beyond model limitations, and allows for real-time agent state management. The framework facilitates the development of complex AI applications that transcend the limitations of large models by offering a drag-and-drop programming approach. It is suitable for scenarios like simultaneous interpretation, speech-to-text conversion, multilingual chat rooms, audio interaction, and audio-visual interaction.
magic
Magic is an open-source all-in-one AI productivity platform designed to help enterprises quickly build and deploy AI applications, aiming for a 100x increase in productivity. It consists of various AI products and infrastructure tools, such as Super Magic, Magic IM, Magic Flow, and more. Super Magic is a general-purpose AI Agent for complex task scenarios, while Magic Flow is a visual AI workflow orchestration system. Magic IM is an enterprise-grade AI Agent conversation system for internal knowledge management. Teamshare OS is a collaborative office platform integrating AI capabilities. The platform provides cloud services, enterprise solutions, and a self-hosted community edition for users to leverage its features.
agent-zero
Agent Zero is a personal, organic agentic framework designed to be dynamic, transparent, customizable, and interactive. It uses the computer as a tool to accomplish tasks, with features like general-purpose assistant, computer as a tool, multi-agent cooperation, customizable and extensible framework, and communication skills. The tool is fully Dockerized, with Speech-to-Text and TTS capabilities, and offers real-world use cases like financial analysis, Excel automation, API integration, server monitoring, and project isolation. Agent Zero can be dangerous if not used properly and is prompt-based, guided by the prompts folder. The tool is extensively documented and has a changelog highlighting various updates and improvements.
languine
Languine is a CLI tool that helps developers streamline the localization process by providing AI-powered translations, automation features, and developer-centric design. It allows users to easily manage translation files, maintain consistency in tone and style, and save time by automating tasks. With support for over 100 languages and smart detection capabilities, Languine simplifies the localization workflow for developers.
agents
Inkeep Agents is a platform for building AI agents using a No-Code Visual Builder or TypeScript SDK. It allows technical and non-technical teams to create and manage agents collaboratively. The platform supports real-time AI chat assistants and AI workflow automation, with features like multi-agent architecture, MCP tools, UI component library, and observability via Traces UI & OpenTelemetry. Inkeep Agents is interoperable, allowing seamless editing and management of agents in both visual and code formats.
agno
Agno is a lightweight library for building multi-modal Agents. It is designed with core principles of simplicity, uncompromising performance, and agnosticism, allowing users to create blazing fast agents with minimal memory footprint. Agno supports any model, any provider, and any modality, making it a versatile container for AGI. Users can build agents with lightning-fast agent creation, model agnostic capabilities, native support for text, image, audio, and video inputs and outputs, memory management, knowledge stores, structured outputs, and real-time monitoring. The library enables users to create autonomous programs that use language models to solve problems, improve responses, and achieve tasks with varying levels of agency and autonomy.
Stellar-Chat
Stellar Chat is a multi-modal chat application that enables users to create custom agents and integrate with local language models and OpenAI models. It provides capabilities for generating images, visual recognition, text-to-speech, and speech-to-text functionalities. Users can engage in multimodal conversations, create custom agents, search messages and conversations, and integrate with various applications for enhanced productivity. The project is part of the '100 Commits' competition, challenging participants to make meaningful commits daily for 100 consecutive days.
nexent
Nexent is a powerful tool for analyzing and visualizing network traffic data. It provides comprehensive insights into network behavior, helping users to identify patterns, anomalies, and potential security threats. With its user-friendly interface and advanced features, Nexent is suitable for network administrators, cybersecurity professionals, and anyone looking to gain a deeper understanding of their network infrastructure.
Revornix
Revornix is an information management tool designed for the AI era. It allows users to conveniently integrate all visible information and generates comprehensive reports at specific times. The tool offers cross-platform availability, all-in-one content aggregation, document transformation & vectorized storage, native multi-tenancy, localization & open-source features, smart assistant & built-in MCP, seamless LLM integration, and multilingual & responsive experience for users.
neuro-san-studio
Neuro SAN Studio is an open-source library for building agent networks across various industries. It simplifies the development of collaborative AI systems by enabling users to create sophisticated multi-agent applications using declarative configuration files. The tool offers features like data-driven configuration, adaptive communication protocols, safe data handling, dynamic agent network designer, flexible tool integration, robust traceability, and cloud-agnostic deployment. It has been used in various use-cases such as automated generation of multi-agent configurations, airline policy assistance, banking operations, market analysis in consumer packaged goods, insurance claims processing, intranet knowledge management, retail operations, telco network support, therapy vignette supervision, and more.
For similar tasks
micdrop
Micdrop is an open source Typescript package for building real-time voice conversations with AI agents. It simplifies complexities on both browser and server sides, providing implementations for various AI providers. Users can customize voice applications by selecting APIs, reducing costs, and controlling conversation flow. Features include microphone handling, advanced audio playback, WebSocket communication, and support for AI components like Large Language Models, Text-to-Speech, and Speech-to-Text.
For similar jobs
design-studio
Tiledesk Design Studio is an open-source, no-code development platform for creating chatbots and conversational apps. It offers a user-friendly, drag-and-drop interface with pre-ready actions and integrations. The platform combines the power of LLM/GPT AI with a flexible 'graph' approach for creating conversations and automations with ease. Users can automate customer conversations, prototype conversations, integrate ChatGPT, enhance user experience with multimedia, provide personalized product recommendations, set conditions, use random replies, connect to other tools like HubSpot CRM, integrate with WhatsApp, send emails, and seamlessly enhance existing setups.
telegram-llm
A Telegram LLM bot that allows users to deploy their own Telegram bot in 3 simple steps by creating a flow function, configuring access to the Telegram bot, and connecting to an LLM backend. Users need to sign into flows.network, have a bot token from Telegram, and an OpenAI API key. The bot can be customized with ChatGPT prompts and integrated with OpenAI and Telegram for various functionalities.
LogChat
LogChat is an open-source and free AI chat client that supports various chat models and technologies such as ChatGPT, ่ฎฏ้ฃๆ็ซ, DeepSeek, LLM, TTS, STT, and Live2D. The tool provides a user-friendly interface designed using Qt Creator and can be used on Windows systems without any additional environment requirements. Users can interact with different AI models, perform voice synthesis and recognition, and customize Live2D character models. LogChat also offers features like language translation, AI platform integration, and menu items like screenshot editing, clock, and application launcher.
AI-Agent-Starter-Kit
AI Agent Starter Kit is a modern full-stack AI-enabled template using Next.js for frontend and Express.js for backend, with Telegram and OpenAI integrations. It offers AI-assisted development, smart environment variable setup assistance, intelligent error resolution, context-aware code completion, and built-in debugging helpers. The kit provides a structured environment for developers to interact with AI tools seamlessly, enhancing the development process and productivity.
bolt-python-ai-chatbot
The 'bolt-python-ai-chatbot' is a Slack chatbot app template that allows users to integrate AI-powered conversations into their Slack workspace. Users can interact with the bot in conversations and threads, send direct messages for private interactions, use commands to communicate with the bot, customize bot responses, and store user preferences. The app supports integration with Workflow Builder, custom language models, and different AI providers like OpenAI, Anthropic, and Google Cloud Vertex AI. Users can create user objects, manage user states, and select from various AI models for communication.
MCPSpy
MCPSpy is a command-line tool leveraging eBPF technology to monitor Model Context Protocol (MCP) communication at the kernel level. It provides real-time visibility into JSON-RPC 2.0 messages exchanged between MCP clients and servers, supporting Stdio and HTTP transports. MCPSpy offers security analysis, debugging, performance monitoring, compliance assurance, and learning opportunities for understanding MCP communications. The tool consists of eBPF programs, an eBPF loader, an HTTP session manager, an MCP protocol parser, and output handlers for console display and JSONL output.
chatless
Chatless is a modern AI chat desktop application built on Tauri and Next.js. It supports multiple AI providers, can connect to local Ollama models, supports document parsing and knowledge base functions. All data is stored locally to protect user privacy. The application is lightweight, simple, starts quickly, and consumes minimal resources.
Windows-MCP
Windows-MCP is a lightweight, open-source project that enables seamless integration between AI agents and the Windows operating system. Acting as an MCP server bridges the gap between LLMs and the Windows operating system, allowing agents to perform tasks such as file navigation, application control, UI interaction, QA testing, and more. It provides seamless Windows integration, supports any LLM without traditional computer vision techniques, offers a rich toolset for UI automation, is lightweight and open-source, customizable and extendable, offers real-time interaction with low latency, includes a DOM mode for browser automation, and supports various tools for interacting with Windows applications and system components.
