gemini-multimodal-playground

gemini-multimodal-playground

Build realtime voice and video agents with Google's new Gemini 2.0 (API is free for now)

Stars: 167

Visit
 screenshot

Gemini Multimodal Playground is a basic Python app for voice conversations with Google's Gemini 2.0 AI model. It features real-time voice input and text-to-speech responses. Users can configure settings through the GUI and interact with Gemini by speaking into the microphone. The application provides options for voice selection, system prompt customization, and enabling Google search. Troubleshooting tips are available for handling audio feedback loop issues that may occur during interactions.

README:

Gemini Multimodal Playground ✨

A Python application for having voice and video conversations with Google's new Gemini 2.0 model. Features real-time voice and video input and audio responses. Available in two versions: a full-stack web application and a standalone Python script.

Full-Stack Version

https://github.com/user-attachments/assets/a81abaa5-2e70-42a9-857c-5ffbff22f821

Getting Your Gemini API Key

  1. Go to Google AI Studio
  2. Sign in with your Google account
  3. Click "Create API Key"
  4. Copy the generated API key and paste it into the appropriate .env file

API key creation

Prerequisites

  1. Python 3.12 or higher
  2. Node.js 18 or higher
  3. A Google Cloud account
  4. A Gemini API key

Backend Setup

  1. Clone this repository
  2. Create a virtual environment and activate it:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install the required packages:
pip install -r requirements.txt
  1. Create a .env file in the root directory with your API key:
GEMINI_API_KEY=your_api_key_here
  1. Start the backend server:
python backend/main.py

Frontend Setup

  1. Navigate to the frontend directory:
cd frontend
  1. Install dependencies:
npm install
  1. Start the development server:
npm run dev
  1. Open http://localhost:3000 in your browser

Standalone Version

https://github.com/user-attachments/assets/82228033-fcfb-4730-9723-3ed09e1979a2

Prerequisites

Same as above, but only Python-related requirements are needed and Tkinter:

  • On Ubuntu/Debian: sudo apt-get install python3-tk
  • On Fedora: sudo dnf install python3-tkinter
  • On macOS & Windows: Already included with Python

Installation

  1. Clone this repository or download the standalone folder

  2. Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install the required packages:
pip install -r requirements.txt
  1. Create a .env file in the standalone directory with your API key:
GEMINI_API_KEY=your_api_key_here

Running the Standalone Application

  1. Make sure your virtual environment is activated
  2. Run the script:
python standalone.py

Configuration Options

Both versions provide several configuration options:

  • System Prompt: The initial instructions given to Gemini about its role and behavior
  • Voice: Choose from different voice options for Gemini's responses:
    • Puck
    • Charon
    • Kore
    • Fenrir
    • Aoede
  • Enable Google Search: Allows Gemini to search the internet for current information
  • Allow Interruptions: Enables interrupting Gemini while it's speaking

Troubleshooting

  • Audio feedback loop issue - Gemini may interrupt itself when it detects its own voice output through your microphone. This occurs because the application processes all incoming audio, including Gemini's responses. To prevent this feedback loop, either:
    1. Disable the "Allow Interruptions" option in settings
    2. Use headphones/earphones to prevent your microphone from picking up Gemini's audio output

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for gemini-multimodal-playground

Similar Open Source Tools

For similar tasks

For similar jobs