multimodal-chat
A multimodal chat interface with many tools.
Stars: 70
Yet Another Chatbot is a sophisticated multimodal chat interface powered by advanced AI models and equipped with a variety of tools. This chatbot can search and browse the web in real-time, query Wikipedia for information, perform news and map searches, execute Python code, compose long-form articles mixing text and images, generate, search, and compare images, analyze documents and images, search and download arXiv papers, save conversations as text and audio files, manage checklists, and track personal improvements. It offers tools for web interaction, Wikipedia search, Python scripting, content management, image handling, arXiv integration, conversation generation, file management, personal improvement, and checklist management.
README:
A multimodal chat interface with access to many tools.
YAIA is a sophisticated multimodal chat interface powered by advanced AI models and equipped with a variety of tools. It can:
- Search and browse the web in real-time
- Query Wikipedia for information
- Perform news and map searches
- Safely execute Python code that can produce text and images such as charts and diagrams
- Compose long-form articles mixing text and images
- Generate, search, and compare images
- Analyze documents and images
- Search and download arXiv papers
- Generate and save conversations as text and audio files
- Save files to the output directory
- Track personal improvements
- Manage checklists for task tracking
Here are examples of how to use various tools:
-
Web Search: "Search the web for recent advancements in quantum computing."
-
Wikipedia: "Find Wikipedia articles about the history of artificial intelligence."
-
Python Scripting: "Create a Python script to generate a bar chart of global CO2 emissions by country."
-
Sketchbook: "Start a new sketchbook and write an introduction about how to compute Pi with numerical methods."
-
Image Generation: "Generate an image of a futuristic city with flying cars and tall skyscrapers."
-
Image Search: "Search the image catalog for pictures of endangered species."
-
arXiv Integration: "Search for recent research papers on deep learning in natural language processing."
-
Conversation Generation: "Create a conversation between three experts discussing how to set up multimodal RAG."
-
File Management: "Save a summary of our discussion about climate change to a file named 'climate_change_summary.txt'."
-
Personal Improvement: "Here's a suggestion to improve: to improve answers, search for official sources."
-
Checklist: "Start a new checklist for a project on implementing a machine learning model."
-
Web Interaction:
- DuckDuckGo Text Search: Performs web searches
- DuckDuckGo News Search: Searches for recent news articles
- DuckDuckGo Maps Search: Searches for locations and businesses
- DuckDuckGo Images Search: Searches for publicly available images
- Web Browser: Browses websites and retrieves their content
-
Wikipedia Tools:
- Wikipedia Search: Finds relevant Wikipedia pages
- Wikipedia Geodata Search: Locates Wikipedia articles by geographic location
- Wikipedia Page Retriever: Fetches full Wikipedia page content
-
Python Scripting:
- Runs Python scripts for computations, testing, and output generation, including text and images
- Python modules can be added to the Python interpreter
- Python code is run in a secure environment provided by AWS Lambda
-
Content Management:
- Personal Archive: Stores and retrieves text, Markdown, or HTML content, using a semantic database
- Sketchbook: Manages a multi-page sketchbook for writing and reviewing long-form content. Supports multiple output formats:
- Markdown (.md): For easy reading and editing
- Word Document (.docx): For document editing
-
Image Handling:
- Image Generation: Creates images based on text prompts
- Image Catalog Search: Searches images by description
- Image Similarity Search: Finds similar images based on a reference image
- Random Images: Retrieves random images from the catalog
- Get Image by ID: Retrieves a specific image from the catalog using its ID
- Image Catalog Count: Returns the total number of images in the catalog
- Download Image: Adds images from URLs to the catalog
-
arXiv Integration:
- Search and download arXiv papers
- Store paper content in the archive for easy retrieval
-
Conversation Generation:
- Transform content into a conversation between three people
- Generate audio files for the conversation using text-to-speech
-
File Management:
- Save File: Allows saving text content to a file with a specified name in the output directory
-
Personal Improvement:
- Track suggestions and mistakes for future enhancements
-
Checklist:
- Manage task lists with the ability to add items, mark them as completed, and review progress
For a comprehensive list of available tools and their usage, refer to ./Config/tools.json
.
- A container tool: Docker or Finch (to install Finch, follow the instructions here)
- Python 3.12 or newer
- AWS account with appropriate permissions to access Amazon Bedrock, AWS Lambda, and Amazon ECR
-
Clone the repository:
git clone https://github.com/danilop/multimodal-chat cd multimodal-chat
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Set up the AWS Lambda function for code execution:
cd LambdaFunction ./deploy_lambda_function.sh cd ..
-
To use Selenium for web browsing, install ChromeDriver. Using Homebrew:
brew install --cask chromedriver
-
To output audio, install
ffmpeg
. Using Homebrew:brew install ffmpeg
You can either use a local OpenSearch instance or connect to a remote server. For local setup:
-
Navigate to the OpenSearch directory:
cd OpenSearch/
-
Set the admin password (first-time setup), this step will create the
.env
file and theopensearch_env.sh
files:./set_password.sh
-
Start OpenSearch locally (it needs access to the
.env
file):./opensearch_start.sh
-
Ensure OpenSearch (2 nodes + dashboard) starts correctly by checking the output
-
To update OpenSearch, download the new container images using this script:
./opensearch_update.sh
For remote server setup, update the client creation code in the main script.
To change password, you need to delete the container uisng finch
or docker
and then set a new password.
Default models for text, images, and embeddings are in the Config/config.ini
file. The models to use are specified using Amazon Bedrock model IDs or cross-region inference profile IDs. You need permissions and access to these models as described in Access foundation models.
This section assumes OpenSearch is running locally in another terminal window as described before.
-
Load the OpenSearch admin password into the environment:
source OpenSearch/opensearch_env.sh
-
Run the application:
python multimodal_chat.py
-
To reset the text and multimodal indexes (note: this doesn't delete images in
./Images/
):python multimodal_chat.py --reset-index
-
Open a web browser and navigate to http://127.0.0.1:7860/ to start chatting.
Here are a few examples of what you can do this application.
In this demo:
- Browse websites using Selenium and specific tools for DuckDuckGo (search, news, geosearch) and Wikipedia
- Use the semantic text archive tool to archive documents and retrieve by keywords
In this demo:
- Using a multimodal index and the local file system to manage an image catalog
- Store images with a generated description
- Retrieve images by text description (semantic search)
- Retrieve images by similarity to another image
- Retrieve random images
In this demo:
- Generate images from a textual description
- The text-to-image prompt is generated from chat instructions
- This approach allows to use the overall conversation to improve the prompt
In this demo:
- Running AI generated code to solve problems
- Running for security in an AWS Lambda function with basic permissions
- Deployed via a container image to easily add Python modules
- Python only but easily extensible
In this demo:
- A tool to help write long forms of text such as articles and blog posts)
- Providing sequential access to text split in pages
- To mitigate the "asymmetry" between a model input and output sizes
In this demo:
- Best results use more than one tools together
- Start with a sketchbook to write a long article
- The article contains code snippets
- A review runs and tests all code snippets and updates each page fixing the code (if needed) and adding actual results
- If you encounter issues with OpenSearch, check the connection settings and ensure the service is running
- For AWS Lambda function errors, verify your AWS credentials and permissions
- If image processing fails, ensure you have the necessary libraries installed and check file permissions
Contributions to YAIA are welcome! Please refer to the contributing guidelines for more information on how to submit pull requests, report issues, or request features.
This project is licensed under the MIT License. See the LICENSE file for details.
- Combine multiple tools for complex tasks. For example, use the web search to find information, then use the sketchbook to write a summary, and finally generate a conversation about the topic.
- When working with images, you can generate new images, search for existing ones, or download images from the web to add to your catalog.
- Use the arXiv integration to stay up-to-date with the latest research in your field of interest.
- The conversation generation tool is great for creating engaging content or preparing for presentations.
- Regularly check and update your personal improvements to track your progress and areas for growth.
For more detailed information on specific components or advanced usage, please refer to the inline documentation in the source code.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for multimodal-chat
Similar Open Source Tools
multimodal-chat
Yet Another Chatbot is a sophisticated multimodal chat interface powered by advanced AI models and equipped with a variety of tools. This chatbot can search and browse the web in real-time, query Wikipedia for information, perform news and map searches, execute Python code, compose long-form articles mixing text and images, generate, search, and compare images, analyze documents and images, search and download arXiv papers, save conversations as text and audio files, manage checklists, and track personal improvements. It offers tools for web interaction, Wikipedia search, Python scripting, content management, image handling, arXiv integration, conversation generation, file management, personal improvement, and checklist management.
Controllable-RAG-Agent
This repository contains a sophisticated deterministic graph-based solution for answering complex questions using a controllable autonomous agent. The solution is designed to ensure that answers are solely based on the provided data, avoiding hallucinations. It involves various steps such as PDF loading, text preprocessing, summarization, database creation, encoding, and utilizing large language models. The algorithm follows a detailed workflow involving planning, retrieval, answering, replanning, content distillation, and performance evaluation. Heuristics and techniques implemented focus on content encoding, anonymizing questions, task breakdown, content distillation, chain of thought answering, verification, and model performance evaluation.
chat-with-notes
Chat-with-Notes is a Flask web application that enables users to upload text files, view their content, and engage with an AI chatbot for discussions. The application prioritizes privacy by utilizing a locally hosted Ollama Llama 3.1 (8B) model for AI responses, ensuring data security. Users can upload files during conversations, clear chat history, and export chat logs. The tool operates locally, requiring Python 3.x, pip, Git, and a locally running Ollama Llama 3.1 (8B) model as prerequisites.
codebase-context-spec
The Codebase Context Specification (CCS) project aims to standardize embedding contextual information within codebases to enhance understanding for both AI and human developers. It introduces a convention similar to `.env` and `.editorconfig` files but focused on documenting code for both AI and humans. By providing structured contextual metadata, collaborative documentation guidelines, and standardized context files, developers can improve code comprehension, collaboration, and development efficiency. The project includes a linter for validating context files and provides guidelines for using the specification with AI assistants. Tooling recommendations suggest creating memory systems, IDE plugins, AI model integrations, and agents for context creation and utilization. Future directions include integration with existing documentation systems, dynamic context generation, and support for explicit context overriding.
hf-llm.rs
HF-LLM.rs is a CLI tool for accessing Large Language Models (LLMs) like Llama 3.1, Mistral, Gemma 2, Cohere and more hosted on Hugging Face. It allows interaction with various models, providing input and receiving responses in a terminal environment. Users can select models, input prompts, receive streaming output, and engage in chat mode. The tool supports a variety of models available on Hugging Face infrastructure, with the list continuously updated. Some models may require a Pro subscription for access.
Customer-Service-Conversational-Insights-with-Azure-OpenAI-Services
This solution accelerator is built on Azure Cognitive Search Service and Azure OpenAI Service to synthesize post-contact center transcripts for intelligent contact center scenarios. It converts raw transcripts into customer call summaries to extract insights around product and service performance. Key features include conversation summarization, key phrase extraction, speech-to-text transcription, sensitive information extraction, sentiment analysis, and opinion mining. The tool enables data professionals to quickly analyze call logs for improvement in contact center operations.
ai-data-analysis-MulitAgent
AI-Driven Research Assistant is an advanced AI-powered system utilizing specialized agents for data analysis, visualization, and report generation. It integrates LangChain, OpenAI's GPT models, and LangGraph for complex research processes. Key features include hypothesis generation, data processing, web search, code generation, and report writing. The system's unique Note Taker agent maintains project state, reducing overhead and improving context retention. System requirements include Python 3.10+ and Jupyter Notebook environment. Installation involves cloning the repository, setting up a Conda virtual environment, installing dependencies, and configuring environment variables. Usage instructions include setting data, running Jupyter Notebook, customizing research tasks, and viewing results. Main components include agents for hypothesis generation, process supervision, visualization, code writing, search, report writing, quality review, and note-taking. Workflow involves hypothesis generation, processing, quality review, and revision. Customization is possible by modifying agent creation and workflow definition. Current issues include OpenAI errors, NoteTaker efficiency, runtime optimization, and refiner improvement. Contributions via pull requests are welcome under the MIT License.
Kuebiko
Kuebiko is a Twitch Chat Bot that reads twitch chat and generates text-to-speech responses using Google Cloud API and OpenAI's GPT-3 text completion model. It allows users to set up their own VTuber AI similar to 'Neuro-Sama'. The project is built with Python and requires setting up various API keys and configurations to enable the bot functionality. Users can customize the voice of their VTuber and route audio using VBAudio Cable. Kuebiko provides a unique way to interact with viewers through chat responses and captions in OBS.
openai-grammar-correction
This project is a Node.js API example that utilizes the OpenAI API for grammar correction and speech-to-text conversion. It helps users correct their English sentences to standard English by leveraging the capabilities of the OpenAI API. The project consists of two applications: Angular and Node.js. Users can follow the installation steps to set up the project in their environment and utilize the OpenAI implementation to correct English sentences. The project also provides guidelines for contribution and support.
Whisper-TikTok
Discover Whisper-TikTok, an innovative AI-powered tool that leverages the prowess of Edge TTS, OpenAI-Whisper, and FFMPEG to craft captivating TikTok videos. Whisper-TikTok effortlessly generates accurate transcriptions from audio files and integrates Microsoft Edge Cloud Text-to-Speech API for vibrant voiceovers. The program orchestrates the synthesis of videos using a structured JSON dataset, generating mesmerizing TikTok content in minutes.
UFO
UFO is a UI-focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.
EDA-GPT
EDA GPT is an open-source data analysis companion that offers a comprehensive solution for structured and unstructured data analysis. It streamlines the data analysis process, empowering users to explore, visualize, and gain insights from their data. EDA GPT supports analyzing structured data in various formats like CSV, XLSX, and SQLite, generating graphs, and conducting in-depth analysis of unstructured data such as PDFs and images. It provides a user-friendly interface, powerful features, and capabilities like comparing performance with other tools, analyzing large language models, multimodal search, data cleaning, and editing. The tool is optimized for maximal parallel processing, searching internet and documents, and creating analysis reports from structured and unstructured data.
ciso-assistant-community
CISO Assistant is a tool that helps organizations manage their cybersecurity posture and compliance. It provides a centralized platform for managing security controls, threats, and risks. CISO Assistant also includes a library of pre-built frameworks and tools to help organizations quickly and easily implement best practices.
CogVideo
CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.
gpt-researcher
GPT Researcher is an autonomous agent designed for comprehensive online research on a variety of tasks. It can produce detailed, factual, and unbiased research reports with customization options. The tool addresses issues of speed, determinism, and reliability by leveraging parallelized agent work. The main idea involves running 'planner' and 'execution' agents to generate research questions, seek related information, and create research reports. GPT Researcher optimizes costs and completes tasks in around 3 minutes. Features include generating long research reports, aggregating web sources, an easy-to-use web interface, scraping web sources, and exporting reports to various formats.
devdocs-to-llm
The devdocs-to-llm repository is a work-in-progress tool that aims to convert documentation from DevDocs format to Long Language Model (LLM) format. This tool is designed to streamline the process of converting documentation for use with LLMs, making it easier for developers to leverage large language models for various tasks. By automating the conversion process, developers can quickly adapt DevDocs content for training and fine-tuning LLMs, enabling them to create more accurate and contextually relevant language models.
For similar tasks
document-ai-samples
The Google Cloud Document AI Samples repository contains code samples and Community Samples demonstrating how to analyze, classify, and search documents using Google Cloud Document AI. It includes various projects showcasing different functionalities such as integrating with Google Drive, processing documents using Python, content moderation with Dialogflow CX, fraud detection, language extraction, paper summarization, tax processing pipeline, and more. The repository also provides access to test document files stored in a publicly-accessible Google Cloud Storage Bucket. Additionally, there are codelabs available for optical character recognition (OCR), form parsing, specialized processors, and managing Document AI processors. Community samples, like the PDF Annotator Sample, are also included. Contributions are welcome, and users can seek help or report issues through the repository's issues page. Please note that this repository is not an officially supported Google product and is intended for demonstrative purposes only.
step-free-api
The StepChat Free service provides high-speed streaming output, multi-turn dialogue support, online search support, long document interpretation, and image parsing. It offers zero-configuration deployment, multi-token support, and automatic session trace cleaning. It is fully compatible with the ChatGPT interface. Additionally, it provides seven other free APIs for various services. The repository includes a disclaimer about using reverse APIs and encourages users to avoid commercial use to prevent service pressure on the official platform. It offers online testing links, showcases different demos, and provides deployment guides for Docker, Docker-compose, Render, Vercel, and native deployments. The repository also includes information on using multiple accounts, optimizing Nginx reverse proxy, and checking the liveliness of refresh tokens.
unilm
The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.
searchGPT
searchGPT is an open-source project that aims to build a search engine based on Large Language Model (LLM) technology to provide natural language answers. It supports web search with real-time results, file content search, and semantic search from sources like the Internet. The tool integrates LLM technologies such as OpenAI and GooseAI, and offers an easy-to-use frontend user interface. The project is designed to provide grounded answers by referencing real-time factual information, addressing the limitations of LLM's training data. Contributions, especially from frontend developers, are welcome under the MIT License.
LLMs-at-DoD
This repository contains tutorials for using Large Language Models (LLMs) in the U.S. Department of Defense. The tutorials utilize open-source frameworks and LLMs, allowing users to run them in their own cloud environments. The repository is maintained by the Defense Digital Service and welcomes contributions from users.
LARS
LARS is an application that enables users to run Large Language Models (LLMs) locally on their devices, upload their own documents, and engage in conversations where the LLM grounds its responses with the uploaded content. The application focuses on Retrieval Augmented Generation (RAG) to increase accuracy and reduce AI-generated inaccuracies. LARS provides advanced citations, supports various file formats, allows follow-up questions, provides full chat history, and offers customization options for LLM settings. Users can force enable or disable RAG, change system prompts, and tweak advanced LLM settings. The application also supports GPU-accelerated inferencing, multiple embedding models, and text extraction methods. LARS is open-source and aims to be the ultimate RAG-centric LLM application.
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
erag
ERAG is an advanced system that combines lexical, semantic, text, and knowledge graph searches with conversation context to provide accurate and contextually relevant responses. This tool processes various document types, creates embeddings, builds knowledge graphs, and uses this information to answer user queries intelligently. It includes modules for interacting with web content, GitHub repositories, and performing exploratory data analysis using various language models.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.