ROSGPT_Vision

Commanding robots using only Language Models' prompts

Stars: 74

Visit

ROSGPT_Vision is a new robotic framework designed to command robots using only two prompts: a Visual Prompt for visual semantic features and an LLM Prompt to regulate robotic reactions. It is based on the Prompting Robotic Modalities (PRM) design pattern and is used to develop CarMate, a robotic application for monitoring driver distractions and providing real-time vocal notifications. The framework leverages state-of-the-art language models to facilitate advanced reasoning about image data and offers a unified platform for robots to perceive, interpret, and interact with visual data through natural language. LangChain is used for easy customization of prompts, and the implementation includes the CarMate application for driver monitoring and assistance.

README:

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

Bilel Benjdira, Anis Koubaa and Anas M. Ali

Robotics and Internet of Things Lab (RIOTU Lab), Prince Sultan University, Saudi Arabia

Inspired by ROSGPT. Both projects aim to bridge the gap between robotics, natural language understanding, and image analysis.

Collaborators who want to participate in this project, are very welcome.

ROSGPT_Vision is a new robotic framework dsigned to command robots using only two prompts:
- a Visual Prompt (for visual semantic features), and
- an LLM Prompt (to regulate robotic reactions).
It is based on a new robotic design pattern: Prompting Robotic Modalities (PRM).
ROSGPT_Vision is used to develop CarMate, a robotic application for monitoring driver distractions and providing real-time vocal notifications. It showcases cost-effective development.
We demonstrated how to optimize the prompting strategies to improve the application.
LangChain framework is used by to easily customize prompts.
More details are described in the academic paper "ROSGPT_Vision: Commanding Robots using only Language Models' Prompts".

Video Demo

An illustrative video demonstration of ROSGPT_Vision is provided:

Overview
ROSGPT_Vision diagram
Prompting Robotic Modalities (PRM) Design Pattern
CarMate Application
Installation
Usage
Citation
License
Acknowledgement
Contribute

Overview

ROSGPT_Vision offers a unified platform that allows robots to perceive, interpret, and interact with visual data through natural language. The framework leverages state-of-the-art language models, including LLAVA, MiniGPT-4, and Caption-Anything, to facilitate advanced reasoning about image data. LangChain is used for easy customization of the prompts. The provided implementation includes the CarMate application, a driver monitoring and assistance system designed to ensure safe and efficient driving experiences.

ROSGPT_Vision diagram

Prompting Robotic Modalities (PRM) Design Pattern

A new design approach emphasizing modular and individualized sensory queries.
Uses specific Modality Language Models (MLM) for textual interpretations of inputs, like the Vision Language Model (VLM) for visual data.
Ensures precise data collection by treating each sensory input separately.
Task Modality's Role: Serves as the central coordinator, synthesizing data from various modalities.

** for more information go to

CarMate Application

CarMate is a complete application for monitoring driver behavior which was developed just by setting two prompts in the YAML file. It automatically analyses the input video using the Visual prompt, analyses what should be done using the LLM prompt, and gives an instant alert to the driver when needed.

These are the prompts used to develop the application, without needing extra code:

The Visual prompt:

Visual prompt: "Describe the driver’s current level of focus 
on driving based on the visual cues, Answer with one short sentence."

The LLM prompt:

LLM prompt:"Consider the following ontology: You must write your Reply 
with one short sentence. Behave as a carmate that surveys the driver 
and gives him advice and instruction to drive safely. You will be given 
human language prompts describing an image. Your task is to provide 
	appropriate instructions to the driver based on the description."

We can see three examples of scenarios, got during the driving:

Scenario 1: The driver is using phone

We can see in the top box the description generated by the image semantics module for the input image using the Visual prompt. Meanwhile, the second box generates the alert that should be given to the driver using the LLM prompt.

Scenario 2: The driver is taking pictures

Scenario 3: The driver is drinking

Installation

To use ROSGPT_Vision, follow these steps:

1. Prepare the code and the environment

Git clone our repository, creating a python environment and ativate it via the following command

  git clone https://github.com/bilel-bj/ROSGPT_Vision.git
  cd ROSGPT_Vision
  git clone https://github.com/Vision-CAIR/MiniGPT-4.git
  git clone https://github.com/haotian-liu/LLaVA.git
  conda env create -f environment.yml
  conda activate ROSGPT_Vision

2. Install the required dependencies

You can run image_semantics.py by install all required dependencies from LLAVA, MiniGPT-4 and Caption-Anything.
Ensure the installation of all requisite dependencies for ROS2.

Usage

To regulate all parameters associated with ROSGPT_Vision, modifications can be made within the corresponding .yaml file.

The YAML contains 6 main sections of configurations parameters:

Task_name: This field specifies the name of the task that the ROS system is configured to perform.
ROSGPT_Vision_Camera_Node: This section contains the configuration for the ROSGPT_Vision_Camera_Node.
Image_Description_Method: This field specifies the method used by the node to generate descriptions from images. It can be one of the currently developed methods: MiniGPT4, LLaVA, or SAM. The configurations needed for everyone of them is put separately at the end of this file.
Vision_prompt: This field specifies the prompt used to guide the image description process.
Output_video: This field specifies the path or the name of where to save the output video file.
GPT_Consultation_Node: This section contains the configuration for the GPT_Consultation_Node.
- llm_prompt: This field specifies the prompt used to guide the language model.
- GPT_temperature: This field specifies the temperature parameter for the GPT model, which controls the randomness of the model's output.
MiniGPT4_parameters: This section contains the configuration for the MiniGPT4 model. It should be clearly set if the model is used in this task, otherwise it could be empty.
- configuration: This field specifies the path for the configuration file of MiniGPT4.
- temperature_miniGPT4: This field specifies the temperature parameter for the MiniGPT4 model.
llava_parameters: This section contains the configuration for the llavA model (if used).
- temperature_llavA: This field specifies the temperature parameter for the llavA model.
SAM_parameters: This section contains the configuration for the SAM model.
- weights_SAM: This field specifies the weights used by the SAM model.

Run in Terminal local machine

run first terminal :

        colcon build --packages-select rosgpt_vision
		    source install/setup.bash
		    python3 src/rosgpt_vision/rosgpt_vision/rosgpt_vision_node_web_cam.py
		    python3 src/rosgpt_vision/rosgpt_vision/ROSGPT_Vision_Camera_Node.py /home/anas/ros2_ws/src/rosgpt_vision/rosgpt_vision/cfg/driver_phone_usage.yaml

run second terminal:

        colcon build --packages-select rosgpt_vision 
		    source install/setup.bash
		    python3 src/rosgpt_vision/rosgpt_vision/ROSGPT_Vision_GPT_Consultation_Node.py /home/anas/ros2_ws/src/rosgpt_vision/rosgpt_vision/cfg/driver_phone_usage.yaml

run third terminal:

bash ros2 topic echo /Image_Description

run fourth terminal:

bash ros2 topic echo /GPT_Consultation

Citation

@misc{benjdira2023rosgptvision,
  title={ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts}, 
  author={Bilel Benjdira and Anis Koubaa and Anas M. Ali},
  year={2023},
  eprint={2308.11236},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
  }

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. You are free to use, share, and adapt this material for non-commercial purposes, as long as you provide attribution to the original author(s) and the source.

Acknowledgement

The codes are based on ROSGPT, LLAVA, MiniGPT-4, Caption-Anything and SAM. Please also follow their licenses. Thanks for their awesome works.

Contribute

As this project is still under progress, contributions are welcome! To contribute, please follow these steps:

Fork the repository on GitHub.
Create a new branch for your feature or bugfix.
Commit your changes and push them to your fork.
Create a pull request to the main repository.

Before submitting your pull request, please ensure that your changes do not break the build and adhere to the project's coding style.

For any questions or suggestions, please open an issue on the GitHub issue tracker.

For Tasks:

Click tags to check more tools for each tasks

monitor driver distractions provide real-time notifications analyze driver behavior generate image descriptions interact with visual data

For Jobs:

robotic engineer artificial intelligence researcher software developer data scientist machine learning engineer

Alternative AI tools for ROSGPT_Vision

Similar Open Source Tools

ROSGPT_Vision

github

: 74

open-webui-tools

Open WebUI Tools Collection is a set of tools for structured planning, arXiv paper search, Hugging Face text-to-image generation, prompt enhancement, and multi-model conversations. It enhances LLM interactions with academic research, image generation, and conversation management. Tools include arXiv Search Tool and Hugging Face Image Generator. Function Pipes like Planner Agent offer autonomous plan generation and execution. Filters like Prompt Enhancer improve prompt quality. Installation and configuration instructions are provided for each tool and pipe.

github

: 131

greenmask

Greenmask is a powerful open-source utility designed for logical database backup dumping, anonymization, synthetic data generation, and restoration. It is highly customizable, stateless, and backward-compatible with existing PostgreSQL utilities. Greenmask supports advanced subset systems, deterministic transformers, dynamic parameters, transformation conditions, and more. It is cross-platform, database type safe, extensible, and supports parallel execution and various storage options. Ideal for backup and restoration tasks, anonymization, transformation, and data masking.

github

: 1.3k

Advanced-Prompt-Generator

This project is an LLM-based Advanced Prompt Generator designed to automate the process of prompt engineering by enhancing given input prompts using large language models (LLMs). The tool can generate advanced prompts with minimal user input, leveraging LLM agents for optimized prompt generation. It supports gpt-4o or gpt-4o-mini, offers FastAPI & Docker deployment for efficiency, provides a Gradio interface for easy testing, and is hosted on Hugging Face Spaces for quick demos. Users can expand model support to offer more variety and flexibility.

github

: 85

ai-notes

Notes on AI state of the art, with a focus on generative and large language models. These are the "raw materials" for the https://lspace.swyx.io/ newsletter. This repo used to be called https://github.com/sw-yx/prompt-eng, but was renamed because Prompt Engineering is Overhyped. This is now an AI Engineering notes repo.

github

: 5.1k

agent-zero

Agent Zero is a personal and organic AI framework designed to be dynamic, organically growing, and learning as you use it. It is fully transparent, readable, comprehensible, customizable, and interactive. The framework uses the computer as a tool to accomplish tasks, with no single-purpose tools pre-programmed. It emphasizes multi-agent cooperation, complete customization, and extensibility. Communication is key in this framework, allowing users to give proper system prompts and instructions to achieve desired outcomes. Agent Zero is capable of dangerous actions and should be run in an isolated environment. The framework is prompt-based, highly customizable, and requires a specific environment to run effectively.

github

: 5.4k

persian-license-plate-recognition

The Persian License Plate Recognition (PLPR) system is a state-of-the-art solution designed for detecting and recognizing Persian license plates in images and video streams. Leveraging advanced deep learning models and a user-friendly interface, it ensures reliable performance across different scenarios. The system offers advanced detection using YOLOv5 models, precise recognition of Persian characters, real-time processing capabilities, and a user-friendly GUI. It is well-suited for applications in traffic monitoring, automated vehicle identification, and similar fields. The system's architecture includes modules for resident management, entrance management, and a detailed flowchart explaining the process from system initialization to displaying results in the GUI. Hardware requirements include an Intel Core i5 processor, 8 GB RAM, a dedicated GPU with at least 4 GB VRAM, and an SSD with 20 GB of free space. The system can be installed by cloning the repository and installing required Python packages. Users can customize the video source for processing and run the application to upload and process images or video streams. The system's GUI allows for parameter adjustments to optimize performance, and the Wiki provides in-depth information on the system's architecture and model training.

github

: 345

peridyno

PeriDyno is a CUDA-based, highly parallel physics engine targeted at providing real-time simulation of physical environments for intelligent agents. It is designed to be easy to use and integrate into existing projects, and it provides a wide range of features for simulating a variety of physical phenomena. PeriDyno is open source and available under the Apache 2.0 license.

github

: 254

KG-LLM-MDQA

This repository contains code and demo for Knowledge Graph Prompting for Multi-Document Question Answering. It includes modules for data collection, training DPR and MDR models, fine-tuning T5 and LLaMA, and reproducing KGP-LLM algorithm. The workflow involves document collection, knowledge graph construction, fine-tuning models, and reproducing main table results. The repository provides instructions for environment setup, folder architecture, and running different modules.

github

: 290

AiTextDetectionBypass

ParaGenie is a script designed to automate the process of paraphrasing articles using the undetectable.ai platform. It allows users to convert lengthy content into unique paraphrased versions by splitting the input text into manageable chunks and processing each chunk individually. The script offers features such as automated paraphrasing, multi-file support for TXT, DOCX, and PDF formats, customizable chunk splitting methods, Gmail-based registration for seamless paraphrasing, purpose-specific writing support, readability level customization, anonymity features for user privacy, error handling and recovery, and output management for easy access and organization of paraphrased content.

github

: 57

OpenCAGE

OpenCAGE is an open-source modding toolkit for Alien: Isolation, enabling custom scripting, configuration, and content modification through graphical interfaces. It includes tools for editing assets, configurations, scripts, behaviour trees, launching the game, and managing backups. The project is constantly evolving with a roadmap that includes features like contextual script editing, content porter, new level creator, mod installers, 3D viewer improvements, navmesh generation, skinned meshes support, sound import/export, and more. OpenCAGE is supported financially by the community and welcomes code contributions.

github

: 278

postgresml

PostgresML is a powerful Postgres extension that seamlessly combines data storage and machine learning inference within your database. It enables running machine learning and AI operations directly within PostgreSQL, leveraging GPU acceleration for faster computations, integrating state-of-the-art large language models, providing built-in functions for text processing, enabling efficient similarity search, offering diverse ML algorithms, ensuring high performance, scalability, and security, supporting a wide range of NLP tasks, and seamlessly integrating with existing PostgreSQL tools and client libraries.

github

: 6.1k

Controllable-RAG-Agent

This repository contains a sophisticated deterministic graph-based solution for answering complex questions using a controllable autonomous agent. The solution is designed to ensure that answers are solely based on the provided data, avoiding hallucinations. It involves various steps such as PDF loading, text preprocessing, summarization, database creation, encoding, and utilizing large language models. The algorithm follows a detailed workflow involving planning, retrieval, answering, replanning, content distillation, and performance evaluation. Heuristics and techniques implemented focus on content encoding, anonymizing questions, task breakdown, content distillation, chain of thought answering, verification, and model performance evaluation.

github

: 951

prompty

Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers. The primary goal is to accelerate the developer inner loop. This repository contains the Prompty Language Specification and a documentation site. The Visual Studio Code extension offers a prompt playground to streamline the prompt engineering process.

github

: 820

ai-chat-android

AI Chat Android demonstrates Google's Generative AI on Android with Firebase Realtime Database. It showcases Gemini API integration, Jetpack Compose UI elements, Android architecture components with Hilt, Kotlin Coroutines for background tasks, and Firebase Realtime Database integration for real-time events. The project follows Google's official architecture guidance with a modularized structure for reusability, parallel building, and decentralized focusing.

github

: 88

XLearning

XLearning is a scheduling platform for big data and artificial intelligence, supporting various machine learning and deep learning frameworks. It runs on Hadoop Yarn and integrates frameworks like TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning offers scalability, compatibility, multiple deep learning framework support, unified data management based on HDFS, visualization display, and compatibility with code at native frameworks. It provides functions for data input/output strategies, container management, TensorBoard service, and resource usage metrics display. XLearning requires JDK >= 1.7 and Maven >= 3.3 for compilation, and deployment on CentOS 7.2 with Java >= 1.7 and Hadoop 2.6, 2.7, 2.8.

github

: 1.7k

For similar tasks

ROSGPT_Vision

github

: 74

ComfyUI-Ollama-Describer

ComfyUI-Ollama-Describer is an extension for ComfyUI that enables the use of LLM models provided by Ollama, such as Gemma, Llava (multimodal), Llama2, Llama3, or Mistral. It requires the Ollama library for interacting with large-scale language models, supporting GPUs using CUDA and AMD GPUs on Windows, Linux, and Mac. The extension allows users to run Ollama through Docker and utilize NVIDIA GPUs for faster processing. It provides nodes for image description, text description, image captioning, and text transformation, with various customizable parameters for model selection, API communication, response generation, and model memory management.

github

: 70

For similar jobs

Detection-and-Classification-of-Alzheimers-Disease

This tool is designed to detect and classify Alzheimer's Disease using Deep Learning and Machine Learning algorithms on an early basis, which is further optimized using the Crow Search Algorithm (CSA). Alzheimer's is a fatal disease, and early detection is crucial for patients to predetermine their condition and prevent its progression. By analyzing MRI scanned images using Artificial Intelligence technology, this tool can classify patients who may or may not develop AD in the future. The CSA algorithm, combined with ML algorithms, has proven to be the most effective approach for this purpose.

github

: 68

Co-LLM-Agents

This repository contains code for building cooperative embodied agents modularly with large language models. The agents are trained to perform tasks in two different environments: ThreeDWorld Multi-Agent Transport (TDW-MAT) and Communicative Watch-And-Help (C-WAH). TDW-MAT is a multi-agent environment where agents must transport objects to a goal position using containers. C-WAH is an extension of the Watch-And-Help challenge, which enables agents to send messages to each other. The code in this repository can be used to train agents to perform tasks in both of these environments.

github

: 202

awesome-synthetic-datasets

This repository focuses on organizing resources for building synthetic datasets using large language models. It covers important datasets, libraries, tools, tutorials, and papers related to synthetic data generation. The goal is to provide pragmatic and practical resources for individuals interested in creating synthetic datasets for machine learning applications.

github

: 169

ai-devices

AI Devices Template is a project that serves as an AI-powered voice assistant utilizing various AI models and services to provide intelligent responses to user queries. It supports voice input, transcription, text-to-speech, image processing, and function calling with conditionally rendered UI components. The project includes customizable UI settings, optional rate limiting using Upstash, and optional tracing with Langchain's LangSmith for function execution. Users can clone the repository, install dependencies, add API keys, start the development server, and deploy the application. Configuration settings can be modified in `app/config.tsx` to adjust settings and configurations for the AI-powered voice assistant.

github

: 116

ROSGPT_Vision

github

: 74

AIBotPublic

AIBotPublic is an open-source version of AIBotPro, a comprehensive AI tool that provides various features such as knowledge base construction, AI drawing, API hosting, and more. It supports custom plugins and parallel processing of multiple files. The tool is built using bootstrap4 for the frontend, .NET6.0 for the backend, and utilizes technologies like SqlServer, Redis, and Milvus for database and vector database functionalities. It integrates third-party dependencies like Baidu AI OCR, Milvus C# SDK, Google Search, and more to enhance its capabilities.

github

: 199

LLMGA

LLMGA (Multimodal Large Language Model-based Generation Assistant) is a tool that leverages Large Language Models (LLMs) to assist users in image generation and editing. It provides detailed language generation prompts for precise control over Stable Diffusion (SD), resulting in more intricate and precise content in generated images. The tool curates a dataset for prompt refinement, similar image generation, inpainting & outpainting, and visual question answering. It offers a two-stage training scheme to optimize SD alignment and a reference-based restoration network to alleviate texture, brightness, and contrast disparities in image editing. LLMGA shows promising generative capabilities and enables wider applications in an interactive manner.

github

: 305

MetaAgent

MetaAgent is a multi-agent collaboration platform designed to build, manage, and deploy multi-modal AI agents without the need for coding. Users can easily create AI agents by editing a yml file or using the provided UI. The platform supports features such as building LLM-based AI agents, multi-modal interactions with users using texts, audios, images, and videos, creating a company of agents for complex tasks like drawing comics, vector database and knowledge embeddings, and upcoming features like UI for creating and using AI agents, fine-tuning, and RLHF. The tool simplifies the process of creating and deploying AI agents for various tasks.

github

: 58

ROSGPT_Vision

README:

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

Video Demo

Table of Contents

Overview

ROSGPT_Vision diagram

Prompting Robotic Modalities (PRM) Design Pattern

CarMate Application

Scenario 1: The driver is using phone

Scenario 2: The driver is taking pictures

Scenario 3: The driver is drinking

Installation

To use ROSGPT_Vision, follow these steps:

Usage

Citation

License

Acknowledgement

Contribute

For Tasks:

For Jobs:

Alternative AI tools for ROSGPT_Vision

Similar Open Source Tools

ROSGPT_Vision

open-webui-tools

greenmask

Advanced-Prompt-Generator

ai-notes

agent-zero

persian-license-plate-recognition

peridyno

KG-LLM-MDQA

AiTextDetectionBypass

OpenCAGE

postgresml

Controllable-RAG-Agent

prompty

ai-chat-android

XLearning

For similar tasks

ROSGPT_Vision

ComfyUI-Ollama-Describer

For similar jobs

Detection-and-Classification-of-Alzheimers-Disease

Co-LLM-Agents

awesome-synthetic-datasets

ai-devices

ROSGPT_Vision

AIBotPublic

LLMGA

MetaAgent