![Grounded_3D-LLM](/statics/github-mark.png)
Grounded_3D-LLM
Code&Data for Grounded 3D-LLM with Referent Tokens
Stars: 97
![screenshot](/screenshots_githubs/OpenRobotLab-Grounded_3D-LLM.jpg)
Grounded 3D-LLM is a unified generative framework that utilizes referent tokens to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It transforms 3D vision tasks into language formats through task-specific prompts, curating grounded language datasets and employing Contrastive Language-Scene Pre-training (CLASP) to bridge the gap between 3D vision and language models. The model covers tasks like 3D visual question answering, dense captioning, object detection, and language grounding.
README:
This repository will release the official implementation of "Grounded 3D-LLM with Referent Token".
Prior studies on 3D scene comprehension have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (LMMs) to consolidate various 3D visual tasks within a unified generative framework. The model utilizes a series of ``referent tokens'' to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data arbitrarily. 3D vision tasks are naturally transformed into language formats through task-specific prompts. To effectively associate the scene with text, we curate the grounded language datasets either from human-annotated sources or by bootstrapping existing object labels at the phrase level. We then employ Contrastive Language-Scene Pre-training (CLASP) to bridge the divide between 3D vision and language models, thus facilitating the use of referent tokens in subsequent language modeling tasks. Our comprehensive evaluation covers open-ended tasks like 3D visual question answering and dense captioning, as well as close-ended tasks such as object detection and language grounding.
Please refer to the data visualization page for detailed instructions on the minimal setup for visualizing the grounded scene caption dataset.
Grounded 3D-LLM is trained using 4 or 8 NVIDIA Tesla A100 GPUs. Please refer to the installation page for detailed installation scripts for model training.
Please download all the scene-language datasets the from HuggingFace. The datasets are listed as follows:
Dataset | # for Train | # for Eval |
---|---|---|
ScanRefer | 36639 | 9503 |
Scan2Cap | 36639 | 9503 |
ScanQA | 26516 | 9402 |
Object-Description | 28197 | 7912 |
GroundedSceneCaption | 84301 | -- |
EmbodiedPlanning | 3500 | -- |
EmbodiedDialogue | 129799 | -- |
GlobalSceneCaption | 4065 | -- |
3D-LLM | 27627 | -- |
Alpaca | 51865 | -- |
Please download the pretrained weights from HuggingFace and place them in $ROOT_PATH/pretrained/
.
Please download the pretrained LLM weights (Tiny-Vicuna-1B) and store them in $ROOT_PATH/pretrained/llm_weight/Tiny-Vicuna-1B/
If you would like to utilize our pretrained model checkpoints, they can be obtained from HuggingFace. Please save these in the checkpoint directory located at $ROOT_PATH/saved
.
Steps | Model Checkpoints |
---|---|
1 | Mask3D-CLIP |
2 | Mask3D-CLASP |
3 | Grounded 3D-LLM |
After completing the downloads, the root folder should be organized as follows:
ROOT_PATH
├── data # data
│ ├── langdata
│ │ │── groundedscenecaption_format.json
│ │ │── scanrefer_format.json
│ │ │── scanqa_format.json
│ │ │── ...
│ ├── processed
│ │── rawscannet
│ │ │── scans
│ │ │── scannetv2-labels.combined.tsv
│── pretrained # pretrained weights for model training
│ │── bert-base-uncased # bert pretrained weights
│ │── label_clip_features.pth # clip's text features for scannet-200 class names
│ │── llm_weight
│ │ │── Tiny-Vicuna-1B # pretrained weights from https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B
│── saved # model checkpoints saved path
│ │── step1_mask3d_clip_4GPUS
│ │── step2_mask3d_lang_4GPUS
│ │── step3_mask3d_lang_4GPUS
bash final_scripts/step1_pretrain_detector.sh
Step 3: After training the detector, pre-train the detector using Contrastive Language-Scene Pre-training:
bash final_scripts/step2_pretrain_3d-clasp.sh
bash final_scripts/step3_train_grounded3dllm.sh
The model checkpoints will be saved in saved/step3_mask3d_lang_4GPUS/last-checkpoint.pth
, and the inference results will be stored in saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/
.
Note that if you want to run the inference, simply comment out the training command in these scripts.
To evaluate all the respective results, run the following command:
bash final_scripts/eval_llm.sh ./saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/
To interact with Grounded 3D-LLM via the demo chat, first run the model inference and ensure that the scene_features
are saved in saved/step3_mask3d_lang_4GPUS/scene_features
. After that, launch the gradio demo chat by running the following command:
bash web_chat_demo/web_chat_demo.sh
Please note that the visualization of the related segmentation masks is not yet supported in the Gradio demo.
- [x] Release Grouded Scene Caption data (ScanNet).
- [x] Release data visualizer.
- [x] Release data generation code.
- [x] Release pre-trained checkpoints.
- [x] Release Grounded 3D-LLM training and evaluation.
- [ ] Demo supports mask visualization.
Many thanks to the following open-source projects:
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Grounded_3D-LLM
Similar Open Source Tools
![Grounded_3D-LLM Screenshot](/screenshots_githubs/OpenRobotLab-Grounded_3D-LLM.jpg)
Grounded_3D-LLM
Grounded 3D-LLM is a unified generative framework that utilizes referent tokens to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It transforms 3D vision tasks into language formats through task-specific prompts, curating grounded language datasets and employing Contrastive Language-Scene Pre-training (CLASP) to bridge the gap between 3D vision and language models. The model covers tasks like 3D visual question answering, dense captioning, object detection, and language grounding.
![cambrian Screenshot](/screenshots_githubs/cambrian-mllm-cambrian.jpg)
cambrian
Cambrian-1 is a fully open project focused on exploring multimodal Large Language Models (LLMs) with a vision-centric approach. It offers competitive performance across various benchmarks with models at different parameter levels. The project includes training configurations, model weights, instruction tuning data, and evaluation details. Users can interact with Cambrian-1 through a Gradio web interface for inference. The project is inspired by LLaVA and incorporates contributions from Vicuna, LLaMA, and Yi. Cambrian-1 is licensed under Apache 2.0 and utilizes datasets and checkpoints subject to their respective original licenses.
![lightllm Screenshot](/screenshots_githubs/ModelTC-lightllm.jpg)
lightllm
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework known for its lightweight design, scalability, and high-speed performance. It offers features like tri-process asynchronous collaboration, Nopad for efficient attention operations, dynamic batch scheduling, FlashAttention integration, tensor parallelism, Token Attention for zero memory waste, and Int8KV Cache. The tool supports various models like BLOOM, LLaMA, StarCoder, Qwen-7b, ChatGLM2-6b, Baichuan-7b, Baichuan2-7b, Baichuan2-13b, InternLM-7b, Yi-34b, Qwen-VL, Llava-7b, Mixtral, Stablelm, and MiniCPM. Users can deploy and query models using the provided server launch commands and interact with multimodal models like QWen-VL and Llava using specific queries and images.
![AIOS Screenshot](/screenshots_githubs/agiresearch-AIOS.jpg)
AIOS
AIOS, a Large Language Model (LLM) Agent operating system, embeds large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.
![TableLLM Screenshot](/screenshots_githubs/RUCKBReasoning-TableLLM.jpg)
TableLLM
TableLLM is a large language model designed for efficient tabular data manipulation tasks in real office scenarios. It can generate code solutions or direct text answers for tasks like insert, delete, update, query, merge, and chart operations on tables embedded in spreadsheets or documents. The model has been fine-tuned based on CodeLlama-7B and 13B, offering two scales: TableLLM-7B and TableLLM-13B. Evaluation results show its performance on benchmarks like WikiSQL, Spider, and self-created table operation benchmark. Users can use TableLLM for code and text generation tasks on tabular data.
![OmAgent Screenshot](/screenshots_githubs/om-ai-lab-OmAgent.jpg)
OmAgent
OmAgent is an open-source agent framework designed to streamline the development of on-device multimodal agents. It enables agents to empower various hardware devices, integrates speed-optimized SOTA multimodal models, provides SOTA multimodal agent algorithms, and focuses on optimizing the end-to-end computing pipeline for real-time user interaction experience. Key features include easy connection to diverse devices, scalability, flexibility, and workflow orchestration. The architecture emphasizes graph-based workflow orchestration, native multimodality, and device-centricity, allowing developers to create bespoke intelligent agent programs.
![inference Screenshot](/screenshots_githubs/xorbitsai-inference.jpg)
inference
Xorbits Inference (Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.
![vision-parse Screenshot](/screenshots_githubs/iamarunbrahma-vision-parse.jpg)
vision-parse
Vision Parse is a tool that leverages Vision Language Models to parse PDF documents into beautifully formatted markdown content. It offers smart content extraction, content formatting, multi-LLM support, PDF document support, and local model hosting using Ollama. Users can easily convert PDFs to markdown with high precision and preserve document hierarchy and styling. The tool supports multiple Vision LLM providers like OpenAI, LLama, and Gemini for accuracy and speed, making document processing efficient and effortless.
![keras-llm-robot Screenshot](/screenshots_githubs/smalltong02-keras-llm-robot.jpg)
keras-llm-robot
The Keras-llm-robot Web UI project is an open-source tool designed for offline deployment and testing of various open-source models from the Hugging Face website. It allows users to combine multiple models through configuration to achieve functionalities like multimodal, RAG, Agent, and more. The project consists of three main interfaces: chat interface for language models, configuration interface for loading models, and tools & agent interface for auxiliary models. Users can interact with the language model through text, voice, and image inputs, and the tool supports features like model loading, quantization, fine-tuning, role-playing, code interpretation, speech recognition, image recognition, network search engine, and function calling.
![outspeed Screenshot](/screenshots_githubs/outspeed-ai-outspeed.jpg)
outspeed
Outspeed is a PyTorch-inspired SDK for building real-time AI applications on voice and video input. It offers low-latency processing of streaming audio and video, an intuitive API familiar to PyTorch users, flexible integration of custom AI models, and tools for data preprocessing and model deployment. Ideal for developing voice assistants, video analytics, and other real-time AI applications processing audio-visual data.
![superduperdb Screenshot](/screenshots_githubs/SuperDuperDB-superduperdb.jpg)
superduperdb
SuperDuperDB is a Python framework for integrating AI models, APIs, and vector search engines directly with your existing databases, including hosting of your own models, streaming inference and scalable model training/fine-tuning. Build, deploy and manage any AI application without the need for complex pipelines, infrastructure as well as specialized vector databases, and moving our data there, by integrating AI at your data's source: - Generative AI, LLMs, RAG, vector search - Standard machine learning use-cases (classification, segmentation, regression, forecasting recommendation etc.) - Custom AI use-cases involving specialized models - Even the most complex applications/workflows in which different models work together SuperDuperDB is **not** a database. Think `db = superduper(db)`: SuperDuperDB transforms your databases into an intelligent platform that allows you to leverage the full AI and Python ecosystem. A single development and deployment environment for all your AI applications in one place, fully scalable and easy to manage.
![airdcpp-windows Screenshot](/screenshots_githubs/airdcpp-airdcpp-windows.jpg)
airdcpp-windows
AirDC++ for Windows 10/11 is a file sharing client with a focus on ease of use and performance. It is designed to provide a seamless experience for users looking to share and download files over the internet. The tool is built using Visual Studio 2022 and offers a range of features to enhance the file sharing process. Users can easily clone the repository to access the latest version and contribute to the development of the tool.
![last_layer Screenshot](/screenshots_githubs/arekusandr-last_layer.jpg)
last_layer
last_layer is a security library designed to protect LLM applications from prompt injection attacks, jailbreaks, and exploits. It acts as a robust filtering layer to scrutinize prompts before they are processed by LLMs, ensuring that only safe and appropriate content is allowed through. The tool offers ultra-fast scanning with low latency, privacy-focused operation without tracking or network calls, compatibility with serverless platforms, advanced threat detection mechanisms, and regular updates to adapt to evolving security challenges. It significantly reduces the risk of prompt-based attacks and exploits but cannot guarantee complete protection against all possible threats.
![sqlcoder Screenshot](/screenshots_githubs/defog-ai-sqlcoder.jpg)
sqlcoder
Defog's SQLCoder is a family of state-of-the-art large language models (LLMs) designed for converting natural language questions into SQL queries. It outperforms popular open-source models like gpt-4 and gpt-4-turbo on SQL generation tasks. SQLCoder has been trained on more than 20,000 human-curated questions based on 10 different schemas, and the model weights are licensed under CC BY-SA 4.0. Users can interact with SQLCoder through the 'transformers' library and run queries using the 'sqlcoder launch' command in the terminal. The tool has been tested on NVIDIA GPUs with more than 16GB VRAM and Apple Silicon devices with some limitations. SQLCoder offers a demo on their website and supports quantized versions of the model for consumer GPUs with sufficient memory.
![llm-graph-builder Screenshot](/screenshots_githubs/neo4j-labs-llm-graph-builder.jpg)
llm-graph-builder
Knowledge Graph Builder App is a tool designed to convert PDF documents into a structured knowledge graph stored in Neo4j. It utilizes OpenAI's GPT/Diffbot LLM to extract nodes, relationships, and properties from PDF text content. Users can upload files from local machine or S3 bucket, choose LLM model, and create a knowledge graph. The app integrates with Neo4j for easy visualization and querying of extracted information.
![SoM-LLaVA Screenshot](/screenshots_githubs/zzxslp-SoM-LLaVA.jpg)
SoM-LLaVA
SoM-LLaVA is a new data source and learning paradigm for Multimodal LLMs, empowering open-source Multimodal LLMs with Set-of-Mark prompting and improved visual reasoning ability. The repository provides a new dataset that is complementary to existing training sources, enhancing multimodal LLMs with Set-of-Mark prompting and improved general capacity. By adding 30k SoM data to the visual instruction tuning stage of LLaVA, the tool achieves 1% to 6% relative improvements on all benchmarks. Users can train SoM-LLaVA via command line and utilize the implementation to annotate COCO images with SoM. Additionally, the tool can be loaded in Huggingface for further usage.
For similar tasks
![LLMStack Screenshot](/screenshots_githubs/trypromptly-LLMStack.jpg)
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
![ai-guide Screenshot](/screenshots_githubs/Crataco-ai-guide.jpg)
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
![onnxruntime-genai Screenshot](/screenshots_githubs/microsoft-onnxruntime-genai.jpg)
onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.
![jupyter-ai Screenshot](/screenshots_githubs/jupyterlab-jupyter-ai.jpg)
jupyter-ai
Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.
![khoj Screenshot](/screenshots_githubs/khoj-ai-khoj.jpg)
khoj
Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.
![langchain_dart Screenshot](/screenshots_githubs/davidmigloz-langchain_dart.jpg)
langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).
![danswer Screenshot](/screenshots_githubs/danswer-ai-danswer.jpg)
danswer
Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"
![infinity Screenshot](/screenshots_githubs/infiniflow-infinity.jpg)
infinity
Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.
For similar jobs
![weave Screenshot](/screenshots_githubs/wandb-weave.jpg)
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
![LLMStack Screenshot](/screenshots_githubs/trypromptly-LLMStack.jpg)
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
![VisionCraft Screenshot](/screenshots_githubs/VisionCraft-org-VisionCraft.jpg)
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
![kaito Screenshot](/screenshots_githubs/Azure-kaito.jpg)
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
![PyRIT Screenshot](/screenshots_githubs/Azure-PyRIT.jpg)
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
![tabby Screenshot](/screenshots_githubs/TabbyML-tabby.jpg)
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
![spear Screenshot](/screenshots_githubs/isl-org-spear.jpg)
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
![Magick Screenshot](/screenshots_githubs/Oneirocom-Magick.jpg)
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.