NExT-GPT

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

Stars: 3245

Visit

NExT-GPT is an end-to-end multimodal large language model that can process input and generate output in various combinations of text, image, video, and audio. It leverages existing pre-trained models and diffusion models with end-to-end instruction tuning. The repository contains code, data, and model weights for NExT-GPT, allowing users to work with different modalities and perform tasks like encoding, understanding, reasoning, and generating multimodal content.

README:

NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu, Hao Fei*, Leigang Qu, Wei Ji, and Tat-Seng Chua. (*Correspondence )

NExT++, School of Computing, National University of Singapore

This repository hosts the code, data and model weight of NExT-GPT, the first end-to-end MM-LLM that perceives input and generates output in arbitrary combinations (any-to-any) of text, image, video, and audio and beyond.

Noted: we wrap the former old codebase into the NExT-GPT-Lagacy. Please refer to this new codebase for all training and tuning procedures.

🎉 News

[x] [2023.09.15] 🚀🚀 Release the code of NExT-GPT in version 7b_tiva_v0.
[x] [2023.09.27] 🔨🧩 Added modality-blended batch sampler.
[x] [2023.10.01] 📢📢 Release the T2M instruction dataset.
[x] [2023.10.04] 👏👏 Release the checkpoint of NExT-GPT in version 7b_tiva_v0 .
[x] [2023.10.15] 🔨🚀 Update of NExT-GPT in version 7b_tiva_v0 .
[x] [2024.10.07] 👏👏 Release the data and the corresponding construction methods, please refer DATA_README.md for more details.

👉 TODO

[ ] Updating NExT-GPT in more types&sizes of LLMs.
[ ] Empowering NExT-GPT with more modalities of inputs&outputs.
[ ] ...

Example Demos

Here we showcase examples generated from NExT-GPT. For more examples, kindly visit the webpage, or the online live demo.

https://github.com/NExT-GPT/NExT-GPT/assets/18722770/0c2b3d88-a533-4899-ab44-65580fe54538

https://github.com/NExT-GPT/NExT-GPT/assets/18722770/eb1319a6-38aa-4546-a96e-163207e7de93

https://github.com/NExT-GPT/NExT-GPT/assets/18722770/36bec0ad-9bad-4bcf-bc37-92b028f1bc6a

Brief Introduction

NExt-GPT is built on top of existing pre-trained LLM, multimodal encoder and SoTA diffusion models, with sufficient end-to-end instruction tuning.

Multimodal Encoding Stage. Leveraging established encoders to encode inputs in various modalities, where these representations are projected into language-like representations comprehensible to the LLM through a projection layer.
LLM Understanding and Reasoning Stage. Harnessing an existing open-sourced LLM as the core to process input information for semantic understanding and reasoning. The LLM not only directly generates text tokens but also produces unique “modality signal” tokens that serve as instructions to dictate the decoding layers whether & what modal content to output correspondingly.
Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders.

For more technical details, kindly refer to the paper.

Getting Started

1. Code Structure

.
|-- NExT-GPT-Lagacy       # the previous version of the model
|-- assets
|-- checkpoints           # save the pretraining and tuning checkpoints
|-- data  
|   |-- IT_data
|   |   |-- MosIT_data
|   |   |-- T+X-T_data    # text+[image/audio/video] to text instruction data
|   |   `-- T-T+X_data    # synthesized text to text+[image/audio/video] instruction data
|   |-- T_X_pair_data     # text-autio pairs data
|   |   |-- audiocap
|   |   |-- cc3m
|   |   `-- webvid
|   |-- embed 
|   `-- prepare_data.py
|-- figures
|-- merge_lora_weights.py
|-- nextgpt
|   |-- __init__.py
|   |-- constants.py
|   |-- conversation.py
|   |-- dataset
|   |   |-- __init__.py
|   |   |-- audio_processor.py
|   |   |-- base_dataset.py
|   |   |-- catalog.py
|   |   |-- concat_dataset.py
|   |   |-- dataset_utils.py
|   |   `-- sampler.py
|   |-- mm_utils.py
|   |-- model
|   |   |-- __init__.py
|   |   |-- apply_delta.py
|   |   |-- builder.py
|   |   |-- consolidate.py
|   |   |-- language_model
|   |   |-- make_delta.py
|   |   |-- multimodal_decoder
|   |   |-- multimodal_encoder
|   |   |-- multimodal_projector
|   |   |-- nextgpt_arch.py
|   |   `-- utils.py
|   `-- utils.py
|-- scripts
|   |-- finetune.sh
|   |-- pretrain_dec.sh
|   |-- pretrain_enc.sh
|   |-- zero2.json
|   |-- zero3.json
|   `-- zero3_offload.json
|-- LICENSE.md
|-- README.md
|-- nextgpt_trainer.py
|-- predict.py
|-- preprocess_embeddings.py
|-- requirements.txt
|-- train.py
|-- train_mem.py
`-- training_utils.py

2. Environment Preparation [Back to Top]

Please first clone the repo and install the required environment, which can be done by running the following commands:

conda env create -n nextgpt python=3.8

conda activate nextgpt

# CUDA 12.1
conda install pytorch==2.1.2 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone https://github.com/NExT-GPT/NExT-GPT.git
cd NExT-GPT

pip install -r requirements.txt

3. Training/Adapting NExt-GPT on Your Own

3.1. Preparing Pre-trained Checkpoint [Back to Top]

NExT-GPT is trained based on following excellent existing models. Please follow the instructions to prepare the checkpoints.

ImageBind is the unified image/video/audio encoder. The pre-trained checkpoint can be downloaded from here with version huge. Afterward, put the imagebind_huge.pth file at [.pretrain_ckpt/imagebind].
Vicuna: prepare the pretrained vicuna from [here]. Then put the pre-trained model at [./pretrain_ckpt/vicuna-7b-v1.5/].
Image Diffusion is used to generate images. NExT-GPT uses Stable Diffusion with version v2. (will be automatically downloaded)
Audio Diffusion for producing audio content. NExT-GPT employs AudioLDM with version l-full. (will be automatically downloaded)
Video Diffusion for the video generation. We employ ZeroScope with version v2_576w. (will be automatically downloaded)

3.2. Preparing Dataset [Back to Top]

Please download the following datasets used for model training:

A) T-X pairs data

CC3M of text-image pairs, please follow this instruction [here]. Then put the data at [./data/T-X_pair_data/cc3m].
WebVid of text-video pairs, see the [instruction]. The file should be saved at [./data/T-X_pair_data/webvid].
AudioCap of text-audio pairs, see the [instruction]. Save the data in [./data/T-X_pair_data/audiocap].

B) Instruction data

T+X-T
- LLaVA of the visual instruction data, download it from here, and then put it at [./data/IT_data/T+X-T_data/llava].
- Alpaca of the textual instruction data, download it from here, and then put it at [./data/IT_data/T+X-T_data/alpaca/].
- VideoChat, download the video instruction data here, and then put it at [./data/IT_data/T+X-T_data/videochat/].
Side note：After downloading dataset, please run prepare_data.py to preprocess the dataset.
T-X+T (T2M)
- The T-X+T instruction datasets (T2M) are saved at [./data/IT_data/T-T+X_data].
MosIT
- Download the file from here, put them in [./data/IT_data/MosIT_data/]. (We are in the process of finalizing the data and handling the copyright issue.)

3.3. Precomputing Embeddings [Back to Top]

In decoding-side alignment training, we minimize the distance between the representation of signal tokens and captions. To save costs of time and memory, we precompute the text embeddings for image, audio and video captions using the text encoder within the respective diffusion models.

Please run this command before the following training of NExT-GPT, where the produced embedding file will be saved at [./data/embed].

cd ./code/
python preprocess_embeddings.py ../data/T-X_pair_data/cc3m/cc3m_generation.json image ../data/embed/ stabilityai/stable-diffusion-2

Note of arguments:

args[1]: path of caption file;
args[2]: modality, which can be image, video, and audio;
args[3]: saving path of embedding file;
args[4]: corresponding pre-trained diffusion model name.

3.4. Training NExT-GPT [Back to Top]

First of all, please refer to the base configuration file [training_utils.py] for the basic system setting of overall modules, and dataset configuration nextgpt/dataset/catalog.py. The whole NExT-GPT training involves 3 steps:

Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.
```
# Encoding-side LLM-centric Multimodal Alignment
bash scripts/pretrain_enc.sh
```
Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.
```
# Encoding-side LLM-centric Multimodal Alignment
bash scripts/pretrain_enc.sh
```
Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.
```
# Encoding-side LLM-centric Multimodal Alignment
bash scripts/pretrain_enc.sh
```

4. Running NExT-GPT System [Back to Top]

4.1. Preparing Checkpoints

First, loading the pre-trained NExT-GPT system.

Step-1: load Frozen parameters. Please refer to 3.1 Preparing Pre-trained Checkpoint.
Step-2: load Tunable parameters. Please put the NExT-GPT system at ./checkpoints/nextgpt-v1.5-7b. You may either 1) use the params trained yourselves, or 2) download our checkpoints from Huggingface.

4.2. Run the Prediction

Upon completion of the checkpoint loading, you can run the prediction via:

python predict.py

5. Fine-tuning Your Own System [Back to Top]

5.1. Dataset

You can define your own dataset, please refer to the base_dataset.py, and then add the dataset catalog in catalog.py, including the target and parameters.

5.2. Model Framework

Multimodal Encoder: You can leverage your own multimodal encoder in multimodal encoder directory, and add corresponding code in the builder.py.
Multimodal Decoder: You can add your own multimodal decoder, in multimodal decoder directory, and modify the corresponding code in the builder.py.
Projector: You can design your own input and output projector in multimodal projector.

5.3. Fine-tuning

You can pre-define the model, data, and training parameters in training_utils.py. Please refer the finetune.sh for fine-tuning your own model.

Contact

For any questions or feedback, feel free to contact Shengqiong Wu and Hao Fei.

Citation

If you find NextGPT useful in your research or applications, please kindly cite:

@articles{wu2023nextgpt,
  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua},
  journal = {CoRR},
  volume = {abs/2309.05519},
  year={2023}
}

Acknowledgements

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, ImageBind, Stable Diffusion, AudioLDM, and Zeroscope. We also partially draw inspirations from PandaGPT,
GILL, CoDi, Video-LLaMA, LLaVA, and MiniGPT-4. Thanks for their wonderful works.

License Notices

This repository is under BSD 3-Clause License. NExT-GPT is a research project intended for non-commercial use only. One must NOT use the code of NExT-GPT for any illegal, harmful, violent, racist, or sexual purposes. One is strictly prohibited from engaging in any activity that will potentially violate these guidelines. Any potential commercial use of this code should be approved by the authors.

For Tasks:

Click tags to check more tools for each tasks

generate multimodal content process input data train large language models perform multimodal reasoning encode various modalities

For Jobs:

data scientist machine learning engineer ai researcher natural language processing specialist multimodal ai developer

Alternative AI tools for NExT-GPT

Similar Open Source Tools

NExT-GPT

github

: 3.2k

SoM-LLaVA

SoM-LLaVA is a new data source and learning paradigm for Multimodal LLMs, empowering open-source Multimodal LLMs with Set-of-Mark prompting and improved visual reasoning ability. The repository provides a new dataset that is complementary to existing training sources, enhancing multimodal LLMs with Set-of-Mark prompting and improved general capacity. By adding 30k SoM data to the visual instruction tuning stage of LLaVA, the tool achieves 1% to 6% relative improvements on all benchmarks. Users can train SoM-LLaVA via command line and utilize the implementation to annotate COCO images with SoM. Additionally, the tool can be loaded in Huggingface for further usage.

github

: 92

llm-awq

AWQ (Activation-aware Weight Quantization) is a tool designed for efficient and accurate low-bit weight quantization (INT3/4) for Large Language Models (LLMs). It supports instruction-tuned models and multi-modal LMs, providing features such as AWQ search for accurate quantization, pre-computed AWQ model zoo for various LLMs, memory-efficient 4-bit linear in PyTorch, and efficient CUDA kernel implementation for fast inference. The tool enables users to run large models on resource-constrained edge platforms, delivering more efficient responses with LLM/VLM chatbots through 4-bit inference.

github

: 2.6k

LLM-Pruner

LLM-Pruner is a tool for structural pruning of large language models, allowing task-agnostic compression while retaining multi-task solving ability. It supports automatic structural pruning of various LLMs with minimal human effort. The tool is efficient, requiring only 3 minutes for pruning and 3 hours for post-training. Supported LLMs include Llama-3.1, Llama-3, Llama-2, LLaMA, BLOOM, Vicuna, and Baichuan. Updates include support for new LLMs like GQA and BLOOM, as well as fine-tuning results achieving high accuracy. The tool provides step-by-step instructions for pruning, post-training, and evaluation, along with a Gradio interface for text generation. Limitations include issues with generating repetitive or nonsensical tokens in compressed models and manual operations for certain models.

github

: 828

pgvecto.rs

pgvecto.rs is a Postgres extension written in Rust that provides vector similarity search functions. It offers ultra-low-latency, high-precision vector search capabilities, including sparse vector search and full-text search. With complete SQL support, async indexing, and easy data management, it simplifies data handling. The extension supports various data types like FP16/INT8, binary vectors, and Matryoshka embeddings. It ensures system performance with production-ready features, high availability, and resource efficiency. Security and permissions are managed through easy access control. The tool allows users to create tables with vector columns, insert vector data, and calculate distances between vectors using different operators. It also supports half-precision floating-point numbers for better performance and memory usage optimization.

github

: 1.9k

evalverse

Evalverse is an open-source project designed to support Large Language Model (LLM) evaluation needs. It provides a standardized and user-friendly solution for processing and managing LLM evaluations, catering to AI research engineers and scientists. Evalverse supports various evaluation methods, insightful reports, and no-code evaluation processes. Users can access unified evaluation with submodules, request evaluations without code via Slack bot, and obtain comprehensive reports with scores, rankings, and visuals. The tool allows for easy comparison of scores across different models and swift addition of new evaluation tools.

github

: 159

TokenFormer

TokenFormer is a fully attention-based neural network architecture that leverages tokenized model parameters to enhance architectural flexibility. It aims to maximize the flexibility of neural networks by unifying token-token and token-parameter interactions through the attention mechanism. The architecture allows for incremental model scaling and has shown promising results in language modeling and visual modeling tasks. The codebase is clean, concise, easily readable, state-of-the-art, and relies on minimal dependencies.

github

: 481

gpustack

GPUStack is an open-source GPU cluster manager designed for running large language models (LLMs). It supports a wide variety of hardware, scales with GPU inventory, offers lightweight Python package with minimal dependencies, provides OpenAI-compatible APIs, simplifies user and API key management, enables GPU metrics monitoring, and facilitates token usage and rate metrics tracking. The tool is suitable for managing GPU clusters efficiently and effectively.

github

: 2.0k

HuixiangDou

HuixiangDou is a **group chat** assistant based on LLM (Large Language Model). Advantages: 1. Design a two-stage pipeline of rejection and response to cope with group chat scenario, answer user questions without message flooding, see arxiv2401.08772 2. Low cost, requiring only 1.5GB memory and no need for training 3. Offers a complete suite of Web, Android, and pipeline source code, which is industrial-grade and commercially viable Check out the scenes in which HuixiangDou are running and join WeChat Group to try AI assistant inside. If this helps you, please give it a star ⭐

github

: 2.3k

sec-parser

The `sec-parser` project simplifies extracting meaningful information from SEC EDGAR HTML documents by organizing them into semantic elements and a tree structure. It helps in parsing SEC filings for financial and regulatory analysis, analytics and data science, AI and machine learning, causal AI, and large language models. The tool is especially beneficial for AI, ML, and LLM applications by streamlining data pre-processing and feature extraction.

github

: 99

lance

Lance is a modern columnar data format optimized for ML workflows and datasets. It offers high-performance random access, vector search, zero-copy automatic versioning, and ecosystem integrations with Apache Arrow, Pandas, Polars, and DuckDB. Lance is designed to address the challenges of the ML development cycle, providing a unified data format for collection, exploration, analytics, feature engineering, training, evaluation, deployment, and monitoring. It aims to reduce data silos and streamline the ML development process.

github

: 4.3k

gemini-next-chat

Gemini Next Chat is an open-source, extensible high-performance Gemini chatbot framework that supports one-click free deployment of private Gemini web applications. It provides a simple interface with image recognition and voice conversation, supports multi-modal models, talk mode, visual recognition, assistant market, support plugins, conversation list, full Markdown support, privacy and security, PWA support, well-designed UI, fast loading speed, static deployment, and multi-language support.

github

: 1.1k

fastRAG

fastRAG is a research framework designed to build and explore efficient retrieval-augmented generative models. It incorporates state-of-the-art Large Language Models (LLMs) and Information Retrieval to empower researchers and developers with a comprehensive tool-set for advancing retrieval augmented generation. The framework is optimized for Intel hardware, customizable, and includes key features such as optimized RAG pipelines, efficient components, and RAG-efficient components like ColBERT and Fusion-in-Decoder (FiD). fastRAG supports various unique components and backends for running LLMs, making it a versatile tool for research and development in the field of retrieval-augmented generation.

github

: 1.3k

FireRedTTS

FireRedTTS is a foundation text-to-speech framework designed for industry-level generative speech applications. It offers a rich-punctuation model with expanded punctuation coverage and enhanced audio production consistency. The tool provides pre-trained checkpoints, inference code, and an interactive demo space. Users can clone the repository, create a conda environment, download required model files, and utilize the tool for synthesizing speech in various languages. FireRedTTS aims to enhance stability and provide controllable human-like speech generation capabilities.

github

: 313

rai

RAI is a framework designed to bring general multi-agent system capabilities to robots, enhancing human interactivity, flexibility in problem-solving, and out-of-the-box AI features. It supports multi-modalities, incorporates an advanced database for agent memory, provides ROS 2-oriented tooling, and offers a comprehensive task/mission orchestrator. The framework includes features such as voice interaction, customizable robot identity, camera sensor access, reasoning through ROS logs, and integration with LangChain for AI tools. RAI aims to support various AI vendors, improve human-robot interaction, provide an SDK for developers, and offer a user interface for configuration.

github

: 280

airflint

Airflint is a tool designed to enforce best practices for all your Airflow Directed Acyclic Graphs (DAGs). It is currently in the alpha stage and aims to help users adhere to recommended practices when working with Airflow. Users can install Airflint from PyPI and integrate it into their existing Airflow environment to improve DAG quality. The tool provides rules for function-level imports and jinja template syntax usage, among others, to enhance the development process of Airflow DAGs.

github

: 88

For similar tasks

NExT-GPT

github

: 3.2k

ml-engineering

This repository provides a comprehensive collection of methodologies, tools, and step-by-step instructions for successful training of large language models (LLMs) and multi-modal models. It is a technical resource suitable for LLM/VLM training engineers and operators, containing numerous scripts and copy-n-paste commands to facilitate quick problem-solving. The repository is an ongoing compilation of the author's experiences training BLOOM-176B and IDEFICS-80B models, and currently focuses on the development and training of Retrieval Augmented Generation (RAG) models at Contextual.AI. The content is organized into six parts: Insights, Hardware, Orchestration, Training, Development, and Miscellaneous. It includes key comparison tables for high-end accelerators and networks, as well as shortcuts to frequently needed tools and guides. The repository is open to contributions and discussions, and is licensed under Attribution-ShareAlike 4.0 International.

github

: 13.0k

distributed-llama

Distributed Llama is a tool that allows you to run large language models (LLMs) on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. It uses TCP sockets to synchronize the state of the neural network, and you can easily configure your AI cluster by using a home router. Distributed Llama supports models such as Llama 2 (7B, 13B, 70B) chat and non-chat versions, Llama 3, and Grok-1 (314B).

github

: 2.0k

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

Awesome-LLM

Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.

github

: 22.1k

MotionLLM

MotionLLM is a framework for human behavior understanding that leverages Large Language Models (LLMs) to jointly model videos and motion sequences. It provides a unified training strategy, dataset MoVid, and MoVid-Bench for evaluating human behavior comprehension. The framework excels in captioning, spatial-temporal comprehension, and reasoning abilities.

github

: 212

LLMGA

LLMGA (Multimodal Large Language Model-based Generation Assistant) is a tool that leverages Large Language Models (LLMs) to assist users in image generation and editing. It provides detailed language generation prompts for precise control over Stable Diffusion (SD), resulting in more intricate and precise content in generated images. The tool curates a dataset for prompt refinement, similar image generation, inpainting & outpainting, and visual question answering. It offers a two-stage training scheme to optimize SD alignment and a reference-based restoration network to alleviate texture, brightness, and contrast disparities in image editing. LLMGA shows promising generative capabilities and enables wider applications in an interactive manner.

github

: 305

LLMs

LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.

github

: 97

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

NExT-GPT

README:

NExT-GPT: Any-to-Any Multimodal LLM

🎉 News

👉 TODO

Example Demos

Brief Introduction

Getting Started

Table of Contents:

1. Code Structure

2. Environment Preparation [Back to Top]

3. Training/Adapting NExt-GPT on Your Own

3.1. Preparing Pre-trained Checkpoint [Back to Top]

3.2. Preparing Dataset [Back to Top]

3.3. Precomputing Embeddings [Back to Top]

3.4. Training NExT-GPT [Back to Top]

4. Running NExT-GPT System [Back to Top]

4.1. Preparing Checkpoints

4.2. Run the Prediction

5. Fine-tuning Your Own System [Back to Top]

5.1. Dataset

5.2. Model Framework

5.3. Fine-tuning

Contact

Citation

Acknowledgements

License Notices

For Tasks:

For Jobs:

Alternative AI tools for NExT-GPT

Similar Open Source Tools

NExT-GPT

SoM-LLaVA

llm-awq

LLM-Pruner

pgvecto.rs

evalverse

TokenFormer

gpustack

HuixiangDou

sec-parser

lance

gemini-next-chat

fastRAG

FireRedTTS

rai

airflint

For similar tasks

NExT-GPT

ml-engineering

distributed-llama

Awesome-LLMs-for-Video-Understanding

Awesome-LLM

MotionLLM

LLMGA

LLMs

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick