awesome-VLLMs

awesome-VLLMs

This repository collects papers on VLLM applications. We will update new papers irregularly.

Stars: 52

Visit
 screenshot

README:

Awesome Visual Large Language Models (VLLMs)

Maintenance Awesome PR's Welcome

Vision language models (VLMs) have emerged as powerful tools for learning unified embedding spaces that integrate vision and language. Inspired by the success of large language models (LLMs), which have demonstrated remarkable reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining significant attention for developing both general-purpose and specialized VLMs.

In this repository, we provide a comprehensive summary of the current literature from an application-oriented perspective. We hope this resource serves as a valuable reference for the VLLM research community.

If you are interested in this project, you can contribute to this repo by pulling requests 😊😊😊

πŸ“’ News

πŸš€ What's New in This Update:

  • [2025.3.2]: πŸ”₯ Adding two projects on complex reasoning: R1-V and VLM-R1!
  • [2025.2.23]: πŸ”₯ Adding one video-to-action paper and one vision-to-text paper!
  • [2025.2.1]: πŸ”₯ Adding four video-to-text papers!
  • [2025.1.22]: πŸ”₯ Adding one video-to-text paper!
  • [2025.1.17]: πŸ”₯ Adding three video-to-text papers, thanks for the contributions from Enxin!
  • [2025.1.14]: πŸ”₯ Adding two complex reasoning papers and one video-to-text paper!
  • [2025.1.13]: πŸ”₯ Adding one VFM survey paper!
  • [2025.1.12]: πŸ”₯ Adding one efficient MLLM paper!
  • [2025.1.9]: πŸ”₯πŸ”₯πŸ”₯ Adding one efficient MLLM survey!
  • [2025.1.7]: πŸ”₯πŸ”₯πŸ”₯ Our survey paper is released! Please check this link for more information. We add more tool management papers in our paper list.
  • [2025.1.6]: πŸ”₯ We add one OS Agent survey paper in our paper list, and a new category: complex reasoning!
  • [2025.1.4]: πŸ”₯ We updated the general domain and egocentric video papers in our paper list, thanks for the contributions from Wentao!
  • [2025.1.2]: πŸ”₯ We add more interpretation papers to our paper list, thanks for the contributions from Ruoyu!
  • [2024.12.15]: πŸ”₯ We release our VLLM application paper list repo!

🌈 Table of Contents

Existing VLM surveys

VLM surveys

Title Venue Date Code Project
Star
Foundation Models Defining a New Era in Vision: A Survey and Outlook
T-PAMI 2025-1-9 Github Project
Star
Vision-Language Models for Vision Tasks: A Survey
T-PAMI 2024-8-8 Github Project
Star
Vision + Language Applications: A Survey
CVPRW 2023-5-24 Github Project
Vision-and-Language Pretrained Models: A Survey
IJCAI (survey track) 2022-5-3 Github Project

🎯Back to Top

MLLM surveys

Title Venue Date Code Project
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
ArXiv 2024-12-27 Github Project
Towards Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
ArXiv 2024-12-3 Github Project
Star
A Survey on Multimodal Large Language Models
T-PAMI 2024-11-29 Github Project
Star
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
ArXiv 2024-11-22 Github Project
A Survey on Multimodal Large Language Models
National Science Review 2024-11-12 Github Project
Video Understanding with Large Language Models: A Survey
ArXiv 2024-6-24 Github Project
Star
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
ArXiv 2024-9-21 Github Project
Star
The Revolution of Multimodal Large Language Models: A Survey
ArXiv 2024-6-6 Github Project
Star
Efficient Multimodal Large Language Models: A Survey
ArXiv 2024-5-17 Github Project
Star
A Survey on Hallucination in Large Vision-Language Models
ArXiv 2024-5-6 Github Project
Star
Hallucination of multimodal large language models: A survey
ArXiv 2024-4-29 Github Project
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
ArXiv 2024-4-12 Github Project
Star
MM-LLMs: Recent Advances in MultiModal Large Language Models
ArXiv 2024-2-20 Github Project
Exploring the Reasoning Abilities of Multimodallarge Language Models (mllms): a Comprehensive survey on Emerging Trends in Multimodal Reasonings
ArXiv 2024-1-18 Github Project
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
ArXiv 2023-12-27 Github Project
Multimodal Large Language Models: A Survey
BigData 2023-12-15 Github Project

🎯Back to Top

Vision-to-text

Image-to-text

General domain

General ability
Name Title Venue Date Code Project
InternVL2.5 Star
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
ArXiv 2024-12-17 Github Project
CompCap CompCap: Improving Multimodal Large Language Models with Composite Captions ArXiv 2024-12-06 Github Project
NVILA NVILA: Efficient Frontier Visual Language Models ArXiv 2024-12-05 Github Project
Molmo and PixMo Star
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
ArXiv 2024-09-25 Github Project
Qwen2-VL Star
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
ArXiv 2024-09-18 Github Project
mPLUG-Owl3 Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
ArXiv 2024-08-09 Github Project
LLaVA-OneVision Star
LLaVA-OneVision: Easy Visual Task Transfer
ArXiv 2024-08-06 Github Project
VILA$^{2}$ VILA $^2$: VILA Augmented VILA ArXiv 2024-07-24 Github Project
EVLM EVLM: An Efficient Vision-Language Model for Visual Understanding ArXiv 2024-07-19 Github Project
MG-LLaVA MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning ArXiv 2024-06-27 Github Project
Cambrian-1 Star
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
ArXiv 2024-06-24 Github Project
Ovis Star
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
ArXiv 2024-05-31 Github Project
ConvLLaVA Star
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
ArXiv 2024-05-24 Github Project
Meteor Star
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
NeurIPS 2024-05-24 Github Project
CuMo Star
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
ArXiv 2024-05-09 Github Project
Mini-Gemini Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
ArXiv 2024-03-27 Github Project
MM1 MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training ArXiv 2024-03-14 Github Project
DeepSeek-VL Star
DeepSeek-VL: Towards Real-World Vision-Language Understanding
ArXiv 2024-03-08 Github Project
InternLM-XComposer2 Star
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
ArXiv 2024-01-29 Github Project
MoE-LLaVA Star
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
ArXiv 2024-01-29 Github Project
InternVL Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2023-12-21 Github Project
VILA Star
VILA: On Pre-training for Visual Language Models
ArXiv 2023-12-12 Github Project
Vary Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
ECCV 2023-12-11 Github Project
Honeybee Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
CVPR 2023-11-11 Github Project
OtterHD Star
OtterHD: A High-Resolution Multi-modality Model
ArXiv 2023-11-07 Github Project
mPLUG-Owl2 Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
CVPR 2023-11-07 Github Project
Fuyu Star
Fuyu-8B: A Multimodal Architecture for AI Agents
ArXiv 2023-10-17 Github Project
MiniGPT-v2 Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
ArXiv 2023-10-14 Github Project
LLaVA 1.5 Star
Improved Baselines with Visual Instruction Tuning
ArXiv 2023-10-05 Github Project
InternLM-XComposer Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
ArXiv 2023-09-26 Github Project
Qwen-VL Star
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
ArXiv 2023-08-24 Github Project
StableLLaVA Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
ArXiv 2023-08-20 Github Project
BLIVA Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
AAAI 2023-08-19 Github Project
SVIT Star
SVIT: Scaling up Visual Instruction Tuning
ArXiv 2023-07-09 Github Project
LaVIN Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
NeurIPS 2023-05-24 Github Project
InstructBLIP Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
NeurIPS 2023-05-11 Github Project
MultiModal-GPT Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
ArXiv 2023-05-08 Github Project
Otter Star
Otter: A Multi-Modal Model with In-Context Instruction Tuning
ArXiv 2023-05-05 Github Project
mPLUG-Owl Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
ArXiv 2023-04-27 Github Project
LLaMA-Adapter V2 Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
ArXiv 2023-04-28 Github Project
MiniGPT-4 Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
NeurIPS 2023-04-20 Github Project
LLaVA Star
Visual Instruction Tuning
NeurIPS 2023-04-17 Github Project
LLaMA-Adapter Star
LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention
ICLR 2023-03-28 Github Project
Kosmos-1 Star
Language Is Not All You Need: Aligning Perception with Language Models
NeurIPS 2023-02-27 Github Project
Flamingo Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS 2022-04-29 Github Project

🎯Back to Top

REC
Name Title Venue Date Code Project
ChatRex Star
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
ArXiv 2024-11-27 Github Project
Griffon-G Star
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
ArXiv 2024-10-21 Github Project
Ferret Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
ICLR 2024-10-11 Github Project
OMG-LLaVA Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
NeurIPS 2024-06-27 Github Project
VisionLLMv2 Star
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
ArXiv 2024-06-12 Github Project
Groma Star
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
ECCV 2024-04-19 Github Project
Griffonv2 Star
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
ArXiv 2024-03-14 Github Project
ASMv2 Star
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
ECCV 2024-02-29 Github Project
SPHINX-X SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ArXiv 2024-02-08 Github Project
ChatterBox Star
ChatterBox: Multi-round Multimodal Referring and Grounding
ArXiv 2024-01-24 Github Project
LEGO Star
LEGO: Language Enhanced Multi-modal Grounding Model
ArXiv 2024-01-12 Github Project
GroundingGPT Star
GroundingGPT: Language Enhanced Multi-modal Grounding Model
ACL 2024-01-11 Github Project
BuboGPT Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
ArXiv 2024-07-17 Github Project
Ferret-v2 Star
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
COLM 2024-04-11 Github Project
InfMLLM Star
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
NeurIPS 2024-02-07 Github Project
VistaLLM Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model ECCV 2023-12-19 Github Project
LLaVA-Grounding Star
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
ArXiv 2023-12-05 Github Project
Lenna Star
Lenna: Language Enhanced Reasoning Detection Assistant
ArXiv 2023-12-05 Github Project
Griffon Star
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
ECCV 2023-11-24 Github Project
Lion Star
Lion: Empowering multimodal large language model with dual-level visual knowledge
CVPR 2023-11-20 Github Project
SPHINX SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models ArXiv 2023-11-13 Github Project
NExT-Chat Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
ArXiv 2023-11-08 Github Project
GLaMM Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2023-11-06 Github Project
CogVLM Star
CogVLM: Visual Expert for Pretrained Language Models
ArXiv 2023-11-06 Github Project
Pink Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
CVPR 2023-10-01 Github Project
PVIT Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
ArXiv 2023-08-25 Github Project
ASM Star
The all-seeing project: Towards panoptic visual recognition and understanding of the open world
ICLR 2023-08-03 Github Project
Shikra Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
ArXiv 2023-06-27 Github Project
Kosmos-2 Star
KOSMOS-2: Grounding Multimodal Large LanguageModels to the World
ICLR 2023-06-26 Github Project
ChatSpot Star
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
ArXiv 2023-07-18 Github Project
GPT4RoI Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
IJCAI 2023-07-07 Github Project
ContextDET Star
Contextual Object Detection with Multimodal Large Language Models
ArXiv 2023-05-29 Github Project
DetGPT Star
DetGPT: Detect What You Need via Reasoning
ArXiv 2023-05-23 Github Project
VisionLLM Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
ArXiv 2023-05-18 Github Project

🎯Back to Top

RES
Name Title Venue Date Code Project
OMG-LLaVA Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
NeurIPS 2024-06-27 Github Project
VisionLLMv2 Star
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
ArXiv 2024-06-12 Github Project
LLM-Seg LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning CVPR 2024-04-12 Github Project
PSALM Star
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
ECCV 2024-03-21 Github Project
GROUNDHOG Star
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
CVPR 2024-02-26 Github Project
GELLA Generalizable Entity Grounding via Assistance of Large Language Model ECCV 2024-02-04 Github Project
OMG-Seg Star
OMG-Seg: Is One Model Good Enough For All Segmentation?
CVPR 2024-01-18 Github Project
LISA++ Star
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
ArXiv 2023-12-28 Github Project
VistaLLM Star
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
ECCV 2023-12-19 Github Project
Osprey Star
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR 2023-12-15 Github Project
GSVA Star
GSVA: Generalized Segmentation via Multimodal Large Language Models
CVPR 2023-12-05 Github Project
PixelLM Star
PixelLM: Pixel Reasoning with Large Multimodal Model
CVPR 2023-12-04 Github Project
PixelLLM Star
PixelLM: Pixel Reasoning with Large Multimodal Model
ECCV 2023-12-04 Github Project
LLaFS Star
LLaFS: When Large Language Models Meet Few-Shot Segmentation
CVPR 2023-11-28 Github Project
NExT-Chat Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
ArXiv 2023-11-08 Github Project
GLaMM Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2023-11-06 Github Project
LISA Star
LISA: Reasoning Segmentation via Large Language Model
CVPR 2023-08-01 Github Project
ContextDET Star
Contextual Object Detection with Multimodal Large Language Models
ArXiv 2023-05-29 Github Project
VisionLLM Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
ArXiv 2023-05-18 Github Project

🎯Back to Top

OCR
Name Title Venue Date Code Project
TextHawk2 Star
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
ArXiv 2024-10-07 Github Project
Dockylin Star
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
AAAI 2024-06-27 Github Project
StrucTexTv3 StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond ArXiv 2024-05-31 Github Project
Fox Star
Focus Anywhere for Fine-grained Multi-page Document Understanding
ArXiv 2024-05-23 Github Project
TextMonkey Star
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
ArXiv 2024-05-07 Github Project
TinyChart Star
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
ACL 2024-04-25 Github Project
TextHawk Star
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
ArXiv 2024-04-14 Github Project
HRVDA HRVDA: High-Resolution Visual Document Assistant CVPR 2024-04-10 Github Project
InternLM-XComposer2-4KHD Star
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
NeurIPS 2024-04-09 Github Project
LayoutLLM LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding CVPR 2024-04-08 Github Project
ViTLP Star
Visually Guided Generative Text-Layout Pre-training for Document Intelligence
NAACL 2024-03-25 Github Project
mPLUG-DocOwl 1.5 Star
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
ArXiv 2024-03-19 Github Project
DoCo Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models CVPR 2024-02-29 Github Project
TGDoc Star
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs
ArXiv 2023-11-22 Github Project
DocPedia DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding ArXiv 2023-11-20 Github Project
UReader Star
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
ACL 2023-10-08 Github Project
UniDoc UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding ArXiv 2023-08-19 Github Project
mPLUG-DocOwl Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
ArXiv 2023-07-04 Github Project
LLaVAR Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
ArXiv 2023-06-29 Github Project

🎯Back to Top

Retrieval
Name Title Venue Date Code Project
EchoSight Star
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
EMNLP 2024-07-17 Github Project
FROMAGe Star
Grounding Language Models to Images for Multimodal Inputs and Outputs
ICML 2024-01-31 Github Project
Wiki-LLaVA Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs CVPR 2023-04-23 Github Project
UniMuR Unified Embeddings for Multimodal Retrieval via Frozen LLMs ICML 2019-05-08 Github Project

🎯Back to Top

VLLM+X

Remote sensing
Name Title Venue Date Code Project
VHM Star
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis
ArXiv 2024-11-06 Github Project
LHRS-Bot Star
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
ECCV 2024-07-16 Github Project
Popeye Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
J-STARS 2024-06-13 Github Project
RS-LLaVA Star
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
Remote Sens. 2024-04-23 Github Project
EarthGPT Star
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
TGRS 2024-03-08 Github Project
RS-CapRet Large Language Models for Captioning and Retrieving Remote Sensing Images
ArXiv 2024-02-09 Github Project
SkyEyeGPT Star
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
ArXiv 2024-01-18 Github Project
GeoChat Star
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
CVPR 2023-11-24 Github Project
RSGPT Star
RSGPT: A Remote Sensing Vision Language Model and Benchmark
ArXiv 2023-07-28 Github Project

🎯Back to Top

Medical
Name Title Venue Date Code Project
UMed-LVLM Training Medical Large Vision-Language Models with Abnormal-Aware Feedback
ArXiv 2025-01-02 Github Project
PMC-VQA Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
ArXiv 2024-09-08 Github Project
MedVersa A Generalist Learner for Multifaceted Medical Image Interpretation
ArXiv 2024-05-13 Github Project
PeFoMed Star
PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging
ArXiv 2024-04-16 Github Project
RaDialog Star
RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance
ArXiv 2023-11-30 Github Project
Med-Flamingo Star
Med-Flamingo: a Multimodal Medical Few-shot Learner
ML4H 2023-07-27 Github Project
XrayGPT Star
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
BioNLP 2023-06-13 Github Project
LLaVA-Med Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
NeurIPS 2023-06-01 Github Project
CXR-RePaiR-Gen Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models
MLHC 2023-05-05 Github Project

🎯Back to Top

Science and math
Name Title Venue Date Code Project
MAVIS Star
MAVIS: Mathematical Visual Instruction Tuning
ECCV 2024-11-01 Github Project
Math-LLaVA Star
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
EMNLP 2024-10-08 Github Project
MathVerse Star
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
ECCV 2024-08-18 Github Project
We-Math Star
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
ArXiv 2024-07-01 Github Project
CMMaTH CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
ArXiv 2024-06-28 Github Project
GeoEval Star
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
ACL 2024-05-17 Github Project
FigurA11y Star
FigurA11y: AI Assistance for Writing Scientific Alt Text
IUI 2024-04-05 Github Project
MathVista Star
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
ICLR 2024-01-21 Github Project
mPLUG-PaperOwl Star
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
ACM MM 2024-01-09 Github Project
G-LLaVA Star
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
ArXiv 2023-12-18 Github Project
T-SciQ Star
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering
AAAI 2023-12-18 Github Project
ScienceQA Star
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS 2022-10-17 Github Project

🎯Back to Top

Graphics and UI
Name Title Venue Date Code Project
Graphist Star
Graphic Design with Large Multimodal Model
ArXiv 2024-04-22 Github Project
Ferret-UI Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
ECCV 2024-04-08 Github Project
CogAgent Star
CogAgent: A Visual Language Model for GUI Agents
CVPR 2023-12-21 Github Project

🎯Back to Top

Financial analysis
Name Title Venue Date Code Project
FinTral FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
ACL 2024-06-14 Github Project
FinVis-GPT Star
FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis
ArXiv 2023-07-31 Github Project

🎯Back to Top

Video-to-text

General domain

Name Title Venue Date Code Project
TPO Star
Temporal Preference Optimization for Long-Form Video Understanding
ArXiv 2025-1-23 Github Project
VideoLLaMA3 Star
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
ArXiv 2025-1-22 Github Project
Aria Star
ARIA : An Open Multimodal Native Mixture-of-Experts Model
ArXiv 2024-12-17 Github Project
Apollo Apollo: An Exploration of Video Understanding in Large Multimodal Models ArXiv 2024-12-13 Github Project
LinVT Star
LinVT: Empower Your Image-level Large Language Model to Understand Videos
ArXiv 2024-12-11 Github Project
Video-LLaMA2 Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
ArXiv 2024-10-30 Github Project
LLaVA-OneVision Star
LLaVA-OneVision: Easy Visual Task Transfer
ArXiv 2024-10-26 Github Project
Oryx Star
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
ICLR 2024-10-22 Github Project
LongVU Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
ArXiv 2024-10-22 Github Project
AuroraCap Star
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Arxiv 2024-10-4 Github Project
LLaVA-Video Star
Video Instruction Tuning With Synthetic Data
ArXiv 2024-10-04 Github Project
SlowFast-LLaVA Star
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
ArXiv 2024-9-15 Github Project
InternVideo2 Star
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
ArXiv 2024-8-14 Github Project
mPLUG-Owl3 Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
ArXiv 2024-08-13 Github Project
Goldfish Star
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
ECCV 2024-07-17 Github Project
VoT Star
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
ICML 2024-07-17 Github Project
Flash-VStream Star
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
ArXiv 2024-06-30 Github Project
LLaVA-Next-Video Star
LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
online 2024-04-30 Github Project
PLLaVA Star
PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Arxiv 2023-4-29 Github Project
MovieChat+ Star
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Arxiv 2023-4-26 Github Project
MiniGPT4-Video Star
MiniGPT4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens
CVPR Workshop 2024-04-04 Github Project
ST-LLM Star
ST-LLM: Large language models are effective temporal learners
ECCV 2024-03-30 Github Project
LLaMA-VID Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
ECCV 2023-11-28 Github Project
MovieChat Star
Moviechat: From dense token to sparse memory for long video understanding
CVPR 2023-7-31 Github Project
Video-LLaMA Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
EMNLP 2023-10-25 Github Project
Vid2Seq Star
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023-03-21 Github Project
LaViLa Star
Learning Video Representations from Large Language Models
CVPR 2022-12-08 Github Project
VideoBERT Star
VideoBERT: A joint model for video and language representation learning
ICCV 2019-09-11 Github Project

🎯Back to Top

Video conversation

Name Title Venue Date Code Project
Video-LLaVA Star
Video-llava: Learning united visual representation by alignment before projection
EMNLP 2024-10-01 Github Project
BT-Adapter Star
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
CVPR 2024-06-27 Github Project
VideoGPT+ Star
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
arXiv 2024-06-13 Github Project
Video-ChatGPT Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
ACL 2024-06-10 Github Project
MVBench Star
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024-05-23 Github Project
LVChat Star
LVCHAT: Facilitating Long Video Comprehension
ArXiv 2024-02-19 Github Project
VideoChat Star
VideoChat: Chat-Centric Video Understanding
ArXiv 2024-01-04 Github Project
Valley Star
Valley: Video Assistant with Large Language model Enhanced abilitY
ArXiv 2023-10-08 Github Project

🎯Back to Top

Egocentric view

Name Title Venue Date Code Project
PALM Star
PALM: Predicting Actions through Language Models
CVPR Workshop 2024-07-18 Github Project
GPT4Ego GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition ArXiv 2024-05-11 Github Project
AntGPT Star
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
ICLR 2024-04-01 Github Project
LEAP LEAP: LLM-Generation of Egocentric Action Programs ArXiv 2023-11-29 Github Project
LLM-Inner-Speech Star
Egocentric Video Comprehension via Large Language Model Inner Speech
CVPR Workshop 2023-06-18 Github Project
LLM-Brain LLM as A Robotic Brain: Unifying Egocentric Memory and Control ArXiv 2023-04-25 Github Project
LaViLa Star
Learning Video Representations from Large Language Models
CVPR 2022-12-08 Github Project

🎯Back to Top

Vision-to-action

Autonomous driving

Perception

Name Title Venue Date Code Project
DriveLM Star
DriveLM: Driving with Graph Visual Question Answering
ECCV 2024-7-17 Github Project
Talk2BEV Star
Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving
ICRA 2024-5-13 Github Project
Nuscenes-QA Star
TNuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario
AAAI 2024-3-24 Github Project
DriveMLM Star
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
ArXiv 2023-12-25 Github Project
LiDAR-LLM LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
CoRR 2023-12-21 Github Project
Dolphis Star
Dolphins: Multimodal Language Model for Driving
ArXiv 2023-12-1 Github Project

🎯Back to Top

Planning

Name Title Venue Date Code Project
DriveGPT4 DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model
RAL 2024-8-7 Github Project
SurrealDriver Star
SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers’ Driving-thinking Data
ArXiv 2024-7-22 Github Project
DriveVLM DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
CoRL 2024-6-25 Github Project
DiLu Star
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
ICLR 2024-2-22 Github Project
LMDrive Star
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
CVPR 2023-12-21 Github Project
GPT-Driver Star
DGPT-Driver: Learning to Drive with GPT
NeurlPS Workshop 2023-12-5 Github Project
ADriver-I ADriver-I: A General World Model for Autonomous Driving
ArXiv 2023-11-22 Github Project

🎯Back to Top

Prediction

Name Title Venue Date Code Project
Seena Star
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
ArXiv 2024-10-29 Github Project
BEV-InMLLM Star
Holistic autonomous driving understanding by bird’s-eye-view injected multi-Modal large model
CVPR 2024-1-2 Github Project
Prompt4Driving Star
Language Prompt for Autonomous Driving
ArXiv 2023-9-8 Github Project

🎯Back to Top

Embodied AI

Perception

Name Title Venue Date Code Project
Wonderful-Team Star
Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs
ArXiv 2024-12-4 Github Project
AffordanceLLM Star
AffordanceLLM: Grounding Affordance from Vision Language Models
CVPR 2024-4-17 Github Project
3DVisProg Star
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
CVPR 2024-3-23 Github Project
WREPLAN REPLAN: Robotic Replanning with Perception and Language Models
ArXiv 2024-2-20 Github Project
PaLM-E PaLM-E: An Embodied Multimodal Language Model
ICML 2023-3-6 Github Project

🎯Back to Top

Manipulation

Name Title Venue Date Code Project
OpenVLA Star
OpenVLA: An Open-Source Vision-Language-Action Model
ArXiv 2024-9-5 Github Project
LLARVA Star
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
CoRL 2024-6-17 Github Project
RT-X Star
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
ArXiv 2024-6-1 Github Project
RoboFlamingo Vision-Language Foundation Models as Effective Robot Imitators
ICLR 2024-2-5 Github Project
VoxPoser Star
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
CoRL 2023-11-2 Github Project
ManipLLM Star
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
CVPR 2023-12-24 Github Project
RT-2 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
ArXiv 2023-7-28 Github Project
Instruct2Act Star
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
ArXiv 2023-5-24 Github Project

🎯Back to Top

Planning

Name Title Venue Date Code Project
LLaRP Star
Large Language Models as Generalizable Policies for Embodied Tasks
ICLR 2024-4-16 Github Project
MP5 Star
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
CVPR 2024-3-24 Github Project
LL3DA Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
CVPR 2023-11-30 Github Project
EmbodiedGPT Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
NeurlPS 2023-11-2 Github Project
ELLM Star
Guiding Pretraining in Reinforcement Learning with Large Language Models
ICML 2023-9-15 Github Project
3D-LLM Star
3D-LLM: Injecting the 3D World into Large Language Models
NeurlPS 2023-7-24 Github Project
NLMap Star
Open-vocabulary Queryable Scene Representations for Real World Planning
ICRA 2023-7-4 Github Project

🎯Back to Top

Navigation

Name Title Venue Date Code Project
ConceptGraphs Star
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
ICRA 2024-5-13 Github Project
RILA RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation
CVPR 2024-4-27 Github Project
EMMA Star
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld
CVPR 2024-3-29 Github Project
VLN-VER Star
Volumetric Environment Representation for Vision-Language Navigation
CVPR 2024-3-24 Github Project
MultiPLY Star
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
CVPR 2024-1-16 Github Project

🎯Back to Top

Automated tool management

Name Title Venue Date Code Project
Falcon-UI Falcon-UI: Understanding GUI Before Following User Instructions arXiv 2024-12-12 Github Project
AGENTTREK AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials arXiv 2024-12-12 Github Project
Aguvis Star
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
arXiv 2024-12-12 Github Project
ScribeAgent Star
ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data
ArXiv 2024-12-5 Github Project
ShowUI Star
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
NeurlPS Workshop 2024-11-26 Github Project
MultiUI Star
Harnessing Webpage UIs for Text-Rich Visual Understanding
ArXiv 2024-11-6 Github Project
EDGE EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data ArXiv 2024-11-2 Github Project
AndroidLab Star
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
NeurlPS Workshop 2024-10-30 Github Project
OS-ATLAS Star
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
ArXiv 2024-10-30 Github Project
AutoGLM AutoGLM: Autonomous Foundation Agents for GUIs ArXiv 2024-10-30 Github Project
Ferret-UI 2 Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms ArXiv 2024-10-24 Github Project
Tool-LMM Star
Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning
arXiv 2024-1-19 Github Project
CLOVA Star
CLOVA: A Closed-loop Visual Assistant with Tool Usage and Update
CVPR 2023-12-18 Github Project
CRAFT Star
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
arXiv 2023-9-29 Github Project
Confucius Star
Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum
AAAI 2023-8-27 Github Project
AVIS Avis: Autonomous visual information seeking with large language model agent NeurIPS 2023-6-13 Github Project
GPT4Tools Star
GPT4Tools: Teaching large language model to use tools via self-instruction
NeurIPS 2023-5-30 Github Project
ToolkenGPT Star
ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings
NeurIPS 2023-5-19 Github Project
Chameleon Star
Chameleon: Plug-and-play compositional reasoning with large language models
NeurIPS 2023-4-19 Github Project
HuggingGPT Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
NeurIPS 2023-3-30 Github Project
TaskMatrix.AI TaskMatrix.AI: Completing tasks by connecting foundation models with millions of APIs Intelligent Computing (AAAS) 2023-3-29 Github Projecct
MM-ReACT Star
MM-ReACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-3-20 Github Project
ViperGPT Star
ViperGPT: Visual Inference via Python Execution for Reasoning
ICCV 2023-3-14 Github Project
MIND’S EYE MIND’S EYE: GROUNDED LANGUAGE MODEL REASONING THROUGH SIMULATION arXiv 2022-10-11 GitHub Project

🎯Back to Top

Text-to-vision

Text-to-image

Name Title Venue Date Code Project
LLMGA Star
LLMGA: Multimodal Large Language Model based Generation Assistant
ECCV 2024-7-27 GitHub Project
Emu Star
Generative pretraining in multimodality,
ICLR 2024-5-8 GitHub Project
Kosmos-G Kosmos-G: Generating Images in Context with Multimodal Large Language Models ICLR 2024-4-26 GitHub Project
LaVIT Star
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
ICLR 2024-3-22 GitHub Project
MiniGPT-5 Star
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
ArXiv 2024-3-15 GitHub Project
LMD Star
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
TMLR 2024-3-4 GitHub Project
DiffusionGPT Star
DiffusionGPT: LLM-Driven Text-to-Image Generation System
ArXiv 2024-1-18 GitHub Project
VL-GPT Star
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
ArXiv 2023-12-4 GitHub Project
CoDi-2 Star
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
CVPR 2023-11-30 GitHub Project
SEED-LLAMA Star
Making LLaMA SEE and Draw with SEED Tokenizer
CVPR 2023-10-3 GitHub Project
JAM Jointly Training Large Autoregressive Multimodal Models ICLR 2023-9-28 GitHub Project
CM3Leon Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning ArXiv 2023-9-5 GitHub Project
SEED Star
Planting a SEED of Vision in Large Language Model
ICLR 2023-8-12 GitHub Project
GILL Star
Generating Images with Multimodal Language Models
NeurlPS 2023-5-26 GitHub Project

🎯Back to Top

Text-to-3D

Name Title Venue Date Code Project
3DGPT Star
3D-GPT: Procedural 3D Modeling with Large Language Models
ArXiv 2024-5-29 GitHub Project
Holodeck Star
Holodeck: Language Guided Generation of 3D Embodied AI Environments
CVPR 2024-4-22 GitHub Project
LLMR Star
LLMR: Real-time Prompting of Interactive Worlds using Large Language Models
ACM CHI 2024-3-22 GitHub Project
GPT4Point Star
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
ArXiv 2023-12-1 GitHub Project
ShapeGPT Star
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
ArXiv 2023-12-1 GitHub Project
MeshGPT Star
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
ArXiv 2023-11-27 GitHub Project
LI3D Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback NeurlPS 2023-5-26 GitHub Project

🎯Back to Top

Text-to-video

Name Title Venue Date Code Project
Mora Star
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
ArXiv 2024-10-3 GitHub Project
VideoStudio Star
VideoStudio: Generating Consistent-Content and Multi-Scene Videos
ECCV 2024-9-16 GitHub Project
VideoDirectorGPT Star
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
COLM 2024-7-12 GitHub Project
VideoPoet VideoPoet: A Large Language Model for Zero-Shot Video Generation ICML 2024-6-4 GitHub Project
MAGVIT-v2 Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation ICLR 2024-3-29 GitHub Project
LLM-groundedDiffusion Star
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
TMLR 2023-11-27 GitHub Project
SVD Star
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
TMLR 2023-11-27 GitHub Project
Free-Bloom Star
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
NeurlPS 2023-9-25 GitHub Project

🎯Back to Top

Other applications

Face

Name Title Venue Date Code Project
Face-MLLM Face-MLLM: A Large Face Perception Model arXiv 2024-10-28 Github Project
ExpLLM ExpLLM: Towards Chain of Thought for Facial Expression Recognition arXiv 2024-9-4 Github Project
EMO-LLaMA Star
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning
arXiv 2024-8-21 Github Project
EmoLA Star
Facial Affective Behavior Analysis with Instruction Tuning
ECCV 2024-7-12 Github Project
EmoLLM Star
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models
ArXiv 2024-6-29 Github Project

🎯Back to Top

Anomaly Detetcion

Name Title Venue Date Code Project
HAWK Star
HAWK: Learning to Understand Open-World Video Anomalies
NeurlPS 2024-5-27 Github Project
CUVA Star
Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
CVPR 2024-5-6 Github Project
LAVAD Star
Harnessing Large Language Models for Training-free Video Anomaly Detectiong
CVPR 2024-4-1 Github Project

🎯Back to Top

Gaming

Name Title Venue Date Code Project
ADAM Star
Adam: An Embodied Causal Agent in Open-World Environments
ArXiv 2024-10-29 Github Project
VARP Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case ArXiv 2024-09-19 Github Project
DLLM Star
World Models with Hints of Large Language Models for Goal Achieving
ArXiv 2024-06-11 Github Project
MineDreamer Star
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
NeurIPS 2024 Workshop 2024-03-18 Github Project
HAS Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation ICLR 2024-03-13 Github Project
CRADLE Star
CRADLE: Empowering Foundation Agents Towards General Computer Control
ArXiv 2024-03-05 Github Project
Atari-GPT Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games ArXiv 2024-03-05 Github Project
MP5 Star
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
CVPR 2023-12-12 Github Project
STEVE Star
See and Think: Embodied Agent in Virtual Environment
ECCV 2023-11-26 Github Project
STEVE-EYE Star
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds
ICLR 2023-10-20 Github Project
JARVIS-1 Star
JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
ArXiv 2023-10-11 Github Project

🎯Back to Top

Efficiency

Name Title Venue Date Code Project
LLaVA-Mini Star
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
ArXiv 2025-1-7 Github Project
Dynamic-VLM Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM ArXiv 2024-12-12 Github Project
PVC Star
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
ArXiv 2024-12-12 Github Project
iLLaVA Star
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
ArXiv 2024-12-8 Github Project
VTC-CLS Star
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
ArXiv 2024-12-8 Github Project
NegToMe Star
Negative Token Merging: Image-based Adversarial Feature Guidance
ArXiv 2024-12-5 Github Project
VisionZip Star
VisionZip: Longer is Better but Not Necessary in Vision Language Models
ArXiv 2024-12-5 Github Project
AIM Star
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
ArXiv 2024-12-4 Github Project
Dynamic-LLaVA Star
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
ArXiv 2024-12-3 Github Project
ATP-LLaVA ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models ArXiv 2024-11-30 Github Project
YOPO Star
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
ArXiv 2024-11-30 Github Project
DyCoke Star
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
ArXiv 2024-11-22 Github Project
LLaVA-MR **LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval ** ArXiv 2024-11-21 Github Project
FoPru **FoPru: Focal Pruning for Efficient Large Vision-Language Models ** ArXiv 2024-11-21 Github Project
FocusLLaVA **FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression ** ArXiv 2024-11-21 Github Project
RLT Star
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
NeurlPS 2024-11-7 Github Project
LLaVolta Star
Efficient Large Multi-modal Models via Visual Context Compression
NeurlPS 2024-11-6 Github Project
QueCC Star
Inference Optimal VLMs Need Only One Visual Token but Larger Models
ArXiv 2024-11-5 Github Project
PyramidDrop Star
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
ArXiv 2024-10-22 Github Project
Victor Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers ArXiv 2024-10-17 Github Project
AVG-LLaVA Star
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
ArXiv 2024-10-4 Github Project
TRIM Star
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
COLING 2024-9-28 Github Project
TokenPacker Star
TokenPacker: Efficient Visual Projector for Multimodal LLM
ArXiv 2024-8-28 Github Project
MaVEn MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model NeurlPS 2024-8-26 Github Project
HiRED Star
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
AAAI 2024-8-20 Github Project
VoCo-LLaMA Star
VoCo-LLaMA: Towards Vision Compression with Large Language Models
ArXiv 2024-6-18 Github Project
DeCo Star
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
ArXiv 2024-5-31 Github Project
LLaVA-PruMerge Star
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
ArXiv 2024-5-22 Github Project
FastV Star
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
ECCV 2024-5-5 Github Project
LLaVA-HR Star
HFeast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
ArXiv 2024-3-5 Github Project

🎯Back to Top

Security

Name Title Venue Date Code Project
SynthVLM Star
Synthvlm: High-efficiency and high-quality synthetic data for vision language models
ArXiv 2024-8-10 Github Project
WolfMLLM Star
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
ArXiv 2024-6-3 Github Project
AttackMLLM Synthvlm: High-efficiency and high-quality synthetic data for vision language models ICLRW 2024-5-16 Github Project
OODCV Star
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
ECCV 2023-11-27 Github Project
InjectMLLM Star
(ab) using images and sounds for indirect instruction injection in multi-modal llms
ArXiv 2023-10-3 Github Project
AdvMLLM On the Adversarial Robustness of Multi-Modal Foundation Models ICCVW 2023-8-21 Github Project

🎯Back to Top

Interpretability and explainability

Name Title Venue Date Code Project
MultiTrust Star
MULTITRUST: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
ArXiv 2024-12-6 Github Project
XL-VLMs Star
A Concept-Based Explainability Framework for Large Multimodal Models
NeurlPS 2024-11-30 Github Project
VPS Star
Interpreting Object-level Foundation Models via Visual Precision Search
Arxiv 2024-11-25 Github Project
SAE
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
ArXiv 2024-11-22 Github Project
MLLM-Probe
Probing Multimodal Large Language Models for Global and Local Semantic Representations
ArXiv 2024-11-21 Github Project
LexVLA Star
Unified Lexical Representation for Interpretable Visual-Language Alignment
NeurlPS 2024-11-11 Github Project
MUB Star
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
ArXiv 2024-11-5 Github Project
LLaVA-CAM Star
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks
ArXiv 2024-10-17 Github Project
LLaVA-Interp Star
Towards Interpreting Visual Information Processing in Vision-Language Models
ArXiv 2024-10-9 Github Project
MINER Star
MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models
ArXiv 2024-10-7 Github Project
VL-Interpret Star
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
ArXiv 2024-10-3 Github Project
MMNeuron Star
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
ArXiv 2024-10-1 Github Project
MLLM-ONTO Star
Enhancing Explainability in Multimodal Large Language Models Using Ontological Context
ArXiv 2024-9-27 Github Project
EAGLE Star
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
ArXiv 2024-8-28 Github Project
MLLM-Law Star
Law of Vision Representation in MLLMs
ArXiv 2024-8-24 Code Project
VALE VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models ArXiv 2024-8-23 Code Project
DistTrain DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models ArXiv 2024-8-15 Github Project
MLLM-Projection Star
Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space
ArXiv 2024-8-9 Github Project
Reason2Drive Star
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
ECCV 2024-7-20 Github Project
LVLM-LP Star
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
ECCV 2024-7-17 Github Project
CLIP-Neurons Star
Interpreting the Second-Order Effects of Neurons in CLIP
ArXiv 2024-6-24 Github Project
LVLM-Interpret Star
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models
ArXiv 2024-6-24 Github Project
Holmes-VAD Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM ArXiv 2024-6-18 Github Project
MMNeurons Star
Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers
ACL 2024-6-11 Github Project
DeCo Star
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
ICML 2024-5-31 Github Project
MAIA Star
A Multimodal Automated Interpretability Agent
ICML 2024-4-22 Github Project
CDL Star
Pre-trained Vision-Language Models Learn Discoverable Visual Concepts
ArXiv 2024-4-19 Github Project
OLIVE Star
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
NAACL 2024-4-3 Github Project
OPERA Star
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2024-3-12 Github Project
RLHF-V Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
CVPR 2024-3-8 Github Project
HA-DPO Star
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
ArXiv 2024-2-18 Github Project
HA-DPO Star
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
ArXiv 2024-2-6 Github Project
BenchLMM Star
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
ECCV 2023-12-6 Github Project
VCD Star
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CVPR 2023-11-28 Github Project
LLaVA-RLHF Star
LLaVA-RLHF: Aligning Large Multimodal Models with Factually Augmented RLHF
ArXiv 2023-9-25 Github Project

🎯Back to Top

Complex reasoning

Name Title Venue Date Code Project
VLM-R1 Star
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
None 2025-2-15 Github Project
R1-V Star
R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3
Blog 2025-2-3 Github Project
LlamaV-o1 Star
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
ArXiv 2025-1-10 Github Project
Virgo Star
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
ArXiv 2025-1-3 Github Project
Mulberry Star
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search
ArXiv 2024-12-31 Github Project
LLaVA-CoT Star
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
ArXiv 2024-11-25 Github Project

🎯Back to Top

Contributors

Thanks to all the contributors!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-VLLMs

Similar Open Source Tools

For similar tasks

No tools available

For similar jobs

No tools available