awesome-VLLMs

This repository collects papers on VLLM applications. We will update new papers irregularly.

Stars: 52

Visit

README:

Awesome Visual Large Language Models (VLLMs)

🔥🔥🔥 Visual Large Language Models for Generalized and Specialized Applications

Vision language models (VLMs) have emerged as powerful tools for learning unified embedding spaces that integrate vision and language. Inspired by the success of large language models (LLMs), which have demonstrated remarkable reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining significant attention for developing both general-purpose and specialized VLMs.

In this repository, we provide a comprehensive summary of the current literature from an application-oriented perspective. We hope this resource serves as a valuable reference for the VLLM research community.

If you are interested in this project, you can contribute to this repo by pulling requests 😊😊😊

📢 News

🚀 What's New in This Update:

[2025.3.2]: 🔥 Adding two projects on complex reasoning: R1-V and VLM-R1!
[2025.2.23]: 🔥 Adding one video-to-action paper and one vision-to-text paper!
[2025.2.1]: 🔥 Adding four video-to-text papers!
[2025.1.22]: 🔥 Adding one video-to-text paper!
[2025.1.17]: 🔥 Adding three video-to-text papers, thanks for the contributions from Enxin!
[2025.1.14]: 🔥 Adding two complex reasoning papers and one video-to-text paper!
[2025.1.13]: 🔥 Adding one VFM survey paper!
[2025.1.12]: 🔥 Adding one efficient MLLM paper!
[2025.1.9]: 🔥🔥🔥 Adding one efficient MLLM survey!
[2025.1.7]: 🔥🔥🔥 Our survey paper is released! Please check this link for more information. We add more tool management papers in our paper list.
[2025.1.6]: 🔥 We add one OS Agent survey paper in our paper list, and a new category: complex reasoning!
[2025.1.4]: 🔥 We updated the general domain and egocentric video papers in our paper list, thanks for the contributions from Wentao!
[2025.1.2]: 🔥 We add more interpretation papers to our paper list, thanks for the contributions from Ruoyu!
[2024.12.15]: 🔥 We release our VLLM application paper list repo!

🌈 Table of Contents

Visual Large Language Models for Generalized and Specialized Applications

Existing VLM surveys

VLM surveys

Title	Venue	Date	Code	Project
Foundation Models Defining a New Era in Vision: A Survey and Outlook	T-PAMI	2025-1-9	Github	Project
Vision-Language Models for Vision Tasks: A Survey	T-PAMI	2024-8-8	Github	Project
Vision + Language Applications: A Survey	CVPRW	2023-5-24	Github	Project
Vision-and-Language Pretrained Models: A Survey	IJCAI (survey track)	2022-5-3	Github	Project

🎯Back to Top

MLLM surveys

Title	Venue	Date	Code	Project
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use	ArXiv	2024-12-27	Github	Project
Towards Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey	ArXiv	2024-12-3	Github	Project
A Survey on Multimodal Large Language Models	T-PAMI	2024-11-29	Github	Project
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs	ArXiv	2024-11-22	Github	Project
A Survey on Multimodal Large Language Models	National Science Review	2024-11-12	Github	Project
Video Understanding with Large Language Models: A Survey	ArXiv	2024-6-24	Github	Project
A Survey on Multimodal Benchmarks: In the Era of Large AI Models	ArXiv	2024-9-21	Github	Project
The Revolution of Multimodal Large Language Models: A Survey	ArXiv	2024-6-6	Github	Project
Efficient Multimodal Large Language Models: A Survey	ArXiv	2024-5-17	Github	Project
A Survey on Hallucination in Large Vision-Language Models	ArXiv	2024-5-6	Github	Project
Hallucination of multimodal large language models: A survey	ArXiv	2024-4-29	Github	Project
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions	ArXiv	2024-4-12	Github	Project
MM-LLMs: Recent Advances in MultiModal Large Language Models	ArXiv	2024-2-20	Github	Project
Exploring the Reasoning Abilities of Multimodallarge Language Models (mllms): a Comprehensive survey on Emerging Trends in Multimodal Reasonings	ArXiv	2024-1-18	Github	Project
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey	ArXiv	2023-12-27	Github	Project
Multimodal Large Language Models: A Survey	BigData	2023-12-15	Github	Project

🎯Back to Top

Vision-to-text

Image-to-text

General domain

General ability

Name	Title	Venue	Date	Code	Project
InternVL2.5	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	ArXiv	2024-12-17	Github	Project
CompCap	CompCap: Improving Multimodal Large Language Models with Composite Captions	ArXiv	2024-12-06	Github	Project
NVILA	NVILA: Efficient Frontier Visual Language Models	ArXiv	2024-12-05	Github	Project
Molmo and PixMo	Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models	ArXiv	2024-09-25	Github	Project
Qwen2-VL	Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	ArXiv	2024-09-18	Github	Project
mPLUG-Owl3	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	ArXiv	2024-08-09	Github	Project
LLaVA-OneVision	LLaVA-OneVision: Easy Visual Task Transfer	ArXiv	2024-08-06	Github	Project
VILA$^{2}$	VILA $^2$: VILA Augmented VILA	ArXiv	2024-07-24	Github	Project
EVLM	EVLM: An Efficient Vision-Language Model for Visual Understanding	ArXiv	2024-07-19	Github	Project
MG-LLaVA	MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning	ArXiv	2024-06-27	Github	Project
Cambrian-1	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	ArXiv	2024-06-24	Github	Project
Ovis	Ovis: Structural Embedding Alignment for Multimodal Large Language Model	ArXiv	2024-05-31	Github	Project
ConvLLaVA	ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	ArXiv	2024-05-24	Github	Project
Meteor	Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models	NeurIPS	2024-05-24	Github	Project
CuMo	CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	ArXiv	2024-05-09	Github	Project
Mini-Gemini	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	ArXiv	2024-03-27	Github	Project
MM1	MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training	ArXiv	2024-03-14	Github	Project
DeepSeek-VL	DeepSeek-VL: Towards Real-World Vision-Language Understanding	ArXiv	2024-03-08	Github	Project
InternLM-XComposer2	InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model	ArXiv	2024-01-29	Github	Project
MoE-LLaVA	MoE-LLaVA: Mixture of Experts for Large Vision-Language Models	ArXiv	2024-01-29	Github	Project
InternVL	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR	2023-12-21	Github	Project
VILA	VILA: On Pre-training for Visual Language Models	ArXiv	2023-12-12	Github	Project
Vary	Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	ECCV	2023-12-11	Github	Project
Honeybee	Honeybee: Locality-enhanced Projector for Multimodal LLM	CVPR	2023-11-11	Github	Project
OtterHD	OtterHD: A High-Resolution Multi-modality Model	ArXiv	2023-11-07	Github	Project
mPLUG-Owl2	mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	CVPR	2023-11-07	Github	Project
Fuyu	Fuyu-8B: A Multimodal Architecture for AI Agents	ArXiv	2023-10-17	Github	Project
MiniGPT-v2	MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	ArXiv	2023-10-14	Github	Project
LLaVA 1.5	Improved Baselines with Visual Instruction Tuning	ArXiv	2023-10-05	Github	Project
InternLM-XComposer	InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	ArXiv	2023-09-26	Github	Project
Qwen-VL	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	ArXiv	2023-08-24	Github	Project
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	ArXiv	2023-08-20	Github	Project
BLIVA	BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions	AAAI	2023-08-19	Github	Project
SVIT	SVIT: Scaling up Visual Instruction Tuning	ArXiv	2023-07-09	Github	Project
LaVIN	Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	NeurIPS	2023-05-24	Github	Project
InstructBLIP	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS	2023-05-11	Github	Project
MultiModal-GPT	MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	ArXiv	2023-05-08	Github	Project
Otter	Otter: A Multi-Modal Model with In-Context Instruction Tuning	ArXiv	2023-05-05	Github	Project
mPLUG-Owl	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	ArXiv	2023-04-27	Github	Project
LLaMA-Adapter V2	LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	ArXiv	2023-04-28	Github	Project
MiniGPT-4	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	NeurIPS	2023-04-20	Github	Project
LLaVA	Visual Instruction Tuning	NeurIPS	2023-04-17	Github	Project
LLaMA-Adapter	LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention	ICLR	2023-03-28	Github	Project
Kosmos-1	Language Is Not All You Need: Aligning Perception with Language Models	NeurIPS	2023-02-27	Github	Project
Flamingo	Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022-04-29	Github	Project

🎯Back to Top

REC

Name	Title	Venue	Date	Code	Project
ChatRex	ChatRex: Taming Multimodal LLM for Joint Perception and Understanding	ArXiv	2024-11-27	Github	Project
Griffon-G	Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	ArXiv	2024-10-21	Github	Project
Ferret	Ferret: Refer and Ground Anything Anywhere at Any Granularity	ICLR	2024-10-11	Github	Project
OMG-LLaVA	OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	NeurIPS	2024-06-27	Github	Project
VisionLLMv2	VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	ArXiv	2024-06-12	Github	Project
Groma	Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	ECCV	2024-04-19	Github	Project
Griffonv2	Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring	ArXiv	2024-03-14	Github	Project
ASMv2	The All-Seeing Project V2: Towards General Relation Comprehension of the Open World	ECCV	2024-02-29	Github	Project
SPHINX-X	SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models	ArXiv	2024-02-08	Github	Project
ChatterBox	ChatterBox: Multi-round Multimodal Referring and Grounding	ArXiv	2024-01-24	Github	Project
LEGO	LEGO: Language Enhanced Multi-modal Grounding Model	ArXiv	2024-01-12	Github	Project
GroundingGPT	GroundingGPT: Language Enhanced Multi-modal Grounding Model	ACL	2024-01-11	Github	Project
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	ArXiv	2024-07-17	Github	Project
Ferret-v2	Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models	COLM	2024-04-11	Github	Project
InfMLLM	InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory	NeurIPS	2024-02-07	Github	Project
VistaLLM	Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model	ECCV	2023-12-19	Github	Project
LLaVA-Grounding	LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models	ArXiv	2023-12-05	Github	Project
Lenna	Lenna: Language Enhanced Reasoning Detection Assistant	ArXiv	2023-12-05	Github	Project
Griffon	Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models	ECCV	2023-11-24	Github	Project
Lion	Lion: Empowering multimodal large language model with dual-level visual knowledge	CVPR	2023-11-20	Github	Project
SPHINX	SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	ArXiv	2023-11-13	Github	Project
NExT-Chat	NExT-Chat: An LMM for Chat, Detection and Segmentation	ArXiv	2023-11-08	Github	Project
GLaMM	GLaMM: Pixel Grounding Large Multimodal Model	CVPR	2023-11-06	Github	Project
CogVLM	CogVLM: Visual Expert for Pretrained Language Models	ArXiv	2023-11-06	Github	Project
Pink	Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	CVPR	2023-10-01	Github	Project
PVIT	Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models	ArXiv	2023-08-25	Github	Project
ASM	The all-seeing project: Towards panoptic visual recognition and understanding of the open world	ICLR	2023-08-03	Github	Project
Shikra	Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	ArXiv	2023-06-27	Github	Project
Kosmos-2	KOSMOS-2: Grounding Multimodal Large LanguageModels to the World	ICLR	2023-06-26	Github	Project
ChatSpot	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	ArXiv	2023-07-18	Github	Project
GPT4RoI	GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	IJCAI	2023-07-07	Github	Project
ContextDET	Contextual Object Detection with Multimodal Large Language Models	ArXiv	2023-05-29	Github	Project
DetGPT	DetGPT: Detect What You Need via Reasoning	ArXiv	2023-05-23	Github	Project
VisionLLM	VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	ArXiv	2023-05-18	Github	Project

🎯Back to Top

RES

Name	Title	Venue	Date	Code	Project
OMG-LLaVA	OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding	NeurIPS	2024-06-27	Github	Project
VisionLLMv2	VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	ArXiv	2024-06-12	Github	Project
LLM-Seg	LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning	CVPR	2024-04-12	Github	Project
PSALM	PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model	ECCV	2024-03-21	Github	Project
GROUNDHOG	GROUNDHOG: Grounding Large Language Models to Holistic Segmentation	CVPR	2024-02-26	Github	Project
GELLA	Generalizable Entity Grounding via Assistance of Large Language Model	ECCV	2024-02-04	Github	Project
OMG-Seg	OMG-Seg: Is One Model Good Enough For All Segmentation?	CVPR	2024-01-18	Github	Project
LISA++	LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model	ArXiv	2023-12-28	Github	Project
VistaLLM	Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model	ECCV	2023-12-19	Github	Project
Osprey	Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR	2023-12-15	Github	Project
GSVA	GSVA: Generalized Segmentation via Multimodal Large Language Models	CVPR	2023-12-05	Github	Project
PixelLM	PixelLM: Pixel Reasoning with Large Multimodal Model	CVPR	2023-12-04	Github	Project
PixelLLM	PixelLM: Pixel Reasoning with Large Multimodal Model	ECCV	2023-12-04	Github	Project
LLaFS	LLaFS: When Large Language Models Meet Few-Shot Segmentation	CVPR	2023-11-28	Github	Project
NExT-Chat	NExT-Chat: An LMM for Chat, Detection and Segmentation	ArXiv	2023-11-08	Github	Project
GLaMM	GLaMM: Pixel Grounding Large Multimodal Model	CVPR	2023-11-06	Github	Project
LISA	LISA: Reasoning Segmentation via Large Language Model	CVPR	2023-08-01	Github	Project
ContextDET	Contextual Object Detection with Multimodal Large Language Models	ArXiv	2023-05-29	Github	Project
VisionLLM	VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	ArXiv	2023-05-18	Github	Project

🎯Back to Top

OCR

Name	Title	Venue	Date	Code	Project
TextHawk2	TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens	ArXiv	2024-10-07	Github	Project
Dockylin	DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming	AAAI	2024-06-27	Github	Project
StrucTexTv3	StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond	ArXiv	2024-05-31	Github	Project
Fox	Focus Anywhere for Fine-grained Multi-page Document Understanding	ArXiv	2024-05-23	Github	Project
TextMonkey	TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document	ArXiv	2024-05-07	Github	Project
TinyChart	TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning	ACL	2024-04-25	Github	Project
TextHawk	TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models	ArXiv	2024-04-14	Github	Project
HRVDA	HRVDA: High-Resolution Visual Document Assistant	CVPR	2024-04-10	Github	Project
InternLM-XComposer2-4KHD	InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD	NeurIPS	2024-04-09	Github	Project
LayoutLLM	LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding	CVPR	2024-04-08	Github	Project
ViTLP	Visually Guided Generative Text-Layout Pre-training for Document Intelligence	NAACL	2024-03-25	Github	Project
mPLUG-DocOwl 1.5	mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding	ArXiv	2024-03-19	Github	Project
DoCo	Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models	CVPR	2024-02-29	Github	Project
TGDoc	Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs	ArXiv	2023-11-22	Github	Project
DocPedia	DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding	ArXiv	2023-11-20	Github	Project
UReader	UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model	ACL	2023-10-08	Github	Project
UniDoc	UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding	ArXiv	2023-08-19	Github	Project
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	ArXiv	2023-07-04	Github	Project
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	ArXiv	2023-06-29	Github	Project

🎯Back to Top

Retrieval

Name	Title	Venue	Date	Code	Project
EchoSight	EchoSight: Advancing Visual-Language Models with Wiki Knowledge	EMNLP	2024-07-17	Github	Project
FROMAGe	Grounding Language Models to Images for Multimodal Inputs and Outputs	ICML	2024-01-31	Github	Project
Wiki-LLaVA	Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	CVPR	2023-04-23	Github	Project
UniMuR	Unified Embeddings for Multimodal Retrieval via Frozen LLMs	ICML	2019-05-08	Github	Project

🎯Back to Top

VLLM+X

Remote sensing

Name	Title	Venue	Date	Code	Project
VHM	VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	ArXiv	2024-11-06	Github	Project
LHRS-Bot	LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model	ECCV	2024-07-16	Github	Project
Popeye	Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery	J-STARS	2024-06-13	Github	Project
RS-LLaVA	RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery	Remote Sens.	2024-04-23	Github	Project
EarthGPT	EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain	TGRS	2024-03-08	Github	Project
RS-CapRet	Large Language Models for Captioning and Retrieving Remote Sensing Images	ArXiv	2024-02-09	Github	Project
SkyEyeGPT	SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	ArXiv	2024-01-18	Github	Project
GeoChat	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	CVPR	2023-11-24	Github	Project
RSGPT	RSGPT: A Remote Sensing Vision Language Model and Benchmark	ArXiv	2023-07-28	Github	Project

🎯Back to Top

Medical

Name	Title	Venue	Date	Code	Project
UMed-LVLM	Training Medical Large Vision-Language Models with Abnormal-Aware Feedback	ArXiv	2025-01-02	Github	Project
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	ArXiv	2024-09-08	Github	Project
MedVersa	A Generalist Learner for Multifaceted Medical Image Interpretation	ArXiv	2024-05-13	Github	Project
PeFoMed	PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging	ArXiv	2024-04-16	Github	Project
RaDialog	RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance	ArXiv	2023-11-30	Github	Project
Med-Flamingo	Med-Flamingo: a Multimodal Medical Few-shot Learner	ML4H	2023-07-27	Github	Project
XrayGPT	XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models	BioNLP	2023-06-13	Github	Project
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	NeurIPS	2023-06-01	Github	Project
CXR-RePaiR-Gen	Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models	MLHC	2023-05-05	Github	Project

🎯Back to Top

Science and math

Name	Title	Venue	Date	Code	Project
MAVIS	MAVIS: Mathematical Visual Instruction Tuning	ECCV	2024-11-01	Github	Project
Math-LLaVA	Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models	EMNLP	2024-10-08	Github	Project
MathVerse	MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?	ECCV	2024-08-18	Github	Project
We-Math	We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?	ArXiv	2024-07-01	Github	Project
CMMaTH	CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models	ArXiv	2024-06-28	Github	Project
GeoEval	GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving	ACL	2024-05-17	Github	Project
FigurA11y	FigurA11y: AI Assistance for Writing Scientific Alt Text	IUI	2024-04-05	Github	Project
MathVista	MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts	ICLR	2024-01-21	Github	Project
mPLUG-PaperOwl	mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	ACM MM	2024-01-09	Github	Project
G-LLaVA	G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model	ArXiv	2023-12-18	Github	Project
T-SciQ	T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering	AAAI	2023-12-18	Github	Project
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	NeurIPS	2022-10-17	Github	Project

🎯Back to Top

Graphics and UI

Name	Title	Venue	Date	Code	Project
Graphist	Graphic Design with Large Multimodal Model	ArXiv	2024-04-22	Github	Project
Ferret-UI	Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs	ECCV	2024-04-08	Github	Project
CogAgent	CogAgent: A Visual Language Model for GUI Agents	CVPR	2023-12-21	Github	Project

🎯Back to Top

Financial analysis

Name	Title	Venue	Date	Code	Project
FinTral	FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models	ACL	2024-06-14	Github	Project
FinVis-GPT	FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis	ArXiv	2023-07-31	Github	Project

🎯Back to Top

Video-to-text

General domain

Name	Title	Venue	Date	Code	Project
TPO	Temporal Preference Optimization for Long-Form Video Understanding	ArXiv	2025-1-23	Github	Project
VideoLLaMA3	VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	ArXiv	2025-1-22	Github	Project
Aria	ARIA : An Open Multimodal Native Mixture-of-Experts Model	ArXiv	2024-12-17	Github	Project
Apollo	Apollo: An Exploration of Video Understanding in Large Multimodal Models	ArXiv	2024-12-13	Github	Project
LinVT	LinVT: Empower Your Image-level Large Language Model to Understand Videos	ArXiv	2024-12-11	Github	Project
Video-LLaMA2	VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	ArXiv	2024-10-30	Github	Project
LLaVA-OneVision	LLaVA-OneVision: Easy Visual Task Transfer	ArXiv	2024-10-26	Github	Project
Oryx	Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution	ICLR	2024-10-22	Github	Project
LongVU	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding	ArXiv	2024-10-22	Github	Project
AuroraCap	AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	Arxiv	2024-10-4	Github	Project
LLaVA-Video	Video Instruction Tuning With Synthetic Data	ArXiv	2024-10-04	Github	Project
SlowFast-LLaVA	SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	ArXiv	2024-9-15	Github	Project
InternVideo2	InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	ArXiv	2024-8-14	Github	Project
mPLUG-Owl3	mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	ArXiv	2024-08-13	Github	Project
Goldfish	Goldfish: Vision-Language Understanding of Arbitrarily Long Videos	ECCV	2024-07-17	Github	Project
VoT	Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	ICML	2024-07-17	Github	Project
Flash-VStream	Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams	ArXiv	2024-06-30	Github	Project
LLaVA-Next-Video	LLaVA-NeXT: A Strong Zero-shot Video Understanding Model	online	2024-04-30	Github	Project
PLLaVA	PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	Arxiv	2023-4-29	Github	Project
MovieChat+	MovieChat+: Question-aware Sparse Memory for Long Video Question Answering	Arxiv	2023-4-26	Github	Project
MiniGPT4-Video	MiniGPT4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens	CVPR Workshop	2024-04-04	Github	Project
ST-LLM	ST-LLM: Large language models are effective temporal learners	ECCV	2024-03-30	Github	Project
LLaMA-VID	LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	ECCV	2023-11-28	Github	Project
MovieChat	Moviechat: From dense token to sparse memory for long video understanding	CVPR	2023-7-31	Github	Project
Video-LLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	EMNLP	2023-10-25	Github	Project
Vid2Seq	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	CVPR	2023-03-21	Github	Project
LaViLa	Learning Video Representations from Large Language Models	CVPR	2022-12-08	Github	Project
VideoBERT	VideoBERT: A joint model for video and language representation learning	ICCV	2019-09-11	Github	Project

🎯Back to Top

Video conversation

Name	Title	Venue	Date	Code	Project
Video-LLaVA	Video-llava: Learning united visual representation by alignment before projection	EMNLP	2024-10-01	Github	Project
BT-Adapter	BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning	CVPR	2024-06-27	Github	Project
VideoGPT+	VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	arXiv	2024-06-13	Github	Project
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	ACL	2024-06-10	Github	Project
MVBench	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	CVPR	2024-05-23	Github	Project
LVChat	LVCHAT: Facilitating Long Video Comprehension	ArXiv	2024-02-19	Github	Project
VideoChat	VideoChat: Chat-Centric Video Understanding	ArXiv	2024-01-04	Github	Project
Valley	Valley: Video Assistant with Large Language model Enhanced abilitY	ArXiv	2023-10-08	Github	Project

🎯Back to Top

Egocentric view

Name	Title	Venue	Date	Code	Project
PALM	PALM: Predicting Actions through Language Models	CVPR Workshop	2024-07-18	Github	Project
GPT4Ego	GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition	ArXiv	2024-05-11	Github	Project
AntGPT	AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?	ICLR	2024-04-01	Github	Project
LEAP	LEAP: LLM-Generation of Egocentric Action Programs	ArXiv	2023-11-29	Github	Project
LLM-Inner-Speech	Egocentric Video Comprehension via Large Language Model Inner Speech	CVPR Workshop	2023-06-18	Github	Project
LLM-Brain	LLM as A Robotic Brain: Unifying Egocentric Memory and Control	ArXiv	2023-04-25	Github	Project
LaViLa	Learning Video Representations from Large Language Models	CVPR	2022-12-08	Github	Project

🎯Back to Top

Vision-to-action

Autonomous driving

Perception

Name	Title	Venue	Date	Code	Project
DriveLM	DriveLM: Driving with Graph Visual Question Answering	ECCV	2024-7-17	Github	Project
Talk2BEV	Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving	ICRA	2024-5-13	Github	Project
Nuscenes-QA	TNuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario	AAAI	2024-3-24	Github	Project
DriveMLM	DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving	ArXiv	2023-12-25	Github	Project
LiDAR-LLM	LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding	CoRR	2023-12-21	Github	Project
Dolphis	Dolphins: Multimodal Language Model for Driving	ArXiv	2023-12-1	Github	Project

🎯Back to Top

Planning

Name	Title	Venue	Date	Code	Project
DriveGPT4	DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model	RAL	2024-8-7	Github	Project
SurrealDriver	SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers’ Driving-thinking Data	ArXiv	2024-7-22	Github	Project
DriveVLM	DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models	CoRL	2024-6-25	Github	Project
DiLu	DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models	ICLR	2024-2-22	Github	Project
LMDrive	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	CVPR	2023-12-21	Github	Project
GPT-Driver	DGPT-Driver: Learning to Drive with GPT	NeurlPS Workshop	2023-12-5	Github	Project
ADriver-I	ADriver-I: A General World Model for Autonomous Driving	ArXiv	2023-11-22	Github	Project

🎯Back to Top

Prediction

Name	Title	Venue	Date	Code	Project
Seena	Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving	ArXiv	2024-10-29	Github	Project
BEV-InMLLM	Holistic autonomous driving understanding by bird’s-eye-view injected multi-Modal large model	CVPR	2024-1-2	Github	Project
Prompt4Driving	Language Prompt for Autonomous Driving	ArXiv	2023-9-8	Github	Project

🎯Back to Top

Embodied AI

Perception

Name	Title	Venue	Date	Code	Project
Wonderful-Team	Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs	ArXiv	2024-12-4	Github	Project
AffordanceLLM	AffordanceLLM: Grounding Affordance from Vision Language Models	CVPR	2024-4-17	Github	Project
3DVisProg	Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding	CVPR	2024-3-23	Github	Project
WREPLAN	REPLAN: Robotic Replanning with Perception and Language Models	ArXiv	2024-2-20	Github	Project
PaLM-E	PaLM-E: An Embodied Multimodal Language Model	ICML	2023-3-6	Github	Project

🎯Back to Top

Manipulation

Name	Title	Venue	Date	Code	Project
OpenVLA	OpenVLA: An Open-Source Vision-Language-Action Model	ArXiv	2024-9-5	Github	Project
LLARVA	LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning	CoRL	2024-6-17	Github	Project
RT-X	Open X-Embodiment: Robotic Learning Datasets and RT-X Models	ArXiv	2024-6-1	Github	Project
RoboFlamingo	Vision-Language Foundation Models as Effective Robot Imitators	ICLR	2024-2-5	Github	Project
VoxPoser	VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models	CoRL	2023-11-2	Github	Project
ManipLLM	ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation	CVPR	2023-12-24	Github	Project
RT-2	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	ArXiv	2023-7-28	Github	Project
Instruct2Act	Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model	ArXiv	2023-5-24	Github	Project

🎯Back to Top

Planning

Name	Title	Venue	Date	Code	Project
LLaRP	Large Language Models as Generalizable Policies for Embodied Tasks	ICLR	2024-4-16	Github	Project
MP5	MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	CVPR	2024-3-24	Github	Project
LL3DA	LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	CVPR	2023-11-30	Github	Project
EmbodiedGPT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	NeurlPS	2023-11-2	Github	Project
ELLM	Guiding Pretraining in Reinforcement Learning with Large Language Models	ICML	2023-9-15	Github	Project
3D-LLM	3D-LLM: Injecting the 3D World into Large Language Models	NeurlPS	2023-7-24	Github	Project
NLMap	Open-vocabulary Queryable Scene Representations for Real World Planning	ICRA	2023-7-4	Github	Project

🎯Back to Top

Navigation

Name	Title	Venue	Date	Code	Project
ConceptGraphs	ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning	ICRA	2024-5-13	Github	Project
RILA	RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation	CVPR	2024-4-27	Github	Project
EMMA	Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld	CVPR	2024-3-29	Github	Project
VLN-VER	Volumetric Environment Representation for Vision-Language Navigation	CVPR	2024-3-24	Github	Project
MultiPLY	MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World	CVPR	2024-1-16	Github	Project

🎯Back to Top

Automated tool management

Name	Title	Venue	Date	Code	Project
Falcon-UI	Falcon-UI: Understanding GUI Before Following User Instructions	arXiv	2024-12-12	Github	Project
AGENTTREK	AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials	arXiv	2024-12-12	Github	Project
Aguvis	Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction	arXiv	2024-12-12	Github	Project
ScribeAgent	ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data	ArXiv	2024-12-5	Github	Project
ShowUI	ShowUI: One Vision-Language-Action Model for GUI Visual Agent	NeurlPS Workshop	2024-11-26	Github	Project
MultiUI	Harnessing Webpage UIs for Text-Rich Visual Understanding	ArXiv	2024-11-6	Github	Project
EDGE	EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data	ArXiv	2024-11-2	Github	Project
AndroidLab	AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents	NeurlPS Workshop	2024-10-30	Github	Project
OS-ATLAS	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	ArXiv	2024-10-30	Github	Project
AutoGLM	AutoGLM: Autonomous Foundation Agents for GUIs	ArXiv	2024-10-30	Github	Project
Ferret-UI 2	Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms	ArXiv	2024-10-24	Github	Project
Tool-LMM	Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning	arXiv	2024-1-19	Github	Project
CLOVA	CLOVA: A Closed-loop Visual Assistant with Tool Usage and Update	CVPR	2023-12-18	Github	Project
CRAFT	CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets	arXiv	2023-9-29	Github	Project
Confucius	Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum	AAAI	2023-8-27	Github	Project
AVIS	Avis: Autonomous visual information seeking with large language model agent	NeurIPS	2023-6-13	Github	Project
GPT4Tools	GPT4Tools: Teaching large language model to use tools via self-instruction	NeurIPS	2023-5-30	Github	Project
ToolkenGPT	ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings	NeurIPS	2023-5-19	Github	Project
Chameleon	Chameleon: Plug-and-play compositional reasoning with large language models	NeurIPS	2023-4-19	Github	Project
HuggingGPT	HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face	NeurIPS	2023-3-30	Github	Project
TaskMatrix.AI	TaskMatrix.AI: Completing tasks by connecting foundation models with millions of APIs	Intelligent Computing (AAAS)	2023-3-29	Github	Projecct
MM-ReACT	MM-ReACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-3-20	Github	Project
ViperGPT	ViperGPT: Visual Inference via Python Execution for Reasoning	ICCV	2023-3-14	Github	Project
MIND’S EYE	MIND’S EYE: GROUNDED LANGUAGE MODEL REASONING THROUGH SIMULATION	arXiv	2022-10-11	GitHub	Project

🎯Back to Top

Text-to-vision

Text-to-image

Name	Title	Venue	Date	Code	Project
LLMGA	LLMGA: Multimodal Large Language Model based Generation Assistant	ECCV	2024-7-27	GitHub	Project
Emu	Generative pretraining in multimodality,	ICLR	2024-5-8	GitHub	Project
Kosmos-G	Kosmos-G: Generating Images in Context with Multimodal Large Language Models	ICLR	2024-4-26	GitHub	Project
LaVIT	Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	ICLR	2024-3-22	GitHub	Project
MiniGPT-5	MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	ArXiv	2024-3-15	GitHub	Project
LMD	LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR	2024-3-4	GitHub	Project
DiffusionGPT	DiffusionGPT: LLM-Driven Text-to-Image Generation System	ArXiv	2024-1-18	GitHub	Project
VL-GPT	VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	ArXiv	2023-12-4	GitHub	Project
CoDi-2	CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	CVPR	2023-11-30	GitHub	Project
SEED-LLAMA	Making LLaMA SEE and Draw with SEED Tokenizer	CVPR	2023-10-3	GitHub	Project
JAM	Jointly Training Large Autoregressive Multimodal Models	ICLR	2023-9-28	GitHub	Project
CM3Leon	Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	ArXiv	2023-9-5	GitHub	Project
SEED	Planting a SEED of Vision in Large Language Model	ICLR	2023-8-12	GitHub	Project
GILL	Generating Images with Multimodal Language Models	NeurlPS	2023-5-26	GitHub	Project

🎯Back to Top

Text-to-3D

Name	Title	Venue	Date	Code	Project
3DGPT	3D-GPT: Procedural 3D Modeling with Large Language Models	ArXiv	2024-5-29	GitHub	Project
Holodeck	Holodeck: Language Guided Generation of 3D Embodied AI Environments	CVPR	2024-4-22	GitHub	Project
LLMR	LLMR: Real-time Prompting of Interactive Worlds using Large Language Models	ACM CHI	2024-3-22	GitHub	Project
GPT4Point	GPT4Point: A Unified Framework for Point-Language Understanding and Generation	ArXiv	2023-12-1	GitHub	Project
ShapeGPT	ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model	ArXiv	2023-12-1	GitHub	Project
MeshGPT	MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	ArXiv	2023-11-27	GitHub	Project
LI3D	Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback	NeurlPS	2023-5-26	GitHub	Project

🎯Back to Top

Text-to-video

Name	Title	Venue	Date	Code	Project
Mora	Mora: Enabling Generalist Video Generation via A Multi-Agent Framework	ArXiv	2024-10-3	GitHub	Project
VideoStudio	VideoStudio: Generating Consistent-Content and Multi-Scene Videos	ECCV	2024-9-16	GitHub	Project
VideoDirectorGPT	VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	COLM	2024-7-12	GitHub	Project
VideoPoet	VideoPoet: A Large Language Model for Zero-Shot Video Generation	ICML	2024-6-4	GitHub	Project
MAGVIT-v2	Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR	2024-3-29	GitHub	Project
LLM-groundedDiffusion	LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR	2023-11-27	GitHub	Project
SVD	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	TMLR	2023-11-27	GitHub	Project
Free-Bloom	Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	NeurlPS	2023-9-25	GitHub	Project

🎯Back to Top

Other applications

Face

Name	Title	Venue	Date	Code	Project
Face-MLLM	Face-MLLM: A Large Face Perception Model	arXiv	2024-10-28	Github	Project
ExpLLM	ExpLLM: Towards Chain of Thought for Facial Expression Recognition	arXiv	2024-9-4	Github	Project
EMO-LLaMA	EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning	arXiv	2024-8-21	Github	Project
EmoLA	Facial Affective Behavior Analysis with Instruction Tuning	ECCV	2024-7-12	Github	Project
EmoLLM	EmoLLM: Multimodal Emotional Understanding Meets Large Language Models	ArXiv	2024-6-29	Github	Project

🎯Back to Top

Anomaly Detetcion

Name	Title	Venue	Date	Code	Project
HAWK	HAWK: Learning to Understand Open-World Video Anomalies	NeurlPS	2024-5-27	Github	Project
CUVA	Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly	CVPR	2024-5-6	Github	Project
LAVAD	Harnessing Large Language Models for Training-free Video Anomaly Detectiong	CVPR	2024-4-1	Github	Project

🎯Back to Top

Gaming

Name	Title	Venue	Date	Code	Project
ADAM	Adam: An Embodied Causal Agent in Open-World Environments	ArXiv	2024-10-29	Github	Project
VARP	Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case	ArXiv	2024-09-19	Github	Project
DLLM	World Models with Hints of Large Language Models for Goal Achieving	ArXiv	2024-06-11	Github	Project
MineDreamer	MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control	NeurIPS 2024 Workshop	2024-03-18	Github	Project
HAS	Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation	ICLR	2024-03-13	Github	Project
CRADLE	CRADLE: Empowering Foundation Agents Towards General Computer Control	ArXiv	2024-03-05	Github	Project
Atari-GPT	Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games	ArXiv	2024-03-05	Github	Project
MP5	MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception	CVPR	2023-12-12	Github	Project
STEVE	See and Think: Embodied Agent in Virtual Environment	ECCV	2023-11-26	Github	Project
STEVE-EYE	Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds	ICLR	2023-10-20	Github	Project
JARVIS-1	JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models	ArXiv	2023-10-11	Github	Project

🎯Back to Top

Efficiency

Name	Title	Venue	Date	Code	Project
LLaVA-Mini	LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token	ArXiv	2025-1-7	Github	Project
Dynamic-VLM	Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM	ArXiv	2024-12-12	Github	Project
PVC	PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models	ArXiv	2024-12-12	Github	Project
iLLaVA	iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models	ArXiv	2024-12-8	Github	Project
VTC-CLS	[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs	ArXiv	2024-12-8	Github	Project
NegToMe	Negative Token Merging: Image-based Adversarial Feature Guidance	ArXiv	2024-12-5	Github	Project
VisionZip	VisionZip: Longer is Better but Not Necessary in Vision Language Models	ArXiv	2024-12-5	Github	Project
AIM	AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning	ArXiv	2024-12-4	Github	Project
Dynamic-LLaVA	Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification	ArXiv	2024-12-3	Github	Project
ATP-LLaVA	ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models	ArXiv	2024-11-30	Github	Project
YOPO	Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See	ArXiv	2024-11-30	Github	Project
DyCoke	DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models	ArXiv	2024-11-22	Github	Project
LLaVA-MR	LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval	ArXiv	2024-11-21	Github	Project
FoPru	FoPru: Focal Pruning for Efficient Large Vision-Language Models	ArXiv	2024-11-21	Github	Project
FocusLLaVA	FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression	ArXiv	2024-11-21	Github	Project
RLT	Don't Look Twice: Faster Video Transformers with Run-Length Tokenization	NeurlPS	2024-11-7	Github	Project
LLaVolta	Efficient Large Multi-modal Models via Visual Context Compression	NeurlPS	2024-11-6	Github	Project
QueCC	Inference Optimal VLMs Need Only One Visual Token but Larger Models	ArXiv	2024-11-5	Github	Project
PyramidDrop	PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction	ArXiv	2024-10-22	Github	Project
Victor	Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers	ArXiv	2024-10-17	Github	Project
AVG-LLaVA	AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity	ArXiv	2024-10-4	Github	Project
TRIM	Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs	COLING	2024-9-28	Github	Project
TokenPacker	TokenPacker: Efficient Visual Projector for Multimodal LLM	ArXiv	2024-8-28	Github	Project
MaVEn	MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model	NeurlPS	2024-8-26	Github	Project
HiRED	HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments	AAAI	2024-8-20	Github	Project
VoCo-LLaMA	VoCo-LLaMA: Towards Vision Compression with Large Language Models	ArXiv	2024-6-18	Github	Project
DeCo	DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	ArXiv	2024-5-31	Github	Project
LLaVA-PruMerge	LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models	ArXiv	2024-5-22	Github	Project
FastV	An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models	ECCV	2024-5-5	Github	Project
LLaVA-HR	HFeast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	ArXiv	2024-3-5	Github	Project

🎯Back to Top

Security

Name	Title	Venue	Date	Code	Project
SynthVLM	Synthvlm: High-efficiency and high-quality synthetic data for vision language models	ArXiv	2024-8-10	Github	Project
WolfMLLM	The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative	ArXiv	2024-6-3	Github	Project
AttackMLLM	Synthvlm: High-efficiency and high-quality synthetic data for vision language models	ICLRW	2024-5-16	Github	Project
OODCV	How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	ECCV	2023-11-27	Github	Project
InjectMLLM	(ab) using images and sounds for indirect instruction injection in multi-modal llms	ArXiv	2023-10-3	Github	Project
AdvMLLM	On the Adversarial Robustness of Multi-Modal Foundation Models	ICCVW	2023-8-21	Github	Project

🎯Back to Top

Interpretability and explainability

Name	Title	Venue	Date	Code	Project
MultiTrust	MULTITRUST: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models	ArXiv	2024-12-6	Github	Project
XL-VLMs	A Concept-Based Explainability Framework for Large Multimodal Models	NeurlPS	2024-11-30	Github	Project
VPS	Interpreting Object-level Foundation Models via Visual Precision Search	Arxiv	2024-11-25	Github	Project
SAE	Large Multi-modal Models Can Interpret Features in Large Multi-modal Models	ArXiv	2024-11-22	Github	Project
MLLM-Probe	Probing Multimodal Large Language Models for Global and Local Semantic Representations	ArXiv	2024-11-21	Github	Project
LexVLA	Unified Lexical Representation for Interpretable Visual-Language Alignment	NeurlPS	2024-11-11	Github	Project
MUB	Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios	ArXiv	2024-11-5	Github	Project
LLaVA-CAM	From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks	ArXiv	2024-10-17	Github	Project
LLaVA-Interp	Towards Interpreting Visual Information Processing in Vision-Language Models	ArXiv	2024-10-9	Github	Project
MINER	MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models	ArXiv	2024-10-7	Github	Project
VL-Interpret	Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations	ArXiv	2024-10-3	Github	Project
MMNeuron	MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model	ArXiv	2024-10-1	Github	Project
MLLM-ONTO	Enhancing Explainability in Multimodal Large Language Models Using Ontological Context	ArXiv	2024-9-27	Github	Project
EAGLE	Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders	ArXiv	2024-8-28	Github	Project
MLLM-Law	Law of Vision Representation in MLLMs	ArXiv	2024-8-24	Code	Project
VALE	VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models	ArXiv	2024-8-23	Code	Project
DistTrain	DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models	ArXiv	2024-8-15	Github	Project
MLLM-Projection	Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space	ArXiv	2024-8-9	Github	Project
Reason2Drive	Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving	ECCV	2024-7-20	Github	Project
LVLM-LP	The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?	ECCV	2024-7-17	Github	Project
CLIP-Neurons	Interpreting the Second-Order Effects of Neurons in CLIP	ArXiv	2024-6-24	Github	Project
LVLM-Interpret	LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models	ArXiv	2024-6-24	Github	Project
Holmes-VAD	Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM	ArXiv	2024-6-18	Github	Project
MMNeurons	Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers	ACL	2024-6-11	Github	Project
DeCo	DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models	ICML	2024-5-31	Github	Project
MAIA	A Multimodal Automated Interpretability Agent	ICML	2024-4-22	Github	Project
CDL	Pre-trained Vision-Language Models Learn Discoverable Visual Concepts	ArXiv	2024-4-19	Github	Project
OLIVE	What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases	NAACL	2024-4-3	Github	Project
OPERA	OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	CVPR	2024-3-12	Github	Project
RLHF-V	RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	CVPR	2024-3-8	Github	Project
HA-DPO	Aligning Modalities in Vision Large Language Models via Preference Fine-tuning	ArXiv	2024-2-18	Github	Project
HA-DPO	Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	ArXiv	2024-2-6	Github	Project
BenchLMM	BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	ECCV	2023-12-6	Github	Project
VCD	VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	CVPR	2023-11-28	Github	Project
LLaVA-RLHF	LLaVA-RLHF: Aligning Large Multimodal Models with Factually Augmented RLHF	ArXiv	2023-9-25	Github	Project

🎯Back to Top

Complex reasoning

Name	Title	Venue	Date	Code	Project
VLM-R1	VLM-R1: A stable and generalizable R1-style Large Vision-Language Model	None	2025-2-15	Github	Project
R1-V	R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3	Blog	2025-2-3	Github	Project
LlamaV-o1	LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs	ArXiv	2025-1-10	Github	Project
Virgo	Virgo: A Preliminary Exploration on Reproducing o1-like MLLM	ArXiv	2025-1-3	Github	Project
Mulberry	Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search	ArXiv	2024-12-31	Github	Project
LLaVA-CoT	LLaVA-CoT: Let Vision Language Models Reason Step-by-Step	ArXiv	2024-11-25	Github	Project

🎯Back to Top

Contributors

Thanks to all the contributors!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-VLLMs

Similar Open Source Tools

awesome-VLLMs

github

: 52

Awesome-Interpretability-in-Large-Language-Models

This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.

github

: 230

MedicalGPT

MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).

github

: 3.6k

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

llms-from-scratch-cn

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

github

: 860

AstrBot

AstrBot is a powerful and versatile tool that leverages the capabilities of large language models (LLMs) like GPT-3, GPT-3.5, and GPT-4 to enhance communication and automate tasks. It seamlessly integrates with popular messaging platforms such as QQ, QQ Channel, and Telegram, enabling users to harness the power of AI within their daily conversations and workflows.

github

: 6.6k

go-cyber

Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.

github

: 353

LLamaTuner

LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.

github

: 586

AstrBot

github

: 7.0k

MindChat

MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

github

: 436

tt-metal

TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.

github

: 786

PaddleNLP

PaddleNLP is an easy-to-use and high-performance NLP library. It aggregates high-quality pre-trained models in the industry and provides out-of-the-box development experience, covering a model library for multiple NLP scenarios with industry practice examples to meet developers' flexible customization needs.

github

: 12.5k

ruoyi-vue-pro

The ruoyi-vue-pro repository is an open-source project that provides a comprehensive development platform with various functionalities such as system features, infrastructure, member center, data reports, workflow, payment system, mall system, ERP system, CRM system, and AI big model. It is built using Java backend with Spring Boot framework and Vue frontend with different versions like Vue3 with element-plus, Vue3 with vben(ant-design-vue), and Vue2 with element-ui. The project aims to offer a fast development platform for developers and enterprises, supporting features like dynamic menu loading, button-level access control, SaaS multi-tenancy, code generator, real-time communication, integration with third-party services like WeChat, Alipay, and cloud services, and more.

github

: 28.9k

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

yudao-boot-mini

yudao-boot-mini is an open-source project focused on developing a rapid development platform for developers in China. It includes features like system functions, infrastructure, member center, data reports, workflow, mall system, WeChat official account, CRM, ERP, etc. The project is based on Spring Boot with Java backend and Vue for frontend. It offers various functionalities such as user management, role management, menu management, department management, workflow management, payment system, code generation, API documentation, database documentation, file service, WebSocket integration, message queue, Java monitoring, and more. The project is licensed under the MIT License, allowing both individuals and enterprises to use it freely without restrictions.

github

: 54

yudao-cloud

Yudao-cloud is an open-source project designed to provide a fast development platform for developers in China. It includes various system functions, infrastructure, member center, data reports, workflow, mall system, WeChat public account, CRM, ERP, etc. The project is based on Java backend with Spring Boot and Spring Cloud Alibaba microservices architecture. It supports multiple databases, message queues, authentication systems, dynamic menu loading, SaaS multi-tenant system, code generator, real-time communication, integration with third-party services like WeChat, Alipay, and more. The project is well-documented and follows the Alibaba Java development guidelines, ensuring clean code and architecture.

github

: 16.5k

For similar tasks

No tools available

For similar jobs

No tools available