
awesome-VLLMs
This repository collects papers on VLLM applications. We will update new papers irregularly.
Stars: 52

README:
Vision language models (VLMs) have emerged as powerful tools for learning unified embedding spaces that integrate vision and language. Inspired by the success of large language models (LLMs), which have demonstrated remarkable reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining significant attention for developing both general-purpose and specialized VLMs.
In this repository, we provide a comprehensive summary of the current literature from an application-oriented perspective. We hope this resource serves as a valuable reference for the VLLM research community.
If you are interested in this project, you can contribute to this repo by pulling requests πππ
π What's New in This Update:
- [2025.3.2]: π₯ Adding two projects on complex reasoning: R1-V and VLM-R1!
- [2025.2.23]: π₯ Adding one video-to-action paper and one vision-to-text paper!
- [2025.2.1]: π₯ Adding four video-to-text papers!
- [2025.1.22]: π₯ Adding one video-to-text paper!
- [2025.1.17]: π₯ Adding three video-to-text papers, thanks for the contributions from Enxin!
- [2025.1.14]: π₯ Adding two complex reasoning papers and one video-to-text paper!
- [2025.1.13]: π₯ Adding one VFM survey paper!
- [2025.1.12]: π₯ Adding one efficient MLLM paper!
- [2025.1.9]: π₯π₯π₯ Adding one efficient MLLM survey!
- [2025.1.7]: π₯π₯π₯ Our survey paper is released! Please check this link for more information. We add more tool management papers in our paper list.
- [2025.1.6]: π₯ We add one OS Agent survey paper in our paper list, and a new category: complex reasoning!
- [2025.1.4]: π₯ We updated the general domain and egocentric video papers in our paper list, thanks for the contributions from Wentao!
- [2025.1.2]: π₯ We add more interpretation papers to our paper list, thanks for the contributions from Ruoyu!
- [2024.12.15]: π₯ We release our VLLM application paper list repo!
- Visual Large Language Models for Generalized and Specialized Applications
Title | Venue | Date | Code | Project |
---|---|---|---|---|
Foundation Models Defining a New Era in Vision: A Survey and Outlook |
T-PAMI | 2025-1-9 | Github | Project |
Vision-Language Models for Vision Tasks: A Survey |
T-PAMI | 2024-8-8 | Github | Project |
Vision + Language Applications: A Survey |
CVPRW | 2023-5-24 | Github | Project |
Vision-and-Language Pretrained Models: A Survey |
IJCAI (survey track) | 2022-5-3 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
EchoSight |
EchoSight: Advancing Visual-Language Models with Wiki Knowledge |
EMNLP | 2024-07-17 | Github | Project |
FROMAGe |
Grounding Language Models to Images for Multimodal Inputs and Outputs |
ICML | 2024-01-31 | Github | Project |
Wiki-LLaVA | Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs | CVPR | 2023-04-23 | Github | Project |
UniMuR | Unified Embeddings for Multimodal Retrieval via Frozen LLMs | ICML | 2019-05-08 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Graphist |
Graphic Design with Large Multimodal Model |
ArXiv | 2024-04-22 | Github | Project |
Ferret-UI |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs |
ECCV | 2024-04-08 | Github | Project |
CogAgent |
CogAgent: A Visual Language Model for GUI Agents |
CVPR | 2023-12-21 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
FinTral |
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models |
ACL | 2024-06-14 | Github | Project |
FinVis-GPT |
FinVis-GPT: A Multimodal Large Language Model for Financial Chart Analysis |
ArXiv | 2023-07-31 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Video-LLaVA |
Video-llava: Learning united visual representation by alignment before projection |
EMNLP | 2024-10-01 | Github | Project |
BT-Adapter |
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning |
CVPR | 2024-06-27 | Github | Project |
VideoGPT+ |
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding |
arXiv | 2024-06-13 | Github | Project |
Video-ChatGPT |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |
ACL | 2024-06-10 | Github | Project |
MVBench |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |
CVPR | 2024-05-23 | Github | Project |
LVChat |
LVCHAT: Facilitating Long Video Comprehension |
ArXiv | 2024-02-19 | Github | Project |
VideoChat |
VideoChat: Chat-Centric Video Understanding |
ArXiv | 2024-01-04 | Github | Project |
Valley |
Valley: Video Assistant with Large Language model Enhanced abilitY |
ArXiv | 2023-10-08 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
PALM |
PALM: Predicting Actions through Language Models |
CVPR Workshop | 2024-07-18 | Github | Project |
GPT4Ego | GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition | ArXiv | 2024-05-11 | Github | Project |
AntGPT |
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? |
ICLR | 2024-04-01 | Github | Project |
LEAP | LEAP: LLM-Generation of Egocentric Action Programs | ArXiv | 2023-11-29 | Github | Project |
LLM-Inner-Speech |
Egocentric Video Comprehension via Large Language Model Inner Speech |
CVPR Workshop | 2023-06-18 | Github | Project |
LLM-Brain | LLM as A Robotic Brain: Unifying Egocentric Memory and Control | ArXiv | 2023-04-25 | Github | Project |
LaViLa |
Learning Video Representations from Large Language Models |
CVPR | 2022-12-08 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
DriveLM |
DriveLM: Driving with Graph Visual Question Answering |
ECCV | 2024-7-17 | Github | Project |
Talk2BEV |
Talk2BEV: Language-enhanced Birdβs-eye View Maps for Autonomous Driving |
ICRA | 2024-5-13 | Github | Project |
Nuscenes-QA |
TNuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario |
AAAI | 2024-3-24 | Github | Project |
DriveMLM |
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving |
ArXiv | 2023-12-25 | Github | Project |
LiDAR-LLM |
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding |
CoRR | 2023-12-21 | Github | Project |
Dolphis |
Dolphins: Multimodal Language Model for Driving |
ArXiv | 2023-12-1 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
DriveGPT4 |
DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model |
RAL | 2024-8-7 | Github | Project |
SurrealDriver |
SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Driversβ Driving-thinking Data |
ArXiv | 2024-7-22 | Github | Project |
DriveVLM |
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models |
CoRL | 2024-6-25 | Github | Project |
DiLu |
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models |
ICLR | 2024-2-22 | Github | Project |
LMDrive |
LMDrive: Closed-Loop End-to-End Driving with Large Language Models |
CVPR | 2023-12-21 | Github | Project |
GPT-Driver |
DGPT-Driver: Learning to Drive with GPT |
NeurlPS Workshop | 2023-12-5 | Github | Project |
ADriver-I |
ADriver-I: A General World Model for Autonomous Driving |
ArXiv | 2023-11-22 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Seena |
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving |
ArXiv | 2024-10-29 | Github | Project |
BEV-InMLLM |
Holistic autonomous driving understanding by birdβs-eye-view injected multi-Modal large model |
CVPR | 2024-1-2 | Github | Project |
Prompt4Driving |
Language Prompt for Autonomous Driving |
ArXiv | 2023-9-8 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Wonderful-Team |
Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs |
ArXiv | 2024-12-4 | Github | Project |
AffordanceLLM |
AffordanceLLM: Grounding Affordance from Vision Language Models |
CVPR | 2024-4-17 | Github | Project |
3DVisProg |
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding |
CVPR | 2024-3-23 | Github | Project |
WREPLAN |
REPLAN: Robotic Replanning with Perception and Language Models |
ArXiv | 2024-2-20 | Github | Project |
PaLM-E |
PaLM-E: An Embodied Multimodal Language Model |
ICML | 2023-3-6 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
OpenVLA |
OpenVLA: An Open-Source Vision-Language-Action Model |
ArXiv | 2024-9-5 | Github | Project |
LLARVA |
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning |
CoRL | 2024-6-17 | Github | Project |
RT-X |
Open X-Embodiment: Robotic Learning Datasets and RT-X Models |
ArXiv | 2024-6-1 | Github | Project |
RoboFlamingo |
Vision-Language Foundation Models as Effective Robot Imitators |
ICLR | 2024-2-5 | Github | Project |
VoxPoser |
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models |
CoRL | 2023-11-2 | Github | Project |
ManipLLM |
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation |
CVPR | 2023-12-24 | Github | Project |
RT-2 |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
ArXiv | 2023-7-28 | Github | Project |
Instruct2Act |
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model |
ArXiv | 2023-5-24 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
LLaRP |
Large Language Models as Generalizable Policies for Embodied Tasks |
ICLR | 2024-4-16 | Github | Project |
MP5 |
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception |
CVPR | 2024-3-24 | Github | Project |
LL3DA |
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning |
CVPR | 2023-11-30 | Github | Project |
EmbodiedGPT |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought |
NeurlPS | 2023-11-2 | Github | Project |
ELLM |
Guiding Pretraining in Reinforcement Learning with Large Language Models |
ICML | 2023-9-15 | Github | Project |
3D-LLM |
3D-LLM: Injecting the 3D World into Large Language Models |
NeurlPS | 2023-7-24 | Github | Project |
NLMap |
Open-vocabulary Queryable Scene Representations for Real World Planning |
ICRA | 2023-7-4 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
ConceptGraphs |
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning |
ICRA | 2024-5-13 | Github | Project |
RILA |
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation |
CVPR | 2024-4-27 | Github | Project |
EMMA |
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld |
CVPR | 2024-3-29 | Github | Project |
VLN-VER |
Volumetric Environment Representation for Vision-Language Navigation |
CVPR | 2024-3-24 | Github | Project |
MultiPLY |
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World |
CVPR | 2024-1-16 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
3DGPT |
3D-GPT: Procedural 3D Modeling with Large Language Models |
ArXiv | 2024-5-29 | GitHub | Project |
Holodeck |
Holodeck: Language Guided Generation of 3D Embodied AI Environments |
CVPR | 2024-4-22 | GitHub | Project |
LLMR |
LLMR: Real-time Prompting of Interactive Worlds using Large Language Models |
ACM CHI | 2024-3-22 | GitHub | Project |
GPT4Point |
GPT4Point: A Unified Framework for Point-Language Understanding and Generation |
ArXiv | 2023-12-1 | GitHub | Project |
ShapeGPT |
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model |
ArXiv | 2023-12-1 | GitHub | Project |
MeshGPT |
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers |
ArXiv | 2023-11-27 | GitHub | Project |
LI3D | Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback | NeurlPS | 2023-5-26 | GitHub | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
Face-MLLM | Face-MLLM: A Large Face Perception Model | arXiv | 2024-10-28 | Github | Project |
ExpLLM | ExpLLM: Towards Chain of Thought for Facial Expression Recognition | arXiv | 2024-9-4 | Github | Project |
EMO-LLaMA |
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning |
arXiv | 2024-8-21 | Github | Project |
EmoLA |
Facial Affective Behavior Analysis with Instruction Tuning |
ECCV | 2024-7-12 | Github | Project |
EmoLLM |
EmoLLM: Multimodal Emotional Understanding Meets Large Language Models |
ArXiv | 2024-6-29 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
HAWK |
HAWK: Learning to Understand Open-World Video Anomalies |
NeurlPS | 2024-5-27 | Github | Project |
CUVA |
Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly |
CVPR | 2024-5-6 | Github | Project |
LAVAD |
Harnessing Large Language Models for Training-free Video Anomaly Detectiong |
CVPR | 2024-4-1 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
SynthVLM |
Synthvlm: High-efficiency and high-quality synthetic data for vision language models |
ArXiv | 2024-8-10 | Github | Project |
WolfMLLM |
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative |
ArXiv | 2024-6-3 | Github | Project |
AttackMLLM | Synthvlm: High-efficiency and high-quality synthetic data for vision language models | ICLRW | 2024-5-16 | Github | Project |
OODCV |
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
ECCV | 2023-11-27 | Github | Project |
InjectMLLM |
(ab) using images and sounds for indirect instruction injection in multi-modal llms |
ArXiv | 2023-10-3 | Github | Project |
AdvMLLM | On the Adversarial Robustness of Multi-Modal Foundation Models | ICCVW | 2023-8-21 | Github | Project |
Name | Title | Venue | Date | Code | Project |
---|---|---|---|---|---|
VLM-R1 |
VLM-R1: A stable and generalizable R1-style Large Vision-Language Model |
None | 2025-2-15 | Github | Project |
R1-V |
R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 |
Blog | 2025-2-3 | Github | Project |
LlamaV-o1 |
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs |
ArXiv | 2025-1-10 | Github | Project |
Virgo |
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM |
ArXiv | 2025-1-3 | Github | Project |
Mulberry |
Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search |
ArXiv | 2024-12-31 | Github | Project |
LLaVA-CoT |
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step |
ArXiv | 2024-11-25 | Github | Project |
Thanks to all the contributors!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for awesome-VLLMs
Similar Open Source Tools

Awesome-Interpretability-in-Large-Language-Models
This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.

MedicalGPT
MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).

Awesome-LLMs-for-Video-Understanding
Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

llms-from-scratch-cn
This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

AstrBot
AstrBot is a powerful and versatile tool that leverages the capabilities of large language models (LLMs) like GPT-3, GPT-3.5, and GPT-4 to enhance communication and automate tasks. It seamlessly integrates with popular messaging platforms such as QQ, QQ Channel, and Telegram, enabling users to harness the power of AI within their daily conversations and workflows.

go-cyber
Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.

Awesome-LLM-Resources-List
Awesome LLM Resources is a curated collection of resources for Large Language Models (LLMs) covering various aspects such as serverless hosting, accessing off-the-shelf models via API, local inference, LLM serving frameworks, open-source LLM web chat UIs, renting GPUs for fine-tuning, fine-tuning with no-code UI, fine-tuning frameworks, OS agentic/AI workflow, AI agents, co-pilots, voice API, open-source TTS models, OS RAG frameworks, research papers on chain-of-thought prompting, CoT implementations, CoT fine-tuned models & datasets, and more.

LLamaTuner
LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.

MindChat
MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

ruoyi-vue-pro
The ruoyi-vue-pro repository is an open-source project that provides a comprehensive development platform with various functionalities such as system features, infrastructure, member center, data reports, workflow, payment system, mall system, ERP system, CRM system, and AI big model. It is built using Java backend with Spring Boot framework and Vue frontend with different versions like Vue3 with element-plus, Vue3 with vben(ant-design-vue), and Vue2 with element-ui. The project aims to offer a fast development platform for developers and enterprises, supporting features like dynamic menu loading, button-level access control, SaaS multi-tenancy, code generator, real-time communication, integration with third-party services like WeChat, Alipay, and cloud services, and more.

VoiceBench
VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

yudao-boot-mini
yudao-boot-mini is an open-source project focused on developing a rapid development platform for developers in China. It includes features like system functions, infrastructure, member center, data reports, workflow, mall system, WeChat official account, CRM, ERP, etc. The project is based on Spring Boot with Java backend and Vue for frontend. It offers various functionalities such as user management, role management, menu management, department management, workflow management, payment system, code generation, API documentation, database documentation, file service, WebSocket integration, message queue, Java monitoring, and more. The project is licensed under the MIT License, allowing both individuals and enterprises to use it freely without restrictions.

yudao-cloud
Yudao-cloud is an open-source project designed to provide a fast development platform for developers in China. It includes various system functions, infrastructure, member center, data reports, workflow, mall system, WeChat public account, CRM, ERP, etc. The project is based on Java backend with Spring Boot and Spring Cloud Alibaba microservices architecture. It supports multiple databases, message queues, authentication systems, dynamic menu loading, SaaS multi-tenant system, code generator, real-time communication, integration with third-party services like WeChat, Alipay, and more. The project is well-documented and follows the Alibaba Java development guidelines, ensuring clean code and architecture.

pmhub
PmHub is a smart project management system based on SpringCloud, SpringCloud Alibaba, and LLM. It aims to help students quickly grasp the architecture design and development process of microservices/distributed projects. PmHub provides a platform for students to experience the transformation from monolithic to microservices architecture, understand the pros and cons of both architectures, and prepare for job interviews. It offers popular technologies like SpringCloud-Gateway, Nacos, Sentinel, and provides high-quality code, continuous integration, product design documents, and an enterprise workflow system. PmHub is suitable for beginners and advanced learners who want to master core knowledge of microservices/distributed projects.

PaddleNLP
PaddleNLP is an easy-to-use and high-performance NLP library. It aggregates high-quality pre-trained models in the industry and provides out-of-the-box development experience, covering a model library for multiple NLP scenarios with industry practice examples to meet developers' flexible customization needs.