awesome-vla-for-ad

🌐 Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Stars: 287

Visit

README:

😎 Awesome VLA for Autonomous Driving

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, whose hand-crafted interfaces and rule-based components often struggle in complex, dynamic, or long-tailed scenarios. Their cascaded structure also amplifies upstream perception errors, undermining downstream planning and control.

This survey reviews vision-action (VA) models and vision-language-action (VLA) models for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks, and organize existing methods into two principal paradigms:

End-to-End VLA, which integrates perception, reasoning, and planning within a single model.
Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners).

For more details, kindly refer to our 📚 Paper, 🌐 Project Page, and 🤗 HuggingFace Leaderboard.

📚 Citation

If you find this work helpful for your research, please kindly consider citing our paper:

@article{survey_vla4ad,
    title   = {Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future},
    author  = {Tianshuai Hu and Xiaolu Liu and Song Wang and Yiyao Zhu and Ao Liang and Lingdong Kong and Guoyang Zhao and Zeying Gong and Jun Cen and Zhiyu Huang and Xiaoshuai Hao and Linfeng Li and Hang Song and Xiangtai Li and Jun Ma and Shaojie Shen and Jianke Zhu and Dacheng Tao and Ziwei Liu and Junwei Liang},
    journal = {arXiv preprint arXiv:2512.16760},
    year    = {2025},
}

1. Vision-Action Models
2. Vision-Language-Action Models
3. Datasets & Benchmarks
- Vision-Action Datasets
- Vision-Language-Action Datasets
4. Applications
5. Other Resources

1. Vision-Action Models

1️⃣ Action-Only Models

Model	Paper	Venue	Website	GitHub

`LBC`	Learning by Cheating	CoRL 2020	-
`Latent-DRL`	End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances	CVPR 2020	-	-
`NEAT`	NEAT: Neural Attention Fields for End-to-End Autonomous Driving	ICCV 2021	-
`Roach`	End-to-End Urban Driving by Imitating a Reinforcement Learning Coach	ICCV 2021
`WoR`	Learning to Drive from A World on Rails	ICCV 2021
`TCP`	Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline	NeurIPS 2022	-
`Urban-Driver`	Urban Driver: Learning to Drive from Real-world Demonstrations Using Policy Gradients	CoRL 2022
`LAV`	Learning from All Vehicles	CVPR 2022
`TransFuser`	TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving	TPAMI 2023	-
`GRI`	GRI: General Reinforced Imitation and its Application to Vision-Based Autonomous Driving	Robotics 2023	-	-
`BEVPlanner`	Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?	CVPR 2024	-
`Raw2Drive`	Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)	NeurIPS 2025	-	-
`RAD`	RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning	NeurIPS 2025		-
`TrajDiff`	TrajDiff: End-to-End Autonomous Driving without Perception Annotation	arXiv 2025	-
`SimScale`	SimScale: Learning to Drive via Real-World Simulation at Scale	arXiv 2025

2️⃣ Perception-Action Models

Model	Paper	Venue	Website	GitHub

`ST-P3`	ST-P3: End-to-End Vision-Based Autonomous Driving via Spatial-Temporal Feature Learning	ECCV 2022	-
`UniAD`	Planning-Oriented Autonomous Driving	CVPR 2023	-
`VAD`	VAD: Vectorized Scene Representation for Efficient Autonomous Driving	ICCV 2023	-
`OccNet`	Scene as Occupancy	ICCV 2023	-
`GenAD`	GenAD: Generative End-to-End Autonomous Driving	ECCV 2024	-
`PARA-Drive`	PARA-Drive: Parallelized Architecture for Real-Time Autonomous Driving	CVPR 2024		-
`Hydra-MDP`	Hydra-MDP: End-to-End Multimodal Planning with Multi-Target Hydra-Distillation	CVPRW 2024
`SparseAD`	SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous Driving	arXiv 2024	-	-
`GaussianAD`	GaussianAD: Gaussian-Centric End-to-End Autonomous Driving	arXiv 2024	-	-
`DiFSD`	DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving	arXiv 2024	-
`DriveTransformer`	DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving	ICLR 2025	-
`SparseDrive`	SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation	ICRA 2025	-
`DiffusionDrive`	DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving	CVPR 2025	-
`GoalFlow`	GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving	CVPR 2025
`GuideFlow`	GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving	arXiv 2025	-
`ETA`	ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models	arXiv 2025	-
`Geo`	Spatial Retrieval Augmented Autonomous Driving	arXiv 2025	-	-
`DiffusionDriveV2`	DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving	arXiv 2025	-
`NaviHydra`	NaviHydra: Controllable Navigation-Guided End-to-End Autonomous Driving with Hydra Distillation	arXiv 2025	-	-
`Mimir`	Mimir: Hierarchical Goal-Driven Diffusion with Uncertainty Propagation for End-to-End Autonomous Driving	arXiv 2025	-
`FROST-Drive`	FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder	arXiv 2026	-	-
`DrivoR`	Driving on Registers	arXiv 2026
`SPS`	See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection	arXiv 2026	-	-

3️⃣ Image-Based World Models

Model	Paper	Venue	Website	GitHub

`DriveDreamer`	DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving	ECCV 2024
`GenAD`	GenAD: Generalized Predictive Model for Autonomous Driving	CVPR 2024	-
`Drive-WM`	Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving	CVPR 2024
`DrivingWorld`	DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT	arXiv 2024
`Imagine-2-Drive`	Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies	IROS 2025		-
`DrivingGPT`	DrivingGPT: Unifying Driving World Modeling and Planning with Multi-Modal Autoregressive Transformers	ICCV 2025		-
`Epona`	Epona: Autoregressive Diffusion World Model for Autonomous Driving	ICCV 2025
`VaViM`	VaViM and VaVAM: Autonomous Driving through Video Generative Modeling	arXiv 2025
`UniDrive-WM`	UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving	arXiv 2026		-

4️⃣ Occupancy-Based World Models

Model	Paper	Venue	Website	GitHub

`OccWorld`	OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving	ECCV 2024
`NeMo`	Neural Volumetric World Models for Autonomous Driving	ECCV 2024	-	-
`OccVAR`	OCCVAR: Scalable 4D Occupancy Prediction via Next-Scale Prediction	OpenReview 2024	-	-
`RenderWorld`	RenderWorld: World Model with Self-Supervised 3D Label	arXiv 2024	-	-
`DFIT-OccWorld`	An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training	arXiv 2024	-	-
`Drive-OccWorld`	Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving	AAAI 2025
`T³Former`	Temporal Triplane Transformers as Occupancy World Models	arXiv 2025	-	-
`AD-R1`	AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models	arXiv 2025	-	-
`SparseOccVLA`	SparseOccVLA: Bridging Occupancy and Vision-Language Models via Sparse Queries for Unified 4D Scene Understanding and Planning	arXiv 2026	-

5️⃣ Latent-Based World Models

Model	Paper	Venue	Website	GitHub

`Covariate-Shift`	Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models	arXiv 2024	-	-
`World4Drive`	World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model	ICCV 2025	-	-
`WoTE`	End-to-End Driving with Online Trajectory Evaluation via BEV World Model	ICCV 2025	-
`LAW`	Enhancing End-to-End Autonomous Driving with Latent World Model	ICLR 2025	-
`SSR`	Navigation-Guided Sparse Scene Representation for End-to-End Autonomous Driving	ICLR 2025	-
`Echo-Planning`	Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back	arXiv 2025	-	-
`SeerDrive`	Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution	NeurIPS 2025	-

2. Vision-Language-Action Models

1️⃣ Textual Action Generator

Model	Paper	Venue	Website	GitHub

`DriveMLM`	DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving	arXiv 2023	-
`RAG-Driver`	RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model	RSS 2024
`RDA-Driver`	Making Large Language Models Better Planners with Reasoning-Decision Alignment	ECCV 2024	-	-
`DriveLM`	DriveLM: Driving with Graph Visual Question Answering	ECCV 2024
`DriveGPT4`	DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model	RA-L 2024		-
`DriVLMe`	DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experience	IROS 2024
`LLaDA`	Driving Everywhere with Large Language Model Policy Adaptation	CVPR 2024
`VLAAD`	VLAAD: Vision and Language Assistant for Autonomous Driving	WACVW 2024	-
`OccLLaMA`	OccLLaMA: A Unified Occupancy-Language-Action World Model for Understanding and Generation Tasks in Autonomous Driving	arXiv 2024		-
`Doe-1`	Doe-1: Closed-Loop Autonomous Driving with Large World Model	arXiv 2024
`LINGO-2`	LINGO-2: Driving with Natural Language	-		-
`SafeAuto`	SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models	ICML 2025	-
`OpenEMMA`	OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving	WACV 2025	-
`ReasonPlan`	ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving	CoRL 2025	-
`WKER`	World Knowledge-Enhanced Reasoning Using Instruction-Guided Interactor in Autonomous Driving	AAAI 2025	-	-
`OmniDrive`	OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning	CVPR 2025	-
`S4-Driver`	S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation	CVPR 2025		-
`Occ-LLM`	Occ-LLM: Enhancing Autonomous Driving with Occupancy-BasedLarge Language Models	ICRA 2025	-	-
`DriveBench`	Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	ICCV 2025
`FutureSightDrive`	FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving	NeurIPS 2025
`ImpromptuVLA`	Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models	NeurIPS 2025
`Sce2DriveX`	Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning	RA-L 2025	-	-
`EMMA`	EMMA: End-to-End Multimodal Model for Autonomous Driving	TMLR 2025		-
`DriveAgent-R1`	DriveAgent-R1: Advancing VLM-Based Autonomous Driving with Hybrid Thinking and Active Perception	arXiv 2025	-	-
`Drive-R1`	Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning	arXiv 2025	-	-
`FastDriveVLA`	FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-Based Token Pruning	arXiv 2025	-	-
`WiseAD`	WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model	arXiv 2025
`AutoDrive-R²`	AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving	arXiv 2025	-	-
`OmniReason`	OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving	arXiv 2025	-	-
`OpenREAD`	OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic	arXiv 2025	-
`dVLM-AD`	dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning	arXiv 2025	-	-
`PLA`	A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving	arXiv 2025	-	-
`AlphaDrive`	AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning	arXiv 2025	-
`CoReVLA`	CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine	arXiv 2025
`WAM-Diff`	WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving	arXiv 2025	-

2️⃣ Numerical Action Generator

Model	Paper	Venue	Website	GitHub

`LMDrive`	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	CVPR 2024
`BEVDriver`	BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving	IROS 2025	-	-
`CoVLA-Agent`	CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving	WACV 2025		-
`ORION`	ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation	ICCV 2025
`SimLingo`	SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment	CVPR 2025
`DriveGPT4-V2`	DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving	CVPR 2025	-	-
`AutoVLA`	AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning	NeurIPS 2025
`DriveMoE`	DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving	arXiv 2025
`DSDrive`	DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning	arXiv 2025	-	-
`OccVLA`	OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision.	arXiv 2025	-	-
`VDRive`	VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-End Autonomous Driving	arXiv 2025	-	-
`ReflectDrive`	Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving	arXiv 2025	-
`E3AD`	E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving	arXiv 2025	-	-
`LCDrive`	Latent Chain-of-Thought World Modeling for End-to-End Driving	arXiv 2025	-	-
`Alpamayo-R1`	Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail	arXiv 2025	-	-
`UniUGP`	UniUGP: Unifying understanding, generation, and planing for end-to-end autonomous driving.	arXiv 2025	-	-
`MindDrive`	MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving	arXiv 2025	-	-
`AdaThinkDrive`	AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving	arXiv 2025	-	-
`Percept-WAM`	Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving	arXiv 2025	-	-
`Reasoning-VLA`	Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving	arXiv 2025	-	-
`SpaceDrive`	SpaceDrive: Infusing Spatial Awareness into VLM-Based Autonomous Driving	arXiv 2025	-	-
`OpenDriveVLA`	OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model	AAAI 2026
`WAM-Flow`	WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving	arXiv 2025

3️⃣ Explicit Action Guidance

Model	Paper	Venue	Website	GitHub

`DriveVLM`	DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models	CoRL 2024		-
`LeapAD`	Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving	NeurIPS 2024
`FasionAD`	FASIONAD: Fast and Slow Fusion Thinking Systems for Human-Like Autonomous Driving with Adaptive Feedback	arXiv 2024	-	-
`Senna`	Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving	arXiv 2024	-
`DualAD`	DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving	IROS 2025
`DME-Driver`	DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving	AAAI 2025	-	-
`SOLVE`	SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving	CVPR 2025	-	-
`ReAL-AD`	ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving	ICCV 2025		-
`LeapVAD`	LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking	TNNLS 2025	-	-
`DiffVLA`	DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving	arXiv 2025	-	-
`FasionAD++`	FASIONAD++: Integrating High-Level Instruction and Information Bottleneck in Fast-Slow fusion Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback	arXiv 2025	-	-

4️⃣ Implicit Representations Transfer

Model	Paper	Venue	Website	GitHub

`VLP`	VLP: Vision Language Planning for Autonomous Driving	CVPR 2024	-	-
`VLM-AD`	VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision	CoRL 2025	-	-
`DiMA`	Distilling Multi-modal Large Language Models for Autonomous Driving	CVPR 2025	-	-
`DINO-Foresight`	DINO-Foresight: Looking into the Future with DINO	NeurIPS 2025
`ALN-P3`	ALN-P3: Unified Language Alignment for Perception, Prediction, and Planning in Autonomous Driving	arXiv 2025	-	-
`VERDI`	VERDI: VLM-Embedded Reasoning for Autonomous Driving	arXiv 2025	-	-
`VLM-E2E`	VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion	arXiv 2025	-	-
`ReCogDrive`	ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving	arXiv 2025
`InsightDrive`	InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving	arXiv 2025	-
`NetRoller`	NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous Driving	arXiv 2025	-
`ViLaD`	ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving	arXiv 2025	-	-
`OmniScene`	OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving	arXiv 2025	-	-
`LMAD`	LMAD: Integrated End-to-End VisionLanguage Model for Explainable Autonomous Driving	arXiv 2025	-	-

3. Datasets & Benchmarks

1️⃣ Vision-Action Datasets

Dataset	Paper	Venue	Website	GitHub

`BDD100K`	BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning	CVPR 2020
`nuScenes`	nuScenes: A Multimodal Dataset for Autonomous Driving	CVPR 2020		-
`Waymo`	Scalability in Perception for Autonomous Driving: Waymo Open Dataset	CVPR 2020
`nuPlan`	nuPlan: A Closed-Loop ML-Based Planning Benchmark for Autonomous Vehicles	arXiv 2021
`Argoverse 2`	Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting	NeurIPS 2021
`Bench2Drive`	Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-to-End Autonomous Driving	NeurIPS 2024	-
`RoboBEV`	Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving	TPAMI 2025	-
`WOD-E2E`	WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-Tail Scenarios	arXiv 2025

2️⃣ Vision-Language-Action Datasets

Dataset	Paper	Venue	Website	GitHub

`BDD-X`	Textual Explanations for Self-Driving Vehicles	ECCV 2018	-
`Talk2Car`	Talk2Car: Predicting Physical Trajectories for Natural Language Commands	IEEE Access 2022	-
`SDN`	DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents	EMNLP 2022	-
`DriveMLM`	DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving	arXiv 2023	-
`LMDrive`	LMDrive: Closed-Loop End-to-End Driving with Large Language Models	CVPR 2024
`DriveLM-nuScenes`	DriveLM: Driving with Graph Visual Question Answering	ECCV 2024
`HBD`	DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving	AAAI 2025	-	-
`VLAAD`	VLAAD: Vision and Language Assistant for Autonomous Driving	WACVW 2024	-
`SUP-AD`	DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models	CoRL 2024		-
`NuInstruct`	Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models	CVPR 2024	-
`WOMD-Reasoning`	WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving	ICML 2025
`DriveCoT`	DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving	arXiv 2024		-
`Reason2Drive`	Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving	ECCV 2024	-
`DriveBench`	Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	ICCV 2025
`MetaAD`	AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning	arXiv 2025
`OmniDrive`	OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning	CVPR 2025	-
`NuInteract`	Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	arXiv 2025	-	-
`DriveAction`	DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models	arXiv 2025	-	-
`ImpromptuVLA`	Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models	arXiv 2025
`CoVLA`	CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving	WACV 2025		-
`OmniReason-nuScenes`	OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving	arXiv 2025	-	-
`OmniReason-B2D`	OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving	arXiv 2025	-	-

4. Applications

5. Other Resources

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-vla-for-ad

Similar Open Source Tools

awesome-vla-for-ad

github

: 287

Awesome-RL-for-LRMs

This repository contains a collection of awesome resources for reinforcement learning in language models. It includes tutorials, code implementations, research papers, and tools to help researchers and practitioners explore and apply reinforcement learning techniques in natural language processing tasks. Whether you are a beginner or an expert in the field, this repository aims to provide valuable insights and guidance to enhance your understanding and implementation of reinforcement learning in language models.

github

: 1.4k

llm-paper-daily

github

: 793

Awesome-Interpretability-in-Large-Language-Models

This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.

github

: 230

Steel-LLM

Steel-LLM is a project to pre-train a large Chinese language model from scratch using over 1T of data to achieve a parameter size of around 1B, similar to TinyLlama. The project aims to share the entire process including data collection, data processing, pre-training framework selection, model design, and open-source all the code. The goal is to enable reproducibility of the work even with limited resources. The name 'Steel' is inspired by a band '万能青年旅店' and signifies the desire to create a strong model despite limited conditions. The project involves continuous data collection of various cultural elements, trivia, lyrics, niche literature, and personal secrets to train the LLM. The ultimate aim is to fill the model with diverse data and leave room for individual input, fostering collaboration among users.

github

: 560

oumi

Oumi is an open-source platform for building state-of-the-art foundation models, offering tools for data preparation, training, evaluation, and deployment. It supports training and fine-tuning models with various parameters, working with text and multimodal models, synthesizing and curating training data, deploying models efficiently, evaluating models comprehensively, and running on different platforms. Oumi provides a consistent API, reliability, and flexibility for research purposes.

github

: 8.9k

LLMs

LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.

github

: 97

cgft-llm

The cgft-llm repository is a collection of video tutorials and documentation for implementing large models. It provides guidance on topics such as fine-tuning llama3 with llama-factory, lightweight deployment and quantization using llama.cpp, speech generation with ChatTTS, introduction to Ollama for large model deployment, deployment tools for vllm and paged attention, and implementing RAG with llama-index. Users can find detailed code documentation and video tutorials for each project in the repository.

github

: 2.0k

Hands-On-Large-Language-Models-CN

Hands-On Large Language Models CN(ZH) is a Chinese version of the book 'Hands-On Large Language Models' by Jay Alammar and Maarten Grootendorst. It provides detailed code annotations and additional insights, offers Notebook versions suitable for Chinese network environments, utilizes openbayes for free GPU access, allows convenient environment setup with vscode, and includes accompanying Chinese language videos on platforms like Bilibili and YouTube. The book covers various chapters on topics like Tokens and Embeddings, Transformer LLMs, Text Classification, Text Clustering, Prompt Engineering, Text Generation, Semantic Search, Multimodal LLMs, Text Embedding Models, Fine-tuning Models, and more.

github

: 244

Awesome-LLM-Resources-List

Awesome LLM Resources is a curated collection of resources for Large Language Models (LLMs) covering various aspects such as serverless hosting, accessing off-the-shelf models via API, local inference, LLM serving frameworks, open-source LLM web chat UIs, renting GPUs for fine-tuning, fine-tuning with no-code UI, fine-tuning frameworks, OS agentic/AI workflow, AI agents, co-pilots, voice API, open-source TTS models, OS RAG frameworks, research papers on chain-of-thought prompting, CoT implementations, CoT fine-tuned models & datasets, and more.

github

: 126

phoenix

Phoenix is a tool that provides MLOps and LLMOps insights at lightning speed with zero-config observability. It offers a notebook-first experience for monitoring models and LLM Applications by providing LLM Traces, LLM Evals, Embedding Analysis, RAG Analysis, and Structured Data Analysis. Users can trace through the execution of LLM Applications, evaluate generative models, explore embedding point-clouds, visualize generative application's search and retrieval process, and statistically analyze structured data. Phoenix is designed to help users troubleshoot problems related to retrieval, tool execution, relevance, toxicity, drift, and performance degradation.

github

: 8.5k

InternVL

InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. It is a vision-language foundation model that can perform various tasks, including: **Visual Perception** - Linear-Probe Image Classification - Semantic Segmentation - Zero-Shot Image Classification - Multilingual Zero-Shot Image Classification - Zero-Shot Video Classification **Cross-Modal Retrieval** - English Zero-Shot Image-Text Retrieval - Chinese Zero-Shot Image-Text Retrieval - Multilingual Zero-Shot Image-Text Retrieval on XTD **Multimodal Dialogue** - Zero-Shot Image Captioning - Multimodal Benchmarks with Frozen LLM - Multimodal Benchmarks with Trainable LLM - Tiny LVLM InternVL has been shown to achieve state-of-the-art results on a variety of benchmarks. For example, on the MMMU image classification benchmark, InternVL achieves a top-1 accuracy of 51.6%, which is higher than GPT-4V and Gemini Pro. On the DocVQA question answering benchmark, InternVL achieves a score of 82.2%, which is also higher than GPT-4V and Gemini Pro. InternVL is open-sourced and available on Hugging Face. It can be used for a variety of applications, including image classification, object detection, semantic segmentation, image captioning, and question answering.

github

: 6.5k

TigerBot

TigerBot is a cutting-edge foundation for your very own LLM, providing a world-class large model for innovative Chinese-style contributions. It offers various upgrades and features, such as search mode enhancements, support for large context lengths, and the ability to play text-based games. TigerBot is suitable for prompt-based game engine development, interactive game design, and real-time feedback for playable games.

github

: 2.2k

go-cyber

Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.

github

: 353

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

Chinese-Mixtral-8x7B

Chinese-Mixtral-8x7B is an open-source project based on Mistral's Mixtral-8x7B model for incremental pre-training of Chinese vocabulary, aiming to advance research on MoE models in the Chinese natural language processing community. The expanded vocabulary significantly improves the model's encoding and decoding efficiency for Chinese, and the model is pre-trained incrementally on a large-scale open-source corpus, enabling it with powerful Chinese generation and comprehension capabilities. The project includes a large model with expanded Chinese vocabulary and incremental pre-training code.

github

: 635

For similar tasks

No tools available

For similar jobs

No tools available

awesome-vla-for-ad

README:

😎 Awesome VLA for Autonomous Driving

📚 Citation

Table of Contents

1. Vision-Action Models

1️⃣ Action-Only Models

2️⃣ Perception-Action Models

3️⃣ Image-Based World Models

4️⃣ Occupancy-Based World Models

5️⃣ Latent-Based World Models

2. Vision-Language-Action Models

1️⃣ Textual Action Generator

2️⃣ Numerical Action Generator

3️⃣ Explicit Action Guidance

4️⃣ Implicit Representations Transfer

3. Datasets & Benchmarks

1️⃣ Vision-Action Datasets

2️⃣ Vision-Language-Action Datasets

4. Applications

5. Other Resources

For Tasks:

For Jobs:

Alternative AI tools for awesome-vla-for-ad

Similar Open Source Tools

awesome-vla-for-ad

Awesome-RL-for-LRMs

llm-paper-daily

Awesome-Interpretability-in-Large-Language-Models

Steel-LLM

oumi

LLMs

cgft-llm

Hands-On-Large-Language-Models-CN

Awesome-LLM-Resources-List

phoenix

InternVL

TigerBot

go-cyber

VoiceBench

Chinese-Mixtral-8x7B

For similar tasks

For similar jobs