Awesome-Embodied-Agent-with-LLMs

This is a curated list of "Embodied AI or robot with Large Language Models" research. Watch this repository for the latest updates! 🔥

Stars: 1188

Visit

This repository, named Awesome-Embodied-Agent-with-LLMs, is a curated list of research related to Embodied AI or agents with Large Language Models. It includes various papers, surveys, and projects focusing on topics such as self-evolving agents, advanced agent applications, LLMs with RL or world models, planning and manipulation, multi-agent learning and coordination, vision and language navigation, detection, 3D grounding, interactive embodied learning, rearrangement, benchmarks, simulators, and more. The repository provides a comprehensive collection of resources for individuals interested in exploring the intersection of embodied agents and large language models.

README:

🤖 Awesome-Embodied-Agent-with-LLMs

This is a curated list of "Embodied AI or agent with Large Language Models" research which is maintained by haonan.

Watch this repository for the latest updates and feel free to raise pull requests if you find some interesting papers!

News🔥

[2024/08/01] Created a new board about social agent and role-playing. 🧑‍🧑‍🧒‍🧒
[2024/06/28] Created a new board about agent self-evolutionary research. 🤖
[2024/06/07] Add Mobile-Agent-v2, a mobile device operation assistant with effective navigation via multi-agent collaboration. 🚀
[2024/05/13] Add "Learning Interactive Real-World Simulators"——outstanding paper award in ICLR 2024 🥇.
[2024/04/24] Add "A Survey on Self-Evolution of Large Language Models", a systematic survey on self-evolution in LLMs! 💥
[2024/04/16] Add some CVPR 2024 papers.
[2024/04/15] Add MetaGPT, accepted for oral presentation (top 1.2%) at ICLR 2024, ranking #1 in the LLM-based Agent category. 🚀
[2024/03/13] Add CRADLE, an interesting paper exploring LLM-based agent in Red Dead Redemption II！🎮

Trend and Imagination of LLM-based Embodied Agent

Figure 1. Trend of Embodied Agent with LLMs.^[1] Figure 2. An envisioned Agent society.^[2]

Methods

Survey

A Survey on Vision-Language-Action Models for Embodied AI [arXiv 2024.03]
The Chinese University of Hong Kong, Huawei Noah’s Ark Lab
Large Multimodal Agents: A Survey [arXiv 2024.02] [Github]
Junlin Xie^♣♡ Zhihong Chen^♣♡ Ruifei Zhang^♣♡ Xiang Wan^♣ Guanbin Li^♠
^♡The Chinese University of Hong Kong, Shenzhen ^♣Shenzhen Research Institute of Big Data, ^♠Sun Yat-sen University
A Survey on Self-Evolution of Large Language Models [arXiv 2024.01]
Key Lab of HCST (PKU), MOE; School of Computer Science, Peking University, Alibaba Group, Nanyang Technological University
Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2024.01]
Stanford University, Microsoft Research, Redmond, University of California, Los Angeles, University of Washington, Microsoft Gaming
Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents [arXiv 2023.11]
Shanghai Jiao Tong University, Amazon Web Services, Yale University
The Rise and Potential of Large Language Model Based Agents: A Survey [arXiv 2023.09]
Fudan NLP Group, miHoYo Inc
A Survey on LLM-based Autonomous Agents [arXiv 2023,08]
Gaoling School of Artificial Intelligence, Renmin University of China

Social Agent

Self-Evolving Agents

AGENTGYM: Evolving Large Language Model-based Agents across Diverse Environments [arXiv 2024.06] [Github] [Project page]
Fudan NLP Lab & Fudan Vision and Learning Lab
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models [arXiv 2024.06] [Github]
Fangzhi Xu^♢♡, Qiushi Sun^{2, ♡}, Kanzhi Cheng¹, Jun Liu^♢, Yu Qiao♡, Zhiyong Wu^♡
^♢Xi’an Jiaotong University, ^♡Shanghai Artificial Intelligence Laboratory, ¹The University of Hong Kong, ²Nanjing Univerisity
Symbolic Learning Enables Self-Evolving Agents [arXiv 2024.06] [Github]
Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang
AIWaves Inc.

Advanced Agent Applications

[Embodied-agents] [Github]
Seamlessly integrate state-of-the-art transformer models into robotics stacks.
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [arXiv 2024] [Github]
Junyang Wang¹, Haiyang Xu², Haitao Jia¹, Xi Zhang², Ming Yan², Weizhou Shen², Ji Zhang², Fei Huang², Jitao Sang¹
¹Beijing Jiaotong University ²Alibaba Group
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family [ICLR 2024 Workshop LLM Agents] [Github]
Junyang Wang¹, Haiyang Xu², Jiabo Ye², Ming Yan², Weizhou Shen², Ji Zhang², Fei Huang², Jitao Sang¹
¹Beijing Jiaotong University ²Alibaba Group
[Machinascript-for-robots] [Github]
Build LLM-powered robots in your garage with MachinaScript For Robots!
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model [CVPR 2024] [Github]
Lirui Zhao^1,2 Yue Yang^2,4 Kaipeng Zhang² Wenqi Shao², Yuxin Zhang¹, Yu Qiao², Ping Luo^2,3 Rongrong Ji¹
¹Xiamen University, ²OpenGVLab, Shanghai AI Laboratory ³The University of Hong Kong, ⁴Shanghai Jiao Tong University
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework [ICLR 2024 (oral)]
DeepWisdom, AI Initiative, King Abdullah University of Science and Technology, Xiamen University, The Chinese University of Hong Kong, Shenzhen, Nanjing University, University of Pennsylvania, University of California, Berkeley, The Swiss AI Lab IDSIA/USI/SUPSI
AppAgent: Multimodal Agents as Smartphone Users [Project page] [Github]
Chi Zhang∗ ZhaoYang∗ JiaxuanLiu∗ YuchengHan XinChen Zebiao Huang BinFu GangYu†
Tencent

LLMs with RL or World Model

KALM: Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts [NeurIPS 2024] [Project Page]
Jing-Cheng Pang, Si-Hang Yang, Kaiyuan Li, Jiaji Zhang, Xiong-Hui Chen, Nan Tang, Yang Yu
¹Nanjing University, ²Polixir.ai
Learning Interactive Real-World Simulators [ICLR 2024 (Outstanding Papers)] [Project Page]
Sherry Yang^1,2, Yilun Du³, Kamyar Ghasemipour², Jonathan Tompson², Leslie Kaelbling³, Dale Schuurmans², Pieter Abbeel¹
¹UC Berkeley, ²Google DeepMind, ³MIT
Robust agents learn causal world models [ICLR 2024]
Jonathan Richens*, TomEveritt
Google DeepMind
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld [CVPR 2024] [Github]
Yijun Yang¹⁵⁴, Tianyi Zhou², Kanxue Li³, Dapeng Tao³, Lvsong Li⁴, Li Shen⁴, Xiaodong He⁴, Jing Jiang⁵, Yuhui Shi¹
¹Southern University of Science and Technology, ²University of Maryland, College Park, ³Yunnan University, ⁴JD Explore Academy, ⁵University of Technology Sydney
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning [NeurIPS 2023] [Project Page][Github]
Lin_Guan¹, Karthik Valmeekam¹, Sarath Sreedharan², Subbarao Kambhampati¹
¹School of Computing & AI Arizona State University Tempe, AZ, ²Department of Computer Science Colorado State University Fort Collins, CO
Eureka: Human-Level Reward Design via Coding Large Language Models [NeurIPS 2023 Workshop ALOE Spotlight] [Project page] [Github]
Jason Ma^1,2, William Liang², Guanzhi Wang^1,3, De-An Huang¹, Osbert Bastani², Dinesh Jayaraman², Yuke Zhu^1,4, Linxi "Jim" Fan¹, Anima Anandkumar^1,3
¹NVIDIA; ²UPenn; ³Caltech; ⁴UT Austin
RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds [arXiv 2023]
Can Language Agents Be Alternatives to PPO? A Preliminary Empirical Study on OpenAI Gym [arXiv 2023]
RoboGPT: An intelligent agent of making embodied long-term decisions for daily instruction tasks [arXiv 2023]
Aligning Agents like Large Language Models [arXiv 2023]
AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents [ICLR 2024 spotlight]
STARLING: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models [arXiv 2023]
Text2Reward: Dense Reward Generation with Language Models for Reinforcement Learning [ICLR 2024 spotlight]
Leveraging Large Language Models for Optimised Coordination in Textual Multi-Agent Reinforcement Learning [arXiv 2023]
Online Continual Learning for Interactive Instruction Following Agents [ICLR 2024]
ADAPTER-RL: Adaptation of Any Agent using Reinforcement Learning [arXiv 2023]
Language Reward Modulation for Pretraining Reinforcement Learning [arXiv 2023]
Informing Reinforcement Learning Agents by Grounding Natural Language to Markov Decision Processes [arXiv 2023]
Learning to Model the World with Language [arXiv 2023]
MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning [ICLR 2024]
Language Reward Modulation for Pretraining Reinforcement Learning [arXiv 2023] [Github]
Ademi Adeniji, Amber Xie, Carmelo Sferrazza, Younggyo Seo, Stephen James, Pieter Abbeel
¹UC Berkeley
Guiding Pretraining in Reinforcement Learning with Large Language Models [ICML 2023]
Yuqing Du^1*, Olivia Watkins^1*, Zihan Wang², Cedric Colas ´^3,4, Trevor Darrell¹, Pieter Abbeel¹, Abhishek Gupta², Jacob Andreas³
¹Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA ²University of Washington, Seattle ³Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory ⁴ Inria, Flowers Laboratory.

Planning and Manipulation or Pretraining

Voyager: An Open-Ended Embodied Agent with Large Language Models [NeurIPS 2023 Workshop ALOE Spotlight] [Project page] [Github]
Guanzhi Wang^1,2, Yuqi Xie³, Yunfan Jiang⁴, Ajay Mandlekar¹, Chaowei Xiao^1,5, Yuke Zhu^1,3, Linxi Fan¹, Anima Anandkumar^1,2 ¹NVIDIA, ²Caltech, ³UT Austin, ⁴Stanford, ⁵UW Madison
Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [ACL 2024][Github]
Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, Weiming Lu
Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives [ACL 2024]
Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control [arXiv 2024] [Project Page]
Enshen Zhou^1,2 Yiran Qin^1,3 Zhenfei Yin^1,4 Yuzhou Huang³ Ruimao Zhang³ Lu Sheng² Yu Qiao¹ Jing Shao¹
¹Shanghai Artificial Intelligence Laboratory, ²The Chinese University of Hong Kong, Shenzhen, ³Beihang University, ⁴The University of Sydney
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception [CVPR 2024] [Project Page]
Yiran Qin^1,2 Enshen Zhou^1,3 Qichang Liu^1,4 Zhenfei Yin^1,5 Lu Sheng³ Ruimao Zhang² Yu Qiao¹ Jing Shao¹
¹Shanghai Artificial Intelligence Laboratory, ²The Chinese University of Hong Kong, Shenzhen, ³Beihang University, ⁴Tsinghua University, ⁵The University of Sydney
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [CVPR 2025] [Project Page]
Enshen Zhou^1* Qi Su^2* Cheng Chi^3*; Zhizheng Zhang⁴ Zhongyuan Wang³ Tiejun Huang^2,3 Lu Sheng^1; He Wang^2,3,4;
¹Beihang University, ²Peking University, ³Beijing Academy of Artificial Intelligence, ⁴GalBot
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation [CVPR 2024]
Zeyuan Yang¹, LIU JIAGENG, Peihao Chen², Anoop Cherian³, Tim Marks, Jonathan Le Roux⁴, Chuang Gan⁵ ¹Tsinghua University, ²South China University of Technology, ³Mitsubishi Electric Research Labs (MERL), ⁴Mitsubishi Electric Research Labs, ⁵MIT-IBM Watson AI Lab
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study [arXiv 2024] [Project Page] [Code]
Weihao Tan², Ziluo Ding¹, Wentao Zhang², Boyu Li¹, Bohan Zhou³, Junpeng Yue³, Haochong Xia², Jiechuan Jiang³, Longtao Zheng², Xinrun Xu1, Yifei Bi¹, Pengjie Gu²,
¹Beijing Academy of Artificial Intelligence (BAAI), China; ²Nanyang Technological University, Singapore; ³School of Computer Science, Peking University, China
See and Think: Embodied Agent in Virtual Environment [arXiv 2023]
Zhonghan Zhao^1*, Wenhao Chai^2*, Xuan Wang^1*, Li Boyi¹, Shengyu Hao¹, Shidong Cao¹, Tian Ye³, Jenq-Neng Hwang², Gaoang Wang¹
¹Zhejiang University ¹University of Washington ¹Hong Kong University of Science and Technology (GZ)
Agent Instructs Large Language Models to be General Zero-Shot Reasoners [arXiv 2023]
Nicholas Crispino¹, Kyle Montgomery¹, Fankun Zeng¹, Dawn Song², Chenguang Wang¹
¹Washington University in St. Louis, ²UC Berkeley
JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models [NeurIPS 2023] [Project Page]
Zihao Wang^1,2 Shaofei Cai^1,2 Anji Liu³ Yonggang Jin⁴ Jinbing Hou⁴ Bowei Zhang⁵ Haowei Lin^1,2 Zhaofeng He⁴ Zilong Zheng⁶ Yaodong Yang¹ Xiaojian Ma^6† Yitao Liang^1†
¹Institute for Artificial Intelligence, Peking University, ²School of Intelligence Science and Technology, Peking University, ³Computer Science Department, University of California, Los Angeles, ⁴Beijing University of Posts and Telecommunications, ⁵School of Electronics Engineering and Computer Science, Peking University, ⁶Beijing Institute for General Artificial Intelligence (BIGAI)
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents [NeurIPS 2023]
Zihao Wang^1,2 Shaofei Cai^1,2 Guanzhou Chen³ Anji Liu⁴ Xiaojian Ma⁴ Yitao Liang^1,5†
¹Institute for Artificial Intelligence, Peking University, ²School of Intelligence Science and Technology, Peking University, ³School of Computer Science, Beijing University of Posts and Telecommunications, ⁴Computer Science Department, University of California, Los Angeles, ⁵Beijing Institute for General Artificial Intelligence (BIGAI)
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society [NeurIPS 2023] [Github] [Project page]
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem
¹King Abdullah University of Science and Technology (KAUST)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [arXiv 2022] [Github] [Project page]
Wenlong Huang¹, Pieter Abbeel¹, Deepak Pathak², Igor Mordatch³
¹UC Berkeley, ²Carnegie Mellon University, ³Google
FILM: Following Instructions in Language with Modular Methods [ICLR 2022] [Github] [Project page]
So Yeon Min¹, Devendra Singh Chaplot², Pradeep Ravikumar¹, Yonatan Bisk¹, Ruslan Salakhutdinov¹
¹Carnegie Mellon University ²Facebook AI Research
Embodied Task Planning with Large Language Models [arXiv 2023] [Github] [Project page] [Demo] [Huggingface Model]
Zhenyu Wu¹, Ziwei Wang^2,3, Xiuwei Xu^2,3, Jiwen Lu^2,3, Haibin Yan^1*
¹School of Automation, Beijing University of Posts and Telecommunications, ²Department of Automation, Tsinghua University, ³Beijing National Research Center for Information Science and Technology
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning [arXiv 2023]
Yue Wu^1,4* , Shrimai Prabhumoye² , So Yeon Min¹ , Yonatan Bisk¹ , Ruslan Salakhutdinov¹ ,Amos Azaria³ , Tom Mitchell¹ , Yuanzhi Li^1,4
¹Carnegie Mellon University, ²NVIDIA, ³Ariel University, ⁴Microsoft Research
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning [CVPR 2022 (Oral)] [Project page] [Github]
Santhosh Kumar Ramakrishnan^1,2, Devendra Singh Chaplot¹, Ziad Al-Halah² Jitendra Malik^1,3, Kristen Grauman^1,2
¹Facebook AI Research, ²UT Austin, ³UC Berkeley
Moving Forward by Moving Backward: Embedding Action Impact over Action Semantics [ICLR 2023] [Project page] [Github]
Kuo-Hao Zeng¹, Luca Weihs², Roozbeh Mottaghi¹, Ali Farhadi¹
¹Paul G. Allen School of Computer Science & Engineering, University of Washington, ²PRIOR @ Allen Institute for AI
Modeling Dynamic Environments with Scene Graph Memory [ICML 2023]
Andrey Kurenkov¹, Michael Lingelbach¹, Tanmay Agarwal¹, Emily Jin¹, Chengshu Li¹, Ruohan Zhang¹, Li Fei-Fei¹, Jiajun Wu¹, Silvio Savarese², Roberto Mart´ın-Mart´ın³
¹Department of Computer Science, Stanford University ²Salesforce AI Research ³Department of Computer Science, University of Texas at Austin.
Reasoning with Language Model is Planning with World Model [arXiv 2023]
Shibo Hao^∗♣, Yi Gu^∗♣, Haodi Ma^♢, Joshua Jiahua Hong^♣, Zhen Wang^{♣ ♠}, Daisy Zhe Wang^♢, Zhiting Hu^♣
^♣UC San Diego, ^♢University of Florida, ^♠Mohamed bin Zayed University of Artificial Intelligence
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [arXiv 2022]
Robotics at Google, Everyday Robots
Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling [ICML 2023]
Kolby Nottingham¹ Prithviraj Ammanabrolu² Alane Suhr² Yejin Choi^3,2 Hannaneh Hajishirzi^3,2 Sameer Singh^1,2 Roy Fox¹
¹Department of Computer Science, University of California Irvine ²Allen Institute for Artificial Intelligence ³Paul G. Allen School of Computer Science
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents [ICCV 2023]
Byeonghwi Kim Jinyeon Kim Yuyeong Kim^1,* Cheolhong Min Jonghyun Choi^†
Yonsei University ¹Gwangju Institute of Science and Technology
Inner Monologue: Embodied Reasoning through Planning with Language Models [CoRL 2022] [Project page]
Robotics at Google
Language Models Meet World Models: Embodied Experiences Enhance Language Models [arXiv 2023] [Twitter]
Jiannan Xiang^∗♠, Tianhua Tao^∗♠, Yi Gu^♠, Tianmin Shu^♢, Zirui Wang^♠, Zichao Yang^♡, Zhiting Hu^♠
^♠UC San Diego, ^♣UIUC, ^♢MIT, ^♡Carnegie Mellon University
AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation [arXiv 2023] [Video]
Chuhao Jin^1* , Wenhui Tan^1* , Jiange Yang^2* , Bei Liu3^† , Ruihua Song¹ , Limin Wang² , Jianlong Fu^3†
¹Renmin University of China, ²Nanjing University, ³Microsoft Research
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution [CoRL 2021] [Project page] [Poster]
Valts Blukis^1,2, Chris Paxton¹, Dieter Fox^1,3, Animesh Garg^1,4, Yoav Artzi²
¹NVIDIA ²Cornell University ³University of Washington ⁴University of Toronto, Vector Institute
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models [ICCV 2023] [Project page] [Github]
Chan Hee Song¹, Jiaman Wu¹, Clayton Washington¹, Brian M. Sadler², Wei-Lun Chao¹, Yu Su¹
¹The Ohio State University, ²DEVCOM ARL
Code as Policies: Language Model Programs for Embodied Control [arXiv 2023] [Project page] [Github] [Blog] [Colab]
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
Robotics at Google
3D-LLM: Injecting the 3D World into Large Language Models [arXiv 2023]
¹Yining Hong, ²Haoyu Zhen, ³Peihao Chen, ⁴Shuhong Zheng, ⁵Yilun Du, ⁶Zhenfang Chen, ^6,7Chuang Gan
¹UCLA ² SJTU ³ SCUT ⁴ UIUC ⁵ MIT ⁶MIT-IBM Watson AI Lab ⁷ Umass Amherst
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2023] [Project page] [Online Demo]
Wenlong Huang¹, Chen Wang¹, Ruohan Zhang¹, Yunzhu Li^1,2, Jiajun Wu¹, Li Fei-Fei¹
¹Stanford University ²University of Illinois Urbana-Champaign
Palm-e: An embodied multimodal language mode [ICML 2023] [Project page]
¹Robotics at Google ²TU Berlin 3Google Research
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning [arXiv 2023]
Zirui Zhao Wee Sun Lee David Hsu
School of Computing National University of Singapore
An Embodied Generalist Agent in 3D World [ICML 2024]
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang Beijing Institute for General Artificial Intelligence (BIGAI)

Multi-Agent Learning and Coordination

Building Cooperative Embodied Agents Modularly with Large Language Models [ICLR 2024] [Project page] [Github]
Hongxin Zhang^1*, Weihua Du^2*, Jiaming Shan³, Qinhong Zhou¹, Yilun Du⁴, Joshua B. Tenenbaum⁴, Tianmin Shu⁴, Chuang Gan^1,5
¹University of Massachusetts Amherst, ²Tsinghua University, ³Shanghai Jiao Tong University, ⁴MIT, ⁵MIT-IBM Watson AI Lab
War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars [arXiv 2023]
Wenyue Hua^1*, Lizhou Fan^2*, Lingyao Li², Kai Mei¹, Jianchao Ji¹, Yingqiang Ge¹, Libby Hemphill², Yongfeng Zhang¹
¹Rutgers University, ²University of Michigan
MindAgent: Emergent Gaming Interaction [arXiv 2023]
Ran Gong^*1† Qiuyuan Huang^*2‡ Xiaojian Ma^*1 Hoi Vo³ Zane Durante^†4 Yusuke Noda³ Zilong Zheng⁵ Song-Chun Zhu¹⁵⁶⁷⁸ Demetri Terzopoulos¹ Li Fei-Fei⁴ Jianfeng Gao²
¹UCLA; ²Microsoft Research, Redmond; ³Xbox Team, Microsoft; ⁴Stanford; ⁵BIGAI; ⁶PKU; ⁷THU; ⁸UCLA
Demonstration-free Autonomous Reinforcement Learning via Implicit and Bidirectional Curriculum [ICML 2023]
Jigang Kim^*1,2 Daesol Cho^*1,2 H. Jin Kim^1,3
¹Seoul National University, ²Artificial Intelligence Institute of Seoul National University (AIIS), ³Automation and Systems Research Institute (ASRI).
Note: This paper mainly focuses on reinforcement learning for Embodied AI.
Adaptive Coordination in Social Embodied Rearrangement [ICML 2023]
Andrew Szot^1,2 Unnat Jain¹ Dhruv Batra^1,2 Zsolt Kira² Ruta Desai¹ Akshara Rai¹
¹Meta AI ²Georgia Institute of Technology.

Vision and Language Navigation

IndoorSim-to-OutdoorReal: Learning to Navigate Outdoors without any Outdoor Experience [arXiv 2023]
Joanne Truong^1,2, April Zitkovich¹, Sonia Chernova², Dhruv Batra^2,3, Tingnan Zhang¹, Jie Tan¹, Wenhao Yu¹
¹Robotics at Google ²Georgia Institute of Technology ³Meta AI
ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation [ICML 2023]
Kaiwen Zhou¹, Kaizhi Zheng¹, Connor Pryor¹, Yilin Shen², Hongxia Jin², Lise Getoor¹, Xin Eric Wang¹
¹University of California, Santa Cruz ²Samsung Research America.
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models [arXiv 2023]
Gengze Zhou¹ Yicong Hong² Qi Wu¹
¹The University of Adelaide ²The Australian National University
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [arXiv 2023] [Github]
Siyuan Huang^1,2 Zhengkai Jiang⁴ Hao Dong³ Yu Qiao² Peng Gao² Hongsheng Li⁵
¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory, ³CFCS, School of CS, PKU, ⁴University of Chinese Academy of Sciences, ⁵The Chinese University of Hong Kong

Detection

DetGPT: Detect What You Need via Reasoning [arXiv 2023]
Renjie Pi^1∗ Jiahui Gao^2* Shizhe Diao^1∗ Rui Pan¹ Hanze Dong¹ Jipeng Zhang¹ Lewei Yao¹ Jianhua Han³ Hang Xu² Lingpeng Kong² Tong Zhang¹
¹The Hong Kong University of Science and Technology ²The University of Hong Kong 3Shanghai Jiao Tong University

3D Grounding

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [arXiv 2023]
Jianing Yang^1,, Xuweiyi Chen^1,, Shengyi Qian¹, Nikhil Madaan, Madhavan Iyengar¹, David F. Fouhey^1,2, Joyce Chai¹
¹University of Michigan, ²New York University
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment [ICCV 2023]
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li Beijing Institute for General Artificial Intelligence (BIGAI)

Interactive Embodied Learning

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning [ICML 2023]
Thomas Carta^1*, Clement Romac ´^1,2, Thomas Wolf², Sylvain Lamprier³, Olivier Sigaud⁴, Pierre-Yves Oudeyer¹
¹Inria (Flowers), University of Bordeaux, ²Hugging Face, ³Univ Angers, LERIA, SFR MATHSTIC, F-49000, ⁴Sorbonne University, ISIR
Learning Affordance Landscapes for Interaction Exploration in 3D Environments [NeurIPS 2020] [Project page]
Tushar Nagarajan, Kristen Grauman
UT Austin and Facebook AI Research, UT Austin and Facebook AI Research
Embodied Question Answering in Photorealistic Environments with Point Cloud Perception [CVPR 2019 (oral)] [Slides]
Erik Wijmans^1†, Samyak Datta¹, Oleksandr Maksymets^2†, Abhishek Das¹, Georgia Gkioxari², Stefan Lee¹, Irfan Essa¹, Devi Parikh^1,2, Dhruv Batra^1,2
¹Georgia Institute of Technology, ²Facebook AI Research
Multi-Target Embodied Question Answering [CVPR 2019]
Licheng Yu¹, Xinlei Chen³, Georgia Gkioxari³, Mohit Bansal¹, Tamara L. Berg^1,3, Dhruv Batra^2,3
¹University of North Carolina at Chapel Hill ²Georgia Tech 3Facebook AI
Neural Modular Control for Embodied Question Answering [CoRL 2018 (Spotlight)] [Project page] [Github]
Abhishek Das¹,Georgia Gkioxari², Stefan Lee¹, Devi Parikh^1,2, Dhruv Batra^1,2
¹Georgia Institute of Technology ²Facebook AI Research
Embodied Question Answering [CVPR 2018 (oral)] [Project page] [Github]
Abhishek Das¹, Samyak Datta¹, Georgia Gkioxari2², Stefan Lee¹, Devi Parikh^2,1, Dhruv Batra²
¹Georgia Institute of Technology, ²Facebook AI Research

Rearrangement

A Simple Approach for Visual Room Rearrangement: 3D Mapping and Semantic Search [ICLR 2023]
¹Brandon Trabucco, ²Gunnar A Sigurdsson, ²Robinson Piramuthu, ^2,3Gaurav S. Sukhatme, ¹Ruslan Salakhutdinov
¹CMU, ²Amazon Alexa AI, ³University of Southern California

Benchmark

SmartPlay: A Benchmark for LLMs as Intelligent Agents [ICLR 2024] [Github]
Yue Wu^1,2, Xuan Tang¹, Tom Mitchell¹, Yuanzhi Li^1,2 ¹Carnegie Mellon University, ²Microsoft Research
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [arXiv 2023] [Project page] [Github]
Yufei Wang¹, Zhou Xian¹, Feng Chen², Tsun-Hsuan Wang³, Yian Wang⁴, Katerina Fragkiadaki¹, Zackory Erickson¹, David Held¹, Chuang Gan^4,5
¹CMU, ²Tsinghua IIIS, ³MIT CSAIL, ⁴UMass Amherst, ⁵MIT-IBM AI Lab
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [ICLR 2021] [Project page] [Github]
Mohit Shridhar^† Xingdi Yuan^♡ Marc-Alexandre Côté^♡ Yonatan Bisk^‡ Adam Trischler^♡ Matthew Hausknecht^♣
^‡University of Washington ^♡Microsoft Research, Montréal ^‡Carnegie Mellon University ^♣Microsoft Research
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks [CVPR 2020] [Project page] [Github]
Mohit Shridhar¹ Jesse Thomason¹ Daniel Gordon¹ Yonatan Bisk^1,2,3 Winson Han³ Roozbeh Mottaghi^1,3 Luke Zettlemoyer¹ Dieter Fox^1,4
¹Paul G. Allen School of Computer Sci. & Eng., Univ. of Washington, ²Language Technologies Institute @ Carnegie Mellon University, ³Allen Institute for AI, ⁴NVIDIA
VIMA: Robot Manipulation with Multimodal Prompts [ICML 2023] [Project page] [Github] [VIMA-Bench]
Yunfan Jiang¹ Agrim Gupta^1† Zichen Zhang^2† Guanzhi Wang^3,4† Yongqiang Dou⁵ Yanjun Chen¹ Li Fei-Fei¹ Anima Anandkumar^3,4 Yuke Zhu^3,6‡ Linxi Fan^3‡
SQA3D: Situated Question Answering in 3D Scenes [ICLR 2023] [Project page] [Slides] [Github]
Xiaojian Ma² Silong Yong^1,3* Zilong Zheng¹ Qing Li¹ Yitao Liang^1,4 Song-Chun Zhu^1,2,3,4 Siyuan Huang¹
¹Beijing Institute for General Artificial Intelligence (BIGAI) ²UCLA ³Tsinghua University ⁴Peking University
IQA: Visual Question Answering in Interactive Environments [CVPR 2018] [Github] [Demo video (YouTube)]
Danie¹ Gordon1 Aniruddha Kembhavi² Mohammad Rastegari^2,4 Joseph Redmon¹ Dieter Fox^1,3 Ali Farhadi^1,2
¹Paul G. Allen School of Computer Science, University of Washington ²Allen Institute for Artificial Intelligence ³Nvidia ⁴Xnor.ai
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments [ICCV 2021] [Project page] [Github]
Difei Gao^1,2, Ruiping Wang^1,2,3, Ziyi Bai^1,2, Xilin Chen¹,
¹Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, ²University of Chinese Academy of Sciences, ³Beijing Academy of Artificial Intelligence

Simulator

LEGENT: Open Platform for Embodied Agents [ACL 2024] [Project page] [Github]
Tsinghua University
AI2-THOR: An Interactive 3D Environment for Visual AI [arXiv 2022] [Project page] [Github]
Allen Institute for AI, University of Washington, Stanford University, Carnegie Mellon University
iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes [IROS 2021] [Project page] [Github]
Bokui Shen*, Fei Xia* et al.
Habitat: A Platform for Embodied AI Research [ICCV 2019] [Project page] [Habitat-Sim] [Habitat-Lab] [Habitat Challenge]
Facebook AI Research, Facebook Reality Labs, Georgia Institute of Technology, Simon Fraser University, Intel Labs, UC Berkeley
Habitat 2.0: Training Home Assistants to Rearrange their Habitat [NeurIPS 2021] [Project page]
Facebook AI Research, Georgia Tech, Intel Research, Simon Fraser University, UC Berkeley

Others

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models [ICLR 2023]
Google Research, Brain Team
React: Synergizing reasoning and acting in language models [ICLR 2023]
Shunyu Yao^1∗, Jeffrey Zhao², Dian Yu², Nan Du², Izhak Shafran², Karthik Narasimhan¹, Yuan Cao²
¹Department of Computer Science, Princeton University ², Google Research, Brain team
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models [arXiv 2023]
Virginia Tech, Microsoft
Graph of Thoughts: Solving Elaborate Problems with Large Language Models [arXiv 2023]
ETH Zurich, Cledar, Warsaw University of Technology
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [arXiv 2023]
Shunyu Yao¹, Dian Yu², Jeffrey Zhao², Izhak Shafran², Thomas L. Griffiths¹, Yuan Cao², Karthik Narasimhan¹
¹Princeton University, ²Google DeepMind
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [NeurIPS 2022]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, Denny Zhou
Google Research, Brain Team
MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [NeurIPS 2022] [Github] [Project page] [Knowledge Base]
Linxi Fan¹ , Guanzhi Wang^2∗ , Yunfan Jiang^3* , Ajay Mandlekar¹ , Yuncong Yang⁴ , Haoyi Zhu⁵ , Andrew Tang⁴ , De-An Huang¹ , Yuke Zhu^1,6† , Anima Anandkumar^1,2†
¹NVIDIA, ²Caltech, ³Stanford, ⁴Columbia, ⁵SJTU, ⁶UT Austin
Distilling Internet-Scale Vision-Language Models into Embodied Agents [ICML 2023]
Theodore Sumers^1∗ Kenneth Marino² Arun Ahuja² Rob Fergus² Ishita Dasgupta²
LISA: Reasoning Segmentation via Large Language Model [arXiv 2023] [Github] [Huggingface Models] [Dataset] [Online Demo]
TXin Lai¹ Zhuotao Tian² Yukang Chen¹ Yanwei Li¹ Yuhui Yuan³ Shu Liu² Jiaya Jia^1,2
¹The Chinese University of Hong Kong ²SmartMore ³MSRA

Acknowledge

[1] Trend pic from this repo.
[2] Figure from this paper: The Rise and Potential of Large Language Model Based Agents: A Survey.

For Tasks:

Click tags to check more tools for each tasks

explore embodied ai implement rl models develop agent applications conduct vision-language research analyze 3d grounding

For Jobs:

ai researcher machine learning engineer robotics engineer data scientist research scientist

Alternative AI tools for Awesome-Embodied-Agent-with-LLMs

Similar Open Source Tools

Awesome-Embodied-Agent-with-LLMs

github

: 1.2k

awesome-and-novel-works-in-slam

This repository contains a curated list of cutting-edge works in Simultaneous Localization and Mapping (SLAM). It includes research papers, projects, and tools related to various aspects of SLAM, such as 3D reconstruction, semantic mapping, novel algorithms, large-scale mapping, and more. The repository aims to showcase the latest advancements in SLAM technology and provide resources for researchers and practitioners in the field.

github

: 92

MedLLMsPracticalGuide

This repository serves as a practical guide for Medical Large Language Models (Medical LLMs) and provides resources, surveys, and tools for building, fine-tuning, and utilizing LLMs in the medical domain. It covers a wide range of topics including pre-training, fine-tuning, downstream biomedical tasks, clinical applications, challenges, future directions, and more. The repository aims to provide insights into the opportunities and challenges of LLMs in medicine and serve as a practical resource for constructing effective medical LLMs.

github

: 1.3k

unilm

The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

github

: 19.6k

Fueling-Ambitions-Via-Book-Discoveries

Fueling-Ambitions-Via-Book-Discoveries is an Advanced Machine Learning & AI Course designed for students, professionals, and AI researchers. The course integrates rigorous theoretical foundations with practical coding exercises, ensuring learners develop a deep understanding of AI algorithms and their applications in finance, healthcare, robotics, NLP, cybersecurity, and more. Inspired by MIT, Stanford, and Harvard’s AI programs, it combines academic research rigor with industry-standard practices used by AI engineers at companies like Google, OpenAI, Facebook AI, DeepMind, and Tesla. Learners can learn 50+ AI techniques from top Machine Learning & Deep Learning books, code from scratch with real-world datasets, projects, and case studies, and focus on ML Engineering & AI Deployment using Django & Streamlit. The course also offers industry-relevant projects to build a strong AI portfolio.

github

: 205

Embodied-AI-Guide

Embodied-AI-Guide is a comprehensive guide for beginners to understand Embodied AI, focusing on the path of entry and useful information in the field. It covers topics such as Reinforcement Learning, Imitation Learning, Large Language Model for Robotics, 3D Vision, Control, Benchmarks, and provides resources for building cognitive understanding. The repository aims to help newcomers quickly establish knowledge in the field of Embodied AI.

github

: 4.1k

LLM-in-Vision

Recent LLM (Large Language Models)-based CV and multi-modal works.

github

: 743

Awesome-LLM-Reasoning-Openai-o1-Survey

The repository 'Awesome LLM Reasoning Openai-o1 Survey' provides a collection of survey papers and related works on OpenAI o1, focusing on topics such as LLM reasoning, self-play reinforcement learning, complex logic reasoning, and scaling law. It includes papers from various institutions and researchers, showcasing advancements in reasoning bootstrapping, reasoning scaling law, self-play learning, step-wise and process-based optimization, and applications beyond math. The repository serves as a valuable resource for researchers interested in exploring the intersection of language models and reasoning techniques.

github

: 184

gen-ai-experiments

Gen-AI-Experiments is a structured collection of Jupyter notebooks and AI experiments designed to guide users through various AI tools, frameworks, and models. It offers valuable resources for both beginners and experienced practitioners, covering topics such as AI agents, model testing, RAG systems, real-world applications, and open-source tools. The repository includes folders with curated libraries, AI agents, experiments, LLM testing, open-source libraries, RAG experiments, and educhain experiments, each focusing on different aspects of AI development and application.

github

: 641

Awesome-LLM-Interpretability

Awesome-LLM-Interpretability is a curated list of materials related to LLM (Large Language Models) interpretability, covering tutorials, code libraries, surveys, videos, papers, and blogs. It includes resources on transformer mechanistic interpretability, visualization, interventions, probing, fine-tuning, feature representation, learning dynamics, knowledge editing, hallucination detection, and redundancy analysis. The repository aims to provide a comprehensive overview of tools, techniques, and methods for understanding and interpreting the inner workings of large language models.

github

: 130

Awesome-World-Models

This repository is a curated list of papers related to World Models for General Video Generation, Embodied AI, and Autonomous Driving. It includes foundation papers, blog posts, technical reports, surveys, benchmarks, and specific world models for different applications. The repository serves as a valuable resource for researchers and practitioners interested in world models and their applications in robotics and AI.

github

: 1.2k

Awesome-AI-Papers

github

: 55

AiLearning-Theory-Applying

This repository provides a comprehensive guide to understanding and applying artificial intelligence (AI) theory, including basic knowledge, machine learning, deep learning, and natural language processing (BERT). It features detailed explanations, annotated code, and datasets to help users grasp the concepts and implement them in practice. The repository is continuously updated to ensure the latest information and best practices are covered.

github

: 2.9k

ai-agent-papers

The AI Agents Papers repository provides a curated collection of papers focusing on AI agents, covering topics such as agent capabilities, applications, architectures, and presentations. It includes a variety of papers on ideation, decision making, long-horizon tasks, learning, memory-based agents, self-evolving agents, and more. The repository serves as a valuable resource for researchers and practitioners interested in AI agent technologies and advancements.

github

: 565

Awesome_Test_Time_LLMs

This repository focuses on test-time computing, exploring various strategies such as test-time adaptation, modifying the input, editing the representation, calibrating the output, test-time reasoning, and search strategies. It covers topics like self-supervised test-time training, in-context learning, activation steering, nearest neighbor models, reward modeling, and multimodal reasoning. The repository provides resources including papers and code for researchers and practitioners interested in enhancing the reasoning capabilities of large language models.

github

: 69

LLM-Navigation

LLM-Navigation is a repository dedicated to documenting learning records related to large models, including basic knowledge, prompt engineering, building effective agents, model expansion capabilities, security measures against prompt injection, and applications in various fields such as AI agent control, browser automation, financial analysis, 3D modeling, and tool navigation using MCP servers. The repository aims to organize and collect information for personal learning and self-improvement through AI exploration.

github

: 110

For similar tasks

Awesome-Embodied-Agent-with-LLMs

github

: 1.2k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 1.1k

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 3.5k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.9k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675