Awesome-AI-Papers

Awesome-AI-Papers

This repository is used to collect papers and code in the field of AI.

Stars: 51

Visit
 screenshot

README:

Awesome-AI-Papers

This repository is used to collect papers and code in the field of AI. The contents contain the following parts:

Table of Content

  ├─ NLP/  
  │  ├─ Word2Vec/  
  │  ├─ Seq2Seq/           
  │  └─ Pretraining/  
  │    ├─ Large Language Model/          
  │    ├─ LLM Application/ 
  │      ├─ AI Agent/          
  │      ├─ Academic/          
  │      ├─ Code/       
  │      ├─ Financial Application/
  │      ├─ Information Retrieval/  
  │      ├─ Math/     
  │      ├─ Medicine and Law/   
  │      ├─ Recommend System/      
  │      └─ Tool Learning/             
  │    ├─ LLM Technique/ 
  │      ├─ Alignment/          
  │      ├─ Context Length/          
  │      ├─ Corpus/       
  │      ├─ Evaluation/
  │      ├─ Hallucination/  
  │      ├─ Inference/     
  │      ├─ MoE/   
  │      ├─ PEFT/     
  │      ├─ Prompt Learning/   
  │      ├─ RAG/       
  │      └─ Reasoning and Planning/       
  │    ├─ LLM Theory/       
  │    └─ Chinese Model/             
  ├─ CV/  
  │  ├─ CV Application/          
  │  ├─ Contrastive Learning/         
  │  ├─ Foundation Model/ 
  │  ├─ Generative Model (GAN and VAE)/          
  │  ├─ Image Editing/          
  │  ├─ Object Detection/          
  │  ├─ Semantic Segmentation/            
  │  └─ Video/          
  ├─ Multimodal/       
  │  ├─ Audio/          
  │  ├─ BLIP/         
  │  ├─ CLIP/        
  │  ├─ Diffusion Model/   
  │  ├─ Multimodal LLM/          
  │  ├─ Text2Image/          
  │  ├─ Text2Video/            
  │  └─ Survey/           
  │─ Reinforcement Learning/ 
  │─ GNN/ 
  └─ Transformer Architecture/          

NLP

1. Word2Vec

  • Efficient Estimation of Word Representations in Vector Space, Mikolov et al., arxiv 2013. [paper]
  • Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al., NIPS 2013. [paper]
  • Distributed representations of sentences and documents, Le and Mikolov, ICML 2014. [paper]
  • Word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, Goldberg and Levy, arxiv 2014. [paper]
  • word2vec Parameter Learning Explained, Rong, arxiv 2014. [paper]
  • Glove: Global vectors for word representation.Pennington et al., EMNLP 2014. [paper][code]
  • fastText: Bag of Tricks for Efficient Text Classification, Joulin et al., arxiv 2016. [paper][code]
  • ELMo: Deep Contextualized Word Representations, Peters et al., NAACL 2018. [paper]
  • Distilling the Knowledge in a Neural Network, Hinton et al., arxiv 2015. [paper][FitNets]
  • BPE: Neural Machine Translation of Rare Words with Subword Units, Sennrich et al., ACL 2016. [paper][code]
  • Byte-Level BPE: Neural Machine Translation with Byte-Level Subwords, Wang et al., arxiv 2019. [paper][code]

2. Seq2Seq

  • Generating Sequences With Recurrent Neural Networks, Graves, arxiv 2013. [paper]
  • Sequence to Sequence Learning with Neural Networks, Sutskever et al., NeruIPS 2014. [paper]
  • Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al., ICLR 2015. [paper][code]
  • On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, Cho et al., arxiv 2014. [paper]
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho et al., arxiv 2014. [paper]
  • [fairseq][fairseq2][pytorch-seq2seq]

3. Pretraining

3.1 Large Language Model

3.2 LLM Application

3.2.1 AI Agent
3.2.2 Academic
3.2.3 Code
3.2.4 Financial Application
3.2.5 Information Retrieval
3.2.6 Math
3.2.7 Medicine and Law
3.2.8 Recommend System
3.2.9 Tool Learning
  • Tool Learning with Foundation Models, Qin et al., arxiv 2023. [paper][code]

  • Tool Learning with Large Language Models: A Survey, Qu et al., arxiv 2024. [paper][code]

  • Toolformer: Language Models Can Teach Themselves to Use Tools, Schick et al., arxiv 2023. [paper][toolformer-pytorch][conceptofmind/toolformer][xrsrke/toolformer][Graph_Toolformer]

  • ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Qin et al., ICLR 2024 Spotlight. [paper][code][StableToolBench]

  • Gorilla: Large Language Model Connected with Massive APIs, Patil et al., arxiv 2023. [paper][code]

  • HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, Shen et al., NeurIPS 2023. [paper][code]

  • GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction, Yang et al., arxiv 2023. [paper][code]

  • RestGPT: Connecting Large Language Models with Real-World RESTful APIs, Song et al., arxiv 2023. [paper][code]

  • LLMCompiler: An LLM Compiler for Parallel Function Calling, Kim et al., ICML 2024. [paper][code]

  • Large Language Models as Tool Makers, Cai et al, arxiv 2023. [paper][code]

  • ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang et al., arxiv 2023. [paper][code][ToolQA][toolbench]

  • ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search, Zhuang et al., arxiv 2023. [paper][[code]]

  • Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, Lu et al., NeurIPS 2023. [paper][code]

  • ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios, Ye et al., arxiv 2024. [paper][code]

  • AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls, Du et al., arxiv 2024. [paper][code]

  • LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, Wang et al., arxiv 2024. [paper][code]

  • What Are Tools Anyway? A Survey from the Language Model Perspective, Wang et al., arxiv 2024. [paper]

  • ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, Lu et al., arxiv 2024. [paper][code][API-Bank][ToolHop][ComplexFuncBench]

  • Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval, Chen et al., arxiv 2024. [paper]

  • ToolACE: Winning the Points of LLM Function Calling, Liu et al., arxiv 2024. [paper][ToolGen]

  • Hammer: Robust Function-Calling for On-Device Language Models via Function Masking, Lin et al., arxiv 2024. [paper][code]

  • [functionary][ToolLearningPapers][awesome-tool-llm]

3.3 LLM Technique

3.3.1 Alignment
3.3.2 Context Length
  • ALiBi: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, Press et al., ICLR 2022. [paper][code]
  • Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation, Chen et al., arxiv 2023. [paper]
  • Scaling Transformer to 1M tokens and beyond with RMT, Bulatov et al., AAAI 2024. [paper][code][LM-RMT]
  • RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text, Zhou et al., arxiv 2023. [paper][code]
  • LongNet: Scaling Transformers to 1,000,000,000 Tokens, Ding et al., arxiv 2023. [paper][code][unofficial code]
  • Focused Transformer: Contrastive Training for Context Scaling, Tworkowski et al., NeurIPS 2023. [paper][code]
  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, Chen et al., ICLR 2024 Oral. [paper][code]
  • StreamingLLM: Efficient Streaming Language Models with Attention Sinks, Xiao et al., ICLR 2024. [paper][code][SwiftInfer][SwiftInfer blog]
  • YaRN: Efficient Context Window Extension of Large Language Models, Peng et al., ICLR 2024. [paper][code][LM-Infinite]
  • Ring Attention with Blockwise Transformers for Near-Infinite Context, Liu et al., ICLR 2024. [paper][code][ring-attention-pytorch][ring-flash-attention][local-attention][tree_attention]
  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression, Jiang et al., ACL 2024. [paper][code]
  • LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, Ding et al., arxiv 2024. [paper][code]
  • LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning, Jin et al., arxiv 2024. [paper][code]
  • The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey, Pawar et al., arxiv 2024. [paper][Awesome-LLM-Long-Context-Modeling]
  • Data Engineering for Scaling Language Models to 128K Context, Fu et al., arxiv 2024. [paper][code]
  • CEPE: Long-Context Language Modeling with Parallel Context Encoding, Yen et al., ACL 2024. [paper][code]
  • Training-Free Long-Context Scaling of Large Language Models, An et al., ICML 2024. [paper][code]
  • InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory, Xiao et al., NeurIPS 2024. [paper][code]
  • Counting-Stars: A Simple, Efficient, and Reasonable Strategy for Evaluating Long-Context Large Language Models, Song et al., arxiv 2024. [paper][code][LLMTest_NeedleInAHaystack][RULER][LooGLE][LongBench][google-deepmind/loft]
  • Infini-Transformer: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, Munkhdalai et al., arxiv 2024. [paper][infini-transformer-pytorch][InfiniTransformer][infini-mini-transformer][megalodon][InfiniteHiP]
  • Activation Beacon: Long Context Compression with Activation Beacon, Zhang et al., ICLR 2025. [paper][code][Extending Llama-3's Context Ten-Fold Overnight]
  • Make Your LLM Fully Utilize the Context, An et al., arxiv 2024. [paper][code]
  • CoPE: Contextual Position Encoding: Learning to Count What's Important, Golovneva et al., arxiv 2024. [paper][rope_cope]
  • Scaling Granite Code Models to 128K Context, Stallone et al., arxiv 2024. [paper][code][granite-3.1-language-models]
  • Generalizing an LLM from 8k to 1M Context using Qwen-Agent, Qwen Team, 2024. [blog][Qwen2.5-1M]
  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Bai et al., arxiv 2024. [paper][code][LongCite][LongReward][context-cite][OmniThink][SelfCite]
  • A failed experiment: Infini-Attention, and why we should keep trying, HuggingFace Blog, 2024. [blog][Magic Blog]
  • Why Does the Effective Context Length of LLMs Fall Short, An et al., arxiv 2024. [paper][code][rotary-embedding-torch]
  • How to Train Long-Context Language Models (Effectively), Gao et al., arxiv 2024. [paper][code]
3.3.3 Corpus
3.3.4 Evaluation
3.3.5 Hallucination
  • Extrinsic Hallucinations in LLMs, Lilian Weng, 2024. [blog]
  • Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Zhang et al., arxiv 2023. [paper][code]
  • A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, Huang et al., arxiv 2023. [paper][code][Awesome-MLLM-Hallucination]
  • The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, Li et al., arxiv 2024. [paper][code]
  • FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios, Chem et al., arxiv 2023. [paper][code][OlympicArena][FActScore]
  • Chain-of-Verification Reduces Hallucination in Large Language Models, Dhuliawala et al., arxiv 2023. [paper][code]
  • HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models, Guan et al., CVPR 2024. [paper][code]
  • Woodpecker: Hallucination Correction for Multimodal Large Language Models, Yin et al., arxiv 2023. [paper][code]
  • OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, Huang et al., CVPR 2024 Highlight. [paper][code]
  • TrustLLM: Trustworthiness in Large Language Models, Sun et al., arxiv 2024. [paper][code]
  • SAFE: Long-form factuality in large language models, Wei et al., arxiv 2024. [paper][code]
  • RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models, Hu et al., arxiv 2024. [paper][code][HaluAgent][LLMsKnow]
  • Detecting hallucinations in large language models using semantic entropy, Farquhar et al., Nature 2024. [paper][semantic_uncertainty][long_hallucinations][Semantic Uncertainty ICLR 2023][Lynx-hallucination-detection]
  • A Survey on the Honesty of Large Language Models, Li et al., arxiv 2024. [paper][code]
  • LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations, Orgad et al., arxiv 2024. [paper][code]
3.3.6 Inference
3.3.7 MoE
3.3.8 PEFT (Parameter-efficient Fine-tuning)
3.3.9 Prompt Learning
3.3.10 RAG (Retrieval Augmented Generation)
Text Embedding
3.3.11 Reasoning and Planning
Survey

3.4 LLM Theory

3.5 Chinese Model


CV

  • CS231n: Deep Learning for Computer Vision [link]

1. Basic for CV

  • AlexNet: ImageNet Classification with Deep Convolutional Neural Networks, Krizhevsky et al., NIPS 2012. [paper]
  • VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan et al., ICLR 2015. [paper]
  • GoogLeNet: Going Deeper with Convolutions, Szegedy et al., CVPR 2015. [paper]
  • ResNet: Deep Residual Learning for Image Recognition, He et al., CVPR 2016 Best Paper. [paper][code][resnet_inference.py]
  • DenseNet: Densely Connected Convolutional Networks, Huang et al., CVPR 2017 Oral. [paper][code]
  • EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Tan et al., ICML 2019. [paper][code][EfficientNet-PyTorch][noisystudent]
  • BYOL: Bootstrap your own latent: A new approach to self-supervised Learning, Grill et al., arxiv 2020. [paper][code][byol-pytorch][simsiam]
  • ConvNeXt: A ConvNet for the 2020s, Liu et al., CVPR 2022. [paper][code][ConvNeXt-V2]

2. Contrastive Learning

  • MoCo: Momentum Contrast for Unsupervised Visual Representation Learning, He et al., CVPR 2020. [paper][code]

  • SimCLR: A Simple Framework for Contrastive Learning of Visual Representations, Chen et al., PMLR 2020. [paper][code]

  • CoCa: Contrastive Captioners are Image-Text Foundation Models, Yu et al., arxiv 2024. [paper][CoCa-pytorch][multimodal]

  • DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., arxiv 2023. [paper][code]

  • FeatUp: A Model-Agnostic Framework for Features at Any Resolution, Fu et al., ICLR 2024. [paper][code]

  • InfoNCE Loss: Representation Learning with Contrastive Predictive Coding, Oord et al., arxiv 2018. [paper][unofficial code]

3. CV Application

4. Foundation Model

5. Generative Model (GAN and VAE)

6. Image Editing

  • InstructPix2Pix: Learning to Follow Image Editing Instructions, Brooks et al., CVPR 2023 Highlight. [paper][code]

  • Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold, Pan et al., SIGGRAPH 2023. [paper][code]

  • DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing, Shi et al., arxiv 2023. [paper][code]

  • DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models, Mou et al., ICLR 2024 Spolight. [paper][code]

  • DragAnything: Motion Control for Anything using Entity Representation, Wu et al., ECCV 2024. [paper][code][Framer][SG-I2V][Go-with-the-Flow]

  • LEDITS++: Limitless Image Editing using Text-to-Image Models, Brack et al., arxiv 2023. [paper][code][demo]

  • Diffusion Model-Based Image Editing: A Survey, Huang et al., arxiv 2024. [paper][code]

  • PromptFix: You Prompt and We Fix the Photo, Yu et al., NeurIPS 2024. [paper][code]

  • MimicBrush: Zero-shot Image Editing with Reference Imitation, Chen et al., arxiv 2024. [paper][code][EchoMimic][echomimic_v2]

  • A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models, Shuai et al., arxiv 2024. [paper][code]

  • Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models, Atzmon et al., arxiv 2024. [paper]

  • MagicQuill: An Intelligent Interactive Image Editing System, Liu et al., arxiv 2024. [paper][code]

  • BrushEdit: All-In-One Image Inpainting and Editing, Li et al., arxiv 2024. [paper][code][DiffuEraser]

  • [EditAnything][ComfyUI-UltraEdit-ZHO][libcom][Awesome-Image-Composition][RF-Solver-Edit]

7. Object Detection

  • DETR: End-to-End Object Detection with Transformers, Carion et al., arxiv 2020. [paper][code][detrex][RT-DETR]

  • Focus-DETR: Less is More_Focus Attention for Efficient DETR, Zheng et al., arxiv 2023. [paper][code]

  • U2-Net_Going Deeper with Nested U-Structure for Salient Object Detection, Qin et al., arxiv 2020. [paper][code]

  • YOLO: You Only Look Once: Unified, Real-Time Object Detection Redmon et al., arxiv 2015. [paper]

  • YOLOX: Exceeding YOLO Series in 2021, Ge et al., arxiv 2021. [paper][code]

  • Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism, Wang et al., arxiv 2023. [paper][code]

  • Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, Liu et al., ECCV 2024. [paper][code][DINO-X][OV-DINO][OmDet][groundingLMM][Awesome-Visual-Grounding]

  • YOLO-World: Real-Time Open-Vocabulary Object Detection, Cheng et al., CVPR 2024. [paper][code]

  • YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information, Wang et al., arxiv 2024. [paper][code]

  • T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy, Jiang et al., arxiv 2024. [paper][code][ChatRex]

  • YOLOv10: Real-Time End-to-End Object Detection, Wang et al., arxiv 2024. [paper][code][YOLOv12]

  • D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement, Peng et al., ICLR 2025. [paper][code]

  • [detectron2][yolov5][mmdetection][mmdetection3d][detrex][Ultralytics YOLO11][AlphaPose]

8. Semantic Segmentation

9. Video

  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, Tong et al., NeurIPS 2022 Spotlight. [paper][code]

  • Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts, Zhao et al., arxiv 2024. [paper][code]

  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation, Wang et al., arxiv 2024. [paper]

  • [V-JEPA][I-JEPA][jepa-intuitive-physics][DINO-WM]

  • VideoMamba: State Space Model for Efficient Video Understanding, Li et al., ECCV 2024. [paper][code]

  • VideoChat: Chat-Centric Video Understanding, Li et al., CVPR 2024 Highlight. [paper][code]

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, Maaz et al., ACL 2024. [paper][code][Video-LLaMA][MovieChat][Chat-UniVi]

  • MVBench: A Comprehensive Multi-modal Video Understanding Benchmark, Li et al., CVPR 2024 Highlight. [paper][code][PhyGenBench]

  • OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer, Zhang et al., EMNLP 2024. [paper][code]

  • Tarsier: Recipes for Training and Evaluating Large Video Description Models, Wang et al., arxiv 2024. [paper][code][Tarsier2]

  • MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions, Ju et al., arxiv 2024. [paper][code]

  • MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling, Men et al., arxiv 2024. [paper][code][MIMO-pytorch][StableV2V]

  • Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding, Shu et al., arxiv 2024. [paper][code][LongVU][VisionZip][TimeChat]

  • Enhance-A-Video: Better Generated Video for Free, Luo et al., arxiv 2025. [paper][code][VideoSys][Magic-1-For-1]

  • Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos, Yuan et al., arxiv 2025. [paper][code]

  • VideoWorld: Exploring Knowledge Learning from Unlabeled Videos, Ren et al., arxiv 2025. [paper][code][LWM][iVideoGPT]

  • [Awesome-LLMs-for-Video-Understanding]

10. Survey for CV

  • ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy, Vishniakov et al., arxiv 2023. [paper][code]
  • Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey, Xin et al., arxiv 2024. [paper][code]

Multimodal

1. Audio

2. Blip

  • ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, Li et al., NeurIPS 2021. [paper][code]
  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, Li et al., ICML 2022. [paper][code][laion-coco]
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Li et al., ICML 2023. [paper][code]
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, Dai et al., arxiv 2023. [paper][code]
  • X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning, Panagopoulou et al., arxiv 2023. [paper][code]
  • xGen-MM (BLIP-3): A Family of Open Large Multimodal Models, Xue et al., arxiv 2024. [paper][code]
  • xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations, Qin et al., arxiv 2024. [paper][code]
  • xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs, Ryoo et al., arxiv 2024. [paper]
  • LAVIS: A Library for Language-Vision Intelligence, Li et al., arxiv 2022. [paper][code]
  • VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, Bao et al., NeurIPS 2022. [paper][code]
  • BEiT: BERT Pre-Training of Image Transformers, Bao et al., ICLR 2022 Oral presentation. [paper][code]
  • BeiT-V3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, Wang et al., CVPR 2023. [paper][code]

3. Clip

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., ICML 2021. [paper][code][open_clip][clip-as-service][SigLIP][EVA][DIVA][Clip-Forge]
  • DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al., arxiv 2022. [paper][code]
  • GLIPv2: Unifying Localization and Vision-Language Understanding, Zhang et al., NeurIPS 2022. [paper][code][GLIGEN]
  • SigLIP: Sigmoid Loss for Language Image Pre-Training, Zhai et al, arxiv 2023. [paper][SigLIP 2][siglip]
  • EVA-CLIP: Improved Training Techniques for CLIP at Scale, Sun et al., arxiv 2023. [paper][code][EVA-CLIP-18B]
  • Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, Yang et al., arxiv 2022. [paper][code]
  • MetaCLIP: Demystifying CLIP Data, Xu et al., ICLR 2024 Spotlight. [paper][code]
  • Alpha-CLIP: A CLIP Model Focusing on Wherever You Want, Sun et al., arxiv 2023. [paper][code][Bootstrap3D]
  • MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, Tong et al., arxiv 2024. [paper][code]
  • MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training, Vasu et al., CVPR 20224. [paper][code]
  • Long-CLIP: Unlocking the Long-Text Capability of CLIP, Zhang et al., ECCV 2024. [paper][code][Inf-CLIP]
  • CLOC: Contrastive Localized Language-Image Pre-Training, Chen et al., arxiv 2024. [paper]
  • LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation, Huang et al., arxiv 2024. [paper][code]
  • SuperClass: Classification Done Right for Vision-Language Pre-Training, Huang et al., NeurIPS 2024. [paper][code]
  • AIM-v2: Multimodal Autoregressive Pre-training of Large Vision Encoders, Fini et al., arxiv 2024. [paper][code]
  • Scaling Pre-training to One Hundred Billion Data for Vision Language Models, Wang et al., arxiv 2025. [paper]

4. Diffusion Model

5. Multimodal LLM

6. Text2Image

  • DALL-E: Zero-Shot Text-to-Image Generation, Ramesh et al., arxiv 2021. [paper][code]

  • DALL-E3: Improving Image Generation with Better Captions, Betker et al., OpenAI 2023. [paper][code][blog][Glyph-ByT5]

  • ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models, Zhang et al., ICCV 2023 Marr Prize. [paper][code][ControlNet_Plus_Plus][ControlNeXt][ControlAR][OminiControl][ROICtrl]

  • T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models, Mou et al., AAAI 2024. [paper][code]

  • AnyText: Multilingual Visual Text Generation And Editing, Tuo et al., arxiv 2023. [paper][code]

  • RPG: Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs, Yang et al., ICML 2024. [paper][code][IterComp]

  • LAION-5B: An open large-scale dataset for training next generation image-text models, Schuhmann et al., NeurIPS 2022. [paper][code][blog][laion-coco][multimodal_textbook][kangas]

  • DeepFloyd IF: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al., arxiv 2022. [paper][code]

  • Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al., NeurIPS 2022. [paper][unofficial code]

  • Instruct-Imagen: Image Generation with Multi-modal Instruction, Hu et al., arxiv 2024. [paper][Imagen 3]

  • CogView: Mastering Text-to-Image Generation via Transformers, Ding et al., NeurIPS 2021. [paper][code][ImageReward]

  • CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, Ding et al., arxiv 2022. [paper][code]

  • CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion, Zheng et al., ECCV 2024. [paper][code]

  • TextDiffuser: Diffusion Models as Text Painters, Chen et al., arxiv 2023. [paper][code]

  • TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering, Chen et al., arxiv 2023. [paper][code]

  • PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, Chen et al., arxiv 2023. [paper][code]

  • PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models, Chen et al., arxiv 2024. [paper][code]

  • PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation, Chen et al., arxiv 2024. [paper][code]

  • IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, Ye et al., arxiv 2023. [paper][code][ID-Animator][InstantID]

  • Controllable Generation with Text-to-Image Diffusion Models: A Survey, Cao et al., arxiv 2024. [paper][code]

  • StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, Zhou et al., NeurIPS 2024. [paper][code][AutoStudio][story-adapter]

  • Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding, Li et al., arxiv 2024. [paper][code][Hunyuan3D-1][Hunyuan3D-2][xDiT]

  • GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation, Li et al., CVPR 2024. [paper][t2v_metrics][VQAScore]

  • [Kolors][Kolors-Virtual-Try-On][EVLM: An Efficient Vision-Language Model for Visual Understanding]

  • EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models, Zhao et al., NeurIPS 2024. [paper][code]

  • Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens, Fan et al., arxiv 2024. [paper]

  • Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis, Bai et al., arxiv 2024. [paper][code]

  • SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, Xie et al., ICLR 2025. [paper][code]

  • [flux][x-flux][x-flux-comfyui][FLUX.1-dev-LoRA][qwen2vl-flux][1.58-bit FLUX][3DIS]

7. Text2Video

8. Survey for Multimodal

  • A Survey on Multimodal Large Language Models, Yin et al., arxiv 2023. [paper][Awesome-Multimodal-Large-Language-Models][MME-Survey]
  • Multimodal Foundation Models: From Specialists to General-Purpose Assistants, Li et al., arxiv 2023. [paper][cvinw_readings]
  • From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities, Lu et al., arxiv 2024. [paper][Leaderboards]
  • Efficient Multimodal Large Language Models: A Survey, Jin et al., arxiv 2024. [paper][code]
  • An Introduction to Vision-Language Modeling, Bordes et al., arxiv 2024. [paper]
  • Building and better understanding vision-language models: insights and future directions, Laurençon et al., arxiv 2024. [paper]
  • Video Understanding with Large Language Models: A Survey, Tang et al., arxiv 2023. [paper][code]
  • Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey, Chen et al., arxiv 2024. [paper][code]

9. Other

  • Fuyu-8B: A Multimodal Architecture for AI Agents Bavishi et al., Adept blog 2023. [blog][model]
  • Otter: A Multi-Modal Model with In-Context Instruction Tuning, Li et al., arxiv 2023. [paper][code]
  • OtterHD: A High-Resolution Multi-modality Model, Li et al., arxiv 2023. [paper][code][model]
  • CM3leon: Scaling Autoregressive Multi-Modal Models_Pretraining and Instruction Tuning, Yu et al., arxiv 2023. [paper][Unofficial Implementation]
  • MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer, Tian et al., arxiv 2024. [paper][code]
  • CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations, Qi et al., arxiv 2024. [paper][code]
  • SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models, Gao et al., arxiv 2024. [paper][code]
  • Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers, Gao et al., arxiv 2024. [paper][code]
  • Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining, Liu et al., arxiv 2024. [paper][code][Lumina-Video]
  • LWM: World Model on Million-Length Video And Language With RingAttention, Liu et al., arxiv 2024. [paper][code]
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models, Chameleon Team, arxiv 2024. [paper][code][X-Prompt]
  • *SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation, Ge et al., arxiv 2024. [paper][code][SEED][SEED-Story]

Reinforcement Learning

1.Basic for RL

2. LLM for decision making

  • Decision Transformer_Reinforcement Learning via Sequence Modeling, Chen et al., NeurIPS 2021. [paper][code]
  • Trajectory Transformer: Offline Reinforcement Learning as One Big Sequence Modeling Problem, Janner et al., NeurIPS 2021. [paper][code]
  • Guiding Pretraining in Reinforcement Learning with Large Language Models, Du et al., ICML 2023. [paper][code]
  • Introspective Tips: Large Language Model for In-Context Decision Making, Chen et al., arxiv 2023. [paper]
  • Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions, Chebotar et al., CoRL 2023. [paper][Unofficial Implementation]
  • Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods, Cao et al., arxiv 2024. [paper]

GNN

Survey for GNN


Transformer Architecture

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Awesome-AI-Papers

Similar Open Source Tools

For similar tasks

No tools available

For similar jobs

No tools available