LLM-Discrete-Tokenization-Survey

LLM-Discrete-Tokenization-Survey

The official GitHub page for the survey paper "Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey". And this paper is under review.

Stars: 77

Visit
 screenshot

The repository contains a comprehensive survey paper on Discrete Tokenization for Multimodal Large Language Models (LLMs). It covers various techniques, applications, and challenges related to discrete tokenization in the context of LLMs. The survey explores the use of vector quantization, product quantization, and other methods for tokenizing different modalities like text, image, audio, video, graph, and more. It also discusses the integration of discrete tokenization with LLMs for tasks such as image generation, speech recognition, recommendation systems, and multimodal understanding and generation.

README:

Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

The official GitHub page for the survey paper "Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey".

arXiv

[arXiv][https://arxiv.org/abs/2507.22920]

fig_1


fig_1


fig_2


fig_12



tab_LLM



tab_MLLM



2 Fundamental Techniques

fig_3

fig_4

2.1 Vinalla Vector Quantization (VQ)

  1. 2017_NeurIPS_VQ-VAE_Neural Discrete Representation Learning. [arXiv]

  2. 2017_NeurlPS_SHVQ_Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. [arXiv]

  3. 2018_arXiv_Theory and Experiments on Vector Quantized Autoencoders. [arXiv]

  4. 2019_NeurlPS_VQ-VAE-2_Generating Diverse High-Fidelity Images with VQ-VAE-2. [NeurIPS] [arXiv]

  5. 2020_AAAI_soft VQ-VAE_Vector quantization-based regularization for autoencoders. [AAAI] [arXiv]

  6. 2020_NeurlPS_HQA_Hierarchical Quantized Autoencoders. [NeurIPS] [arXiv]

  7. 2021_CVPR_VQGAN_Taming Transformers for High-Resolution Image Synthesis. [CVPR] [arXiv] [Homepage]

  8. 2022_arXiv_HC-VQ_Homology-constrained vector quantization entropy regularizer. [arXiv]

  9. 2022_ICML_SQ-VAE_SQ-VAE:Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization. [ICML--Slides] [arXiv]

  10. 2023_CVPR_Reg-VQ_Regularized Vector Quantization for Tokenized Image Synthesis. [CVPR] [arXiv]

  11. 2023_ICCV_CVQ-VAE_Online Clustered Codebook. [ICCV] [arXiv]

  12. 2023_ICML_Straightening Out the Straight-Through Estimator:Overcoming Optimization Challenges in Vector Quantized Networks. [ICML] [arXiv]

  13. 2023_ICML_VQ-WAE_Vector Quantized Wasserstein Auto-Encoder. [ICML] [arXiv]

  14. 2024_arXiv_HyperVQ_HyperVQ: MLR-based Vector Quantization in Hyperbolic Space. [arXiv]

  15. 2024_arXiv_IBQ_Scalable Image Tokenization with Index Backpropagation Quantization. [arXiv]

  16. 2024_arXiv_SimVQ_Addressing representation collapse in vector quantized models with one linear layer. [arXiv]

  17. 2024_NeurIPS_VQGAN-LC_Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%. [arXiv]

  18. 2024_L4DC_SCQ_Soft Convex Quantization=Revisiting Vector Quantization with Convex Optimization. [PMLR] [arXiv]

  19. 2024_TMLR_HQ-VAE_HQ-VAE:Hierarchical Discrete Representation Learning with Variational Bayes. [arXiv]

  20. 2025_ICLR_rotation trick_Restructuring vector quantization with the rotation trick. [ICLR--Slides] [arXiv]

2.2 Residual Vector Quantization (RVQ)

fig_5

  1. 2010_Sensors_RVQ_Approximate nearest neighbor search by residual vector quantization. [Sensors]

  2. 2014_arXiv_SQ_Stacked Quantizers for Compositional Vector Compression. [arXiv]

  3. 2014_IEEE Multimedia_PRVQ_Projected residual vector quantization for ANN search. [IEEE]

  4. 2015_arXiv_IRVQ_Improved Residual Vector Quantization for High-dimensional Approximate Nearest Neighbor Search. [arXiv]

  5. 2015_arXiv_TRQ_Transformed Residual Quantization for Approximate Nearest Neighbor Search. [arXiv]

  6. 2016_Neurocomputing_RVQ-P RVQ-NP_Parametric and nonparametric residual vector quantization optimizations for ANN search. [ACM]

  7. 2016_TKDE_CompQ_Competitive Quantization for Approximate Nearest Neighbor Search. [IEEE TKDE]

  8. 2017_Multimedia Systems_ERVQ_Optimized residual vector quantization for efficient approximate nearest neighbor search. [Springer]

  9. 2017_Multimedia_GRVQ_Generalized Residual Vector Quantization and Aggregating Tree for Large Scale Search. [IEEE Transactions on Multimedia]

  10. 2024_ICML_QINCo_Residual Quantization with Implicit Neural Codebooks. [ICML] [arXiv]

  11. 2025_arXiv_Qinco2_Qinco2:Vector Compression and Search with Improved Implicit Neural Codebooks. [arXiv]

2.3 Product Quantization (PQ)

fig_6

  1. 2011_TPAMI_PQ_Product Quantization for Nearest Neighbor Search. [IEEE TPAMI]

  2. 2013_CVPR_CKM_Cartesian k-means. [CVPR]

  3. 2013_CVPR_OPQ_Optimized product quantization for approximate nearest neighbor search. [IEEE CVPR] [CVPR]

  4. 2014_CVPR_LOPQ_Locally optimized product quantization for approximate nearest neighbor search. [IEEE CVPR] [CVPR]

  5. 2014_ICML_CQ_Composite Quantization for Approximate Nearest Neighbor Search. [ICML]

  6. 2015_CVPR_SQ1 SQ2_Sparse Composite Quantization. [CVPR]

  7. 2015_ICIP_PTQ_Product tree quantization for approximate nearest neighbor search. [IEEE ICIP]

  8. 2015_TKDE_OCKM_Optimized Cartesian K-Means. [IEEE TKDE] [arXiv]

  9. 2016_CVPR_SQ_Supervised Quantization for Similarity Search. [CVPR] [arXiv]

  10. 2018_TKDE_Online PQ_Online Product Quantization. [IEEE TKDE]

  11. 2019_CVPR_DPQ_End-to-End Supervised Product Quantization for Image Search and Retrieval. [IEEE CVPR] [CVPR] [arXiv]

  12. 2020_ICDM_Online OPQ_Online Optimized Product Quantization. [IEEE ICDM]

  13. 2020_ICML_DPQ_Differentiable Product Quantization for End-to-End Embedding Compression. [ICML] [arXiv]

  14. 2022_ICDE_VAQ_Fast Adaptive Similarity Search through Variance-Aware Quantization. [IEEE ICDE]

  15. 2023_ESWA_RVPQ_Residual Vector Product Quantization for Approximate Nearest Neighbor Search. [Springer]

  16. 2023_WWW_DOPQ_Diferentiable Optimized Product Qantization and Beyond. [WWW]

  17. 2024_AAAI_HiHPQ_HiHPQ: Hierarchical Hyperbolic Product Quantization for Unsupervised Image Retrieval. [ACM AAAI] [arXiv]

2.4 Additive Vector Quantzation (AQ)

fig_7

  1. 2014_CVPR_AQ_Additive Quantization for Extreme Vector Compression. [IEEE CVPR] [CVPR]

  2. 2016_ECCV_LSQ_Revisiting Additive Quantization. [Springer] [ECCV]

  3. 2018_ECCV_LSQ++_LSQ++: Lower Running Time and Higher Recall in Multi-codebook Quantization. [ECCV]

  4. 2021_KDD_Online AQ_Online Additive Quantization. [ACM KDD] [KDD]

2.5 Finite Scalar Quantization (FSQ)

fig_8

  1. 2024_ICLR_FSQ_Finite Scalar Quantization: VQ-VAE Made Simple. [ICLR] [arXiv]

2.6 Look-up Free Quantization (LFQ)

fig_9

  1. 2024_ICLR_MAGVIT-v2_Language Model Beats Diffusion--Tokenizer is Key to Visual Generation. [ICLR] [arXiv]

2.7 Binary Spherical Quantization (BSQ)

fig_10

  1. 2025_ICLR_BSQ_Image and Video Tokenization with Binary Spherical Quantization. [arXiv]

2.8 Graph Anchor-Relation Compositional Tokenization

  1. 2022_ICLR_NodePiece_NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs. [ICLR] [arXiv]

  2. 2023_AAAI_EARL_Entity-Agnostic Representation Learning for Parameter-Efficient Knowledge Graph Embedding. [ACM AAAI] [arXiv]

  3. 2023_EMNLP_RandomEQ_Random Entity Quantization for Parameter-Efficient Compositional Knowledge Graph Representation. [EMNLP] [arXiv]

tab_1

3 Classical Applications without LLMs

3.1 Image

  1. 2017_CVPR_DVSQ_Deep Visual-Semantic Quantization for Efficient Image Retrieval. [CVPR]

  2. 2019_IJCAI_DPQ_Beyond product quantization:Deep progressive quantization for image retrieval. [IJCAI] [arXiv]

  3. 2021_ICCV_SPQ_Self-supervised Product Quantization for Deep Unsupervised Image Retrieval. [ICCV] [arXiv]

  4. 2022_arXiv_BEiT v2_BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. [arXiv]

  5. 2022_AAAI_MeCoQ_Contrastive Quantization with Code Memory for Unsupervised Image Retrieval. [AAAI] [arXiv]

  6. 2022_CVPR_MaskGIT_MaskGIT:Masked Generative Image Transformer. [CVPR] [arXiv] [Github]

  7. 2022_CVPR_RQ-VAE RQ-Transformer_AutoRegressive Image Generation using Residual Quantization. [CVPR] [arXiv]

  8. 2022_ICLR_BEiT_BEiT: BERT Pre-Training of Image Transformers. [ICLR] [arXiv]

  9. 2022_ICLR_ViT-VQGAN_Vector-Quantization Image Modeling with Improved VQGAN. [ICLR] [arXiv]

  10. 2022_NeurlPS_MoVQ_MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation. [NeurIPS] [arXiv]

  11. 2023_CVPR_DQ-VAE_Towards Accurate Image Coding:Improved Autoregressive Image Generation with Dynamic Vector Quantization. [CVPR] [arXiv]

  12. 2023_CVPR_MAGE_MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. [CVPR] [arXiv] [Github]

  13. 2023_CVPR_MQ-VAE_Not All Image Regions Matter:Masked Vector Quantization for Autoregressive Image Generation. [CVPR] [arXiv]

  14. 2023_ICCV_Efficient-VQGAN_Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers. [ICCV] [IEEE ICCV] [arXiv]

  15. 2024_CVPR_SeQ-GAN_Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis. [CVPR] [arXiv]

  16. 2024_ICLR_ClusterMIM_On the Role of Discrete Tokenization in Visual Representation Learning. [ICLR] [arXiv]

  17. 2024_NeurlPS_TiTok_An Image is Worth 32 Tokens for Reconstruction and Generation. [NeurIPS] [arXiv]

  18. 2024_NeurlPS_VAR_Visual Autoregressive Modeling:Scalable Image Generation via Next-Scale Prediction. [NeurIPS] [arXiv]

  19. 2024_NeurlPS_VQ-KD_Image Understanding Makes for A Good Tokenizer for Image Generation. [NeurIPS] [arXiv]

  20. 2024_TMLR_MaskBit_MaskBit: Embedding-free Image Generation via Bit Tokens. [Poster] [arXiv]

  21. 2025_arXiv_FlowMo_Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization. [arXiv]

  22. 2025_CVPR_MergeVQ_MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization. [[CVPR]](MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization) [arXiv] [Github]

  23. 2025_ICLR_DnD-Transformer_A spark of vision-language intelligence: 2-dimensional autoregressive transformer for efficient finegrained image generation. [ICLR] [arXiv]

3.2 Sound

  1. 2022_TASLP_SoundStream_SoundStream:An End-to-End Neural Audio Codec. [IEEE] [arXiv] [Github]

  2. 2023_arXiv_HiFi-Codec_HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec. [arXiv]

  3. 2023_NeurlPS_DAC_High fidelity neural audio compression with Improved RVQGAN. [NeurIPS] [arXiv]

  4. 2023_TMLR_EnCodec_High Fidelity Neural Audio Compression. [arXiv]

  5. 2024_J-STSP_SemantiCodec_SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound. [arXiv]

  6. 2025_arXiv_QINCODEC_QINCODEC: Neural Audio Compression with Implicit Neural Codebooks. [arXiv]

  7. 2025_arXiv_SQCodec_One Quantizer is Enough: Toward a Lightweight Audio Codec. [arXiv]

  8. 2025_arXiv_UniCodec_UniCodec:Unified Audio Codec with Single Domain-Adaptive Codebook. [arXiv] [Github]

  9. 2025_IEEE Signal Processing Letter_StreamCodec_A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication. [arXiv]

  10. 2020_ICLR_vq-wav2vec_vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. [ICLR] [arXiv]

  11. 2020_NeurlPS_wav2vec 2.0_wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. [NeurIPS] [arXiv]

  12. 2023_ICASSP_LMCodec_LMCodec: A Low Bitrate Speech Codec with Causal Transformer Models. [IEEE ICASSP] [arXiv]

  13. 2024_ICLR_SpeechTokenizer_SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. [ICLR] [arXiv]

  14. 2025_ICASSP_LFSC_Low Frame-rate Speech Codec: A Codec Designed for Fast High-quality Speech LLM Training and Inference. [IEEE ICASSP] [arXiv]

  15. 2025_ICLR_TAAE_Scaling Transformers for Low-Bitrate High-Quality Speech Coding. [ICLR]

3.3 Graph

  1. 2020_AAAI_SNEQ_SNEQ: Semi-Supervised Attributed Network Embedding with Attention-Based Quantisation. [AAAI]

  2. 2020_ACL_TS-CL_Knowledge Graph Embedding Compression. [ACL]

  3. 2021_CIKM_LightKG_A Lightweight Knowledge Graph Embedding Framework for Efficient Inference and Storage. [ACM CIKM]

  4. 2021_TNNLS_d-SNEQ_SemiSupervised Network Embedding with Differentiable Deep Quantization. [IEEE TNNLS]

  5. 2023_ICLR_Mole-BERT_Mole-BERT: Rethinking Pre-Training Graph Neural Networks for Molecules. [ICLR]

  6. 2023_NeurIPS_IMoLD_Learning Invariant Molecular Representation in Latent Discrete Space. [NeurIPS] [arXiv]

  7. 2024_ICLR_MAPE-PPI_MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding. [ICLR] [arXiv]

  8. 2024_ICLR_VQGraph_VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs. [ICLR] [arXiv]

  9. 2024_NeurlPS_GFT_GFT: Graph Foundation Model with Transferable Tree Vocabulary. [NeurIPS] [arXiv]

  10. 2024_TMLR_DGAE_Discrete Graph Auto-Encoder. [OpenReview TMLR] [arXiv]

  11. 2025_arXiv_GT-SVQ_GT-SVQ: A Linear-Time Graph Transformer for Node Classification Using Spiking Vector Quantization. [arXiv]

  12. 2025_AAAI_GLAD_GLAD: Improving Latent Graph Generative Modeling with Simple Quantization. [AAAI] [ICML Workshop] [arXiv]

  13. 2025_ICLR_GQT_Learning Graph Quantized Tokenizers. [ICLR] [arXiv]

  14. 2025_ICLR_NID_Node Identifiers: Compact, Discrete Representations for Efficient Graph Learning. [ICLR] [arXiv]

  15. 2025_WWW_HQA-GAE_Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection. [WWW] [arXiv]

3.4 Video

  1. 2021_arXiv_VideoGPT_VideoGPT: Video Generation using VQ-VAE :and Transformers. [arXiv] [Github]

  2. 2022_ECCV_TATS_Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer. [ACM ECCV] [arXiv]

  3. 2023_CVPR_MAGVIT_MAGVIT: Masked Generative Video Transformer. [CVPR] [arXiv]

  4. 2023_ICLR_Phenaki_Phenaki: Variable Length Video Generation From Open Domain Textual Descriptions. [ICLR] [arXiv]

  5. 2024_arXiv_LARP_LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior. [arXiv] [Github]

  6. 2024_arXiv_VidTok_VidTok: A Versatile and Open-Source Video Tokenizer. [arXiv]

  7. 2024_arXiv_VQ-NeRV_VQ-NeRV: A Vector Quantized Neural Representation for Videos. [arXiv] [Github]

  8. 2024_ICLR_MAGVIT-v2_Language Model Beats Diffusion: Tokenizer is Key to Visual Generation. [arXiv] [Github]

  9. 2024_NeurlPS_OmniTokenizer_OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation. [NeurIPS] [arXiv]

  10. 2025_arXiv_SweetTok_SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization. [arXiv]

  11. 2025_arXiv_TVC_TVC: Tokenized Video Compression with Ultra-Low Bitrate. [arXiv]

  12. 2025_ICLR_BSQ_Image and Video Tokenization with Binary Spherical Quantization. [arXiv] [Github]

3.5 Action

  1. 2023_CoLR_SAQ_Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning. [arXiv] [Github]

  2. 2024_ICML_PRISE_PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control. [ICML] [arXiv] [Github]

3.6 Text + Image

  1. 2021_ICML_DALL-E_Zero-Shot Text-to-Image Generation. [ICML] [arXiv]

  2. 2021_NeurIPS_CogView_CogView:Mastering Text-to-Image Generation via Transformers. [NeurIPS] [arXiv] [Github]

  3. 2022_CVPR_VQ-Diffusion_Vector Quantized Diffusion Model for Text-to-Image Synthesis. [CVPR] [IEEE CVPR] [arXiv] [Github]

  4. 2022_ECCV_Make-A-Scene_Make-A-Scene:Scene-Based Text-to-Image Generation with Human Priors. [ECCV] [arXiv] [Github]

  5. 2023_CVPR_NUWA-LIP_NUWA-LIP:Language-guided Image Inpainting with Defect-free VQGAN. [CVPR] [arXiv] [Github]

  6. 2023_ICLR_Unified-IO_Unified-IO: A unified model for vision, language, and multi-modal tasks. [ICLR] [arXiv]

  7. 2023_ICML_Muse_Muse: Text-to-Image Generation via Masked Generative Transformer. [ICML] [arXiv]

  8. 2024_arXiv_TexTok_Language-Guided Image Tokenization for Generation. [arXiv]

  9. 2024_NeurlPS_LG-VQ_LG-VQ: Language-Guided Codebook Learning. [NeurIPS] [arXiv] [Github]

  10. 2025_arXiv_TokLIP_TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation. [arXiv] [Github]

  11. 2025_arXiv_UniTok_UniTok: A Unified Tokenizer for Visual Generation and Understanding. [arXiv] [Github]

  12. 2025_AAAI_MyGO_MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion. [arXiv] [Github]

  13. 2025_ICLR_HART_HART: Efficient Visual Generation with Hybrid Autoregressive Transformer. [arXiv] [Github]

3.7 Text + Sound

  1. 2023_ICLR_AudioGen_AudioGen:Textually Guided Audio Generation. [ICLR] [arXiv] [Github] [HuggingFace]

  2. 2024_arXiv_Spectral Codecs:Spectrogram-Based Audio Codecs for High Quality Speech Synthesis. [arXiv]

  3. 2023_arXiv_VALL-E X_Speak Foreign Languages with Your Own Voice:Cross-Lingual Neural Codec Language Modeling. [arXiv]

  4. 2023_arXiv_VALL-E_Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. [arXiv]

  5. 2024_arXiv_RALL-E_RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis. [arXiv]

  6. 2024_arXiv_SimpleSpeech 2_SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models. [IEEE] [arXiv]

  7. 2024_ICML_NaturalSpeech 3_NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. [ICML] [arXiv] [Mircosoft] [Github]

  8. 2024_Interspeech_SimpleSpeech_SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models. [arXiv] [Github]

  9. 2024_Interspeech_Single-Codec_Single-Codec: Single-Codebook Speech Codec towards High-Performance. [arXiv] [Github]

  10. 2025_ICLR_HALL-E_HALL-E: Hierarchical Neural Codec Language Mdoel for Minute-Long Zero-Shot Text-to-Speech Synthesis. [ICLR] [arXiv]

3.8 Audio + Video

  1. 2025_CVIU_VQ-MAE-AV_A vector quantized masked autoencoder for audiovisual speech emotion recognition. [arXiv]

3.9 Audio + Motion

  1. 2024_CVPR_ProTalk_Towards Variable and Coordinated Holistic Co-Speech Motion Generation. [CVPR] [arXiv] [Github]

3.10 Audio + Image + Video

  1. 2025_AAAI_VQTalker_VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization. [AAAI] [arXiv] [Github]

3.11 Text + Image + Video + Action

  1. 2024_arXiv_WorldDreamer_WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens. [arXiv] [Github]

3.12 Complex Modality in RecSys

  1. 2020_WWW_MGQE_Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems. [WWW] [arXiv]

  2. 2023_NeurlPS_TIGER_Recommender Systems with Generative Retrieval. [NeurIPS] [arXiv] [HuggingFace]

  3. 2023_TOIS_ReFRS_ReFRS: Resource-efficient Federated Recommender System for Dynamic and Diversified User Preferences. [ACM TOIS] [arXiv]

  4. 2023_WWW_VQ-Rec_Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. [WWW] [arXiv] [Github]

  5. 2024_KDD_EAGER_EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration. [KDD] [arXiv] [Github]

  6. 2024_RecSys_CoST_CoST: Contrastive Quantization based Semantic Tokenization for Generative Recommendation. [ACM RecSys] [arXiv]

fig_11

4 LLM-based Single-Modality Applications

4.1 Image

  1. 2023_NeurIPS_SPAE_SPAE: Semantic Pyramid AutoEncoder for Multimoda Generation with Frozen LLMs. [NeurIPS] [ACM NeurIPS] [arXiv] [Github]

  2. 2023_NeurlPS_LQAE_Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment. [NeurIPS] [arXiv]

  3. 2024_arXiv_LlamaGen_Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. [arXiv] [Github]

  4. 2024_CVPR_V2T Tokenizer_Beyond Text: Frozen Large Language Models in Visual Signal Comprehension. [CVPR] [IEEE CVPR] [arXiv]

  5. 2024_ICML_StrokeNUWA_StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis. [ICML] [ACM ICML] [arXiv] [Github] [HuggingFace]

  6. 2025_arXiv_V2Flow_V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation. [arXiv] [Github]

4.2 Audio

  1. 2023_NeurlPS_TWIST_Textually Pretrained Speech Language Models. [NeruIPS] [arXiv]

  2. 2024_arXiv_JTFS LM_Comparing Discrete and Continuous Space LLMs for Speech Recognition. [arXiv] [Github]

  3. 2024_arXiv_SSVC_Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations. [arXiv]

  4. 2025_ICLR_WavTokenizer_Wavtokenizer: An Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling. [OpenReview] [arXiv] [GitHub]

  5. 2024_arXiv_GLM-4-Voice_GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot. [arXiv] [GitHub] [HuggingFace]

4.3 Graph

  1. 2024_arXiv_NT-LLM_NT-LLM: A Novel Node Tokenizer for Integrating Graph Structure into Large Language Models. [arXiv]

  2. 2025_AAAI_Dr.E_Multi-View Empowered Structural Graph Wordification for Language Models. [AAAI] [arXiv]

4.4 Action

  1. 2024_CVPR_LLM-AR_LLMs are Good Action Recognizers. [CVPR] [IEEE CVPR] [arXiv]

4.5 Complex Modalityu in RecSys

  1. 2024_arXiv_ETEGRec_Generative Recommender with End-to-End Learnable Item Tokenization. [arXiv]

  2. 2024_arXiv_META ID_Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens. [ICML] [arXiv]

  3. 2024_arXiv_QARM_QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou. [arXiv]

  4. 2024_arXiv_STORE_Streamlining Semantic Tokenization and Generative Recommendation with A Single LLM. [arXiv]

  5. 2024_arXiv_TokenRec_TokenRec: Learning to Tokenize ID for LLM-bsed Generative Recommendations. [arXiv] [Github]

  6. 2024_CIKM_ColaRec_Content-Based Collaborative Generation for Recommender Systems. [ACM CIKM] [arXiv] [Github]

  7. 2024_CIKM_LETTER_Learnable Item Tokenization for Generative Recommendation. [ACM CIKM] [arXiv] [Github]

  8. 2024_ICDE_LC-Rec_Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation. [arXiv] [Github] [HuggingFace]

  9. 2025_arXiv_UTGRec_Universal Item Tokenization for Transferable Generative Recommendation. [arXiv] [Github]

  10. 2025_AAAI_Semantic Convergence: Harmonizing Recommender Systems via Two-Stage Alignment and Behavioral Semantic Tokenization. [AAAI] [arXiv]

  11. 2025_WWW_EAGER-LLM_EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration. [WWW] [arXiv] [Github]

  12. 2025_WWW_ED2_Unleash LLMs Potential for Sequential Recommendation by Coordinating Dual Dynamic Index Mechanism. [WWw] [arXiv]

5 LLM based Multi-Modality Applications

5.1 Text + Image

  1. 2023_arXiv_SEED_Planting a SEED of Vision in Large Language Model. [arXiv] [Github]

  2. 2024_arXiv_Chameleon_Chameleon: Mixed-Modal Early-Fusion Foundation Models. [arXiv] [Github]

  3. 2024_arXiv_ILLUME_ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance. [arXiv]

  4. 2024_arXiv_Janus_Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. [CVPR] [arXiv]

  5. 2024_arXiv_Lumina-mGPT_Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining. [arXiv] [Github]

  6. 2024_arXiv_MUSE-VL_MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding. [arXiv]

  7. 2024_arXiv_Show-o_Show-o: One single transformer to unify multimodal understanding and generation. [arXiv] [Github]

  8. 2024_arXiv_Liquid_Liquid: Language Models are Scalable and Unified Multi-modal Generators. [arXiv] [Github]

  9. 2024_ICLR_LaVIT_Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. [ICLR] [arXiv] [Github]

  10. 2024_ICLR_SEED-LLaMA_Making LLaMA SEE and Draw with SEED Tokenizer. [ICLR] [arXiv] [Github]

  11. 2024_ICML_Libra_Libra: Building Decoupled Vision System on Large Language Models. [ICML] [arXiv] [Github]

  12. 2024_ICML_Morph-Tokens_Auto-Encoding Morph-Tokens for Multimodal LLM. [ICML] [arXiv] [Github]

  13. 2025_arXiv_DDT-LLaMA_Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens. [arXiv] [Github]

  14. 2025_arXiv_ETT_End-to-End-Vision Tokenizer Tuning. [arXiv] [HuggingFace]

  15. 2025_arXiv_FashionM3_FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model. [arXiv]

  16. 2025_arXiv_HiMTok_HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model. [arXiv] [Github] [HuggingFace]

  17. 2025_arXiv_ILLUME+_ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement. [arXiv] [Github]

  18. 2025_arXiv_Janus-Pro_Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. [arXiv]

  19. 2025_arXiv_QLIP_QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation. [arXiv] [Github]

  20. 2025_arXiv_SemHiTok_SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation. [arXiv]

  21. 2025_arXiv_UniToken_UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding. [arXiv] [Github]

  22. 2025_arXv_Token-Shuffle_Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models. [arXiv] [Github] [HuggingFace]

  23. 2025_AAAI_MARS_MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis. [AAAI] [arXiv] [Github]

  24. 2025_CVPR_TokenFlow_TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation. [CVPR] [arXiv] [Github]

  25. 2025_ICLR_ClawMachine_ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding. [ICLR] [arXiv]

  26. 2025_arXiv_Unicode2_Unicode2: Cascaded-Large-scale-Codebooks-for-Unified-Multimodal-Understanding-and-Generation. [arXiv]

5.2 Text + Audio

  1. 2023_arXiv_AudioPaLM_AudioPaLM: A Large Language Model That Can Speak and Listen. [arXiv] [HuggingFace] [Youtube]

  2. 2023_arXiv_LauraGPT_LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT. [arXiv] [Github]

  3. 2023_EMNLP_SpeechGPT_SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. [ENMLP] [arXiv] [Github] [HuggingFace]

  4. 2024_arXiv_SpeechGPT-Gen_SpeechGPT-Gen=Scaling Chain-of-Information Speech Generation. [arXiv] [Github] [Demo]

  5. 2024_arXiv_CosyVoice 2_CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. [arXiv] [Github]

  6. 2024_arXiv_CosyVoice_CosyVoice_A Scalable Multilingual Zero-shot Text-to-Speech Synthesizer based on Supervised Semantic Tokens. [arXiv] [Github]

  7. 2024_arXiv_IntrinsicVoice_IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities. [arXiv]

  8. 2024_arXiv_Moshi_Moshi: a speech-text foundation model for real-time dialogue. [arXiv] [Github]

  9. 2024_arXiv_OmniFlatten_OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation. [arXiv]

  10. 2024_arXiv_T5-TTS_Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment. [arXiv] [Github]

  11. 2024_ICASSP_VoxtLM_VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks. [IEEE ICASSP] [arXiv] [Github]

  12. 2024_IEEE Signal Processing Letters_MSRT_Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization. [IEEE Signal Processing Letters]

  13. 2024_Interspeech_DiscreteSLU_DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding. [arXiv] [HuggingFace]

  14. 2024_MM_GPT-Talker_Generative Expressive Conversational Speech Synthesis. [ACM MM] [arXiv] [Github]

  15. 2025_arXiv_Spark-TTS_Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens. [arXiv] [Github]

  16. 2025_Kimi-Audio_Kimi-Audio Technical Report. [arXiv] [Github] [HuggingFace]

5.3 Text + Video

  1. 2024_arXiv_Loong_Loong: Generating Minute-level Long Videos with Autoregressive Language Models. [arXiv] [Github] [HuggingFace]

  2. 2024_ICML_Video-LaVIT_Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization. [ICML] [ACM ICML] [arXiv] [Github]

  3. 2025_arXiv_HiTVideo_HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models. [arXiv] [Github]

5.4 Text + Graph

  1. 2024_arXiv_UniMoT_UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation. [arXiv] [Github]

  2. 2024_ICML Workshop_HIGHT_Improving Graph-Language Alignment with Hierarchical Graph Tokenization. [OpenReview ICML Workshop]

  3. 2025_arXiv_MedTok_Multimodal Medical Code Tokenizer. [arXiv]

  4. 2025_arXiv_SSQR_Self-Supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models. [arXiv] [HuggingFace]

5.5 Text + Motion

  1. 2024_arXiv_MotionGlot_MotionGlot: A Multi-Embodied Motion Generation Model. [arXiv] [Demo] [Github]

  2. 2024_CVPR_AvatarGPT_AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond. [CVPR] [IEEE CVPR] [arXiv] [Github]

  3. 2024_ECCV_Semgrasp_Semgrasp_Semantic grasp generation via language aligned discretization. [ACM ECCV] [arXiv]

  4. 2024_IV_Walk-the-Talk_Walk-the-Talk: LLM driven Pedestrian Motion Generation. [IEEE IV]

5.6 Text + Image + Audio

  1. 2023_arXiv_TEAL_TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models. [arXiv]

  2. 2024_arXiv_DMLM_Discrete Multimodal Transformers with A Pretrained Large Language Model for Mixed-Supervision Speech Processing. [arXiv]

  3. 2024_ACL_AnyGPT_AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. [ACL] [arXiv] [Github] [HuggingFace] [Youtube] [Bilibili]

5.7 Text + Image + Video

  1. 2024_arXiv_Emu3: Next-Token Prediction is All You Need. [arXiv] [Github] [Youtube]

  2. 2025_ICLR_LWM_World model on million-length video and language with blockwise ringattention. [ICLR] [arXiv] [Github] [HuggingFace] [YouTube]

  3. 2025_ICLR_VILA-U_VILA-U: A Unified Foundation Model Integrating Visual Understanding and Generation. [ICLR] [arXiv] [HuggingFace]

5.8 Text + Audio + Motion

  1. 2025_ICCGV_LLM Gesticulator_LLM Gesticulator: Levaraging Large Language Models for Scalable and Controllable Co-Speech Gesture Synthesis. [arXiv] [Github]

5.9 Text + Image + Audio + Action

  1. 2024_CVPR_Unified-IO 2_Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action. [CVPR] [IEEE CVPR] [arXiv] [HuggingFace]

5.10 Text + Image + Video + Audio

  1. 2024_arXiv_MIO_MIO: A foundation model on multimodal tokens. [arXiv] [HuggingFace] [YouTube]

  2. 2024_ICML_VideoPoet_VideoPoet: A Large Language Model for Zero-Shot Video Generation. [ICML] [arXiv] [HuggingFace] [Youtube]

fig_13

6 Challenges and Future Directions

Related Survey

  1. 1988_IEEE Transactions on Communications_Survey_Image Coding Using Vector Quantization: A Review.

  2. 1989_IEEE Transactions on Information Theory_Survey_High-Resolution Quantization Theory and the Vector Quantizer Advantage.

  3. 1990_IEEE Transactions on Information Theory_Survey_A Study of Vector Quantization for Noisy Channels.

  4. 1996_TIP_Survey_Advances in Residual Vector Quantization:A Review.

  5. 1998_IEEE transactions on information theory_Survey_Quantization.

  6. 2006_IEEE Potentials_Survey_A review of vector quantization techniques.

  7. 2010_IHMSP_Survey_A Survey of VQ Codebook Generation.

  8. 2018_MTA_Survey_A Survey of Product Quantization.

  9. 2019_Frontiers of Information Technology & Electronic Engineering_Survey_Vector quantization:a review.

  10. 2023_ACM Computing Surveys_Survey_Embedding in Recommender Systems: A Survey.

  11. 2024_arXiv_Survey_Vector quantization for recommender systems:a review and outlook.

  12. 2024_ACM Computing Surveys_Survey_Embedding Compression in Recommender Systems: A Survey.

  13. 2024_Data Engineering_Survey_High-Dimensional Vector Quantization:General Framework, Recent Advances, and Future Directions.

  14. 2024_arXIv_Survey_Next Token Prediction Towards Multimodal Intelligence:A Comprehensive Survey.

  15. 2024_PrePrints_Survey_Continuous or Discrete, That is the Question=A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension.

  16. 2025_arXiv_Survey_A Survey of Quantized Graph Representation Learning= Connecting Graph Structures with Large Language Models.

  17. 2025_arXiv_Survey_From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval.

  18. 2025_arXiv_Survey_Recent Advances in Discrete Speech Tokens:A Review.

  19. 2025_TMLR_Survey_Autoregressive Models in Vision:A Survey.

Acknowledgements

The following works are recommended by their authors after the initial release of this survey and will be added to the revised version and this repository. We sincerely thank them for their contributions to enriching the coverage of this work.

  1. 2024_arXiv_Liquid_Liquid: Language Models are Scalable and Unified Multi-modal Generators. (5.1 Text + Image)

    Authors: Junfeng Wu et al. (HUST & ByteDance)
    [arXiv] [Github] (Added on August 9, 2025, following the authors’ kind recommendation)

📖 Citation

If you compare with, build on, or use aspects of this work, please cite the following:

@article{li2025discrete,
  title={Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey},
  author={Li, Jindong and Fu, Yali and Liu, Jiahong and Cao, Linxiao and Ji, Wei and Yang, Menglin and King, Irwin and Yang, Ming-Hsuan},
  journal={arXiv preprint arXiv:2507.22920},
  year={2025}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLM-Discrete-Tokenization-Survey

Similar Open Source Tools

For similar tasks

For similar jobs