Consistency_LLM

Consistency_LLM

[ICML 2024] CLLMs: Consistency Large Language Models

Stars: 293

Visit
 screenshot

Consistency Large Language Models (CLLMs) is a family of efficient parallel decoders that reduce inference latency by efficiently decoding multiple tokens in parallel. The models are trained to perform efficient Jacobi decoding, mapping any randomly initialized token sequence to the same result as auto-regressive decoding in as few steps as possible. CLLMs have shown significant improvements in generation speed on various tasks, achieving up to 3.4 times faster generation. The tool provides a seamless integration with other techniques for efficient Large Language Model (LLM) inference, without the need for draft models or architectural modifications.

README:

CLLM

 Consistency Large Language Models: A Family of Efficient Parallel Decoders

| Paper | Blog |

License Maintenance Contributions welcome Weights

Consistency large language models (CLLMs) is a new family of models capable of reducing inference latency by efficiently decoding $n$ tokens in parallel. This decoding method is called Jacobi decoding, which improves inference efficiency in comparison with conventional auto-regressive (AR) decoding. CLLMs are trained with the objective of performing efficient Jacobi decoding by mapping any randomly initialized $n$-token sequence to the same result as AR decoding in as few steps as possible.

Experiment results have demonstrated the effectiveness of CLLMs, showing $2.4\times$ to $3.4\times$ improvements in generation speed on a variety of tasks.

A demo of using CLLM to achieve significant improvements ($\sim3\times$) in generation speed to solve a basic math problem is shown below:

Contents

News 🔥

  • [2024/3] CLLMs are integrated in FastChat!
  • [2024/2] CLLM Paper now available on arXiv. CLLMs model checkpoints are released on Huggingface Hub.

Introduction

Consistency Large Language Models (CLLMs) is a family of efficient parallel decoders refined from pre-trained LLMs.

Compared with existing fast decoding techniques, CLLMs achieve fast parallel decoding without the need for:

  • Draft models
  • Architectural modifications/auxiliary model components

This introduces a number of advantages for CLLMs:

  • CLLMs don't have to deal with the complexity of obtaining 'good' draft models and managing two different models in a single system.
  • CLLMs share the same architecture with target LLMs and require no additional engineering efforts when adopting the technique to different models.
  • CLLMs can be integrated seamlessly with other techniques for efficient LLM inference (e.g. Lookahead Decoding) to achieve even more significant speedup.

Installation

  1. Environment setup:
conda create -n cllm python=3.10
conda activate cllm
  1. Clone this repository and build from source:
git clone [email protected]:hao-ai-lab/Consistency_LLM.git
cd Consistency_LLM
  1. Install dependency:
pip install -r requirements.txt
pip install flash-attn==2.4.1

Model Weights

Target Pre-trained Models

Size Dataset Huggingface Repo
7B ShareGPT cllm/vicuna-7b-sharegpt-gpt4-48k
7B GSM8K (Math) GAIR/Abel-7B-001
7B Spider (Text-to-SQL) cllm/deepseekcoder-7b-instruct-spider
7B Code-Search-Net Python cllm/deepseekcoder_7b_codesearch_net_python

CLLMs

Size Dataset Huggingface Repo
7B ShareGPT cllm/consistency-llm-7b-sharegpt48k
7B GSM8K (Math) cllm/consistency-llm-7b-math
7B Spider (Text-to-SQL) cllm/consistency-llm-7b-spider
7B Code-Search-Net Python cllm/consistency-llm-7b-codesearchnet

Usage

Inference

bash applications/run_chat_cllm.sh {model_path} {cllm_type}

cllm_type can take the value of spider, python, gsm8k, sharegpt.

Training

  1. Collect Jacobi trajectory:
  • Method 1: Directly download Jacobi trajectory to data/collected_jacobi_trajectory/ from our Huggingface Hub page.
  • Method 2 (Generate trajectory suitable to your own target model and dataset): Some raw datasets that contain additional information like database dependency or cannot be directly loaded from Huggingface Hub (for example, Spider and ShareGPT are required to be installed in data/raw_data). Then run scripts/generate_trajectory.sh and the training dataset for a CLLM will be saved in data/collected_jacobi_trajectory/.

For example, for the gsm8k dataset, run:

# max_new_tokens corresponds to the size of n_token_sequence
CUDA_VISIBLE_DEVICES=0 bash scripts/generate_trajectory.sh {filename} {model_path} {n_token_seq_size} {max_new_seq_len}
Other command options
--filename: path to the raw dataset, currently supporting {data/raw_data/spider, code_search_net, data/raw_data/gsm8k_train.jsonl, data/raw_data/ShareGPT_V3_unfiltered_cleaned_split.json} \ 
--data_size: maximum number of prompts used to extract Jacobi trajectories \ 
--use_aug: use data augmentation technique \
--use_labels: add dataset's labels to the output file
  1. Train a CLLM:
bash scripts/train_cllm.sh {model_path} {trajectory_file} {output_path} {n_token_seq_size}

Evaluation

We follow the same settings in human-eval, Spider, MT-bench and GSM8K evaluate CLLMs' generation quality. An example code to evaluate CLLMs' throughput measured in tokens/s, fast-forwarded token count, stationary token count can be found in eval folder. Take GSM8K dataset as an example:

To test the speedup, run:

CUDA_VISIBLE_DEVICES=0 bash eval/gsm8k/speedup.sh {model_path} {target_model_path} {max_new_tokens}

To test the accuracy, run:

CUDA_VISIBLE_DEVICES=0 python eval/gsm8k/acc.py --model_dir path_to_cllm --temperature 0.0 --top_p 1.0 --output_file_name 'cllm_generated_gsm8k.jsonl' \
--dev_set "gsm8k" --prompt_type math-single --max_new_tokens_for_consistency 16 --max_tokens 1024 --use_consistency_decoding

Citation

This is the official project repository for the following paper. If you find this repository helpful, Please kindly cite:

@misc{kou2024cllms,
      title={CLLMs: Consistency Large Language Models}, 
      author={Siqi Kou and Lanxiang Hu and Zhezhi He and Zhijie Deng and Hao Zhang},
      year={2024},
      eprint={2403.00835},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Consistency_LLM

Similar Open Source Tools

For similar tasks

For similar jobs