LMCache

LMCache

Redis for LLMs

Stars: 680

Visit
 screenshot

LMCache is a serving engine extension designed to reduce time to first token (TTFT) and increase throughput, particularly in long-context scenarios. It stores key-value caches of reusable texts across different locations like GPU, CPU DRAM, and Local Disk, allowing the reuse of any text in any serving engine instance. By combining LMCache with vLLM, significant delay savings and GPU cycle reduction are achieved in various large language model (LLM) use cases, such as multi-round question answering and retrieval-augmented generation (RAG). LMCache provides integration with the latest vLLM version, offering both online serving and offline inference capabilities. It supports sharing key-value caches across multiple vLLM instances and aims to provide stable support for non-prefix key-value caches along with user and developer documentation.

README:

lmcache logo

| Blog | Documentation | Join Slack | Interest Form | Official Email |

💡 What is LMCache?

TL;DR - Redis for LLMs.

LMCache is a LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations including (GPU, CPU DRAM, Local Disk), LMCache reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious GPU cycles and reduces response delay for users.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker images here.

🚀 Performance snapshot

image

💻 Installation and Quickstart

Please refer to our detailed documentation for LMCache V1 and LMCache V0

Interested in Connecting?

Fill out the interest form and our team will reach out to you! https://forms.gle/mQfQDUXbKfp2St1z7

🛣️ News and Milestones

  • [x] LMCache V1 with vLLM integration with following features is live 🔥
    • High performance CPU KVCache offloading
    • Disaggregated prefill
    • P2P KVCache sharing
  • [x] LMCache is supported in the vLLM production stack ecosystem
  • [x] User and developer documentation
  • [x] Stable support for non-prefix KV caches
  • [x] Support installation through pip install and integrate with latest vLLM
  • [x] First release of LMCache

📖 Blogs and documentations

Our latest blog posts and the documentation pages are available online

Community meeting

The community meeting for LMCache is co-hosted with the community meeting for the vLLM production stack project.

Meeting Details:

Meetings alternate weekly between the two times. All are welcome to join!

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use LMCache for your research, please cite our papers:

@inproceedings{liu2024cachegen,
  title={Cachegen: Kv cache compression and streaming for fast large language model serving},
  author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
  booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
  pages={38--56},
  year={2024}
}

@article{cheng2024large,
  title={Do Large Language Models Need a Content Delivery Network?},
  author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},
  journal={arXiv preprint arXiv:2409.13761},
  year={2024}
}

@article{yao2024cacheblend,
  title={CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion},
  author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
  journal={arXiv preprint arXiv:2405.16444},
  year={2024}
}

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LMCache

Similar Open Source Tools

For similar tasks

For similar jobs