<img alt="LightLLM" src="https://github.com/ModelTC/LightLLM/raw/v1.1.0/assets/logo_new.png" width=90%>
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.
English Docs | 中文文档 | Blogs
Learn more in the release blogs: v1.0.0 blog.
Please refer to the FAQ for more information.
We welcome any coopoeration and contribution. If there is a project requires LightLLM's support, please contact us via email or create a pull request.
Projects based on LightLLM or referenced LightLLM components: - LazyLLM - LoongServe, Peking University - OmniKV, Ant Group - vLLM (some LightLLM's kernel used) - SGLang (some LightLLM's kernel used) - ParrotServe, Microsoft - Aphrodite (some LightLLM's kernel used) - S-LoRA
Also, LightLLM's pure-python design and token-level KC Cache management make it easy to use as the basis for research projects.
Academia works based on or use part of LightLLM: - ParrotServe (OSDI’24) - SLoRA (MLSys’24) - LoongServe (SOSP’24) - ByteDance’s CXL (Eurosys’24) - VTC (OSDI’24) - OmniKV (ICLR’25) - CaraServe, LoRATEE, FastSwitch ...
For further information and discussion, join our discord server. Welcome to be a member and look forward to your contribution!
This repository is released under the Apache-2.0 license.
We learned a lot from the following projects when developing LightLLM. - Faster Transformer - Text Generation Inference - vLLM - SGLang - flashinfer - Flash Attention 1&2 - OpenAI Triton
We have published a number of papers around components or features of LightLLM, if you use LightLLM in your work, please consider citing the relevant paper.
constrained decoding: accepted by ACL2025 and achieved the outstanding paper award.
@inproceedings{
anonymous2025pre,
title={Pre\${\textasciicircum}3\$: Enabling Deterministic Pushdown Automata for Faster Structured {LLM} Generation},
author={Anonymous},
booktitle={Submitted to ACL Rolling Review - February 2025},
year={2025},
url={https://openreview.net/forum?id=g1aBeiyZEi},
note={under review}
}
Request scheduler: accepted by ASPLOS’25:
@inproceedings{gong2025past,
title={Past-Future Scheduler for LLM Serving under SLA Guarantees},
author={Gong, Ruihao and Bai, Shihao and Wu, Siyu and Fan, Yunqian and Wang, Zaijun and Li, Xiuhong and Yang, Hailong and Liu, Xianglong},
booktitle={Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
pages={798--813},
year={2025}
}
$ claude mcp add LightLLM \
-- python -m otcore.mcp_server <graph>