hub / github.com/LMCache/LMCache

github.com/LMCache/LMCache @v0.5.0-cu129 sqlite

repository ↗ · DeepWiki ↗ · release v0.5.0-cu129 ↗

15,596 symbols 68,677 edges 1,046 files 8,055 documented · 52%

README

<img src="https://github.com/LMCache/LMCache/raw/v0.5.0-cu129/asset/logo.png" alt="lmcache logo" width="45%">

A KV Cache Management Layer for Scalable LLM Inference

<hr width="78%">

Blog | Documentation | Join Slack | Community Meeting | Roadmap

Updates

[2026/05] 🔥 Agentic workload benchmark on AMD MI300X (blog).
[2026/04] 🔥 LMCache's new multiprocess (MP) architecture release (blog).
[2026/03] LMCache at GTC 2026 (post).
[2026/01] LMCache multi-node P2P CPU memory sharing, from experimental feature to production (blog).

[2025/11] LMCache x CoreWeave accelerate efficient LLM inference for Cohere (blog).
[2025/10] LMCache joins the PyTorch Foundation and Tensormesh unveiled (blog, PyTorch).
[2025/09] NVIDIA Dynamo integrates LMCache, accelerating LLM inference (blog).
[2025/08] 🎉 LMCache hits 5,000+ GitHub stars (blog).
[2025/08] LMCache supports gpt-oss (20B/120B) on day 1 (blog).
[2025/07] Get faster LLM inference and cheaper responses with LMCache and Redis (Redis blog).
[2025/07] LMCache extends its turbo-boost to multimodal models in vLLM V1 (blog).
[2025/06] LLM Production Stack goes cross-hardware: AMD, Arm and Ascend (blog).

About

LMCache is a KV cache management layer for LLM inference. It turns KV cache from a temporary state into reusable AI-native knowledge that can be stored persistently, reused across multiple serving engines, monitored with an observability stack, and transformed for better generation quality. As a result, LMCache reduces TTFT (time-to-first-token) and improves throughput, especially for long-context agentic, multi-turn conversation, and knowledge-augmented workloads (e.g., RAG).

LMCache is vendor-neutral. It can be used as a KV cache layer for a range of mainstream open-source serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers. The vendor neutrality allows users to freely switch between serving engines and storage vendors, while reusing the stored KV caches.

LMCache Deployment Modes

Key features

Engine-independent deployment: LMCache, as a standalone daemon process, manages KV cache independently from the inference engine process, so that KV cache will not be lost even if the inference engine crashes (i.e., no fate-sharing with engines).
Persistent, tiered KV cache offloading and reuse: Move KV caches out of GPU memory into a tiered storage hierarchy spanning CPU memory, local storage, and remote backends, enabling reuse across requests, sessions, and engine instances to reduce repeated prefill computation and improve TTFT.
Production-level KV cache observability: LMCache provides a rich set of KV cache observability metrics, including typical Kubernetes metrics (health monitoring, performance diagnostics), KV-cache-specific metrics (request-level and token-level prefix cache hits, lifecycle, request-level KV cache performance), management metrics (user-specific usage), and more.
Pluggable storage and transport backends: Easily integrate remote storage and KV transfer backends through a unified interface, enabling KV cache offloading and sharing across storage providers. Through this interface, LMCache supports storage backends including CPU RAM, local disk (SSD), Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS.
Non-prefix KV reuse: Extend KV reuse beyond prefix caching by reusing cached KV blocks at any position in the prompt. This leverages CacheBlend to selectively recompute tokens for quality recovery.
PD disaggregation and KV transfer: Support KV cache transfer from prefill workers to decode workers over NVLink, RDMA, or TCP through transport layers such as NIXL.
Pluggable KV transformation: A simple interface for researchers to write compression, token dropping, and custom serialization through a flexible SERDE interface.

LMCache is becoming an integral layer in the LLM inference ecosystem, with community-driven integration with serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers:

LMCache ecosystem

Getting Started

To use LMCache, simply install lmcache from your package manager, e.g. pip:

pip install lmcache

For more setup options and examples, see: - Installation - Quickstart - LMCache Recipes - CLI Reference - Benchmarking Guide - Production Deployment

Contributing

We welcome and value contributions and collaborations. Join us in improving LMCache. Check out the Contributing Guide or join our Slack community to get started.

Adoption and Partnerships

LMCache has a growing community of developers, researchers, industry adopters, and partners building the next generation of efficient LLM inference systems.

<img alt="LMCache Adoption and Partnerships" src="https://github.com/LMCache/LMCache/raw/v0.5.0-cu129/asset/partner_light.png">

As an independent open-source project, LMCache is becoming the de-facto standard for KV Cache management in LLM inference. Its continued development and community work are supported in part by Tensormesh.

Citation

LMCache builds on research in KV cache management, including cache reuse, offloading, compression, and serving optimization. If you use LMCache in your research, please cite the LMCache paper and related work.

@article{cheng2025lmcache,
  title={LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference},
  author={Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen},
  journal={arXiv preprint arXiv:2510.09665},
  year={2025}
}

License

The LMCache codebase is licensed under Apache License 2.0. See the LICENSE file for details.

Core symbols most depended-on inside this repo

publish

called by 455

lmcache/v1/mp_observability/event_bus.py

debug

called by 401

lmcache/v1/multiprocess/modules/management.py

get

called by 314

lmcache/v1/compute/blend/utils.py

lmcache/v1/kv_codec/asym_k16_v8.py

pop

called by 241

lmcache/v1/storage_backend/nixl_storage_backend.py

data_ptr

called by 211

lmcache/v1/memory_management.py

tensor

called by 185

lmcache/v1/memory_management.py

Shape

Method 9,413

Function 3,958

Class 2,008

Route 181

Struct 36

Languages

Python97%

Go3%

TypeScript1%

Modules by API surface

lmcache/v1/memory_management.py270 symbols

lmcache/v1/cache_controller/message.py123 symbols

tests/v1/distributed/test_native_connector_l2_adapter.py119 symbols

lmcache/v1/storage_backend/nixl_storage_backend.py119 symbols

tests/v1/distributed/test_l2_adapter_factory.py102 symbols

tests/v1/distributed/test_l1_manager.py95 symbols

tests/test_utils.py95 symbols

tests/v1/test_config.py89 symbols

tests/v1/native_storage_ops/test_bitmap.py87 symbols

tests/conftest.py84 symbols

lmcache/observability.py81 symbols

tests/v1/storage_backend/test_gds_backend.py80 symbols

Dependencies from manifests, versioned

cel.dev/exprv0.24.0 · 1×

github.com/Masterminds/semver/v3v3.4.0 · 1×

github.com/antlr4-go/antlr/v4v4.13.0 · 1×

github.com/beorn7/perksv1.0.1 · 1×

github.com/blang/semver/v4v4.0.0 · 1×

github.com/cenkalti/backoff/v4v4.3.0 · 1×

github.com/cespare/xxhash/v2v2.3.0 · 1×

github.com/davecgh/go-spewv1.1.1 · 1×

github.com/emicklei/go-restful/v3v3.12.2 · 1×

github.com/evanphx/json-patch/v5v5.9.11 · 1×

github.com/felixge/httpsnoopv1.0.4 · 1×

github.com/fsnotify/fsnotifyv1.9.0 · 1×

For agents

$ claude mcp add LMCache \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact