hub / github.com/NVIDIA/TensorRT-LLM / KvCacheConfig

Class KvCacheConfig

tensorrt_llm/llmapi/llm_args.py:1628–1772 · view source on GitHub ↗

Configuration for the KV cache.

Source from the content-addressed store, hash-verified

1626
1627	@PybindMirror.mirror_pybind_fields(_KvCacheConfig)
1628	class KvCacheConfig(StrictBaseModel, PybindMirror):
1629	"""
1630	Configuration for the KV cache.
1631	"""
1632	enable_block_reuse: bool = Field(
1633	default=True,
1634	description=
1635	"Controls if KV cache blocks can be reused for different requests.")
1636	max_tokens: Optional[int] = Field(
1637	default=None,
1638	description=
1639	"The maximum number of tokens that should be stored in the KV cache. If both `max_tokens` and `free_gpu_memory_fraction` are specified, memory corresponding to the minimum will be used."
1640	)
1641	max_attention_window: Optional[List[int]] = Field(
1642	default=None,
1643	description=
1644	"Size of the attention window for each sequence. Only the last tokens will be stored in the KV cache. If the number of elements in `max_attention_window` is less than the number of layers, `max_attention_window` will be repeated multiple times to the number of layers."
1645	)
1646	sink_token_length: Optional[int] = Field(
1647	default=None,
1648	description=
1649	"Number of sink tokens (tokens to always keep in attention window).")
1650	free_gpu_memory_fraction: Optional[float] = Field(
1651	default=0.9,
1652	description=
1653	"The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both `max_tokens` and `free_gpu_memory_fraction` are specified, memory corresponding to the minimum will be used."
1654	)
1655	host_cache_size: Optional[int] = Field(
1656	default=None,
1657	description=
1658	"Size of the host cache in bytes. If both `max_tokens` and `host_cache_size` are specified, memory corresponding to the minimum will be used."
1659	)
1660	onboard_blocks: bool = Field(
1661	default=True, description="Controls if blocks are onboarded.")
1662	cross_kv_cache_fraction: Optional[float] = Field(
1663	default=None,
1664	description=
1665	"The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model."
1666	)
1667	secondary_offload_min_priority: Optional[int] = Field(
1668	default=None,
1669	description=
1670	"Only blocks with priority > mSecondaryOfflineMinPriority can be offloaded to secondary memory."
1671	)
1672	event_buffer_max_size: int = Field(
1673	default=0,
1674	description=
1675	"Maximum size of the event buffer. If set to 0, the event buffer will not be used."
1676	)
1677	attention_dp_events_gather_period_ms: int = Field(
1678	default=5,
1679	description=
1680	"The period in milliseconds to gather attention DP events across ranks."
1681	)
1682	enable_partial_reuse: bool = Field(
1683	default=True,
1684	description=
1685	"Whether blocks that are only partially matched can be reused.")

Callers 15

test_eagle3_output_consistency_4gpusFunction · 0.90

model_fnFunction · 0.90

test_connector_multi_requestFunction · 0.90

test_llm_args_type_defaultMethod · 0.90

test_llm_args_type_tensorrtMethod · 0.90

verify_disaggregatedFunction · 0.90

test_disaggregated_llama_context_capacityFunction · 0.90

test_disaggregated_spec_dec_batch_slot_limitFunction · 0.90

test_pp2_rayMethod · 0.90

test_fp8Method · 0.90

test_fp8_4gpusMethod · 0.90

test_eagle3Method · 0.90

Calls 1

FieldFunction · 0.85

Tested by 15

test_eagle3_output_consistency_4gpusFunction · 0.72

model_fnFunction · 0.72

test_connector_multi_requestFunction · 0.72

test_llm_args_type_defaultMethod · 0.72

test_llm_args_type_tensorrtMethod · 0.72

verify_disaggregatedFunction · 0.72

test_disaggregated_llama_context_capacityFunction · 0.72

test_disaggregated_spec_dec_batch_slot_limitFunction · 0.72

test_pp2_rayMethod · 0.72

test_fp8Method · 0.72

test_fp8_4gpusMethod · 0.72

test_eagle3Method · 0.72