MCPcopy
hub / github.com/NVIDIA/TensorRT-LLM / kv_cache_quantize

Function kv_cache_quantize

tensorrt_llm/quantization/quantize.py:556–566  ·  view source on GitHub ↗
(model)

Source from the content-addressed store, hash-verified

554
555# Now consider the kv cache is enabled for all layers
556def kv_cache_quantize(model):
557 for name, module in model.named_modules():
558 if isinstance(module,
559 (Attention, SmoothQuantAttention, Fp8RowwiseAttention)):
560 # for dequant
561 module.kv_cache_scaling_factor = Parameter(shape=(1, ),
562 dtype='float32')
563 # for quant
564 module.kv_cache_rcp_scaling_factor = Parameter(shape=(1, ),
565 dtype='float32')
566 return model
567
568
569def quantize(model, quant_config: Union[QuantConfig, LayerQuantConfig]):

Callers 1

quantizeFunction · 0.85

Calls 2

ParameterClass · 0.85
named_modulesMethod · 0.80

Tested by

no test coverage detected