hub / github.com/NVIDIA/TensorRT-LLM / dynamic_quantize

Function dynamic_quantize

tensorrt_llm/quantization/functional.py:1396–1427 · view source on GitHub ↗

Parameters: x : Tensor (On GPU) The input tensor. double_scale : Tensor (On GPU) The global per-tensor scaling factor. It should contain only 1 element. axis : int The axis to quantize. Default is -1 (the last axis). block_size

(
        x: Tensor,
        double_scale: Tensor,
        axis: int = -1,
        block_size: int = 16,
        data_qtype: trt.DataType = trt.fp4,
        scale_qtype: trt.DataType = trt.fp8)

Source from the content-addressed store, hash-verified

1394
1395
1396	def dynamic_quantize(
1397	x: Tensor,
1398	double_scale: Tensor,
1399	axis: int = -1,
1400	block_size: int = 16,
1401	data_qtype: trt.DataType = trt.fp4,
1402	scale_qtype: trt.DataType = trt.fp8) -> Tuple[Tensor, Tensor]:
1403	'''
1404	Parameters:
1405	x : Tensor (On GPU)
1406	The input tensor.
1407	double_scale : Tensor (On GPU)
1408	The global per-tensor scaling factor. It should contain only 1 element.
1409	axis : int
1410	The axis to quantize. Default is -1 (the last axis).
1411	block_size : int
1412	The block size for quantization. Default is 16.
1413	data_qtype : trt.DataType
1414	The data type for quantized data. Default is FP4.
1415	scale_qtype : trt.DataType
1416	The data type for block scale. Default is FP8.
1417	Returns:
1418	A tuple of two tensors: quantized tensor and block scale tensor.
1419	'''
1420	if axis < 0:
1421	axis = len(x.shape) + axis
1422	dynq = default_trtnet().add_dynamic_quantize(x.trt_tensor, axis, block_size,
1423	data_qtype, scale_qtype)
1424	dynq.set_input(1, double_scale.trt_tensor)
1425	quantized = _create_tensor(dynq.get_output(0), dynq)
1426	scale = _create_tensor(dynq.get_output(1), dynq)
1427	return quantized, scale
1428
1429
1430	def block_double_dequantize(x: Tensor,

Callers 2

forwardMethod · 0.85

Calls 3

default_trtnetFunction · 0.85

_create_tensorFunction · 0.85

get_outputMethod · 0.45

Tested by

no test coverage detected