Parameters: x : Tensor (On GPU) The input tensor. double_scale : Tensor (On GPU) The global per-tensor scaling factor. It should contain only 1 element. axis : int The axis to quantize. Default is -1 (the last axis). block_size
(
x: Tensor,
double_scale: Tensor,
axis: int = -1,
block_size: int = 16,
data_qtype: trt.DataType = trt.fp4,
scale_qtype: trt.DataType = trt.fp8)
| 1394 | |
| 1395 | |
| 1396 | def dynamic_quantize( |
| 1397 | x: Tensor, |
| 1398 | double_scale: Tensor, |
| 1399 | axis: int = -1, |
| 1400 | block_size: int = 16, |
| 1401 | data_qtype: trt.DataType = trt.fp4, |
| 1402 | scale_qtype: trt.DataType = trt.fp8) -> Tuple[Tensor, Tensor]: |
| 1403 | ''' |
| 1404 | Parameters: |
| 1405 | x : Tensor (On GPU) |
| 1406 | The input tensor. |
| 1407 | double_scale : Tensor (On GPU) |
| 1408 | The global per-tensor scaling factor. It should contain only 1 element. |
| 1409 | axis : int |
| 1410 | The axis to quantize. Default is -1 (the last axis). |
| 1411 | block_size : int |
| 1412 | The block size for quantization. Default is 16. |
| 1413 | data_qtype : trt.DataType |
| 1414 | The data type for quantized data. Default is FP4. |
| 1415 | scale_qtype : trt.DataType |
| 1416 | The data type for block scale. Default is FP8. |
| 1417 | Returns: |
| 1418 | A tuple of two tensors: quantized tensor and block scale tensor. |
| 1419 | ''' |
| 1420 | if axis < 0: |
| 1421 | axis = len(x.shape) + axis |
| 1422 | dynq = default_trtnet().add_dynamic_quantize(x.trt_tensor, axis, block_size, |
| 1423 | data_qtype, scale_qtype) |
| 1424 | dynq.set_input(1, double_scale.trt_tensor) |
| 1425 | quantized = _create_tensor(dynq.get_output(0), dynq) |
| 1426 | scale = _create_tensor(dynq.get_output(1), dynq) |
| 1427 | return quantized, scale |
| 1428 | |
| 1429 | |
| 1430 | def block_double_dequantize(x: Tensor, |
no test coverage detected