MCPcopy
hub / github.com/NVIDIA/TensorRT-LLM / forward

Method forward

tensorrt_llm/quantization/layers.py:670–694  ·  view source on GitHub ↗
(self, x, lora_runtime_params=None, all_reduce_params=None)

Source from the content-addressed store, hash-verified

668 self.tllm_to_externel_key_dict = {"weight": ["weight", "weight_scale"]}
669
670 def forward(self, x, lora_runtime_params=None, all_reduce_params=None):
671 assert lora_runtime_params is None, "lora is not supported on SmoothQuantRowLinear now"
672 x, per_token_scale = x
673 x = fp8_rowwise_gemm(x, self.weight.value, per_token_scale,
674 self.per_channel_scale.value,
675 self.quant_mode.has_fp8_rowwise(),
676 self.quant_mode.has_fp8_rowwise())
677
678 if self.tp_size > 1 and self.tp_group is not None:
679 need_bias = self.bias is not None
680 fuse_bias_into_all_reduce = need_bias and (
681 all_reduce_params
682 is not None) and (all_reduce_params.fusion_op
683 == AllReduceFusionOp.RESIDUAL_RMS_NORM)
684 if fuse_bias_into_all_reduce:
685 all_reduce_params.bias = self.bias.value
686 x = allreduce(x, self.tp_group, all_reduce_params=all_reduce_params)
687 if need_bias and not fuse_bias_into_all_reduce:
688 x = x + self.bias.value
689 return x
690
691 if self.bias is not None:
692 x = x + self.bias.value
693
694 return x
695
696 def postprocess(self, tllm_key, weights, **kwargs):
697 if "per_channel_scale" in tllm_key:

Callers

nothing calls this directly

Calls 3

fp8_rowwise_gemmFunction · 0.70
allreduceFunction · 0.50
has_fp8_rowwiseMethod · 0.45

Tested by

no test coverage detected