Add an operation that performs a collective all-reduce. Let's define 'world_size' as the length of the 'group' list. That functions creates a layer to compute the sum of 'world_size' tensors distributed amongst the 'world_size' participating ranks (one GPU per rank). The list
(
tensor: Tensor,
group: List[int],
all_reduce_params: Optional[AllReduceParams] = AllReduceParams()
)
| 4046 | |
| 4047 | |
| 4048 | def allreduce( |
| 4049 | tensor: Tensor, |
| 4050 | group: List[int], |
| 4051 | all_reduce_params: Optional[AllReduceParams] = AllReduceParams() |
| 4052 | ) -> Tensor: |
| 4053 | ''' |
| 4054 | Add an operation that performs a collective all-reduce. |
| 4055 | |
| 4056 | Let's define 'world_size' as the length of the 'group' list. That functions |
| 4057 | creates a layer to compute the sum of 'world_size' tensors distributed |
| 4058 | amongst the 'world_size' participating ranks (one GPU per rank). |
| 4059 | |
| 4060 | The list 'group' contains the identifiers of the ranks participating into |
| 4061 | the collective operation. |
| 4062 | |
| 4063 | The tensors in the different ranks must be 1D tensors (or views) and the output |
| 4064 | tensor will have that same shape. The output tensor will be replicated on |
| 4065 | the 'world_size' ranks. |
| 4066 | |
| 4067 | That operation is implemented using a plugin that wraps the NCCL all-reduce |
| 4068 | collective operation. See |
| 4069 | https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce |
| 4070 | for details. |
| 4071 | |
| 4072 | Parameters: |
| 4073 | tensor : Tensor |
| 4074 | The input tensor. |
| 4075 | |
| 4076 | group : List[int] |
| 4077 | The ranks participating into the all-reduce operation. |
| 4078 | |
| 4079 | strategy: AllReduceStrategy |
| 4080 | NCCL delegates all-reduce to NCCL while ONESHOT and TWOSHOT are custom latency-optimal algorithms. |
| 4081 | AUTO chooses amongst the three based on a message-size heuristic. |
| 4082 | |
| 4083 | Returns: |
| 4084 | The tensor produced by that layer. |
| 4085 | ''' |
| 4086 | |
| 4087 | global allreduce_ub_counter |
| 4088 | allreduce_ub_counter += 1 |
| 4089 | |
| 4090 | if all_reduce_params is None: |
| 4091 | all_reduce_params = AllReduceParams() |
| 4092 | all_reduce_params.update_strategy() |
| 4093 | |
| 4094 | # TODO(TRTLLM-996): remove this WAR when custom allreduce is supported |
| 4095 | # for encoder models in C++ runtime. |
| 4096 | workspace = None |
| 4097 | if all_reduce_params.strategy != AllReduceStrategy.NCCL and all_reduce_params.strategy != AllReduceStrategy.UB: |
| 4098 | if current_all_reduce_helper().workspace is None: |
| 4099 | all_reduce_params.strategy = AllReduceStrategy.NCCL_SYMMETRIC |
| 4100 | else: |
| 4101 | workspace = current_all_reduce_helper().workspace.trt_tensor |
| 4102 | if all_reduce_params.strategy == AllReduceStrategy.UB: |
| 4103 | tensor.mark_output("allreduce_ub_0_" + str(allreduce_ub_counter)) |
| 4104 | dtype = default_net().plugin_config.nccl_plugin |
| 4105 | layer, allreduce_plg_creator, pfc = create_allreduce_plugin( |