MCPcopy
hub / github.com/NVIDIA/TensorRT-LLM / allreduce

Function allreduce

tensorrt_llm/functional.py:4048–4141  ·  view source on GitHub ↗

Add an operation that performs a collective all-reduce. Let's define 'world_size' as the length of the 'group' list. That functions creates a layer to compute the sum of 'world_size' tensors distributed amongst the 'world_size' participating ranks (one GPU per rank). The list

(
    tensor: Tensor,
    group: List[int],
    all_reduce_params: Optional[AllReduceParams] = AllReduceParams()
)

Source from the content-addressed store, hash-verified

4046
4047
4048def allreduce(
4049 tensor: Tensor,
4050 group: List[int],
4051 all_reduce_params: Optional[AllReduceParams] = AllReduceParams()
4052) -> Tensor:
4053 '''
4054 Add an operation that performs a collective all-reduce.
4055
4056 Let's define 'world_size' as the length of the 'group' list. That functions
4057 creates a layer to compute the sum of 'world_size' tensors distributed
4058 amongst the 'world_size' participating ranks (one GPU per rank).
4059
4060 The list 'group' contains the identifiers of the ranks participating into
4061 the collective operation.
4062
4063 The tensors in the different ranks must be 1D tensors (or views) and the output
4064 tensor will have that same shape. The output tensor will be replicated on
4065 the 'world_size' ranks.
4066
4067 That operation is implemented using a plugin that wraps the NCCL all-reduce
4068 collective operation. See
4069 https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce
4070 for details.
4071
4072 Parameters:
4073 tensor : Tensor
4074 The input tensor.
4075
4076 group : List[int]
4077 The ranks participating into the all-reduce operation.
4078
4079 strategy: AllReduceStrategy
4080 NCCL delegates all-reduce to NCCL while ONESHOT and TWOSHOT are custom latency-optimal algorithms.
4081 AUTO chooses amongst the three based on a message-size heuristic.
4082
4083 Returns:
4084 The tensor produced by that layer.
4085 '''
4086
4087 global allreduce_ub_counter
4088 allreduce_ub_counter += 1
4089
4090 if all_reduce_params is None:
4091 all_reduce_params = AllReduceParams()
4092 all_reduce_params.update_strategy()
4093
4094 # TODO(TRTLLM-996): remove this WAR when custom allreduce is supported
4095 # for encoder models in C++ runtime.
4096 workspace = None
4097 if all_reduce_params.strategy != AllReduceStrategy.NCCL and all_reduce_params.strategy != AllReduceStrategy.UB:
4098 if current_all_reduce_helper().workspace is None:
4099 all_reduce_params.strategy = AllReduceStrategy.NCCL_SYMMETRIC
4100 else:
4101 workspace = current_all_reduce_helper().workspace.trt_tensor
4102 if all_reduce_params.strategy == AllReduceStrategy.UB:
4103 tensor.mark_output("allreduce_ub_0_" + str(allreduce_ub_counter))
4104 dtype = default_net().plugin_config.nccl_plugin
4105 layer, allreduce_plg_creator, pfc = create_allreduce_plugin(

Callers 8

test_allreduceMethod · 0.90
test_allreduceMethod · 0.90
forward_allreduceFunction · 0.90
test_allreduceMethod · 0.90
embeddingFunction · 0.70
funcFunction · 0.50
funcFunction · 0.50
calc_fused_allreduceFunction · 0.50

Calls 13

update_strategyMethod · 0.95
has_scaleMethod · 0.95
AllReduceParamsClass · 0.85
default_netFunction · 0.85
create_allreduce_pluginFunction · 0.85
default_trtnetFunction · 0.85
str_dtype_to_trtFunction · 0.85
_add_plugin_infoFunction · 0.85
_create_tensorFunction · 0.85
mark_outputMethod · 0.80
castMethod · 0.80

Tested by 6

test_allreduceMethod · 0.72
test_allreduceMethod · 0.72
forward_allreduceFunction · 0.72
test_allreduceMethod · 0.72
funcFunction · 0.40
calc_fused_allreduceFunction · 0.40