hub / github.com/NVIDIA/TensorRT-LLM / allreduce

Function allreduce

tensorrt_llm/functional.py:4048–4141 · view source on GitHub ↗

Add an operation that performs a collective all-reduce. Let's define 'world_size' as the length of the 'group' list. That functions creates a layer to compute the sum of 'world_size' tensors distributed amongst the 'world_size' participating ranks (one GPU per rank). The list

(
    tensor: Tensor,
    group: List[int],
    all_reduce_params: Optional[AllReduceParams] = AllReduceParams()
)

Source from the content-addressed store, hash-verified

4046
4047
4048	def allreduce(
4049	tensor: Tensor,
4050	group: List[int],
4051	all_reduce_params: Optional[AllReduceParams] = AllReduceParams()
4052	) -> Tensor:
4053	'''
4054	Add an operation that performs a collective all-reduce.
4055
4056	Let's define 'world_size' as the length of the 'group' list. That functions
4057	creates a layer to compute the sum of 'world_size' tensors distributed
4058	amongst the 'world_size' participating ranks (one GPU per rank).
4059
4060	The list 'group' contains the identifiers of the ranks participating into
4061	the collective operation.
4062
4063	The tensors in the different ranks must be 1D tensors (or views) and the output
4064	tensor will have that same shape. The output tensor will be replicated on
4065	the 'world_size' ranks.
4066
4067	That operation is implemented using a plugin that wraps the NCCL all-reduce
4068	collective operation. See
4069	https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce
4070	for details.
4071
4072	Parameters:
4073	tensor : Tensor
4074	The input tensor.
4075
4076	group : List[int]
4077	The ranks participating into the all-reduce operation.
4078
4079	strategy: AllReduceStrategy
4080	NCCL delegates all-reduce to NCCL while ONESHOT and TWOSHOT are custom latency-optimal algorithms.
4081	AUTO chooses amongst the three based on a message-size heuristic.
4082
4083	Returns:
4084	The tensor produced by that layer.
4085	'''
4086
4087	global allreduce_ub_counter
4088	allreduce_ub_counter += 1
4089
4090	if all_reduce_params is None:
4091	all_reduce_params = AllReduceParams()
4092	all_reduce_params.update_strategy()
4093
4094	# TODO(TRTLLM-996): remove this WAR when custom allreduce is supported
4095	# for encoder models in C++ runtime.
4096	workspace = None
4097	if all_reduce_params.strategy != AllReduceStrategy.NCCL and all_reduce_params.strategy != AllReduceStrategy.UB:
4098	if current_all_reduce_helper().workspace is None:
4099	all_reduce_params.strategy = AllReduceStrategy.NCCL_SYMMETRIC
4100	else:
4101	workspace = current_all_reduce_helper().workspace.trt_tensor
4102	if all_reduce_params.strategy == AllReduceStrategy.UB:
4103	tensor.mark_output("allreduce_ub_0_" + str(allreduce_ub_counter))
4104	dtype = default_net().plugin_config.nccl_plugin
4105	layer, allreduce_plg_creator, pfc = create_allreduce_plugin(

Callers 8

test_allreduceMethod · 0.90

forward_allreduceFunction · 0.90

test_allreduceMethod · 0.90

embeddingFunction · 0.70

funcFunction · 0.50

calc_fused_allreduceFunction · 0.50

Calls 13

update_strategyMethod · 0.95

has_scaleMethod · 0.95

AllReduceParamsClass · 0.85

current_all_reduce_helperFunction · 0.85

default_netFunction · 0.85

create_allreduce_pluginFunction · 0.85

default_trtnetFunction · 0.85

str_dtype_to_trtFunction · 0.85

_add_plugin_infoFunction · 0.85

_create_tensorFunction · 0.85

mark_outputMethod · 0.80

castMethod · 0.80

Tested by 6

test_allreduceMethod · 0.72

forward_allreduceFunction · 0.72

test_allreduceMethod · 0.72

funcFunction · 0.40

calc_fused_allreduceFunction · 0.40