hub / github.com/deepspeedai/DeepSpeed / _create_expert_and_data_parallel

Function _create_expert_and_data_parallel

deepspeed/utils/groups.py:240–382 · view source on GitHub ↗

Create expert and data parallel groups. When mp_size is None or 1: legacy consecutive ordering (backward compatible). When mp_size > 1 and mp_mode=="tp": TP-strided rank ordering. When mp_size > 1 and mp_mode=="sp": consecutive rank ordering. Note: Caller of this function is respon

(expert_parallel_size_,
                                     mp_size=None,
                                     pp_size=None,
                                     mp_mode="tp",
                                     use_data_before_expert_parallel_=False)

Source from the content-addressed store, hash-verified

238
239
240	def _create_expert_and_data_parallel(expert_parallel_size_,
241	mp_size=None,
242	pp_size=None,
243	mp_mode="tp",
244	use_data_before_expert_parallel_=False):
245	"""Create expert and data parallel groups.
246
247	When mp_size is None or 1: legacy consecutive ordering (backward compatible).
248	When mp_size > 1 and mp_mode=="tp": TP-strided rank ordering.
249	When mp_size > 1 and mp_mode=="sp": consecutive rank ordering.
250
251	Note: Caller of this function is responsible to check if the groups already exist.
252
253	Example - E + D parallel (legacy path)
254	world_size = 16
255	expert_parallel_size = 2 # number of experts in same group
256	expert_data_parallel_group = [0,2,4,6,8,10,12,14], [1,3,5,7,9,11,13,15] - all reduce is only on MoE params
257	expert_parallel_group = [0, 1], [2,3], [4,5], [6,7], [8,9] - no all reduce, but all to all
258	data_parallel_group = [0,1,...,15] - all reduce is only on non-MoE
259
260	Args:
261	expert_parallel_size_ (int): Expert parallel group size.
262	mp_size (int, optional): Model parallel size (TP or SP). None treated as 1.
263	pp_size (int, optional): Pipeline parallel size. None falls back to mpu.
264	mp_mode (str): "tp" for TP-strided ordering, "sp" for consecutive ordering.
265	use_data_before_expert_parallel_ (bool): Use the D + E instead of E + D topology.
266	"""
267	assert dist.is_initialized()
268
269	# Resolve parameters for backward compat
270	effective_mp_size = 1 if mp_size is None else mp_size
271
272	log_dist(f'Creating expert and data parallel groups with size {expert_parallel_size_}', ranks=[0])
273	world_size = dist.get_world_size()
274
275	# Resolve pp_size
276	if pp_size is not None:
277	pp_world_size = pp_size
278	else:
279	pp_world_size = 1 if mpu is None else bwc_pipeline_parallel_world_size(mpu)
280
281	rank = dist.get_rank()
282
283	pp_stride = world_size // pp_world_size
284	_ensure_divisibility(pp_stride, expert_parallel_size_)
285
286	group_name = f"ep_size_{expert_parallel_size_}"
287
288	global _EXPERT_DATA_PARALLEL_GROUP
289	global _EXPERT_DATA_PARALLEL_GROUP_RANKS
290	global _EXPERT_PARALLEL_GROUP
291	global _EXPERT_PARALLEL_GROUP_RANKS
292
293	# Legacy path: mp_size <= 1 (preserves exact original behavior)
294	if effective_mp_size <= 1:
295	ep_stride = pp_stride // expert_parallel_size_
296
297	# Build the expert data parallel groups.

Callers

nothing calls this directly

Calls 8

log_distFunction · 0.90

bwc_pipeline_parallel_world_sizeFunction · 0.90

_ensure_divisibilityFunction · 0.85

get_world_sizeMethod · 0.80

appendMethod · 0.80

is_initializedMethod · 0.45

get_rankMethod · 0.45

new_groupMethod · 0.45

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…