MCPcopy
hub / github.com/FareedKhan-dev/train-llm-from-scratch / ddp_wrap

Function ddp_wrap

src/post_training/distributed.py:61–75  ·  view source on GitHub ↗

Wrap a model in DDP when running multi-GPU; return it unchanged otherwise. Only the *trainable* model should be wrapped. Reference / old-policy / reward models carry no gradients and must NOT be wrapped (they are replicated identically per rank). Pass ``find_unused_parameters=True`` fo

(model: torch.nn.Module, ctx: DDPContext, find_unused_parameters: bool = False)

Source from the content-addressed store, hash-verified

59
60
61def ddp_wrap(model: torch.nn.Module, ctx: DDPContext, find_unused_parameters: bool = False) -> torch.nn.Module:
62 """Wrap a model in DDP when running multi-GPU; return it unchanged otherwise.
63
64 Only the *trainable* model should be wrapped. Reference / old-policy / reward models
65 carry no gradients and must NOT be wrapped (they are replicated identically per rank).
66
67 Pass ``find_unused_parameters=True`` for a model where some parameters do not receive a
68 gradient on every step (for example the reward model, which uses the backbone's
69 ``forward_hidden`` and a reward head but never its ``lm_head``). Without it, DDP raises
70 a "did not get a gradient" error on the first backward.
71 """
72 if not ctx.enabled:
73 return model
74 device_ids = [ctx.local_rank] if ctx.device.startswith("cuda") else None
75 return DDP(model, device_ids=device_ids, find_unused_parameters=find_unused_parameters)
76
77
78def is_main_process(ctx: DDPContext) -> bool:

Callers 6

mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90
mainFunction · 0.90

Calls

no outgoing calls

Tested by

no test coverage detected