hub / github.com/deepspeedai/DeepSpeedExamples / backward

Method backward

Megatron-LM/fp16/fp16.py:494–554 · view source on GitHub ↗

:attr:`backward` performs the following conceptual steps: 1. fp32_loss = loss.float() (see first Note below) 2. scaled_loss = fp32_loss*loss_scale 3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (whi

(self, loss, update_master_grads=True, retain_graph=False)

Source from the content-addressed store, hash-verified

492	return retval
493
494	def backward(self, loss, update_master_grads=True, retain_graph=False):
495	"""
496	:attr:`backward` performs the following conceptual steps:
497
498	1. fp32_loss = loss.float() (see first Note below)
499	2. scaled_loss = fp32_loss*loss_scale
500	3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
501	4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32.
502	5. Finally, master grads are divided by loss_scale.
503
504	In this way, after :attr:`backward`, the master params have fresh gradients,
505	and :attr:`step` may be called.
506
507	.. note::
508	:attr:`backward` internally converts the loss to fp32 before applying the loss scale.
509	This provides some additional safety against overflow if the user has supplied an
510	fp16 loss value.
511	However, for maximum overflow safety, the user should
512	compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
513	:attr:`backward`.
514
515	.. warning::
516	The gradients found in a model's leaves after the call to
517	:attr:`backward` should not be regarded as valid in general,
518	because it's possible
519	they have been scaled (and in the case of dynamic loss scaling,
520	the scale factor may change over time).
521	If the user wants to inspect gradients after a call to :attr:`backward`,
522	only the master gradients should be regarded as valid. These can be retrieved via
523	:attr:`inspect_master_grad_data()`.
524
525	Args:
526	loss: The loss output by the user's model. loss may be either float or half (but see first Note above).
527	update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`.
528	retain_graph (bool, optional, default=False): Forwards the usual ``retain_graph=True`` option to the internal call to ``loss.backward``. If ``retain_graph`` is being used to accumulate gradient values from multiple backward passes before calling ``optimizer.step``, passing ``update_master_grads=False`` is also recommended (see Example below).
529
530	Example::
531
532	# Ordinary operation:
533	optimizer.backward(loss)
534
535	# Naive operation with multiple losses (technically valid, but less efficient):
536	# fp32 grads will be correct after the second call, but
537	# the first call incurs an unnecessary fp16->fp32 grad copy.
538	optimizer.backward(loss1)
539	optimizer.backward(loss2)
540
541	# More efficient way to handle multiple losses:
542	# The fp16->fp32 grad copy is delayed until fp16 grads from all
543	# losses have been accumulated.
544	optimizer.backward(loss1, update_master_grads=False)
545	optimizer.backward(loss2, update_master_grads=False)
546	optimizer.update_master_grads()
547	"""
548	# To consider: try multiple backward passes using retain_grad=True to find
549	# a loss scale that works. After you find a loss scale that works, do a final dummy
550	# backward pass with retain_graph=False to tear down the graph. Doing this would avoid
551	# discarding the iteration, but probably wouldn't improve overall efficiency.

Callers 7

trainFunction · 0.45

cifar10_tutorial.pyFile · 0.45

cifar10_deepspeed.pyFile · 0.45

mainFunction · 0.45

backward_stepFunction · 0.45

Calls 1

update_master_gradsMethod · 0.95

Tested by

no test coverage detected