MCPcopy
hub / github.com/deepspeedai/DeepSpeedExamples / backward

Method backward

Megatron-LM/fp16/fp16.py:494–554  ·  view source on GitHub ↗

:attr:`backward` performs the following conceptual steps: 1. fp32_loss = loss.float() (see first Note below) 2. scaled_loss = fp32_loss*loss_scale 3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (whi

(self, loss, update_master_grads=True, retain_graph=False)

Source from the content-addressed store, hash-verified

492 return retval
493
494 def backward(self, loss, update_master_grads=True, retain_graph=False):
495 """
496 :attr:`backward` performs the following conceptual steps:
497
498 1. fp32_loss = loss.float() (see first Note below)
499 2. scaled_loss = fp32_loss*loss_scale
500 3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined).
501 4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32.
502 5. Finally, master grads are divided by loss_scale.
503
504 In this way, after :attr:`backward`, the master params have fresh gradients,
505 and :attr:`step` may be called.
506
507 .. note::
508 :attr:`backward` internally converts the loss to fp32 before applying the loss scale.
509 This provides some additional safety against overflow if the user has supplied an
510 fp16 loss value.
511 However, for maximum overflow safety, the user should
512 compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to
513 :attr:`backward`.
514
515 .. warning::
516 The gradients found in a model's leaves after the call to
517 :attr:`backward` should not be regarded as valid in general,
518 because it's possible
519 they have been scaled (and in the case of dynamic loss scaling,
520 the scale factor may change over time).
521 If the user wants to inspect gradients after a call to :attr:`backward`,
522 only the master gradients should be regarded as valid. These can be retrieved via
523 :attr:`inspect_master_grad_data()`.
524
525 Args:
526 loss: The loss output by the user's model. loss may be either float or half (but see first Note above).
527 update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`.
528 retain_graph (bool, optional, default=False): Forwards the usual ``retain_graph=True`` option to the internal call to ``loss.backward``. If ``retain_graph`` is being used to accumulate gradient values from multiple backward passes before calling ``optimizer.step``, passing ``update_master_grads=False`` is also recommended (see Example below).
529
530 Example::
531
532 # Ordinary operation:
533 optimizer.backward(loss)
534
535 # Naive operation with multiple losses (technically valid, but less efficient):
536 # fp32 grads will be correct after the second call, but
537 # the first call incurs an unnecessary fp16->fp32 grad copy.
538 optimizer.backward(loss1)
539 optimizer.backward(loss2)
540
541 # More efficient way to handle multiple losses:
542 # The fp16->fp32 grad copy is delayed until fp16 grads from all
543 # losses have been accumulated.
544 optimizer.backward(loss1, update_master_grads=False)
545 optimizer.backward(loss2, update_master_grads=False)
546 optimizer.update_master_grads()
547 """
548 # To consider: try multiple backward passes using retain_grad=True to find
549 # a loss scale that works. After you find a loss scale that works, do a final dummy
550 # backward pass with retain_graph=False to tear down the graph. Doing this would avoid
551 # discarding the iteration, but probably wouldn't improve overall efficiency.

Callers 7

trainFunction · 0.45
mainFunction · 0.45
mainFunction · 0.45
backward_stepFunction · 0.45
backward_stepFunction · 0.45

Calls 1

update_master_gradsMethod · 0.95

Tested by

no test coverage detected