:attr:`backward` performs the following conceptual steps: 1. fp32_loss = loss.float() (see first Note below) 2. scaled_loss = fp32_loss*loss_scale 3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (whi
(self, loss, update_master_grads=True, retain_graph=False)
| 492 | return retval |
| 493 | |
| 494 | def backward(self, loss, update_master_grads=True, retain_graph=False): |
| 495 | """ |
| 496 | :attr:`backward` performs the following conceptual steps: |
| 497 | |
| 498 | 1. fp32_loss = loss.float() (see first Note below) |
| 499 | 2. scaled_loss = fp32_loss*loss_scale |
| 500 | 3. scaled_loss.backward(), which accumulates scaled gradients into the ``.grad`` attributes of the model's leaves (which may be fp16, fp32, or a mixture, depending how your model was defined). |
| 501 | 4. fp16 grads are then copied to the master params' ``.grad`` attributes (see second Note), which are guaranteed to be fp32. |
| 502 | 5. Finally, master grads are divided by loss_scale. |
| 503 | |
| 504 | In this way, after :attr:`backward`, the master params have fresh gradients, |
| 505 | and :attr:`step` may be called. |
| 506 | |
| 507 | .. note:: |
| 508 | :attr:`backward` internally converts the loss to fp32 before applying the loss scale. |
| 509 | This provides some additional safety against overflow if the user has supplied an |
| 510 | fp16 loss value. |
| 511 | However, for maximum overflow safety, the user should |
| 512 | compute the loss criterion (MSE, cross entropy, etc) in fp32 before supplying it to |
| 513 | :attr:`backward`. |
| 514 | |
| 515 | .. warning:: |
| 516 | The gradients found in a model's leaves after the call to |
| 517 | :attr:`backward` should not be regarded as valid in general, |
| 518 | because it's possible |
| 519 | they have been scaled (and in the case of dynamic loss scaling, |
| 520 | the scale factor may change over time). |
| 521 | If the user wants to inspect gradients after a call to :attr:`backward`, |
| 522 | only the master gradients should be regarded as valid. These can be retrieved via |
| 523 | :attr:`inspect_master_grad_data()`. |
| 524 | |
| 525 | Args: |
| 526 | loss: The loss output by the user's model. loss may be either float or half (but see first Note above). |
| 527 | update_master_grads (bool, optional, default=True): Option to copy fp16 grads to fp32 grads on this call. By setting this to False, the user can delay the copy, which is useful to eliminate redundant fp16->fp32 grad copies if :attr:`backward` is being called on multiple losses in one iteration. If set to False, the user becomes responsible for calling :attr:`update_master_grads` before calling :attr:`step`. |
| 528 | retain_graph (bool, optional, default=False): Forwards the usual ``retain_graph=True`` option to the internal call to ``loss.backward``. If ``retain_graph`` is being used to accumulate gradient values from multiple backward passes before calling ``optimizer.step``, passing ``update_master_grads=False`` is also recommended (see Example below). |
| 529 | |
| 530 | Example:: |
| 531 | |
| 532 | # Ordinary operation: |
| 533 | optimizer.backward(loss) |
| 534 | |
| 535 | # Naive operation with multiple losses (technically valid, but less efficient): |
| 536 | # fp32 grads will be correct after the second call, but |
| 537 | # the first call incurs an unnecessary fp16->fp32 grad copy. |
| 538 | optimizer.backward(loss1) |
| 539 | optimizer.backward(loss2) |
| 540 | |
| 541 | # More efficient way to handle multiple losses: |
| 542 | # The fp16->fp32 grad copy is delayed until fp16 grads from all |
| 543 | # losses have been accumulated. |
| 544 | optimizer.backward(loss1, update_master_grads=False) |
| 545 | optimizer.backward(loss2, update_master_grads=False) |
| 546 | optimizer.update_master_grads() |
| 547 | """ |
| 548 | # To consider: try multiple backward passes using retain_grad=True to find |
| 549 | # a loss scale that works. After you find a loss scale that works, do a final dummy |
| 550 | # backward pass with retain_graph=False to tear down the graph. Doing this would avoid |
| 551 | # discarding the iteration, but probably wouldn't improve overall efficiency. |
no test coverage detected