## Learned Step Size Pass(网络微调过程-LSQ) Learned Step Size optimization, a training-based optimization pass that tunes weights and scales for high precision quantization. [This method is proposed by Steven K. Esser] (https://arxiv.org/pdf/1902.08153.pdf) This is an alternative versi
| 567 | |
| 568 | |
| 569 | class LearnedStepSizePass(TrainingBasedPass): |
| 570 | """ |
| 571 | ## Learned Step Size Pass(网络微调过程-LSQ) |
| 572 | |
| 573 | Learned Step Size optimization, a training-based optimization pass that tunes weights and scales for high precision quantization. |
| 574 | |
| 575 | [This method is proposed by Steven K. Esser] (https://arxiv.org/pdf/1902.08153.pdf) |
| 576 | |
| 577 | This is an alternative version of LSQ, this pass will split your graph into multiple trainable blocks, each blocks will be trained separately. |
| 578 | Warning: PPQ Learned Step Size minimize only the output loss of each block, which means after training the internal results probably goes far away from original. |
| 579 | |
| 580 | PPQ Learned Step Size optimization requires 256 ~ 2048 samples for finetuning your network, while the data label is not necessary. All training data are cache in GPU memory or CPU memory for acceleration. |
| 581 | |
| 582 | The training loss will be computed as: |
| 583 | |
| 584 | let: Y = WX + b |
| 585 | |
| 586 | Quant(Y, scale_Y) = Quant(W, scale_W) Quant(X, scale_X) + b |
| 587 | |
| 588 | loss = loss_func(Y, Quant(Y, scale_Y)) # loss between fp output and int8 output, that is why we do not need labeled data. |
| 589 | |
| 590 | The formula of calculating the derivatives of y and scale_Y: |
| 591 | |
| 592 | if y > scale_Y * -128 and y < scale_Y * 127: |
| 593 | |
| 594 | dQuant(y, scale_Y)/dy = dQuant(y, scale_Y) |
| 595 | |
| 596 | dQuant(y, scale_Y)/dscale_Y = Quant(y, scale_Y) - y |
| 597 | |
| 598 | if y < scale_Y * -128: |
| 599 | |
| 600 | dQuant(y, scale_Y)/dy = 0 |
| 601 | |
| 602 | dQuant(y, scale_Y)/dscale_Y = -128 |
| 603 | |
| 604 | if y > scale_Y * 127: |
| 605 | |
| 606 | dQuant(y, scale_Y)/dy = 0 |
| 607 | |
| 608 | dQuant(y, scale_Y)/dscale_Y = 127 |
| 609 | |
| 610 | ### Parameters: |
| 611 | |
| 612 | * interested_layers(List[str]): |
| 613 | |
| 614 | A list of operation names, only the layers listed in this parameter will be trained. |
| 615 | |
| 616 | If interested_layers is None, all layers(conv and gemm) will be trained. |
| 617 | |
| 618 | * steps(int) |
| 619 | |
| 620 | Training steps for finetuning your network, default is 500. |
| 621 | |
| 622 | * block_size(int) |
| 623 | |
| 624 | PPQ Learned Step Size optimization split your graph into blocks at first, |
| 625 | each block will be finetuned separately. |
| 626 |
no outgoing calls
no test coverage detected