(args: IWalkthroughArgs)
| 10 | import { commentary, DimStyle, IWalkthroughArgs, moveCameraTo, setInitialCamera } from "./WalkthroughTools"; |
| 11 | |
| 12 | export function walkthrough03_LayerNorm(args: IWalkthroughArgs) { |
| 13 | let { walkthrough: wt, layout, state, tools: { afterTime, c_str, c_blockRef, c_dimRef, breakAfter, cleanup } } = args; |
| 14 | let { C } = layout.shape; |
| 15 | |
| 16 | if (wt.phase !== Phase.Input_Detail_LayerNorm) { |
| 17 | return; |
| 18 | } |
| 19 | |
| 20 | let ln = layout.blocks[0].ln1; |
| 21 | |
| 22 | setInitialCamera(state, new Vec3(-6.680, 0.000, -65.256), new Vec3(281.000, 9.000, 2.576)); |
| 23 | wt.dimHighlightBlocks = [layout.residual0, ...ln.cubes]; |
| 24 | |
| 25 | commentary(wt, null, 0)` |
| 26 | |
| 27 | The ${c_blockRef('_input embedding_', state.layout.residual0)} matrix from the previous section is the input to our first Transformer block. |
| 28 | |
| 29 | The first step in the Transformer block is to apply _layer normalization_ to this matrix. This is an |
| 30 | operation that normalizes the values in each column of the matrix separately.`; |
| 31 | breakAfter(); |
| 32 | |
| 33 | let t_moveCamera = afterTime(null, 1.0); |
| 34 | let t_hideExtra = afterTime(null, 1.0, 1.0); |
| 35 | let t_moveInputEmbed = afterTime(null, 1.0); |
| 36 | let t_moveCameraClose = afterTime(null, 0.5); |
| 37 | |
| 38 | breakAfter(); |
| 39 | commentary(wt)` |
| 40 | Normalization is an important step in the training of deep neural networks, and it helps improve the |
| 41 | stability of the model during training. |
| 42 | |
| 43 | We can regard each column separately, so let's focus on the 4th column (${c_dimRef('t = 3', DimStyle.T)}) for now.`; |
| 44 | |
| 45 | breakAfter(); |
| 46 | let t_focusColumn = afterTime(null, 0.5); |
| 47 | |
| 48 | // mu ascii: \u03bc |
| 49 | // sigma ascii: \u03c3 |
| 50 | breakAfter(); |
| 51 | commentary(wt)` |
| 52 | The goal is to make the average value in the column equal to 0 and the standard deviation equal to 1. To do this, |
| 53 | we find both of these quantities (${c_blockRef('mean (\u03bc)', ln.lnAgg1)} & ${c_blockRef('std dev (\u03c3)', ln.lnAgg2)}) for the column and then subtract the average and divide by the standard deviation.`; |
| 54 | |
| 55 | breakAfter(); |
| 56 | |
| 57 | let t_calcMuAgg = afterTime(null, 0.5); |
| 58 | let t_calcVarAgg = afterTime(null, 0.5); |
| 59 | |
| 60 | // 1e-5 as 1x10^-5, but with superscript: 1x10<sup>-5</sup> |
| 61 | |
| 62 | breakAfter(); |
| 63 | commentary(wt)` |
| 64 | The notation we use here is E[x] for the average and Var[x] for the variance (of the column of length ${c_dimRef('C', DimStyle.C)}). The |
| 65 | variance is simply the standard deviation squared. The epsilon term (ε = ${<>1×10<sup>-5</sup></>}) is there to prevent division by zero. |
| 66 | |
| 67 | We compute and store these values in our aggregation layer since we're applying them to all values in the column. |
| 68 | |
| 69 | Finally, once we have the normalized values, we multiply each element in the column by a learned |
no test coverage detected