MCPcopy
hub / github.com/bbycroft/llm-viz / walkthrough03_LayerNorm

Function walkthrough03_LayerNorm

src/llm/walkthrough/Walkthrough03_LayerNorm.tsx:12–193  ·  view source on GitHub ↗
(args: IWalkthroughArgs)

Source from the content-addressed store, hash-verified

10import { commentary, DimStyle, IWalkthroughArgs, moveCameraTo, setInitialCamera } from "./WalkthroughTools";
11
12export function walkthrough03_LayerNorm(args: IWalkthroughArgs) {
13 let { walkthrough: wt, layout, state, tools: { afterTime, c_str, c_blockRef, c_dimRef, breakAfter, cleanup } } = args;
14 let { C } = layout.shape;
15
16 if (wt.phase !== Phase.Input_Detail_LayerNorm) {
17 return;
18 }
19
20 let ln = layout.blocks[0].ln1;
21
22 setInitialCamera(state, new Vec3(-6.680, 0.000, -65.256), new Vec3(281.000, 9.000, 2.576));
23 wt.dimHighlightBlocks = [layout.residual0, ...ln.cubes];
24
25 commentary(wt, null, 0)`
26
27The ${c_blockRef('_input embedding_', state.layout.residual0)} matrix from the previous section is the input to our first Transformer block.
28
29The first step in the Transformer block is to apply _layer normalization_ to this matrix. This is an
30operation that normalizes the values in each column of the matrix separately.`;
31 breakAfter();
32
33 let t_moveCamera = afterTime(null, 1.0);
34 let t_hideExtra = afterTime(null, 1.0, 1.0);
35 let t_moveInputEmbed = afterTime(null, 1.0);
36 let t_moveCameraClose = afterTime(null, 0.5);
37
38 breakAfter();
39 commentary(wt)`
40Normalization is an important step in the training of deep neural networks, and it helps improve the
41stability of the model during training.
42
43We can regard each column separately, so let's focus on the 4th column (${c_dimRef('t = 3', DimStyle.T)}) for now.`;
44
45 breakAfter();
46 let t_focusColumn = afterTime(null, 0.5);
47
48 // mu ascii: \u03bc
49 // sigma ascii: \u03c3
50 breakAfter();
51 commentary(wt)`
52The goal is to make the average value in the column equal to 0 and the standard deviation equal to 1. To do this,
53we find both of these quantities (${c_blockRef('mean (\u03bc)', ln.lnAgg1)} & ${c_blockRef('std dev (\u03c3)', ln.lnAgg2)}) for the column and then subtract the average and divide by the standard deviation.`;
54
55 breakAfter();
56
57 let t_calcMuAgg = afterTime(null, 0.5);
58 let t_calcVarAgg = afterTime(null, 0.5);
59
60 // 1e-5 as 1x10^-5, but with superscript: 1x10<sup>-5</sup>
61
62 breakAfter();
63 commentary(wt)`
64The notation we use here is E[x] for the average and Var[x] for the variance (of the column of length ${c_dimRef('C', DimStyle.C)}). The
65variance is simply the standard deviation squared. The epsilon term (ε = ${<>1&times;10<sup>-5</sup></>}) is there to prevent division by zero.
66
67We compute and store these values in our aggregation layer since we're applying them to all values in the column.
68
69Finally, once we have the normalized values, we multiply each element in the column by a learned

Callers 1

runWalkthroughFunction · 0.90

Calls 15

setInitialCameraFunction · 0.90
commentaryFunction · 0.90
moveCameraToFunction · 0.90
lerpFunction · 0.90
splitGridFunction · 0.90
drawDependencesFunction · 0.90
drawDataFlowFunction · 0.90
clampFunction · 0.90
findSubBlocksFunction · 0.90
startProcessBeforeFunction · 0.90
processUpToFunction · 0.90
c_blockRefFunction · 0.85

Tested by

no test coverage detected