hub / github.com/bbycroft/llm-viz / walkthrough03_LayerNorm

Function walkthrough03_LayerNorm

src/llm/walkthrough/Walkthrough03_LayerNorm.tsx:12–193 · view source on GitHub ↗

(args: IWalkthroughArgs)

Source from the content-addressed store, hash-verified

10	import { commentary, DimStyle, IWalkthroughArgs, moveCameraTo, setInitialCamera } from "./WalkthroughTools";
11
12	export function walkthrough03_LayerNorm(args: IWalkthroughArgs) {
13	let { walkthrough: wt, layout, state, tools: { afterTime, c_str, c_blockRef, c_dimRef, breakAfter, cleanup } } = args;
14	let { C } = layout.shape;
15
16	if (wt.phase !== Phase.Input_Detail_LayerNorm) {
17	return;
18	}
19
20	let ln = layout.blocks[0].ln1;
21
22	setInitialCamera(state, new Vec3(-6.680, 0.000, -65.256), new Vec3(281.000, 9.000, 2.576));
23	wt.dimHighlightBlocks = [layout.residual0, ...ln.cubes];
24
25	commentary(wt, null, 0)`
26
27	The ${c_blockRef('_input embedding_', state.layout.residual0)} matrix from the previous section is the input to our first Transformer block.
28
29	The first step in the Transformer block is to apply _layer normalization_ to this matrix. This is an
30	operation that normalizes the values in each column of the matrix separately.`;
31	breakAfter();
32
33	let t_moveCamera = afterTime(null, 1.0);
34	let t_hideExtra = afterTime(null, 1.0, 1.0);
35	let t_moveInputEmbed = afterTime(null, 1.0);
36	let t_moveCameraClose = afterTime(null, 0.5);
37
38	breakAfter();
39	commentary(wt)`
40	Normalization is an important step in the training of deep neural networks, and it helps improve the
41	stability of the model during training.
42
43	We can regard each column separately, so let's focus on the 4th column (${c_dimRef('t = 3', DimStyle.T)}) for now.`;
44
45	breakAfter();
46	let t_focusColumn = afterTime(null, 0.5);
47
48	// mu ascii: \u03bc
49	// sigma ascii: \u03c3
50	breakAfter();
51	commentary(wt)`
52	The goal is to make the average value in the column equal to 0 and the standard deviation equal to 1. To do this,
53	we find both of these quantities (${c_blockRef('mean (\u03bc)', ln.lnAgg1)} & ${c_blockRef('std dev (\u03c3)', ln.lnAgg2)}) for the column and then subtract the average and divide by the standard deviation.`;
54
55	breakAfter();
56
57	let t_calcMuAgg = afterTime(null, 0.5);
58	let t_calcVarAgg = afterTime(null, 0.5);
59
60	// 1e-5 as 1x10^-5, but with superscript: 1x10<sup>-5</sup>
61
62	breakAfter();
63	commentary(wt)`
64	The notation we use here is E[x] for the average and Var[x] for the variance (of the column of length ${c_dimRef('C', DimStyle.C)}). The
65	variance is simply the standard deviation squared. The epsilon term (ε = ${<>1×10<sup>-5</sup></>}) is there to prevent division by zero.
66
67	We compute and store these values in our aggregation layer since we're applying them to all values in the column.
68
69	Finally, once we have the normalized values, we multiply each element in the column by a learned

Callers 1

runWalkthroughFunction · 0.90

Calls 15

setInitialCameraFunction · 0.90

commentaryFunction · 0.90

moveCameraToFunction · 0.90

lerpFunction · 0.90

splitGridFunction · 0.90

drawDependencesFunction · 0.90

drawDataFlowFunction · 0.90

clampFunction · 0.90

findSubBlocksFunction · 0.90

startProcessBeforeFunction · 0.90

processUpToFunction · 0.90

c_blockRefFunction · 0.85

Tested by

no test coverage detected