| 1216 | |
| 1217 | |
| 1218 | class BatchNorm1D(LayerBase): |
| 1219 | def __init__(self, momentum=0.9, epsilon=1e-5, optimizer=None): |
| 1220 | """ |
| 1221 | A batch normalization layer for 1D inputs. |
| 1222 | |
| 1223 | Notes |
| 1224 | ----- |
| 1225 | BatchNorm is an attempt address the problem of internal covariate |
| 1226 | shift (ICS) during training by normalizing layer inputs. |
| 1227 | |
| 1228 | ICS refers to the change in the distribution of layer inputs during |
| 1229 | training as a result of the changing parameters of the previous |
| 1230 | layer(s). ICS can make it difficult to train models with saturating |
| 1231 | nonlinearities, and in general can slow training by requiring a lower |
| 1232 | learning rate. |
| 1233 | |
| 1234 | Equations [train]:: |
| 1235 | |
| 1236 | Y = scaler * norm(X) + intercept |
| 1237 | norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon) |
| 1238 | |
| 1239 | Equations [test]:: |
| 1240 | |
| 1241 | Y = scaler * running_norm(X) + intercept |
| 1242 | running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon) |
| 1243 | |
| 1244 | In contrast to :class:`LayerNorm1D`, the BatchNorm layer calculates |
| 1245 | the mean and var across the *batch* rather than the output features. |
| 1246 | This has two disadvantages: |
| 1247 | |
| 1248 | 1. It is highly affected by batch size: smaller mini-batch sizes |
| 1249 | increase the variance of the estimates for the global mean and |
| 1250 | variance. |
| 1251 | |
| 1252 | 2. It is difficult to apply in RNNs -- one must fit a separate |
| 1253 | BatchNorm layer for *each* time-step. |
| 1254 | |
| 1255 | Parameters |
| 1256 | ---------- |
| 1257 | momentum : float |
| 1258 | The momentum term for the running mean/running std calculations. |
| 1259 | The closer this is to 1, the less weight will be given to the |
| 1260 | mean/std of the current batch (i.e., higher smoothing). Default is |
| 1261 | 0.9. |
| 1262 | epsilon : float |
| 1263 | A small smoothing constant to use during computation of ``norm(X)`` |
| 1264 | to avoid divide-by-zero errors. Default is 1e-5. |
| 1265 | optimizer : str, :doc:`Optimizer <numpy_ml.neural_nets.optimizers>` object, or None |
| 1266 | The optimization strategy to use when performing gradient updates |
| 1267 | within the :meth:`update` method. If None, use the :class:`SGD |
| 1268 | <numpy_ml.neural_nets.optimizers.SGD>` optimizer with |
| 1269 | default parameters. Default is None. |
| 1270 | |
| 1271 | Attributes |
| 1272 | ---------- |
| 1273 | X : list |
| 1274 | Running list of inputs to the :meth:`forward <numpy_ml.neural_nets.LayerBase.forward>` method since the last call to :meth:`update <numpy_ml.neural_nets.LayerBase.update>`. Only updated if the `retain_derived` argument was set to True. |
| 1275 | gradients : dict |
no outgoing calls