How to Train Your ResNet 7: Batch Norm
We investigate how batch normalisation helps optimisation (spoiler: it involves internal covariate shift...). Along the way we meet some bad initialisations, degenerate networks and spiky Hessians.
1/ The blog post
is a nice demonstration of some phenomena which, luckily, we have a deep literature on. So allow me to relay some of my current knowledge. Here's a graphical summary again.
More details here
Daniel Jiwoong Im
"The corresponding loss of variance within each channel follows from the Law of total variance which states that the total variance, integrating over the weights distribution – which is preserved by design under the He-et-al initialisation"
Another cool analysis by
Experimentally, #BatchNorm increases the robustness of parameters wrt. adversarial perturbations and hence improves the conditioning of the Hessian significantly