We investigate how batch normalisation helps optimisation (spoiler: it involves internal covariate shift...). Along the way we meet some bad initialisations, degenerate networks and spiky Hessians.

Greg Yang
35d ago

@kobai19 @yoavgo @srush_nlp @dcpage3 @philipmlong 1/ The blog post myrtle.ai/how-to-train-y… is a nice demonstration of some phenomena which, luckily, we have a deep literature on. So allow me to relay some of my current knowledge. Here's a graphical summary again. pic.twitter.com/O8o15Q4caz

6

"The corresponding loss of variance within each channel follows from the Law of total variance which states that the total variance, integrating over the weights distribution – which is preserved by design under the He-et-al initialisation"

Another cool analysis by @dcpage3: myrtle.ai/how-to-train-y…
Experimentally, #BatchNorm increases the robustness of parameters wrt. adversarial perturbations and hence improves the conditioning of the Hessian significantly