We investigate how batch normalisation helps optimisation (spoiler: it involves internal covariate shift...). Along the way we meet some bad initialisations, degenerate networks and spiky Hessians.

"The corresponding loss of variance within each channel follows from the Law of total variance which states that the total variance, integrating over the weights distribution – which is preserved by design under the He-et-al initialisation"

Experimentally, #BatchNorm increases the robustness of parameters wrt. adversarial perturbations and hence improves the conditioning of the Hessian significantly