1/ The blog post is a nice demonstration of some phenomena which, luckily, we have a deep literature on. So allow me to relay some of my current knowledge. Here's a graphical summary again.
"The corresponding loss of variance within each channel follows from the Law of total variance which states that the total variance, integrating over the weights distribution – which is preserved by design under the He-et-al initialisation"
Another cool analysis by : Experimentally, #BatchNorm increases the robustness of parameters wrt. adversarial perturbations and hence improves the conditioning of the Hessian significantly