1/ The blog post is a nice demonstration of some phenomena which, luckily, we have a deep literature on. So allow me to relay some of my current knowledge. Here's a graphical summary again.
More details here
"The corresponding loss of variance within each channel follows from the Law of total variance which states that the total variance, integrating over the weights distribution – which is preserved by design under the He-et-al initialisation"
Another cool analysis by : Experimentally, #BatchNorm increases the robustness of parameters wrt. adversarial perturbations and hence improves the conditioning of the Hessian significantly