Appendix E — He Initialisation

Consider a sequence of conv or dense layers (indexed l). The logits and ReLU activations can be derived as follows: {\bf y}_l = {\bf W}_l {\bf x}_l \quad \text{and} \quad {\bf x}_l = \max({\bf y}_{l-1},0).

Assuming independence and weights and biases with zero mean:

\mathrm{Var}[y_l] = n_l \mathrm{Var}[w_lx_l] = n_l \mathrm{Var}[w_l]E[x_l^2].

For ReLU, x_l =0 for y_{l-1} < 0, thus E[x_l^2] = \frac{1}{2}\mathrm{Var}[y_{l-1}], and

\mathrm{Var}[y_l] = \frac{1}{2}n_l \mathrm{Var}[w_l] \mathrm{Var}[y_{l-1}].

One way to avoid an increase/decrease of the variance throughout the layers is to set: \mathrm{Var}[w_l] = \frac{2}{n_l}, which we can achieve by sampling w_l from \mathcal{N}(0, \sqrt{2/n_l}).