A Notes
Here is a collection of additional notes to complement the handouts. This material is non-examinable… but the it is not rare for these notes to be regularly covered at technical interviews.
A.1 Universal Approximation Theorem
The Universal approximation theorem (Hornik, 1991) says that ‘’a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’. The result applies for sigmoid, tanh and many other hidden layer activation functions.
The aim of this note is to provide an indication of why this is the case. To make things easier, we are going to look at 1D functions \(x \mapsto f(x)\), and show that the theorem holds for binary neurons. Binary neurons are neuron units for which the activation is a simple threshold, i.e. \([{\bf x}^{\top}{\bf w} \geq 0]\). In our case with a 1D input:
\[\begin{equation} x \mapsto [ xw_1 + w_0 \geq 0] = \begin{cases} 1 & \text{if } xw_1 + w_0 \geq 0 \\ 0 & \text{otherwise} \end{cases} \end{equation}\]
We use binary neurons but the idea can be extended to multiple inputs and different activation functions (eg. ReLU, sigmoid, etc.).
The argument starts with the observation that any given continuous function \(f\) can be approximated by a discretisation as follows:
\[\begin{equation} \tilde{f}(x) \approx \sum_{i=0}^n f(x_i) [ x_i \leq x < x_{i+1}]\,, \end{equation}\]
where \(x \mapsto [ x_i \leq x < x_{i+1}]\) is a pulse function that is 1 only in the interval \([x_i, x_{i+1}[\) and zero outside (see example in Fig.).
The trick is that this pulse can be modelled as the addition of two binary neuron units (see Fig.A.2):
\[\begin{equation} [ x_i \leq x < x_{i+1}] = [ x - x_{i}\geq 0] - [ x - x_{i+1} \geq 0] \end{equation}\]
We can substitute this for all the pulses: \[\begin{equation} f(x) \approx \sum_{i=0}^n f(x_i) \left( [ x \geq x_{i}] - [ x \geq x_{i+1}] \right) = f(x_0) + \sum_{i=1}^n \left(f(x_i) - f(x_{i-1})\right) [ x \geq x_{i}] \end{equation}\]
The network corresponding to the discretisation of Fig. A.1 is illustrated in Fig. A.3 below.
A.2 Why Does \(L_1\) Regularisation Induce Sparsity?
Sparsity is when some of the weights obtained by minimisation of the loss function turn out to be zero. Consider a least squares example where our model looks something like this:
\[\begin{equation} y = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p. \end{equation}\]
As true machine learning practitioners, we haven’t done our homework, paid no attention to the nature of the problem at hand, and simply included all features in our model, including a whole bunch of features that should not be involved in the prediction. Our idea is that the optimisation/training will figure this out, and that only a sparse subset of weights will turn out to be non-zero, eg. only \(\hat{w_0},\hat{w_1},\hat{w_2} \neq 0\) and for all the other weights it will be estimated that \(\hat{w_{j}} = 0\). What we are hoping here is that the optimisation will perform variable selection for us. This is a potentially very useful idea for analysing models or for pruning unnecessary parts of a complex model (e.g., reduce the number of filters in a convolution block).
In reality, the least-squares (and most optimisation techniques) will not do this for you. Because the variance of the measurements, the estimated weights will be generally close to, but, crucially, never equal to zero. We could try to manually set some of the small values to zero, and check the effect on the performance, but this is kind of ad hoc and not very practical on large models. A better way to achieve sparsity is to add a regularisation term to the loss function. The idea is that, by adding a penalty when weights are non-null, the optimisation will naturally favour zeros over small non-zero values.
Now, there are fundamental differences between applying a \(L_1\) or \(L_2\) regularisation. The motivation for this note is to show why applying \(L_1\) induces sparsity and why applying \(L_2\) does not. We will show this for a single weight parameter \(w\) but this can be easily generalised to more parameters.
The core of the argument is that for the optimal value to be zero, the loss function, \(E(w)\), must admit a global minimum at \(w=0\), or at least a local minimum as optimisers like gradient descent might converge to a local minimum. Without regularisation, this minimum is unlikely to be at precisely zero. Figure A.4 shows an example where the loss function admits a minimum at \(\hat{w}=0.295\). Let’s see how adding a regularisation term can shift this minimum towards zero.
\(L_2\) Regularisation
Let us first look at \(L_2\) and see how the regularisation impacts the loss function \(E(w)\). The regularised loss \(E_2(w)\) can be written as follows: \[\begin{equation} E_2(w) = E(w) + \lambda_2 \frac{1}{2} w^2, \end{equation}\] where \(\lambda_2\) controls the amount of regularisation. We can see in Figure A.5 that increasing \(\lambda_2\) for our example creates a stronger local minimum (see the closeup inset) that shifts towards \(\hat{w}=0\). In Figure A.6 we have plotted the position of the local minimum/predicted weight for a wider range of \(\lambda_2\) values and we can see that the predicted weight does not reach zero.
Let’s see, in more details, what happens near \(w=0\) using a Taylor expansion:
\[\begin{equation} E(w) \approx E(0) + \frac{dE}{dw}(0) w + \frac{1}{2}\frac{d^2E}{dw^2}(0) w^2 \end{equation}\]
The regularised expression can then be approximated as:
\[\begin{equation} E_2(w) \approx E(0) + \frac{dE}{dw}(0) w + \frac{1}{2}\left( \frac{d^2E}{dw^2}(0) + \lambda_2 \right) w^2 \end{equation}\] To have a local minimum at \(w=0\), we would need to have \(\frac{dE_2}{dw}(0)=0\), but instead we have \[\begin{equation} \frac{dE_2}{dw}(0) = \frac{dE}{dw}(0). \end{equation}\] In fact, the regularisation has no effect on the derivative at zero, thus, regardless of how much regularisation we add, we will never have \(\frac{dE_2}{dw}(0)=0\) (unless zero was already a minimum before regularisation). A local minimum near zero is however attained when
\[\begin{equation} \frac{dE_2}{dw}(w) = 0 = \frac{dE}{dw}(0) + w \left(\frac{d^2E}{dw^2}(0) + \lambda_2\right) \end{equation}\]
and it occurs at
\[\begin{equation} \hat{w} = -\frac{\frac{dE}{dw}(0)}{\frac{d^2E}{dw^2}(0) + \lambda_2 }. \end{equation}\]
We have indeed \(\lim_{\lambda_2\rightarrow +\infty}\hat{w} = 0\), but \(\hat{w}\) never actually reaches zero, except in the unlikely case where the optimal weight \(\hat{w}\) was already 0 to start with.
With the \(L_2\) regularisation, we are almost guaranteed that the optimal weights will never be zero. Thus \(L_2\) does not induce sparsity.
\(L_1\) Regularisation.
Now, let’s see how \(L_1\) regularisation does induce sparsity. The regularised loss for \(L_1\) is:
\[\begin{equation} E_1(w) = E(w) + \lambda_1 |w|. \end{equation}\]
Figure A.7 shows that increasing \(\lambda_1\) on our example also moves the local minimum closer to zero, but, this time, we can clearly see in Figure A.8 that zero becomes a local minimum for \(\lambda_1>0.4058\).
Let’s examine this in more details. A Taylor series expansion of order 1 near zero gives us:
\[\begin{equation} E(w) \approx E(0) + \frac{dE}{dw}(0) w + \lambda_1 | w | \end{equation}\]
This time it is a bit trickier as we have a discontinuity of the derivative at zero:
\[\begin{equation} \frac{dE}{dw}(w) \approx \frac{dE}{dw}(0) + \begin{cases} \lambda_1 & \text{if } w > 0 \\ -\lambda_1 & \text{if } w < 0 \end{cases}, \end{equation}\]
But note that the derivative near zero does now depend on the regulariser and a local minimum can in fact be formed at \(0\) when
\[\begin{equation} \lambda_1 > \left| \frac{dE}{dw}(0) \right|. \end{equation}\]
Thus, when we increase \(\lambda_1\), as soon as \(\lambda_1 > \left| \frac{dE}{dw}(0) \right|\), zero will become a local minimum (on our example, we can observe this threshold in Figure A.8 at \(\lambda_1=|\frac{dE}{dw}(0)| = 0.4058\) ). Depending on \(\lambda_1\) and \(E(w)\), this local minimum can also be a global minimum. The typical behaviour is that when you gradually increase \(\lambda_1\), the optimal weight will eventually snap to zero.
Take Away
A solution to sparsity is to add a regulariser, so as to form a local minimum at zero. The main difference between \(L_1\) and \(L_2\) basically boils down to this: \(L_2\) is too smooth and has no effect, whereas the gradient discontinuity of \(L_1\) at zero enables a steep dent that can create a local minimum.
Note that this can be generalised to any \(L_p\) regulariser, defined as \(\|x\|_p^p= \sum_{i=1}^n |x_i|^p\) (where \(p\) is here some real number that should not be confused with the number of parameters of our model, it is just the convention notation for these norms). If \(p>1\), then there is no sparsity, if \(p\leq 1\), there is sparsity.
A.3 Kernel Trick
In this note we look back at the kernel trick.
We start by observing that many linear machine learning methods are based on minimising something like:
\[ E({\bf w}) = \mathcal{L}( X {\bf w}, y) + \lambda \| {\bf w} \|^2 \]
For instance, in least squares: \[ \mathcal{L}( X {\bf w}, y) = \sum_{n=1}^N (y_i - {\bf x}_i^{\top} {\bf w})^2 \] and in SVM: \[ \mathcal{L}( X {\bf w}, y) = \sum_{i=1}^N [y_i=0]\max(0, {\bf x}_i^{\top} {\bf w}) + [y_i=1]\max(0, 1 - {\bf x}_i^{\top} {\bf w}) \]
The term \(\lambda \| {\bf w} \|^2\) is the regularisation term we already saw in linear regression.
When minimising \(E({\bf w})\), \(\boldsymbol{\hat{\textbf{w}}}\) is necessarily of the form: \[ \boldsymbol{\hat{\textbf{w}}} = X^{\top} \alpha = \sum_{i=1}^n \alpha_i {\bf x}_i \]
Proof:
Consider \(\boldsymbol{\hat{\textbf{w}}} = X^{\top} \alpha + {\bf v}\), with \({\bf v}\) such that \(X{\bf v} = 0\).
We show that \(E(X^{\top} \alpha + {\bf v}) > E(X^{\top} \alpha)\) if \({\bf v} \neq 0\):
\[ \begin{aligned} E(X^{\top} \alpha + {\bf v}) &= \mathcal{L}( X X^{\top} \alpha + X{\bf v} , y) + \lambda \| X^{\top} \alpha + {\bf v}\|^2 \\ &= \mathcal{L}( X X^{\top} \alpha , y) + \lambda\left(\alpha^{\top}XX^{\top}\alpha + 2 \alpha X {\bf v} + {\bf v}^{\top}{\bf v} \right) \\ &= \mathcal{L}( X X^{\top} \alpha , y) + \lambda \left(\alpha^{\top}XX^{\top}\alpha + {\bf v}^{\top}{\bf v} \right) \\ &> E(X^{\top} \alpha) \quad \text{if}\, {\bf v} \neq 0 \end{aligned} \]
now if \({\bf w} = X^{\top}\alpha\), then
\[ E({\bf w}) = E(\alpha)= \mathcal{L}(XX^{\top}\alpha, {\bf y}) + \lambda \alpha^{\top}XX^{\top}\alpha \]
We call \(K = XX^{\top}\) the Kernel Matrix. It is a matrix of dimension \(n \times n\) whose entries are the scalar products between observations: \[ K_{i,j} = {\bf x}_i ^{\top}{\bf x}_j \]
Note that the expression to minimise \[ E(\alpha) = \mathcal{L}(K\alpha, {\bf y}) + \lambda \alpha^{\top}K\alpha \] only contains matrices and vectors of dimension \(n \times n\) or \(n \times 1\). In fact, even if the features are of infinite dimension (\(p=+\infty\)), our reparameterised problem only depends on the number of observations \(n\).
When we transform the features \({\bf x} \rightarrow \phi({\bf x})\). The expression to minimise keeps the same form: \[ E(\alpha) = \mathcal{L}(K\alpha, {\bf y}) + \lambda \alpha^{\top}K\alpha \] the only changes occur for \(K\): \[ K_{i,j} = \phi({\bf x}_i) ^{\top}\phi({\bf x}_j) \]
Thus we never really need to explicitly compute \(\phi\), we just need to know how to compute \(\phi({\bf x}_i) ^{\top}\phi({\bf x}_j)\).