Appendix C — Why Does L_1 Regularisation Induce Sparsity?
In machine learning, a model is described as being sparse when many of its learned parameters (or weights) are exactly zero. Consider a linear regression model:
\begin{equation} y = w_0 + w_1 x_1 + w_2 x_2 + \dots + w_p x_p. \end{equation}
Imagine a scenario where we have a large number of features, but we suspect that only a small subset of them are truly predictive of the output. If we train a standard regression model, it will likely assign a non-zero weight to every feature. The weights for irrelevant features might be small, but they will rarely be exactly zero. This makes the model difficult to interpret and less efficient.
Sparsity is desirable because it effectively performs automatic feature selection. By driving the weights of irrelevant features to zero, a sparse model reveals which features are most important. This leads to simpler, more interpretable, and often more robust models.
The question, then, is how to encourage a model to learn sparse solutions. While one could manually set small weights to zero after training, this is an ad-hoc approach. A more principled method is to incorporate regularisation into the loss function during training. By adding a penalty term that depends on the magnitude of the weights, we can influence the optimisation process to favour sparsity.
However, not all regularisers are created equal. This note explains why L_1 regularisation is effective at inducing sparsity, while the more common L_2 regularisation is not. We will demonstrate this for a single weight parameter, w, but the argument generalises to higher dimensions.
The core of the argument lies in how the regularisation term alters the shape of the loss function, E(w), near the origin. For a weight to be optimally zero, the regularised loss function must have a minimum at w=0. Figure C.1 shows a typical loss function where the minimum is at some non-zero value, \hat{w}=0.295. Let us see how adding regularisation can shift this minimum to zero.
C.1 L_2 Regularisation
Let us first examine L_2 regularisation. The regularised loss function, E_2(w), is defined as:
\begin{equation} E_2(w) = E(w) + \lambda_2 \frac{1}{2} w^2, \end{equation}
where \lambda_2 > 0 is the regularisation strength. As shown in Figure C.2, increasing \lambda_2 pulls the minimum of the loss function closer to zero. However, as Figure Figure C.3 reveals, the optimal weight \hat{w} approaches zero asymptotically but never actually reaches it (unless it was already zero to begin with).
To understand why, let us analyse the function near w=0 using a Taylor expansion of the original loss, E(w):
\begin{equation} E(w) \approx E(0) + \frac{dE}{dw}(0) w + \frac{1}{2}\frac{d^2E}{dw^2}(0) w^2 \end{equation}
The regularised loss is therefore:
\begin{equation} E_2(w) \approx E(0) + \frac{dE}{dw}(0) w + \frac{1}{2}\left( \frac{d^2E}{dw^2}(0) + \lambda_2 \right) w^2 \end{equation}
For a minimum to exist at w=0, the first derivative of the loss function must be zero at that point. The derivative of E_2(w) is:
\begin{equation} \frac{dE_2}{dw}(w) \approx \frac{dE}{dw}(0) + \left( \frac{d^2E}{dw^2}(0) + \lambda_2 \right) w \end{equation}
At w=0, this becomes:
\begin{equation} \frac{dE_2}{dw}(0) = \frac{dE}{dw}(0). \end{equation}
This is a crucial result. The L_2 penalty is a smooth function (w^2), and its derivative at w=0 is zero. Consequently, it does not change the gradient of the loss function at the origin. If the original loss function had a non-zero gradient at w=0 (which it typically does, as seen in Figure C.1), the regularised loss will have the exact same gradient at that point. Therefore, w=0 cannot be a minimum.
The new minimum of the regularised function occurs where \frac{dE_2}{dw}(w) = 0, which gives:
\begin{equation} \hat{w} = -\frac{\frac{dE}{dw}(0)}{\frac{d^2E}{dw^2}(0) + \lambda_2 }. \end{equation}
From this expression, we can see that as \lambda_2 \rightarrow \infty, \hat{w} \rightarrow 0. The weight is shrunk towards zero, but never becomes exactly zero. Thus, L_2 regularisation does not induce sparsity.
C.2 L_1 Regularisation
Now, let us consider L_1 regularisation. The regularised loss is:
\begin{equation} E_1(w) = E(w) + \lambda_1 |w|. \end{equation}
As shown in Figure C.4 and Figure C.5, the effect of the L_1 penalty is qualitatively different. As \lambda_1 increases, the minimum not only moves closer to zero but, for \lambda_1 > 0.4058, it snaps to become exactly zero.
The key difference lies in the penalty term |w|. Unlike w^2, the absolute value function is not smooth at the origin; it has a sharp “kink”. This means its derivative is discontinuous at w=0. Let us examine the derivative of the regularised loss, E_1(w):
\begin{equation} \frac{dE_1}{dw}(w) = \frac{dE}{dw}(w) + \lambda_1 \cdot \text{sgn}(w), \end{equation}
where \text{sgn}(w) is the sign function, which is +1 for w>0 and -1 for w<0. For a minimum to exist at w=0, the gradient must be positive for w \rightarrow 0^+ and negative for w \rightarrow 0^-. Let us check these conditions using a first-order Taylor expansion of E(w) around zero:
For w=0 to be a local minimum, we need the slope to be negative just to the left of zero and positive just to the right. This means we need:
\begin{equation} \frac{dE}{dw}(0) - \lambda_1 < 0 \quad \text{and} \quad \frac{dE}{dw}(0) + \lambda_1 > 0 \end{equation}
These two conditions can be combined into a single one:
\begin{equation} \lambda_1 > \left| \frac{dE}{dw}(0) \right|. \end{equation}
This result is profound. Unlike the L_2 case, the L_1 penalty introduces a constant force, \lambda_1, that pulls the weight towards zero. If this force is stronger than the gradient of the original loss function at the origin, |\frac{dE}{dw}(0)|, then the optimal solution for the weight becomes exactly zero. This is why L_1 regularisation induces sparsity.
C.3 Takeaways
The ability of L_1 regularisation to produce sparse models stems from the sharp, non-differentiable “kink” in the absolute value function at the origin. This creates a constant penalty gradient that can overcome the gradient of the loss function, forcing weights to become exactly zero.
In contrast, the L_2 penalty is smooth (differentiable) at the origin and its gradient is zero at that point. It therefore cannot create a minimum at zero unless one already existed. It shrinks weights towards zero but does not perform feature selection in the same way as L_1.
This concept can be generalised to the L_p norm, defined as \|w\|_p = (\sum_i |w_i|^p)^{1/p}. Regularisers with p \le 1 (like L_1) induce sparsity, while those with p > 1 (like L_2) do not.