Appendix A — Relationship between Error, Loss Function and Maximum Likelihood
As we have seen in the handout on Least Squares, there is a very fundamental link between the distribution of the prediction error and the type of loss function you should consider.
Very early on, Gauss connected Least squares with the principles of probability and to the Gaussian distribution.
Recall that the linear model is: \begin{equation} \mathbf {y} = \mathbf{X} \mathbf{w} + \boldsymbol{\varepsilon} \end{equation}
The error \boldsymbol{\varepsilon} is the random variable that embodies the uncertainty of the model and explains the differences between the prediction {\bf x}_i^{\top}{\bf w} and the outcome y_i.
Let’s assume that the error follows a Gaussian distribution, i.e. that \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \begin{equation} p({\varepsilon}_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{{\varepsilon_i}^2}{2\sigma^2}\right) \end{equation}
We can measure the likelihood to have y_i given {\bf x}_i. It is given by: \begin{equation} p(y_i|{\bf x}_i, {\bf w}) = p(\varepsilon_i = {\bf x}_i^{\top}{\bf w} - y_i) = \frac{1}{\sqrt{2\pi\sigma^2}}\mathrm{exp}\left(\frac{({\bf x}_i^{\top}{\bf w} - y_i)^2}{2\sigma^2}\right) \end{equation} Assuming independence of the error terms \epsilon_i, the combined likelihood to have all outputs {\bf y} given all data {\bf X} is given by
\begin{aligned} p({\bf y}|{\bf X}, {\bf w}) &= \prod_{i=1}^n p(\varepsilon_i)\\ %= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \mathrm{exp}\left(-\frac{{\varepsilon}_i^2}{2\sigma^2}\right) \\ &= \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \mathrm{exp}\left(-\sum_{i=1}^n\frac{ \left({\bf x}_i^{\top}{\bf w} - y_i\right)^2}{2\sigma^2}\right) \end{aligned}
The maximum likelihood estimate \boldsymbol{\hat{\textbf{w}}}_{ML} is simply the weight vector {\bf w} that maximises the likelihood p({\bf y}|{\bf X}, {\bf w}):
\begin{equation} \boldsymbol{\hat{\textbf{w}}}_{ML} = \arg\max_{\bf w} p({\bf y}|{\bf X}, {\bf w}) \end{equation}
A more practical, but equivalent, approach is to minimise the negative log likelihood:
\begin{aligned} \boldsymbol{\hat{\textbf{w}}}_{ML} &= \arg\min_{\bf w} - \mathrm{log}\left(p({\bf y}|{\bf X}, {\bf w})\right) \\ &= \arg\min_{\bf w} \frac{1}{2\sigma^2} \sum_{i=1}^n \left({\bf x}_i^{\top}{\bf w} - y_i\right)^2 - n \mathrm{log}\left(\sqrt{2\pi\sigma^2}\right) \\ &= \arg\min_{\bf w} \sum_{i=1}^n \left({\bf x}_i^{\top}{\bf w} - y_i\right)^2 \end{aligned}
Thus we’ve shown that the Least Square estimate is in fact the Maximum Likelihood solution if the error is assumed to be Gaussian.
Now, let’s assume that the error follows a Laplace distribution:
\varepsilon_i \sim \mathrm{Laplace}(0, \lambda). \begin{equation}
p({\varepsilon}_i) = \frac{1}{2 \lambda}
\exp\left(-\frac{ |{\varepsilon_i}| }{\lambda}\right)
\end{equation}
Assuming independence of the error terms \epsilon_i, the combined likelihood to have all outputs {\bf y} given all data {\bf X} is this time given by
\begin{equation} \begin{aligned} p({\bf y}|{\bf X}, {\bf w}) &= \prod_{i=1}^n p(\varepsilon_i)\\ &= \left(\frac{1}{2\lambda}\right)^n \exp\left(-\frac{1}{\lambda}\sum_{i=1}^n | {\bf x}_i^{\top}{\bf w} - y_i | \right) \end{aligned} \end{equation}
From which we can derive that minimising the Mean Absolute Error (MAE) loss is identical to finding the maximum likelihood solution, if the error follows a Laplace distribution.
Note that solving for the MAE loss is typically tricky. Convex optimisation techniques have been developed in the 2000s to solve for these kind of problems. The mathematics involved are beyond the scope of this module.
A.1 Takeaways
The loss function is intimately related to the distribution of your errors. This can give us a way to check that we are using an appropriate loss function. Say you use Least Squares to find the Mean Square Error minimiser {\bf w}_{MSE}. If you compute the prediction errors for {\bf w}_{MSE}, you can then build a histogram of these errors and check that it is indeed close enough to a Gaussian distribution. If the error is far from Gaussian, it may be a good idea to use different loss function, or to go back to the dataset and remove any possible spurious outlier.