Appendix D — Kernel Trick
In this note we look back at the kernel trick.
We start by observing that many linear machine learning methods are based on minimising something like:
E({\bf w}) = \mathcal{L}( X {\bf w}, y) + \lambda \| {\bf w} \|^2
For instance, in least squares: \mathcal{L}( X {\bf w}, y) = \sum_{n=1}^N (y_i - {\bf x}_i^{\top} {\bf w})^2 and in SVM: \mathcal{L}( X {\bf w}, y) = \sum_{i=1}^N [y_i=0]\max(0, {\bf x}_i^{\top} {\bf w}) + [y_i=1]\max(0, 1 - {\bf x}_i^{\top} {\bf w})
The term \lambda \| {\bf w} \|^2 is the regularisation term we already saw in linear regression.
When minimising E({\bf w}), \boldsymbol{\hat{\textbf{w}}} is necessarily of the form: \boldsymbol{\hat{\textbf{w}}} = X^{\top} \alpha = \sum_{i=1}^n \alpha_i {\bf x}_i
Proof:
Consider \boldsymbol{\hat{\textbf{w}}} = X^{\top} \alpha + {\bf v}, with {\bf v} such that X{\bf v} = 0.
We show that E(X^{\top} \alpha + {\bf v}) > E(X^{\top} \alpha) if {\bf v} \neq 0:
\begin{aligned} E(X^{\top} \alpha + {\bf v}) &= \mathcal{L}( X X^{\top} \alpha + X{\bf v} , y) + \lambda \| X^{\top} \alpha + {\bf v}\|^2 \\ &= \mathcal{L}( X X^{\top} \alpha , y) + \lambda\left(\alpha^{\top}XX^{\top}\alpha + 2 \alpha X {\bf v} + {\bf v}^{\top}{\bf v} \right) \\ &= \mathcal{L}( X X^{\top} \alpha , y) + \lambda \left(\alpha^{\top}XX^{\top}\alpha + {\bf v}^{\top}{\bf v} \right) \\ &> E(X^{\top} \alpha) \quad \text{if}\, {\bf v} \neq 0 \end{aligned}
now if {\bf w} = X^{\top}\alpha, then
E({\bf w}) = E(\alpha)= \mathcal{L}(XX^{\top}\alpha, {\bf y}) + \lambda \alpha^{\top}XX^{\top}\alpha
We call K = XX^{\top} the Kernel Matrix. It is a matrix of dimension n \times n whose entries are the scalar products between observations: K_{i,j} = {\bf x}_i ^{\top}{\bf x}_j
Note that the expression to minimise E(\alpha) = \mathcal{L}(K\alpha, {\bf y}) + \lambda \alpha^{\top}K\alpha only contains matrices and vectors of dimension n \times n or n \times 1. In fact, even if the features are of infinite dimension (p=+\infty), our reparameterised problem only depends on the number of observations n.
When we transform the features {\bf x} \rightarrow \phi({\bf x}). The expression to minimise keeps the same form: E(\alpha) = \mathcal{L}(K\alpha, {\bf y}) + \lambda \alpha^{\top}K\alpha the only changes occur for K: K_{i,j} = \phi({\bf x}_i) ^{\top}\phi({\bf x}_j)
Thus we never really need to explicitly compute \phi, we just need to know how to compute \phi({\bf x}_i) ^{\top}\phi({\bf x}_j).