Appendix D — Kernel Trick

In this note we look back at the kernel trick.

We start by observing that many linear machine learning methods are based on minimising something like:

E({\bf w}) = \mathcal{L}( X {\bf w}, y) + \lambda \| {\bf w} \|^2

For instance, in least squares: \mathcal{L}( X {\bf w}, y) = \sum_{n=1}^N (y_i - {\bf x}_i^{\top} {\bf w})^2 and in SVM: \mathcal{L}( X {\bf w}, y) = \sum_{i=1}^N [y_i=0]\max(0, {\bf x}_i^{\top} {\bf w}) + [y_i=1]\max(0, 1 - {\bf x}_i^{\top} {\bf w})

The term \lambda \| {\bf w} \|^2 is the regularisation term we already saw in linear regression.

When minimising E({\bf w}), \boldsymbol{\hat{\textbf{w}}} is necessarily of the form: \boldsymbol{\hat{\textbf{w}}} = X^{\top} \alpha = \sum_{i=1}^n \alpha_i {\bf x}_i

Proof:

Consider \boldsymbol{\hat{\textbf{w}}} = X^{\top} \alpha + {\bf v}, with {\bf v} such that X{\bf v} = 0.

We show that E(X^{\top} \alpha + {\bf v}) > E(X^{\top} \alpha) if {\bf v} \neq 0:

\begin{aligned} E(X^{\top} \alpha + {\bf v}) &= \mathcal{L}( X X^{\top} \alpha + X{\bf v} , y) + \lambda \| X^{\top} \alpha + {\bf v}\|^2 \\ &= \mathcal{L}( X X^{\top} \alpha , y) + \lambda\left(\alpha^{\top}XX^{\top}\alpha + 2 \alpha X {\bf v} + {\bf v}^{\top}{\bf v} \right) \\ &= \mathcal{L}( X X^{\top} \alpha , y) + \lambda \left(\alpha^{\top}XX^{\top}\alpha + {\bf v}^{\top}{\bf v} \right) \\ &> E(X^{\top} \alpha) \quad \text{if}\, {\bf v} \neq 0 \end{aligned}

now if {\bf w} = X^{\top}\alpha, then

E({\bf w}) = E(\alpha)= \mathcal{L}(XX^{\top}\alpha, {\bf y}) + \lambda \alpha^{\top}XX^{\top}\alpha

We call K = XX^{\top} the Kernel Matrix. It is a matrix of dimension n \times n whose entries are the scalar products between observations: K_{i,j} = {\bf x}_i ^{\top}{\bf x}_j

Note that the expression to minimise E(\alpha) = \mathcal{L}(K\alpha, {\bf y}) + \lambda \alpha^{\top}K\alpha only contains matrices and vectors of dimension n \times n or n \times 1. In fact, even if the features are of infinite dimension (p=+\infty), our reparameterised problem only depends on the number of observations n.

When we transform the features {\bf x} \rightarrow \phi({\bf x}). The expression to minimise keeps the same form: E(\alpha) = \mathcal{L}(K\alpha, {\bf y}) + \lambda \alpha^{\top}K\alpha the only changes occur for K: K_{i,j} = \phi({\bf x}_i) ^{\top}\phi({\bf x}_j)

Thus we never really need to explicitly compute \phi, we just need to know how to compute \phi({\bf x}_i) ^{\top}\phi({\bf x}_j).