Chapter 3 Know your Classics

Before we dive into Neural Networks, we must keep in mind that Neural Nets have been around for a while and, until recently, they were not the method of choice for Machine Learning.

A zoo of algorithms exits out there, and we’ll briefly introduce here some of the classic methods for supervised learning. In the following we are looking at a few popular classification algorithms.

3.1 k-nearest neighbours

\(k\)-nearest neighbours (\(k\)-NN) is a very simple yet powerful technique. For an input \({\bf x}\), you retrieve the \(k\)-nearest neighbours in the training data, then return the majority class amoung the \(k\) values. You can also return the confidence as a proportion of the majority class.

For instance, in the example of Figure 3.1 below, the prediction for 3-NN would be the positive class (red cross) with a 66% confidence, and the 5-NN prediction would be the negative class (blue circle with a 60% confidence).

Example of 3-NN and 5-NN.

Figure 3.1: Example of 3-NN and 5-NN.

We show in Figure 3.2 below a few examples of \(k\)-NN outputs on three datasets. For each dataset are reported the decision boundary maps for the 1-NN, 3-NN and 10-NN binary classifications. The colour shades correspond to different likelihood to belong to each class (ie. deep blue = 100% certain to belong to class 0, deep magenta = 100% to belong to class 1). This map is obtained by evaluating the prediction for each point of the 2D input plane.

Decision boundaries on 3 problems. The intensity of the shades indicates the certainty we have about the prediction.

Figure 3.2: Decision boundaries on 3 problems. The intensity of the shades indicates the certainty we have about the prediction.

Pros:

  • It is a non-parametric technique.
  • It works surprisingly well and you can obtain high accuracy if the training set is large enough.

Cons:

  • Finding the nearest neighbours is computationally expensive and doesn’t scale with the training set.
  • It may generalise very badly if your training set is small.
  • You don’t learn much about the features themselves.

3.2 Decision Trees

In decision trees (Breiman et al. 1984) and its many variants, each node of the decision tree is associated with a region of the input space, and internal nodes partition that region into sub-regions, in a divide and conquer fashion as shown in Figure 3.3.

Decision-Tree principle.

Figure 3.3: Decision-Tree principle.

The regions are split along the axes of the input space (eg. at each node you take a decision according to a binary test such as \(x_2 < 3\)).

Decision Trees do not produce probabilties to belong to a class, but by considering multiple decision trees and aggregating their outputs, as in Ada-Boost (Freund and Schapire 1995) and Random Forests (Ho 1995), we can obtain different levels of probability (as we did with \(k\)-NN).

As can been seen in Figure 3.4, the decision maps for these techniques produce vertical and horisontal contours, which follow each of the input space axes.

Decision maps for Decision Tree, Ada Boost and Random Forest. In Ada Boost and Random Forests, multiple decision trees are used to aggregate a probability on the prediction.

Figure 3.4: Decision maps for Decision Tree, Ada Boost and Random Forest. In Ada Boost and Random Forests, multiple decision trees are used to aggregate a probability on the prediction.

Random Forests gained a lot of popularity before the rise of Neural Nets as they can be very efficiently computed. For instance they where used for the body part identification in the Microsoft Kinect (Shotton et al. 2013) (see demo page).

pros:

  • It is fast.

cons:

  • as can be seen in Figure 3.5, decisions are taken along axes (eg. \(x_1<3\)) but it could be more efficient to split the classes along a diagonal (eg. \(x_1<x_2\)):
In Decision Trees, the feature space is split along axes (eg. $x_1<3$) but it could be more efficient to split the classes along a diagonal (eg. $x_1<x_2$).

Figure 3.5: In Decision Trees, the feature space is split along axes (eg. \(x_1<3\)) but it could be more efficient to split the classes along a diagonal (eg. \(x_1<x_2\)).

3.2.1 See Also

Ada Boost, Random Forests.

https://www.youtube.com/watch?v=p17C9q2M00Q

3.3 Linear SVM

Until recently, Support Vector Machines were the most popular technique around.

Like in Logistic Regression, SVM starts as a linear classifier: \[ y = [ {\bf x}^{\top}{\bf w} > 0 ] \]

The difference with logistic regression lies in the choice of the loss function .

Whereas in logistic regression the loss function was based on the cross-entropy, the loss function in SVM is based on the Hinge loss function:

\[ L_{SVM}( {\bf w}) = \sum_{i=1}^N [y_i=0]\max(0, {\bf x}_i^{\top} {\bf w}) + [y_i=1]\max(0, 1 - {\bf x}_i^{\top} {\bf w}) \]

From a geometrical point of view, SVM seeks to find the hyperplane that maximises the separation between the two classes (see Figure 3.6 below).

SVM maximises the separation between classes.

Figure 3.6: SVM maximises the separation between classes.

There is a lot more to SVM, but this will be not covered in this module.

3.4 No Free-Lunch Theorem

Note that there is a priori no advantage in using linear SVM over logistic regression in terms of performance alone. It all depends on the type of data you have.

Recall that the choice of loss function directly relates to assumptions you make about the distribution of the prediction errors, and thus about the dataset of your problem).

This is formalised in the no free lunch theorem (Wolpert and Macready 1997), which tells us that classifiers perform equally well when averaged over all possible problems. In other words: your choice of classifier should depend on the problem at hand.

No Free Lunch Theorem.

Figure 3.7: No Free Lunch Theorem.

3.5 Kernel Trick

SVM gained popularity when it became associated with the kernel trick.

3.5.1 The Problem with Feature Expansions

Recall that in linear regression, we managed to fit non-linear functions by augmenting the feature space with higher order polynomials of each the observations, e.g, \(x\), \(x^2\), \(x^3\), etc.

What we’ve done is to map the original features into a higher dimensional feature space: \(\phi: {\bf x}\mapsto \phi({\bf x})\). In our case we had:

\[ \phi ({x}) = \left( \begin{matrix} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \end{matrix} \right) \]

Transforming the original features into more complex ones is a key ingredient of machine learning.

The collected features are usually not optimal for linearly separating the classes and it is often unclear how these should be transformed (see Figure 3.8). We would like the machine learning technique to learn how to find \(\phi\) to best recombine the features so as to yield optimal class separation.

Feature mapping is used to transform the input data into a new dataset that can be solved using a linear classifier.

Figure 3.8: Feature mapping is used to transform the input data into a new dataset that can be solved using a linear classifier.

Another problem is that the size of the new feature vectors \(\phi({\bf x})\) could potentially grow very large.

Consider a polynomial augmentation, which expands a feature vector into a polynomial of degree \(d\). For instance:

\[ \phi \left([ x_1 \,,\, x_2 ]^{\top}\right) = [ 1 \,,\, x_1 \,,\, x_2 \,,\, x_1 x_2 \,,\, x_1^2 \,,\, x_2^2 ]^{\top} \] \[ \phi \left([ x_1 \,,\, x_2 \,,\, x_3 ]^{\top}\right) = [ 1 \,,\, x_1 \,,\, x_2 \,,\, x_3 \,,\, x_1 x_3 \,,\, x_1 x_2 \,,\, x_2 x_3 \,,\, x_1^2 \,,\, x_2^2 \,,\, x_3^2 ]^{\top} \]

It can be shown that for input features of dimension \(p\) and a polynomial of degree \(d\), the transformed features are of dimension \(\frac{(p+d)!}{p!\, d!}\).

For example, if you have \(p=100\) features per observation and that you are looking at a polynomial of order 5, the resulting feature vector is of dimension about 100 millions!!

Now, recall that Least-Squares solutions are given by \({\bf w} = (X^{\top}X)^{-1}X^{\top}{\bf y}\), if \(\phi({\bf x})\) is of dimension 100 millions, then \(X^{\top}X\) is of size \(10^8 \times 10^8\). This is totally impractical.

We want to transform the original features into higher level features but this seems to come at the cost of greatly increasing the dimension of the original problem.

The Kernel trick offers an elegant solution to this problem and allows us to use very complex mapping functions \(\phi\) without having to ever explicitly compute them.

3.5.2 Step 1: re-parameterisation

In most machine learning algorithms, the loss function usually depend on computing the score \({\bf x}^{\top}{\bf w}\). We can show (see note A.3) that the optimal weights \(\hat{\bf w}\) can then be re-expressed in terms of the existing input feature vectors:

\[ \hat{\bf w} = \sum_{i=1}^n \alpha_i {\bf x}_i, \]

where \(\alpha_i\) are new weights defining \({\bf w}\) as a linear combination in the \({\bf x}_i\) data points.

The score \({\bf x}^{\top}{\bf w}\) can now be re-written as:

\[ {\bf x}^{\top}\hat{\bf w} = \sum_{i=1}^n \alpha_i {\bf x}^{\top} {\bf x}_i, \]

where we note that the scalars \({\bf x}^{\top} {\bf x}_i\) are dot-products between feature vectors.

These new weights can be seen as a re-parametrisation of \(\hat{\bf w}\), with the loss \(E({\bf w})\) now being re-expressed as \(E({\boldsymbol\alpha})\). These new weights are sometimes called the dual coefficients in SVM.

Apriori it looks like we are making things more complicated for ourselves (and it’s a bit true), but look at what happens when we use augmented features:

\[ \phi({\bf x})^{\top}{\bf w} = \sum_{i=1}^n \alpha_i \phi({\bf x})^{\top} \phi({\bf x}_i) \]

To compute \(\phi({\bf x})^{\top}\hat{\bf w}\), we only ever need to know how to compute the dot products \(\phi({\bf x})^{\top} \phi({\bf x}_i)\).

3.5.3 Step 2: the Kernel Functions

Introducing the kernel function as \[ \kappa({\bf u}, {\bf v} ) = \phi({\bf u})^{\top} \phi({\bf v}), \] we can rewrite the scores as: \[ \phi({\bf x})^{\top}\hat{\bf w} = \sum_{i=1}^n \alpha_i \kappa({\bf x}, {\bf x}_i). \]

The key is now is that we could define \(\kappa\) without having to explicitly define \(\phi\).

The kernel trick builds on the Theory of Reproducing Kernels, which we says that for a whole class of kernel functions \(\kappa\) we can find a mapping \(\phi\) that is such that \(\kappa({\bf u}, {\bf v} ) = \phi({\bf u})^{\top} \phi({\bf v})\). The beauty is that we do not need to know about \(\phi\) or even compute it. We only need to only how to compute the kernel function \(\kappa\).

Many kernel functions are possible. For instance, the polynomial kernel is defined as: \[ \kappa({\bf u}, {\bf v}) = (r - \gamma {\bf u}^{\top} {\bf v})^d \] and one can show that this is equivalent to using a polynomial mapping as proposed earlier. Except that instead of requiring 100’s of millions of dimensions, we only need vectors of size \(n\) and to compute \(\kappa({\bf u}, {\bf v})\), which is linear in \(p\).

The most commonly used kernel is probably the Radial Basis Function (RBF) kernel: \[ \kappa({\bf u}, {\bf v}) = e^{- \gamma \| {\bf u} - {\bf v}\|^2 } \]

The induced mapping \(\phi\) is infinitely dimensional, but that’s OK because we never need to evaluate \(\phi({\bf x})\).

3.5.4 Understanding the RBF

To have some intuition about these kernels, consider the kernel trick for a RBF kernel. The score for a particular observation \({\bf x}\) is: \[ \mathrm{score}({\bf x}) = \sum_{i=1}^n \alpha_i \kappa({\bf x}, {\bf x}_i) \]

The kernel function \(\kappa({\bf u}, {\bf v}) = e^{- \gamma \| {\bf u} - {\bf v}\|^2 }\) is a measure of similarity between observations. If both observations are similar, \(\kappa({\bf u}, {\bf v}) \approx 1\). If they are very different, \(\kappa({\bf u}, {\bf v}) \approx 0\). We can see it as a neighbourhood indicator function. If the observations are close, \(\kappa({\bf u}, {\bf v}) \approx 1\), else \(\kappa({\bf u}, {\bf v}) \approx 0\). The scale of this neighbourhood is controlled by \(\gamma\).

(as you can imaging, this is less intuitive for other kernels)

Let’s choose \(\alpha_i = 1\) for positive observations and \(\alpha_i = -1\) for negative observations. This is obviously not the optimal, but this is in fact close to what happens in SVM. We have now something resembling \(k\)-NN. Indeed, look at the score:

\[ \begin{aligned} \mathrm{score}({\bf x}) &= \sum_{i=1}^n \alpha_i \kappa({\bf x}, {\bf x}_i) \\ &\approx \sum_{i \in \text{neighbours of ${\bf x}$}} \begin{cases} 1 & \text{if $y_i$ positive} \\ -1 & \text{if $y_i$ negative} \end{cases} \\ &\approx \text{nb of positive neighbours of ${\bf x}$} - \text{nb of negative neighbours of ${\bf x}$} \end{aligned} \]

This makes sense: if \({\bf x}\) has more positive than negative neighbours in the dataset, then its score should be high, and its prediction positive.

Thus we have here something similar to \(k\)-NN. The main difference is that instead of finding a fixed number of the \(k\) closest neighbours, we consider all the neighbours within some radius (controlled by \(\gamma\)).

3.5.5 Support Vectors

In SVM, the actual values of \(\hat{\alpha}_i\) are estimated by ways of the minimisation of the Hinge loss.

The optimisation falls outside of the scope of this course material. We could use Gradient Descent, but, as it turns out, the Hinge loss makes this problem a constrained optimisation problem, for which we can find an off-the-shelf solver (eg. minimize from scipy.optimize). The good news is that we can find the global minimum without having to worry about convergence issues.

After optimisation, we find that, indeed, \(-1\leq \hat{\alpha}_i \leq 1\), with the sign of \(\hat{\alpha}_i\) indicating the class membership. This this thus following a similar idea to what was proposed previously.

SVM using a RBF kernel with support vectors

Figure 3.9: SVM using a RBF kernel with support vectors

Above is an example using SVM-RBF, with contour lines of the score. The thickness of the outer circle for each observation \({\bf x}_i\) is proportional to \(|\alpha_i| \leq 1\) (no black circle means \(\alpha_i=0\)).

Only a fraction of the datapoints have non-null \(\alpha_i\). These special datapoints are called support vectors. They typically lie near the class boundary. Only these support vectors are required to compute predictions, which means that prediction can be made (a bit) more efficiently.

3.5.6 What does it look like?

Decision Boundaries for SVM using a linear and polynomial kernels.

Figure 3.10: Decision Boundaries for SVM using a linear and polynomial kernels.

Decision Boundaries for SVM using Gaussian kernels. The value of gamma controls the smoothness of the boundary.

Figure 3.11: Decision Boundaries for SVM using Gaussian kernels. The value of gamma controls the smoothness of the boundary.

3.5.7 Remarks

Support vector machines are not the only algorithm that can avail of the kernel trick. Many other linear models (including logistic regression) can be enhanced in this way. They are known as kernel methods.

A major drawback to kernel methods is that the cost of evaluating the decision function is proportional to the number of training examples, because the \(i^{th}\) observation contributes a term \(\alpha_i \kappa({\bf x},{\bf x}_i)\) to the decision function.

SVM somehow mitigates this by learning which examples contribute the most (the support vectors).

The cost of training is however still high for large datasets (eg. with tens of thousands of datapoints).

Evidence that deep learning could outperform kernel SVM on large datasets emerged in 2006 when team lead by G. Hinton demonstrated that a neural network on the MNIST benchmark. The real tipping point occured in 2012 with (Krizhevsky, Sutskever, and Hinton 2012) (see introduction).

3.6 Take Away

Neural Nets have existed for a while, but it is only since 2012 that they have started to surpass all other techniques.

Kernel based techniques have been very popular for a while as they offer an elegant way of transforming input features into more complex features that can then be linearly separated.

The problem with kernel techniques is that they cannot deal efficiently with large datasets (eg. more than 10’s of thousands of observations).

3.6.1 See Also

The topic is related to Gaussian Processes, Reproducing kernel Hilbert spaces, kernel Logistic Regression.

Some useful references:

Laurent El Ghaoui’s lecture at Berkeley

Eric Kim’s python tutorial on SVM

References

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth; Brooks.
Freund, Yoav, and Robert E. Schapire. 1995. “A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting.” In Computational Learning Theory, edited by Paul Vitányi, 23–37. Berlin, Heidelberg: Springer Berlin Heidelberg.
Ho, Tin Kam. 1995. “Random Decision Forests.” In Proceedings of 3rd International Conference on Document Analysis and Recognition, 1:278–282 vol.1. https://doi.org/10.1109/ICDAR.1995.598994.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–1105. Curran Associates, Inc. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
Shotton, Jamie, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. 2013. “Real-Time Human Pose Recognition in Parts from Single Depth Images.” Commun. ACM 56 (1): 116–24. https://doi.org/10.1145/2398356.2398381.
Wolpert, David H., and William G. Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1 (1): 67–82.