3  A Tour of Classic Classifiers

Before we delve into the world of neural networks, it is important to recognise that they are not a recent invention. For many years, other machine learning algorithms were the preferred methods for a wide range of tasks. In this chapter, we will briefly introduce some of the most influential classic supervised learning algorithms for classification. Given the scope of this chapter, we will only touch upon these techniques, as some would traditionally warrant dedicated modules for in-depth study.

3.1 k-Nearest Neighbours (k-NN)

The k-nearest neighbours (k-NN) algorithm is a simple yet powerful non-parametric method. To classify a new data point, {\bf x}, the algorithm identifies the k closest data points in the training set (its “neighbours”). The new data point is then assigned to the class that is most common among its k neighbours. The confidence of the prediction can be expressed as the proportion of neighbours belonging to the majority class.

For example, in Figure 3.1, if we use k=3, the prediction for the new data point (with the question mark) would be the positive class (red cross) with 66% confidence. However, if we use k=5, the prediction would be the negative class (blue circle) with 60% confidence.

Figure 3.1: An illustration of k-NN with k=3 and k=5.

Figure 3.2 shows the decision boundaries produced by k-NN for different values of k on three different datasets. The color shading indicates the predicted probability of belonging to each class. As you can see, the decision boundaries become smoother as k increases.

Figure 3.2: Decision boundaries for k-NN on three different classification problems. The intensity of the color indicates the confidence of the prediction.

Pros:

  • Simple and intuitive: The algorithm is easy to understand and implement.
  • Non-parametric: It makes no assumptions about the underlying data distribution.
  • Effective with large datasets: With a sufficiently large training set, k-NN can achieve high accuracy.

Cons:

  • Computationally expensive: Finding the nearest neighbours can be slow, especially with large datasets.
  • Sensitive to small datasets: The algorithm can perform poorly if the training set is small or not representative of the true data distribution.
  • Lack of interpretability: The model does not provide insights into the importance of different features.

3.2 Decision Trees

Decision trees (Breiman et al. 1984) and their more advanced variants, like Random Forests and AdaBoost, are another popular class of algorithms. A decision tree partitions the input space into a set of rectangular regions, following a “divide and conquer” strategy, as illustrated in Figure 3.3.

Figure 3.3: The principle of a decision tree.

At each internal node of the tree, a decision is made based on a simple test, such as “is feature x_2 less than 3?”. This process is repeated until a leaf node is reached, which corresponds to a specific class label.

While a single decision tree does not produce probabilistic predictions, ensemble methods like AdaBoost (Freund and Schapire 1995) and Random Forests (Ho 1995) combine the outputs of multiple decision trees to generate probabilities, similar to how we can get a confidence score with k-NN.

As shown in Figure 3.4, the decision boundaries of these models are composed of vertical and horizontal lines, aligned with the axes of the input space and corresponding to the tests performed (eg. x_2 > 3, x_1 > 2, etc. )

Figure 3.4: Decision boundaries for a single Decision Tree, AdaBoost, and Random Forest. The ensemble methods produce smoother boundaries and probabilistic predictions.

Random Forests were particularly popular before the widespread adoption of neural networks due to their computational efficiency. A notable application was the real-time body part tracking in the Microsoft Kinect (Shotton et al. 2013) (see demo page).

Pros:

  • Fast and efficient: Decision trees are relatively fast to train and use for prediction.
  • Interpretable: The tree structure provides a clear and understandable representation of the decision-making process.

Cons:

  • Axis-aligned splits: Decision trees can only create splits that are parallel to the feature axes. This can be inefficient if the true decision boundary is diagonal. For instance, the tree decomposition from Figure 3.4 would have been more efficient if we used a diagonal split with x_1 < x_2 as shown in Figure 3.5.
Figure 3.5: Decision trees can only split the feature space along the axes, but it could be better to separate the dataset by an off-axis cut (e.g. by testing x_1 < x_2).
See Also

3.3 Linear SVM

Until the rise of deep learning, Support Vector Machines (SVMs) were the most popular classification algorithm.

Similar to Logistic Regression, a linear SVM is a linear classifier that makes predictions based on a linear combination of the input features:

y = [ {\bf x}^{\top}{\bf w} > 0 ]

The key difference between SVM and logistic regression lies in the loss function used for training.

While logistic regression uses the cross-entropy loss, SVM employs the hinge loss:

L_{SVM}( {\bf w}) = \sum_{i=1}^N [y_i=0]\max(0, 1 + {\bf x}_i^{\top} {\bf w}) + [y_i=1]\max(0, 1 - {\bf x}_i^{\top} {\bf w})

Geometrically, the hinge loss encourages the model to find a hyperplane that maximizes the margin, or the distance, between the two classes (see Figure 3.6).

Figure 3.6: SVM aims to find the hyperplane that maximizes the margin between the two classes.

There is much more to SVMs, but a full treatment is beyond the scope of this module.

3.4 The No-Free-Lunch Theorem

It is important to note that there is no single best classifier for all problems. The performance of a classifier depends heavily on the nature of the data.

Recall that the choice of loss function directly relates to assumptions you make about the distribution of the prediction errors, and thus about the dataset of your problem).

This is formalised by the No-Free-Lunch Theorem (Wolpert and Macready 1997), which states that, when averaged over all possible problems, all classifiers perform equally well. In other words, the choice of classifier should always be guided by the specific characteristics of the problem at hand.

Figure 3.7: Illustration of the No-Free-Lunch Theorem.

3.5 The Kernel Trick

SVMs gained immense popularity with the introduction of the kernel trick.

3.5.1 The Challenge of Feature Expansion

Recall from our discussion of linear regression that we can fit non-linear relationships by augmenting the feature space with higher-order terms (e.g., x, x^2, x^3). This is a form of feature mapping, where we transform the original features into a higher-dimensional space: \phi: {\bf x}\mapsto \phi({\bf x}). For example:

\phi ({x}) = \left( \begin{matrix} 1 \\ x \\ x^2 \\ x^3 \\ \vdots \end{matrix} \right)

Feature transformation is a fundamental concept in machine learning. The original features are often not sufficient to linearly separate the classes, and it is not always clear how to best transform them (see Figure 3.8).

Figure 3.8: Feature mapping transforms the input data into a new space where a linear classifier can be used.

A major challenge with feature expansion is that the dimensionality of the new feature space can grow very rapidly. For a polynomial expansion of degree d on an input feature vector of dimension p, the new feature vector will have a dimension of \frac{(p+d)!}{p!\, d!}.

For instance, with p=100 features and a polynomial of degree 5, the resulting feature vector would have a dimension of approximately 100 million. This makes computations, such as the least-squares solution {\bf w} = (X^{\top}X)^{-1}X^{\top}{\bf y}, completely impractical, as X^{\top}X would be a 10^8 \times 10^8 matrix.

The kernel trick provides an elegant solution to this problem, allowing us to work with very complex, high-dimensional feature mappings without ever explicitly computing them.

3.5.2 Step 1: Re-parameterization

In many machine learning algorithms, the loss function depends on the score, which is calculated (see previous chapter) as {\bf x}^{\top}{\bf w}. It can be shown (see Appendix D) that the optimal weight vector, \hat{\bf w}, can be expressed as a linear combination of the input feature vectors:

\hat{\bf w} = \sum_{i=1}^n \alpha_i {\bf x}_i,

where the \alpha_i are a new set of weights. These new weights are sometimes called the dual coefficients in SVM. The score can then be rewritten as:

{\bf x}^{\top}\hat{\bf w} = \sum_{i=1}^n \alpha_i {\bf x}^{\top} {\bf x}_i,

Notice that the score now depends on the dot products between feature vectors. This re-parameterization is key to the kernel trick. When we apply a feature mapping \phi, the score becomes:

\phi({\bf x})^{\top}{\bf w} = \sum_{i=1}^n \alpha_i \phi({\bf x})^{\top} \phi({\bf x}_i)

To compute the score in this high-dimensional space, we only need to be able to compute the dot products \phi({\bf x})^{\top} \phi({\bf x}_i).

3.5.3 Step 2: Kernel Functions

We define a kernel function as: \kappa({\bf u}, {\bf v} ) = \phi({\bf u})^{\top} \phi({\bf v}),

This allows us to rewrite the score as: \phi({\bf x})^{\top}\hat{\bf w} = \sum_{i=1}^n \alpha_i \kappa({\bf x}, {\bf x}_i).

The key here is that we can often define and compute the kernel function \kappa without ever explicitly defining or computing the feature mapping \phi. The theory of Reproducing Kernel Hilbert Spaces (RKHS) guarantees that for a wide class of kernel functions, a corresponding mapping \phi does indeed exist.

Many different kernel functions are available. For example, the polynomial kernel is defined as: \kappa({\bf u}, {\bf v}) = (r - \gamma {\bf u}^{\top} {\bf v})^d This kernel is equivalent to the polynomial feature mapping of degree d we discussed earlier (see (Wikipedia 2025a)), but it avoids the computational explosion in dimensionality.

The most commonly used kernel is the Radial Basis Function (RBF) kernel (see (Wikipedia 2025b)): \kappa({\bf u}, {\bf v}) = e^{- \gamma \| {\bf u} - {\bf v}\|^2 }

The feature mapping \phi induced by the RBF kernel is infinitely dimensional, but we never need to compute it directly. A finite approximation of the mapping can be obtained by taking cosine/sine projections of the input feature onto a set of random directions {\bf w}_{1}, \ldots, {\bf w}_{D}:

\varphi ({\bf x})\approx{\frac {1}{\sqrt {D}}}[\cos ( {\bf w}_{1}^\top{\bf x}) ,\sin ( {\bf w}_{1}^\top{\bf x}) ,\ldots ,\cos({\bf w}_{D}^\top{\bf x}) ,\sin({\bf w}_{D}^\top{\bf x}) ]^{\top}

3.5.4 Understanding the RBF Kernel

To gain some intuition for how the RBF kernel works, let us consider the score for a particular data point {\bf x}: \mathrm{score}({\bf x}) = \sum_{i=1}^n \alpha_i \kappa({\bf x}, {\bf x}_i)

The kernel function \kappa({\bf u}, {\bf v}) = e^{- \gamma \| {\bf u} - {\bf v}\|^2 } acts as a measure of similarity between two data points. If {\bf u} and {\bf v} are close, \kappa({\bf u}, {\bf v}) \approx 1. If they are far apart, \kappa({\bf u}, {\bf v}) \approx 0. The parameter \gamma controls the scale of this neighbourhood. As you can imaging, this is less intuitive for other kernels.

If we were to set \alpha_i = 1 for positive examples and \alpha_i = -1 for negative examples (which is a simplification of what SVM actually does), the score would be:

\begin{aligned} \mathrm{score}({\bf x}) &= \sum_{i=1}^n \alpha_i \kappa({\bf x}, {\bf x}_i) \\ &\approx \sum_{i \in \text{neighbours of ${\bf x}$}} \begin{cases} 1 & \text{if $y_i$ positive} \\ -1 & \text{if $y_i$ negative} \end{cases} \\ &\approx \text{nb of positive neighbours of ${\bf x}$} - \text{nb of negative neighbours of ${\bf x}$} \end{aligned}

This is similar to k-NN. The score is high if a data point has more positive neighbours than negative neighbours. The main difference is that instead of a fixed number of neighbours (k), we consider all neighbours within a certain radius (controlled by \gamma).

3.5.5 Support Vectors

In an SVM, the optimal values of \hat{\alpha}_i are found by minimising the hinge loss. This is a constrained optimisation problem that can be solved using off-the-shelf solvers. The solution has the property that many of the \alpha_i values are actually zero.

The data points for which \alpha_i is non-zero are called support vectors. These are typically the points that lie closest to the decision boundary (see Figure 3.9). Only these support vectors are needed to make predictions, which can make the prediction process more efficient.

Figure 3.9: An SVM with an RBF kernel. The support vectors are the data points with non-zero alpha values. Here they are highlighted with an outer circle, whose thickness is proportional to the magnitude of |\alpha_i|.

Figure 3.10 shows the decision boundaries for SVMs with different polynomial kernels. As you can see, the decision boundaries are ellipses or hyperbolas. Examples of decision boundaries for the RBF kernel are shown in Figure 3.11. We can clearly see how the gamma parameter controls the smoothness of the boundary.

Figure 3.10: Decision boundaries for SVMs with linear and polynomial kernels.
Figure 3.11: Decision boundaries for SVMs with RBF kernels. The gamma parameter controls the smoothness of the boundary.

3.5.6 Remarks

  • The kernel trick is not limited to SVMs. Many other linear models, such as logistic regression, can be “kernelised.” These are known as kernel methods.
  • A major drawback of kernel methods is that the computational cost of making predictions scales with the number of training examples (like with k-NN)
  • The training time for kernel methods can also be high for large datasets (e.g., tens of thousands of data points).

Evidence that deep learning could outperform kernel SVMs on large datasets began to emerge in 2006. The real turning point came in 2012 with the success of AlexNet (Krizhevsky, Sutskever, and Hinton 2012) in the ImageNet competition.

3.6 Takeaways

  • Neural networks have been around for a long time, but it is only since 2012 that they have started to surpass other techniques in popularity and performance.
  • Random Forests and SVM with RBF kernel are very efficient solutions when the dataset is relatively small. (eg. less than 10’s of thousands of observations).
  • Kernel methods provide an elegant way to handle non-linear data by implicitly mapping it to a high-dimensional feature space.
  • The computational cost of kernel methods can be a significant drawback for large datasets.