4  Evaluating Classifier Performance

In the preceding chapters, we have explored a variety of classification algorithms, including Logistic Regression, Support Vector Machines (SVMs), Decision Trees, and k-Nearest Neighbours (k-NN). We have seen how each of these models learns to draw a decision boundary to separate different classes in our feature space.

Figure 4.1: Classification results for some popular classifiers on a sample dataset.

Looking at these plots gives us a qualitative sense of how the classifiers behave, but it is not enough. To build effective machine learning systems, we need to move beyond visual intuition. We need a rigorous, quantitative way to answer critical questions:

To do this, we need to establish a set of standard, objective evaluation metrics that allow us to score and compare models in a consistent and meaningful way.

4.1 Metrics for Binary Classification

Let us begin with the most common scenario: binary classification. Here, the outcome belongs to one of two classes, which we typically label as positive (class 1) and negative (class 0). For any prediction our classifier makes, there are four possible outcomes:

  • True Positive (TP): The model correctly predicts the positive class. (Predicts 1, and the true class is 1).
  • True Negative (TN): The model correctly predicts the negative class. (Predicts 0, and the true class is 0).
  • False Positive (FP): The model incorrectly predicts the positive class. (Predicts 1, but the true class is 0). This is also known as a Type I error.
  • False Negative (FN): The model incorrectly predicts the negative class. (Predicts 0, but the true class is 1). This is also known as a Type II error.

These four outcomes form the basis of nearly all binary classification metrics.

4.1.1 The Confusion Matrix

The most fundamental tool for summarising a classifier’s performance is the confusion matrix. It is a simple table that lays out the counts of TP, TN, FP, and FN, providing a complete picture of the model’s predictions versus the actual ground truth.

Actual: Negative (0) Actual: Positive (1)
Predicted: 0 TN FN
Predicted: 1 FP TP

The structure of a confusion matrix.

For example, the confusion matrices for the classifiers shown in Figure 4.1 are as follows:

Table 4.1: Confusion Matrices for the classifiers shown in Figure 4.1
(a) RBF SVM classifier.
Actual: 0 Actual: 1
Predicted: 0 TN=162 FN=25
Predicted: 1 FP=17 TP=196
(b) Decision Tree classifier.
Actual: 0 Actual: 1
Predicted: 0 TN=170 FN=17
Predicted: 1 FP=29 TP=184

While the confusion matrix is comprehensive, it is often useful to distill these counts into a few key summary statistics.

4.1.2 Accuracy

Accuracy is perhaps the most intuitive metric. It measures the overall fraction of predictions that the classifier got right.

\mathrm{Accuracy} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}

While simple, accuracy can be misleading, especially when dealing with imbalanced datasets (where one class is much more frequent than the other). For example, if a disease affects only 1% of the population, a model that always predicts “no disease” will have 99% accuracy, but it will be completely useless for its intended purpose.

4.1.3 Precision and Recall

To get a more nuanced view, we often turn to two complementary metrics: precision and recall.

Recall, also known as Sensitivity or the True Positive Rate (TPR), answers the question: Of all the actual positive examples, what fraction did we correctly identify?

\mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} = p(\hat{y}=1 | y=1)

High recall is crucial in applications where failing to detect a positive case has severe consequences (e.g., medical screening, fraud detection). We want to minimise false negatives.

Precision answers the question: Of all the examples we predicted as positive, what fraction were actually positive?

\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} = p(y=1 | \hat{y}=1)

High precision is important when the cost of a false positive is high (e.g., a spam filter marking an important email as spam).

There is often a trade-off between precision and recall. A model that is very aggressive in predicting positives will have high recall but may have low precision. A model that is very conservative will have high precision but may have low recall.

4.1.4 The F1 Score

The F1 score provides a way to combine precision and recall into a single number. It is the harmonic mean of the two, which tends to be closer to the smaller of the two values. It is high only when both precision and recall are high.

F_{1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} = \frac{2\mathrm{TP}}{2\mathrm{TP} + \mathrm{FP} + \mathrm{FN}}

4.1.5 The Importance of Using Multiple Metrics

It is critical to understand that a single metric rarely tells the whole story. Relying on just one can be dangerously misleading, as a classifier can easily be designed to perform perfectly on one metric while being terrible in practice.

Example

Consider a dataset with 100 examples: 15 are positive (class 1) and 85 are negative (class 0).

  • Classifier A always predicts positive (1). Its confusion matrix is: TN=0, FN=0, FP=85, TP=15. Its recall is 15/(15+0) = 100\%, which sounds perfect! However, its precision is a dismal 15/(15+85) = 15\%.

  • Classifier B always predicts negative (0). Its confusion matrix is: TN=85, FN=15, FP=0, TP=0. Its accuracy is (85+0)/100 = 85\%, which seems quite good. But its recall is 0/(0+15) = 0\%. It fails to find any of the positive cases.

Both classifiers are useless, but you need at least two metrics (e.g., precision and recall, or recall and accuracy) to see the full picture.

Conclusion: Never evaluate a classifier with a single metric in isolation.

4.2 Visualising Performance: The ROC Curve

Many classifiers, like logistic regression, do not output a hard 0 or 1 label directly. Instead, they produce a score or probability. We then apply a threshold to this score to make the final classification (e.g., predict 1 if score > 0.5).

Changing this threshold allows us to trade off between the True Positive Rate (Recall) and the False Positive Rate (FPR), which is the proportion of negatives that are incorrectly labelled as positive.

\mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}} = p(\hat{y}=1 | y=0)

The Receiver Operating Characteristic (ROC) curve is a powerful tool that visualises this trade-off. It is created by plotting the TPR (y-axis) against the FPR (x-axis) for every possible threshold value.

Figure 4.2: An example of Receiving Operating Characteristic (ROC) curves for four different classifiers.
  • A perfect classifier would achieve a TPR of 1 and an FPR of 0, corresponding to the top-left corner of the plot.
  • A random classifier (e.g., flipping a coin) would produce a diagonal line from (0,0) to (1,1). Any useful classifier must perform above this line.
  • The closer the curve is to the top-left corner, the better the classifier.
Figure 4.3: ROC curves for the classifiers from Figure 4.1.

4.2.1 Area Under the Curve (AUC)

While the ROC curve provides a comprehensive view, it is often convenient to summarise it with a single number: the Area Under the Curve (AUC).

\mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}) \, d\mathrm{FPR}

The AUC can be interpreted as the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. A perfect classifier has an AUC of 1.0, while a random classifier has an AUC of 0.5.

4.2.2 Average Precision

Similarly, one can plot a Precision-Recall curve and compute the area under it, which is known as the Average Precision (AP). This metric is particularly informative for highly imbalanced datasets where the number of negatives far outweighs the number of positives.

It is implemented slightly differently from the ROC-AUC: \mathrm{AP} = \sum_{i=1}^n \mathrm{Precision}_i \times \left(\mathrm{Recall}_i-\mathrm{Recall}_{i-1}\right) where \mathrm{Recall}_i and \mathrm{Precision}_i are the precision and recall values taken at n different thresholds T_i values.

4.3 Metrics for Multiclass Classification

When we have more than two classes, the concepts of precision and recall do not apply directly. The confusion matrix becomes a K \times K table for a K-class problem.

Actual: 0 Actual: 1 Actual: 2
Predicted: 0 102 10 5
Predicted: 1 8 89 12
Predicted: 2 7 11 120

There are K-1 possible ways of miss-classifying each class. Thus there are (K-1) \times K types of errors in total.

The most common way to adapt binary metrics to the multiclass setting is to use averaging strategies. For each class k, we can compute its own set of metrics by considering it as the “positive” class and all other classes as the “negative” class (a one-vs-rest approach). Then, we can average these per-class metrics.

  • Macro-averaging: Compute the metric independently for each class and then take the unweighted average. This treats all classes equally, regardless of their size. \mathrm{MacroPrecision} = \frac{1}{K} \sum_{k=1}^K \mathrm{Precision}_k

  • Micro-averaging: Aggregate the counts of TP, FP, and FN for all classes first, and then compute the metric from these aggregated counts. This gives equal weight to each individual prediction, so larger classes have more influence. \mathrm{MicroPrecision} = \frac{\sum \mathrm{TP}_k}{\sum \mathrm{TP}_k + \sum \mathrm{FP}_k}

Example

Given y_true = [0, 1, 2, 0, 1, 2, 2] and y_pred = [0, 2, 1, 0, 0, 1, 0]

we have \mathrm{TP}_0 = 2, \mathrm{TP}_1 = 0, \mathrm{TP}_2 = 0, \mathrm{FP}_0 = 2, \mathrm{FP}_1 = 2, \mathrm{FP}_2 = 1

\mathrm{MicroPrecision} = \frac{2 + 0 + 0}{ (2+0+0) + (1 + 2 + 1)} = 0.286

\mathrm{MacroPrecision} = \frac{1}{3} \left( \frac{2}{2+2} + \frac{0}{0 + 2} + \frac{0}{0 + 1} \right) = 0.167

A popular macro-averaged metric is the mean Average Precision (mAP), which is the average of the Average Precision (AP) scores across all classes:

\mathrm{mAP} = \frac{1}{K} \sum_{k=1}^K \mathrm{AP}_k

4.4 The Three Essential Datasets: Training, Validation, and Testing

Now that we have our metrics, we must be careful about what data we use to compute them. A robust evaluation workflow requires splitting our data into three distinct sets:

  1. Training Set: This is the data the model learns from. The model’s parameters (e.g., the weights in logistic regression) are optimised to minimise the loss on this set.

  2. Validation (or Development) Set: This set is used to tune the model’s hyperparameters—the configuration settings that are not learned directly, such as the learning rate, the value of k in k-NN, or the strength of regularisation. We choose the hyperparameters that yield the best performance on the validation set.

  3. Test Set: This set is held out until the very end. It is used only once to provide a final, unbiased estimate of the chosen model’s performance on unseen data. You must never tune your model based on its performance on the test set. Doing so would be a form of data leakage, and your final performance metric would be an overly optimistic and invalid estimate of how the model will perform in the real world.

This separation is crucial to avoid overfitting. A model can easily memorise the training set, so good performance there means little. By tuning on a separate validation set, we get a more realistic estimate of generalisation. As tuning is essentially a form of training, performance on the dev set is also tainted. The final test set provides the ultimate, honest assessment.

How large do the dev/test sets need to be?

  • Training sets: as large as you can afford.

  • Validation/Dev sets with sizes from 1,000 to 10,000 examples are common. With 100 examples, you will have a good chance of detecting an improvement of 5%. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%. The size of that dataset should be at least as large as what is required by your targetted confidence interval.

  • Test sets should be large enough to give high confidence in the overall performance of your system. One popular heuristic had been to use 30% of your data for your test set. This is makes sense when you have, say, 100 to 10,000 examples but maybe less so when you have billion of training examples. Basically, think that you want to catch the all possible edge cases of your system.

Important: the test and dev sets should contain examples of what you ultimately want to perform well on, rather than whatever data you happen to have for training.

4.5 Takeaways

  • Before starting any machine learning project, your first steps should be to define your evaluation metrics and carefully create your training, validation, and test sets.

  • For binary classification, always use a combination of metrics. Accuracy alone can be misleading. Precision, recall, and the F1 score provide a more complete picture.

  • The ROC curve and its corresponding AUC score are excellent tools for evaluating and comparing models across all possible thresholds.

  • For multiclass problems, use macro- or micro-averaging to adapt binary metrics. The confusion matrix remains a vital tool for detailed error analysis.

  • Rigorously separating your data into training, validation, and test sets is non-negotiable for building models that generalise well to new, unseen data.

Exercises

Exercise 4.1 Consider a binary classifier with the following confusion matrix:

Actual: 0 Actual: 1
Predicted: 0 TN=16 FN=4
Predicted: 1 FP=10 TP=70

Compute the accuracy and comment on the performance of the classifier.

Exercise 4.2 Consider a multiclass classifier which produce the following results

y_true = [0, 0, 1, 2, 2, 2, 1, 1, 0]
y_pred = [1, 0, 1, 2, 1, 0, 1, 2, 0]

compute the confusion matrix, the accuracy, the micro precision and the macro precision.