Chapter 4 Evaluating Classifier Performance

We have seen a number of classifiers (Logistic Regression, SVM, kernel classifiers, Decision Trees, \(k\)-NN) but we still haven’t talked about their performance.

Recall some of results for these classifiers:

Classification Results for some of the popular classifiers.

Figure 4.1: Classification Results for some of the popular classifiers.

How do we measure the performance of a classifier?

How do we compare classifiers?

We need metrics that everybody can agree on.

4.1 Metrics for Binary Classifiers

If you have a binary problem with classes 0 (e.g. negative/false/fail) and 1 (e.g. positive/true/success), you have 4 possible outcomes:

True Positive : you predict \(\hat{y}=1\) and indeed \(y=1\).

True Negative : you predict \(\hat{y}=0\) and indeed \(y=0\).

False Negative : you predict \(\hat{y}=0\) but in fact \(y=1\).

False Positive : you predict \(\hat{y}=1\) but in fact \(y=0\).

In statistics, False Positives are often called type-I errors and False Negatives type-II errors.

4.1.1 Confusion Matrix

A confusion matrix is a table that reports the number of false positives, false negatives, true positives, and true negatives for each class.

Table 4.1: Confusion Matrix Definition.
Actual: 0 Actual: 1
predicted: 0 TN FN
predicted: 1 FP TP

For instance, the confusion matrices for the classifiers from Fig. 4.1 are as follows:

Table 4.2: Confusion Matrix for the \(k\)-NN classifier from Fig. 4.1.
Actual: 0 Actual: 1
predicted: 0 TN=166 FN=21
predicted: 1 FP=25 TP=188
Table 4.3: Confusion Matrix for the Logistic Regression classifier from Fig. 4.1.
Actual: 0 Actual: 1
predicted: 0 TN=152 FN=35
predicted: 1 FP=42 TP=171
Table 4.4: Confusion Matrix for the Linear SVM classifier from Fig. 4.1.
Actual: 0 Actual: 1
predicted: 0 TN=148 FN=39
predicted: 1 FP=41 TP=172
Table 4.5: Confusion Matrix for the RBF SVM classifier from Fig. 4.1.
Actual: 0 Actual: 1
predicted: 0 TN=162 FN=25
predicted: 1 FP=17 TP=196
Table 4.6: Confusion Matrix for the Decision Tree classifier from Fig. 4.1.
Actual: 0 Actual: 1
predicted: 0 TN=170 FN=17
predicted: 1 FP=29 TP=184

From TP, TN, FP, FN, we can derive a number of popular metrics.

4.1.2 Recall/Sensitivity/True Positive Rate (TPR)

Recall (also called sensitivity or true positive rate) is the probability that a positive example is indeed predicted as positive. In other words it is the proportion of positives that are correctly labelled as positives. \[ \mathrm {Recall} ={\frac {\mathrm {TP} }{P}}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }} = p(\hat{y}=1 | y=1) \]

4.1.3 Precision

Precision is the probability that a positive prediction is indeed positive: \[ \mathrm {Precision} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }} = p( y=1 | \hat{y}=1) \]

4.1.4 False Positive Rate (FPR)

False Positive Rate is the proportion of negatives that are incorrectly labelled as positive: \[ \begin{aligned} \mathrm {FP\ rate} &={\frac {\mathrm {FP} }{N}}={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }} = p(\hat{y}=1 | y=0) \end{aligned} \]

4.1.5 Accuracy

Accuracy is the probability that a prediction is correct: \[ \begin{aligned} \mathrm {Accuracy} &={\frac {\mathrm {TP} +\mathrm {TN} }{P+N}}={\frac {\mathrm {TP} +\mathrm {TN} }{\mathrm {TP} +\mathrm {TN} +\mathrm {FP} +\mathrm {FN} }}\\ &= p(\hat{y}=1 , y=1) + p(\hat{y}=0 , y=0) \end{aligned} \]

4.1.6 F1 Score

F1 score is the harmonic mean of precision and recall: \[ F_{1}=2\cdot {\frac {\mathrm {recall} \cdot \mathrm {precision} }{\mathrm {precision} +\mathrm {recall} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}\]

4.1.7 You Need Two Metrics

Many other derived metrics exist, but remember that since there are two types of errors (false positives and false negatives), you will always need at least two metrics to really capture the performance of a classifier.

Performing well on a single metric can be meaningless. A good classifier should perform well on 2 metrics.

Example 4.1 Consider a classifier with this confusion matrix:

Actual: 0 Actual: 1
predicted: 0 TN=70 FN=5
predicted: 1 FP=15 TP=10

The actual number of positives (class 1) is 15 and the actual number of negatives is 85 (class 0).

The recall is TP/(TP+FN)=10/(5+10)=66.7%

The accuracy is (TP+TN)/(TP+FN+TN+FP)=(70+10)/100=80%.

If we take a classifier (A) that always returns 1, the confusion matrix for the same problem becomes:

Actual: 0 Actual: 1
predicted: 0 TN=0 FN=0
predicted: 1 FP=85 TP=15

and its recall is 15/(15+0) = 100%!!

If we take instead a classifier (B) that always returns 0, we have:

Actual: 0 Actual: 1
predicted: 0 TN=85 FN=15
predicted: 1 FP=0 TP=0

and its accuracy is (85+0)/(100) = 85%!!

Clearly both classifiers (A) and (B) are non informative and are really poor choices but you need two metrics to see this:

For (A), the recall is 100% but the accuracy is only 15%.

For (B), the accuracy is 85% but the recall is 0%.

Conclusion: if you don’t know anything about the test set, you need two metrics.

4.1.8 ROC curve

When you start measuring the performance of a classifier, chances are that you can tune a few parameters. For instance, if your classifier is based on a linear classifier model \(y = [{\bf x}^{\top}{\bf w} > T]\), you can tune the threshold value \(T\). Increasing \(T\) means that your classifier will be more conservative about classifying points as \(y=1\).

By varying the parameter \(T\), we can produce a family of classifiers with different performances. We can report the \(\mathrm{FP}\) rate = FP/(FP+TN) and \(\mathrm{TP}\) rate = TP/(TP+FN) for each of the different values of \(T\) on a graph called the receiver operating characteristic curve, or ROC curve.

Here is an example showing the ROC curves for 4 different classifiers.

Receiving Operating Characteristic (ROC) curve.

Figure 4.2: Receiving Operating Characteristic (ROC) curve.

A perfect classifier will have a TP rate or 100% and a FP rate of 0%. A random classifier will have TP rate equal to the FP rate.

If your ROC curve is below the random classifier diagonal, then you are doing something wrong.

Below are reported a few ROC curves for the earlier problem.

ROC curves for the classifiers of Fig. \protect\@ref(fig:dataset04)

Figure 4.3: ROC curves for the classifiers of Fig. 4.1

4.1.9 ROC-AUC

There are some metrics that attempt to summarise the performance of the classifier across all thresholds into a single statistic. Typically with the ROC curve, we would use the Area Under the Curve (AUC), which is basically defined as the integral of the ROC curve:

\[ \mathrm{AUC} = \int \mathrm{TPR} (\mathrm{FPR}) d \mathrm{FPR} \]

This integral is typically implemented by measuring \(\mathrm{FPR}_{i}\) and \(\mathrm{TPR}_{i}\) for a set of \(n\) thresholds different thresholds \(T_i\) and evaluating the integral via trapezoidal integration:

\[ \mathrm{AUC} = \sum_{i=2}^n \frac{1}{2} \left(\mathrm{TPR}_i+\mathrm{TPR}_{i-1}\right) \times \left(\mathrm{FPR}_i-\mathrm{FPR}_{i-1}\right) \]

4.1.10 Average Precision

Similarly, the Average Precision computes the area under the curve for the \(\mathrm{Precision}\)-\(\mathrm{Recall}\) curve. It is implemented slightly differently from the ROC-AUC: \[ \mathrm{AP} = \sum_{i=1}^n \mathrm{Precision}_i \times \left(\mathrm{Recall}_i-\mathrm{Recall}_{i-1}\right) \] where \(\mathrm{Recall}_i\) and \(\mathrm{Precision}_i\) are the precision and recall values taken at \(n\) different thresholds \(T_i\) values.

4.2 Multiclass Classifiers

Binary metrics don’t adapt nicely to problems where there are more than 2 classes. For multiclass problems with \(n\) classes, there are \(n-1\) possible ways of miss-classifying each class. Thus there are \((n-1) \times n\) types of errors in total.

You can always present your results as a confusion matrix. For instance:

Actual: 0 Actual: 1 Actual: 2
predicted: 0 102 10 5
predicted: 1 8 89 12
predicted: 2 7 11 120

You can also think of your multiclass problem as a set of binary problems (does an observation belong to class \(k\) or not), and then aggregate the binary metrics in some way.

Next are presented two ways of aggregating metrics for multiclass problems.

In micro averaging, the metric (e.g. precision, recall, F1 score) is computed from the combined true positives, true negatives, false positives, and false negatives of the \(K\) classes.

For instance the micro-averaged precision is: \[ \mathrm{micro PRE} = \frac{\mathrm{micro TP}}{\mathrm{micro TP} + \mathrm{micro FP}} \] with \(\mathrm{micro TP} = \mathrm{TP}_1 + \cdots + \mathrm{TP}_K\), and \(\mathrm{micro FP} = \mathrm{FP}_1 + \cdots + \mathrm{FP}_K\)

In macro-averaging, the performances are averaged over the classes: \[ \mathrm{macro PRE} = \frac{\mathrm{PRE}_1 + \cdots + \mathrm{PRE}_K}{K} \qquad \text{where} \quad \mathrm{PRE}_k = \frac{\mathrm{TP}_k}{\mathrm{TP}_k + \mathrm{FP}_k} \]

Example 4.2 Given y_true = [0, 1, 2, 0, 1, 2, 2] and y_pred = [0, 2, 1, 0, 0, 1, 0]

we have \(\mathrm{TP}_0 = 2\), \(\mathrm{TP}_1 = 0\), \(\mathrm{TP}_2 = 0\), \(\mathrm{FP}_0 = 2\), \(\mathrm{FP}_1 = 2\), \(\mathrm{FP}_2 = 1\)

\[ \mathrm{micro PRE} = \frac{2 + 0 + 0}{ (2+0+0) + (1 + 2 + 1)} = 0.286 \]

\[ \mathrm{macro PRE} = \frac{1}{3} \left( \frac{2}{2+2} + \frac{0}{0 + 2} + \frac{0}{0 + 1} \right) = 0.167 \]

A popular macro-averaging metric is the mean Average Precision (mAP).

The mAP is calculated by finding the Average Precision (AP) for each class and then averaging over the number of classes:

\[ \mathrm{mAP} = \frac{1}{K} \sum_{k=1}^K \mathrm{AP}_k \]

4.3 Training/Validation/Testing Sets

Now that we have established metrics, we still need to define the data that will be used for evaluating the metric. You usually need:

  • a Training set that you use for learning the algorithm.

  • a Dev or Validation set, that you use to tune the parameters of the algorithm.

  • a Test set, that you use to fairly assess the performance of the algorithm. You should not try to optimise for the test set.

Why so many sets? Because you want to avoid over-fitting. Even if your model has many parameters, it is easy to overfit your training set with enough training. Thus, we need a testing set to check the performance on unseen data.

Now, if you tune the parameters of your algorithm to perform better on the testing set, you are likely to overfit it as well. But we want to use the testing set as a way of accurately estimating the performance on unseen data. Thus we use a different set, the dev/validation set, for any parameter estimation.

Important: the test and dev sets should contain examples of what you ultimately want to perform well on, rather than whatever data you happen to have for training.

How large do the dev/test sets need to be?

  • Training sets: as large as you can afford.

  • Validation/Dev sets with sizes from 1,000 to 10,000 examples are common. With 100 examples, you will have a good chance of detecting an improvement of 5%. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%.

  • Test sets should be large enough to give high confidence in the overall performance of your system. One popular heuristic had been to use 30% of your data for your test set. This is makes sense when you have say 100 to 10,000 examples but maybe when you have billion of training examples. Basically, think that you want to catch the all possible edge cases of your system.

4.4 Take Away

When approaching a new project, your first steps should be to design your Training/Validation/Test sets and decide on the metrics. Only then should you start thinking about which classifier to train.

Remember that the metrics are usually not the be-all and end-all of the evaluation. Each metric can only look at a particular aspect of your problem. You need to monitor multiple metrics.

As the project progresses, it is expected that the datasets will be updated and that new metrics will be introduced.