2 Logistic Regression: From Lines to Probabilities
In the previous chapter, we explored Linear Regression, that can model problems where the output is a continuous variable, such as a price, temperature, or height. However, many real-world problems require us to make a categorical choice, where we need to build a classifier that can answer questions like: Is this email spam or not? Does this patient have a particular disease? Which category does this news article belong to?
This chapter introduces Logistic Regression (Cox 1958), a fundamental algorithm for tackling binary classification problems, where the outcome is one of two categories (e.g., 0 or 1, true or false, pass or fail). Despite its name, logistic regression is a model for classification, not regression.
There is a vast ecosystem of classification algorithms, so why focus on this one? The reason is that logistic regression is not just a workhorse classifier in its own right; it is also the foundational building block of modern neural networks. Understanding it thoroughly will pave the way for the more complex deep learning models we will encounter later.
2.1 A Motivating Example: Predicting Exam Success
Let us start with a simple, intuitive example, adapted from Wikipedia:
A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that a student will pass the exam?
The collected data consists of pairs of (Hours Studied, Result), where the result is binary: 1 for a pass and 0 for a fail.
Our goal is to build a model that, given a number of hours studied, can predict the outcome.
2.2 Why Not Linear Regression?
Our first instinct might be to apply what we already know: linear regression. Although the output y is binary (0 or 1), we could still attempt to fit a straight line to the data using least squares.
y \approx {\bf x}^{\top}{\bf w}
For our simple 1D problem, the feature vector is {\bf x}^{\top} = [1, x] (where x is hours studied) and the weights are {\bf w}^{\top} = [w_0, w_1]. The least squares fit is shown below.
The line produces continuous values, not the 0s and 1s we need.
We could convert the output into a binary classification by applying a threshold, for instance at 0.5:
\hat{y} = [ {\bf x}^{\top}{\bf w} > 0.5 ] = \begin{cases} 1 & \text{ if ${\bf x}^{\top}{\bf w} > 0.5$} \\ 0 & \text{ otherwise} \end{cases}
However, this approach has a fundamental flaw. Linear regression’s loss function (MSE) tries to minimise the squared distance between the line and the data points. Consider a student who studied for 6 hours and passed. Their data point is at (6, 1). The line’s prediction might be, say, 1.2. The squared error is (1 - 1.2)^2 = 0.04. Now consider a student who studied for 10 hours and also passed. The line’s prediction might be 2.0. The squared error is (1 - 2.0)^2 = 1.0. The model is heavily penalised for this second student, even though the prediction (pass) is clearly correct.
This means that outliers or even correctly classified but distant points can disproportionately influence the position of the line, pulling it away from what might be a better decision boundary.
The core issue is that we optimised the model to make {\bf x}^{\top}{\bf w} match y, when we should have optimised it to make our classification rule [ {\bf x}^{\top}{\bf w} > 0.5 ] match y. We need a model designed for probabilities, not for direct value prediction.
2.3 A Probabilistic View of Classification with Generalised Linear Models
Let us reframe the problem. Instead of predicting the outcome directly, let us try to predict the probability of the outcome. Specifically, we want to model the conditional probability p(y=1 | {\bf x}, {\bf w}).
Setting the threshold at 0, we want to use the core model:
y = [ {\bf x}^{\top}{\bf w} + \epsilon > 0 ]
The term {\bf x}^{\top}{\bf w}, often called the logit or score, provides a measure of evidence for the positive class. The score can range from -\infty (very likely to be y=-1) to +\infty (very likely to be y=1). A score of 0 indicates that we are undecided between both options.
In our toy example, the risk score is just a re-scaled version of the number of hours studied. For instance, if you study less than 1 hour your are very likely to fail. In the general case, the risk operates a dimensional reduction. That is, it combines multiple input values into a single score, that can then be used for comparison. Think of a buyer’s guide that combines multiple evaluations to form a single score.
The key to general linear models is the idea that the uncertainty (\varepsilon) is on the risk score itself, not directly on the outcome. That is, the error on the risk score might move the ultimate decision to either side of the threshold boundary.
We can now, like in Least Squares, take a probabilistic view of the problem and try to model/approximate the distribution of \epsilon with a known distribution.
Multiple choices are possible for the distribution of \epsilon. In logistic regression, the error \epsilon is assumed to follow a logistic distribution and the risk score {\bf x}^{\top} {\bf y} is also called the logit.
In probit regression, the error \epsilon is assumed to follow a normal distribution, the risk score {\bf x}^{\top} {\bf w} is also called the probit.
In practice, the logistic and probit models produce almost identical results. Logistic regression is far more common in machine learning, primarily because the sigmoid function and its derivative are computationally simpler and more efficient to work with.
From now on, we’ll only look at the logistic model. Note that similar derivations could be made for any other model.
2.4 Logistic Model
From now on, we’ll only look at the logistic model. Note that similar derivations could be made for any other model.
Consider p(y=1|{\bf x},{\bf w}), the likelihood that the output is a success given the input features and model parameters:
\begin{aligned} p(y=1 | {\bf x},{\bf w}) &= p( {\bf x}^{\top}{\bf w} + \epsilon > 0 )\\ &= p(\epsilon > - {\bf x}^{\top}{\bf w}) \end{aligned}
since \epsilon is symmetrically distributed around 0, it follows that
\begin{aligned} p(y=1 | {\bf x},{\bf w}) &= p( \epsilon < {\bf x}^{\top}{\bf w}) \end{aligned}
Because we have made some assumptions about the distribution of \epsilon, we are able to derive a closed-form expression for the likelihood.
The function f: t \mapsto f(t) = p( \epsilon < t) is the c.d.f. of the logistic distribution and is also called the logistic function or sigmoid:
f(t) = \frac{1}{1 + e^{-t}}
Thus we have a simple model for the likelihood of success p(y=1 | {\bf x},{\bf w}):
p(y=1 | {\bf x},{\bf w}) = p( \epsilon < {\bf x}^{\top}{\bf w}) = f({\bf x}^{\top}{\bf w}) = \frac{1}{1 + e^{-{\bf x}^{\top}{\bf w}}}
The likelihood of failure is simply given by:
p(y=0 | {\bf x},{\bf w}) = 1- p(y=1 | {\bf x},{\bf w}) = \frac{1}{1 + e^{+{\bf x}^{\top}{\bf w}}}
Exercise 2.1 Show that 1 - \sigma(t) = \sigma(-t), and therefore that p(y=0 | {\bf x}, {\bf w}) = \frac{1}{1 + e^{+{\bf x}^{\top}{\bf w}}}.
Below is the plot of our new probabilistic model, fitted to the student data. (We will see how to find the optimal weights {\bf w} shortly.)
The model is easy to interpret. For example, it tells us that a student who studies for 3 hours has approximately a 60% chance of passing the exam. Crucially, for students who study many hours, the probability approaches 1 and then stays there. The model is no longer penalised for being “too correct,” which solves the main issue we had with linear regression.
This brings us to an important distinction. In linear regression, the model prediction, that we denote as h_{\bf w}({\bf x}), was a direct prediction of the outcome:
h_{\bf w}({\bf x}) = y
In logistic regression, the model prediction h_{\bf w}({\bf x}) is an estimate of the likelihood of the outcome:
h_{\bf w}({\bf x}) = p(y=1|{\bf x},{\bf w})
Thus, whereas in linear regression we try to answer the question:
What is the expected value of y given {\bf x}?
In logistic regression (and any other general linear model), we, instead, try to answer the question:
What is the probability that y=1 given {\bf x}?
Note that this approach is now very robust to including students that have studied for many hours. In figure below we have added to the dataset a successful student that studied for 6.2 hours. The new logistic regression estimate (see next section) is almost identical to our previous estimate (both magenta and red curves actually coincide).
2.5 Training: Maximum Likelihood and Cross-Entropy
How do we find the optimal weights {\bf w}? As with linear regression, we turn to the principle of Maximum Likelihood Estimation (MLE). We want to find the weights that make our observed data the most probable.
For a single training example ({\bf x}_i, y_i), the probability of observing the actual outcome y_i is:
p(y_i|{\bf x}_i, {\bf w} ) = \begin{cases} h_{\bf w}({\bf x}_i) & \text{ if $y_i=1$} \\ 1 - h_{\bf w}({\bf x}_i) & \text{ if $y_i=0$} \end{cases}
Since y_i can only be 0 or 1, we can write this more compactly:
p(y_i|{\bf x}_i, {\bf w} ) = h_{\bf w}({\bf x}_i)^{y_i} (1-h_{\bf w}({\bf x}_i))^{1 - y_i}
Assuming the training examples are independent, the likelihood of the entire dataset is the product of the individual likelihoods:
L({\bf w}) = p({\bf y} |{\bf X}, {\bf w}) = \prod_{i=1}^n h_{\bf w}({\bf x}_i)^{y_i} (1-h_{\bf w}({\bf x}_i))^{1 - y_i}
Our goal is to find the {\bf w} that maximises L({\bf w}). As before, it is mathematically more convenient to work with the logarithm of the likelihood. Maximising the log-likelihood is equivalent to minimising its negative, which gives us our loss function, E({\bf w}):
\begin{align*} E({\bf w}) &= -\log(L({\bf w})) \\ &= -\log \left( \prod_{i=1}^n h_{\bf w}({\bf x}_i)^{y_i} (1-h_{\bf w}({\bf x}_i))^{1 - y_i} \right) \\ &= - \sum_{i=1}^n \left( y_i\ \log(h_{\bf w}({\bf x}_i)) + (1 - y_i)\ \log(1 - h_{\bf w}({\bf x}_i)) \right) \end{align*}
This loss function is of fundamental importance in machine learning and is known as the binary cross-entropy. It measures the dissimilarity between the true distribution (where the probability is 1 for the correct class and 0 for the other) and the model’s predicted probability distribution. Minimising the cross-entropy loss is equivalent to maximising the likelihood of our model.
2.6 Optimisation with Gradient Descent
Unlike linear regression, there is no closed-form solution for the weights {\bf w} that minimise the cross-entropy loss. We must find them using an iterative optimisation algorithm. The most common method is gradient descent.
The idea behind gradient descent is simple: we start with an initial guess for the weights, {\bf w}^{(0)}, and then repeatedly take small steps in the direction that most steeply decreases the loss function. That direction is the negative of the gradient of the loss, -\frac{\partial E}{\partial {\bf w}}({\bf w}).
The update rule for gradient descent is:
{\bf w}^{(t+1)} = {\bf w}^{(t)} - \eta \frac{\partial E}{\partial {\bf w}}({\bf w}^{(t)})
Here, \eta is the learning rate, a hyperparameter that controls the size of each step. Finding a good learning rate is a crucial part of training machine learning models.
Let us find the gradient of our cross-entropy loss. Recall that h_{\bf w}({\bf x}) = \sigma({\bf x}^{\top}{\bf w}).
Exercise 2.2 Given that the derivative of the sigmoid function is \sigma'(t) = \sigma(t)(1-\sigma(t)), show that the gradient of the binary cross-entropy loss is: \frac{\partial E}{\partial {\bf w}}({\bf w}) = \sum_{i=1}^{n} \left(h_{\bf w}({\bf x}_i) - y_i \right) {\bf x}_i
This result is remarkably simple. The term (h_{\bf w}({\bf x}_i) - y_i) is simply the prediction error for example i. The update for each weight is proportional to the sum of these errors, weighted by the corresponding input feature values.
The gradient descent algorithm for logistic regression is as follows:
2.7 Visualising the Decision Boundary
Let us consider an example with two features, x_1 and x_2. The model will learn a set of weights w_0, w_1, w_2. Our classification rule is to predict y=1 if p(y=1|{\bf x}) > 0.5. This happens when the sigmoid’s input, the logit, is positive:
{\bf x}^{\top}{\bf w} = w_0 + w_1 x_1 + w_2 x_2 > 0
The equation w_0 + w_1 x_1 + w_2 x_2 = 0 defines a line in the 2D feature space. This line is the decision boundary. All points on one side of the line are classified as 1, and all points on the other side are classified as 0.
2.8 Beyond Binary: Multiclass Classification
What if we have more than two classes? A common approach is to extend logistic regression to handle multiple categories. This is known as Multinomial Logistic Regression or Softmax Regression.
Instead of a single set of weights, we now learn a separate weight vector {\bf w}_k for each class k \in \{1, \dots, K\}. For a given input {\bf x}, we can compute a linear score {\bf x}^{\top}{\bf w}_k for each class.
To convert these K scores into a valid probability distribution (where the probabilities sum to 1), we use the softmax function, which is a generalisation of the sigmoid function:
p(y=C_k| {\bf x}, {\bf W} ) = \mathrm{softmax}({\bf z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}
where {\bf z} is the vector of scores, so z_k = {\bf x}^{\top}{\bf w}_k.
To train this model, we again use the maximum likelihood principle. This leads to a multiclass version of the cross-entropy loss, often called categorical cross-entropy:
E({\bf W}) = - \sum_{i=1}^{n} \sum_{k=1}^K [y_i=C_k]\ \log(p(y_i=C_k| {\bf x}_i,{\bf W}))
Here, [y_i=C_k] is an indicator that is 1 if the true class for observation i is k, and 0 otherwise. This loss is also minimised using gradient descent.
2.9 Takeaways
- Logistic Regression is a linear model for binary classification, not regression.
- It models the probability of an outcome by passing a linear combination of features (the logit) through the sigmoid (or logistic) function.
- The model is trained by minimising the binary cross-entropy loss function, which is derived from the principle of Maximum Likelihood Estimation.
- Since there is no closed-form solution, optimisation is performed iteratively using methods like gradient descent.
- The extension of Logistic Regression to more than two classes is called Multinomial Logistic Regression, which uses the softmax function and is trained by minimising the categorical cross-entropy loss.
Exercises
Exercise 2.3 Given weights w_0=0.1, w_1=1, w_2=2. What is the probabilty p of that observation with feature values x_1=0.3, x_2=0.4, belongs to class 1?
Exercise 2.4 What do large values of the negative log-likelihood indicate? (select all correct answers)
- That the likelihood of the outcome to be of class 1 is high.
- That the likelihood of the outcome to be of class 0 is high.
- That the statistical model fits the data well.
- That the statistical model is a poor fit of the data.
Exercise 2.5 Consider a general linear model for a binary classification problem, whose accuracy on the training set is 100%, that is, every single output y is perfectly predicted. What is the maximum value h that the average cross-entropy on the training set can take?