8 Recurrent Neural Networks – 4C16 - Deep Learning and its Applications

8.1 A Feedforward Network Unrolled Over Time

Sequential data appears in many domains. Obvious examples include time series such as audio signals, stock market prices, or a vehicle’s trajectory. Text, which is a sequence of words or characters, is another prime example. In fact, RNNs were particularly successful in machine translation tasks during the early phase of the deep learning revolution.

At its core, an RNN works by recursively applying a function to each element of a sequence. The network maintains a hidden state, or context, which is updated at each step. This context captures information from all previous steps. As shown in Figure 8.1, this architecture is traditionally represented using a feedback loop in the graph.

Figure 8.1: A Recurrent Neural Network shown in its compact, recursive form.

The input stream, denoted by {\bf x}, feeds into the context layer, denoted by {\bf h}. This layer then re-uses the previously computed context, {\bf h}_{t-1}, along with the current input, {\bf x}_t, to compute the new context, {\bf h}_t, and the output, {\bf y}_t.

For those with a background in signal processing, an analogy can be drawn: if convolutional layers are akin to Finite Impulse Response (FIR) filters, then RNNs are similar to Infinite Impulse Response (IIR) filters, as they incorporate feedback from previous states.

To better understand how an RNN operates, we can “unroll” or “unfold” the recursive loop (see Figure 8.2). This reveals a deep feedforward network where each layer corresponds to a single time step in the sequence.

Figure 8.2: The same Recurrent Neural Network in its unrolled, feedforward form.

A key characteristic of RNNs is that the network parameters (weights and biases) are shared across all time steps. This means we use the same set of weights, w, at every iteration. This parameter sharing makes the network efficient, as it does not need to learn a new set of parameters for each point in the sequence, and it allows the model to generalise to sequences of varying lengths.

Figure 8.3: In a simple RNN, the hidden layer is a standard fully connected layer.

Figure 8.3 shows an RNN in its most basic form, often called a simple RNN or an Elman network, where the hidden layer is simply a dense layer of neurons with a \mathrm{tanh} activation function.

We typically use a \mathrm{tanh} activation because its output ranges from -1 to 1. This allows the hidden state’s values to both increase and decrease.

The governing equations for a simple RNN at a time step t are:

\begin{equation} \begin{aligned} {\bf h}_{t}&=\tanh({\bf W}_{h}{\bf x}_{t}+{\bf U}_{h}{\bf h}_{t-1}+{\bf b}_{h})\\ {\bf y}_{t}&=\sigma _{y}({\bf W}_{y}{\bf h}_{t}+{\bf b}_{y}) \end{aligned} \end{equation}

Here, {\bf x}_t is the input vector at time t, {\bf h}_t is the hidden state vector, and {\bf y}_t is the output vector. The matrices {\bf W}_{h}, {\bf U}_{h}, and {\bf W}_{y}, along with the bias vectors {\bf b}_{h} and {\bf b}_{y}, are the parameters that the network learns. Note that these parameters are the same for all time steps. \sigma_y is the activation function for the output layer, chosen based on the specific task (e.g., softmax for classification).

In PyTorch, we can define a simple RNN layer with nn.RNN. The input shape is typically (n, p), where n is the number of time steps in the sequence and p is the number of features at each time step.

rnn = nn.RNN(input_size=10, hidden_size=20)
input = torch.randn(30, 10) # seq length = 30, nb features = 10
h0 = torch.randn( 1, 20)    # context vector size = 20
output, hn = rnn(input, h0) 
output.shape                # torch.Size([30, 20]) 
                            # output seq has length 30, with 20 features

In practice, we would define it through a class as follows:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleRNNModel(nn.Module):
    def __init__(self, p, hsize, osize):
        super(SimpleRNNModel, self).__init__()
        # Define the RNN and Linear layers
        self.rnn = nn.RNN(p, hsize, batch_first=True)
        self.fc = nn.Linear(hsize, osize)

    def forward(self, x):
        # returns the hidden state for every time step.
        output_sequence, _ = self.rnn(x)
        
        # Pass the full sequence to the linear layer
        output = self.fc(output_sequence)
        
        return F.softmax(output, dim=-1)

# Usage:
# model = SimpleRNNModel(p=p, hsize=hsize, osize=osize)

Here the RNN layer returns the full sequence of hidden states, ie. one for each time step. Figure 8.4 illustrates this. The SimpleRNN layer outputs the hidden state for each time step. This is useful for sequence-to-sequence tasks, such as machine translation or speech recognition, where we need an output at each step of the sequence.

Figure 8.4: An unrolled RNN layer that returns the full sequence of hidden states.

An alternative shown in Figure 8.5 is when the RNN layer outputs only the hidden state from the very last time step. This is common when we need a single summary representation of the entire sequence, for instance, in a classification task where this final state is fed into a dense layer to predict a label for the whole sequence.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleRNNModel2(nn.Module):
    def __init__(self, p, hs, os):
        super(SimpleRNNModel2, self).__init__()
        self.rnn = nn.RNN(p, hs, batch_first=True)
        self.fc = nn.Linear(hs, os)

    def forward(self, x):
        # The RNN layer gives us the full sequence
        output_sequence, _ = self.rnn(x)
        
        # Here we select the output of the final time step.
        # `[:, -1, :]` does this.
        last_hidden_state = output_sequence[:, -1, :]
        
        # Pass only the final hidden state to the linear layer
        output = self.fc(last_hidden_state)
        
        return F.softmax(output, dim=-1)

# Usage:
# model = SimpleRNNModel2(p=p, hs=hs, os=os)

Figure 8.5: An unrolled RNN layer that returns only the final hidden state.

Note that we can stack multiple RNN layers, much like we stack convolutional layers in a CNN. For instance:

import torch
import torch.nn as nn
import torch.nn.functional as F

class StackedRNNModel(nn.Module):
    def __init__(self, p, hs, ks, os):
        super(StackedRNNModel, self).__init__()
        # Define the two RNN layers and the final linear layer
        self.rnn1 = nn.RNN(p, hs, batch_first=True)
        self.rnn2 = nn.RNN(hs, ks, batch_first=True)
        self.fc = nn.Linear(ks, os)

    def forward(self, x):
        # Get the full output sequence from the first RNN layer
        output1, _ = self.rnn1(x)
        
        # Pass that sequence to the second RNN layer
        output2, _ = self.rnn2(output1)
        
        # Take the final hidden state from the last layer's output
        last_hidden_state = output2[:, -1, :]
        
        # Pass it to the linear layer for classification
        output = self.fc(last_hidden_state)
        
        return F.softmax(output, dim=-1)

# Usage:
# model = StackedRNNModel(p=feature_size, hs=hs, ks=ks, os=os)

This results in the deep architecture illustrated in Figure 8.6.

Figure 8.6: An example of how multiple RNN layers can be stacked.

8.2 Application Example: Character-Level Language Modelling

Let us explore a classic application of RNNs: building a character-level language model. The goal is to predict the next character in a piece of text, given the sequence of preceding characters. This is, in essence, our first look at a generative language model. The core idea is to train an RNN on a large corpus of text and have it learn the underlying statistical patterns of the language.

Training

First, we must convert the text characters into a numerical format that the network can process. A common method is one-hot encoding, where each unique character in the vocabulary is represented by a binary vector with a single ‘1’ at the index corresponding to that character, and ’0’s everywhere else.

The training process is framed as a classification task. We feed the network a sequence of characters, for example {\bf x}_1, \dots, {\bf x}_{n-1}, and train it to predict the next character, {\bf y} = {\bf x}_{n}. The network’s output layer will typically use a softmax activation function, which produces a probability distribution over the entire vocabulary for the next character. The loss function used for training is usually cross-entropy.

Figure 8.7: An unrolled RNN used for next-character prediction.

So, the training objective is simple: given a sequence of previous characters, can the network accurately predict the character that comes next?

Inference

Once the RNN is trained, we can use it to generate new text, one character at a time. This process is known as inference or sampling. We begin by providing the network with an initial “seed” sequence (e.g., a few characters or words). The RNN processes this seed and outputs a probability distribution for the next character, as shown in Figure 8.8.

To generate the next character, we sample from this probability distribution. This means characters with a higher predicted probability are more likely to be chosen, but there is still an element of randomness. The newly generated character is then appended to the sequence, and this new, longer sequence is fed back into the RNN to generate the character after that. This process is repeated to generate entire sentences or even paragraphs of text.

Figure 8.8: The process of generating text one character at a time using a trained RNN.

This fun application was popularised in a seminal blog post by Andrej Karpathy. We recommend visiting the post for more examples and insights into the power of RNNs. As we will see in later chapters, this fundamental idea of sequential prediction is at the heart of modern Large Language Models (LLMs).

8.3 Training: Back-Propagation Through Time

To train an RNN, we need a method to calculate the gradients of the loss function with respect to the network’s parameters. Since the parameters are shared across all time steps, the gradient at a particular time step depends on all previous time steps.

The standard algorithm for this is Back-Propagation Through Time (BPTT). It works by first unrolling the RNN into a deep feedforward network, as shown in Figure 8.9. Once unrolled, we can apply the standard back-propagation algorithm to calculate the gradients. The total gradient for a given parameter is the sum of the gradients for that parameter at each time step.

Figure 8.9: Back-Propagation Through Time (BPTT) involves unrolling the RNN and applying standard back-propagation.

However, BPTT has its challenges. Unrolling the network for a long sequence can result in a very deep computational graph, which can consume a large amount of GPU memory. Furthermore, the process is inherently sequential, making it difficult to parallelise and slow to train.

To mitigate these issues, a common strategy is to split the long sequence into smaller chunks and apply BPTT only on these truncated parts. This approach is called Truncated Back-Propagation Through Time (TBPTT), illustrated in Figure 8.10. While this makes training more manageable, it comes at the cost of the network’s ability to learn dependencies that span longer than the chunk size.

Figure 8.10: In Truncated BPTT, the RNN is unrolled for only a fixed number of time steps.

8.4 Dealing with Long Sequences

When unrolled, recurrent networks can become very deep. As with any deep network, training with gradient descent is susceptible to the vanishing and exploding gradient problems. As the error is propagated back through many time steps, the gradients can either shrink exponentially until they become negligible (vanish) or grow exponentially until they become unstable (explode). This makes it very difficult for simple RNNs to learn long-range dependencies in the data.

To visualise this consider a simplified case where the tanh activation is replaced with a linear activation (note this is a valid approximation if the weights are small or the logits close to zero). Let’s unroll the governing equations for a particular choice of weights {\bf U}, {\bf W}, {\bf b}:

\begin{equation} \begin{aligned} {\bf h}_{1}&={\bf W}{\bf x}_{1}+{\bf b}\\ {\bf h}_{2}&={\bf U} \left( {\bf W}{\bf x}_{1}+{\bf b}\right) + {\bf W}{\bf x}_{2}+{\bf b} \\ {\bf h}_{3}&={\bf U}^2 \left( {\bf W}{\bf x}_{1}+{\bf b}\right) + {\bf U} \left( {\bf W}{\bf x}_{2}+{\bf b}\right) + {\bf W}{\bf x}_{3}+{\bf b} \\ & \vdots \\ {\bf h}_{t}&={\bf U}^{t-1} \left( {\bf W}{\bf x}_{1}+{\bf b}\right) + \cdots + {\bf U}^{t-2} \left( {\bf W}{\bf x}_{2}+{\bf b}\right) + \cdots + \left({\bf W}{\bf x}_{t}+{\bf b}\right) \\ \end{aligned} \end{equation}

Somehow the matrix exponentiation of {\bf U} needs to remain stable, even large values of t. This is in theory possible, but, very difficult to achieve in training, with a random initialisation of U.

For this reason, the simple RNN architecture is rarely used in practice. Instead, we resort to more sophisticated RNN architectures that were specifically designed to address this issue, namely Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU).

8.4.1 LSTM

The Long Short-Term Memory (LSTM) architecture was introduced in 1997 by Sepp Hochreiter and Jürgen Schmidhuber (Hochreiter and Schmidhuber 1997) precisely to combat the vanishing and exploding gradient problems. LSTM cells (see Figure 8.11) replace the simple hidden layer of a standard RNN. They introduce a more complex internal structure that includes a separate cell state and a series of “gates” that regulate the flow of information.

These gates—the forget gate, input gate, and output gate—allow the network to selectively add or remove information from the cell state, enabling it to remember information for very long periods. After their potential was realised around 2014, major technology companies like Google, Apple, and Microsoft began using LSTMs extensively in products for speech recognition and machine translation.

Figure 8.11: The internal architecture of a Long Short-Term Memory (LSTM) cell. (Figure by François Deloche).

See Also

S. Hochreiter and J. Schmidhuber (1997). “Long short-term memory” original paper
Keras’s LSTM documentation
See also Brandon’s Rohrer’s video
and colah’s blog

8.4.2 GRU

The Gated Recurrent Unit (GRU) was introduced in 2014 (Chung et al. 2014) as a simpler alternative to the LSTM. GRUs combine the forget and input gates into a single “update gate” and merge the cell state and hidden state. This results in a model that is computationally more efficient (faster to train) because it has fewer parameters than an LSTM.

The performance of GRUs is often comparable to that of LSTMs. They may perform slightly better on smaller datasets but can be outperformed by LSTMs on larger, more complex problems. The architecture is shown in Figure 8.12.

Figure 8.12: The internal architecture of a Gated Recurrent Unit (GRU) cell. (Figure by François Deloche).

See Also

J. Chung, C. Gulcehre, K. Cho and Y. Bengio (2014). “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. (original paper)
Keras GRU documentation

8.4.3 Gated Units

Without delving too deeply into the equations of LSTMs and GRUs, it is useful to understand the core concept they introduce: gated units. So far, the primary way we have combined information from two units, u_1 and u_2, has been through a linear combination, w_1u_1 + w_2u_2. Gating provides an alternative mechanism based on element-wise multiplication.

Figure 8.13: A gated unit, where u_1 controls the flow of information from u_2.

A gate is typically a vector produced by a sigmoid activation function, \sigma(u_1), whose values range between 0 and 1. This gate then acts as a filter on another vector, u_2. When a value in the gate is close to 0, the corresponding feature in u_2 is blocked. When it is close to 1, the feature is allowed to pass through.

To build some intuition, consider a text processing example where u_2 is a vector representing the probability of the next word:

u_2 = \begin{bmatrix} \vdots \\ p(\text{bat --- the animal}) = 0.4 \\ p(\text{bat --- the stick}) = 0.3 \\ \vdots \end{bmatrix}

Here, the word “bat” is ambiguous. The role of the gate, \sigma(u_1), which is computed from the prior context, could be to resolve this ambiguity:

\sigma(u_1) = \begin{bmatrix} \vdots \\ 0.96 \\ 0.04 \\ \vdots \end{bmatrix}

Multiplying the two vectors element-wise effectively filters out the unwanted meaning:

u_2 \odot \sigma(u_1) = \begin{bmatrix} \vdots \\ 0.4 \\ 0.3 \\ \vdots \end{bmatrix} \odot \begin{bmatrix} \vdots \\ 0.96 \\ 0.04 \\ \vdots \end{bmatrix} = \begin{bmatrix} \vdots \\ 0.384 \\ 0.012 \\ \vdots \end{bmatrix}

In LSTMs and GRUs, this gating mechanism is used to control the cell state, allowing the network to learn what information to store, what to forget, and what to output at each time step.

8.5 Application: Image Caption Generator

A powerful application that combines computer vision and natural language processing is the Image Caption Generator. This model takes an image as input and automatically generates a textual description.

See Also

O. Vinyals, A. Toshev, S. Bengio and D. Erhan (2015). ``Show and tell: A neural image caption generator’’ original paper (Vinyals et al. 2015)
Google Research Blog post

The process begins (see Figure 8.14) by using a pre-trained Convolutional Neural Network (CNN), such as VGG or ResNet, to extract a rich set of visual features from the input image.

We typically remove the final classification layer of the CNN, as we are interested in the high-level feature representation from one of the last fully connected layers, not the final class prediction.

Figure 8.14: Image Captioning. First the input impage is mapped into a vector using a pre-trained network like VGG. This vector is used as the context for the first step of a RNN. The RNN’s output makes predictions about the next word token. Sampling ends when the special token `<end>` is emitted.

This feature vector, which serves as a numerical summary of the image’s content, is then fed as the initial input to an RNN (typically an LSTM or GRU). The RNN’s task is to generate the caption, one word at a time.

The RNN is trained to predict the next word in the caption, given the image features and the words generated so far. During inference, we continue this process, feeding the previously generated word back as input to predict the next, until a special <end> token is generated.

8.6 Takeaways

Recurrent Neural Networks provide a powerful framework for modelling sequential data, finding applications in time series analysis, text processing, and video analysis. However, simple RNNs are difficult to train due to the vanishing and exploding gradient problems, which limit their ability to capture long-range dependencies.

Gated architectures like LSTMs and GRUs were developed to overcome these limitations. By using gating mechanisms to control the flow of information, they have made the training of deep recurrent models far more stable and effective. These models became the standard for many language-based tasks, including machine translation, text generation, and image captioning.

8.7 Limitations of RNNs and the Rise of Transformers

Despite their success, RNNs have fundamental limitations. Their inherently recurrent nature, processing data one step at a time, prevents effective parallelisation. This makes them slow to train on very long sequences.

Perhaps more critically, RNNs and LSTMs are not well-suited for transfer learning in the same way as CNNs. It is difficult to pre-train a general-purpose RNN on a massive dataset and then fine-tune it for a new task. Consequently, most RNN applications require training from scratch, which demands large amounts of task-specific data and significant computational resources.

The 2017 landmark paper “Attention Is All You Need” (Vaswani et al. 2017) introduced the Transformer architecture, which has since ended the predominance of RNNs in natural language processing. Transformers, built upon the Attention Mechanism, dispense with recurrence entirely. Their design allows for massive parallelisation and has proven exceptionally effective for transfer learning. This has enabled the creation of powerful pre-trained models like BERT and GPT, which can be adapted to a wide range of tasks with minimal fine-tuning. This is why we will cover these in the next chapters.

… but good ideas never die. Recurrent approaches made a comeback in late 2023 with Mamba, an architecture demonstrating that state-space models with recurrent properties can compete with, and sometimes outperform, Transformers, particularly in terms of computational efficiency during inference. This shows that the principles of recurrence remain an active and evolving area of research.

See Also

Albert Gu, Tri Dao (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. original paper
see also this tutorial