Chapter 8 Recurrent Neural Networks
Recurrent Neural Networks (RNN) are special type of neural architectures designed to be used on sequential data.
8.1 A Feed Forward Network Rolled Out Over Time
Sequential data can be found in any time series such as audio signal, stock market prices, vehicle trajectory but also in natural language processing (text). In fact, RNNs have been particularly successful with Machine Translation tasks.
Recurrent Networks define a recursive evaluation of a function. The input stream feeds a context layer (denoted by \(h\) in the diagram). The context layer then re-use the previously computed context values to compute the output values.
The best analogy in signal processing would be to say that if convolutional layers where similar to FIR filters, RNNs are similar to IIR filters.
The RNN can be unfolded to produce a classic feedforward neural net.
A key aspect of RNNs is that the network parameters \(w\) are shared across all the iterations. That is \(w\) is fixed in time.
In its simplest form, the inner structure of the hidden layer block is simply a dense layer of neurons with \(\mathrm{tanh}\) activation. This is called a simple RNN architecture or Elman network.
We usually take a \(\mathrm{tanh}\) activation as it can produce positive or negative values, allowing for increases and decreases of the state values. Also \(\mathrm{tanh}\) bounds the state values between -1 and 1, and thus avoids a potential explosion of the state values.
The equations for this network are as follows:
\[ \begin{aligned}{\bf h}_{t}&=\tanh({\bf W}_{h}{\bf x}_{t}+{\bf U}_{h}{\bf h}_{t-1}+{\bf b}_{h})\\{\bf y}_{t}&=\sigma _{y}({\bf W}_{y}{\bf h}_{t}+{\bf b}_{y}) \end{aligned} \]
where \({\bf x}\) is the input vector, \({\bf h}\) the vector of the hidden layer states, \({\bf y}\) is the output vector, \(\sigma_y\) is the output’s activation function, \({\bf W}_{h}\) and \({\bf b}_h\) the matrix stacking the parameters for \(h\), \({\bf U}_{h}\) the matrix stacking the feedback parameters for \(h\) and \({\bf W}_{y}\) and \({\bf b}_y\) the matrix and vector stacking the parameters for the output.
The parameters \({\bf W}_{h}\), \({\bf W}_{y}\), \({\bf b}_{h}\), \({\bf b}_{y}\) are shared by all input vectors \({x}_t\).
In Keras, we can define a simple RNN layer as follows:
input = Input(shape=(n, p))
h = SimpleRNN(hsize, return_sequences=True})(input)
output = Dense(osize, Activation='softmax')(h)
Note that we can choose to produce a single output for the entire sequence instead of an output at each timestamp. In Keras, this would be defined as:
input = Input(shape=(n, p))
h = SimpleRNN(hs, return_sequences=False)(input)
output = Dense(os, Activation='softmax')(h)
And we can stack multiple RNN layers. For instance:
input = Input(shape=(n, p))
h = SimpleRNN(hs, return_sequences=True)(input)
k = SimpleRNN(ks, return_sequences=False)(h)
output = Dense(os, Activation='softmax')(k)
8.2 Application Example: Character-Level Language Modelling
In the next slide is presented an example application of RNNs where we try to predict next character given a sequence of previous characters. The idea is to give the RNN a large corpus of text to train on and try to model the text inner dynamics.
Training. We start from a character one-hot encoding. Each input of the RNNs is a character from the sequence. The RNN then is used for a classification task: we try to classify the output of the sequence \({\bf x}_1,\cdots,{\bf x}_{n-1}\) as the next character \({\bf y}={\bf x}_{n}\).
Since we are using cross-entropy and softmax, the network returns back the vector of probability distribution for the next character.
We are training for a classification task: can you predict the next character based on the previous characters?
Once we have trained the RNN, we can then generate whole sentences, one character at a time. We achieve this by providing an initial sentence fragment, or seed. Then we can use our RNN to predict the probability distribution of the next character. To generate the next character, we simply sample the next character based from these probabilities. This character is then appended to the sentence and the process is repeated.
Diagram of the text generation process is illustrated in the next slide.
This fun application is taken from this seminal blog post by Karpathy:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/\#fun-with-rnns
Check this link for results and more insight about the RNN!
8.3 Training: Back-Propagation Through Time
To train a RNN, we can unroll the network to expand it into a standard feedforward network and then apply back-propagation as per usual.
This process is called Back-Propagation Through Time (BPTT).
Note that the unrolled network can grow very large and might be hard to fit into the GPU memory. Also, the process is very sequential in nature and it is thus difficult to avail of parallelism.
Sometimes, a strategy to speed up learning is to split the sequence into chunks and train apply BPTT on these truncated parts. This process is called Truncated Back-Propagation Through Time.
Example of unrolling the RNN with BPTTIt is possible to split the sequence into chunks.
and train each chunk separately (truncated BPTT)
8.4 Dealing with Long Sequences
When unrolled, recurrent networks can grow very deep. As with any deep network, the main problem with using gradient descent is then that the error gradients can vanish (or explode) exponentially quickly. Therefore we rarely use the Simple RNN layer architecture as they are very difficult to train. Instead, we usually resort to two alternative RNN layer architectures: LSTM and GRU.
8.4.1 LSTM
LSTM (Long Short-Term Memory) was specifically proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber (Hochreiter and Schmidhuber 1997) to deal with the exploding and vanishing gradient problem. LSTM blocks are a special type of network that is used for the recurrent hidden layer. LSTM block can be used as a direct replacement for the dense layer structure of simple RNNs.
After 2014, major technology companies including Google, Apple, and Microsoft started using LSTM in their speech recognition or Machine Translation products.
S. Hochreiter and J. Schmidhuber (1997). “Long short-term memory”. [https://goo.gl/hhBNRE]
See also Brandon’s Rohrer’s video: [https://youtu.be/WCUNPb-5EYI]
and colah’s blog [https://goo.gl/uc7gbn]
8.4.2 GRU
GRU (Gated Recurrent Units) were introduced in 2014 (Chung et al. 2014) as a simpler alternative to the LSTM block. Their performance is reported to be similar to the one of LSTM (maybe slightly better on smaller problems and slightly worse on bigger problems). As they have fewer parameters than LSTM, GRUs are quite a bit faster to train.
J. Chung, C. Gulcehre, K. Cho and Y. Bengio (2014). “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. [https://arxiv.org/abs/1412.3555]
8.4.3 Gated Units
Without going too much into the inner workings of GRU and LSTM, we note that they make use of gated units, which offer an alternative way for combining units. So far, the only way we had to combine two units \(u_1\) and \(u_2\) was through a linear combination \(w_1u_1+w_2u_2\). The gating mechanism proposed here offers to do it through a multiplication of both inputs:
The sigmoid \(\sigma\) produces a vector of True/False conditions that filters out features of \(u_2\), based on another sub-network prediction \(u_1\).
To understand its interest, remember that feature vectors typically contain non-negative values, that indicate how strong a feature is expressed. For instance, say that you are processing text, \(u_2\) could be used to predict the next word probabilities:
\[ u_2 = \begin{bmatrix} \vdots \\ p(\text{bat --- the animal}) = 0.4 \\ p(\text{bat --- the stick}) = 0.3 \\ \vdots \end{bmatrix} \]
Here there is some ambiguity about the meaning of “bat”. The role of the prediction \(\sigma(u_1)\) could be to specifically disambiguate this:
\[ \sigma(u_1) = \begin{bmatrix} \vdots \\ 0.96 \\ 0.04 \\ \vdots \end{bmatrix} \]
Then multiplying both vectors would filter out the unwanted features:
\[ u_2 = \begin{bmatrix} \vdots \\ p(\text{bat --- the animal}) = 0.4 \\ p(\text{bat --- the stick}) = 0.3 \\ \vdots \end{bmatrix} \; \times \; \sigma(u_1) = \begin{bmatrix} \vdots \\ 0.96 \\ 0.04 \\ \vdots \end{bmatrix} = \begin{bmatrix} \vdots \\ 0.38 \\ 0.01 \\ \vdots \end{bmatrix} \]
8.6 Take Away
Recurrent Neural Networks offer a way to deal with sequences, such as in time series, video sequences, or text processing. RNNs are particularly difficult to train as unfolding them into Feed Forward Networks lead to very deep networks, which are potentially prone to vanishing or exploding gradient issues.
Gated recurrent networks (LSTM, GRU) have made training much easier and have become the method of choice for most of applications based on Language models (eg. image captioning, text understanding, machine translation, text generation, etc.).
8.7 Limitations of RNNs and the Rise of Transformers
One issue with the idea of recurrence is that it prevents parallel computing. Unrolling the RNN can lead to potentially very deep networks of arbitrary length. And, as the weights are shared across the whole sequence, there is no convenient way for parallelisation.
The main critical issue with RNNs/LSTMs is, however, that they are are not suitable for transfer learning. It is very difficult to build on pre-trained models, as we are doing with CNNs. Any new application with RNNs will require vast quantity of data and will be tricky training.
The 2017 landmark paper on the Attention Mechanism (Vaswani et al. 2017) has since then ended the architectural predominance of RNNs. Pretty much any language model now relies on the Transformer architectures, which are built on top of this Attention mechanism. The defining advantage of Attention over RNNs is that it can be efficiently used for transfer learning. This means that, for any application that requires a language model, can now build on top of powerful pre-trained Transformers models, such as BERT, and thus avoid the lengthy complex training of RNNs.
Attention Is All You Need [https://arxiv.org/abs/1706.03762]
… but good ideas never die. Recurrent approaches made a comeback in late 2023 with Mamba, an architecture that shows that recurrent models can compete with Transformers, at a lower computational cost for the inference.
Albert Gu, Tri Dao (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. [https://arxiv.org/abs/2312.00752]
see also this tutorial: [https://tinyurl.com/4ub385ye]