10 Attention Mechanism and Transformers

The Attention Mechanism (2015) and the Transformer model (2017), which is built upon it, have revolutionised the field of Natural Language Processing (NLP). Their influence has been so profound that they have been widely adopted in almost all Deep Learning applications, from computer vision to speech recognition.

In this chapter, we will look in detail at the Attention Mechanism and the Transformer model. As these architectures originated in the field of NLP, we will introduce them in the context of text processing, which provides a natural and intuitive setting for understanding their core concepts.

10.1 Motivation

To understand why Transformers and Attention have had such an impact, we first need to appreciate the limitations of the models that came before them, namely Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).

10.1.1 The Problem with CNNs and RNNs

Recurrent Neural Networks (such as LSTMs and GRUs) were, for a long time, the model of choice for sequence processing tasks. Their recurrent nature makes them a natural fit for handling variable-length inputs like sentences. However, they have several drawbacks:

The sequential nature of RNNs prohibits parallelisation. Each step depends on the previous one, making them slow to train on long sequences.
The context is computed from the past only, meaning the representation of a word only depends on the words that came before it. Bidirectional RNNs mitigate this, but they are even more computationally expensive.
There is no explicit distinction between short-term and long-term dependencies; everything is handled by the same recurrent state, which can become a bottleneck.
Training RNNs can be tricky due to vanishing and exploding gradient problems.
It is not straightforward to apply transfer learning efficiently.

On the other hand, Convolutional Neural Networks, which are dominant in computer vision, can also be applied to sequences (using 1D convolutions). They offer several advantages:

They can be massively parallelised, as the output at each position can be computed independently.
They are excellent at exploiting local dependencies (within the kernel’s receptive field). Long-range dependencies can be captured by stacking multiple layers.

However, CNNs also have their own limitations when it comes to text:

They are not designed to handle variable-sized inputs without padding or truncation, which can lead to a loss of information.
The dependencies they capture are at fixed positions relative to the current word, which is a rigid assumption for language.

10.1.2 The Problem with Positional Dependencies

Let us examine the issue of fixed positional dependencies more closely. Consider a simple 1D convolution on a sequence of feature vectors {\bf x}_i with a kernel size of 5. To simplify the argument, we will ignore cross-channel interactions:

\begin{equation} \text{output}_i = w_{-2} {\bf x}_{i-2} + w_{-1} {\bf x}_{i-1} + w_{0} {\bf x}_{i} + w_{1} {\bf x}_{i+1} + w_{+2} {\bf x}_{i+2} + b, \end{equation}

The weight w_{-1} is always associated with the dependency relationship between the current sample and the previous one (i.e., a distance of 1 in the past). This relationship is assumed to be the same across all sentences.

Now, consider a dense (fully connected) layer, again ignoring cross-channel interactions:

\begin{equation} \text{output}_i = \sum_{j=1}^L w_{i,j} {\bf x}_{j} + b, \end{equation}

Here, we face a similar issue: the relationships are defined according to fixed absolute positions. For example, the weight w_{1,3} captures the relationship between the first and third words, and this is assumed to be the same for all sentences, regardless of their content.

However, in natural language, dependencies are not so rigid. Look at the dependency graph for a typical sentence:

Figure 10.1: Example of a Sentence Dependency Graph

The distances between related words are not fixed. For instance, the verb is not always the word immediately following the subject. Convolutional and Dense layers are not well equipped to handle such flexible relationships.

So, what is the problem? Can we not just make the network bigger?

Yes, the Universal Approximation Theorem tells us that we can always throw more filters or neurons at the problem. In theory, a large enough network could learn all possible dependency graphs. However, this is clearly not an optimal approach. It is inefficient and would require vast amounts of data.

This is where the Attention Mechanism comes to the rescue. It provides a way to learn these dependencies dynamically.

10.2 The Attention Mechanism

The Attention Mechanism was originally introduced in the context of machine translation and image captioning, where it was used to align different parts of an image with words in a sentence (Xu et al. 2015). The idea was quickly adapted to model relationships between words within a single sentence Luong, Pham, and Manning (2015).

Since its inception, the Attention Mechanism has been iterated upon in many papers, leading to various forms (e.g., Bahdanau Attention, Luong Attention). Here, we will focus on the Scaled Dot-Product Attention used in the Transformer model, as it is arguably the most popular and influential.

10.2.1 Core Mechanism of a Dot-Product Attention Layer

Let us revisit the formulation of a Dense layer:

\begin{equation} \text{output}_i = \sum_{j=1}^L w_{i,j} {\bf x}_{j} + b, \end{equation}

The core idea of Attention is that, instead of learning a fixed set of weights w_{i,j}, we could devise a recipe to generate these weights on the fly, based on the input data itself. For instance, we could have something like this:

\begin{equation} \text{output}_i = \sum_{j=1}^L f({\bf x}_{i}, {\bf x}_{j}) {\bf x}_{j}, \end{equation}

where f would be a function that computes the weights dynamically.

Taking our previous NLP example, the word is is clearly a verb and hearing is a subject. We could therefore imagine that the weight w_{\text{is},\text{hearing}} could be defined based purely on the semantics of {\bf x}_\text{is} and {\bf x}_\text{hearing}, regardless of their actual positions in the sentence:

\begin{equation} w_{\text{is},\text{hearing}} = f({\bf x}_\text{is}, {\bf x}_\text{hearing}). \end{equation}

This is the central idea behind Attention. Let us now see how it is implemented in practice.

To make the explanation more generic, we will consider two sequences of vectors: a sequence of queries, {\bf q}_1, \dots, {\bf q}_{L_q}, and a sequence of keys, {\bf k}_1, \dots, {\bf k}_{L_k}. The terms keys and queries draw an analogy to a retrieval or database system (see later). In the following example, we will focus on computing the output for a single query, {\bf q}_3:

The Attention layer computes an alignment score, s, between the query {\bf q}_{3} and each of the keys, {\bf k}_{1}, \dots, {\bf k}_{4}:

Many formulae for the alignment score exist. The formula used in the Transformer paper is based on the scaled dot product of the feature vectors:

\begin{equation} s_{i,j} = {\bf q}_{i}^{\top} {\bf k}_{j} / \sqrt{d_k} \end{equation}

(Note: the normalisation by the square root of the key dimension, d_k, is an important detail that was found to help stabilise training).

The scores, s, are analogous to logits: a large score (+\infty) means that the query and key are highly related. A softmax function can then be used to normalise these scores into a set of weights that sum to 1:

\begin{equation} [w_{3,1}; w_{3,2}; w_{3,3}; w_{3,4}] = \mathrm{softmax}( [s_{3,1} ; s_{3,2} ; s_{3,3} ; s_{3,4}]) \end{equation}

Instead of combining the keys, we use these weights to form a weighted sum of a third set of vectors, the values, {\bf v}_1, \dots, {\bf v}_{L_k} (again, note the analogy to a retrieval system):

\begin{equation} \mathrm{output}_3 = w_{3,1} {\bf v}_1 + w_{3,2} {\bf v}_2 + w_{3,3} {\bf v}_3 + w_{3,4} {\bf v}_4 \end{equation}

We can repeat this operation for all other query vectors, for example, for {\bf q}_5:

In summary, an Attention layer takes three tensors as input:

A tensor of queries, {\bf Q}=[{\bf q_1}, \dots, {\bf q_{L_q}}]^{\top}, of size L_q \times d_q, where L_q is the length of the query sequence and d_q is the dimension of the query feature vectors.
A tensor of keys, {\bf K}=[{\bf k_1}, \dots, {\bf k_{L_k}}]^{\top}, of size L_k \times d_k.
A tensor of values, {\bf V} = [{\bf v_1}, \dots, {\bf v_{L_k}}]^{\top}, of size L_k \times d_v.

Note that the key and query dimensions must be equal (d_k = d_q), while the value dimension, d_v, can be different.

The values can be thought of as the context vectors associated with each word, similar to what we would have in an RNN. The queries and keys are different representations of the input words, used to determine how they relate to each other.

Given these three tensors, the Attention layer returns a new tensor of size L_q \times d_v, where each output vector is a weighted average of the value vectors:

\begin{equation} \text{output}_{i} = \sum_{j=1}^{L_k} w_{i,j} {\bf v}_{j} \end{equation}

On the face of it, this looks like a dense layer, as each output vector is a linear combination of the value vectors. The crucial difference is that the weights, w_{i,j}, are computed dynamically as a function of how well the query {\bf q}_i aligns with the key {\bf k}_j:

\begin{equation} s_{i,j} = {\bf q}_{i}^{\top} {\bf k}_{j} / \sqrt{d_k} \end{equation}

These scores are then normalised using a softmax function:

\begin{equation} w_{i,j} = \frac{\exp(s_{i,j})}{\sum_{l=1}^{L_k} \exp(s_{i,l})} \quad \text{so as to have $\sum_j w_{i,j} = 1$ and $0 \leq w_{i,j} \leq 1$. } \end{equation}

In other words, for each query vector {\bf q}_i:

We evaluate the alignment/similarity between {\bf q}_i and all the keys {\bf k}_j:

\begin{equation} s_{i,j} = {\bf q}_i^\top {\bf k}_j / \sqrt{d_k} \end{equation}
The scores are then normalised across all keys using softmax to obtain the weights w_{i,j}:

\begin{equation} w_{i,j} = \frac{\exp(s_{i,j})}{\sum_{l=1}^{L_k} \exp(s_{i,l})} \end{equation}
We compute the output vector as the weighted average of the value vectors {\bf v}_j:

\begin{equation} \text{output}_{i} = \sum_{j=1}^{L_k} w_{i,j} {\bf v}_{j} \end{equation}

10.2.2 The Attention Mechanism as a Fuzzy Dictionary Lookup

Now that the mechanic is understood, let’s revisit the retrieval analogy. The attention mechanism can be understood as a differentiable, “fuzzy” version of a key-value lookup, such as one performed with a Python dictionary.

Consider a simple dictionary named capital that maps countries to their capital cities:

capital = {
    "France": "Paris",
    "UK": "London",
    "Germany": "Berlin"
}

In the language of attention, we can represent these pairs as keys ({\bf k}) and values ({\bf v}):

\begin{aligned} {\bf k}_1 &\equiv \text{`France'} & ; \quad & {\bf v}_1 &\equiv \ \text{`Paris'} \\ {\bf k}_2 &\equiv \text{`UK'} & ; \quad & {\bf v}_2 &\equiv \ \text{`London'} \\ {\bf k}_3 &\equiv \text{`Germany'} & ; \quad & {\bf v}_3 &\equiv \ \text{`Berlin'} \\ \end{aligned}

A standard dictionary performs a “hard” lookup. If we provide the query 'France', it finds an exact match with key {\bf k}_1 and returns the single corresponding value {\bf v}_1.

We can model this hard lookup using an attention-like process. Imagine a hypothetical alignment score function that returns +\infty if the query and key strings are identical, and -\infty if they differ. When we apply the softmax function to these scores, the weights become either 1 for a perfect match or 0 otherwise.

Let’s pose the query {\bf q}_1 \equiv \text{`France'}. The alignment scores with our keys ({\bf k}_1, {\bf k}_2, {\bf k}_3) would be (+\infty, -\infty, -\infty). Applying the softmax function to these scores yields the attention weights:

(w_{1,1}, w_{1,2}, w_{1,3}) = \text{softmax}(+\infty, -\infty, -\infty) = (1, 0, 0)

The output is then the weighted sum of the values:

\begin{aligned} \text{output}_1 &= \sum_{j=1}^{3} w_{1,j} {\bf v}_{j} \\ &= (1 \times {\bf v}_1) + (0 \times {\bf v}_2) + (0 \times {\bf v}_3) \\ &\equiv \text{`Paris'} \end{aligned}

This perfectly mimics the dictionary lookup.

The actual attention mechanism is a “soft” or “fuzzy” lookup. By using similarity measure (like the our dot product) between vector representations, a query for 'French Republic' might result in high, but not absolute, similarity to 'France', yielding attention weights like (0.95, 0.03, 0.02). The resulting output would be a blend of the values, heavily weighted towards 'Paris'. This allows the model to relax the requirement to have a single exact match.

10.2.3 No Trainable Parameters

As we loop through the queries and keys, the number of similarity scores to compute is L_q \times L_k. Each similarity calculation takes \mathcal{O}(d_k) operations, so the overall computational complexity is \mathcal{O}(L_q \times L_k \times d_k). This is very similar in complexity to a dense layer (except that we do not try to have cross-channel weights).

Importantly, because we have a formula to compute the weights, the Attention mechanism itself does not have any trainable parameters. This becomes apparent when we write down the full mathematical formula in matrix form:

\begin{equation} \small \text{Attention}({\bf Q}, {\bf K}, {\bf V}) = \mathrm{softmax}\left(\frac{{\bf Q} {\bf K}^\top}{\sqrt{d_k}} \right) {\bf V} \end{equation}

where the \mathrm{softmax} function is applied row-wise.

10.2.4 Self-Attention

Self-Attention is a special case of the Attention mechanism where the queries, keys, and values are all derived from a single input tensor, {\bf X} = [{\bf x}_1, {\bf x}_2, \cdots, {\bf x}_L]^{\top} of size L \times d. This is achieved by using three separate linear transformations to project the input tensor into the query, key, and value spaces:

\begin{equation} {\bf q}_i = {\bf W}_Q^{\top} {\bf x}_i, \end{equation}

\begin{equation} {\bf k}_i = {\bf W}_K^{\top} {\bf x}_i, \end{equation}

\begin{equation} {\bf v}_i = {\bf W}_V^{\top} {\bf x}_i . \end{equation}

The Self-Attention output is therefore given by:

\begin{equation} \begin{split} \text{Self-Attention}({\bf X}, {\bf W}_Q, {\bf W}_K, {\bf W}_V) = \\ \text{Attention}({\bf X}{\bf W}_Q, {\bf X}{\bf W}_K, {\bf X}{\bf W}_V) \end{split} \end{equation}

If we substitute the definitions, we get the following all-in-one equation:

\begin{equation} \text{Self-Attention}({\bf X}) = \mathrm{softmax}\left(\frac{({\bf X} {\bf W_Q})({\bf X} {\bf W_K})^\top }{\sqrt{d_k}} \right) ({{\bf X} {\bf W}_V}) \end{equation}

In this formulation, the only trainable parameters are contained in the weight matrices {\bf W}_Q (size d \times d_q), {\bf W}_K (size d \times d_k), and {\bf W}_V (size d \times d_v). These are relatively small matrices, and crucially, they can operate on sequences of any length, since their dimensions do not depend on the sequence length L.

The code presented below is python/numpy implementation of how the attention vector for the first token/word in the sequence can be computed. This code would need to be also applied to all the other words in the sequence.


def softmax(x):
  return(np.exp(x)/np.exp(x).sum())

# encoder representations of four different words
word_1 = np.array([1, 0, 0]); word_2 = np.array([0, 1, 0]);
word_3 = np.array([1, 1, 0]); word_4 = np.array([0, 0, 1])

# initialisation of the weight matrices
# These would be learned during training
W_Q = np.random.randn(3, 2) # d=3, d_q=2
W_K = np.random.randn(3, 2) # d=3, d_k=2
W_V = np.random.randn(3, 2) # d=3, d_v=2

# generating the queries, keys and values
query_1 = word_1 @ W_Q; key_1 = word_1 @ W_K; value_1 = word_1 @ W_V
query_2 = word_2 @ W_Q; key_2 = word_2 @ W_K; value_2 = word_2 @ W_V
query_3 = word_3 @ W_Q; key_3 = word_3 @ W_K; value_3 = word_3 @ W_V
query_4 = word_4 @ W_Q; key_4 = word_4 @ W_K; value_4 = word_4 @ W_V

# scoring the first query vector against all key vectors
scores_1 = np.array([np.dot(query_1, key_1), np.dot(query_1, key_2),
                     np.dot(query_1, key_3), np.dot(query_1, key_4)])

# computing the weights by a softmax operation
weights_1 = softmax(scores_1 / np.sqrt(key_1.shape[0]))

# computing the first attention vector
attention_1 = (weights_1[0] * value_1 + weights_1[1] * value_2 +
               weights_1[2] * value_3 + weights_1[3] * value_4)
 
print(attention_1)

10.2.5 Computational Complexity

Since each feature vector in the sequence is compared to all other feature vectors, the computational complexity is quadratic in the input sequence length, L. This is similar to a dense layer.

Method	Complexity
Self-Attention	\mathcal{O}(L^2 \cdot d_k)
RNN/LSTM/GRU	\mathcal{O}(L \cdot d \cdot d_v)
Convolution	\mathcal{O}(L \cdot \text{kernel\_size} \cdot d \cdot d_v)
Dense Layer	\mathcal{O}(L^2 \cdot d \cdot d_v)

Note that we typically choose the key dimension, d_k, to be much smaller than the input dimension, d (e.g., d_k = d/8). This reduces the computational cost, but it remains quadratic in the sequence length, L. The idea is that each attention head only needs to look at one aspect of the relationship between words, for instance, the subject-verb relationship.

Like Dense Layers and Convolutions, Attention can be easily parallelised. We could also restrict the attention mechanism to a local neighbourhood to reduce the complexity from \mathcal{O}(L^2) to \mathcal{O}(L \cdot w), where w is the window size.

More than the computational complexity, however, the number of trainable parameters is what is particularly interesting. The number of parameters in Self-Attention does not depend on the sequence length, which is a significant advantage over RNNs and Dense Layers.

Method	Number of Trainable Parameters
Self-Attention	\mathcal{O}(d \cdot d_q + d \cdot d_k + d \cdot d_v)
RNN/LSTM/GRU	\mathcal{O}(d \cdot d_v + d_v^2)
Convolution	\mathcal{O}(\text{kernel\_size} \cdot d \cdot d_v)
Dense Layer	\mathcal{O}(L \cdot d \cdot d_v)

10.2.7 The Multi-Head Attention Layer

You can think of an Attention layer as a replacement for a convolution layer. Just as you can chain multiple convolutional layers, you can also chain multiple Attention layers.

In Transformers, a set of \left({\bf W}_{Q}, {\bf W}_{K}, {\bf W}_{V}\right) matrices is called an attention head. A Multi-Head Attention layer is simply a layer that contains multiple attention heads. The outputs of these heads are concatenated and then linearly transformed back to the expected dimension.

The number of heads is a hyperparameter, analogous to the number of filters in a convolutional layer. Each head can learn to focus on different types of relationships between words. For example, one head might learn to capture syntactic dependencies, while another might focus on semantic similarity.

Below is an example in Keras of a self-attention layer with two heads:

import torch
import torch.nn as nn
 
mha = nn.MultiheadAttention(embed_dim=4, num_heads=2, batch_first=True)
y, attn = mha(query=x, key=x, value=x)  # y: [B, T, 4]

10.2.8 Takeaways (Attention Mechanism)

RNNs do not parallelise well, and Convolutions assume fixed positional relationships, which is not ideal for text.
The Attention Mechanism resolves these issues by defining a formula to dynamically compute the weights between any two positions, i and j, based on the alignment (dot-product) between a query vector for i and a key vector for j.
With Self-Attention, linear transformation matrices are used to produce the queries, keys, and value vectors from a single input tensor.
The computational complexity of Attention is quadratic in the input sequence length (as with Dense Layers). The Attention mechanism itself has no trainable parameters, but Self-Attention requires learning the projection matrices {\bf W}_Q, {\bf W}_K, and {\bf W}_V.
Self-Attention and Cross-Attention are well suited for text processing, as the semantics of the words can take precedence over their absolute or relative positions.
Cross-Attention is a powerful tool for working with multiple modalities (e.g., audio, video, images, text), as it is agnostic to the positions of the keys and values and can thus handle potential synchronisation issues.

10.3 Transformers

In 2017, Vaswani et al. proposed the Transformer, a simple yet powerful network architecture based solely on attention layers. This architecture has fundamentally impacted not only text processing but the entire field of deep learning.

Attention Is All You Need

A. Vaswani et al. Attention Is All You Need. In Advances in Neural Information Processing Systems, pages 5998–6008. (2017)

original paper

The original publication has generated over 57,000 citations as of 2022 (for reference, a paper is considered highly successful if it has over 100 citations).

10.3.1 An Encoder-Decoder Architecture

The Transformer architecture, as described in the original paper, is an encoder-decoder model, as shown in Figure 10.5.

Figure 10.5: The Transformer architecture, as described in the original paper (with the encoder and decoder parts highlighted in magenta).

The first part of the network (highlighted in magenta) is an encoder, which is a sub-network that transforms the input sequence into a meaningful, compact tensor representation. Think of it as being analogous to the VGG network, which transforms an image into a compact 4096 \times 1 feature vector. As with VGG, the idea is that this pre-trained encoder can be reused for other tasks through transfer learning.

The Encoder itself is made of a stack of identical blocks. At the core of each of these blocks is a Multi-Head Attention layer, followed by a simple feed-forward network.

Below is an example of what an implementation of that encoder could look like.

class EncoderBlock(nn.Module):
    def __init__(self, d_model, ff_dim, heads=2, p=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(
            d_model, heads, dropout=p, batch_first=True)
        self.ln1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, ff_dim),
            nn.ReLU(),
            nn.Dropout(p),
            nn.Linear(ff_dim, d_model))
        self.ln2 = nn.LayerNorm(d_model)
        self.do = nn.Dropout(p)
 
    def forward(self, x, attn_mask=None, key_padding_mask=None):
        a, _ = self.attn(x, x, x,
                         attn_mask=attn_mask,
                         key_padding_mask=key_padding_mask,
                         need_weights=False)
        x = self.ln1(x + self.do(a))
        return self.ln2(x + self.ff(x))
 
class Encoder(nn.Module):
    """Stacks multiple EncoderBlocks. Input/Output: (B, S, D)."""
    def __init__(self, num_layers, d_model, ff_dim, heads=2, p=0.1):
        super().__init__()
        self.num_layers = num_layers
        self.blocks = nn.ModuleList(
            [EncoderBlock(d_model, ff_dim, heads, p) for _ in range(num_layers)]
        )
 
    def forward(self, x, attn_mask=None, key_padding_mask=None):
        for block in self.blocks:
            x = block(x, attn_mask=attn_mask, key_padding_mask=key_padding_mask)
        return x

The Decoder, also highlighted in magenta, is also made of a stack of blocks containing Multi-Head Attention layers. Its job is to take the encoder’s output and generate the target sequence.

10.3.2 Positional Encoding

Note the presence of a Positional Encoding layer at the input. As the Attention mechanism itself is permutation-invariant, it does not have any notion of word order. To remedy this, the Transformer architecture proposes to encode the position of each word as a vector and add it to the input embedding.

The positional encoding is a function \phi: i \mapsto \phi(i) that maps a position i to a vector. This vector is then simply appended to the word embedding: {\bf x}'_i = [{\bf x}_i; \phi(i)].

Why do we need a special encoding for this? Why not simply use the index of the word as a feature, i.e., \phi(i) = i?

This is because the similarity measure needs to make sense. If we used the dot product, the similarity between positions i and j would simply be i \times j. We would prefer the similarity to be high for nearby positions and low for distant ones. For example, we would like \phi(i)^\top \phi(j) to be large if i \approx j and small otherwise.

A function that has this property is the Gaussian kernel: \begin{equation} \phi(i)^\top \phi(j) \approx \exp( - \lambda (i-j)^2). \end{equation}

So such an embedding exists: it is the (infinite) Fourier series basis (the same as in the RBF kernel in SVM). As we cannot afford the luxury of an infinite embedding, we need to truncate the series. This is what was proposed in the original Transformer paper. For a positional encoding of dimension d_{pos}, they propose:

\begin{equation} i \mapsto \phi(i) = \begin{bmatrix} \sin(\omega_1 i) \\ \cos(\omega_1 i) \\ \sin(\omega_2 i) \\ \cos(\omega_2 i) \\ \vdots \\ \sin(\omega_{d_{pos}/2} i) \\ \cos(\omega_{d_{pos}/2} i) \end{bmatrix} \quad \text{where } \omega_k = 1/10000^{2k/d_{pos}} \end{equation}

The advantage of using a positional encoding, as opposed to hard-coding positional relationships as in a CNN, is that the position is treated as just another piece of information. It can be transformed, combined with other features, or even ignored by the network. It is up to the training process to learn how to best use this information.

10.3.3 Takeaways (Transformers)

There is obviously a lot more to know about Transformers, but we have covered the main ideas here.

The Transformer model is an encoder-decoder architecture based on blocks of Attention layers.
The positional information, which is lost in the attention mechanism, is re-introduced by adding a positional encoding to the input vectors.
Transformers benefit from the efficiency of the Attention Mechanism, requiring fewer parameters than RNNs for similar performance, and can be easily parallelised.
Transformers are the backbone of modern NLP networks such as ChatGPT. They are also the backbone of many models that handle multiple modalities (e.g., text, images, speech).

10.1 Motivation

10.1.1 The Problem with CNNs and RNNs

10.1.2 The Problem with Positional Dependencies

10.2 The Attention Mechanism

10.2.1 Core Mechanism of a Dot-Product Attention Layer

10.2.2 The Attention Mechanism as a Fuzzy Dictionary Lookup

10.2.3 No Trainable Parameters

10.2.4 Self-Attention

10.2.5 Computational Complexity

10.2.6 A Perfect Tool for Multi-Modal Processing

10.2.7 The Multi-Head Attention Layer

10.2.8 Takeaways (Attention Mechanism)

10.3 Transformers

10.3.1 An Encoder-Decoder Architecture

10.3.2 Positional Encoding

10.3.3 Takeaways (Transformers)