### 1 Introduction

Hello, friends. Welcome to Yuelai inn. The paper to be introduced today is a paper published by Google in 2017, called “attention is all you need” [1]. Of course, there have been a lot of analysis about this paper on the Internet, but good food is not afraid of late. The author is just here to talk about his understanding and application of it. For this paper, the author will introduce it through seven articles: ① the idea and principle of multi head attention mechanism in transformer; ② Position encoding and decoding process of transformer; ③ The network structure of transformer and the realization of self attention mechanism; ④ Implementation process of transformer; ⑤ Transformer based translation model; ⑥ Text classification model based on transformer; ⑦ Couplet generation model based on transformer.

I hope that through this series of 7 articles, we can have a clearer understanding and understanding of transformer. Now, let’s formally enter the interpretation of this paper. The official account can get the download link by backing up the paper.

### 2 motivation

#### 2.1 problems faced

According to the order in which we always interpret the paper, let’s first take a look at why the author proposed the transformer model at that time? What kind of problems need to be solved? What are the defects of the current model?

In the abstract part of the paper, the author mentioned that the current mainstream sequence models are encoder decoder models based on complex cyclic neural networks or convolutional neural networks, and even the best sequence models are based on the encoder decoder architecture under the attention mechanism. Why does the author keep mentioning these traditional encoder decoder models? Then, in the introduction part, the author talks about that in the modeling process of the traditional encoder decoder architecture, the calculation process of the next time will depend on the output of the previous time, and this inherent attribute limits the traditional encoder decoder model to calculate in parallel, as shown in Figure 1.

This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

<center>

Figure 1. Coding diagram of cyclic neural network

</center>

Then the author said that although the latest research work has greatly improved the computational efficiency of the traditional cyclic neural network, the essential problem has not been solved.

Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

#### 2.2 solutions

Therefore, in this paper, the author proposes a new transformer architecture to solve this problem for the first time. The advantage of transformer architecture is that it completely discards the traditional loop structure and instead calculates the implicit representation of model input and output only through the attention mechanism, which is called the famous self attention mechanism.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution.

on the whole,**The so-called self attention mechanism is to directly calculate the attention weight of each position in the coding process through some operation; Then the implicit vector representation of the whole sentence is calculated in the form of weight sum**。 Finally, transformer architecture is an encoder decoder model based on this self attention mechanism.

### 3 technical means

After introducing the background of the whole paper, let’s first take a look at the true face of self attention mechanism, and then explore the overall network architecture.

#### 3.1 self-Attention

First of all, we need to understand that the so-called self attention mechanism is actually the “scaled dot product attention” referred to in the paper. In the paper, the author said that the attention mechanism can be described as the process of mapping query and a series of key value pairs to an output, and the output vector is the sum of the weights on value calculated according to query and key.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

However, to further understand the meaning of query, key and value, we need to combine the decoding process of transformer, which will be introduced later. Specifically, the structure of self attention mechanism is shown in Figure 2.

<center>

Figure 2. Structure of self attention mechanism

</center>

As can be seen from Figure 2, the core process of self attention mechanism is to calculate the attention weight through Q and K; Then act on V to get the whole weight and output. Specifically, for input Q, K and V, the calculation formula of output vector is:

$$

\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V\;\;\;\;\;(1)

$$

Where Q, K and V are three matrices respectively, and their (second) dimensions are $d respectively_ q,d_ k,d_ V $(you can actually find $d from the following calculation process_ q=d_ v)$。 And the formula $(1) $divided by $\ sqrt{d_ k} The process of $is the scale in Figure 2.

The reason for scaling this step is that through the experiment, the author found that for a larger $D_ For K $, you will get a large value after completing $QK ^ t $, which will lead to a very small gradient after sofrmax operation, which is not conducive to network training.

We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

If you just look at the structure in Figure 2 and the calculation process in formula $(1) $, it is obviously not so easy to understand the meaning of self attention mechanism. For example, one of the most confusing questions for beginners is how Q, K and V in Figure 2 come from? Next, let’s look at a practical calculation example. Now, assuming that the input sequence is “who am I” and has been represented by a matrix with the shape of $3 \ Times4 $in some way, Q, K and v [2] can be obtained through the process shown in Figure 3.

<center>

Figure 3. Q, K and V calculation process diagram

</center>

It can be seen from the calculation process in Figure 3 that Q, K and V are actually calculated by multiplying input X by three different matrices respectively (this is only limited to the process in which encoder and decoder use self attention mechanism to encode in their respective input parts, and Q, K and V in the interactive part of encoder and decoder are otherwise referred to). Here, you can understand the calculated Q, K and V as three different linear transformations for the same input to represent its three different states. After Q, K and V are calculated, the weight vector can be further calculated. The calculation process is shown in Fig. 4.

<center>

Figure 4. Calculation diagram of attention weight (already operated by scale and softmax)

</center>

As shown in Figure 4, after calculating the attention weight matrix through the above process, we can’t help asking, what exactly do these weight values represent? For the first row of the weight matrix, 0.7 represents the attention value of “I” and “I”; 0.2 indicates the attention value of “I” and “yes”; 0.1 indicates the attention value of “I” and “who”. In other words, when encoding “I” in the sequence, 0.7 should focus on “I”, 0.2 on “yes”, and 0.1 on who.

Similarly, for the third line of the weight matrix, it means that when encoding the “who” in the sequence, 0.2 should focus on “I”, 0.1 on “yes”, and 0.7 on “who”. From this process, it can be seen that through this weight matrix model, we can easily know how to focus attention on different positions when encoding vectors at corresponding positions.

However, from the above calculation results, we can also see that,**When the model encodes the information of the current location, it will pay too much attention to its own location**(although this is common sense) and may ignore other locations [2]. Therefore, one solution adopted by the author is to adopt the multi head attention mechanism, which we will see later.

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself.

After the weight matrix is calculated by the process shown in Fig. 4, it can be applied to v ， Then, the final coded output is obtained, and the calculation process is shown in Fig. 5.

<center>

Figure 5. Weight and coding output diagram

</center>

According to the process shown in Figure 5, we can get the final encoded output vector. Of course, we can observe the above process from another angle, as shown in Figure 6.

<center>

Figure 6. Coding output calculation diagram

</center>

As can be seen from Figure 6, for the coding vector whose final output “is”, it is actually the weighted sum of the three vectors of the original “who am I”, which reflects the whole process of attention weight allocation when encoding “is”.

The whole process can also be shown in Fig. 5.

<center>

Figure 7. Calculation process of self attention mechanism

</center>

It can be seen that this self attention mechanism does solve the problem of “the disadvantage that the traditional sequence model needs to be carried out in sequence in the coding process” proposed by the author at the beginning of the paper. With the self attention mechanism, the coding vector containing different position attention information can be obtained only by matrix transformation of the original input several times.

This is the end of the introduction to the core part of the self attention mechanism, but there are still many details not introduced. For example, how do you get Q, K and V when encoder and decoder interact? What is the meaning of the marked mask operation in Figure 2, when it will be used, and so on? These contents will be introduced one by one later. Next, let’s continue to explore the multiheadattention mechanism.

#### 3.2 MultiHeadAttention

After the introduction of the above contents, we have a clear understanding of the self attention mechanism to a certain extent, but we also mentioned the defects of the self attention mechanism above:**When the model encodes the information of the current location, it will excessively focus on its own location,**Therefore, the author proposes to solve this problem through multi head attention mechanism. At the same time, the multi head attention mechanism can also give the output of the attention layer to contain the coded representation information in different subspaces, so as to enhance the expression ability of the model.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

After saying why we need a long attention mechanism and the benefits of using a long attention mechanism, let’s take a look at what a long attention mechanism is.

<center>

Figure 8. Structure of multi head attention mechanism

</center>

As shown in Fig. 8, it can be seen that the so-called multi head attention mechanism is actually a multi group self attention processing process of the original input sequence; Then, the results of each group of self attention are spliced together for a linear transformation to obtain the final output result. Specifically, the calculation formula is:

$$

\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,…,\text{head}_h)W^O\\

\;\;\;\;\;\;\;\text{where}\;\;\text{head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V)

$$

among

$$

W^Q_i\in\mathbb{R}^{d_{model}\times d_k}，W^K_i\in\mathbb{R}^{d_{model}\times d_k}，W^V_i\in\mathbb{R}^{d_{model}\times d_v}，W^O\in\mathbb{R}^{hd_v\times d_{model}}

$$

At the same time, in this paper, the author uses $h = 8 $parallel self attention modules (8 heads) to build an attention layer, and limits $d for each self attention module_ k=d_ v=d_{ model}/h=64$。**It can be found from here that the multi head attention mechanism used in this paper is actually to split a large high-dimensional single head into $h $multi heads**。 Therefore, the whole calculation process of multi head attention mechanism can be represented by the process shown in Fig. 9.

<center>

Figure 9. Calculation process of multi head attention mechanism

</center>

As shown in Figure 9, according to the input sequence x and $W ^ Q_ 1,W^K_ 1,W^V_ 1 $and we get $Q_ 1,K_ 1,V_ 1 $, and further according to the formula $(1) $, the output $Z of a single self attention module is obtained_ 1$； Similarly, according to X and $W ^ Q_ 2,W^K_ 2,W^V_ 2 $will get another self attention module output $Z_ 2$。 Finally, add $Z_ 1,Z_ 2 $is stacked horizontally to form $Z $, and then $Z $is multiplied by $W ^ o $to get the output of the whole multi head attention layer. At the same time, according to the calculation process in Figure 8, $D can also be obtained_ q=d_ k=d_ v$。

So far, the principle of the multi head attention mechanism, the core part of the whole transformer, is introduced.

### 4 Summary

In this paper, the author first introduces the motivation of the paper, including the problems faced by the traditional network structure and the countermeasures proposed by the author; Then it introduces what is self attention mechanism and its corresponding principle; Finally, it introduces what is the mechanism of multi head attention and the benefits of using multi head attention. At the same time, for this part, the key to understand is the calculation principle and process of self attention mechanism. In the next article, the author will introduce the location encoding and decoding process of transformer in detail.

This is the end of the content. Thank you for reading! If you think the above is helpful to you,**Welcome to share with a friend of yours**！ If you have any questions and suggestions, please add the author’s wechat ‘nulls8’ or add a group to communicate. Green mountains do not change and green water flows. I’ll see you at the inn in the month!

### quote

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need

[2] The Illustrated Transformer http://jalammar.github.io/ill…

[3] LANGUAGE TRANSLATION WITH TRANSFORMER https://pytorch.org/tutorials…

[4] The Annotated Transformer http://nlp.seas.harvard.edu/2…

[5] SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT https://pytorch.org/tutorials…