# Attention is all you need

## Background

### What is attention?

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

## Transformer Model Architecture

### Attention

#### Scaled Dot-Product Attention

Query，Key，Value 首先经过一个线性变换，然后输入到放缩点积 Attention，注意这里要做 h 次，其实也就是所谓的多头，每一次算一个头。而且每次 Q，K，V 进行线性变换的参数 W 是不一样的。然后将 h 次的放缩点积 Attention 结果进行拼接，再进行一次线性变换得到的值作为多头 Attention 的结果。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

#### Applications of Attention in Transformer Model

1）在encoder-decoder attention层，query来自于先前的解码层，key与value来自于encoder的输出。

2）在encoder上单独使用了self-attention。这里的key，value，query都来自于encoder中上一层的输出。

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed.