Attention is all you need

paper link


What is attention?

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.


在计算 Attention 时主要分为三步,第一步是将 query 和每个 key 进行相似度计算得到权重,常用的相似度函数有点积,拼接,感知机等;然后第二步一般是使用一个 softmax 函数对这些权重进行归一化;最后将权重和相应的键值 value 进行加权求和得到最后的 Attention。

Transformer Model Architecture


Scaled Dot-Product Attention

输入是$d_{k}$维的query和key,value的维度为$d_{v}$。其操作步骤为:首先计算query与所有key的内积,然后再除以$\sqrt{d_{k}}$,并使用softmax获取value的权值,最后加权求和得到相应的输出。在Scaled Dot-Product Attention上加上一个Mask(仅在decoder的Masked Multi-head Attention中使用)单元,可以用于处理训练时的一些不合法信息流动的问题。


在实际操作里,通常Q是一系列query vectors的组合,$Q \in R^{ m \times d_{k}}$中的每一行对应于一个query;Q与$K^{T}\in\mathbb{R}^{d_{v}\times n}$相乘并进行scale之后,即可得到$m \times n$ 的矩阵,其中每一行都对应着一个 query vector的attention score;之后进行softmax归一化;然后,再将这个attention matrix与$V\in{\mathbb{R}^{n\times d_{v}}}$相乘,即可得到一个 $m \times d_{v}$ 维的attention output,其中每一行对应一个query output。


Multi-Head Attention

Query,Key,Value 首先经过一个线性变换,然后输入到放缩点积 Attention,注意这里要做 h 次,其实也就是所谓的多头,每一次算一个头。而且每次 Q,K,V 进行线性变换的参数 W 是不一样的。然后将 h 次的放缩点积 Attention 结果进行拼接,再进行一次线性变换得到的值作为多头 Attention 的结果。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Applications of Attention in Transformer Model

1)在encoder-decoder attention层,query来自于先前的解码层,key与value来自于encoder的输出。


3)在decoder上也单独使用了self-attention。这里的key,value,query来自于decoder中当前时间步及之前的输出。为了避免信息向左流动,在scaled dot-product attention中增加了一个屏蔽层(mask),用以屏蔽掉那些softmax中的不合法连接(仅在训练时发挥作用)。

Position-wise Feed-Forward Networks

Position-wise Feed-Forward Networks使用两层线性变换与一个ReLU激活函数实现:

Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed.