Deep Residual Output Layers for Neural Language Generation


paper link
code link



现有的语言生成方法基本上使用log-linear分类器去预测下一个词(softmax),我们可以将标签权重(即softmax $\mathcal{W}$ 的每一行)视为词向量,input encoder把context映射为同样编码空间的向量,然后使用内积计算input vector和label vector在joint input-label space的相似度,最后再经过softmax函数。虽然可以使用word embedding作为标签向量 _(Tying word vectors and word classifiers: A loss framework for language modeling, Using the output embedding to improve language models)_,但是不同的词之间却没有参数共享,这限制了模型的迁移能力。最近的工作 _Improving tied architectures for language modelling_ 使用bilinear mapping来共享输出之间的参数,_Beyond weight tying: Learning joint input-output embeddings for neural machine translation_ 则是使用 dual nonlinear mapping,增强分类器的性能。

本文提出了一种学习joint input-label space中输出标签编码的方法,提出了deep residual nonlinear mapping from word embeddings to the joint input-output space,有效地获取输出空间的结构信息,同时避免过拟合。在本文中,input encoder的结构和softmax内积操作保持不变。


Neural Language Generation

p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{W}^{T} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)
其中$\mathbf{W} \in \mathbb{R}^{\mathrm{d}_{\mathrm{h}} \times|\mathcal{V}|}$,第i个标签的类别参数$\mathbf{W}_{i}^{T}$与第j个标签的类别参数$\mathbf{W}_{j}^{T}$是相互独立的。

Weight Tying

p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{E} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)
词向量矩阵$\mathbf{E}\in R^{|V|\times d}$。这种方法可以隐式地学习输出结构。

Bilinear Mapping

p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{E}_{\mathbf{1}} \mathbf{W}_{\mathbf{1}} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)

Dual Nonlinear Mapping

_Beyond weight tying: Learning joint input-output embeddings for neural machine translation_ 提出通过两个非线性映射来分别学习output和context的结构:

Deep Residual Output Layers

Figure  1.  General  overview  of  the  proposed  architecture.

本文提出的Deep Residual Output Layers基于下式:
p\left(\mathbf{y}_{t} | \mathbf{y}_{1}^{t-1}\right) \propto \exp \left(g_{o u t}(\mathbf{E}) g_{i n}\left(\mathbf{h}_{t}\right)+\mathbf{b}\right)
$g_{in}(\cdot)$ 以 context representation $h_{t}$ 为输入(在本文中作者设置$g_{in}(\cdot)=\mathcal{I}$),$g_{out}(\cdot)$ 以所有的标签描述为输入,编码为label embedding $\mathbf{E}^{k}$,k是层数。

Label Encoder Network 针对于自然语言生成任务,输出标签为词汇表中的词,在本文中直接使用词向量作为label的输入表示。

In general, there may be additional information about each label, such as dictionary entries, cross-lingual resources, or contextual information, in which case we can add an initial encoder for these descriptions which outputs a label embedding matrix.

为了编码输出空间结构,定义$g_{out}(\cdot)$为k层网络,以label embedding $\mathbf{E}$ 作为输入(即词向量):
\mathbf{E}^{(\mathbf{k})}=f_{o u t}^{(k)}\left(\mathbf{E}^{(\mathbf{k}-1)}\right)
f_{o u t}^{(i)}\left(\mathbf{E}^{(i-1)}\right)=\sigma\left(\mathbf{E}^{(i-1)} \mathbf{U}^{(i)}+\mathbf{b}_{\mathbf{u}}^{(i)}\right)
$\sigma$ 是非线性激活函数。

\mathbf{E}^{(\mathbf{k})}=f_{o u t}^{(k)}\left(\mathbf{E}^{(\mathbf{k}-\mathbf{1})}\right)+\mathbf{E}^{(\mathbf{k}-\mathbf{1})}+\mathbf{E}
Figure  2.  The  proposed  deep  residual  label  network  architecture  for  neural  language  generation.  Straight  lines  represent  the  input  to  a function  and  curved  lines  represent  shortcut  or  residual  connections  implying  addition  operations.

f_{\text {out}}^{\prime(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)=\delta\left(f_{\text {out}}^{(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)\right) \odot f_{\text {oul}}^{(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)


Language Modeling

More specifically, because low frequency words lack data to individually learn the complex structure of the output space, transfer of learned information from other words is crucial to improving performance, whereas this is not the case for higher frequency words. This analysis suggests that our model could also be useful for zero-resource scenarios, where labels need to be predicted without any training data, similarly to other joint input-output space models.

Neural Machine Translation


  • Tying word vectors and word classifiers: A loss framework for language modeling
  • _Press, O. and Wolf, L. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 157–163, Valencia, Spain, April 2017. Association for Computational Linguistics. URL
  • _Gulordava, K., Aina, L., and Boleda, G. How to represent a word and predict it, too: Improving tied architectures for language modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2936–2941, Brussels, Bel-gium, October-November 2018. Association for Com-putational Linguistics. URL
  • _Pappas, N., Miculicich, L., and Henderson, J. Beyond weight tying: Learning joint input-output embeddings for neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 73–83. Association for Computational Linguistics, 2018. URL