# Deep Residual Output Layers for Neural Language Generation

## Background

### Neural Language Generation

t时刻的输出由下式计算得到：
$$p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{W}^{T} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)$$

### Weight Tying

$$p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{E} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)$$

### Bilinear Mapping

$$p\left(\mathbf{y}_{\mathbf{t}} | \mathbf{y}_{\mathbf{1}}^{\mathbf{t}-\mathbf{1}}\right) \propto \exp \left(\mathbf{E}_{\mathbf{1}} \mathbf{W}_{\mathbf{1}} \mathbf{h}_{\mathbf{t}}+\mathbf{b}\right)$$

### Dual Nonlinear Mapping

_Beyond weight tying: Learning joint input-output embeddings for neural machine translation_ 提出通过两个非线性映射来分别学习output和context的结构：

## Deep Residual Output Layers

$$p\left(\mathbf{y}_{t} | \mathbf{y}_{1}^{t-1}\right) \propto \exp \left(g_{o u t}(\mathbf{E}) g_{i n}\left(\mathbf{h}_{t}\right)+\mathbf{b}\right)$$
$g_{in}(\cdot)$ 以 context representation $h_{t}$ 为输入（在本文中作者设置$g_{in}(\cdot)=\mathcal{I}$），$g_{out}(\cdot)$ 以所有的标签描述为输入，编码为label embedding $\mathbf{E}^{k}$，k是层数。

Label Encoder Network 针对于自然语言生成任务，输出标签为词汇表中的词，在本文中直接使用词向量作为label的输入表示。

In general, there may be additional information about each label, such as dictionary entries, cross-lingual resources, or contextual information, in which case we can add an initial encoder for these descriptions which outputs a label embedding matrix.

$$\mathbf{E}^{(\mathbf{k})}=f_{o u t}^{(k)}\left(\mathbf{E}^{(\mathbf{k}-1)}\right)$$

$$f_{o u t}^{(i)}\left(\mathbf{E}^{(i-1)}\right)=\sigma\left(\mathbf{E}^{(i-1)} \mathbf{U}^{(i)}+\mathbf{b}_{\mathbf{u}}^{(i)}\right)$$
$\sigma$ 是非线性激活函数。

$$\mathbf{E}^{(\mathbf{k})}=f_{o u t}^{(k)}\left(\mathbf{E}^{(\mathbf{k}-\mathbf{1})}\right)+\mathbf{E}^{(\mathbf{k}-\mathbf{1})}+\mathbf{E}$$

$$f_{\text {out}}^{\prime(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)=\delta\left(f_{\text {out}}^{(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)\right) \odot f_{\text {oul}}^{(i)}\left(\mathbf{E}^{(\mathbf{i}-\mathbf{1})}\right)$$

## Experiments

Language Modeling

More specifically, because low frequency words lack data to individually learn the complex structure of the output space, transfer of learned information from other words is crucial to improving performance, whereas this is not the case for higher frequency words. This analysis suggests that our model could also be useful for zero-resource scenarios, where labels need to be predicted without any training data, similarly to other joint input-output space models.

Neural Machine Translation

