Deep contextualized word representations

2018-10-07

Embedding, NLP

本文提出了ELMo（深度上下文词表示）模型，它是由AI2开发的，并在NAACL 2018上被评为最佳论文。在ELMo中，嵌入基于一个双层的双向语言模型（biLM）的内部状态计算，ELMo也是因此得名的：Embeddings from Language Models（来自语言模型的嵌入）。
paper link
code link

Introduction

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

预训练的词向量表示是很多自然语言理解模块的关键组成部分，然而，学习高质量的词向量表示很有难度，因为需要同时对词语本身的特性（例如语法，语义等特征）和不同上下文（例如一词多义）进行建模。本文提出的deep contextualized word representation能够很好的解决以上两个问题，极大地改善NLP下游任务。

与传统的学习上下文词向量模型不同，ELMo计算所有biLM层输出的线性组合，而非仅仅使用最顶层的输出。这种结合不同LSTM网络层的状态信息的方式能够得到更加丰富的Embedding表达。high-level的LSTM层能够捕获词语上下文相关的语义信息（可以用来做消歧任务），而lower-level的LSTM层能够句法信息（part-of-speech tagging）。

Simultaneously exposing all of these signals is highly beneficial, allowing the learned models select the types
of semi-supervision that are most useful for each end task.

ELMo: Embeddings from Language Models

Unlike most widely used word embeddings (Pennington et al., 2014), ELMo word representations
are functions of the entire input sentence, as described in this section. They are computed on top
of two-layer biLMs with character convolutions(Sec. 3.1), as a linear function of the internal net-
work states (Sec. 3.2). This setup allows us to do semi-supervised learning, where the biLM is pretrained at a large scale (Sec. 3.4) and easily incorporated into a wide range of existing neural NLP architectures (Sec. 3.3).

Bidirectional language models

前向语言模型：给定N个词语的序列$(t_{1}, t_{2}, … , t_{n})$，前向语言模型以如下方式对这个序列出现的概率建模
$$p(t_{1},t_{2},…,t_{n}) = \prod_{k=1}^{n}p(t_{k}|t_{1},t_{2},…,t_{k-1})$$
最新的LM模型（《Exploring the limits of language modeling》、《On the state of the art of evaluation in neural language models》和《Regularizing and optimizing lstm language models》）首先计算一个上下文无关的toekn representation $x_{k}^{LM}$（通过词向量或者基于character的CNN），然后把它作为输入传递到L层的前向LSTM网络中，在每一个（token）位置k上，每一层LSTM都能得到上下文相关的表示：
$$\underset{h}{\rightarrow}_{k,j}^{LM},\: \: \: \: \: \: where \:j = 1,…,L$$
通过顶层LSTM的output $\underset{h}{\rightarrow}_{k,L}^{LM}$ 加上softmax层预测下一个token。

后向语言模型类似，只是根据下文来预测前一个词：
$$p(t_{1},t_{2},…,t_{n}) = \prod_{k=1}^{n}p(t_{k}|t_{k+1},t_{k+2},…,t_{n})$$
双向语言模型（biLM）将前后向语言模型结合起来，最大化前向、后向模型的联合似然函数即可，如下式所示：

其中， $\Theta_{x},\Theta_{s}$ 分别是context-independent词向量训练和 softmax层的参数（前后向语言模型共享）， $\underset{\Theta_{LSTM}}{\rightarrow}, \underset{\Theta_{LSTM}}{\leftarrow}$ 则是双向语言模型的（LSTM网络）参数。

ELMo

ELMo是双向语言模型biLM的多层表示的组合，对于某一个词语 $t_{k}$，一个L层的双向语言模型biLM能够由2L+1个向量集合表示：

ELMo将多层的biLM的输出集合R整合成一个向量：$ELMo_{k} = E(R_{k};\Theta_{e})$。最简单的情况是ELMo仅仅使用最顶层的输出，即 $ELMo_{k} = E(R_{k};\Theta_{e})={h_{k,L}}^{LM}$ ，类似于TagLM和CoVe模型。

一般情况下，我们针对于不同的任务，计算所有biLM layers的加权和：

$s^{task}$ are softmax-normalized weights and the scalar parameter $\gamma ^{task}$ allows the task model to scale the entire ELMo vector. $\gamma ^{task}$ is of practical importance to aid the optimization process (see supplemental material for details). Considering that the activations of each biLM layer have a different distribution, in some cases it also helped to apply layer normalization (Ba et al., 2016) to each biLM layer before weighting.

Using biLMs for supervised NLP tasks

对于一个具体的NLP任务，给定预训练好的biLM模型（针对于不同的NLP任务和数据集，biLM模型可以fine tuning，fine tuning时不需要监督labels），我们只需要固定biLM模型的参数，然后运行biLM模型得到每个词语的所有层的向量表示，根据指定NLP任务的samples来训练学习一个线性组合（监督学习）。

直接将ELMo词向量 $ELMo_{k}^{task}$ 与普通的词向量 $x_{k}$拼接（concat）$[x_{k}:ELMo_{k}^{task}]$作为新的$x_{k}$输入。
直接将ELMo词向量$ELMo_{k}^{task}$ 与隐层输出向量 $h_{k}$ 拼接$[h_{k}:ELMo_{k}^{task}]$作为新的隐层状态$h_{k}$ ，在SNLI,SQuAD上都有提升。

Finally, we found it beneficial to add a moderate amount of dropout to ELMo (Srivastava et al.,2014) and in some cases to regularize the ELMo weights by adding $\lambda ||w||^2_2$ to the loss. This imposes an inductive bias on the ELMo weights to stay close to an average of all biLM layers.

Evaluation

作者在6个NLP任务上做实验，证明简单的添加ELMo模型就可以得到最优结果，这适用于不同的网络结构和语言任务。

具体任务的细节参见： https://drive.google.com/open?id=1ZUlKKt9DMNHSF1UMtnUFb9fqZSXZPfhf

Analysis

Alternate layer weighting schemes

针对于上式，有很多种结合biLM layer的方式：

Previous work on contextual representations used only the last layer, whether it be from a biLM (Peters et al., 2017) or an MT encoder (CoVe; McCann et al., 2017).

正则化系数 $\lambda$ 也有很大影响：

large values such as $\lambda=1$ effectively reduce the weighting function to a simple average over the layers

smaller values (e.g.,$\lambda=0.001$) allow the layer weights to vary

$Table 2: Development set performance for SQuAD, SNLI and SRL comparing using all layers of the biLM (with different choices of regularization strength $\lambda$) to just the top layer.$

Where to include ELMo?

尽管论文中的网络结构都仅仅把ELMo作为网络底层的输入，然而针对于某些任务的特定网络结构，把ELMo作为网络的（隐层）输出也能提高准确率。

One possible explanation for this result is that both the SNLI and SQuAD architectures use attention layers after the biRNN, so introducing ELMo at this layer allows the model to attend directly to the biLM’s internal representations. In the SRL case,the task-specific context representations are likely more important than those from the biLM.

What information is captured by the biLM’s representations?

因为biLM能够提高性能，直观上讲，应该是能够捕获到word embedding所缺乏的信息。论文展示了一个例子：

作者通过word sense disambiguation (WSD) task and a POS tagging task 证明了biLM的不同层可以捕获不同类型的信息。

Sample efficiency

需要更少的epoch达到最优性能
在小规模数据集上提升效果更明显

Conclusion

We have introduced a general approach for learning high-quality deep context-dependent representations from biLMs, and shown large improvements when applying ELMo to a broad range of NLP tasks. Through ablations and other controlled experiments, we have also confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context, and that using all layers improves overall task performance.

Helic He

NLP