# Deep contextualized word representations

## Introduction

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

Simultaneously exposing all of these signals is highly beneficial, allowing the learned models select the types
of semi-supervision that are most useful for each end task.

## ELMo: Embeddings from Language Models

Unlike most widely used word embeddings (Pennington et al., 2014), ELMo word representations
are functions of the entire input sentence, as described in this section. They are computed on top
of two-layer biLMs with character convolutions(Sec. 3.1), as a linear function of the internal net-
work states (Sec. 3.2). This setup allows us to do semi-supervised learning, where the biLM is pretrained at a large scale (Sec. 3.4) and easily incorporated into a wide range of existing neural NLP architectures (Sec. 3.3).

### Bidirectional language models

$$p(t_{1},t_{2},…,t_{n}) = \prod_{k=1}^{n}p(t_{k}|t_{1},t_{2},…,t_{k-1})$$

$$\underset{h}{\rightarrow}_{k,j}^{LM},\: \: \: \: \: \: where \:j = 1,…,L$$

$$p(t_{1},t_{2},…,t_{n}) = \prod_{k=1}^{n}p(t_{k}|t_{k+1},t_{k+2},…,t_{n})$$

### ELMo

ELMo是双向语言模型biLM的多层表示的组合，对于某一个词语 $t_{k}$，一个L层的双向语言模型biLM能够由2L+1个向量集合表示：

ELMo将多层的biLM的输出集合R整合成一个向量：$ELMo_{k} = E(R_{k};\Theta_{e})$。最简单的情况是ELMo仅仅使用最顶层的输出，即 $ELMo_{k} = E(R_{k};\Theta_{e})={h_{k,L}}^{LM}$ ，类似于TagLM和CoVe模型。

$s^{task}$ are softmax-normalized weights and the scalar parameter $\gamma ^{task}$ allows the task model to scale the entire ELMo vector. $\gamma ^{task}$ is of practical importance to aid the optimization process (see supplemental material for details). Considering that the activations of each biLM layer have a different distribution, in some cases it also helped to apply layer normalization (Ba et al., 2016) to each biLM layer before weighting.

### Using biLMs for supervised NLP tasks

1. 直接将ELMo词向量 $ELMo_{k}^{task}$ 与普通的词向量 $x_{k}$拼接（concat）$[x_{k}:ELMo_{k}^{task}]$作为新的$x_{k}$输入。

2. 直接将ELMo词向量$ELMo_{k}^{task}$ 与隐层输出向量 $h_{k}$ 拼接$[h_{k}:ELMo_{k}^{task}]$作为新的隐层状态$h_{k}$ ，在SNLI,SQuAD上都有提升。

Finally, we found it beneficial to add a moderate amount of dropout to ELMo (Srivastava et al.,2014) and in some cases to regularize the ELMo weights by adding $\lambda ||w||^2_2$ to the loss. This imposes an inductive bias on the ELMo weights to stay close to an average of all biLM layers.

## Analysis

### Alternate layer weighting schemes

Previous work on contextual representations used only the last layer, whether it be from a biLM (Peters et al., 2017) or an MT encoder (CoVe; McCann et al., 2017).

• large values such as $\lambda=1$ effectively reduce the weighting function to a simple average over the layers
• smaller values (e.g.,$\lambda=0.001$) allow the layer weights to vary

### Where to include ELMo?

One possible explanation for this result is that both the SNLI and SQuAD architectures use attention layers after the biRNN, so introducing ELMo at this layer allows the model to attend directly to the biLM’s internal representations. In the SRL case,the task-specific context representations are likely more important than those from the biLM.

### Sample efficiency

• 需要更少的epoch达到最优性能
• 在小规模数据集上提升效果更明显

## Conclusion

We have introduced a general approach for learning high-quality deep context-dependent representations from biLMs, and shown large improvements when applying ELMo to a broad range of NLP tasks. Through ablations and other controlled experiments, we have also confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words-in-context, and that using all layers improves overall task performance.