# BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

## Introduction

• 基于特征的策略（feature-based）
• 微调策略（fine-tuning）

• 展示了双向预训练语言表征的重要性。不同于 Radford 等人（2018）使用单向语言模型进行预训练，BERT 使用 MLM 预训练深度双向表征。本研究与 Peters 等人（2018）的研究也不同，后者使用的是独立训练的从左到右和从右到左 LM 的浅层级联。

• 证明了预训练表征可以消除对许多精心设计的任务特定架构的需求。BERT 是首个在大批句子层面和 token 层面任务中取得当前最优性能的基于微调的表征模型，其性能超越许多使用任务特定架构的系统。

• BERT 刷新了 11 项 NLP 任务的当前最优性能记录。本论文还报告了 BERT 的模型简化测试（ablation study），证明该模型的双向特性是最重要的一项新贡献。

### Feature-based Approaches

ELMo 是从语言模型中提取上下文相关的特征，将其移植到特定任务的架构中，在很多NLP任务上得到了提升。注意Feature-based的预训练表征在应用的时候只是被添加到特定任务架构中，作为一个额外的特征。ELMo也是使用的bi-LM，但是两个方向的语言模型是相互独立的，只是在最后将两个方向的语言模型隐层状态拼接在一起，这一点跟本文提出的BERT有本质差异。

### Fine-tuning Approaches

OpenAI GPT利用了Transformer网络代替了LSTM作为语言模型来更好的捕获长距离语言结构。然后在进行具体任务有监督微调时使用了语言模型作为附属任务训练目标。最后在 12 个 NLP 任务上进行了实验，9 个任务获得了 SOTA。

OpenAI GPT使用的是标准的语言模型目标函数，即通过前 k 个词预测当前词，但是在语言模型网络上，使用了 Google 团队在 Attention is all your need 论文中提出的 Transformer 解码器作为语言模型基础架构。

## BERT

### Model Architecture

BERT 的模型架构是一个多层双向 Transformer 编码器，基于 Vaswani 等人 (2017) 描述的原始实现，在 tensor2tensor 库中发布。由于 Transformer 的使用最近变得很普遍，而且我们的实现与原始版本实际相同，我们将不再赘述模型架构的背景。（guide

• BERTBASE: L=12, H=768, A=12, 总参数=110M

• BERTLARGE: L=24, H=1024, A=16, 总参数=340M

### Input Representation

• We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. We denote split word pieces with ##.

• We use learned positional embeddings with supported sequence lengths up to 512 tokens.

• The first token of every sequence is always the special classification embedding ([CLS]). The final hidden state (i.e., output of Transformer) corresponding to this token is used as the aggregate sequence representation for classification tasks. For non-classification tasks, this vector is ignored.

• Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned sentence A embedding to every token of the first sentence and a sentence B embedding to every token of the second sentence.

• For single-sentence inputs we only use the sentence A embeddings.

Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context.

2. 预测 15% 的词而不是预测整个句子，使得预训练的收敛更慢。但是对于第二点，作者们觉得虽然是慢了，但是效果提升比较明显可以弥补。

Rather than always replacing the chosen words with [MASK], the data generator will do the following:

• 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy -> my dog is [MASK]

• 10% of the time: Replace the word with a random word, e.g., my dog is hairy -> my dog is apple

• 10% of the time: Keep the word unchanged, e.g., my dog is hairy -> my dog is hairy. The purpose of this is to bias the representation towards the actual observed word.

#### Task #2: Next Sentence Prediction

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

We choose the NotNext sentences completely at random, and the final pre-trained model achieves 97%-98% accuracy at this task. Despite its simplicity, we demonstrate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI.

### Pre-training Procedure

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is <= 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no special consideration given to partial word pieces.

The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood.

### Fine-tuning Procedure

For span-level and token-level prediction tasks, the above procedure must be modified slightly in a task-specific manner. Details are given in the corre- sponding subsection of Section 4.

BERT模型的所有参数和softmax分类层的参数在fune-tuning阶段联合训练，目标是最大化正确标签的似然概率。模型的超参数在fune-tuning阶段大多数保持不变，除了batch_size，learning rate，training epochs。

The optimal hyperparameter values are task-specific, but we found the following range of possible values to work well across all tasks:

• Batch size: 16, 32
• Learning rate (Adam): 5e-5, 3e-5, 2e-5
• Number of epochs: 3, 4

We also observed that large data sets (e.g.,100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.

### Comparison of BERT and OpenAI GPT

• GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).

• GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training.

• GPT was trained for 1M steps with a batch size of 32,000 words; BERT was trained for 1M steps with a batch size of 128,000 words.

• GPT used the same learning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specific fine-tuning learning rate which performs the best on the development set.