# Transfer Learning for Sequence Labeling Using Source Model and Target Data

## Introduction

• 基于source data $D_{s}$ 训练出source model $M_{s}$
• 定义一个迁移学习任务TL：从source data $D_{s}$ 迁移到target data $D_{t}$ ，注意$D_{t}$中除了$D_{s}$已有的实体类别之外，新增了一些实体类别，但是$D_{t}$的规模远远小于$D_{s}$，并且迁移的时候不允许直接使用$D_{s}$训练$M_{t}$。

• 给定在source data $D_{s}$ 训练出的source model $M_{s}$（实际使用的是Bi-LSTM+CRF），使用其参数来初始化$M_{t}$，同时增加$M_{t}$输出层的维度，然后在target data $D_{t}$ 上fine-tuning。
• 增加了一个neural adapter来连接$M_{s}$和$M_{t}$，通过一个Bi-LSTM来实现，以$M_{s}$的最后线性层输出（未经过softmax）为Bi-LSTM的输入，它的输出作为$M_{t}$的额外输入。适配器adapter的主要作用是解决$D_{s}$和$D_{t}$中标签序列不一致的问题。

the surface form of a new category type has already appeared in the $D_{S}$, but they are not annotated as a label. Because it is not yet considered as a concept to be recognized.

### Problem Formalization

In the initial phase, a sequence labeling model, $M_{S}$, is trained on a source dataset, $D_{S}$, which has E classes. Then, in the next phase, a new model, $M_{T}$, needs to be learned on target dataset, $D_{T}$, which contains new input examples and E + M classes, where M is the number of new classes. $D_{S}$ cannot be used for training $M_{T}$.

### Transfer Learning Approach

Training of a source model:

Parameter Transfer: 因为增加了新的类别，所以要修改Bi-LSTM后的最后一层FC的维度，如Figure 1所示。具体来说，FC的作用是将LSTM的输出隐层向量h映射到维度为 $nE+1$ 的向量p，其中n是由标注格式确定的一个常数因子，对于BIO格式（_B-NE_,_I-NE_）来说$n=2$，而增加了M个新类别后，FC的输出维度应该增加为 $n(E+M)+1$。对于要修改维度的FC层，其参数初始化由$X \sim \mathcal{N}\left(\mu, \sigma^{2}\right)$ 得到，其中$\mu, \sigma$ 是原FC权重参数的均值和标准差；而对于其它尺寸没有变化的网络层，直接用$M_{T}$对应的层初始化，如下所示。

Training the target model:

### Transfer Learning using neural adapters

It should be noted that many word sequences corresponding to new NE categories can already appear in the source data, but they are annotated as null since their label is not part of the source data annotation yet. This is a critical aspect to solve as otherwise the target model with transferred parameters would treat the word sequence corresponding to a new NE category as a null category.

$$\overrightarrow{a}_{t}=\overrightarrow{\mathrm{A}}\left(p_{t}^{\mathrm{S}}, \overrightarrow{a}_{t-1}\right)$$

$$\overleftarrow{a}_{t}=\overleftarrow{\mathrm{A}}\left(p_{t}^{\mathrm{S}}, \overleftarrow{a}_{t+1}\right)$$

$$\boldsymbol{p}_{t}^{\mathrm{T}^{\prime}}=\boldsymbol{a}_{t} \oplus \boldsymbol{p}_{t}^{\mathrm{T}}$$

$$\boldsymbol{a}_{t}=\left[\overrightarrow{a}_{t} \oplus \overleftarrow{a}_{t}\right], \oplus \text{ is the element-wise summation}$$

The choice of BLSTM as the adapter is motivated by the fact that we want to incorporate the context information of a feature in the sequence to detect the new category that was annotated and possibly incorrectly predicted as not a label.

## Experiments

### Results on CoNLL and I-CAB datasets

• 参数迁移可以带来更好的结果
• 固定参数会导致结果变差，尤其是新的实体类别