# An Efficient Approach to Encoding Context for Spoken Language Understanding

SLU是任务型对话系统的基础，本文提出了一种基于对话历史建模的SLU模型，通过RNN对对话上下文进行建模，进而辅助对当前句子的理解，并且可以用于DST（对话状态追踪）。

## Introduction

• 对于每个turn，需要处理之前所有的历史输入的自然语言。
• 对话上下文可以潜在地通过dialogue state tracker来获得。使用单独的SLU-specific网络而不是复用DST的上下文信息会导致运算加倍。
• memory networks是将系统输入的自然语言进行编码，忽略了系统的dialogue act；二者含有同样的信息，但是dialogue act更结构化并且类别更少。

• 直接对系统的dialogue act进行编码，取代了对系统输入的自然语言编码。这使得我们可以复用DM的输出结果来获取上下文。

• 使用层级RNN对上下文编码，一个时间步长处理一个轮次的输入。减少了计算量同时性能没有下降。

Our representation of dialogue context is similar to those used in dialogue state tracking models [17, 18, 19], thus enabling the sharing of context representation between SLU and DST.

## Approach

utterance encoderslot tagger都是用的是双向RNN，除了上述输入之外，都额外增加了上下文向量$o^{t}$作为输入，具体细节见以下详细描述。

### System Act Encoder

System Act Encoder 的作用是将时刻t的系统dialogue acts进行编码，得到$a^{t}$，编码与act的出现顺序无关。这与基于自然语言的编码不同，其会隐式地包含顺序信息。

• 带有一个slot的act（一个act最多有一个slot），可以含slot value，也可以不包含：request(time), negate(time='6pm')
• 不含slot的act：greeting

_Note that the same dialogue act can appear in the dialogue with or without an associated slot (negate(time=‘6 pm’) versus negate)._

• $A_{sys}$ : 所有系统act的集合
• $a^{t}_{slot}(s)$ : binary vector, len=$\left | A_{sys} \right |$，代表act with slot，不含slot的值
• $a^{t}_{ns}$ : binary vector, len=$\left | A_{sys} \right |$，代表act without slot
• e_{s} : embedding for slot s
• $S^{t}$ : slot集合

System Act Encoder 本质上是一个全连接网络，结构如下：

### Utterance Encoder

Utterance Encoder的作用是获得用户输入token sequence的表征，输入为用户的自然语言序列（首末分别加上SOS和EOS token），输出为对应token embedding。

• $x_{t}=\left \{ x^{t}_{m}\epsilon R^{u_{d}}, \forall 0 \leq m< M^{t} \right \}$：用户输入的token embedding
• $M^{t}$：第t轮，用户输入的序列长度
• $u^{t} \epsilon R^{2d_{u}}$：对整个用户输入的表征
• $u_{o}^{t} =\left \{ u_{o,m}^{t}\epsilon R^{2d_{u}}, \forall 0 \leq m< M^{t} \right \}$：对应输入token序列的表征

Utterance Encoder本质上是一个单层双向的GRU：
$$u^{t}, u_{o}^{t}=BRNN_{GRU}(x^{t}) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$$

### Dialogue Encoder

Dialogue Encoder是一个单向的GRU RNN，每一个时间步长代表一个对话轮次，目的是得到每一轮对话的上下文表征。输入为 $a^{t} \bigoplus u^{t}$，结合上一轮次的隐层状态 $s^{t-1}$，得到当前轮次的输出$o^{t}$以及隐层状态$s^{t}$（对于GRU Cell来说二者是一样的），$o^{t}$即为第t轮的对话上下文表征。

### Intent and Dialogue Act Classification

The user intent helps to identify the APIs/databases which the dialogue system should interact with.Intents are predicted at each turn so that a change of intent during the dialogue can be detected.

• $p_{i}^{t}$：len=$\left|I \right|$，intent 概率分布
• $p_{a}^{t}(k)$：probability of presence of dialogue act k in turn t
• $I$：user intent set
• $A_{u}$：dialogue act set
• $W_{i}\epsilon R^{d \times \left|I \right|}, W_{a}\epsilon R^{d \times \left|A_{u} \right|}, len(o^{t})=d$

During inference, we predict $argmax(p_{i}^{t})$ as the intent label and all dialogue acts with probability greater than $t_{u}$ are associated with the utterance, where 0 < $t_{u}$ < 1.0 is a hyperparameter tuned using the validation set.

### Slot Tagging

Slot tagging is the task of identifying the values for different slots present in the user utterance.

Slot Tagger是一个Bi-LSTM，输入为Utterance Encoder的输出token embedding，得到 $s_{o}^{t}=\left \{ s_{o,m}^{t}\epsilon R^{2d_{s}},0\leq m< M^{t} \right \}$，$M^{t}$ 是用户输入的token序列长度。对于第m个token，使用 $s_{o,m}^{t}$ 做softmax分类得到基于 $2\left| S\right|+1$ 个标签的概率分布，S是所有的slot构成的集合。

We use an LSTM cell instead of a GRU because it gave better results on the validation set.

## Experiments

• the dialogue encoding vector $o^{t-1}$ encodes all turns prior to the current turn

• the system intent vector $a^{t}$ encodes the current turn system utterance

Positions A and C feed context vectors as additional inputs at each RNN step whereas positions B and D use the context vectors to initialize the hidden state of the two RNNs after a linear projection to the hidden state dimension.

• 只有 $a^{t}$ ，没有dialogue encoder模块：在A-D某一个位置上将 $a^{t}$ 作为额外输入，去掉dialogue encoder模块，直接用 $u^{t}$ 代替 $o^{t}$ 做intent和act分类。在这种配置下，实验证明，在位置B添加 $a^{t}$ 可以在验证集上达到最优效果，测试集结果见Table 1的第七行。

• 只有 $a^{t}$ ：将 $a^{t}$ 作为dialogue encoder模块的输入，同时在A-D某一个位置上将 $a^{t}$ 作为额外输入。Table 1的第八行代表这种配置下的最好模型，此时将 $a^{t}$ 添加到D位置。

• 只有 $o^{t-1}$ ：将 $a^{t}$ 作为dialogue encoder模块的输入，在C或者D位置上添加 $o^{t-1}$ 作为额外输入。Table 1的第九行代表这种配置下的最好模型，此时将 $o^{t-1}$ 添加到D位置。

• $a^{t}$ 和 $o^{t-1}$：将 $a^{t}$ 作为dialogue encoder模块的输入，在C或者D位置上独立添加 $o^{t-1}$ 或者 $a^{t}$ 作为额外输入，共有四种情况。Table 1的第十行代表这种配置下的最好模型，此时将 $o^{t-1}$ 添加到D位置，$a^{t}$ 到C位置。

### Dataset

For instance, only 13% of the movie names in the validation and test sets are also present in the training set.

### Baselines

• NoContext: A two-layer stacked bidirectional RNN using GRU and LSTM cells respectively, and no context.

• PrevTurn: This is similar to the NoContext model. with a different bidirectional GRU layer encoding the previous system turn, and this encoding being input to the slot tagging layer of encoder i.e. position C in Figure 2.

• MemNet: This is the system from [11], using cosine attention. For this model, we report metrics with models trained with memory sizes of 6 and 20 turns. A memory size of 20, while making the model slower, enables it to use the entire dialogue history for most of the dialogues.

• SDEN: This is the system from [12] which uses a bidirectional GRU RNN for combining memory embeddings. We report metrics for models with memory sizes 6 and 20.

### Training and Evaluation

We use sigmoid cross entropy loss for dialogue act classification (since it is modeled as a multilabel binary classification problem) and softmax cross entropy loss for intent classification and slot tagging. During training, we minimize the sum of the three constituent losses using the ADAM optimizer [25] for 150k training stepswith a batch size of 10 dialogues.

To improve model performance in the presence of out of vocabulary (OOV) tokens arising from entities not present in the training set, we randomly replace tokens corresponding to slot values in user utterance with a special OOV token with a value dropout probability that linearly increases during training.

To find the best hyperparameter values, we perform grid search over the token embedding size {64;128;256},
learning rate [0.0001, 0.01], maximum value dropout probability [0:2;0:5] and the intent prediction threshold {0.3,0.4,0.5}, for each model configuration listed in Section 3. The utterance encoder and slot tagger layer sizes are set equal to the token embedding dimension, and that of the dialogue encoder to half this dimension. In Table 1, we report intent accuracy, dialogue act F1 score, slot chunk F1 score [22] and frame accuracy on the test set for the best runs for each configuration in Section 3 based on frame accuracy on the combined validation set, to avoid overfitting.

A frame is considered correct if its predicted intent, slots and acts are all correct.

## Results and Discussion

1. 本文提出的模型与MemNet and SDEN基线模型的准确率相当，均远远优于无上下文模型，证明上下文信息在SLU中的重要性。
2. 另一个关注方面是计算效率：memory network在每个轮次都需要处理对话历史中的很多输入语句，而本文提出的模型只需要经过一个前馈全连接网络以及RNN的一步计算即可得到上下文表征。SDEN比memory network更慢，因为它需要将memory network的输出embedding再通过RNN。

Empirically, MemNet-6 and MemNet-20 experiments took roughly 4x and 12x more

3. 本文提出的模型在小数据集（Sim-M）上的泛化能力更优。

Two interesting experiments to compare are rows 2 and 7 i.e. “PrevTurn” and “$a^{t}$ only, No DE”; they both use context only from the previous system utterance/acts, discarding the remaining turns. Our system act encoder, comprising only a two-layer feedforward network, is in principle faster than the bidirectional GRU that “PrevTurn” uses to encode the system utterance. This notwithstanding, the similar performance of both models suggests that using system dialogue acts for context is a good alternative to using the corresponding system utterance.

1. Table 1中也显示了最优的 $a^{t}$ 和 $o^{t-1}$ 输入位置。总体来说，将它们作为RNN Cell的初始状态（B,D）要优于单独拼接输入（A,C）。作者认为这可能是因为 $a^{t}$ 和 $o^{t-1}$ 对于每个用户token来说都是相同的，造成了冗余。
2. 在slot tagger任务的准确率上，使用 $o^{t-1}$ 与 $a^{t}$ 相比并没有带来提升。这表明：系统act中的slot与用户回复中提到的slot有很强的相关性，用户回复的通常是与上一个系统act直接相关，而与之前的轮次相关性不大。