Training Millions of Personalized Dialogue Agents

2019-04-11

Dialogue System

本文是FAIR发表于EMNLP2018上的文章，主要提出了一个基于Reddit的大规模开放域对话数据集，附带大量的用户个性，实验证明用户个性有助于提高对话系统的性能；同时，基于本数据集的预训练模型也有助于各种任务（FAIR的另一篇文章Wizard of Wikipedia Knowledge-powered conversational agents 使用了基于本数据集的预训练Transformer Encoder）。

paper link

Introduction

FAIR提出了两个基于用户个性persona的对话数据集：

Personalizing Dialogue Agents I have a dog, do you have pets too?(PERSONA-CHAT) ：
- 众包标注，仅有1100左右的用户个性，规模较小；
- 手工构建，可能与真实对话存在偏差。
Training Millions of Personalized Dialogue Agents ：
- 从Reddit中抽取，5百万的个性，7亿的对话数量，大规模对话数据集；
- 规则抽取，会有一些偏差，用户persona与对话内容不一定相关，甚至相反；
- 适合预训练模型。

Building a dataset of millions of persona-based dialogues

以下是persona-based dialog的一个例子：

用户个性是指回复者Responser的性格描述，由自然语言描述句组成（作者限制最大N句），目标是预测Response。

We construct the persona of a user by gathering all the comments they wrote, splitting them into sentences, and selecting the sentences that satisfy the following rules:

each sentence must contain between 4 and 20 words or punctuation marks

it contains either the word _I_ or _my_

at least one verb

at least one noun, pronoun or adjective.

作者采用了四种方法来收集用户个性：

_rule_：在所有满足上述规则的句子中，随机选择至多N个句子作为用户个性。
_rule+classifier_：首先使用上述规则初步过滤，之后再用一个分类器计算得分，手工设定阈值，选择前topN个作为个性句。这个分类器使用PERSONA-CHAT数据集中的persona句与随机抽取的reddit comments作为训练数据。
_random from user_：从同一个用户Responser句中随机抽取（只需满足长度的要求，忽略其它），作为该用户的个性
_random from dataset_：从整个数据集中随机抽取，有可能来自于不同用户，作为对比实验。

We take each pair of successive comments in a thread to form the context and response of an example.

End-to-end dialogue models

As in Zhang et al. (2018), we combine the encoded context and persona using a 1-hop memory network with a residual connection, using the context as query and the set of persona sentences as memory.

We use mini-batches of training examples and, for each example therein, all the responses of the other examples of the same batch are used as negative responses.

作者使用了以下几种编码器：

Bag-of-words：对词向量过一个全连接层，然后对所有位置做平均池化，除以长度的平方根，得到encoding
LSTM：applies a 2-layer bidirectional LSTM. We use the last hidden state as encoded sentence.
Transformer encoding：We subsequently average the resulting representation across all positions in the sentence, yielding a fixed-size representation.

Experiments

Conclusion

本文是FAIR关于个性化对话系统的系列文章之一，主要提出了一个基于Reddit的大规模开放域对话数据集，附带大量的用户个性，适合于其它对话任务预训练模型。