本文是FAIR发表于EMNLP2018上的文章,主要提出了一个基于Reddit的大规模开放域对话数据集,附带大量的用户个性,实验证明用户个性有助于提高对话系统的性能;同时,基于本数据集的预训练模型也有助于各种任务(FAIR的另一篇文章Wizard of Wikipedia Knowledge-powered conversational agents 使用了基于本数据集的预训练Transformer Encoder)。
Introduction
FAIR提出了两个基于用户个性persona的对话数据集:
Personalizing Dialogue Agents I have a dog, do you have pets too?(PERSONA-CHAT) :
- 众包标注,仅有1100左右的用户个性,规模较小;
- 手工构建,可能与真实对话存在偏差。
Training Millions of Personalized Dialogue Agents :
- 从Reddit中抽取,5百万的个性,7亿的对话数量,大规模对话数据集;
- 规则抽取,会有一些偏差,用户persona与对话内容不一定相关,甚至相反;
- 适合预训练模型。
Building a dataset of millions of persona-based dialogues
以下是persona-based dialog的一个例子:
用户个性是指回复者Responser的性格描述,由自然语言描述句组成(作者限制最大N句),目标是预测Response。
We construct the persona of a user by gathering all the comments they wrote, splitting them into sentences, and selecting the sentences that satisfy the following rules:
- each sentence must contain between 4 and 20 words or punctuation marks
- it contains either the word _I_ or _my_
- at least one verb
- at least one noun, pronoun or adjective.
作者采用了四种方法来收集用户个性:
- _rule_:在所有满足上述规则的句子中,随机选择至多N个句子作为用户个性。
- _rule+classifier_:首先使用上述规则初步过滤,之后再用一个分类器计算得分,手工设定阈值,选择前topN个作为个性句。这个分类器使用PERSONA-CHAT数据集中的persona句与随机抽取的reddit comments作为训练数据。
- _random from user_:从同一个用户Responser句中随机抽取(只需满足长度的要求,忽略其它),作为该用户的个性
- _random from dataset_:从整个数据集中随机抽取,有可能来自于不同用户,作为对比实验。
We take each pair of successive comments in a thread to form the context and response of an example.
End-to-end dialogue models
As in Zhang et al. (2018), we combine the encoded context and persona using a 1-hop memory network with a residual connection, using the context as query and the set of persona sentences as memory.
We use mini-batches of training examples and, for each example therein, all the responses of the other examples of the same batch are used as negative responses.
作者使用了以下几种编码器:
- Bag-of-words:对词向量过一个全连接层,然后对所有位置做平均池化,除以长度的平方根,得到encoding
- LSTM:applies a 2-layer bidirectional LSTM. We use the last hidden state as encoded sentence.
- Transformer encoding:We subsequently average the resulting representation across all positions in the sentence, yielding a fixed-size representation.
Experiments
Conclusion
本文是FAIR关于个性化对话系统的系列文章之一,主要提出了一个基于Reddit的大规模开放域对话数据集,附带大量的用户个性,适合于其它对话任务预训练模型。