本文是来自于FAIR的一篇关于个性化chitchat对话系统数据集PERSONA-CHAT的论文,发表于ACL2018,论文主要是创建了一个全新的带有对话双方个性的对话数据集,并提供了一些基线模型。基于此数据集,在NIPS2018上举行了 ConvAI2 评测。
Introduction
目前的闲聊系统存在三大问题:
- agent缺乏一致的个性
- 缺乏对长期对话历史的理解
- 容易生成通用回复
PERSONA-CHAT数据集是一个众包数据集,每个对话者根据给定的profile与另一位对话者进行对话,收集过程主要包括以下三个步骤:
- Personas: we crowdsource a set of 1155 possible personas, each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation, and 100 for test.
- Revised personas: to avoid modeling that takes advantage of trivial word overlap, we crowdsource additional rewritten sets of the same 1155 personas, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging.
- Persona chat: we pair two Turkers and assign them each a random (original) persona from the pool, and ask them to chat. This resulted in a dataset of 162,064 utterances over 10,907 dialogs, 15,602 utterances (1000 dialogs) of which are set aside for validation, and 15,024 utterances (968 dialogs) for test.
每个persona由5句自然描述组成,使得对话者能够在对话过程中提及persona有关的信息,示例如下:
We asked the workers to make each sentence short, with a maximum of 15 words per sentence.
为了避免众包人员在写persona的时候出现重复的问题,另一批众包人员会对第一步中的persona进行重写,在改写时不允许出现与原文相同的词(停词除外):
收集到大量的personas之后,随机给两位对话者分配不同的persona,让他们进行对话,目的是获取到对方的信息。
In an early study we noticed the crowdworkers tending to talk about themselves(their own persona) too much, so we also added the instructions “both ask questions and answer questions of your chat partner” which seemed to
help.
对话交替进行,每一轮不超过15个词;并且设定不允许直接从persona复制描述,而且对话最少是6到8轮,见Table 2。
在评估模型性能的时候,论文集中于给定对话历史预测下一句话:
We consider this in four possible scenarios: conditioning on no persona, your own persona, their persona, or both. These scenarios can be tried using either the original personas, or the revised ones. We then evaluate the task using three metrics: (i) the log likelihood of the correct sequence, measured via perplexity, (ii) F1 score, and (iii) next utterance classification loss, following Lowe et al. (2015). The latter consists of choosing N random distractor responses from other dialogues (in our setting, N=19) and the model selecting the best response among them, resulting in a score of one if the model chooses the correct response, and zero otherwise (called hits@1 in the experiments).
Models
本文所采用的基线模型分为两类:检索式和生成式
- Baseline ranking models:IR and StarSpace
Ranking Profile Memory Network:将profile sentence $p_{i}$ 作为memory,dialogue history作为query q
最后再计算 $q^{+}$ 与候选回复的相似度。Key-Value Profile Memory Network:第一跳采取和前面Profile Memory Network 相同的方式,得到$q^{+}$后,再用$q^{+}$做另外的注意力(keys为对话历史,values为对应的下一句回复),最后得到$q^{++}$,再计算与候选回复的相似度。
- Seq2Seq
- Generative Profile Memory Network
Experiments
Automated metrics
Most models improve significantly when conditioning prediction on their own persona at least for the original (non-revised) versions, which is an easier task than the re-vised ones which have no word overlap.
Table 3中的结果都是基于说话者自身的persona,作者做了对比试验,来验证基于对方persona或者双方的persona的模型效果:
We can also condition a model on the other speaker’s persona, or both personas at once, the results of which are in Tables 5 and 6 in the Appendix. Using “Their persona” has less impact on this dataset. We believe this is because most speakers tend to focus on themselves when it comes to their interests. It would be interesting how often this is the case in other datasets. Certainly this is skewed by the particular instructions one could give to the crowdworkers. For example if we gavethe instructions “try not to talk about yourself, but about the other’s interests’
likely these metrics would change.
_Details in origin paper_
Human Evaluation
Finding the balance between fluency, engage-ment, consistency, and a persistent persona re-mains a strong challenge for future research.
Two tasks could naturally be considered using PERSONACHAT: (1) next utterance prediction during dialogue, and (2) profile prediction given dialogue history.