Personalizing Dialogue Agents I have a dog, do you have pets too?

本文是来自于FAIR的一篇关于个性化chitchat对话系统数据集PERSONA-CHAT的论文,发表于ACL2018,论文主要是创建了一个全新的带有对话双方个性的对话数据集,并提供了一些基线模型。基于此数据集,在NIPS2018上举行了 ConvAI2 评测。

paper link
dataset link



  1. agent缺乏一致的个性
  2. 缺乏对长期对话历史的理解
  3. 容易生成通用回复


  1. Personas: we crowdsource a set of 1155 possible personas, each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation, and 100 for test.
  2. Revised personas: to avoid modeling that takes advantage of trivial word overlap, we crowdsource additional rewritten sets of the same 1155 personas, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging.
  3. Persona chat: we pair two Turkers and assign them each a random (original) persona from the pool, and ask them to chat. This resulted in a dataset of 162,064 utterances over 10,907 dialogs, 15,602 utterances (1000 dialogs) of which are set aside for validation, and 15,024 utterances (968 dialogs) for test.


We asked the workers to make each sentence short, with a maximum of 15 words per sentence.

Table  1:  Example  Personas  (left)  and  their  revised  versions  (right)  from  the  PERSONA-CHAT  dataset. The  revised  versions  are  designed  to  be  characteristics  that  the  same  persona  might  have,  which  could  be rephrases,  generalizations  or  specializations.


Table  2:  Example  dialog  from  the  PERSONA-CHAT  dataset.  Person  1  is  given  their  own  persona  (top  left) at  the  beginning  of  the  chat,  but  does  not  know  the  persona  of  Person  2,  and  vice-versa.  They  have  to  get to  know  each  other  during  the  conversation.


In an early study we noticed the crowdworkers tending to talk about themselves(their own persona) too much, so we also added the instructions “both ask questions and answer questions of your chat partner” which seemed to

对话交替进行,每一轮不超过15个词;并且设定不允许直接从persona复制描述,而且对话最少是6到8轮,见Table 2。


We consider this in four possible scenarios: conditioning on no persona, your own persona, their persona, or both. These scenarios can be tried using either the original personas, or the revised ones. We then evaluate the task using three metrics: (i) the log likelihood of the correct sequence, measured via perplexity, (ii) F1 score, and (iii) next utterance classification loss, following Lowe et al. (2015). The latter consists of choosing N random distractor responses from other dialogues (in our setting, N=19) and the model selecting the best response among them, resulting in a score of one if the model chooses the correct response, and zero otherwise (called hits@1 in the experiments).



  1. Baseline ranking models:IR and StarSpace
  2. Ranking Profile Memory Network:将profile sentence $p_{i}$ 作为memory,dialogue history作为query q

    最后再计算 $q^{+}$ 与候选回复的相似度。

  3. Key-Value Profile Memory Network:第一跳采取和前面Profile Memory Network 相同的方式,得到$q^{+}$后,再用$q^{+}$做另外的注意力(keys为对话历史,values为对应的下一句回复),最后得到$q^{++}$,再计算与候选回复的相似度。

  4. Seq2Seq
  5. Generative Profile Memory Network


Automated metrics

Table  3:  Evaluation  of  dialog  utterance  prediction  with  various  modelsin  three  settings:  without conditioning  on  a  persona,  conditioned  on  the  speakers  given  persona  (“Original  Persona”),  or  a  revised persona  that  does  not  have  word  overlap.

Most models improve significantly when conditioning prediction on their own persona at least for the original (non-revised) versions, which is an easier task than the re-vised ones which have no word overlap.

Table 3中的结果都是基于说话者自身的persona,作者做了对比试验,来验证基于对方persona或者双方的persona的模型效果:

Table  5:  Evaluation  of  dialog  utterance  prediction  with  generative models in  four  settings:  conditioned  on  the  speakers  persona  (“self  persona”),  the  dialogue  partner’s  persona  (“their  persona”),  both or  none.  The  personas  are  either  the  original  source  given  toTurkers  to  condition  the  dialogue,  or  the revised personas  that  do  not  have  word  overlap.  In  the  “no  persona”  setting,  the  models  are  equivalent, so  we  only  report  once.

Table  6:Evaluation  of  dialog  utterance  prediction  with  ranking  modelsusing  hits@1  in  four  settings: conditioned  on  the  speakers  persona  (”self  persona”),  the  dialogue  partner’s  persona  (”their  persona”), both  or  none.  The  personas  are  either  the  original  source  given  to  Turkers  to  condition  the  dialogue,  or the  rewritten  personas  that  do  not  have  word  overlap,  explaining  the  poor  performance  of  IR  in  that  case.

We can also condition a model on the other speaker’s persona, or both personas at once, the results of which are in Tables 5 and 6 in the Appendix. Using “Their persona” has less impact on this dataset. We believe this is because most speakers tend to focus on themselves when it comes to their interests. It would be interesting how often this is the case in other datasets. Certainly this is skewed by the particular instructions one could give to the crowdworkers. For example if we gavethe instructions “try not to talk about yourself, but about the other’s interests’
likely these metrics would change.

_Details in origin paper_

Human Evaluation

Table  4:Human  Evaluationof  variousPERSONA-CHATmodels,  along  with  a  comparison  to  human  performance,  and  Twitter  and  OpenSubtitles  based  models  (last4  rows),  standard  deviation  in  parenthesis.

Finding the balance between fluency, engage-ment, consistency, and a persistent persona re-mains a strong challenge for future research.

Two tasks could naturally be considered using PERSONACHAT: (1) next utterance prediction during dialogue, and (2) profile prediction given dialogue history.