NLP

Wizard of Wikipedia Knowledge-powered conversational agents

本文是FAIR发表于ICLR2019上的文章,主要提出了一个基于Wikipedia背景知识的开放域对话数据集以及两个基线模型。

paper link
dataset&&code link

Introduction

Wizard of Wikipedia对话数据集属于开放域对话系统,一个对话者随机选择一个初始话题,对话双方可以在此基础上进行对话,但在对话过程中话题也可以拓展。对话双方的角色是不同的,分为 wizardapprentice

  • wizard:wizard的目的是通知apprentice关于对话主题相关的背景知识,在对话开始之前,会给定一些相关的wiki段落,这些对于apprentice不可见。同时,wizard不允许直接复制拷贝wiki里的文本句子作为回复,而是需要自己进行组合生成融合知识的回答。
  • apprentice:apprentice的目的是深入的询问与对话主题相关的问题,这与普通的闲聊有所区别。

Conversation Flow The flow of the conversation thus takes place as follows.

  1. Either the wizard or apprentice is picked to choose the topic and speak first. The other player receives the topic information, and the conversation begins.
  2. When the apprentice sends the wizard a message, the wizard is shown relevant knowledge(described below), and chooses a relevant sentence in order to construct a response, or else chooses the no sentence used option.
  3. The Wizard responds to the apprentice basing their response on their chosen sentence.
  4. The conversation repeats until one of the conversation partners ends the chat (after a minimum of 4 or 5 turns each, randomly chosen beforehand).

HUMAN ANNOTATION INTERFACE(FOR WIZARD)

Table  1:  Dataset  statistics  of  the  Wizard  of  Wikipedia  task

The  Wizard  of  Wikipediadataset.

Models

Generative  Transformer  Memory  Network.  An  IR  system  provides  knowledge  candidates  from  Wikipedia.  Dialogue  Context  and  Knowledge  are  encoded  using  a  shared  encoder.  In  the Two-stage  model,  the  dialogue  and  knowledge  are  re-encoded  after  knowledge  selection.

作者提出了两个基线模型,分别是检索式和生成式。二者都是用相同的Transformer来编码context和knowledge获取向量表征,再通过memrory network选择knowledge。

RETRIEVAL TRANSFORMER MEMORY NETWORK
首先使用Transformer来编码context $m_{c_{1}}, \dots, m_{c_{K}}$和knowledge $x$ 获取向量表征,之后利用x对context做注意力得到向量 $\mathrm{rep}_{\mathrm{LHS}}\left(m_{c_{1}}, \ldots, m_{c_{K}}, x\right)$,同样用另一个Transformer获取候选回复的表示 $\mathrm{rep}_{\mathrm{LHS}}(r_{i})$,使用二者内积结果作为输出概率:

The model is trained to minimize the cross-entropy loss, where the negative candidates for each example are the responses to the other examples in the batch (Henderson et al., 2017).

GENERATIVE TRANSFORMER MEMORY NETWORK

作者提出了两种变体:a Two-stage and an End-to-end version

  • End-to-end : 与检索模型类似,得到context对knowledge的注意力分布后,选择概率最大的知识 $m_{best}$,将其与context encoding拼接,然后再经过Transformer decoder解码生成。作者额外添加了辅助交叉熵loss,以帮助选择合适的知识:$\mathcal{L}=(1-\lambda) \mathcal{L}_{\mathrm{NLL}}+\lambda \mathcal{L}_{\mathrm{know} \mathrm{ledge}}$
  • Two-stage:这种模式下,模型分为两个单独的子任务knowledge selectionutterance prediction,二者分开训练。knowledge selection的训练方式与end-to-end没有区别,在选择出知识$m_{best}$后,需要用另一个Transformer对context和选择的知识进行编码,再经过Transformer decoder解码生成。作者还提出了一种knowledge dropout的机制,能够避免knowledge selection错误传播。

Experiments

KNOWLEDGE SELECTION TASK

Table  2:  Test  performance  of  various  methods  on  the  Knowledge  Selection  Task.  The  models must select  the  gold  knowledge  sentences  chosen  by  humans  given  the  dialogue  context

FULL TASK: DIALOGUE WITH KNOWLEDGE

作者设置了两种实验条件:Predicted Knowledge 指模型需要从给定的所有知识中预测匹配的知识,而Gold Knowledge指模型直接使用wizard手工选择的知识。

Table  3:Retrieval  methods  on  the  full  Wizard  task. Models  must  select  relevant  knowledge  and retrieve  a  response  from  the  training  set  as  a  dialogue  response.  Using  knowledge  always  helps,  and the Transformer  Memory  Network  with  pretraining  performs  best.

Table  4:Generative  models  on  the  full  Wizard  Task.The  Two-stage  model  performs  best  using predicted  knowledge,  while  the  End-to-end  (E2E)  model  performs  best  with  gold  knowledge.

Table  5:Human  Experiments.Evaluations  of  the  best  generative  and  retrieval  models  on  full dialogues  with  humans.  Human  ratings  are  reported  as  mean  (stddev).  Wiki  F1  measures  unigram overlap  with  the  Wikipedia  entry  for  the  chosen  topic,  a  measure  of  knowledge  used  in  conversations.

Conclusion

本文核心的贡献在于提出了一个基于Wikipedia背景知识的开放域对话数据集,从实验结果来看,目前的模型与人相比还有很大的差距,值得研究。

There is much future work to be explored using our task and dataset. Some of these include:
(i) bridging the gap between the engagingness of retrieval responses versus the ability of generative models to work on new knowledge and topics.
(ii) learning to retrieve and reason simultaneously rather than using a separate IR component.
(iii) investigating the relationship between knowledge-grounded dialogue and existing QA tasks which also employ such IR systems. The aim is for those strands to come together to obtain an engaging and knowledgeable conversational agent.