Discriminative Deep Dyna-Q Robust Planning for Dialogue Policy Learning

本文是Deep Dyna-Q Integrating Planning for Task-Completion Dialogue Policy Learning 团队的续作,主要解决的是原始DDQ模型对world model生成的simulated dialogues质量好坏的严重依赖,通过引入一个区分真实对话和模拟对话的判别器,进而提高DDQ模型的鲁棒性和有效性。
paper link
code link



Furthermore, to the best of our knowledge, there is no universally accepted metric for evaluating user simulators for dialogue purpose. Ref : A survey on metrics for the evaluation of user simulations Therefore, it remains controversial whether training task-completion dialogue agent via simulated users is a valid and effective approach.

之前的研究Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning 提出了一种新的框架DDQ:通过使用真实的对话数据对环境建模(world model),然后使用真实对话和world model生成的对话共同与RL_agent交互,进而学习对话策略。在DDQ框架中,真实的对话经验有两个作用:

  • directly improve the dialogue policy via RL;
  • improve the world model via supervised learning to make it behave more human-like.

The former is referred to as direct reinforcement learning, and the latter world model learning. Respectively,
the policy model is trained via real experiences collected by interacting with real users (direct reinforcement learning), and simulated experiences collected by interacting with the learned world model (planning or indirect reinforcement learning).


although at the early stages of dialogue training it is helpful to perform planning aggressively with large amounts of simulated experiences regardless their quality, in the late stages when the dialogue agent has been significantly improved, low-quality simulated experiences often hurt the performance badly.

本文提出的Discriminative Deep Dyna-Q (D3Q)模型,利用GAN的思想,在planning阶段引入一个判别器,区分模拟对话与真实对话。

Figure  1:  Proposed  D3Q  for  dialogue  policy  learning

如Figure 1所示,所有world model产生的模拟对话经验都会由判别器判断质量好坏,只有判别器无法与真实对话区分的模拟对话才被认为是高质量的对话,被用于planning。

During the course of dialogue training, both the world model and discriminator are refined using the real experiences. So, the quality threshold held by the discriminator goes up with the world model and dialogue agent, especially in the late stage of training.


  1. The proposed Discriminative Deep Dyna-Q approach is capable of controlling the quality of simulated experiences generated by the world model in the planning phase, which enables effective and robust dialogue policy
  2. The proposed model is verified by experiments including simulation, human evaluation, and domain-extension settings, where all results show better sample efficiency over the DDQ baselines learning.

Discriminative Deep Dyna-Q (D3Q)

Figure  2:  Illustration  of  the  proposed  D3Q  dialogue system  framework

starting with an initial dialogue policy and an initial world model (both are trained with pre-collected human conversational data), D3Q training consists of four stages: (1)direct reinforcement learning: the agent interacts with real users, collects real experiences and improves dialogue policy; (2) world model learning: the world model is
learned and refined using real experience; (3) discriminator learning: the discriminator is learned and refined to differentiate simulated experience from real experience; and (4) controlled planning: the agent improves the dialogue policy using the high-quality simulated experience generated by the world model and the discriminator.

Direct Reinforcement Learning

_DQN : details in Discriminative Deep Dyna-Q : Robust Planning for Dialogue Policy Learning_

World Model Learning

_world model与原始的 DDQ相同_

Discriminator Learning


Figure3:  The  model  architectures  of  the  world  model  and  the  discriminator  for  controlled  planning

判别器D的作用是鉴别world model生成对话的质量,结构如Figure 3右边所示,使用一个LSTM编码对话上下文得到表征向量,再用MLP预测一个0-1的概率,表示模拟对话与真实对话的相似度。D的目标函数为:


其中m为batch size。

Controlled Planning


  • $B^{u}$:存储真实对话经验
  • $B^{s}$:存储所有的模拟对话
  • $B^{h}$:D判别的高质量模拟对话

Figure  4:  The  learning  curves  of  DDQ(K)  agents where(K - 1)is  the  number  of  planning  steps

Figure 4 shows the performance of DDQ agents with different planning steps without heuristics. It is observable that the performance is unstable, especially for larger planning steps, which indicates that the quality of simulated experience is becoming more pivotal as the number of planning steps increases.


Dataset & Baselines

_details in origin paper_
数据集有两种:full domain and domain extension

Table  1:  The  data  schema  for  full  domain  and  domain extension  settings

Simulation Evaluation

In this setting, the dialogue agents are optimized by interacting with the user simulators instead of with real users. In another word, the world model is trained to mimic user simulators. In spite of the discrepancy between simulators and real users, this setting endows us with the flexibility to perform a detailed analysis of models without much cost, and to reproduce experimental results easily.

Figure  5:  The  learning  curves  of  agents  (DQN,  DDQ, and  D3Q)  under  the  full  domain  setting

Figure 5展示的是full domain设置下不同模型的表现:D3Q(其实这里planning step=4,与DDQ(5)相同)远远超过DQN和DDQ(5),与DQN(5)的收敛速度相当。

Figure 4表明DDQ对模拟对话的质量很敏感(即planning step),而D3Q则具有更大的鲁棒性,见Figure 6(对比Figure 4)。

Figure  6:  Thelearning  curves  of  D3Q(K)  agents  which (K-1)  is  the  number  of  planning  steps  (K  =  2,  3,  5,  10, 15)

Domain Extension

Figure  8:  The  learning  curves  of  agents  (DQN,  DDQ, and  D3Q)  under  the  domain  extension  setting

The results summarized in Figure 8 show that D3Q significantly outperforms the baseline methods, demonstrating its robustness. Furthermore, D3Q shows remarkable learning efficiency while extending the domain, which even outperforms DQN(5). A potential reason might be that the world model could improve exploration in such unstable and noisy environment.

Human Evaluation

In the human evaluation experiments, real users interact with different models without knowing which agent is behind the system.

The user can abandon the task and terminate the dialogue at any time, if she or he believes that the dialogue was unlikely to succeed, or simply because the dialogue drags on for too many turns. In such cases, the dialogue session is considered as failure.

Figure  9:  The  human  evaluation  results  of  D3Q, DDQ(5),  and  D3Q  in  the  full  domain  setting,  the  number  of  test  dialogues  indicated  on  each  bar,  and  the  p-values  from  a  two-sided  permutation  test  (difference  in mean  is  significant  withp  <  0:05)


This paper proposes a new framework, Discriminative Deep Dyna-Q (D3Q), for task-completion dialogue policy learning. With a discriminator as judge, the proposed approach is capable of controlling the quality of simulated experience generated in the planning phase, which enables efficient and robust dialogue policy learning. Furthermore, D3Q can be viewed as a generic model-based RL approach easily-extensible to other RL problems.