本文是Deep Dyna-Q Integrating Planning for Task-Completion Dialogue Policy Learning 团队的续作，主要解决的是原始DDQ模型对world model生成的simulated dialogues质量好坏的严重依赖，通过引入一个区分真实对话和模拟对话的判别器，进而提高DDQ模型的鲁棒性和有效性。
Furthermore, to the best of our knowledge, there is no universally accepted metric for evaluating user simulators for dialogue purpose. Ref : A survey on metrics for the evaluation of user simulations Therefore, it remains controversial whether training task-completion dialogue agent via simulated users is a valid and effective approach.
之前的研究Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning 提出了一种新的框架DDQ：通过使用真实的对话数据对环境建模（world model），然后使用真实对话和world model生成的对话共同与RL_agent交互，进而学习对话策略。在DDQ框架中，真实的对话经验有两个作用：
- directly improve the dialogue policy via RL;
- improve the world model via supervised learning to make it behave more human-like.
The former is referred to as direct reinforcement learning, and the latter world model learning. Respectively,
the policy model is trained via real experiences collected by interacting with real users (direct reinforcement learning), and simulated experiences collected by interacting with the learned world model (planning or indirect reinforcement learning).
although at the early stages of dialogue training it is helpful to perform planning aggressively with large amounts of simulated experiences regardless their quality, in the late stages when the dialogue agent has been significantly improved, low-quality simulated experiences often hurt the performance badly.
本文提出的Discriminative Deep Dyna-Q (D3Q)模型，利用GAN的思想，在planning阶段引入一个判别器，区分模拟对话与真实对话。
如Figure 1所示，所有world model产生的模拟对话经验都会由判别器判断质量好坏，只有判别器无法与真实对话区分的模拟对话才被认为是高质量的对话，被用于planning。
During the course of dialogue training, both the world model and discriminator are refined using the real experiences. So, the quality threshold held by the discriminator goes up with the world model and dialogue agent, especially in the late stage of training.
- The proposed Discriminative Deep Dyna-Q approach is capable of controlling the quality of simulated experiences generated by the world model in the planning phase, which enables effective and robust dialogue policy
- The proposed model is verified by experiments including simulation, human evaluation, and domain-extension settings, where all results show better sample efficiency over the DDQ baselines learning.
starting with an initial dialogue policy and an initial world model (both are trained with pre-collected human conversational data), D3Q training consists of four stages: (1)direct reinforcement learning: the agent interacts with real users, collects real experiences and improves dialogue policy; (2) world model learning: the world model is
learned and refined using real experience; (3) discriminator learning: the discriminator is learned and refined to differentiate simulated experience from real experience; and (4) controlled planning: the agent improves the dialogue policy using the high-quality simulated experience generated by the world model and the discriminator.
_DQN : details in Discriminative Deep Dyna-Q : Robust Planning for Dialogue Policy Learning_
_world model与原始的 DDQ相同_
判别器D的作用是鉴别world model生成对话的质量，结构如Figure 3右边所示，使用一个LSTM编码对话上下文得到表征向量，再用MLP预测一个0-1的概率，表示模拟对话与真实对话的相似度。D的目标函数为：
Figure 4 shows the performance of DDQ agents with different planning steps without heuristics. It is observable that the performance is unstable, especially for larger planning steps, which indicates that the quality of simulated experience is becoming more pivotal as the number of planning steps increases.
_details in origin paper_
数据集有两种：full domain and domain extension
In this setting, the dialogue agents are optimized by interacting with the user simulators instead of with real users. In another word, the world model is trained to mimic user simulators. In spite of the discrepancy between simulators and real users, this setting endows us with the flexibility to perform a detailed analysis of models without much cost, and to reproduce experimental results easily.
Figure 5展示的是full domain设置下不同模型的表现：D3Q（其实这里planning step=4，与DDQ(5)相同）远远超过DQN和DDQ(5)，与DQN(5)的收敛速度相当。
Figure 4表明DDQ对模拟对话的质量很敏感（即planning step），而D3Q则具有更大的鲁棒性，见Figure 6（对比Figure 4）。
The results summarized in Figure 8 show that D3Q significantly outperforms the baseline methods, demonstrating its robustness. Furthermore, D3Q shows remarkable learning efficiency while extending the domain, which even outperforms DQN(5). A potential reason might be that the world model could improve exploration in such unstable and noisy environment.
In the human evaluation experiments, real users interact with different models without knowing which agent is behind the system.
The user can abandon the task and terminate the dialogue at any time, if she or he believes that the dialogue was unlikely to succeed, or simply because the dialogue drags on for too many turns. In such cases, the dialogue session is considered as failure.
This paper proposes a new framework, Discriminative Deep Dyna-Q (D3Q), for task-completion dialogue policy learning. With a discriminator as judge, the proposed approach is capable of controlling the quality of simulated experience generated in the planning phase, which enables efficient and robust dialogue policy learning. Furthermore, D3Q can be viewed as a generic model-based RL approach easily-extensible to other RL problems.