# Discriminative Deep Dyna-Q Robust Planning for Dialogue Policy Learning

## Introduction

Furthermore, to the best of our knowledge, there is no universally accepted metric for evaluating user simulators for dialogue purpose. Ref : A survey on metrics for the evaluation of user simulations Therefore, it remains controversial whether training task-completion dialogue agent via simulated users is a valid and effective approach.

• directly improve the dialogue policy via RL;
• improve the world model via supervised learning to make it behave more human-like.

The former is referred to as direct reinforcement learning, and the latter world model learning. Respectively,
the policy model is trained via real experiences collected by interacting with real users (direct reinforcement learning), and simulated experiences collected by interacting with the learned world model (planning or indirect reinforcement learning).

although at the early stages of dialogue training it is helpful to perform planning aggressively with large amounts of simulated experiences regardless their quality, in the late stages when the dialogue agent has been significantly improved, low-quality simulated experiences often hurt the performance badly.

During the course of dialogue training, both the world model and discriminator are refined using the real experiences. So, the quality threshold held by the discriminator goes up with the world model and dialogue agent, especially in the late stage of training.

1. The proposed Discriminative Deep Dyna-Q approach is capable of controlling the quality of simulated experiences generated by the world model in the planning phase, which enables effective and robust dialogue policy
2. The proposed model is verified by experiments including simulation, human evaluation, and domain-extension settings, where all results show better sample efficiency over the DDQ baselines learning.

## Discriminative Deep Dyna-Q (D3Q)

starting with an initial dialogue policy and an initial world model (both are trained with pre-collected human conversational data), D3Q training consists of four stages: (1)direct reinforcement learning: the agent interacts with real users, collects real experiences and improves dialogue policy; (2) world model learning: the world model is
learned and refined using real experience; (3) discriminator learning: the discriminator is learned and refined to differentiate simulated experience from real experience; and (4) controlled planning: the agent improves the dialogue policy using the high-quality simulated experience generated by the world model and the discriminator.

### World Model Learning

_world model与原始的 DDQ相同_

### Controlled Planning

D3Q共使用了三个经验池：

• $B^{u}$：存储真实对话经验
• $B^{s}$：存储所有的模拟对话
• $B^{h}$：D判别的高质量模拟对话

Figure 4 shows the performance of DDQ agents with different planning steps without heuristics. It is observable that the performance is unstable, especially for larger planning steps, which indicates that the quality of simulated experience is becoming more pivotal as the number of planning steps increases.

## Experiments

### Dataset & Baselines

_details in origin paper_

### Simulation Evaluation

In this setting, the dialogue agents are optimized by interacting with the user simulators instead of with real users. In another word, the world model is trained to mimic user simulators. In spite of the discrepancy between simulators and real users, this setting endows us with the flexibility to perform a detailed analysis of models without much cost, and to reproduce experimental results easily.

Figure 5展示的是full domain设置下不同模型的表现：D3Q（其实这里planning step=4，与DDQ(5)相同）远远超过DQN和DDQ(5)，与DQN(5)的收敛速度相当。

Figure 4表明DDQ对模拟对话的质量很敏感（即planning step），而D3Q则具有更大的鲁棒性，见Figure 6（对比Figure 4）。

Domain Extension

The results summarized in Figure 8 show that D3Q significantly outperforms the baseline methods, demonstrating its robustness. Furthermore, D3Q shows remarkable learning efficiency while extending the domain, which even outperforms DQN(5). A potential reason might be that the world model could improve exploration in such unstable and noisy environment.

### Human Evaluation

In the human evaluation experiments, real users interact with different models without knowing which agent is behind the system.

The user can abandon the task and terminate the dialogue at any time, if she or he believes that the dialogue was unlikely to succeed, or simply because the dialogue drags on for too many turns. In such cases, the dialogue session is considered as failure.

## Conclusions

This paper proposes a new framework, Discriminative Deep Dyna-Q (D3Q), for task-completion dialogue policy learning. With a discriminator as judge, the proposed approach is capable of controlling the quality of simulated experience generated in the planning phase, which enables efficient and robust dialogue policy learning. Furthermore, D3Q can be viewed as a generic model-based RL approach easily-extensible to other RL problems.