# Deep Dyna-Q Integrating Planning for Task-Completion Dialogue Policy Learning

## Introduction

Learning policies for task-completion dialogue is often formulated as a reinforcement learning (RL) problem.

Dhingra et al.(2017) demonstrated a significant discrepancy in a simulator-trained dialogue agent when evaluated with simulators and with real users.

• Model-Free RL : No model Learn value function (and/or policy) from real experience
• Model-Based RL (using Sample-Based Planning) : Learn a model from real experience Plan value function (and/or policy) from simulated experience
• Dyna : Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience

• 通过监督学习训练和改善world model，使其表现得更像真实用户；
• 直接用来训练强化学习

Dialogue policy can be improved either using real experience directly (i.e., direct reinforcement learning) or via the world model indirectly (referred to as planning or indirect reinforcement learning). The interaction between world model learning, direct reinforcement learning and planning is illustrated in Figure 1(c), following the Dyna-Q framework (Sutton, 1990).

## Dialogue Policy Learning via Deep Dyna-Q (DDQ)

• an LSTM-based natural language understanding (NLU) module (Hakkani-T ̈ ur et al., 2016) for identifying user intents and extracting associated slots;
• a state tracker (Mrkˇsi ́c et al., 2016) for tracking the dialogue states;
• a dialogue policy which selects the next action based on the current state;
• a model-based natural language generation (NLG) module for converting dialogue actions to natural language response (Wen et al., 2015);
• a world model for generating simulated user actions and simulated rewards.

Algorithm ： Deep Dyna-Q for Dialogue Policy Learning

As illustrated in Figure 1(c), starting with an initial dialogue policy and an initial world model (both trained with pre-collected human conversational data), the training of the DDQ agent consists of three processes:

1. direct reinforcement learning, where the agent interacts with a real user, collects real experience and improves the dialogue policy;
2. world model learning, where the world model is learned and refined using real experience;
3. planning, where the agent improves the dialogue policy using simulated experience.

### Planning

1. 在每个对话的开始阶段，先随机选择一个user goal $G=(C,R)$ ，C 代表一系列的限制条件（inform slots），R 是request slots的集合。

For movie-ticket booking dialogues, constraints are typically the name and the date of the movie, the number of tickets to buy, etc. Requests can contain these slots as well as the location of the theater, its start time, etc.

1. 本文提出的模型里，都假定对话是由用户发起，也即用户会说第一句话。第一个user action 是从 (R, C) 中采样得到，简化为inform或者request diaact。

A request, such as request(theater;moviename=batman), consists of a requestvslot and multiple (> 1) constraint slots, uniformly sampled from R and C, respectively. An inform contains constraint slots only. The user action can also be converted to natural language via NLG, e.g., “which theater will show batman?”

2. 在每一个对话轮次中，world model 将user state s(agent说完一句话后即进入user state)，上一个agent action a作为输入（均采用one-hot编码），输出user response $a^{u}$ ，奖励 r ，以及一个bool值（代表对话是否终止）。world model实际上是由一个MLP网络实现的，本身是一个监督学习，网络结构如下：

_(s, a) 代表二者拼接_

### World Model Learning

In this process (lines 19-22 in Algorithm 1), $M(s,a;\theta_{M})$ is refined via minibatch SGD using real experience in the replay buffer $D^{u}$. As shown in Figure 3,$M(s,a;\theta_{M})$ is a multi-task neural network (Liu et al., 2015) that combines two classification tasks of simulating $a_{u}$ and t, respectively, and one regression task of simulating r. The lower layers are shared across all tasks, while the top layers are task-specific.

## Experiments and Results

### Dataset

Raw conversational data in the movie-ticket booking scenario was collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema defined by domain experts, as shown in Table 4, which consists of 11 dialogue acts and 16 slots. In total, the dataset contains 280 annotated dialogues, the average length of which is approximately 11 turns.

### Dialogue Agents for Comparison

• A DQN agent is learned by standard DQN, implemented with direct reinforcement learning only (lines 5-18 in Algorithm 1) in each epoch.

• The DDQ(K) agents are learned by DDQ of Algorithm 1, with an initial world model pretrained on human conversational data, as described in Section 3.1. K is the number of planning steps. We trained different versions of DDQ(K) with different K’s.

• The DDQ(K, rand-init $\theta_{M}$ ) agents are learned by the DDQ method with a randomly initialized world model.

• The DDQ(K, fixed $\theta_{M}$) agents are learned by DDQ with an initial world model pretrained on human conversational data. But the world model is not updated afterwards. That is, the world model learning part in Algorithm 1 (lines 19-22) is removed. The DDQ(K, fixed $\theta_{M}$) agents are evaluated in the simulation setting only.

• The DQN(K) agents are learned by DQN, but with K times more real experiences than the DQN agent. DQN(K) is evaluated in the simulation setting only. Its performance can be viewed as the upper bound of its DDQ(K) counterpart, assuming that the world model in DDQ(K) perfectly matches real users.

Implementation Details

We found in our experiments that setting Z > 1 improves the performance of all agents, but does not change
the conclusion of this study: DDQ consistently outperforms DQN by a statistically significant margin. Conceptually, the optimal value ofZ used in planning is different from that in direct reinforcement learning, and should vary according to the quality of the world model. The better the world model is, the more aggressive update (thus bigger Z) is being used in planning. We leave it to future work to investigate how to optimize Z for planning in DDQ.

### Simulated User Evaluation

Although the simulator-trained agents are sub-optimal when applied to real users due to the discrepancy between simulators and real users, the simulation setting allows us to perform a detailed analysis of DDQ without much cost and to reproduce the experimental results easily.

### User Simulator

1. 成功预定电影票
2. user的所有限制条件被满足（限制条件是指user goal中的inform slots）

1. 在对话进行的过程中，agent得到的奖励均为-1 （无论最终对话成功与否）
2. 对于成功的对话， 最后一轮agent得到的奖励为 $2 \times L$ ；对于失败的对话，最后一轮agent得到的奖励为 $- L$

_L is the maximum number of turns in each dialogue, and is set to 40 in our experiments._

• inform slots contain a number of slot-value pairs which serve as constraints from the user.
• request slots contain a set of slots that user has no information about the values, but wants to get the values from the agent during the conversation. ticket is a default slot which always appears in the request slots part of user goal.

slots are split into two groups. Some of slots must appear in the user goal, we called these elements as Required slots. In the movie-booking scenario, it includes moviename, theater, starttime, date, numberofpeople; the rest slots are Optional slots, for example, theater chain, video format etc.

We generated the user goals from the labeled dataset mentioned in Section 3.1, using two mechanisms. One mechanism is to extract all the slots (known and unknown) from the first user turns (excluding the greeting user turn) in the data, since usually the first turn contains some or all the required information from user. The other mechanism is to extract all the slots (known and unknown) that first appear in all the user turns,and then aggregate them into one user goal. We dump these user goals into a file as the user-goal database. Every time when running a dialogue,we randomly sample one user goal from this user goal database.

### Results

• success rate ：对话成功率
• average reward ：平均每个对话得到的奖励
• average number of turns：对话的平均轮次

Figure 4 展示了不同K的DDQ agent的区别。所有的agent都是基于相同的rule-based agent产生的经验池，因此训练初始阶段的performance类似，之后有所提升，尤其是对于较大的K值。

Recall that the DDQ(K) agent withK=0 is identical to the DQN agent, which does no planning but relies on direct reinforcement learning only. Without planning, the DQN agent took about 180 epochs (real dialogues) to reach the success rate of 50%, and DDQ(10) took only 50 epochs.

Intuitively, the optimal value of K needs to be determined by seeking the best trade-off between the quality of the world model and the amount of simulated experience that is useful for improving the dialogue agent. This is a non-trivial optimization problem because both the dialogue agent and the world model are updated constantly during
training and the optimal K needs to be adjusted accordingly. For example, we find in our experiments that at the early stages of training, it is fine to perform planning aggressively by using large amounts of simulated experience even though they are of low quality, but in the late stages of training where the dialogue agent has been significantly improved, low-quality simulated experience is likely to hurt the performance. Thus, in our implementation of Algorithm 1, we use a heuristic to reduce the value of K in the late stages of training (e.g., after 150 epochs in Figure 4) to mitigate the negative impact of low-qualify simulated experience. We leave it to future work how to optimize the planning step size during DDQ training in a principled way.

Figure 5 shows that the quality of the world model has a significant impact on the agent’s performance. The learning curve of DQN(10) indicates the best performance we can expect with a perfect world model. With a pre-trained world model, the performance of the DDQ agent improves more rapidly, although eventually, the DDQ and DDQ(rand-init M) agents reach the same success rate after many epochs. The world model learning process is crucial to both the efficiency of dialogue policy learning and the final performance of the agent. For example, in the early stages (before 60 epochs), the performances of DDQ and DDQ(fixedM) remain very close to each other, but DDQ reaches a success rate almost 10% better than DDQ(fixedM) after 400 epochs.

## Conclusion

We propose a new strategy for a task-completion dialogue agent to learn its policy by interacting
with real users. Compared to previous work, our agent learns in a much more efficient way, using only a small number of real user interactions, which amounts to an affordable cost in many nontrivial domains. Our strategy is based on the Deep Dyna-Q (DDQ) framework where planning is integrated into dialogue policy learning. The effectiveness of DDQ is validated by human-in-the-loop experiments, demonstrating that a dialogue agent can efficiently adapt its policy on the fly by interacting with real users via deep RL.

One interesting topic for future research is exploration in planning. We need to deal with the challenge of adapting the world model in a changing environment, as exemplified by the domain extension problem (Lipton et al., 2016). As pointed out by Sutton and Barto (1998), the general problem here is a particular manifestation of the conflict between exploration and exploitation. In a
planning context, exploration means trying actions that may improve the world model, whereas ex-
ploitation means trying to behave in the optimal way given the current model. To this end, we want
the agent to explore in the environment, but not so much that the performance would be greatly degraded.