Deep Dyna-Q Integrating Planning for Task-Completion Dialogue Policy Learning

2018-10-19

NLP, Reinforcement Learning, Task Oriented Dialogue

本文提出了一种新的通过与真实用户交互来学习对话策略的方法Deep Dyna-Q，与之前的工作相比，只需要少量的真实对话数据，通过world model对用户建模，结合model-free和model-based，该方法能够高效的学习对话策略。

Introduction

Learning policies for task-completion dialogue is often formulated as a reinforcement learning (RL) problem.

但是，与simulation-based的游戏，例如Atari games (Mnih et al., 2015) and AlphaGo (Silver et al., 2016a, 2017)相比，任务型对话系统需要真实用户来跟RL_agent进行交互来更新学习对话策略，这在现实情况下是不可接受的。

解决该问题的一个方法是：基于人类的真实对话数据，构建一个用户模拟器，RL_agent可以跟用户模拟器进行交互来更新学习对话策略。然而缺点在于，用户模拟器缺乏人类对话语言的多样性，agent的性能极大地受模拟器的影响。

Dhingra et al.(2017) demonstrated a significant discrepancy in a simulator-trained dialogue agent when evaluated with simulators and with real users.

同时，目前没有合理的指标来评价用户模拟器的性能。因此，是否采用用户模拟器来训练RL_agent仍然是有争议的。

本文提出了一种基于Dyna-Q framework的新方法，在学习策略的时候加入了planning过程。首先简单了解Dyna-Q 算法（from David Silver）：

Model-Free RL : No model Learn value function (and/or policy) from real experience
Model-Based RL (using Sample-Based Planning) : Learn a model from real experience Plan value function (and/or policy) from simulated experience
Dyna : Learn a model from real experience Learn and plan value function (and/or policy) from real and simulated experience

具体在本文中，提出了world model，用来模拟真实用户。在学习对话策略的过程中，真实的对话数据有两个作用：

通过监督学习训练和改善world model，使其表现得更像真实用户；
直接用来训练强化学习

Dialogue policy can be improved either using real experience directly (i.e., direct reinforcement learning) or via the world model indirectly (referred to as planning or indirect reinforcement learning). The interaction between world model learning, direct reinforcement learning and planning is illustrated in Figure 1(c), following the Dyna-Q framework (Sutton, 1990).

Dialogue Policy Learning via Deep Dyna-Q (DDQ)

本文提出的DDQ共包含以下5个模块：

an LSTM-based natural language understanding (NLU) module (Hakkani-T ̈ ur et al., 2016) for identifying user intents and extracting associated slots;
a state tracker (Mrkˇsi ́c et al., 2016) for tracking the dialogue states;
a dialogue policy which selects the next action based on the current state;
a model-based natural language generation (NLG) module for converting dialogue actions to natural language response (Wen et al., 2015);
a world model for generating simulated user actions and simulated rewards.

Algorithm ： Deep Dyna-Q for Dialogue Policy Learning

As illustrated in Figure 1(c), starting with an initial dialogue policy and an initial world model (both trained with pre-collected human conversational data), the training of the DDQ agent consists of three processes:

direct reinforcement learning, where the agent interacts with a real user, collects real experience and improves the dialogue policy;
world model learning, where the world model is learned and refined using real experience;
planning, where the agent improves the dialogue policy using simulated experience.

Direct Reinforcement Learning

本过程是一个标准的DQN算法，其中值函数近似使用了一个多层MLP网络：

Planning

在Planning过程中，先用world model产生模拟对话。K代表每一步direct reinforcement learning对应多少步planning steps。如果world model能够充分模拟用户，K可以适当变大来加快训练过程。在DDQ模型中，使用了两个经验池，$D^{u}$ 存储的是真实的对话，$D^{s}$ 存储的是world model模拟对话，使用的是相同的DQN算法，只是在不同阶段操作的经验池不同。

在这里描述模拟对话产生的过程：

在每个对话的开始阶段，先随机选择一个user goal $G=(C,R)$ ，C 代表一系列的限制条件（inform slots），R 是request slots的集合。

For movie-ticket booking dialogues, constraints are typically the name and the date of the movie, the number of tickets to buy, etc. Requests can contain these slots as well as the location of the theater, its start time, etc.

本文提出的模型里，都假定对话是由用户发起，也即用户会说第一句话。第一个user action 是从 (R, C) 中采样得到，简化为inform或者request diaact。

A request, such as request(theater;moviename=batman), consists of a requestvslot and multiple (> 1) constraint slots, uniformly sampled from R and C, respectively. An inform contains constraint slots only. The user action can also be converted to natural language via NLG, e.g., “which theater will show batman?”
在每一个对话轮次中，world model 将user state s(agent说完一句话后即进入user state)，上一个agent action a作为输入（均采用one-hot编码），输出user response $a^{u}$ ，奖励 r ，以及一个bool值（代表对话是否终止）。world model实际上是由一个MLP网络实现的，本身是一个监督学习，网络结构如下：

_(s, a) 代表二者拼接_

World Model Learning

In this process (lines 19-22 in Algorithm 1), $M(s,a;\theta_{M})$ is refined via minibatch SGD using real experience in the replay buffer $D^{u}$. As shown in Figure 3,$M(s,a;\theta_{M})$ is a multi-task neural network (Liu et al., 2015) that combines two classification tasks of simulating $a_{u}$ and t, respectively, and one regression task of simulating r. The lower layers are shared across all tasks, while the top layers are task-specific.

Experiments and Results

Dataset

Raw conversational data in the movie-ticket booking scenario was collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema defined by domain experts, as shown in Table 4, which consists of 11 dialogue acts and 16 slots. In total, the dataset contains 280 annotated dialogues, the average length of which is approximately 11 turns.

Dialogue Agents for Comparison

作者共实现了以下几种模型作为对比：

A DQN agent is learned by standard DQN, implemented with direct reinforcement learning only (lines 5-18 in Algorithm 1) in each epoch.
The DDQ(K) agents are learned by DDQ of Algorithm 1, with an initial world model pretrained on human conversational data, as described in Section 3.1. K is the number of planning steps. We trained different versions of DDQ(K) with different K’s.
The DDQ(K, rand-init $\theta_{M}$ ) agents are learned by the DDQ method with a randomly initialized world model.
The DDQ(K, fixed $\theta_{M}$) agents are learned by DDQ with an initial world model pretrained on human conversational data. But the world model is not updated afterwards. That is, the world model learning part in Algorithm 1 (lines 19-22) is removed. The DDQ(K, fixed $\theta_{M}$) agents are evaluated in the simulation setting only.
The DQN(K) agents are learned by DQN, but with K times more real experiences than the DQN agent. DQN(K) is evaluated in the simulation setting only. Its performance can be viewed as the upper bound of its DDQ(K) counterpart, assuming that the world model in DDQ(K) perfectly matches real users.

Implementation Details：
所有的agent $Q(s, a; \theta_{Q})$ ，world model $M(s,a;\theta_{M})$ 都是基于MLP实现的。作者认为Z的数值会影响agent的效果：

We found in our experiments that setting Z > 1 improves the performance of all agents, but does not change
the conclusion of this study: DDQ consistently outperforms DQN by a statistically significant margin. Conceptually, the optimal value ofZ used in planning is different from that in direct reinforcement learning, and should vary according to the quality of the world model. The better the world model is, the more aggressive update (thus bigger Z) is being used in planning. We leave it to future work to investigate how to optimize Z for planning in DDQ.

在agent训练之前的初始化过程中（代码中称之为warm start），作者并没有直接从真实对话数据中提取填充 $D^{u}$ ，而是基于真实对话构建了一个rule-based agent，然后使用该agent与一个rule-based simulator产生对话填充 $D^{u}$ 。因此，强化学习训练的经验池数据还是依赖于规则产生。

Simulated User Evaluation

如上所述，因为world model的训练数据来自于rule-based agent和rule-based simulator，所以实际上world model只是在模拟rule-based simulator，dqn-based agent也是在和user simulator进行交互，而非真实用户。这就导致：

Although the simulator-trained agents are sub-optimal when applied to real users due to the discrepancy between simulators and real users, the simulation setting allows us to perform a detailed analysis of DDQ without much cost and to reproduce the experimental results easily.

User Simulator

作者采用了一个公开的user simulator。在训练的过程中，在每一轮对话问答中，user simulator对agent的话生成user response，以及在每个对话结束时生成reward。

对话成功的条件为（必须同时满足）：

成功预定电影票
user的所有限制条件被满足（限制条件是指user goal中的inform slots）

奖励设置：

在对话进行的过程中，agent得到的奖励均为-1 （无论最终对话成功与否）
对于成功的对话，最后一轮agent得到的奖励为 $2 \times L$ ；对于失败的对话，最后一轮agent得到的奖励为 $- L$

_L is the maximum number of turns in each dialogue, and is set to 40 in our experiments._

在任务型对话系统中，整个对话是基于一个隐式的user goal，agent在对话开始的时候并不知道user goal，它的目的就在于帮助用户完成goal。user goal包含以下方面：

inform slots contain a number of slot-value pairs which serve as constraints from the user.
request slots contain a set of slots that user has no information about the values, but wants to get the values from the agent during the conversation. ticket is a default slot which always appears in the request slots part of user goal.

除此之外，作者为了简化问题，还额外增加了一些限制条件：

slots are split into two groups. Some of slots must appear in the user goal, we called these elements as Required slots. In the movie-booking scenario, it includes moviename, theater, starttime, date, numberofpeople; the rest slots are Optional slots, for example, theater chain, video format etc.

作者是从真实的对话数据中提取的user goal：

We generated the user goals from the labeled dataset mentioned in Section 3.1, using two mechanisms. One mechanism is to extract all the slots (known and unknown) from the first user turns (excluding the greeting user turn) in the data, since usually the first turn contains some or all the required information from user. The other mechanism is to extract all the slots (known and unknown) that first appear in all the user turns,and then aggregate them into one user goal. We dump these user goals into a file as the user-goal database. Every time when running a dialogue,we randomly sample one user goal from this user goal database.

Results

评价指标：

success rate ：对话成功率
average reward ：平均每个对话得到的奖励
average number of turns：对话的平均轮次

$Table 1: Results of different agents at training epoch ={100, 200, 300}. Each number is averaged over 5 runs, each run tested on 2000 dialogues. Excluding DQN(5) and DQN(10) which serve as the upper bounds, any two groups of success rate (except three groups: at epoch 100, DDQ(5, rand-init $\theta_{M}$) and DDQ(10, fixed $\theta_{M}$), at epoch 200, DDQ(5, rand-init $\theta_{M}$) and DDQ(10, rand-init $\theta_{M}$), at epoch 300, DQN and DDQ(10, fixed $\theta_{M}$)) evaluated at the same epoch is statistically significant in mean with p <0:01. (Success: success rate)$

Figure 4 展示了不同K的DDQ agent的区别。所有的agent都是基于相同的rule-based agent产生的经验池，因此训练初始阶段的performance类似，之后有所提升，尤其是对于较大的K值。

Recall that the DDQ(K) agent withK=0 is identical to the DQN agent, which does no planning but relies on direct reinforcement learning only. Without planning, the DQN agent took about 180 epochs (real dialogues) to reach the success rate of 50%, and DDQ(10) took only 50 epochs.

论文中简要地提到关于K值的最优选择：

Intuitively, the optimal value of K needs to be determined by seeking the best trade-off between the quality of the world model and the amount of simulated experience that is useful for improving the dialogue agent. This is a non-trivial optimization problem because both the dialogue agent and the world model are updated constantly during
training and the optimal K needs to be adjusted accordingly. For example, we find in our experiments that at the early stages of training, it is fine to perform planning aggressively by using large amounts of simulated experience even though they are of low quality, but in the late stages of training where the dialogue agent has been significantly improved, low-quality simulated experience is likely to hurt the performance. Thus, in our implementation of Algorithm 1, we use a heuristic to reduce the value of K in the late stages of training (e.g., after 150 epochs in Figure 4) to mitigate the negative impact of low-qualify simulated experience. We leave it to future work how to optimize the planning step size during DDQ training in a principled way.

Figure 5 shows that the quality of the world model has a significant impact on the agent’s performance. The learning curve of DQN(10) indicates the best performance we can expect with a perfect world model. With a pre-trained world model, the performance of the DDQ agent improves more rapidly, although eventually, the DDQ and DDQ(rand-init M) agents reach the same success rate after many epochs. The world model learning process is crucial to both the efficiency of dialogue policy learning and the final performance of the agent. For example, in the early stages (before 60 epochs), the performances of DDQ and DDQ(fixedM) remain very close to each other, but DDQ reaches a success rate almost 10% better than DDQ(fixedM) after 400 epochs.

Conclusion

We propose a new strategy for a task-completion dialogue agent to learn its policy by interacting
with real users. Compared to previous work, our agent learns in a much more efficient way, using only a small number of real user interactions, which amounts to an affordable cost in many nontrivial domains. Our strategy is based on the Deep Dyna-Q (DDQ) framework where planning is integrated into dialogue policy learning. The effectiveness of DDQ is validated by human-in-the-loop experiments, demonstrating that a dialogue agent can efficiently adapt its policy on the fly by interacting with real users via deep RL.

One interesting topic for future research is exploration in planning. We need to deal with the challenge of adapting the world model in a changing environment, as exemplified by the domain extension problem (Lipton et al., 2016). As pointed out by Sutton and Barto (1998), the general problem here is a particular manifestation of the conflict between exploration and exploitation. In a
planning context, exploration means trying actions that may improve the world model, whereas ex-
ploitation means trying to behave in the optimal way given the current model. To this end, we want
the agent to explore in the environment, but not so much that the performance would be greatly degraded.