Content area
The goal of dialogue management in a spoken dialogue system is to take actions based on observations and inferred beliefs. To ensure that the actions optimize the performance or robustness of the system, researchers have turned to reinforcement learning methods to learn policies for action selection. To derive an optimal policy from data, the dynamics of the system is often represented as a Markov Decision Process (MDP), which assumes that the state of the dialogue depends only on the previous state and action. In this article, we investigate whether constraining the state space by the Markov assumption, especially when the structure of the state space may be unknown, truly affords the highest reward. In simulation experiments conducted in the context of a dialogue system for interacting with a speech-enabled web browser, models under the Markov assumption did not perform as well as an alternative model which classifies the total reward with accumulating features. We discuss the implications of the study as well as its limitations. [PUBLICATION ABSTRACT]
Lang Res Eval (2006) 40:4766
DOI 10.1007/s10579-006-9008-2
ORIGINAL PAPER
Tim Paek David Maxwell Chickering
Published online: 15 November 2006 Springer Science+Business Media B.V. 2006
Abstract The goal of dialogue management in a spoken dialogue system is to take actions based on observations and inferred beliefs. To ensure that the actions optimize the performance or robustness of the system, researchers have turned to reinforcement learning methods to learn policies for action selection. To derive an optimal policy from data, the dynamics of the system is often represented as a Markov Decision Process (MDP), which assumes that the state of the dialogue depends only on the previous state and action. In this article, we investigate whether constraining the state space by the Markov assumption, especially when the structure of the state space may be unknown, truly affords the highest reward. In simulation experiments conducted in the context of a dialogue system for interacting with a speech-enabled web browser, models under the Markov assumption did not perform as well as an alternative model which classies the total reward with accumulating features. We discuss the implications of the study as well as its limitations.
Keywords Spoken dialogue Dialogue management Markov assumption
1 Introduction
The goal of dialogue management in a spoken dialogue system is to take actions based on observations and inferred beliefs. Dialogue management plays a crucial role in the overall performance of the system because speech recognition is often quite poor, due to noisy or unexpected input. With robust dialogue management, the system can still take actions that maintain the task at hand. Unfortunately, coming up with a suitable set of dialogue management strategies is no easy task. Traditional methods typically involve authoring and tuning complicated hand-crafted rules that
T. Paek (&) D. M. Chickering
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA e-mail: [email protected]
D. M. Chickeringe-mail: [email protected]
Evaluating the Markov assumption in Markov Decision Processes for spoken dialogue management
123
48 Lang Res Eval (2006) 40:4766
require considerable deployment time and cost. Statistical methods, on the other hand, hold the promise of robust performance from models that can be trained on data and optimized, so long as the data is representative of what the dialogue system can expect to encounter during deployment (Young, 2000).
Among the more popular statistical methods, researchers have turned to reinforcement learning methods because it is possible to derive a policy for action selection. Given that the dynamics of the system is represented as a Markov Decision Process (MDP), which assumes that the state of the dialogue depends only on the previous state and action, this policy is guaranteed to be optimal with respect to the data. The Markov assumption is made as a modelling choice for the data. Hence, an important topic of inquiry is whether this choice is appropriate and benecial.
In this article, we explore the Markov assumption on both theoretical and empirical grounds. In particular, we investigate whether constraining the state space by the Markov assumption truly affords the highest reward, especially when the structure of the state space may be unknown, which is typically the case. This article is organized as follows. In Sect. 2, we discuss the modelling assumptions relevant to spoken dialogue and provide relevant background on reinforcement learning applied to spoken dialogue management. In Sect. 3, we challenge the modelling assumptions by proposing alternative models to the MDP that vary the temporal relations among features. All competing models generate dialogue management strategies for interacting with a speech-enabled web browser, and we explain in detail how we built these models from data. In Sect. 4, we evaluate the performance of all the models in simulation experiments and assess the best performing model. Finally, in Sect. 5, we conclude with a discussion of the implications and limitations of the experimental study.
2 Background
Before discussing the assumptions underlying the MDP, it is important to consider the basic units of dialogue modelling; that is, what basic units form a dialogue process. Because all dialogue systems respond to user utterances, perhaps the simplest way to model the dynamics of the interaction is to divide the temporal process by user utterances. In other words, a dialogue turn begins at the start of each new user utterance. While alternative ways of measuring dialogue progression exist, such as questionanswer pairs or contributions (Clark, 1996), they typically require knowledge about the type of utterance or action that was produced; for example, that an utterance was an uptake. For simplicity, we take the user utterance as the most basic unit of dialogue progression.
Given an utterance then, the most basic features that a system can observe before taking an action are those that pertain to the utterance itself. As such, we consider that at every turn, a dialogue system can observe at least the features that can be known about the current utterance at hand. In a state-based representation, the features of the current utterance would constitute the state space, and all state space variables would be indexed by the time in which the utterance occurred. In principle, state space variables can be engineered to aggregate observations arbitrarily far back in time. We consider such features later. For now, suppose that only the most basic information, that is, features of the current utterance, can be observed. We now discuss modelling assumptions that can be made on top of this basis.
123
Lang Res Eval (2006) 40:4766 49
2.1 Assessing assumptions
The MDP framework relies on several assumptions, not all of which may be valid in the context of spoken dialogue. Supposing for now a basic decomposition of dialogue progression, or the system dynamics, into user utterances, where states can be indexed by those utterances, as discussed above, the most obvious assumption is the Markov assumption, which declares that the current state of the dialogue depends only on the previous state and action. One reason for making the Markov assumption is that it allows the Bellman equations (for Eq. 4 as we get to below) to exploit the Optimality Principle, which states that whatever the initial state may be, all remaining decisions must be optimal with regard to the state following the rst decision (Bellman, 1957). This allows the optimal policy (for Eq. 5) to be solved efciently using dynamic programming. Furthermore, by maintaining just local dependencies between the current state and previous state for potentially long sequences of actions, the system can benet from having just those local parameters to estimate. However practical these reasons may be, whether or not a model constrained by the Markov assumption yields the highest reward as compared to models constrained by other assumptions is still an empirical question, one which we investigate later.
From a linguistic perspective, it seems counter-intuitive to believe, as the Optimality Principle implies, that an optimal policy based just on the previous turn (i.e., the features of the previous utterance) provides as good a policy as that based on the full history of interaction. After all, most linguists acknowledge that in a conversation, participants collaboratively build up shared knowledge about what has been said and mutually understood (Clark, 1996). This shared knowledge, or common ground, is cumulative in nature and underlies all future interactions.
A response to this criticism is to argue that if aspects of history are important for making future decisions, they could be incorporated with states that summarize what has been learned so far; that is, summary states that are not time-indexed but cumulative. However, this argument merely avoids the problem by adding additional assumptions, this time relating to what variables should be included in the state space. Most policy-guided dialogue systems specify the state space up front, delineating all state variables that are assumed to be relevant for receiving a reward. These variables are dened and restricted so as to not only facilitate the Markov assumption, but also expedite tractable inference. Unfortunately, in practice, most of the time dialogue designers do not know in advance what variables should be included in the state space. In the next section, we discuss what a dialogue designer could do in such a situation. For now, it is enough to say that if possible, we should like to build models that rely on as few assumptions as necessary.
Finally, another assumption underlying the MDP is that the probabilities of making state transitions or receiving specic rewards do not change over time; that is, they are stationary. For dialogue systems that provide services across a large population of users, the stationary assumption may indeed hold because individual differences are generalized. However, for dialogue systems that provide services to a limited number of users, it is not unreasonable to believe that people may change their preferences about how they want the system to behave around them over time. If unobservable states such as user frustration are included in the model, they may change over time as well. In such cases, it is incumbent upon the system to continually adapt its policy. In Chickering and Paek (2006), we discuss how a
123
50 Lang Res Eval (2006) 40:4766
dialogue system could adapt its policy in real-time to a particular user through online feedback.
2.2 MDP framework
Reinforcement learning addresses the problem of how an agent should act in dynamic environments so as to maximize a scalar reward signal (Sutton & Barto, 1998). This problem is manifest in spoken dialogue systems because the system must take sequential actions based on its observations, such as user utterances, and its beliefs. A central debate in the literature concerns the use of models. Model-free approaches do not explicitly represent the dynamics of the environment, but instead directly approximate a value function that measures the desirability of each environment state. These approaches offer near-optimal solutions that depend on systematic exploration of all actions in all states (Watkins & Dayan, 1992). On the other hand, model-based approaches explicitly represent a model of the dynamics of the environment to compute an estimate of the expected value of each action. With a model, the agent can reduce the number of steps to learn a policy by simulating the effects of its actions at various states (Sutton & Barto, 1998). Perhaps for this reason, and for the fact that it is possible to derive a policy that is guaranteed to be optimal with respect to the data, spoken dialogue researchers have by and large pursued model-based reinforcement learning methods (see e.g., Levin, Pieraccini, & Eckert, 1998; Singh, Litman, Kearns, & Walker, 2002; Williams, Poupart, & Young, 2005).
The framework underlying model-based reinforcement learning is that of the MDP, which can be characterized by a tuple (S, A, P, R) with:
A state space S with states s 2S. The state space may consist of features related to spoken utterances, and so forth. We discuss this further in the next section.
An action space A with actions a 2A. The action space comprises all system actions in dialogue management, such as conrming various slots, or engaging in a user requested service.
Unknown state transition probabilities P : S A S7!0; 1 , where P(St+1 |St, At) gives the probability of a transition from a state s 2S and action a 2A at time slice t to another state s 2S in the next time slice. The distribution P denes the dynamics of the environment, and constitutes the formal basis for the Markov assumption.
A reward function R : S A7!<, where Rt = R(St, At) assigns an immediate reward at time slice t for taking action a 2A in state s 2S. R plays a critical role in the policy that is learned for dialogue management as we discuss further below.
In order for a dialogue system to take actions according to the MDP, it is necessary to be able to derive a policy p : S7!A mapping states to actions so that it maximizes some specied objective function for the long term reward. How much of the future the system takes into account in making its decisions at any given moment depends upon the specied horizon for the objective function. Perhaps the simplest objective function is the total reward over a nite horizon, which species that at any given time t, the system should optimize its expected reward for the next h steps
E X
th
qt
Rq! 1
123
Lang Res Eval (2006) 40:4766 51
Alternatively, the system can take the innite long term reward into account with future rewards geometrically discounted by a discount factor c
E X
1cqRq! 2
where 0 c < 1 for computability purposes. The discount factor sets the present value of future rewards; so, for example, if c = 0, the system acts myopically and tries to maximize immediate rewards, but as c approaches 1, the system becomes more farsighted (Sutton & Barto, 1998). Although it is theoretically possible for a spoken dialogue to continue ad innitum, most systems are designed to avoid innite regressions where, for example, the system engages in the same repair over and over (e.g., Im sorry, can you repeat that?). In practice, most dialogues (and especially task-oriented dialogues) are nite horizon, given that oftentimes growing user frustration ultimately leads to the termination of the interaction.
In place of (1) and (2) above, the objective function can also be based on post-hoc measures such as usability scores (Singh et al., 2002; Walker, Passonneau, & Boland, 2001b), and construed to reect whatever qualities a dialogue designer may want the system to possess, such as the ability to re-tool the system for future use (Walker et al., 2001a). In short, the assignment of the reward function reects the desired behaviour of the system.
For the rest of this paper, we conne our discussion to the nite horizon MDP where we assume for simplicity that all variables in the state space can be fully observed by the system. Although the innite horizon MDP is more commonly used for spoken dialogue systems, we focus on the nite horizon for two reasons: rst, because our domain task imposes a restriction on the length of the dialogue, as we discuss in Sect. 1, and second, because our representation of a nite horizon MDP as an inuence diagram allows us to learn which state variables are associated with an immediate reward for each time step, as we discuss in Sect. 3. When state variables are included that are not fully observable, such as the users intention in producing an utterance, the dialogue constitutes a Partially Observable MDP (see e.g., Paek & Horvitz, 2000; Roy, Pineau, & Thrun, 2000; Williams & Young, 2005; Zhang, Cai, Mao, & Guo, 2001). The POMDP also employs the Markov assumption.
Building on (1), an optimal policy can be learned through various algorithms that involve nding the optimal value function:
V tst max
p E X
Pst1jst; atV t1st1
where the value of a state s at time t is the expected immediate reward plus the expected value of the next state at time t + 1 using the best possible action. The
th
qt
Rq! 3
where the optimal value of a state s is the expected reward for the next h steps, if the system starts in s at time t and executes the optimal policy p. The optimal value function (3) is unique and can be dened recursively using the Bellman equations
V tst max
at Rt X
st2S
!
4
123
52 Lang Res Eval (2006) 40:4766
simultaneous equations engendered by (4) can be solved efciently with dynamic programming. Given the optimal value function, the optimal policy is simply
p tst arg max
at
2.3 Inuence diagram
The nite horizon MDP can be viewed as a special case of an inuence diagram, a more general framework for graphical modelling that facilitates decisiontheoretic optimization. An inuence diagram is a directed acyclic graph composed of three types of nodes: chance nodes, decision nodes and value nodes. The nodes correspond to variables which can be constants, probabilities, decisions, or objectives. The inuence diagram also contains a single utility node that is a deterministic function of all the value nodes. Connecting the nodes are two types of arcs: probabilistic arcs and informational arcs. Arcs pointing into chance or value nodes specify a probabilistic dependency between a child and its parents. Arcs pointing into a decision node are informational in that the parents of the decision node are assumed to be known or observed before a decision is made. Similar to the more commonly known Bayesian network model, which is essentially an inuence diagram without decision and value nodes, detailed data about the variables are stored within the nodes, so the graph is compact and visually represents the relationship among variables (see Shachter, 1998 for examples). Although the traditional denition of an inuence diagram (Howard & Matheson, 1981) permits only one value or utility node, our use of multiple value nodes is simply a way of factoring the utility function and has been used by other researchers (Lauritzen & Nilsson, 2001; Tatman & Shachter, 1990).
Figure 1 displays an inuence diagram for a nite horizon MDP where all states s
2S have been mapped to chance nodes, all actions a 2A to decision nodes, and all r 2R to value nodes expressing the immediate reward for taking action a in state s at time t. As shown in the gure, the unknown state transition probabilities P are
Fig. 1 An inuence-diagram model of a nite-horizon MDP. Informational arcs have been left out
!
Rt X
st2S
Pst1jst; atV t1st1
5
123
Lang Res Eval (2006) 40:4766 53
visually represented in the arcs going into St+1 from St and At in the top left-hand corner. One of the three reward functions R : S A7!< is represented in the top right-hand corner with the arcs going from Sh and Ah to the value node Rh. Finally, a utility node at the bottom of the gure sums all the immediate rewards as in Eq. 3.
Technically, because the MDP is fully observable at any given time slice, informational arcs point into each decision node from the previous time slice, though we have left them out to reduce clutter. The inuence diagram also contains a set of parameters Q that characterize the conditional distributions of the non-decision nodes, dened as
PS; RjA; H Y X2S[R
PXjPaX; HX 6
where Pa(X) denotes the set of parents for node X, and Q X denotes the subset of parameters in Q that dene the local probability distribution of X. The transition probabilities P for the MDP are clearly subsumed by (6) and reside in the node St+1,
as shown in the gure.
We discuss inuence diagrams for two reasons. First, inuence diagrams allow us to understand what kinds of alternative models we could experiment with in competition to the MDP, because the transition probabilities P could easily depend on state variables other than just those in the previous time slice, as we demonstrate in the next section. And second, inuence diagrams provide a framework in which to address what state variables should be included in S at all if a dialogue designer is unsure about what variables may be important for receiving an immediate reward.
3 Alternative models
In the previous section, we discussed how the Markov assumption can be tied together with the selection of state space. Unfortunately, dialogue designers who want to use reinforcement learning for dialogue management typically do not know in advance what variables are relevant for receiving a reward and how they are related to each other: that is, the structure of the state space is unknown. Rather than choosing variables so as to facilitate the Markov assumption, we propose a different approach: include all variables that might be predictive of reinforcement and let the data decide which ones to include. This can be done using techniques for learning the parameters and structure of a Bayesian network, extended to inuence diagrams (Chickering & Paek, 2006; Heckerman, 1995).
To derive the structure of the graphical models we describe below, including the MDP, we learned inuence diagrams employing decision trees to encode local conditional distributions using a tool that performs Bayesian structure search (Chickering, 2002). As described in Chickering, Heckerman, and Meek (1997), a heuristic search method is conducted over feasible probabilistic dependency models guided by a Bayesian score that ranks all candidate models, where the score is an estimation of the likelihood of the data given the proposed dependency structure. For each variable, the method constructs a decision tree containing multinomial distributions at each leaf for discrete variables, and Gaussian distributions at each leaf for continuous variables. In learning the inuence diagrams, we only specied constraints on the graphical structure of the state space, such as the Markov
123
54 Lang Res Eval (2006) 40:4766
assumption, for which we wanted to conduct experiments. In particular, we built competing models to the MDP that vary the nature of the temporal relations between features. We did not assume that the state space was known beforehand, and as such, we included all variables that we were able to log from interacting with a command-and-control, speech-enabled web browser. We now describe our data collection procedure and the models we built.
3.1 Data collection
The data from which we built all models was for spoken dialogue interaction with a speech-enabled web browser. As we describe in (Chickering & Paek, 2006), the data was generated using a simulation environment, where all possible system actions to a user command were systematically explored. The simulation proceeded as follows. First, we randomly selected a command from the command-and-control grammar for the browser (e.g., go back, go forward, go to link x). Using state-of-theart text-to-speech (TTS) generation, we produced an utterance for the command, varying all controllable TTS parameters: in particular, the TTS engine (5 engines), pitch (6 levels), volume (6 levels), and speaking rate (6 levels). In Sect. 4, where we evaluate the models, we describe how we selected specic congurations of those parameters as test voices for the experiment. Because we were interested in building models that were robust to noise, we included empty commands and added various types of background noise to see if a model could learn to ignore spurious commands. The produced utterance was then recognized by a Microsoft Speech API (SAPI) recognition engine, whereupon we logged all possible SAPI events. These events, and functions of these events, constituted the feature set. Many of the features are listed in the appendix. The features roughly fell into the following three broad categories
1. Within-utterance ASR (Automatic Speech Recognition) features: Features pertaining to a single utterance such as the number of hypotheses in an n-best list of variable length, the mean of the condence scores, etc.
2. Between-utterance ASR features: Features pertaining to matches across utterances, such as whether the current top command in the n-best list matched the previous top commands, etc.
3. Dialogue features: Features pertaining to the overall dialogue such as the number of repairs so far, whether the system has engaged in a conrmation yet, etc.
It is important to note that both the between-utterance ASR features and dialogue features span multiple time slices, and as such, are not Markovian per se. By including these features, we decided to leave open the possibility that summary variables may very well be relevant for receiving a reward. In building the models, we let the learning algorithm decide whether to include these variables in its decision trees.
Once the produced utterance was recognized and events recorded, the simulation took a random action. For the rst utterance, the simulation could either execute the most likely command in the n-best list (DoTop), conrm among the top three choices while giving the option that it may not be any of them (Conrm), ignore the utterance as spurious (Ignore), or ask for a repetition (Repeat). For the second
123
Lang Res Eval (2006) 40:4766 55
utterance, only DoTop, Conrm, and Repeat were possible, and for the third utterance, only DoTop and Bail, an action which makes an apology and terminates the dialogue, were possible. We did not allow the interaction to extend beyond the third utterance given the typically low tolerance users have in command-and-control settings for extended repairs.
If the simulation selected DoTop, Ignore, or Bail, the session nished and the simulation could now be rewarded. For taking the action DoTop, if the result of executing the most likely command matched the true command (i.e., the one sampled from the grammar), the simulation received a positive scalar reward (+100); otherwise, it received a negative reward (100). For taking the action Ignore, if the true command was empty, the simulation was awarded +100. Otherwise, it received a penalty of 100. For Bail, the simulation always received a penalty of 100. If either the repair action Conrm or Repeat was selected, a smaller penalty (75) was incurred, and the simulation proceeded to produce a conrmation choice that matched the true command or a repetition of the previous utterance, respectively.
In noisy environments, the n-best list frequently failed to include the true command. In order to handle this type of situation, as stated previously, we included with the top three choices the option that it might not be any of them so that even if the true command was absent from the n-best list, the system could still gracefully conclude that it did not hear the appropriate command. In such a situation, if the system correctly guessed None of the above, it received the usual scalar reward of (+100) plus the penalty (75) for each of the one or more repairs it took to get there.
In summary, by sampling a command from the grammar and then responding to it with random actions until the dialogue nished and rewards were assigned, the simulation environment amassed data that could be used to learn which features and actions were most relevant for receiving a reward.
3.2 Types of models
In order to validate empirically whether constraining the state space by the Markov assumption, especially when the structure of the state space may be unknown, truly affords the highest reward, we built three types of alternative inuence diagrams that vary in the extent of their temporal dependencies using the simulation data.
Figure 2 displays all the models we learned. To make room for visual comparison, we have left out the utility node in the zero, rst, and second order models. Highlighted in the gure is the MDP, which is the same as Fig. 1. The MDP has a rst order Markov dependency; hence, S2 depends on S1, and S3 depends on S2. If S3 also depends on S1, which occurs two time slices in the past, then we have a second order Markov model, as shown in the top of the gure. In preparing the data for the second order model, we added between-utterance ASR features such as where the top command in the third slice occurred in the n-best lists of the second and rst time slices. On the other hand, if no state St depends on the previous state St-1, then we have no Markov dependency. We call this a zero order model.
3.2.1 Total reward model
As an altogether different approach, we built a total reward model for each time slice where state features accumulate with each successive time slice, as shown in the
123
56 Lang Res Eval (2006) 40:4766
Fig. 2 Alternative models to the MDP that vary along the Markov scale. The utility node in the zero, rst, and second order models has been left out. And the informational arcs in the rst and second order have been left out
bottom of the gure, and the local system actions optimize the total utility function, as opposed to the immediate reward function. Unlike the Markov models which use inference in order to estimate the policy, as we discuss later, the total reward model simply predicts the total reward given all observable features. This kind of supervised learning model had been applied previously in the call-routing domain to transfer calls so as to minimize caller time and operational expense (Paek & Horvitz, 2004).
When building a model to predict the total reward at a particular time step t, it is important to limit the training data to those instances that correspond to optimal behaviour at future time steps. To understand this point, recall that we generated the training data by randomly performing actions. If we used all of this training data to learn a model for the total reward at time t, this would give us the optimal action assuming that we were to act randomly after this point. To take care of this issue, we work backwards through time, building rst the model for the last time step. Then, before learning the model for time step t, we remove from the training data all instances for which the (random) action at time step t + 1 does not match the optimal action according to the model for time step t + 1. As a result, we get a model for the total reward at each time step under the assumption that we act optimally after that point.
In effect, the total reward model falls under the rubric of a model-free approach in that it does not explicitly represent the dynamics of the environment, but rather directly predicts the long term reward. Fundamentally, the total reward model attempts to classify the total reward, which given the small state space for our domain, was discrete. In particular, there were only ve possible total reward
123
Lang Res Eval (2006) 40:4766 57
outcomes, and as such, we could treat the total reward as a discrete variable. In other domains where the number of distinct rewards possible may be large, we can learn a regression instead of a classication.
3.3 Model building
Building each model entailed piecewise construction of portions of the model by the time slice variables involved. For those models involving just the rst time slice variables, we had over 20,000 simulation sessions as training data. For those models involving the rst and second time slices, we had over 10,000 sessions. And nally, for those models involving all three time slices, we had over 5,000 sessions.
Table 1 summarizes the complexity of the models as a function of the number of splits in the decision trees that make up the local probability distributions, and the number of state space variables. For the rst and second order Markov models we also built limited versions, where the state space variables were limited to those that exhibited probabilistic dependence on the immediate reward. Note that some state variables only depended on other state variables. This had the effect of reducing the complexity of each respective model by a factor of 6 for the number of splits, and 4 for the number of state variables. The simplest model by far was the total reward model which had about 30 times fewer splits than any of the unlimited Markov models, and more than 4 times fewer variables than the MDP model. Although the total reward model had almost the same number of variables as the limited Markov models, the variables included were quite different. We discuss some of the differences later.
Figure 3 shows the limited MDP learned from the simulation data. Explanations for all of the state variables can be found in the appendix. The decision nodes contain only the actions that were available at a particular time slice in the simulation, as explained in Sect. 1. For the rst time slice, no dependencies between state variables are necessary because all features of the rst utterance are fully observed. Note that each time slice has a number of features related to the n-best list condence scores. This is not surprising given that most hand-crafted dialogue management strategies utilize some kind of condence threshold for taking actions (e.g., Do the top recognized command if its condence is greater than 95%). The mean condence score in the n-best list appears in every time slice, suggesting that it pays off to build features that aggregate the condence scores. In fact, many of the features in the rst and second time slices of the limited MDP are aggregate features of the n-best list, such as the sum of all condence scores (Score Sum), the range of score values (Score Range), and whether all the commands in the list were the same though the actual phrases or wording were different (Cmds All Same).
Table 1 Complexity of the models learned as a function of the number of splits in the decision trees and the number of state space variables
MODEL No. splits No. state space variables
0th order 903 80 1st order 1096 90 1st order (limited) 183 20 2nd order 1133 90 2nd order (limited) 190 20 Total reward (combined) 28 16
123
58 Lang Res Eval (2006) 40:4766
Fig. 3 A nite horizon MDP learned from data, where the state space was constrained to include only those variables that directly predicted the immediate reward for each time slice, in addition to the Markov assumption. The utility node has been left out, as well as the informational arcs
The MDP in Fig. 3 also shows that all time slices include a state variable related to the particular command that was observed, such as the rst or top command in the n-best list. This indicates that the model is domain-dependent; that is, the commands specic to this application make a difference in whether or not the system receives a reward. For example, if we look at the decision tree for the value node of the second time slice, we would see that if the top command in the n-best list is hide numbers(i.e., numbered hyperlinks) after the system has asked for a repeat in the rst time slice, and the system decides in this current time slice to execute this top command, the probability that it would receive an immediate negative reward for failure would be 90%. All the models we learned from the data were domain-dependent in this way.
Some of the variables that were not included in the limited MDP, but were included in the unlimited version, were most of the between-utterance ASR features. The notable exception that does appear in the limited version is the variable Top Cmd Same As Previous which checks to see if the top command between the current time slice and previous time slice are the same. In general, most of the state variables included in the limited models were within-utterance ASR features.
3.4 Policy estimation
To derive an optimal policy from data involves a two step process: rst, learning an optimal model with respect to the data, and then, learning a policy with respect to the model. As discussed previously, we learned the models from the simulation data using Bayesian structure learning. In order to derive the optimal policy, the straightforward approach is to exploit the Markov assumption, at least for the MDP, and apply dynamic programming in the usual way. However, computing the policy for a state space that includes more than a handful of variables can be quite challenging, even for the limited model in Fig. 3. This has prompted researchers to investigate techniques for factoring the state space using graphical models as we have in our models (Meuleau et al., 1998). An alternative approach for computing the policy, which we utilized, is to use forward sampling to approximate the dynamic programming solution for the MDP. Leveraging the learned graphical models as a
123
Lang Res Eval (2006) 40:4766 59
generative model for sampling, similar to Kearns, Mansour, and Ng (1999), we can approximate the expectation in (3) by sampling N random states from the next time slice so that the optimal value function becomes
V tst max
at Rt
1 N X
st2S
V t1st1
!
7
In particular, to determine the value function at time t, the algorithm works by, for each action at, setting the action node At = at and then repeatedly sampling the chance nodes for time step t + 1, recursively solving for the best action in time step t + 1. For each sample, we examine the total utility, and we choose the action at that maximizes the average total utility among the samples. Only a few hundred samples were needed for the algorithm to converge. This technique allowed us to compute the action with the greatest expected reward at any given time slice in the same way for every Markov model so that we could more easily compare and evaluate them.
Note that for the total reward model, because the model directly predicts the total reward, we do not need to use sampling for inference as it contains its own policy for each time slice in the decision tree for the utility node.
4 Evaluation
Because it is often difcult to know in advance what variables in the state space are relevant to receiving a reward, we learned both the structure of the state space and policy for all the models discussed in the previous section. The only factor that we varied was the set of constraints we put on the state space. In particular, we varied the Markov order from zero to three. We also either limited the model to just the state variables that predicted the immediate reward or not, and nally, we built three total reward models that each contain their own policy and combined them (we henceforth refer to the combination of the three models as simply the total reward model). All other factors being equal, the question remains as to whether or not constraining the state space by the Markov assumption truly affords the highest reward.
To answer this question, we used the same simulation environment we described earlier to create the training data to also generate test data. Instead of varying all controllable TTS parameters, we created 75 test voices by permuting the TTS engine (5 engines), pitch (3 levels), and speaking rate (3 levels). All 75 voices were screened for naturalness. In the simulation environment, for any utterance, different models could take different actions, and as a consequence, receive different responses. To maintain controlled experimental conditions, we generated utterances and features for the entire tree of all possible action sequences. That way we ensured that every model received the exact same features if they took the same action, which also meant that between-utterance features utilized the same previous utterance features. For each of the 75 voices, we simulated 1000 experimental trials, where a trial corresponded to the generated tree of all possible actions. Instead of varying the different kinds of background noise as in the training data, this time we xed the background noise to the challenging condition of people chattering in a cafeteria.
123
60 Lang Res Eval (2006) 40:4766
We evaluated the models against the intuitive baseline of how any system would do if it followed the simple policy of always executing the top command in the n-best list. This baseline is meant to characterize the behaviour of a system that follows no dialogue management strategies. It simply commits to the top recognized result and never engages in repair. We call this baseline DoTop.
4.1 Results
Figure 4 displays the running average reward for all models compared to the baseline, and Fig. 5 the average total reward. Although seven lines should be represented, only four are visible because the rst and second order Markov models, as well as their limited versions, performed exactly the same with respect to both their running average reward and their average total reward; that is, the 1st order, 1st order (limited), 2nd order, and 2nd order (limited) models were identical in performance. Hence, while the complexity of the models may have differed, they all took the same actions. This suggests that it is not worthwhile to constrain the state space beyond the rst order Markov relation, nor is it worthwhile to include any more variables than those that directly predict the local immediate reward.
As far as the best performance is concerned, the total reward model outperformed all of its competitors. With respect to average total reward, the total reward model was signicantly better than any of the identically performing Markov models (t(74) = 2.17, p = 0.03 two-tail). It is important to consider that the total reward model achieved this result with just 28 splits in its decision trees, combined between the three time slices. The identically performing Markov models in turn achieved signicantly more total reward on average than the baseline (t(74) = 2.79, p = 0.01 two-tail). The fact that all of these models outperformed the baseline by a signicant margin suggest that it is quite benecial to engage in dialogue repair.
Fig. 4 Running average reward over 75 test voices as a function of the experimental trial iteration. Each test voice went through 1000 experimental trials. The lines for the rst order, rst order (limited), second order, and second order (limited) models are overlayed because the models performed identically
123
Lang Res Eval (2006) 40:4766 61
Fig. 5 Average total reward for all learned models after 1000 experimental trials using 75 test voices. The error bars represent standard errors about the mean
Surprisingly, the zero order Markov model performed signicantly worse than the baseline. The reason for this has to do with the two repair actions, Repeat and Conrm. Without knowing what action was previously selected, because dependencies were not permitted between time slices, the model assigned the same probability of receiving an immediate negative or positive reward to both Repeat and Conrm, thereby conating the two. To get around this problem, separate models specically for Repeat and Conrm would need to be learned, though it is doubtful that this would have changed the nal result of the experiment.
4.2 Assessing the winner
Despite the relatively small complexity of the total reward model, it outperformed all other models. It is instructive to look at the state space variables that were included in the model for each time slice, as shown in Fig. 6. Although many of the state space variables for the total reward model also appear in the limited MDP
Fig. 6 The learned total reward model where state space variables accumulate over time and they all predict the total reward for the entire dialogue
123
62 Lang Res Eval (2006) 40:4766
displayed in Fig. 3, new variables come into play when predicting the total reward, as opposed to the immediate reward, such as the Number of Sound Ends in the rst time slice, the Second Rule in the second time slice, and the Top Score in the third time slice. Furthermore, because features are accumulated, the state space for the total reward model in the third time slice contains variables from the previous time slices, such as the average condence score for the n-best list in the second time slice, and strangely, the condence score of the third best result for the rst time slice. Two summary features also appear in the third time slice: namely, whether the top commands between the rst and second time slices are the same, and whether the top hypotheses between the second and third time slices are the same.
It is important to remember that the total reward model is model-free. It makes no assumptions about how features may be dependent on each other over time, but instead tries to predict the long term value of actions. In fact, while model-free approaches such as Q-learning (Watkins & Dayan, 1992) still learn policies using a system of equations similar to (4), a policy for the total reward model is comparatively easy to learn. It is simply learning a decision tree that classies what is likely to be the total reward for all the features it has observed. Even if the total reward were continuous, learning a regression is relatively easy. Furthermore, during runtime, it requires no inference, as do the other Markov models.
5 Discussion
Dialogue designers considering the use of reinforcement learning methods to learn policies for action selection in dialogue management could benet from the implications of this experimental study. If learning an MDP is the goal, and the designer is uncertain about what variables to include in the state space, our results suggest that an MDP that includes just those variables that are parents of the local immediate reward might perform just as well as a more complicated model with other variables that are not parents. In short, when we factored the state space in the experiment, only those variables relevant to receiving a reward mattered.
5.1 Implications for model selection
Assuming the designer is open to considering other types of models, the total reward model offers several advantages over the Markov models. First, the total reward model does not require inference; the policy is easily extracted from the decision tree for the utility node. Second, unlike the Markov models, the total reward model does not need to model the local rewards, which can be an advantage in cases where only a nal total reward is obtained. Third, the total reward model may perform just as well if not better than the Markov models. In our experiments, it indeed outperformed the Markov models, though we are not quick to generalize this result due to the limitation of the study. And nally, the total reward model makes fewer assumptions about the structure of the state space, which is quite appealing from a theoretical standpoint. It does not assume that an optimal policy based just on the previous turn is as good as a policy based on the full history of interaction. Because the full accumulation of knowledge is expressed in the model, it accords well with socio-linguistic sensibilities.
123
Lang Res Eval (2006) 40:4766 63
The downside of the total reward model is that for longer time horizons learning may be difcult because the number of state space variables can accumulate dramatically. Furthermore, there may be disadvantages to not having a model of the environment, though that is a topic of debate in the reinforcement learning community (Kaelbling, Littman, & Moore, 1996).
5.2 Limitations of the study
The above implications should be moderated by the limitations of the study. First and foremost, the domain in which we learned the models was quite small with a relatively restricted command-and-control grammar, and a small action space. We plan to extend the simulation and learning methods to larger domains, and it will be interesting to see if our results generalize. Second, although we assiduously delineated every feature we could think of for the state space, we may not have included the right set of features that could have allowed the MDP or any other model to outperform the winner. Good feature engineering has been shown to make a signicant difference in other natural-language processing domains, so it is prudent to remain humble about the features we utilized. Finally, our results depend on the techniques we used for learning the structure of the state space. We are agnostic about how these results may have turned out with other model selection techniques.
6 Conclusion
Spoken dialogue systems that learn optimal policies for dialogue management have typically utilized the MDP framework. In so doing, they are committed to the Markov assumption as a modelling choice, and very often, to assumptions about the structure of the state space. In this paper, we explored whether this choice is appropriate and empirically benecial. In particular, we investigated whether constraining the state space by the Markov assumption truly affords the highest reward, especially when the structure of the state space may be unknown, which is typically the case. We examined four models that vary in terms of the extent of their temporal dependencies and found that the total reward model, a model-free approach that predicts the total reward for each time slice with state space features accumulating over time, outperformed the MDP and all other models in terms of total and average reward. This model had the smallest complexity by far, made the fewest number of assumptions, and is relatively easy to compute. For nite horizons, the total reward model offers an attractive alternative to the MDP.
Appendix
The following list includes some of the features utilized for the models. The term Cmd in a feature name refers to the user command, and the term Token refers to the user utterance. For example, if the user says go back, the Cmd will be BACK and the Token will be go back.
123
64 Lang Res Eval (2006) 40:4766
A. within-utterance ASR features (for each turn i)
1. {Top|Second|Third} Cmd (i): The command that occupies the {rst|second|third} position in the top-n list.
2. {Top|Second|Third} token (i): Token that occupies the {rst|second|third} position in the top-n list.
3. {Top|Second|Third} score (i): Condence score that occupies the {rst|second|third} position in the top-n list.
4. Number of false recognitions (i): Number of passes through SAPIs word lattice that fail to recognize a phrase.
5. Number of interference (i): Number of times SAPI raised an event indicating that recognition might have been compromised by a particular audio distortion type.
6. Most freq interference (i): Most common type of audio-distortion event type raised by SAPI.
7. Number of sound starts (i): Number of times any sound start point is detected.8. Number of sound ends (i): Number of times any sound end point is detected.9. Number of phrase starts (i): Number of times a phrase is detected from an audio stream.
10. Maximum redundant {Cmd|Token|Combined} (i): The cardinality of the most frequently occurring {command|token|both}.
11. Maximum number {Cmd|Token|Combined} Matches (i): The cardinality of distinct {command|token|both} that repeat.
12. Score count (i): Number of items in the n-best list.13. Score sum (i): Sum of all the condence scores.14. Maximum score (i): Maximum condence score.15. Minimum score (i): Minimum condence score.16. Score range (i): Difference between the maximum and minimum condence scores.
17. Score median (i): Median condence score if any.18. Score mean (i): Arithmetic mean of the condence scores.19. Score geometric mean (i): Geometric mean of the condence scores.20. Greatest consecutive difference (i): Greatest difference between any two consecutive condence scores, if there are two or more condence scores.
21. Score variance (i): variance of the condence scores.22. Score Stdev (i): Standard deviation of the condence scores.23. Score Stderr (i): Standard error of the condence scores.24. Score mode (i): Mode of the condence scores.25. Cmds all same (i): Whether the commands of the current n-best list are all the same, the top two are the same, or other.
B. Between-utterance ASR features (for i = 2 and i = 3)
1. Index of top Cmd in previous (i): The position of the current top command in the previous n-best list, if any.
2. Index of top Cmd in rst slice (i): The position of the current top command in the rst-turn n-best list, if any.
123
Lang Res Eval (2006) 40:4766 65
3. Top Cmd same as previous (i): Whether the current top command was the previous top command.
4. Index of top token in previous (i): The position of the current top token in the previous n-best list, if any.
5. Index of top token in rst slice (i): The position of the current top token in the rst turn n-best list, if any.
6. Score more than previous (i): Whether the average condence score is greater than the previous average condence score.
7. Gap between top scores (i): Difference between the current top condence score and the previous top condence score.
8. Gap between top scores with rst slice (i): Difference between the current top condence score and the rst-turn top condence score.
C. Dialogue features
1. Turn: The current dialogue step.2. Has Conrm: Whether or not a conrmation has been performed anytime in the previous turns.
3. Number of repairs so far (i): Number of repairs up to i.4. Number of conrms so far (i): Number of conrmations up to i.
References
Bellman, R. E. (1957). Dynamic programming. Princeton University Press
Chickering, D. (2002). The WinMine Toolkit. Technical Report MSR-TR-2002-103, Microsoft,Redmond, WA
Chickering, D. M., Heckerman, D., & Meek, C. (1997). A Bayesian approach to learning Bayesian networks with local structure. In Proceedings of the thirteenth conference on uncertainty in articial intelligence (pp. 8089), Providence, RI: Morgan Kaufmann.
Chickering, D. M., & Paek, T. (2006). Personalizing inuence diagrams: Applying online learning strategies to dialogue management. User Modeling and User-adapted Interaction, To appear. Clark, H. (1996). Using language. Cambridge: Cambridge University Press.
Heckerman, D. (1995). A Bayesian approach for learning causal networks. In S. Hanks, & P. Besnard (Eds.), Proceedings of the eleventh conference on uncertainty in articial intelligence (UAI) (pp. 285295). Morgan Kaufmann.
Howard, R. A., & Matheson, J. (1981). Inuence diagrams. In Readings on the principles and applications of decision analysis (Vol. II, pp. 721762). Menlo Park, CA: Strategic Decisions Group.
Kaelbling, L. P., Littman, M. L., & Moore, A. P. (1996). Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4, 237285.
Kearns, M. J., Mansour, Y., & Ng, A. Y. (1999). A sparse sampling algorithm for near-optimal planning in large Markov Decision Processes. In Proceedings of the international joint conference on articial intelligence (IJCAI) (pp. 13241231).
Lauritzen, S. L., & Nilsson, D. (2001). Representing and solving decision problems with limited information. Management Science, 47(9), 12351251.
Levin, E., Pieraccini, R., & Eckert, W. (1998). Using Markov Decision Processes for learning dialogue strategies. In IEEE Transactions on Speech and Audio Processing (Vol. 8, pp. 1123). Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L.P., Dean, T., & Boutilier, C.
(1998). Solving very large weakly coupled Markov Decision Processes. In Proceedings of the fteenth national conference on articial intelligence and the tenth conference on innovative applications of articial intelligence (AAAI/IAAI) (pp. 165172).
123
66 Lang Res Eval (2006) 40:4766
Paek, T. & Horvitz, E. (2000). Conversation as action under uncertainty. In Proceedings of the sixteenth conference on uncertainty in articial intelligence (UAI) (pp. 455464).
Paek, T. & Horvitz, E. (2004). Optimizing automated call routing by integrating spoken dialog models with queuing models. In Proceedings of the human language technology conference/North American chapter of the association for computational linguistics annual meeting (HLT/NAACL) (pp. 4148).
Roy, N., Pineau, J., & Thrun, S. (2000). Spoken dialogue management using probabilistic reasoning. In Proceedings of the conference of the association for computational linguistics (ACL) (pp. 93100).
Shachter, R. D. (1988). Probabilistic inference and inuence diagrams. Operations Research, 36(4),589604.
Singh, S., Litman, D., Kearns, M., & Walker, M. (2002). Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJ-Fun System. Journal of Articial Intelligence Research, 16, 105133.
Sutton, R. & Barto, A. (1998). Reinforcement learning: An introduction. MIT Press.
Tatman, J. A. & Shachter, R. D. (1990). Dynamic programming and inuence diagrams. IEEE transactions on Systems, Man and Cybernetics, 20(2), 365379.
Walker, M., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, K., Polifroni, B., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., & Whittaker, S. (2001a). DARPA communicator dialog travel planning systems: The June 2000 data collection. In Proceedings of the European conference on speech communication and technology (Eurospeech) (pp. 13711374).
Walker, M., Passonneau, R., & Boland, J. (2001b). Quantitative and qualitative evaluation of DARPA communicator spoken dialogue systems. In Proceedings of the conference of the association for computational linguistics (ACL) (pp. 515522).
Watkins, C., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3), 229256.
Williams, J. D., Poupart, P., & Young, S. (2005). Factored partially observable Markov Decision Processes for dialogue management. In Proceedings of the 4th IJCAI workshop on knowledge and reasoning in practical dialogue systems (pp. 7682).
Williams, J. D., & Young, S. (2005). Scaling up POMDPs for dialog management: The Summary POMDP method. In Proceedings of the IEEE workshop on automatic speech recognition and understanding (ASRU).
Young, S. (2000). Probabilistic methods in spoken dialogue systems. Philosophical Transactions of the Royal Society (Series A), 358(1769), 13891402.
Zhang, B., Cai, Q., Mao, J., & Guo, B. (2001). Planning and acting under uncertainty: A new model for spoken dialogue systems. In proceedings of the sixteenth conference on uncertainty in articial intlligence (pp. 572579).
123
Springer Science+Business Media 2006