Content area
In this paper, we address the issue of generating in-domain language model training data when little or no real user data are available. The two-stage approach taken begins with a data induction phase whereby linguistic constructs from out-of-domain sentences are harvested and integrated with artificially constructed in-domain phrases. After some syntactic and semantic filtering, a large corpus of synthetically assembled user utterances is induced. In the second stage, two sampling methods are explored to filter the synthetic corpus to achieve a desired probability distribution of the semantic content, both on the sentence level and on the class level. The first method utilizes user simulation technology, which obtains the probability model via an interplay between a probabilistic user model and the dialogue system. The second method synthesizes novel dialogue interactions from the raw data by modelling after a small set of dialogues produced by the developers during the course of system refinement. Evaluation is conducted on recognition performance in a restaurant information domain. We show that a partial match to usage-appropriate semantic content distribution can be achieved via user simulations. Furthermore, word error rate can be reduced when limited amounts of in-domain training data are augmented with synthetic data derived by our methods. [PUBLICATION ABSTRACT]
Lang Res Eval (2006) 40:2546
DOI 10.1007/s10579-006-9007-3
ORIGINAL PAPER
Chao Wang Grace Chung Stephanie Seneff
Published online: 8 November 2006 Springer Science+Business Media B.V. 2006
Abstract In this paper, we address the issue of generating in-domain language model training data when little or no real user data are available. The two-stage approach taken begins with a data induction phase whereby linguistic constructs from out-of-domain sentences are harvested and integrated with articially constructed in-domain phrases. After some syntactic and semantic ltering, a large corpus of synthetically assembled user utterances is induced. In the second stage, two sampling methods are explored to lter the synthetic corpus to achieve a desired probability distribution of the semantic content, both on the sentence level and on the class level. The rst method utilizes user simulation technology, which obtains the probability model via an interplay between a probabilistic user model and the dialogue system. The second method synthesizes novel dialogue interactions from the raw data by modelling after a small set of dialogues produced by the developers during the course of system renement. Evaluation is conducted on recognition performance in a restaurant information domain. We show that a partial match to usage-appropriate semantic content distribution can be achieved via user simulations. Furthermore, word error rate can be reduced when limited amounts of in-domain training data are augmented with synthetic data derived by our methods.
The research at MIT was supported by an industrial consortium supporting the MIT Oxygen Alliance. The research at CNRI was supported in part by SPAWAR SSC-SD. The content of this paper does not necessarily reect the position or policy of the Government, and no ofcial endorsement should be inferred.
C. Wang (&) S. Seneff
MIT Computer Science and Articial Intelligence Laboratory, 32 Vassar Street, Cambridge, MA 02139, USAe-mail: [email protected]
G. Chung
Corporation for National Research Initiatives, 1895 Preston White Drive, Suite 100, Reston, VA 22209, USAe-mail: [email protected]
S. Seneffe-mail: [email protected]
Automatic induction of language model data for a spoken dialogue system
123
26 Lang Res Eval (2006) 40:2546
Keywords Language model Spoken dialogue systems User simulation Example-based generation
1 Introduction
A mounting challenge in the building of any new spoken dialogue application is the collection of user data. Real user utterances are important for ensuring adequate coverage and countering sparse data problems, especially in the language modelling and natural language understanding components. To obtain an initial corpus, it is customary to conduct a Wizard-of-Oz data collection and/or solicit plausible inputs from potential users. This is usually followed by successive data collections, in parallel with iterative renements on each dialogue system component. Such an approach tends to be costly, and more automated methods for obtaining data are critical for lowering barriers to deployment.
This paper presents a methodology for synthesizing language model training data tailored to a spoken dialogue query-based application. In our approach, we seek to build a corpus of training sentences which would realistically reect those of user interactions with the dialogue system. Thus, the data must be similar in style to conversational speech encompassing repairs and disuencies, while they should also maximize on diversity and coverage in terms of syntactic constructions. Moreover, at the sentence level (e.g., different types of queries), and at the class level (e.g., within-class statistics), frequency distributions should closely approximate those of real user dialogues. We explore several realistic scenarios applicable at various phases of dialogue system development. During the initial stage of system development, there is typically no data from real users in the new domain. Our strategy is to use formal rules to generate user utterances, similar to Jurafsky et al. (1994), Popovici and Baggia (1997) in which context-free grammars were used to generate user sentences. However, we emphasize that it is important to achieve appropriate frequency distributions corresponding to sentence and class level semantics, in addition to linguistic richness within sentences. We use user simulation technology to guide our sentence generation process (Chung, 2004), essentially shaping the distributions of the sentence corpus by database statistics and a probabilistic user model.
A second scenario is to exploit any existing out-of-domain real-user data of similar style, either collected from previously developed systems, or available from other resources such as the Linguistic Data Consortium. To this end, we have developed a two-stage approach in which a data induction stage rst harvests a linguistically diverse corpus by transforming out-of-domain data (Chung, Seneff, & Wang, 2005), followed by a sampling stage to ensure proper frequency distributions (Wang, Chung, & Seneff, 2005). Our approach does not simply identify sentences that are relevant to the new application in the secondary domain, but exploits as much of the out-of-domain corpus as possible. Essentially domain-specic portions of the sentences in the secondary domain are either substituted by articially generated in-domain phrases, or translated into target domain phrases via formal rules. This process of sentence reconstruction yields a very large variety of patterns harvested from the previous domain, and relies critically on an intermediate step of extensive syntactic and semantic ltering to produce probable user sentences.
The set of transformed sentences aim to cover combinations of variations in both syntactic constructs and semantic content. However, it does not necessarily represent
123
Lang Res Eval (2006) 40:2546 27
an appropriate distribution in terms of either syntax or semantics. Hence, a second stage samples this over-generated corpus in order to form a nal training set that better approximates the distributions of real application specic dialogue interactions. Two different techniques have been implemented for sampling, termed user simulation and dialogue resynthesis, which can be applied individually or in tandem. The rst technique does not require any in-domain data, and utilizes the same user simulation process as in the rst scenario. The difference is that the user sentences were selected from the pool of transformed sentences, instead of generated using formal rules. The second technique is applicable in the scenario where a small set of in-domain development data has been collected after an initial system is in place. In this method, new dialogues similar to the development data are synthesized, again by selecting sentences from the transformed out-of-domain corpus. Hence, we expand the linguistic richness of the development data while maintaining a similar dialogue content distribution. The sentence selection process in both sampling methods relies on an example-based generation (EBG) capability, leveraging previous work in example-based translation (Wang & Seneff, 2004).
The structure of the paper is as follows. Previous related work will be presented in Sect. 2. Section 3 outlines the overall approach of re-using out-of-domain sentences. The next two sections provide a detailed account of the component technologies. Section 4 covers the data induction phase. Three methods of obtaining in-domain data are introduced, followed by a description of the syntactic and semantic ltering steps. Strategies in modelling meta queries and spontaneous speech phenomena (lled pauses and non-speech events) are also discussed in this section. Section 5 covers the sampling phase. The EBG mechanism is rst presented, followed by a description of how it is used in downsampling the over-generated raw corpus through user simulations or re-synthesis of development data (when available). Section 6 details recognition experiments in a restaurant information system, comparing performances corresponding to the scenarios discussed previously in this section. We end with conclusions in Sect. 7.
2 Related work
A recent trend in dialogue system development is a focus on minimizing the time and cost in developing a new dialogue system, particularly with respect to obtaining training data (Fabbrizio, Tur, & Hakkani-Tr, 2004; Feng, Bangalore, & Rahim, 2003; Fosler-Lussier & Kuo, 2001). But dialogue systems are better trained on large amounts of user data that properly characterize the user interactions (Bechet, Riccardi, & Hakkani-Tur, 2004). Generally, with very little training, researchers have sought to obtain more data by supplementing with alternative text sources such as the Web (Bulyko, Ostendorf, & Stolcke, 2003; Feng et al., 2003; Zhu & Rosen-feld, 2001). Some work has been directed towards selecting from an out-of-domain corpus based on some metric for relevance to the application domain (Bellagarda, 1998; Iyer & Ostendorf, 1999; Klakow, 2000). Alternatively, others have turned to language model adaptation where the parameters of a smoothed language model trained from generic data are tuned based on in-domain observations (Bacchiani, Roark, & Saraclar, 2004; Bertoldi, Brugnara, Cettolo, Federico, & Giuliani, 2001, Rudnicky, 1995). Fabbrizio et al. (2004) address the bootstrapping of out-of-domain data by identifying classes of utterances that are either generic or re-usable in the new application. In the absence of any domain data, one common method is to run a
123
28 Lang Res Eval (2006) 40:2546
(usually hand-coded) context-free grammar in generative mode (Fosler-Lussier & Kuo, 2001; Jurafsky et al., 1994; Popovici & Baggia, 1997). This is proposed in Galescu, Ringger, and Allen (1998) to combine with a language model whose back-off model is trained on out-of-domain data.
In contrast, our method assembles entirely new utterances by inserting articially constructed in-domain phrases into templates from another unrelated domain. Furthermore, we believe that obtaining the appropriate frequency distribution of the semantic content by sampling through simulated dialogue interactions would produce higher quality data. Stochastically generated user simulations are increasingly being adopted to train dialogue systems (Levin, Pieraccini, & Eckert, 2000; Schefer & Young, 2000), particularly for selecting and evaluating dialogue strategies (Araki & Doshita, 1996; Hone & Baber, 1995; Lin & Lee, 2001; Lpez-Czar, De la Torre, Segura, & Rubio, 2003). The method described here uses simulations as one method for pre-selecting training utterances to shape the training corpus statistics.
We use an EBG capability for sentence selection in the sampling phase. This idea is inspired by work done in the eld of example-based translation, which typically requires a collection of pre-existing translation pairs and a retrieval mechanism to search the translation memory. Similarity can be based on parse trees (Sato, 1992), complete sentences (Veale & Way, 1997), or words and phrases (Brown, 1999; Levin et al., 2000). Our sentences are indexed with lean syntactic and semantic information, which is obtained automatically by exploiting existing parsing and generation capabilities developed for dialogue systems.
Our method also relates to the instance-based natural language generation work described in Varges and Mellish (2001). While both are for narrow domain applications, and both take semantic content as input, the approaches taken are very different. In Varges and Mellish (2001), examples (or instances) are used to re-rank and select candidates produced by a grammar-based generator, using cosine measure as the distance metric. Our method exploits the restrictiveness of the domain and the large candidate corpus. We directly generate the sentence by retrieving it from the example corpus, using formal rule-based generation for backup when the retrieval fails.
3 Approach
Figure 1 illustrates the multiple steps proposed in this paper. We begin with generating an initial seed corpus in our target domain; examples are given in a Boston restaurant information system. This domain data (13,000 sentences) was obtained by running the dialogue system in simulated user mode (Chung, 2004). The simulations utilized a stochastic user model that, based on the system reply frame, determined a user response, represented as a string of key-value (KV) pairs. From the KV representation, the system generated user utterances by way of formal generation rules (Baptist & Seneff, 2000). The technique of inducing data from rst principles using formal generation will be outlined in Sect. 4.1.1.
Following the creation of a seed corpus, phrases extracted from these in-domain utterances, together with a previously collected ight reservation corpus of 31,000 utterances1 (Seneff, 2002), undergo a transformation to yield synthetic sentences in
1 The transcripts of the ight domain speech data will be made available for research purposes. Check the authors website at http://people.csail.mit.edu/wangc for updates.
123
Lang Res Eval (2006) 40:2546 29
the new domain. Two specic methods for the transformation will be described: an automatic template generation and substitution method (Sect. 4.1.2), and a formal transformation method (Sect. 4.1.3). The resultant set of over-generated and articially constructed sentences is successively reduced, rst by selecting on legal syntactic parses, and then by ltering on semantic relationships. The subsequent steps then address the process of data sampling to rene the data distribution to better match the statistics expected in realistic user dialogue interactions. The resulting sampled data are then further enhanced with generic meta-level queries and speech artefacts modelling.
4 Domain data induction
In this section, we rst describe three different methods for inducing synthetic corpora for a new domain. The rst method involves formal generation from rst principles, using a rule-based natural language generation system, and based on a
Seed Indomain
Fig. 1 A schematic depicting successive steps towards the automatic induction of language model data with seed in-domain data from out-of-domain data. The seed data are synthetic, obtained exclusively through simulations
Outofdomain Sentences
LM Training Corpus
123
30 Lang Res Eval (2006) 40:2546
simple semantic representation of the sentence contents. The other two methods involve transforming user queries directly from another source domain into queries appropriate for the new target domainthe rst of these methods substitutes phrasal units from the target domain into utterances obtained from the source domain, whereas the second one utilizes formal generation rules to translate queries from one domain to the other. This section also describes how the data are ltered syntactically and semantically to remove ill-formed sentences, and how the data are augmented to cover meta queries and noise events.
4.1 Data generation
4.1.1 Formal generation method
The formal generation method works within a user simulation framework to generate in-domain sentences in the absence of any in-domain or out-of-domain data, as illustrated in Fig. 2. During simulation, the end-to-end dialogue system continuously operates in text-mode with a user simulator, described in Chung (2004), and the formal rule-based generation system (Baptist & Seneff, 2000). The system response, encoded in a frame-based meaning representation, known as the system reply frame, is used by the simulator to generate a next putative user query in the form of a KV string. The formal generation component converts the KV specication into a surface string using trivial generation rules crafted by hand. The generation system can support multiple patterns for a specic template, thus adding some variations to the generated surface strings. Table 1 shows an example simulated dialogue.
A large corpus can be generated by running the simulation process over many thousands of dialogues. The corpus generated by the formal method typically contains very well-formed sentences; however, the linguistic richness is limited by the rules created by the developer. Data induced by the formal generation method will be useful as seed data for the transformation methods introduced in the next sections.
4.1.2 Template-based transformation method
The template-based transformation method aims to induce in-domain sentences from available out-of-domain corpora. Essentially, the objective of this method is to capitalize on the diverse syntactic constructions and spontaneous speech phenomena found in the out-of-domain data, replacing the domain-specic phrases with alternatives appropriate in the new application domain. This step will massively over-generate possible sentences that will be rened and pruned using various ltering methods.
System Response
Fig. 2 A schematic showing the process of generating synthetic data by the formal generation method from a KV representation using usersimulation
123
Lang Res Eval (2006) 40:2546 31
Table 1 Example dialogue between a simulated user and the restaurant information dialogue system
User: Okay help me please
System: I know about nearly one thousand restaurants in the Greater Boston area. I have information for over one hundred cities, predominantly Boston, Cambridge, Brookline, Somerville, and NewtonUser: Please give me the restaurant that serves seafood foodSystem: There are 41 Seafood restaurants. They are predominantly in the South End, North End, and Back BayUser: Find the cheapest restaurantSystem: There are four inexpensive Seafood restaurants. Alamo Roast Beef and Seafood, Captain
Quarters Seafood, Dolphin Seafood Too, and Grill Fish
User: Well where is the second one?
System: Captain Quarters Seafood is located at 1871 Revere Beach Parkway in Everett User: Please show me the price range for Captain Quarters SeafoodSystem: The price range of Captain Quarters Seafood is inexpensiveUser: Can you please tell me the telephone number of Elbow Room in Brighton? System: The phone number of the Elbow Room is 617-738-9990
In our experiments, we attempt to transform a large corpus of ight reservation sentences into the restaurant information domain. Each step of the transformation method is shown in Fig. 3. We take advantage of some seed target domain sentences obtained via the formal generation method described in Sect. 4.1.1. The seed restaurant sentences are parsed, and all the noun phrases (NPs) and prepositional phrases (PPs), in their various sequential orderings, are gathered under the nonterminal categories in which they occur. Similarly, the ight sentences are parsed, and the locations of NPs and PPs are replaced by non-terminal tags, yielding a set of templates. Some of the non-terminal categories in which NPs and PPs occur are: direct_object, subject, and predicate_adjective. By exhaustively substituting the
In-domain Sentences
Fig. 3 A schematic showing steps towards generating synthetic data by substituting the NPs and PPs of one domain into the templates derived from a second domain
Out-of-domain Sentences
Overgenerated Synthetic Sentences
Parser
Parser
NP/PP Extraction
Template Conversion
Substitution
123
32 Lang Res Eval (2006) 40:2546
phrases for each non-terminal category of the target domain into the templates, new articial sentences incorporating the syntactic constructs of the ight domain and the semantic content of the restaurant domain are synthesized. Figure 4 illustrates the process with an example. We reuse the same parsers developed for dialogue systems in our experiments. It is also possible to use shallow parsing to identify the NPs and PPs (Hammerton, Osborne, Armstrong, & Daelemans, 2002).
Our initial seed restaurant domain synthetic data yielded 6800 examples of NPs and PPs, and the ight reservation domain yielded 1,000 unique sentence templates. Because of the vast number of combinations possible, we terminated the sentence generation procedure after randomly creating 450 k unique sentences. Some typical sentences from the original restaurant data and the example transformations are shown in Table 2. In comparison with the original articial data, created from a rule-based method, the new synthetic data are richer, embodying more of the characteristic features of humancomputer interactions found in real data. In particular,
Fig. 4 An example illustrating the transformation of a ight-domain sentence Could you repeat the departure time please? into a restaurant-domain sentence Could you repeat the phone number please? The phrase the departure time is substituted by the phone number in the dir_object slot in the template
sentence
post_expl please
aux
could
please
could you repeat <dir_object> please
could you repeat please
Table 2 Examples of how seed sentences in the target restaurant domain are transformed to richer synthetic sentences using templates of a source ight domain. NPs/PPs (italics) are shown (top box) in the original target domain data, (middle box) in the original source sentences from the ight domain, and (bottom box) slotted into the templates from the ight data
Seed sentences with embedded NPs/PPs1. Are there any Asian restaurants on Richmond street2. Give me some information on Atasca3. I would like cheap Mexican food4. Give me the telephone number Source sentence in ight domain1. Also list any ights from Atlanta to Boston on Thursday morning2. Ok hi Im looking for the cheapest ight from Atlanta to Boston3. I mean i only want the arrival time4. Say it again please Newly synthesized sentences from templates1. Also list any Asian restaurants on Richmond street2. Ok hi Im looking for some information on Atasca3. I mean i only want cheap Mexican food4. Say the telephone number again please
123
Lang Res Eval (2006) 40:2546 33
this template-based approach is able to harvest domain-independent speech arte-facts that are embedded within the domain-dependent queries. As a result, we found that the newly constructed data compared with the seed data encompass many more novel phrases that constitute repeats, repairs, greetings and corrections.
4.1.3 Formal transformation method
Another technique that is feasible for inducing sentences for a new application from a secondary domain is to develop formal generation rules which essentially perform a translation from one domain to another. The method we propose here reuses a machine translation capability for paraphrasing user queries from a second language back into English. The same set of generation rules that translate one language to another is now modied so that they replace certain semantic concepts from the secondary (ight) domain to those of the new target domain (restaurant).
The language generation capability has some sophisticated mechanisms; for instance, it is possible to control generation from source and destination such that only one of them maps to in_city while the other maps to on_street or in_region. Thus we prevent an anomalous query that includes two references to cities. Any ight-domain predicates that are difcult to translate can simply be omitted from the generation rules. Some example query transformations through formal generation are shown in Table 3.
A disadvantage of the formal transformation method is that it requires manual expertise to develop the formal rules. However, this approach can generate novel sentence patterns that are obviously unattainable by the other template method. For example, the template based transformation method ignores verb phrases, so that sentences containing restaurant domain specic verbs (e.g. eat) will be completely missing without the formal transformation method.
4.2 Syntactic and semantic ltering
While we have not quantitatively measured the similarity between the ight domain and the restaurant domain data, we do assume that the two applications, being quite different, do not share many common query types. Hence the methods described above are likely to generate many sentences that are not appropriate for the new application. For example, in the template-based method, we only replace NPs and PPs, thereby preserving the verb phrases of the ight domain. Thus extensive ltering is necessary to remove irrelevant or improbable sentences.
One obvious approach to ltering is based on syntactic constraints. That is: remove sentences that fail to produce full parse trees under the grammar for the new
Table 3 Example transformations produced via formal generation rules translating ight domain utterances (F) into restaurant domain utterances (R)
F: What meals does the rst ight serve?
R: What credit cards does the rst restaurant offer?
F: Show me ights from Boston to Phoenix via DallasR: Show me restaurants in Chinatown in Boston that accept credit cards F: Id like to go to DenverR: I would like to eat in Chinatown
123
34 Lang Res Eval (2006) 40:2546
domain. As for removing unlikely semantic relationships, we have devised a method for ltering based on semantic constraints. The semantics of a sentence are encoded by using a parser to convert it into a hierarchical semantic frame. Each sentence maps to a clause type captured at the top level of the frame, and subsequent descendent sub-frames capture topic-predicate relationships. An example is shown in Table 4.
In the semantic ltering phase, the rst training step is the compilation of all the topicpredicate relationships of the target domain, extracted from the semantic hierarchies. The second ltering step is the parsing and semantic frame construction of the new sentences, and deletion of those containing previously unrecorded topicpredicate relationships. The initial training step processes the original seed data using an algorithm that produces a single tree-like hierarchical frame, storing all observed vertical semantic relationships up to a predetermined depth. At three levels deep, all observed parentchildgrandchild relationships involving clause/ topic/predicate sub-frames are documented in a reference frame. When the new synthetic sentences are parsed, the (parentchildgrandchild) sub-frame tuples from the semantics are compared with those in the reference frame. If a tuple has not been previously observed, the entire sentence is deleted from the training corpus.
Table 5 displays a portion of a reference frame, derived from the original seed corpus of seven thousand sentences. Although the trained reference frame is quite sparse in semantic relationships, this kind of ltering is a crude way to eliminate sentences with constructs from the ight domain that are not appropriate for the restaurant domain. Generally, novel subjectverbobject relationships that tend to be improbable or nonsensical are eliminated, whereas semantic relationships consistent with the seed data are preserved. This does presume then that the seed training data has adequate coverage of the basic query types of the domain, although it does not necessitate semantic or syntactic diversity in the seed data.
Examples of ltered or rejected sentences are depicted in Table 6. Shown are the semantically malformed sentences that have been output from the template-based transformation method but have failed the semantic constraints imposed by the
Table 4 Example semantic frame for sentence: describe any Asian restaurants on Richmond street
{c request:pred {p describe :topic {q restaurant
:pred {p pred_cuisine :topic asian } :pred {p on :topic {q street_name
:name richmond:street _ type street }}}
Table 5 A portion of the automatically derived reference frame (depth n = 3) that is used in semantic ltering
{c request :pred {p describe :topic {q restaurant ..}:topic {q pronoun ..} ..}:pred {p give :topic {q phone_number ..}
:topic {q address ..} .. } :pred {p tell :pred {p indir ..}
:topic {q price_range ..} ..}}
Shown are some relationships captured under the request clause
123
Lang Res Eval (2006) 40:2546 35
Table 6 Example synthetic sentences that fail the semantic lter
1. Is the phone number interested in a restaurant in Boston?2. Do i read any restaurant?3. Does the number get in Chinatown?4. May i use any Chinese food?5. What neighbourhood is the price range?6. Does their street address make any Chinese food?
Originally, the sentences are induced by substituting restaurant NPs and PPs into ight domain templates
reference frame. To counter sparsity in the seed data, the developer can enrich the ltered data by manually relaxing the semantic constraints. That is, in several iterative stages, some legitimate but novel semantic relationships are added to the reference frame so that more synthetic sentences would pass the semantic constraints. Ultimately, the 450 k synthetic sentences are reduced to 130 k sentences via syntactic and semantic ltering.
4.3 Further enhancements
4.3.1 Meta queries
There is a core component of all spoken dialogue systems that involves so-called meta queries, which include greetings (hello/good-bye), dialogue navigation requests such as start-over, scratch-that, and repeat, as well as help requests. There are, for example, a surprising number of different ways to say goodbye, for example, Thats it for now, thank you very much!, and one would expect at least one good-bye query in each real-user dialogue. Rather than incorporating such activities into the simulated user model, we decided instead to simply harvest a set of 1158 meta-query sentences from our previous data collection efforts in the ight and weather domains, and augment all of our simulated query corpora with these utterances.
4.3.2 Noise models
It is typically the case that so-called non-speech events are prevalent in spoken dialogue user queries. In our denition, these include the lled-pause words, such as um and er, as well as laughter, coughs, and other forms of extraneous noise. We have developed a set of acoustic models that are specic to these kinds of sounds, and have painstakingly labelled them in our training corpora for the ight and weather domains. Careful modelling of these events, both acoustically and linguistically, can lead to signicant improvements in speech recognition accuracy (Hazen, Hetherington, & Park, 2001).
Our parser removes such events from the text string before attempting to parse an utterance. As a consequence, they are omitted from the simulated restaurant-domain sentences that are transduced from a ight domain corpus, using either the formal transformation or the template-based method. Of course, they are also missing from the formally generated sentences in our original seed corpus. Hence, we sought a way to reintroduce them with an appropriate distribution over induced corpora.
123
36 Lang Res Eval (2006) 40:2546
Our approach was to develop a simple statistical model, described below, by examining a large corpus of manually transcribed ight domain queries we had previously collected from real users. Observing that non-speech events tend to be concentrated at the beginning and end of a sentence, we decided to compute uni-gram statistics for three positional specications: beginning, middle, and end, where middle means simply occurring anywhere in the utterance except the very beginning or the very end. These statistics were measured directly from the ight domain utterances. We also decided to collapse all such events into a single class, termed hnonspeech eventi, to simplify the modelling aspects. We then processed our generated corpus through a procedure which optionally inserted a
hnonspeech eventi at the beginning, exact middle, or end of an utterance, according to the computed unigram statistics for these three positions. Each inserted
hnonspeech eventi, in turn, was instantiated as one of the following choices: humi, heri, hlaughteri, hnoisei, or hcoughi, according to the measured unigram statistics. There were also a certain number of utterances in the ight corpus that contained only hnonspeech eventi, and we added a corresponding small percentage of such utterances to the synthetic corpus. Finally a hnonspeech eventi class was included in the class bigram to recapture the appropriate statistics from the enhanced corpus.
5 Data sampling
Although the data generated as described above cover many variations in syntactic and semantic constructs, it is expected that the frequency distributions in the patterns will not reect those found in real user data because the data were not gathered in dialogue interaction. For any dialogue system, the proportions of query types, at the sentence level, will depend both on the functionality of the system as well as user behaviour. Intuitively, the rst sentence in the following example is more likely than the second one:
1. What is the telephone number?2. Tell me Chinese restaurants on Massachusetts Avenue near Central Square in Cambridge that accept credit cards.
Moreover, the raw data do not encode appropriate within-class statistics, for instance, the lesser prior likelihood of querying for Burmese cuisine versus Chinese cuisine. To gather such statistics, the approach taken here is to reshape the training data by sampling from the raw corpus, utilizing dialogue-level information (Wang et al., 2005).
The sampling technology relies heavily on an example-based generation method for selecting semantically related sentences. There are two primary components: a collection of sentences indexed by lean syntactic and semantic information encoded as KV pairs, and a retrieval mechanism to select a candidate from the indexed corpus given a KV specication. Compiled from the raw synthetic data set, the indexed corpus is the pool of synthetic sentences from which we shall sub-sample. During retrieval, the selected candidate sentence can either be used directly, or further processed by substituting the values of certain keys. Two different congu-rations have been invoked for sampling, which we term user simulation and dialogue resynthesis. We will be applying both these methods in our experiments.
123
Lang Res Eval (2006) 40:2546 37
In the following, we rst describe the EBG component. We then describe the data sampling techniques, utilizing EBG for sentence selection.
5.1 Example-based generation
5.1.1 Generation of indexed corpus
The EBG begins with the construction of an indexed sentence corpus. Each candidate sentence is rst parsed to yield a meaning representation called a semantic frame, which encodes the hierarchy of semantic and syntactic structure of the sentence. Then, a set of trivial generation rules is created to extract very lean semantic and syntactic information from the semantic frame as KV pairs, which can then be used as an index for that sentence. Figure 5 shows a typical group of such indexed sentences.
5.1.2 Retrieval mechanism
The basic function of the retrieval mechanism is to nd a candidate sentence whose KV-index matches the input KV specication. To allow certain exibility in matching the KV pairs, keys are differentiated into several categories, depending on whether they are optional or obligatory, and whether they require matching on the key-only level or the KV level. These are specied in a header le in the indexed corpus, to allow a developer to exibly modify the matching strategy. Each obligatory key in the input KV specication has to be accounted for in the matching process, while optional keys in the input can be ignored to avoid a matching failure (but will be preferred otherwise). If more than one group of sentences is retrieved, the selection pool includes all the groups.
We will illustrate the retrieval process with an example to highlight some of the distinctions in the different key types. Assume we want to retrieve from the indexed corpus a sentence similar to Do you know of any inexpensive French restaurants? The parsing and generation systems will rst produce the following KV pairs:
price_range: inexpensive cuisine: french clause: verify
Suppose the corpus contains only the example shown in Fig. 5, with price_range and cuisine as obligatory keys required to match on the key level, while clause is an optional key required to match on the KV level. If the system is congured to take the values of the retrieved sentence, the output could simply be cheap chinese restaurants please, or yes cheap chinese food. If instead, the system is congured to substitute the values in the input KV, those two outputs would be inexpensive french restaurant please, and yes inexpensive french food, respectively. If the clause were specied as an obligatory key matching on the KV level, then the search would fail to generate any output. For an input such as french restaurants, (cuisine: french clause: clarier), the search would also fail because of the extra obligatory key, price_range, in the candidates KV index.
123
38 Lang Res Eval (2006) 40:2546
{c eform
:price_range "cheap":cuisine "chinese":clause "clarifier":sentences
("a cheap chinese restaurant"
"a cheap restaurant that serves chinese food please""cheap chinese restaurants please""how about a cheap chinese restaurant""yes cheap chinese food"... )}
Fig. 5 Example of a group of sentences with the same KV index
5.2 Sampling methodology
We propose two sampling methods to adapt the distributions of the induced corpus. The rst method is designed for the scenario in which there is no real in-domain data available for adaptation, which is typically the case before the system has actually been deployed. Our strategy then is to utilize user simulation to lter the raw data, with the goal of achieving a more rened distribution in the semantic content.
The second method assumes that there is a small amount of development data available, which can be hypothesized to represent typical user behaviour. Such utterances can be used as templates to induce other similar utterances, in order to expand the richness of the development corpus in a systematic way. The resulting data are able to extend the linguistic coverage of the development data, while maintaining a similar dialogue-level and sentence-level semantic content distribution.
5.2.1 Sampling via user simulation
The rst method, depicted in Fig. 6, is conducted by running the dialogue system in simulation mode through thousands of text dialogues. The raw sentences are rst preprocessed into an indexed corpus based on the syntactic and semantic information in each sentence, encoded as KV pairs. A small portion of such a corpus was illustrated in Fig. 5. During simulation, given a response from the dialogue system,
System Response
Fig. 6 The process of sampling raw data via user simulation. Note: KV = key-value
User Model
KVIndexed
Sentence Corpus
Generation Rules
123
Lang Res Eval (2006) 40:2546 39
the user simulator will generate a query, in the form of KV pairs. The KV information is used to retrieve an appropriate template from the indexed corpus, with classes in the template substituted by values specied in the simulators KV string. The resulting surface string is sent to the dialogue system to push the dialogue interaction forward. In the case of a retrieval failure, perhaps due to gaps in the raw data coverage, the formal generation method as described in Sect. 4 can be invoked as a backup mechanism to provide a well-formed query.
A large collection of user queries can be harvested from repeated simulation runs, utilizing a probabilistic model of user behaviour. Their semantic content distribution is a result of the complex interactions of different aspects of the user model, as well as the strategies of the dialogue system. Prior probabilities of within class distributions, estimated from frequency counts of database instances, will further inuence the semantic content of the nal training corpus.
5.2.2 Dialogue resynthesis
If some set of development data exists, it becomes appealing to consider using it as a guide in sub-selecting from a large corpus of synthetic data. Figure 7 describes the process of transforming such data into new dialogues via example-based generation. Rather than running a closed-loop dialogue system, we simply drive the selection with the semantics of the development data. This technique enables the development data to act as a user model to generate similar but novel dialogues from the synthetic data. The expected training corpus would embed more realistic user behaviour, but at the same time, the harvested sentences will contain a richer variety of sentence constructs than those found in the small development set.
In this method, the development data are parsed utterance by utterance and transformed into a KV representation using the same techniques that were used to create the KV-indexed corpus. During retrieval, the keys in the retrieved sentence template can either be substituted with values drawn from the development utterance, or left unaltered from their original values in the synthetic corpus. This allows us to experiment with combining probability distributions from different data sources. Specically, in the rst mode, substituting attribute values from the development set into the synthetic will result in a within-class distribution similar to that of the development data. On the other hand, in the second mode, preserving attribute values of the synthetic data will result in a within-class distribution sampled from the input synthetic data.
New
Output
Fig. 7 The process of synthesizing new dialogues by transforming development data. Note: KV = key-value
Development
Data
Frame Semantic
English Grammar
KV
Rules
Generation
KVindexed Sentence
Corpus
123
40 Lang Res Eval (2006) 40:2546
6 Experiments and results
In this section, we describe the results of several experiments that were conducted on a test set of 520 utterances, obtained through a data collection effort involving 72 telephone conversations between naive users and the system. Users were asked to interact with a Boston-based restaurant information system via telephone. No specic instructions were provided to the subjects other than a brief introduction to the basic system capability. We excluded recordings which did not contain any speech, but the evaluation data includes utterances (5.4%) with out-of-vocabulary words as well as artefacts such as noise, laughter, etc. The data were transcribed manually to provide reference transcripts.
During the course of system development, we have also collected over 3,000 sentences from developers interacting with the system, either via a typed interface or in spoken mode. This set of developer/expert data is probably not representative of real data from naive users. Nevertheless, they are of high quality both in terms of the syntactic constructs and the semantic content of the queries. These data can thus serve both as a benchmark against which to reference our synthetic corpus performance and as templates from which to guide a sub-selection process.
We conducted a number of recognition experiments, as illustrated in Fig. 8. These experiments progress through increasingly sophisticated techniques for exploiting the simulation and developer data for language modelling, generally reected in improvements in recognition results. Systems I and II correspond to the condition when only synthetic sentences are available for language model training. System I is trained on the raw data only, which is equivalent to the ltered synthetic sentences in Fig. 1. In System II, the data are obtained via the simulation process described in Fig. 6, either by selecting from the raw data pool, or by using formal rules to generate sentences from rst principles.
System III is a benchmark system based only on the developer data. For System IV, the simulation data are used to generalize utterances drawn from the developer data, in an attempt to broaden its coverage of general language usage, while still maintaining a similar mixture of semantic contents. In other words, we use the developer data as a user model to generate similar but novel dialogues from the synthetic data, following the techniques of Fig. 7. Two runs were conducted, and the resulting data were combined with the developer data in training the language model. There are two ways to run the example-based generation module: (1) the sentence templates are retrieved from the example corpus, but the class values are inherited from the developer data; or (2) the entire sentence is retrieved without modication. Thus, the resynthesized dialogues have more-or-less inherited the within-class distribution of the developer data in the rst mode, while the within-class distribution of the simulation data (reecting database statistics) is sampled in the
Raw
Data
Fig. 8 Illustration of the automatic induction of data used in recognition experiments. Data sets induced at each stages are used to train the language models for Systems IIV
SentencesOutofdomain II: IV:
Development
Data
Data
III:
Data
I:
Simulation
Resynthesized
123
Lang Res Eval (2006) 40:2546 41
latter. The second mode is used here since it has been found to achieve a slightly better performance than the alternative (Wang et al., 2005).
The recognizer conguration was kept exactly the same for all experiments, except for the language model training data. We utilize the SUMMIT landmark-based recognition system (Glass, 2003). Word class n-gram language models are created using the techniques described in Seneff, Wang, and Hazen (2003), where the vocabulary and word classes are automatically generated from the natural language grammar used for parsing. In the deployed system, the recognizer utilizes a dynamic class for the restaurant names, which is adjusted based on dialogue context (Chung, Seneff, Wang, & Hetherington, 2004). However, for the off-line experiments conducted here, the recognizer is congured to uniformly support all the known restaurant names in the database, under a static RESTAURANT_NAME word class. The vocabulary size is about 2,500 words, with 1100 of these words being unique restaurant names. The acoustic models are trained on about 120,000 utterances previously collected from telephone conversations in the weather and ight information domains.
In the following two sections, we rst discuss a series of experiments intended to assess the quality of a number of different sets of synthetic data, in the absence of any in-domain real user data (I and II). We then describe a set of experiments intended to enhance recognition performance of an initial system trained on developer data, by manipulating and/or augmenting the training utterances with additional similar data induced through our synthesis techniques (III and IV).
6.1 Synthetic data only
Table 7 reports word error rates (WERs) for a series of experiments assessing the effectiveness of various sets of synthetic data for speech recognition. These results were compared with a baseline system (F&F) which utilized a set of just 200 made-up sentences that were solicited from friends and family prior to system development. Friends were asked to propose queries that they would likely ask a hypothetical restaurant system. Such preliminary data collection efforts in the absence of a functioning dialogue system represent one rudimentary method for jump-starting data collection. It realized a rather high WER of 32.1%.
Table 7 Results of recognition experiments using synthetic data to train the recognizer language model
Conguration Num utts WER (%)
Baseline(F&F) 203 32.1 I: Raw Data 134,526 30.7 Sampling via user simulationII(1): Formal generation 13,352 22.8 II(2): Formal transformation 7,622 24.5 II(3): Template transformation 10,807 22.0 II(4): All (1+2+3) 28,564 20.1
Note: WER = word error rate. F&F = utterances solicited from friends and family before the system existed. Raw data = automatically induced data using template-based transformation, prior to applying any sampling methods. See text for discussion
123
42 Lang Res Eval (2006) 40:2546
The raw set is a set of over 130,000 utterances that are induced by the template-based method using 30,000 ight domain sentences. These have been ltered on syntactic and semantic constraints but have not been downsampled via simulation. In spite of its large size, it only improves slightly over the baseline performance.
Systems II(1-4) are all trained on data devoid of any restaurant-specic real-user utterances, but all of them involve user simulation runs. The training utterances for System II(1) were generated from rst principles, using only formal generation rules in simulated dialogues. Systems II(2) and II(3) both involve transformations from ight domain utterances. System II(2) uses formal generation for translating ight queries into restaurant-domain queries, coupled with user simulation, whereas System II(3) is trained on data that are sampled on the template-induced raw data set, as described above. Systems II(1-3) all yield substantial improvements over the results for the raw simulated data, with the template-based approach being the most effective. In particular, when the template-induced data are sampled, WER is reduced by 28.3%, (30.722.0%). Evidently, as intended, sampling the raw training data through a strategy of user simulations has yielded higher quality training data, because the probability distributions, as estimated by the simulated user model, are much closer to those of real dialogue interactions.
When all three of these sets are combined into a single large set, (System II(4)), the performance improves further, yielding a recognition error rate of just over 20%. It should be noted that, although the formal method performs relatively poorly by itself, the WER increases to over 21% if data from only Systems II(1) and II(3) are included in the training corpus. The difference in WER is tested for statistical signicance using matched pairs segment word error test (Gillick & Cox, 1989), and a signicance at a level of 0.03 is established. Hence, the formal transformation method seems to add some novel coverage beyond what the other two sets offer.
We have determined through separate experiments that including meta queries and noise models improves recognition performances. Hence, for all of these experiments except the F&F, the training data were augmented with the meta queries harvested from the ight and weather data, and the synthetic data were manipulated to insert non-speech events. The next section will document the effects of adding meta queries and noise models to development data.
6.2 Augmenting developer data
Table 8 summarizes the results of several experiments involving the available corpus of nearly 3500 utterances harvested from developer interactions with the system. By themselves, these utterances (typed plus spoken) yielded a WER that was 1% lower
Table 8 Results of recognition experiments augmenting developer data to train the recognizer language model
Conguration Num utts WER(%)
III(1): Dev only 3497 19.1 III(2): Dev + enhancements 4753 18.0 IV: Dev + enhancements + sim (2x) 9131 17.2
Note: WER = word error rate. Enhancements indicates the additional meta-level queries and the application of noise models. sim (2x) indicates simulated data via user simulation, followed by dialogue resynthesis. See text for discussion
123
Lang Res Eval (2006) 40:2546 43
(19.1 versus 20.1) than the best result achieved from data that are entirely articially generated. A question worth asking, however, is whether these developer data can be improved upon through augmentations with reduplication through dialogue resynthesis to yield variants derived from our corpus of simulated data. As can be seen from the table, each of these augmentations led to further reductions in the WER.
The best performing system (IV), at WER 17.2%, combines the developer data utterances with a synthetic data set. The synthetic data set is obtained by (1) induction via the template-based approach from ight utterances followed by syntactic/semantic ltering, (2) downsampling by user simulation, and nally (3) further downsampling guided by the semantic content of the developer utterances (dialogue resynthesis). Two consecutive runs of dialogue resynthesis are conducted, resulting in two additional utterances for each developer utterance. The overall relative improvement achieved by all of these augmentations, compared to the original Dev system, is 9.9%. This WER difference is signicant at the level of P = .001. In other experiments we conducted, it was found that combining with synthetic data without applying the dialogue resynthesis technique did not outperform the system using the augmented developer data (Dev + Enhancements).
These results suggest that real user data, even when derived from developer/ expert users, can be valuable for training a dialogue system. Combining simulated data with developer data has enhanced performance even further. The simulated data clearly add coverage by capturing more novel queries in syntactic constructions and semantic content through the process of induction from ight utterances and simulated dialogue interactions. But this nal simulated set also maintains a set of sentence level statistics that directly approximates user interactions of developers. This seems to be better than only using the user model of the simulator.
A further examination of the development set shows that it covers many non-grammatical constructs that are plausible spoken inputs and cause parse failures. Also included are some out-of-domain queries and sentences with unknown words, found in 3.2% of the sentences. These are not modelled by the template-based induction method because the method uses the same parser to derive the meaning representation, and lters out illegal parses such that none of the induced sentences are intended to be out-of-domain or contain unknown words.
In one nal experiment, we ascertain a possible lower bound on word error by training the language model on the transcriptions of the test set. This oracle condition achieved a 12.2% WER. We can deduce that further manipulations on the language model training data, whether in terms of quantity or quality, while holding the acoustic model constant, would be unlikely to outperform this system.
7 Summary and future work
This paper has described novel methods for inducing language modelling data for a new spoken dialogue system. The methodology we implemented involves a step of generatively inducing a large corpus of articial sentences assembled via a process of parsing and reconstructing out-of-domain data. This is followed by syntactic and semantic ltering of illegal sentences. A nal step is concerned with sampling the corpus based on either simulated dialogues or semantic information extracted from development data.
123
44 Lang Res Eval (2006) 40:2546
Our experiments have shown that reasonable performance can be obtained in the absence of real data, simply by using synthetic training data. We have demonstrated a method for assembling linguistically varied sentences for a new application by harvesting the sentence constructs found inside queries of a secondary, essentially unrelated, domain. In addition, we have also shown that a training corpus can be rened by incorporating statistics that estimate user interaction with the system. This can be achieved without user data via a user simulation strategy. On the other hand, collecting even some expert dialogue data, typed or spoken, can be benecial, and the sentence-level distributions of expert users interactions can be exploited to generate even better synthetic data.
While the procedures we have developed here appear complex, most of them are fully automatic once the appropriate scripts and control les are in place. Developing the parsing grammars for the source and target domains is relatively straightforward, since the rules are based on a core grammar capturing syntactic structure. The main task is to populate terminal nodes with appropriate vocabulary for nouns, verbs and adjectives. A set of simple semantic mapping rules are applied to automatically derive a semantic frame from the parse tree. The parent childgrandchild relationships are derived directly from the semantic frames. The templates and ller phrases are created automatically from the parse trees as well. The formal generation method requires both manual effort and expertise. However, we believe that its potential has not yet been fully realized, since we have thus far devoted only a few person-days effort to this task. With further rule manipulations, we could conceivably obtain a signicantly larger pool of well-formed but novel queries to augment our training corpus.
In future research, we plan to embed our user simulator into deployed spoken dialogue systems, where its role would be to automatically generate several example sentences representative of productive user queries following each system response, which would be displayed in a Graphical User Interface. These can serve as an intuitive help mechanism to guide users through the dialogue. We believe that such a device would greatly reduce the percentage of out-of-domain utterances spoken by real users.
References
Araki, M., & Doshita, S. (1996). Automatic evaluation environment for spoken dialog systems. In Proceedings of the workshop on dialog processing in spoken language systems (pp. 183194). Budapest, Hungary.
Bacchiani, M., Roark, B., & Saraclar, M. (2004). Language model adaptation with MAP estimation and the perceptron algorithm. In Proceedings of the human language technology conference (HLT) (pp. 2124). Boston, MA.
Baptist, L., & Seneff, S. (2000). Genesis-II: A versatile system for language generation in conversational system applications. In Proceedings of the international conference on spoken language processing (ICSLP) (pp. 271274). Beijing, China.
Bechet, F., Riccardi, G., & Hakkani-Tur, D. (2004). Mining spoken dialogue corpora for system evaluation and modeling. In Proceedings of conference on empirical methods in natural language processing (EMNLP), (pp. 134141). Barcelona, Spain.
Bellagarda, J. (1998). Exploiting both Local and global constraint for multispan language modeling. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (Vol. II, pp. 677680). Seattle, WA.
123
Lang Res Eval (2006) 40:2546 45
Bertoldi, N., Brugnara, F., Cettolo, M., Federico, M., & Giuliani, D. (2001). From broadcast news to spontaneous dialogue transcription: Portability issues. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (Vol. I, pp. 3740). Salt Lake City, UT.
Brown, R. D. (1999). Adding linguistic knowledge to a lexical example-based translation system. In Proceedings of the eighth international conference on theoretical and methodological issues in machine translation (TMI) (pp. 2232). Chester, England.
Bulyko, I., Ostendorf, M., & Stolcke, A. (2003). Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures. In Proceedings of the human language technology conference (HLT) (Vol. II, pp. 79), Edmonton, Canada.
Chung, G. (2004). Developing a exible spoken dialog system using simulation. In Proceedings of the conference of the Association for Computational Linguistics (ACL) (pp. 6370). Barcelona, Spain.
Chung, G., Seneff, S., & Wang, C. (2005). Automatic induction of language model data for a spoken dialogue system. In Proceedings of the sixth SIGdial workshop on discourse and dialogue (pp. 5564). Lisbon, Portugal.
Chung, G., Seneff, S., Wang, C., & Hetherington, L. (2004). A dynamic vocabulary spoken dialogue interface. In Proceedings of the international conference on spoken language processing (ICSLP) (pp. 327330). Jeju, Korea.
Fabbrizio, G. D., Tur, G., & Hakkani-Tr, D. (2004). Bootstrapping spoken dialog systems with data reuse. In Proceedings of the fth SIGdial workshop on discourse and dialogue (pp. 7280). Cambridge, MA.
Feng, J., Bangalore, S., & Rahim, M. (2003). Webtalk: Mining websites for automatically building dialog systems. In Proceedings of IEEE ASRU: Automatic speech recognition and understanding (pp. 168173). Virgin Islands.
Fosler-Lussier, E., & Kuo, H. K. J. (2001). Using semantic class information for rapid development of language models within ASR dialogue systems. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (Vol. I, pp. 553556). Salt Lake City, Utah. Galescu, L., Ringger, E., & Allen, J. (1998). Rapid language model development for new task domains. In Proceedings of the rst international conference on language resources and evaluation (LREC) (pp. 807812). Granada, Spain.
Gillick, L., & Cox, S. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (pp. 532535). Glasgow, Scotland.
Glass, J. (2003). A probabilistic framework for segment-based speech recognition. Computer Speech and Language, 17(23), 137152.
Hammerton, J., Osborne, M., Armstrong, S., & Daelemans, W. (2002). Introduction to special issue on machine learning approaches to shallow parsing. Journal of Machine Learning Research, Special Issue on Shallow Parsing, 2(4), 551558.
Hazen, T. J., Hetherington, I. L., & Park, A. (2001). FST-based recognition techniques for multi-lingual and multi-domain spontaneous speech. In Proceedings of the European conference on speech communication and technology (Eurospeech) (pp. 15911594). Aalborg, Denmark. Hone, K., & Baber, C. (1995). Using a simulation method to predict the transaction time effects of applying alternative levels of constraint to user utterances within speech interactive dialogs. In Proceedings of ESCA workshop on spoken dialogue systems (pp. 209212). Vigs Denmark. Iyer, R., & Ostendorf, M. (1999). Relevance weighting for combining multi-domain data for n-gram language model. Computer, Speech and Language, 13(3), 267282.
Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., Fosler, E., & Morgan, N. (1994). The Berkeley restaurant project. In Proceedings of the international conference on spoken language processing (ICSLP) (pp. 21392142).
Klakow, D. (2000). Selecting articles from the language model training corpus. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (Vol. III, pp. 1905 1698).
Levin, E., Pieraccini, R., & Eckert, W. (2000). A stochastic model of humanmachine interaction for learning dialogue strategies. IEEE Transactions on Speech and Audio Processing, 8, 1123. Levin, L., Lavie, A., Woszczyna, M., Gates, D., Gavalda, M., Koll, D., & Waibel, A. (2000). The
Janus III translation system: Speech-to-speech translation in multiple domains. Machine Translation, Special Issue on Spoken Language Translation, 15(12), 325Lin, B. S., & Lee, L. S. (2001). Computer-aided analysis and design for spoken dialog systems based on quantitative simulations. IEEE Transactions on Speech and Audio Processing, 9(5), 534548.
123
46 Lang Res Eval (2006) 40:2546
Lpez-Czar, R., De la Torre, A., Segura, J. C., & Rubio, A. J. (2003). Assessment of dialogue systems by means of a new simulation technique. Speech Communication, 40(3), 387407.
Popovici, C., & Baggia, P. (1997). Language modelling for task-oriented domains. In Proceedings of the European conference on speech communication and technology (Eurospeech) (pp. 1459 1462). Rhodes, Greece.
Rudnicky, A. (1995). Language modeling with limited domain data. In Proceedings of the ARPA spoken language technology workshop (pp. 6669).
Sato, S. (1992). CTM: An example-based translation aid system. In Proceedings of the International
Conference on Computational Linguistics (COLING) (pp. 12591263), Nantes, France. Schefer, K., & Young, S. (2000). Probabilistic simulation of human-machine dialogs. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (Vol. II, pp. 12171220). Istanbul, Turkey.
Seneff, S. (2002). Response planning and generation in the mercury ight reservation system.Computer Speech and Language, 16, 283312.
Seneff, S., Wang, C., & Hazen, T. J. (2003). Automatic induction of n-gram language models from a natural language grammar. In Proceedings of the European conference on speech communication and technology (Eurospeech) (pp. 641644). Geneva, Switzerland.
Varges, S., & Mellish, C. (2001). Instance-based natural language generation. In Proceedings of the conference of the North American chapter of the Association for Computational Linguistics (NAACL) (pp. 18). Pittsburgh, PA.
Veale, T., & Way, A. (1997). Gaijin: A template-driven bootstrapping approach to example-based machine translation. In Proceedings of the conference on non-empirical methods in natural language processing (NeMNLP) (pp. 239244). Soa, Bulgaria.
Wang, C., Chung, G., & Seneff, S. (2005). Language model data ltering via user simulation and dialogue resynthesis. In Proceedings of the European conference on speech communication and technology (Eurospeech) (pp. 2124). Lisbon, Portugal.
Wang, C., & Seneff, S. (2004). High-quality speech translation for language learning. In Proceedings of the InSTIL/ICALL symposium: NLP and speech technologies in advanced language learning systems (pp. 99102). Venice, Italy.
Zhu, X., & Rosenfeld, R. (2001). Improving Trigram language models with the world wide web. In Proceedings of international conference on acoustics, speech, and signal processing (ICASSP) (Vol. I, pp. 533536).
123
Springer Science+Business Media 2006