Content area
The structure of speech dialogue systems is analyzed. It is shown that the creation of speech recognition and synthesis systems is connected with the solution of direct and inverse problems as applied to the organization of speech dialogue. Attention is paid to the multilevel hierarchical organization of speech dialogue systems in which the interaction between recognition and synthesis channels is realized through a common knowledgebase. A generalized structure of data representation at different hierarchy levels is analyzed, and the hierarchy of qualities determining properties of a dialogue system is considered. [PUBLICATION ABSTRACT]
Cybernetics and Systems Analysis, Vol. 44, No. 2, 2008
R. V. Meshcheryakov and V. P. Bondarenko UDC 681.3
The structure of speech dialogue systems is analyzed. It is shown that the creation of speech recognition and synthesis systems is connected with the solution of direct and inverse problems as applied to the organization of speech dialogue. Attention is paid to the multilevel hierarchical organization of speech dialogue systems in which the interaction between recognition and synthesis channels is realized through a common knowledgebase. A generalized structure of data representation at different hierarchy levels is analyzed, and the hierarchy of qualities determining properties of a dialogue system is considered.
Keywords: dialogue, structure, hierarchy, quality.
Speech systems such as systems of speech synthesis and recognition, interactive systems, machine-translation systems, etc. are among intelligent systems. They must, first, understand speech despite possible noise, bad articulation, deviations from a normative syntax, elliptical constructions, etc., and, second, must provide a high-quality speech with a high legibility and naturalness of the speech signal. Man integrally uses both linguistic knowledge (the phonetics, vocabulary, syntax, semantics, prosody, etc. of a language) and nonlinguistic knowledge (the object domain of a dialogue). The main psychoacoustic characteristic of speech is its intelligibility, i.e., the degree of correct perception of sounds, words, and meaning. The maximal intelligibility that is characteristic of the perception of connected speech is phrase intelligibility [1]. If a person perceives isolated words, percent intelligibility turns out to be less. It decreases still further in perceiving isolated phonetic speech elements such as syllables.
The majority of approaches are based on the sequential steps of recognition of words and syntactic analysis. The experience of linguists shows that a person first comprehends the meaning of a phrase and only then proceeds to directly solve the problem of speech understanding that is often considered as a procedure independent of the recognition stage. The use of such approaches leads to poor results. It becomes obvious from the theory of speech perception [1] that recognition and understanding are two closely related procedures integrated into a dialogue process.
CENTRAL TENETS OF A CONCEPTUAL MODEL OF SPEECH DIALOGUE SYSTEMS
External input and output data of a speech dialogue system consist of the semantic space of words and phrases of a given language and an object domain, a speech signal, and also the parameters of the speech production system (for the synthesis channel of the speech system). A speech system has the pragmatic, semantic, syntactic, phonetic, and physical hierarchical levels. Each level has its data set and rules providing information processing. Accordingly, to achieve the objectives of information processing at upper levels in perceiving speech, the problems of lower levels must be solved and vice versa. A similar situation arises in speech synthesis systems since the formation and pronouncing of an utterance
Tomsk State University of Control Systems and Radio Electronics, Tomsk, Russia, [email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 2, pp. 3041, MarchApril 2008. Original article submitted September 7, 2005.
2008 Springer Science+Business Media, Inc.
DIALOGUE AS A BASIS FOR CONSTRUCTION OF SPEECH SYSTEMS
INTRODUCTION
1060-0396/08/4402-0175
175
requires the solution of problems of upper levels and also in other speech systems. These two processes are so interrelated that, in solving the direct problem, the solution of the inverse problem is required.
Thus, an interactive system can be based on the multilevel hierarchical model from [2] that should provide a complete description of a language as a hierarchy of closely integrated events with the specification of the corresponding rules of information transformation at the corresponding levels;
taking into account metalinguistic knowledge or the object domain of dialogue, i.e., the prediction of some succession of events as a distinctive feature of speech perception.
As a result, the central tenets in constructing a model of an interactive system can be as follows: models of the world of subjects carrying on a dialogue are intersected, i.e., world-oriented knowledge elements determined by a concrete object domain of dialogue are partially common for them [3];
subjects always predict (based on the prehistory of their dialogue) the speech reaction of their interlocutors with definite probability.
Hence, an efficient procedure of speech perception and synthesis must include complete knowledge of the language being used, environmental parameters, and the prediction of reactions of interlocutors.
These three central tenets will make it possible to bound the domain of feasible decisions at each moment of dialogue practically at all hierarchy levels and to increase the reliability of speech perception and quality of speech synthesis.
CONCEPTUAL MODEL OF DIALOGUE
Based on the above requirements, we represent a simplified model of a dialogue system by the scheme that is presented in Fig. 1 and in which arrows specify the directions of information flows. Bottom-up flows characterize the perception (recognition) channel, and top-down flows characterize the synthesis channel. Each presented object is specified by its collection of elements of information on the language being used, transformation rules, and relations with other levels. In this scheme, direct interrelations between only two levels, namely, between a higher and a lower one, are reflected. In actual fact, the number of interrelations is larger. For example, at higher levels, of great significance is metalinguistic knowledge, in particular, information on the current object domain. We note that, at lower levels, this knowledge about a language loses its significance.
Let us consider the mutual influence between upper and lower levels. It is obvious that the meaning of a sentence imposes constraints on the structure of the words contained in the sentence and also on their syntactic relations. Undoubtedly, syntactic relations between words in a sentence determine the meaning of an utterance and its objective function. As is well known [4], letters in a text have various probabilities of occurrence, and the probability of occurrence of the next letter depends on the preceding letters. However, even in the case of combinations of three and more letters, the probability of occurrence of a letter decreases. A similar reasoning also takes place for words in sentences. Thus, the current prediction at the level of letters or words makes it possible to restrict decision domains during recognition.
Of definite interest is the influence of the results of a prediction concerning the development of a dialogue at the level of a model of the world, for example, on the lexical base of the domain of feasible decisions at the level of phonetic words. As is well known, 100200 highest frequency words will suffice for everyday dialogues. Of course, any other special object domain will demand the inclusion of its terms and concepts, i.e., new words, which leads to some extension of its lexical base. It turns out that the common lexical base increases only slightly. To this end, based on the algorithms of morphological and syntactic analysis that are considered in [5], the lexical bases for object domains of computer engineering and radio electronics were estimated. It turned out that the highest frequency words used in these domains increase their lexical bases by 3050%. We note that if the current prediction of a dialogue is supported, then, even based on this lexical base, the domain of feasible decisions will be considerably bounded, which must lead to an increase in the recognition reliability at the level of words.
The presented scheme of an interactive system allows one to preliminarily classify speech systems as follows: speech perception systems that completely realize the channel of processing from a speech signal to a model of the world;
systems of complete speech synthesis that realize the channel from a model of the world to a speech signal; word and phrase recognition systems; systems of synthesis from printed texts, etc.
At the present time, word and phrase recognition systems and also systems of synthesis from printed texts are partially realized. Their comparative analysis is given in [6].
176
Dialogue model
Knowledge base.Parameters of speech production
Environment
Fig. 1. A simplified model of a dialogue system.
We note that the recognition reliability of systems with only a bounded number of levels, for example, with only the phonemic level, is low. In using information on syllabic recognition, the reliability increases. A similar situation takes place in speech synthesis, namely, when a letter is directly transformed into a sound, the quality and intelligibility of the corresponding speech signal will be low. With allowance made for the influence of higher levels, the synthesized speech quality increases.
Not only the levels of one channel but also channels themselves are interrelated among themselves. For example, in the speech synthesis based on the articulation model from [7], one should constantly tune depending on varying characteristics of signal intelligibility. To this end, feedbacks are introduced that use the speech signal recognition channel and, based on them, the synthesis is corrected.
Thus, based on the aforesaid, we draw the conclusion that the existing feedbacks are determined by hierarchy levels in an information processing system and the degree of necessity of linguistic knowledge.
The need for feedbacks can be illustrated by the example of the speech production system of man. In delivering a speech, a person corrects it using the following feedbacks: first, through the air and his skeleton with the help of his acoustic system, second, taking into account the location of speech production organs, and, third, fixing the reaction of the interlocutor through his visual channel. According to the information obtained, the person corrects the speech being delivered.
STRUCTURE OF DIALOGUE LEVELS
It is assumed that a dialogue includes the interaction of at least two subsystems whose structures are similar in the general case and whose worlds models intersect. Each subsystem can be represented in the form
,
Ais is the structure and Aib is the behavior of the corresponding subsystem. In this case, such an interactive system is represented as the union of subsystems (1),
= =
i
During a dialogue, each subsystem strives to realize its objective
Ai may be considered as noncoincident. The realization of a partial objective Aio presumes the existence of the corresponding structure and behavior of a subsystem
Ai that can be (in some sense) represented in the form of an optimal
177
Text
Paragraph
Sentence
Word
Syllable
Letter
Sound paragraph
Utterance
Phonetic word
Diphone
Sound
Speech signal
A A A
i i i
= ( , )
s b
(1)
where
s b
. (2)
A A A A
i
U U ( , )
i
i i
Aio and, hence, the structure and behavior of
systems
EE Similarity
Fig. 2. Interaction of subsystems during a dialogue.
exchange [8]. It is obvious that this requires the definition, on the one hand, a quality functional and, on the other hand, a domain of feasible decisions. The most difficult problem is the determination of a quality functional for the entire dialogue but it can be decomposed according to hierarchy levels as a result of representation in the form of a hierarchy of qualities; hence, a hierarchy of domains of feasible decisions must arise.
The main objective is determined by the pragmatic component part of dialogue. At a higher hierarchy level, this part is determined by the global meaning of dialogue. At lower levels, the global meaning is represented by its component parts (concrete meanings).
In Fig. 2, two subsystems
(EE). The levels
From this one can suppose that, at each level, there are corresponding definitions and concepts, for example, models of a grammar, a phonetics, etc. By analogy with models of the world, these models must intersect for different subsystems participating in dialogue.
We consider that resources of a system, for example, its complexity, are bounded. In particular, in generating an utterance, the amount of air in the lungs of a person, operation speed of his articulation organs, etc. are bounded. This leads to the necessity to transmit a message with a sufficient degree of perception reliability with a constraint on resources, which allows one to speak of an optimal exchange in the sense of B. S. Fleishman [8].
In turn, the subsystems of generation and recognition should be considered as hierarchically constructed according to Fig. 1. Since each hierarchy level in an interactive system is considered to be independent, it is necessary to find the objective function of its origin and to determine the method of its optimization during dialogue. This is considered as the attainment of some system quality under given constraints.
In this approach, it is supposed that the hierarchy of qualities of a complicated system is taken into account. The systems being considered belong to the class of decision systems whose hierarchy must be as follows [8]:
A1 and A2 are presented that participate in dialogue through their external environment
K11 and K 21 are similar since they realize similar functions of message formation and transfer. The levels also are subsystems. The levels
K12 and K 22 that realize receiver functions are also similar. Thus, the functions of dialogue subjects must be similar. It might be supposed that, in forming messages, some elements of information are used through the perception channel. Thus, the message source takes into account the capabilities of the message receiver. In this case, it may be said that, in forming and transmitting a message, the receiver strives to achieve not only the objective of optimal coding but also a sufficient reliability level of recognition. To this end, information on the receiver capabilities that is available in its system is used; knowing its structure and the behavior of its perception, the receiver can use some of its parameters.
O-quality characterizes a system in which a feedback is present, an automatic system, and a system possessing the properties of stability, accuracy, etc.;
178
R-quality characterizes a reliable system possessing the structural stability property;
I-quality characterizes a reliable system that possesses the properties of coding and decoding;
C-quality characterizes a controlled system.
Quality plays a varying role with increasing the hierarchy level of an interactive system. Whereas, at the signal, sound, and diphthong levels,
O-, K-, and I-qualities play the leading role, the C-quality, in addition to these qualities, is of major importance at the levels of utterances and sound paragraphs.
A GENERALIZED STRUCTURE OF INFORMATION REPRESENTATION IN AN INTERACTIVE SYSTEM
A dialogue between two subsystems is realized with the help of elementary signals through an environment that introduces distortions and interferences into these signals. Constraints on the complexity of the entire system lead to a restriction of the complexity of levels. Hence, in a bounded time interval, a level can process a finite amount of information. To this end, it is required to use a hierarchical representation in which the amount of information required for the description of input and output signals in unit time decreases during the motion along the recognition channel and increases during the motion along the reverse channel (feedback) of synthesis. The property being considered that is connected with the necessity of information representation in a complicated system with different degrees of abstraction is inherent in all complicated hierarchical systems, which is a consequence of constraints on the structural complexity of any systems (1). This property can be considered as a special
M-quality of complicated hierarchical systems.
Any hierarchy level is characterized by the following pair of transformations that describe the direct and reverse channels of interaction in this system:
* * *
and
X * are the input and output at the top of the system, i.e., for the synthesis channel, and Z is the set of transformations prescribed by higher levels.
The description sizes
> . (4)
This is possible if an equivalence relation, i.e., a partitioning into classes
X X
Y , is established over the set X .
Hence, the description of
then the mapping
classes. Then, for
the following condition must be true:
where
s X Y :
,
:
s Y Z X
, (3)
*
where
X and Y are the input and output sets at the bottom of the system, i.e., for the perception channel, Y Z
m ( )
X and m ( )
Y must satisfy the relation
m m
( ) ( )
X Y
Y and also, maybe, X Y must contain at least the following two components: the name of a class and an enumeration or the rule of generation of elementsx belonging to a given class X Y . But if elements x are generated in some way or other, then they can be stored on the corresponding hierarchy level in the form of some knowledgebase (model) that makes it possible to establish a correspondence between the name of a class and its content. Then
Z will characterize
admissible transformations of the subset
X Y without leaving this class. If the set Z is homeomorphic to the unit segment,
s*becomes a homotopy, which is equivalent to a partitioning of the input set into homotopic equivalence
z [ , ]
0 1
and z = 0 , (5)
*
( , )
d d
X X 0 , (6)
d is the error of the signal X * synthesized from the description of the set X in the case of absence of its transformations andd0 is a given admissible error.Mapping (3) is most obviously illustrated by concepts such as a phoneme and an allophone. Taking this into account, the description of an element of
Y should be supplemented with an additional component part describing its states and also a component part describing the position of
X Y in the set X . This leads to the following form proposed in [9] for the description of an element (called a generator) of
Y:
( ) , ,
= { }. (7)
179
O Y N P S
Y Y Y
NY is the name of a class, PY are features, and SY are relations of X Y with other equivalence classes of the set X . The representation of the input set in the form of mapping (3) satisfying conditions (4)(6) at the level of signals does not raise doubts. In particular, estimate (7) can be obtained by computation of a distance. The problem is dramatically complicated at higher levels when estimation procedure (6) becomes unobvious.
Here,
In the considered statement, each level is characterized by some variable number of elements (7) that depends on a concrete speech message. As has been noted earlier, an element can be related with other elements of the hierarchy level being considered. In some cases, the structures of configurations of elements assume the regularity property [9], which allows one to qualitatively pass to a higher hierarchy level by assuming a regular structure of a lower level in the capacity of an element. A similar situation also takes place for the reverse channel, namely, an element is divided into several lower-level elements each of which has its name, features, and relations. However, key information can be of great importance at all the levels of an interactive system. An example of such information is the required pitch frequency of a speech signal at different levels that is preserved in the features of generator elements.
GENERATORS ON DIFFERENT HIERARCHY LEVELS OF AN INTERACTIVE SYSTEM
We assume that, in using
M-quality, the number of objects that are simultaneously processed at each level is approximately the same [10]. However, objects have different natures and different properties. It is obvious that the information representation at the physical level differs from its description at the level of words. In particular, at the level of signals, information is represented in interval scales and, at the level of words, it is represented in naming and ordering scales.
In the approach considered in [9], several regular structures of the current level (more exactly, one class) with their relations and features are transformed into generators of the higher level or in regular structures of the lower level. In some cases, more than one hierarchy level of dialogue is required to obtain some other regular structure. As example is the influence of the level of sentences on the phonetic composition of a sound (its pronouncing at the beginning or at the end of a sentence). To pass from one level to another, it is insufficient to specify only rules of definition of generators of each level. An algebra should be specified on the set of selected elements of each level.
To select elementary objects at the physical level (the level of sounds), a preliminary segmentation into quasihomogeneous sound signal fragments with their subsequent classification is most frequently used. We consider that the set of heterogeneous segments is defined as follows:
S s i I
i
= =
{ }, , , ,1 2
K i can be determined. Using some proximity measure [11] in this space, it is possible and necessary to determine the image that contains the sought-for speech signal segment. It is obvious that, at this level, comparisons in terms of interval scales can be performed and a distance can be used as the proximity measure.
It is obvious that an equivalence relation
R = ~ can be defined on a set of segments si . Two segments s1 and s2 are equivalent if they belong to one cluster/class in the state space, i.e., if we haves K
1
and s K
2
~
K . It may be noted that, by definition, the equivalence relation is reflective, transitive, and symmetric. At the same time, features
P1 and P2 determine the distinction between the objects s1
and
s2 that fall into one equivalence class. Thus, the equivalence relation makes it possible to map equivalence classes m S M N
:
introduce a mapping
between objects
m1 and m2 ; more precisely, the object m2 must follow the object m1. The value 0 corresponds to the incompatibility between objects. The value 1 corresponds to the obligatory compatibility between objects, i.e., the objectm1
always follows the object
m2 .
180
K , where I is finite. For each selected object, its
cluster body
. Hence, all the segments
located in one cluster belong to the same equivalence class [ ]
onto a set of namings M m
N i
The next level of description of a speech signal is parametric, its input is the set of elements
= { }, where mi is the name of the equivalence class [ ]
K i . The properties of generators can be specified that describe the characteristics of the state space, i.e., its features.
~
M m
N i
= { }. We
r M R
:N N
defined on this set, where r is a binary connectivity relation between elements from M N . These sequences can form regular configurations [9] bounded by a phoneme, a diphthong, or a phonetic word.
The duration of a string determines not only the necessity of taking into account the relation between two successive generators but also all the generators of a given configuration, i.e., the arity of generators increases from configurations of small durations to configurations of large durations. A configuration is constructed as an action over properties of generators. A functionp m m( , ) :
1 2
M N
2 [ .. ] can be defined, where p is the probability (frequency distribution) of the compatibility
0 1
O-QUALITY OF INTERACTIVE SYSTEMS
In the general case, a speech system is considered as a multilevel hierarchical interactive system and is represented by a sequence of mappings
jn j3 j2 j1 y1 y2 y3 yn
F2 F1 S G1
G2
M1 b b b b b M 2 . (8) ... G2 G1 S *
F1 F2 ...
yn y3 y2 y1 j1 j2 j3 jn
, , , and
represent the description of utterances on different hierarchy levels in speech production and speech perception systems,
j j y y
i i i i
, , , and are mappings of descriptions,
and are models of the world of the dialogue participants. The arrows between descriptions
1 2
. (11)
The above structures are determined by the structure of a natural language. In particular, relations of the form (5) really exist [7]. Moreover, they are closed at a very low level, i.e., at the level of forming sonant sounds of speech [12]. Therefore, for lower hierarchy levels of an interactive system, the control stability and accuracy should be provided. Relation
(6) can be defined as the condition of reversibility of the sequence of transformations (8) during perceiving and synthesizing a speech signal and, maybe, a speech.
R- AND I-QUALITIES OF AN INTERACTIVE SYSTEM
These two qualities provide the reliability of transmitted information. It is obvious that, for participants in a dialogue, the corresponding models described in terms of generators (7) must coincide in many respects.
At present, the level of reliability of technical systems, i.e., their
R-quality, is very high. Therefore, to ensure the reliability of transmitted information, it is necessary to provide a high degree of noise immunity, i.e., the
I-quality of an
interactive system. The provision of
I-quality is connected, first, with the choice of the form of the signals being transmitted and providing the least probability of an error during receiving and, second, with the choice of an antinoise coding. The solution of the first part of the problem is determined by distinctive features of the acoustic perception system and, apparently, any speech signal is tuned with allowance made for these distinctive features that determine the speech code structure in many respects [7].
The solution of the second part of the problem is connected, in the classical understanding of antinoise coding [13], with the choice of a code base and a method of redundant coding and decoding. Up to a definite hierarchy level, this problem can be considered as the problem of the choice of a set of generators [9] or, more exactly, their names and relations. Formally, the problem is reduced to the choice of a code base and a finite algebra. The ambiguity of solution of this problem is determined by the fact that there is a set of different languages with different phoneme collections, sound pitches, etc., but they have much in common at the level of speech signals.
An analysis of mappings (3) and description (7) shows that the set of transformations
Z and features PY must be in correspondence among themselves and, perhaps, can be related by a single-valued or a biunique mapping. In this case, the features
PY can also play the dominant role in forming regular configurations. This allows one to suppose that, in forming mapping (3), noise immunity requirements or, more precisely, those of the reliability of decision-making in partitioning the
181
S S and
*
Here,
F F G G
i i i i
are speech signals, and
M M
Fi and Gi
characterize interrelations between them. Objectively, these arrows represent the corresponding transformation models.
There are various types of relations in a speech system, for example,
j y
*
1 1
G F S S G
1 1 1
j j y y
*
2 1 1 2
2 2 1 1
*
;
(9)
. (10)
Moreover, biological feedbacks including dialogue participants are possible, for example,
G F F S S G G
2 2 1 1 2
G S G F S G
2 2 2 2
K
j y y j j y
... ... ...
1 1
TABLE 1
Relationships between the size of description of a feature and the total text size for the groups, %
Speech
I II III
49 (66) 54 (72) 69 (92)
Isolated-word speech
18 (30) 27 (47) 44 (74)
Continuous speech
Pragmatic level
Semantic level
Fig. 3. Projection of objectives of dialogue participants.
X into homotopic classes are essential. Apparently, there must be some relationship between the number of classes and the size of description of featuresm ( )
PY . In particular, partitionings of sounds according to their position and formation method are well known. Let us consider the following three groups of classification of sounds: group I includes vowels, sonants, noise sonants, and unvoiced noise consonants;group II includes vowels, sonants, noise occlusive sonants, unvoiced occlusive consonants, fricative sonants, and unvoiced fricative consonants;group III includes vowels, occlusive sonants, fricative sonants, voiced vibrant consonants, noise fricative sonants, unvoiced fricative consonants, explosive sonants, unvoiced explosive consonants, and affricates.
To conduct an experiment, after manual text transcription, a sound was replace by the class corresponding to it. Then the inverse problem was solved, namely, recurrent sequences of combinations corresponding to definite words were determined from the obtained record. In this case, not only a correct definition of a word was taken into account but also the determination of a single-root word. The relationships obtained for isolated-word and continuous speech are presented in Table 1 (the relationships for single-root words are given in brackets).
Thus, it follows from the obtained results that the reliability of the determination of a configuration and, accordingly, generators of the higher level increases with increasing the number of generators that characterize feature descriptions.
set
C-QUALITY OF AN INTERACTIVE SYSTEM
The controllability of an interactive system (Fig. 3) can be provided by the execution of objective functions at the pragmatics and semantics levels. The dialogue objective [14] is to inform dialogue participants, to transmit knowledge, and to achieve consensus.
It is obvious that the dialogue objective is defined at the pragmatic level that is projected onto the level of semantics in the form of domains (see Fig. 3). Since, during a dialogue, the objective is reached by an isomorphic identification of sequences of generators of the lower level in terms of generators of the higher level, to each point of the pragmatic level will correspond a domain of the semantic level. Here, a sufficiently complicated problem arises that consists of the definition of an equivalence on a set of utterances at the semantic level. The essence is that these equivalence relations do not necessarily coincide for different dialogue participants. Such a noncoincidence can also exist at lower levels of dialogue, for example, at the levels of syntax and morphology. This takes place in the case when words and terms are differently interpreted by dialogue participants (see Fig. 3).
Thus, the dialogue objective is reached by an isomorphic identification of configurations of generators of lower levels in terms of generators of higher levels. In this case, the trajectories of advancement of dialogue (or transformation of configurations) are controlled under the condition of presence of an objective function.
182
Participant 1 Participant 2
Dialogue domain
a
Participant 1 Participant 2
Dialogue domain
b
Fig. 4. The trajectory of advancement of dialogue between participants.
Fig. 5. Transition graph in the case of a stationary pragmatic component.
It should also be noted that, during a dialogue, its participants pursue their objectives. At the same time, their objectives do not necessarily coincide. On the other hand, dialogue participants have different concepts that do not necessarily intersect. Let us consider the variant in which both participants possess similar knowledge and their dialogue is directed toward the achievement of a unified objective. In Fig. 4a, trends are depicted that fix the trajectory of advancement of the dialogue of interlocutors in the semantic dialogue space. Apparently, the trajectories of the participants in the dialogue are gradually shifted to a domain that is common for them.
Nevertheless, in some cases, interlocutors can consider that they speak the same thing but, in actual fact, their object domains are different (Fig. 4b).
Passing to the level of objectives, we can see various situations. For example, interlocutors do not necessarily strive for mutual understanding or their dialogue objectives can be different. In this case, the trends in the semantic space of their dialogue can be in one domain and an objective can be reached.
A realization of transitions in the case of a stationary pragmatic component, for example, in carrying out an examination, is presented in Fig. 5. Here, participant 1 generates messages
Z1, Z3, and Z5 and participant 2 generates answers
Z Z Z 4 6 8, , and . In
states 7 and 8, both participants pass to the common pragmatics domain for which states 7 and 8 are equivalent.
CONCLUSIONS
The performed analysis of a conceptual model of an interactive system allows one to draw the conclusion that the construction of a high-quality speech perception system is impossible without the creation of a synthesis system and vice versa. At the same time, it is necessary to take into account phenomena such as the prediction of a succession of events not only at each hierarchy level (strings of words, phonemes, letters, words, etc.) but also between levels. In particular, knowledge of the object domain of a dialogue allows one to restrict the lexical base of decision-making at the level of words. The conceptual model considered allows one to purposefully formulate requirements on speech recognition and synthesis at different hierarchy levels of an interactive system.
The analysis of the conceptual model shows that there is an intimate connection between the speech perception and speech production systems, namely, a common knowledgebase and close interaction between them during dialogue. In using a common knowledgebase, it is assumed that, in recognizing a speech signal at the level of sounds, a parametric description must be formed that corresponds to models of motions of the articulation organs of the speech production system. In other words, the description of sounds in a speech flow must be close in many respects to the description of the state of articulators since otherwise the feedback during recognition and speech perception can turn out to be inefficient.
183
It should be noted that the
M-quality of a system poses a very topical problem connected with the choice of the structure of generators for different hierarchy levels of an interactive system. In fact, the solution of this problem must determine the sizes of descriptions of a name, features, and relations between generators [7], i.e., must define an equivalence relation on the set of elements of different hierarchy levels. An increase in the size of the description of a name stipulates a decrease in the feature description size, but this can lead, first, to a decrease in the recognition reliability and, second, to the destruction of the structure of relations, i.e., to the destruction of the regularity in sequences of generators. A definite basis for the solution of this problem consists of well-known data on the phonetics, morphology, and grammar of other languages.
The introduction of the concept of the
M-quality of hierarchical systems allows one to use mathematical methods of searching for optimum since quality estimation criteria also are hierarchical. In this case, not only the range of values is decreased but also the opportunity arises to predict the initial point of searching for an optimal value.
REFERENCES
1. Yu. S. Bykov, Theory of Speech Intelligibility in Communication Lines [in Russian], Oboronizdat, Moscow (1954).
2. M. Mesarovic, D. Macko, and Y. Takahara, Theory of Hierarchical Multilevel Systems [Russian translation], Mir, Moscow (1973).
3. D. A. Pospelov (ed.), Artificial Intelligence [in Russian], Vol. 2, Models and Methods: A Handbook, Radio i Svyaz, Moscow (1990).
4. C. Shannon, Works on the Theory of Information and Cybernetics [Russian translation], Izd. Inostr. Lit., Moscow (1963).
5. G. G. Belonogov and A. P. Novoselov, Automation of Processes of Information Accumulation, Retrieval, and Generalization [in Russian], Nauka, Moscow (1979).
6. V. P. Bondarenko, V. P. Kotsubinsky, and R.V. Meshcheryakov, Hierarchical structures of speech recognition and synthesis, in: Intelligent Computer-Aided Design, Control, and Training Systems, Izd. NII AEM, Tomsk (2000), pp. 115125.
7. V. N. Sorokin, Theory of Speech Production [in Russian], Radio i Svayz, Moscow (1985).
8. B. S. Fleishman, Elements of the Theory of Potential Efficiency of Complex Systems [in Russian], Sov. Radio, Moscow (1971).
9. U. Grenander, Lectures in Pattern Theory: Pattern Synthesis [Russian translation], Mir, Moscow (1979).
10. R. Klatski, Human Memory: Structures and Processes [Russian translation], Mir, Moscow (1978).
11. J. Tou and R. Gonzales, Pattern Recognition Principles [Russian translation], Mir, Moscow (1978).
12. G. V. Reklaitis, A. Ravindran, and K. M. Ragsdel, Engineering Optimization [Russian translation], Vol. 1, Mir, Moscow (1986).
13. R. Gallager, Information Theory and Reliable Communication [Russian translation], Sov. Radio, Moscow (1974).
14. V. P. Bondarenko, V. P. Kotsubinsky, and R. V. Meshcheryakov, Speech signal synthesis from printed texts, in:
V. P. Tarasenko (ed.), Automatic and Automated Control of Complex Systems, Izd. Tomsk. Univ., Tomsk (1998), pp. 204217.
184
Springer Science+Business Media, Inc. 2008