This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Speech synthesis, also known as text-to-speech (TTS) technology, mainly solves the problem of converting text information into audible sound information. Up to now, speech synthesis technology has become one of the most commonly used methods of human-computer interaction. It is gradually replacing traditional human-computer interaction methods, making human-computer interaction more convenient and faster. With the continuous development of speech synthesis technology, multilingual speech synthesis technology has become a research interest for researchers. This technology can realize the synthesis of different languages in a unified speech synthesis system [1–3].
There are lots of ethnic minorities in China. Many ethnic minorities have their own languages and scripts. Tibetan is one of the minority languages; it can be divided into three major dialects: Ü-Tsang dialect, Amdo dialect, and Kham dialect, which are mainly used in Tibet, Qinghai, Sichuan, Gansu, and Yunnan. All dialects use Tibetan characters as written text, but there are some differences in the pronunciation of each dialect, so it is difficult for the people who use different dialects to communicate with each other. There have been some research studies on Lhasa-Ü-Tsang dialect speech synthesis technology [4–12]. The end-to-end method [12] has more training advantages than the statistical parameter method, and the synthesis effect is better. There are few existing research studies on the speech synthesis of Amdo dialect, and only the work [13] applied the statistical parameter speech synthesis (SPSS) based on the hidden Markov model (HMM) for Tibetan Amdo dialect.
For the multilingual speech synthesis, the research works mainly use unit-selection concatenative synthesis technique, SPSS based on HMM, and deep learning technology. The unit-selection concatenative synthesis technique mainly includes selecting an unit scale, constructing a corpus and designing an algorithm of unit selection and splicing. This method relies on a large-scale corpus [14, 15]. Additionally, the synthesis effect is unstable and the connection of the splicing unit may have discontinuities. SPSS technology usually requires a complex text front-end to extract various linguistic features from raw text, a duration model, an acoustic model, which is used to learn the transformation between linguistic features and acoustic features, and a complex signal-processing-based vocoder to reconstruct waveform from the predicted acoustic features. The work [16] proposes a framework for estimating HMM on data containing both multiple speakers and multiple languages, aiming to transfer a voice from one language to others. The works [2, 17, 18] propose a method to realize HMM-based cross-lingual SPSS using speaker adaptive training. For speech synthesis technology based on deep learning, the work [19] realizes a deep neural network- (DNN-) based Mandarin-Tibetan bilingual speech synthesis. The experimental results show that synthesized Tibetan speech is better than the HMM-based Mandarin-Tibetan cross-lingual speech synthesis. The work [20] trains the acoustic models with DNN, hybrid long short-term memory (LSTM), and hybrid bidirectional long short-term memory (BLSTM) and implements a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis under a unique framework. Experiments demonstrated that the hybrid BLSTM-based cross-lingual speech synthesis framework was better than the Tibetan monolingual framework. Additionally, there are some research studies which reveal that multilingual speech synthesis using the end-to-end method gains a good performance. The work [21] presents an end-to-end multilingual speech synthesis model using a Unicode encoding “byte” input representation to train a model which outputs the corresponding audio of English, Spanish, or Mandarin. The work [22] proposes a multispeaker, multilingual TTS synthesis model based on Tacotron which is able to produce high-quality speech in multiple languages.
Taking into account that traditional methods require a lot of professional knowledge for phoneme analysis, tone, and prosody labelling, the work is time-consuming and costly, and the modules are usually trained separately, which will lead to the effect of error stacking [23] while the end-to-end speech synthesis system can automatically learn alignments and mapping from linguistic features to acoustic features. These systems can be trained on < text, audio > pairs without complex language-dependent text front-end. Inspired by above works, this paper proposes to use an end-to-end method to implement speech synthesis in Lhasa-Ü-Tsang and Amdo pastoral dialect, using a single sequence-to-sequence (seq2seq) architecture with attention mechanism as the shared feature prediction network for Tibetan multi-dialect and introducing two dialect-specific WaveNet networks to realize the generation of time-domain waveforms.
There are some similarities between this work and works [12, 24]. The WaveNet model is used in these works. However, in our work and [12], the WaveNet model is used for the generation of waveform sample with the input of predicted Mel spectrogram for speech synthesis. In the work [24] about speech recognition, the WaveNet model is used for the generation of text sequence and the input is MFCC features. The work [12] achieved the speech synthesis for Tibetan Lhasa-Ü-Tsang by using end-to-end model. In this paper, we improved the model of the work [12] to implement multidialect speech synthesis.
Our contributions can be summarized as follows. (1) We propose an end-to-end Tibetan multidialect speech synthesis model, which unifies all the modules into one model and realizes the speech synthesis for different Tibetan dialects using one speech synthesis system. (2) Joint learning is used to train the shared feature prediction network by learning the relevant features of multidialect speech data, and it is helpful to improve the speech synthesis performance of different dialects. (3) We use Wylie transliteration scheme to convert the Tibetan text into the corresponding Latin letters, which is used as the training units of the model. It effectively reduces the size of training corpus, reduces the workload of front-end text processing, and improves the modelling efficiency.
The rest of this paper is organized as follows. Section 2 introduces the end-to-end Tibetan multidialect speech synthesis model. The experiments are presented in detail in Section 3 and the results are discussed as well. Finally, we describe our conclusions in Section 4.
2. Model Architecture
The end-to-end speech synthesis model is mainly composed of two parts: the first part contains a seq2seq feature prediction network containing attention mechanism and the second part contains two dialect-specific WaveNet vocoders based on Mel spectrogram. The model adopts a synthesis method from text to intermediate representation and intermediate representation to speech waveform. The encoder and decoder implement the conversion from text to intermediate representation, and the WaveNet vocoders restore the intermediate representation into waveform samples. Figure 1 shows the end-to-end Tibetan multidialect speech synthesis model.
[figure omitted; refer to PDF]
Each syllable in Tibetan has a root, which is the central consonant of the syllable. A vowel label can be added above or below the root to indicate different vowels. Sometimes, there is a superscript at the top of the root, one or two subscripts at the bottom, and a prescript at the front, indicating that the initials of the syllable are compound consonants. The sequence of connection of compound consonants is prescript, superscript, root, and subscript. Sometimes, there is one or two postscripts after the root, which means that the syllable has one or two consonant endings. The structure of Tibetan syllables is shown in Figure 3.
[figure omitted; refer to PDF]
Due to the complexity of Tibetan spelling, a Tibetan syllable can have as many as 20450 possibilities. If a single syllable of a Tibetan character is used as the basic unit of speech synthesis, a large amount of speech data will need to be trained, and the corpus construction workload will be huge. The existing Tibetan speech synthesis system [10, 13] uses the initials and vowels of Tibetan characters as the input of the model, which requires a lot of professional knowledge of Tibetan linguistics and the front-end text processing. In this paper, we adopt the Wylie transliteration scheme, using only the basic 26 Latin letters, without adding letters and symbols, to convert the Tibetan text into the corresponding Latin letters. It effectively reduces the size of training corpus, reduces the workload of front-end text processing, and improves modelling efficiency. Figure 4 shows the converted Tibetan sentence obtained by using the Wylie transliteration scheme for the Tibetan sentence in Figure 2.
[figure omitted; refer to PDF]
In the past, traditional acoustic and linguistic features were used as the input of the WaveNet model for speech synthesis. In this paper, we choose a low-level acoustic representation: Mel spectrogram, as the input of WaveNet for training. The Mel spectrogram emphasizes the details of low frequency, which is very important for the clarity of speech. And compared to the waveform samples, the phase of each frame in Mel spectrogram is unchanged; it is easier to train with the square error loss. We train WaveNet vocoders for Lhasa-Ü-Tsang dialect and Amdo pastoral dialect, and they can synthesize the corresponding Tibetan dialects with the corresponding WaveNet vocoder.
2.4. Training Process
Training process can be summarized into 2 steps: firstly, training the shared feature prediction network; secondly, training a dialect-specific WaveNet vocoder for Lhasa-Ü-Tsang dialect and Amdo pastoral dialect, respectively, based on the outputs generated by the network which was trained in step 1.
We trained the shared feature prediction network on the datasets of Lhasa-Ü-Tsang dialect and Amdo pastoral dialect. On a single GPU, we used the teacher-forcing method to train the feature prediction network, and the input of the decoder was the correct output, not the predicted output, with a batch size of 8. An Adam optimizer was used with
Then, the predicted outputs from the shared feature prediction network were aligned with the ground truth. We trained the WaveNet for Lhasa-Ü-Tsang dialect and Amdo pastoral dialect, respectively, by using the aligned predicted outputs. It means that these predicted data were generated in the teacher-forcing mode. Therefore, each spectrum frame is exactly aligned with a sample of the waveform. In the process of training the WaveNet network, we used an Adam optimizer with
3. Results and Analysis
3.1. Experimental Data
The training data consist of the Lhasa-Ü-Tsang dialect and Amdo pastoral dialect. The Lhasa-Ü-Tsang dialect speech data are about 1.43 hours with 2000 text sentences. The Amdo pastoral dialect speech data are about 2.68 hours with 2671 text sentences. Speech data files are converted to 16 kHz sampling rate, with 16 bit quantization accuracy.
3.2. Experimental Evaluation
In order to ensure the accuracy of the experimental results, we apply two methods, objective and subjective experiments, to evaluate the experimental results.
In objective experiment, the root mean square error (RMSE) of the time-domain sequences is calculated to measure the difference between the synthesized speech and the reference speech. The smaller the RMSE is, the closer the synthesized speech is to the reference and the better the effect of speech synthesis is. The formula of RMSE is shown in equation (3), where
For Lhasa-Ü-Tsang dialect and Amdo pastoral dialect, we randomly select 10 text sentences, use end-to-end Tibetan multi-dialect speech synthesis model for speech synthesis, and calculate the average RMSE to evaluate the closeness of the synthesized speech of the Lhasa-Ü-Tsang dialect and Amdo pastoral dialect to the reference speech. In order to evaluate the performance of the model, we compare it with the end-to-end Tibetan Lhasa-Ü-Tsang dialect speech synthesis model and end-to-end Tibetan Amdo pastoral dialect speech synthesis model. These two models were used to synthesize the same 10 text sentences, and the average RMSE was calculated. The results are shown in Table 1. For Lhasa-Ü-Tsang dialect, the RMSE of the multidialect speech synthesis model is 0.2126, which is less than the one of Lhasa-Ü-Tsang dialect speech synthesis model (0.2223). For Amdo pastoral dialect, the RMSE of the multidialect speech synthesis model is 0.1223, which is less than the one of Amdo pastoral dialect speech synthesis model (0.1253). It means that both Lhasa-Ü-Tsang dialect and Amdo pastoral dialect, which are synthesized by our model, are closer to their reference speech. The results show that our method has capability of the feature representation for both Lhasa-Ü-Tsang and Amdo pastoral dialect through the shared feature prediction network, so as to improve the multidialect speech synthesis performance against single dialect. Besides, the synthetic speech effect of Amdo pastoral dialect is better than that of Lhasa-Ü-Tsang dialect because the data scale of Amdo pastoral dialect is larger than that of Lhasa-Ü-Tsang dialect.
Table 1
Objective evaluation of the results.
Tibetan dialect | The RMSE of end-to-end Tibetan multidialect speech synthesis model | The RMSE of end-to-end Tibetan Lhasa-Ü-Tsang dialect speech synthesis model | The RMSE of end-to-end Tibetan Amdo pastoral dialect speech synthesis model |
Lhasa-Ü-Tsang dialect | 0.2126 | 0.2223 | — |
Amdo pastoral dialect | 0.1223 | — | 0.1253 |
Figures 6 and 7, respectively, show the predicted Mel spectrogram and target Mel spectrogram output by the feature prediction network for Lhasa-Ü-Tsang dialect and Amdo pastoral dialect. It can be seen from the figures that the predicted mel spectrograms of Lhasa-Ü-Tsang dialect and Amdo pastoral dialect are both similar to the target Mel spectrograms.
[figure omitted; refer to PDF]
In subjective experiment, the absolute category rating (ACR) measurement method was used to evaluate the synthesized speech of the Lhasa-Ü-Tsang and Amdo pastoral dialects mentioned above. In the ACR measurement, we selected 25 listeners. After listening to the synthesized speech, we used the original speech as a reference and scored the synthesized speech according to the grading standard in Table 2. After obtaining the scores given by all listeners, the mean opinion score (MOS) of the synthesized speech was calculated, and Table 3 shows the results. The MOS values of the synthesized speech in Lhasa-Ü-Tsang dialect and Amdo pastoral dialects are 3.95 and 4.18, respectively, which means that the synthesized speech has good clarity and naturalness.
Table 2
Grading standards of ACR.
Grading value | Estimated quality |
5 | Very good |
4 | Good |
3 | Medium |
2 | Bad |
1 | Very bad |
Table 3
The MOS comparison of speech synthesized by different synthesis primitive models.
Tibetan dialect | MOS |
Lhasa-Ü-Tsang dialect | 3.95 |
Amdo pastoral dialect | 4.18 |
3.3. Comparative Experiment
In order to verify the performance of the end-to-end Tibetan multidialect speech synthesis system, we have compared it with the “linear prediction amplitude spectrum + Griffin–Lim” and “Mel spectrogram + Griffin–Lim” speech synthesis system. The results of comparison experiment are shown in Table 4. According to Table 4, it can be seen that whether it is Lhasa-Ü-Tsang dialect or Amdo pastoral dialect, the MOS value of the synthesized speech of “Mel spectrogram + Griffin–Lim” speech synthesis system is higher than that of “linear prediction amplitude spectrum + Griffin–Lim” speech synthesis system. The results show that the Mel spectrogram is more effective as a predictive feature than the linear predictive amplitude spectrum, and the quality of the generated speech is higher. The “Mel spectrogram + WaveNet” speech synthesis system outperforms the “Mel spectrogram + Griffin–Lim” speech synthesis system with the higher MOS value, which means that WaveNet has a better performance in recovering speech phase information and generating higher quality of the synthesis speech than the Griffin–Lim algorithm.
Table 4
The MOS comparison of speech synthesized by different models.
Model | MOS of Lhasa-Ü-Tsang dialect | MOS of Amdo pastoral dialect |
Linear predictive amplitude spectrum + Griffin–Lim | 3.30 | 3.52 |
Mel spectrogram + Griffin–Lim | 3.55 | 3.70 |
Mel spectrogram + WaveNet | 3.95 | 4.18 |
4. Conclusion
This paper builds an end-to-end Tibetan multidialect speech synthesis model, including a seq2seq feature prediction network, which maps the character vector to the Mel spectrogram, and a dialect-specific WaveNet vocoder for Lhasa-Ü-Tsang dialect and Amdo pastoral dialect, respectively, which synthesizes the Mel spectrogram into time-domain waveform. Our model can utilize dialect-specific WaveNet vocoders to synthesize corresponding Tibetan dialect. In the experiments, Wylie transcription scheme is used to convert Tibetan characters into Latin letters, which effectively reduces the number of composite primitives and the scale of training data. Both objective and subjective experimental results show that the synthesized speech of Lhasa-Ü-Tsang dialect and Amdo pastoral dialect has high qualities.
Acknowledgments
This study was supported by the National Natural Science Foundation of China under grant no. 61976236.
[1] Y. J. Wu, Y. Nankaku, K. Tokuda, "State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis," Proceedings of the Interspeech, 10th Annual Conference of the International Speech Communication Association, .
[2] H. Liu, Research on HMM-Based Cross-Lingual Speech Synthesi, 2011.
[3] R. Sproat, Multilingual Text-To-Speech Synthesis: The Bell Labs Approach, 1998.
[4] Z. M. Cairang, Research on Tibetan Speech Synthesis Technology Based on Mixed Primitives, 2016.
[5] L. Gao, Z. H. Yu, W. S. Zheng, "Research on HMM-based Tibetan Lhasa speech synthesis technology," Journal of Northwest University for Nationalities, vol. 32 no. 2, pp. 30-35, 2011.
[6] J. X. Zhang, Research on Tibetan Lhasa Speech Synthesis Based on HMM, 2014.
[7] S. P. Xu, Research on Speech Quality Evaluation for Tibetan Statistical Parametric Speech Synthesis, 2015.
[8] X. J. Kong, Research on Methods of Text Analysis for Tibetan Statistical Parametric Speech Synthesis, 2017.
[9] Y. Zhou, D. C. Zhao, "Research on HMM-based Tibetan speech synthesis," Computer Applications and Software, vol. 32 no. 5, pp. 171-174, 2015.
[10] G. C. Du, Z. M. Cairang, Z. J. Nan, "Tibetan speech synthesis based on neural network," Journal of Chinese Information Processing, vol. 33 no. 2, pp. 75-80, 2019.
[11] L. S. Luo, G. Y. Li, C. W. Gong, H. L. Ding, "End-to-end speech synthesis for Tibetan Lhasa dialect," Journal of Physics: Conference Series, vol. 1187 no. 5,DOI: 10.1088/1742-6596/1187/5/052061, 2019.
[12] Y. Zhao, P. Hu, X. Xu, L. Wu, X. Li, "Lhasa-Tibetan speech synthesis using end-to-end model," IEEE Access, vol. 7, pp. 140305-140311, DOI: 10.1109/ACCESS.2019.2940125, 2019.
[13] L. Su, Research on the Speech Synthesis of Tibetan Amdo Dialect Based on HMM, 2018.
[14] S. Quazza, L. Donetti, L. Moisa, P. L. Salza, "Actor: a multilingual unit-selection speech synthesis syste," Proceedings of the 4th ISCA Workshop on Speech Synthesis Perth, Australia, .
[15] F. Deprez, J. Odijk, J. D. Moortel, "Introduction to multilingual corpus-based concatenative speech synthesis," Proceedings of the Interspeech, 8th Annual Conference of the International Speech Communication Association, pp. 2129-2132, .
[16] H. Zen, N. Braunschweiler, S. Buchholz, "Statistical parametric speech synthesis based on speaker and language factorization," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20 no. 6, pp. 1713-1724, DOI: 10.1109/TASL.2012.2187195, 2012.
[17] H. Y. Wang, Research on Statistical Parametric Mandarin-Tibetan Cross-Lingual Speech Synthesis, 2015.
[18] L. Z. Guo, Research on Mandarin-Xingtai Dialect Cross-Lingual Speech Synthesis, 2016.
[19] P. W. Wu, Research on Mandarin-Tibetan Cross-Lingual Speech Synthesis, 2018.
[20] W. Zhang, H. Yang, X. Bu, L. Wang, "Deep learning for Mandarin-Tibetan cross-lingual speech synthesis," IEEE Access, vol. 7, pp. 167884-167894, DOI: 10.1109/ACCESS.2019.2954342, 2019.
[21] B. Li, Y. Zhang, T. Sainath, Y. H. Wu, W. Chan, "Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes," Proceedings of the ICASSP,DOI: 10.1109/ICASSP.2019.8682674, .
[22] Y. Zhang, R. J. Weiss, H. Zen, "Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning," 2019. https://arxiv.org/abs/1907.04448
[23] Z. Y. Qiu, D. Qu, L. H. Zhang, "End-to-end speech synthesis based on WaveNet," Journal of Computer Applications, vol. 39 no. 5, pp. 1325-1329, DOI: 10.11772/j.issn.1001-9081.2018102131, 2019.
[24] Y. Zhao, J. Yue, X. Xu, L. Wu, X. Li, "End-to-end-based Tibetan multitask speech recognition," IEEE Access, vol. 7, pp. 162519-162529, DOI: 10.1109/ACCESS.2019.2952406, 2019.
[25] R. Skerry-Ryan, E. Battenberg, Y. Xiao, "Towards end-to-end prosody transfer for expressive speech synthesis with tacotron," Proceedings of the International Conference on Machine Learning (ICML), .
[26] Y. Wang, D. Stanton, Y. Zhang, "Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis," Proceedings of the International Conference on Machine Learning (ICML), .
[27] A. V. D. Oord, S. Dieleman, H. Zen, "WaveNet: a generative model for raw audio," , 2016. https://arxiv.org/abs/1609.03499
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2021 Xiaona Xu et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
The research on Tibetan speech synthesis technology has been mainly focusing on single dialect, and thus there is a lack of research on Tibetan multidialect speech synthesis technology. This paper presents an end-to-end Tibetan multidialect speech synthesis model to realize a speech synthesis system which can be used to synthesize different Tibetan dialects. Firstly, Wylie transliteration scheme is used to convert the Tibetan text into the corresponding Latin letters, which effectively reduces the size of training corpus and the workload of front-end text processing. Secondly, a shared feature prediction network with a cyclic sequence-to-sequence structure is built, which maps the Latin transliteration vector of Tibetan character to Mel spectrograms and learns the relevant features of multidialect speech data. Thirdly, two dialect-specific WaveNet vocoders are combined with the feature prediction network, which synthesizes the Mel spectrum of Lhasa-Ü-Tsang and Amdo pastoral dialect into time-domain waveform, respectively. The model avoids using a large number of Tibetan dialect expertise for processing some time-consuming tasks, such as phonetic analysis and phonological annotation. Additionally, it can directly synthesize Lhasa-Ü-Tsang and Amdo pastoral speech on the existing text annotation. The experimental results show that the synthesized speech of Lhasa-Ü-Tsang and Amdo pastoral dialect based on our proposed method has better clarity and naturalness than the Tibetan monolingual model.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer