RETRACTED ARTICLE: Transfer learning through

Abstract

The development of numerous frameworks and pedagogical practices has significantly improved the performance of deep learning-based speech recognition systems in recent years. The task of developing automatic speech recognition (ASR) in indigenous languages becomes enormously complex due to the wide range of auditory and linguistic components due to a lack of speech and text data, which has a significant impact on the ASR system's performance. The main purpose of the research is to effectively use in-domain data augmentation methods and techniques to resolve the challenges of data scarcity, resulting in an increased neural network consistency. This research further goes into more detail about how to create synthetic datasets via pooled augmentation methodologies in conjunction with transfer learning techniques, primarily spectrogram augmentation. Initially, the richness of the signal has been improved through the process of deformation of the time and/or the frequency axis. The time-warping aims to deform the signal's envelope, whereas frequency-warping alters spectral content. Second, the raw signal is examined using audio-level speech perturbation methods such as speed and vocal tract length perturbation. These methods are shown to be effective in addressing the issue of data scarcity while having a low implementation cost, making them simple to implement. Nevertheless, these methods have the effect of effectively increasing the dataset size because multiple versions of a single input are fed into the network during training, likely to result in overfitting. Consequently, an effort has been made to solve the problem of data overfitting by integrating two-level augmentation procedures via pooling of prosody/spectrogram modified and original speech signals using transfer learning techniques. Finally, the adult ASR system was experimented on using deep neural network (DNN) with concatenated feature analysis employing Mel-frequency cepstral coefficients (MFCC), pitch features, and the normalization technique of Vocal Tract Length Normalization (VTLN) on pooled Punjabi datasets, yielding a relative improvement of 41.16 percent in comparison with the baseline system.

Details

Title

RETRACTED ARTICLE: Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

Author

Kadyan, Virender¹; Bawa, Puneet²

¹ University of Petroleum & Energy Studies (UPES), Speech and Language Research Centre, School of Computer Science, Dehradun, India (GRID:grid.444415.4) (ISNI:0000 0004 1759 0860)
² Chitkara University Institute of Engineering & Technology, Chitkara University, Centre of Excellence for Speech and Multimodal Laboratory, Rajpura, India (GRID:grid.428245.d) (ISNI:0000 0004 1765 3753)

Pages

21015-21033

Publication year

2022

Publication date

Dec 2022

Publisher

Springer Nature B.V.

ISSN

09410643

e-ISSN

14333058

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1007/s00521-022-07579-6

ProQuest document ID

2732893870

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

RETRACTED ARTICLE: Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

Content area

Abstract

Details

Suggested sources