Full text

Turn on search term navigation

This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

1. Introduction

Neural networks are state-of-the-art in various classification tasks, including video and audio segmentation. Determining the speaker’s identity from an audio clip of their speech is a classification problem of this sort. Nevertheless, the quantity of data necessary to produce acceptable results is one of the most critical trade-offs in neural network training. Gender is a component that results in an average physiological difference. It may increase the identification system’s accuracy because males and females have diverse emotional expressions and voice processes. Integrating gender data in the development and testing processes makes the data more trustworthy. The neural network obtains another element for identifying the task-specific voice qualities for the two genders [1]. Gender categorization by the audio signal is crucial for various applications, involving targeted answers and advertising by voice assistants, population statistics via age group analysis, and automated profiling of an individual using speech data to help in a criminal probe. Additionally, for a model with a gender-specific search space, even a small quantity of data will significantly contribute to various audio systems, such as automated voice recognition, speaker identification, and content-based multimedia indexing.

A collection of features is utilized to determine the gender of a sound. The Mel-scaled power spectrogram (Mel), Mel-frequency cepstral coefficients (MFCCs), power spectrogram chroma (Chroma), spectral contrast (Contrast), and tonal centroid are among the most often used features for speech gender detection (Tonnetz). Machine learning (ML) approaches are used to build a high-quality system for distinguishing voice gender using the retrieved attributes. Each classification approach, in specific, generates a collection of hypothesis models and chooses the most optimum one. This model identifies the unknown voice label by acquiring the audio attributes and classifying the voice gender.

As the feature extraction stage has progressed to the point that many academics now see it as feature engineering, intending to develop robust feature vectors that accurately characterize structures in methods relevant to the job at hand. The primary goal of feature engineering is to create features that cluster patterns belonging to the same class together in the feature space while keeping them as far apart as possible from other categories. But autonomous representation learning has attracted increased interest to study deep learning methodologies more readily and extensively. The classification scheme is constructed with deep learning so that the encoder acquires the optimal attributes for describing patterns throughout the training phase. Additionally, because of specialized deep architectures, such as CNN, the input structures are sometimes depicted as a picture. CNN is a specialized architecture for handling the image classification task, among other things. This has prompted academics working with CNNs to create ways for transforming an audio input to a time-frequency image. Hence, another approach tackles the gender categorization issue arose by using the auditory spectrograms as inputs to our system. It sounds spectrograms are visual representations of audio. Spectrograms are very comprehensive and precise representations of speech that have been extensively employed in auditory categorization applications [2–4]. Deep neural networks (DNNs) trained on extracted features are very effective in removing data and have been successfully used in applications such as speech recognition [3, 5] and picture identification [5–7]. CNN’s can effectively leverage the invariance inherent in spectrograms for convolutional and pooling operations [8].

The purpose of this article is to analyze deep learning techniques to typical machine learning models trained on handcrafted features to determine if deeply learned attributes are adequate for gender categorization tasks. Deep neural networks (DNNs) and convolutional neural networks are the most accurate classifiers and feature extractors for speech gender detection, according to experimental data (CNN) classification. When used for audio categorization, [9] demonstrated the performance of prominent CNN architectures like AlexNet, VGG, Inception, and ResNet. Their method required decomposing the audio time series using a short-time Fourier transform to generate a spectrogram utilized to enter the CNN. The issue with many of these models is that they are huge and have many learnable parameters. Therefore, we concentrated on a database with just a few thousand samples since it seems improbable that these vast networks could be trained with our sparse data. Transfer learning is one way to circumvent this. This is accomplished by employing a pretrained network, freezing the values of most levels, and retraining just the last few layers using our audio data for training. This is one of the directions we took with this job. The performance achieved by fine-tuning various pretrained CNNs (ResNet 34 and ResNet50 on ImageNet) is optimal for our gender audio categorization issues.

The following are the significant contributions made to the community:

(1) Presented the study of impacts of a set of handcrafted voice qualities as possible appropriate features for gender classification algorithms

(2) The work contributes to our understanding of the extent to which picking voice signal features aided in the development of machine learning models

(3) Compared the performance of various classical machine learning models and DNN trained on handcrafted voice features

(4) Performance study of fine-tuning the pretrained ResNet34 and ResNet50 on audio spectrograms. The purpose of this study was to determine whether spectrograms give sufficient detail for accurate gender audio classification

(5) Presented the research on the effectiveness of speech classifiers on various corpus datasets

The rest of the paper is organized as follows: Section 2 describes the previous works in the domain; Section 3 details about the methodology, which consists of dataset, preprocessing techniques, feature extraction, model architecture, and evaluation metrics; Section 4 describes the results and Section 5 with discussion; and Section 6 with conclusions followed by a list of relevant references.

2. Literature Review

Numerous studies have been undertaken to determine the effectiveness of voice classifiers to increase the precision of programs being used. [10] recognized the speaker’s gender on the TIDIGITS database with an accuracy rate of 98.65% using two-level classifiers (pitch frequency and GMM classifier). [11] analyzed voices from the Ivie corpus using four classifiers: GMM, multilayer perceptron (MLP), vector quantization (VQ), and learning vector quantization (LVQ). They had a reliability percentage of 96.4%. [12] integrated the acoustic voice levels determined by five distinct approaches into a single score level. The findings were attained using the gender dataset with an 81.7% success rate for the gender category. [13] developed a method to recognize speakers cantered on a fusion score of seven subsystems employing the MFCC, PLP, and prosodic feature vectors on three distinct classifiers: GMM, SVM, and GMM-SV-based SVM. The categorization rate of success for gender identification is 90.4% when utilizing the aGender dataset. [14] used two classifiers to a private dataset to determine gender voice: SVM and decision tree (DT) using the MFCC feature. The total accuracy of gender categorization using MFCC-SVM and MFCC-DT was 93.16% and 91.45%, correspondingly. [15] developed a method to increase the MFCC features and then modify the weighting between the DNN tiers. These enhanced MFCC attributes are assessed using DNN and I-Vector classifiers, which achieve an overall accuracy rate of 58.98% and 56.13%, accordingly. [16] examined two arrangement approaches (DNN and SVM) for robust sound classifications utilizing single and combination feature vectors. The findings indicated that the DNN strategy outperformed the noise approach because of its robustness and poor sensitivity to sound.

[17] predicted age and gender using deep neural networks. Lately, raw waveform processing in speech has become a trend in the speech field, and it has been shown that employing raw audio waveforms improves voice recognition effectiveness [18]. When raw audio-based neural networks are used, voice activity identification [19] and speaker verification [20] have also shown considerable performance gains. [19] attempts to decipher and explain how CNNs categorize audio information. CNN’s are learned using both spectrogram and raw audio inputs, and a layer-wise relevance propagation (LRP) technique is employed to examine how the systems choose features and make choices. They demonstrated the distinct patches on the input signal that strongly connect to each output label. The paper’s findings show that spectrogram inputs result in greater accuracy than raw auditory data.

In certain works, they employed gender recognition as a factor. [1] used speech recordings to build an emotion recognition model. The suggested approach includes an R-CNN and a gender information block. The suggested approach improves accuracy by 5.6%, 7.3%, and 1.5% in Mandarin, English, and German, respectively, compared to existing highest accuracy algorithms. [21] also illustrates the importance of gender and linguistic variables in vocal expression classification. They found that higher energy emotions like anger, joy, and surprise were easier to discern in male voices speaking a harsh language like German. Disgust and Fear were easier to discern in female voices in each language. They also found that when analyzing emotion across gender and language, signal amplitude, and energy are critical.

3. Materials and Methods

3.1. Dataset

Familiar Voice is a corpus of speech data [20] read by users on the Common Voice website (http://voice.mozilla.org/) based on text from various public domain sources, including user-submitted blog posts, old books, movies, and other publicly available speech corpora. Its main goal is to develop and test automated speech recognition (ASR) software. There are 8,64,448 MP3 audio files in the data collection. A .tsv file including the filename, sentence, accent, age, gender, locale, upvotes, and downvotes was also included in the dataset. The audio clips were saved using the same name in the associated.tsv file. The.tsv file was filtered by deleting any missing attributes in gender, age, and accents and choosing the rows with column “downvotes” of 0. Furthermore, the gender options were limited to “male” and “female.” As a result, the total dataset is limited to 3,94,818 rows with labels for and age (from diverse geographic areas, each with its own linguistic dialects and accents.

We only took a subset of the created dataset for the current work. The number of audio files in the female and male categories was nearly equal to avoid bias (Figure 1). We separated the .tsv file into two columns for further processing: filename and gender. There were 6995 male audio files and 5662 female audio files in the filtered audio collection. The WAV format converted all audio files for frequency spectrum analysis [22]. Figures 2 and 3 show the time-domain representation of the female and male audio waveforms, respectively. The chromagram of the male and female audio categories is depicted in Figures 4 and 5, which shows how the pitches for twelve different pitch classes change between the gender groups.

[figure omitted; refer to PDF]

The study on the features suggests that the even MEL spectrogram images contain discriminating information for the gender classification. Even though the ResNet50 model is complex and took more training time, it shows impressive increment in the accuracy compared to other models trained on handcrafted features. Overall the architecture with more convolutional layers obtained better results than all the handcrafted approaches in the given dataset.

5.3. Performance on Other Datasets

Based on the results of the aforementioned experiments, it is clear that ResNet50 is the most appropriate model for gender categorization on the Mozilla dataset. Although the model should be capable of performing well on test samples drawn from the same parent dataset that was used for training, it should also be capable of performing well on additional datasets gathered in a variety of contexts. For this reason, we conducted performance study on two separate datasets; Saarbruecken Voice Database (SVD) [31] and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [32]. We conducted a performance analysis on these two different datasets in order to better understand the generalization potential of the ResNet50 model that we constructed. The findings of the experiment are depicted in Table 6 and Figure 13.

Table 6

Comparison of the performance of developed ResNet50 on SVD and RAVDESS datasets.

Dataset	Accuracy	$F 1$ score	Recall	Precision
SVD	72.74	77.79	95.48	65.63
RAVDESS	91.5	90.89	97.91	84.81

[figure omitted; refer to PDF]

6. Discussion

Paralinguistic analysis, which includes tasks such as age and gender detection, is a rapidly increasing research topic. Speech is a time-varying signal generated by the vocal tract, and the form of the vocal tracts varies significantly between males and females. Thus, automatic gender identification using voice has a broad range of applications. The age and gender of the speaker may be used by the interactive voice response system (IVR) to refer the speaker to an appropriate consultant [33] or to play background music suited for the speaker’s gender/age group [34]. The variance in pitch across genders is relatively large, making it difficult to categorize the voice into different classes (e.g., emotions and pathologies) [1]. Hence, when used in conjunction with speaker identification, precise gender classification considerably lowers uncertainty in such classification. By narrowing the search space to a class of a particular gender, the classification algorithm will become increasingly accurate and reduce the mutual influence of each other. Compared to gender-neutral systems, automatic speech recognition (ASR) systems using gender-specific models get more excellent recognition rates [35].

We chose the Common Voice corpus for multilingual speech research for a variety of reasons, not the least of which being its immense size. We predicted that the results from this dataset would be more applicable to a wide variety of real-world applications, as it comprises a greater variety of languages and the amount of data per language varies from extremely tiny to relatively large [20].

The first set of classification studies concerns the classification of two gender audio classes using extracted features. As [36] shown, we used five frequency domain features to compare the performance of classical machine learning and the newly designed DNN model. The findings indicated that the proposed DNN achieved the highest accuracy (Table 3) and that the model converges to an optimal weight value for the gender classification problem (Figure 9).

In the second set of experiments, we trained CNNs using voice spectrograms since the human ear similarly perceives sounds in terms of varying frequencies across time [37]. Additionally, a two-dimensional representation of the speech signal serves as an appropriate input for CNN models for speech analysis [38]. This experiment is aimed at determining whether the deep learning method is appropriate for gender classification. If CNN effectively identifies the gender classes, we can bypass the bottleneck associated with manual feature extraction. According to Table 3, the pretrained ResNet50 exhibits the highest accuracy, $F 1$ score, recall, and precision. Additionally, the study implies that pretrained knowledge is crucial in the network’s first stages. This is because pretrained models treat spectrograms identically to images. Also, the network’s fully connected layers suffer the most change, as they are task-specific [39].

When standard speech features are compared to the deep feature extracted in this study, the latter performs nearly 2% better at gender recognition than the former. As a result, the deep feature extracted in this study is more appropriate for gender representation than the manually extracted features. Additionally, when the number of layers is large, the recognition accuracy is high (Table 3). However, as the number of layers increases, the amount of calculation increases slightly and the duration of classification recognition increases [40]. As a result, we need a trade-off between accuracy and model size. Because the primary objective of this project is accuracy, ResNet50 clearly outperforms other custom and traditional models. However, if we are training models or implementing them on resource-constrained platforms with a tolerance for accuracy loss, our model choices may be different.

6.1. Comparative Analysis

There were numerous attempts on the Mozilla Common Voice dataset to classify gender. The audio samples were transformed into 20 statistical features by [41]. They trained some machine learning models and discovered that CatBoost outperforms all other predictive models with a test accuracy of 96.4%. [42] used a neural network to test the efficiency of the MFCC and Mel spectrograms. They discovered that using a combination of MFCC and Mel spectrograms, the proposed model yielded 94.32%. Following up on their previous work, the same authors [43] propose a 1-D convolutional neural network (CNN) model to recognize gender using several features taken from speech signals such as MFCC, Mel spectrogram, and Chroma. By merging the MFCC, Mel, and Chroma feature sets, the 1-D CNN model achieved a higher accuracy of 97.8%. However, our proposed model surpasses these earlier investigations (Table 7).

Table 7

Comparison of performance of the proposed model with the other works.

Work	Accuracy (%)	Recall (%)	Precision (%)	$F 1$ score (%)
[42]	94.32
[41]	96.4	96.4	96.4	96.4
[43]	97.8
Ours	98.57	99.02	98.47	98.74

7. Conclusions

Recognizing the gender of the human Voice has been regarded as a difficult task due to its importance in various applications. For this study, we used Mozilla’s “Common Voice” database, an open-source, multilanguage collection of voices with information about the speaker’s gender. This article examines gender detection from Voice using various machine learning methods. The study discovered that fine-tuning simple pretrained ImageNet models trained on audio spectrograms result in state-of-the-art performance on the MOZILLA dataset, as well as acceptable performance on the SVD and RAVDESS datasets. We see that pretrained models retain much of their past knowledge, particularly in the early layers during fine-tuning. Only the network’s intermediate layers are significantly altered to adapt the model to the audio categorization challenge. Additionally, we discovered that CNN models learnt deep features from energy distributions in spectrograms outperformed handmade feature extraction methods. Gender discrimination results are also rather good, with an accuracy of 98.57%, which is close to the best documented in the literature.

Acknowledgments

The authors extend their appreciation to the Researchers supporting project number (RSP-2021/314) King Saud University, Riyadh, Saudi Arabia.

References

[1] T.-W. Sun, "End-to-end speech emotion recognition with gender information," IEEE Access, vol. 8, pp. 152423-152438, DOI: 10.1109/ACCESS.2020.3017462, 2020.

[2] P. Rao, "Audio signal processing," Speech, Audio, Image and Biomedical Signal Processing Using Neural Networks, pp. 169-189, 2008.

[3] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Y. Ng, "Deep speech: Scaling up end-to-end speech recognition," 2014. https://arxiv.org/abs/1412.5567

[4] Y. Han, J. Kim, K. Lee, "Deep convolutional neural networks for predominant instrument recognition in polyphonic music," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25 no. 1, pp. 208-221, 2017.

[5] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, "Deep speech 2: end-to-end speech recognition in english and mandarin," International conference on machine learning, pp. 173-182, .

[6] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014. https://arxiv.org/abs/1409.1556

[7] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, vol. 6, pp. 770-778, DOI: 10.1109/CVPR.2016.90, 2016.

[8] J. Pons, O. Slizovskaia, R. Gong, E. Gómez, X. Serra, "Timbre analysis of music audio signals with convolutional neural networks," 2017 25th European Signal Processing Conference (EUSIPCO), pp. 2744-2748, DOI: 10.23919/EUSIPCO.2017.8081710, .

[9] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, "CNN architectures for large-scale audio classification," 2017 ieee international conference on acoustics, speech and signal processing (icassp), pp. 131-135, DOI: 10.1109/ICASSP.2017.7952132, .

[10] Y. Hu, D. Wu, A. Nucci, "Pitch-based gender identification with two-stage classification," Security and Communication Networks, vol. 5,DOI: 10.1002/sec.308, 2012.

[11] R. Djemili, H. Bourouba, M. C. A. Korba, "A speech signal based gender identification system using four classifiers," 2012 International conference on multimedia computing and systems, pp. 184-187, DOI: 10.1109/ICMCS.2012.6320122, .

[12] M. Li, K. J. Han, S. Narayanan, "Automatic speaker age and gender recognition using acoustic and prosodic level information fusion," Computer Speech & Language, vol. 27 no. 1, pp. 151-167, DOI: 10.1016/j.csl.2012.01.008, 2013.

[13] E. Yücesoy, V. V. Nabiyev, "A new approach with score-level fusion for the classification of a speaker age and gender," Computers and Electrical Engineering, vol. 53, pp. 29-39, DOI: 10.1016/j.compeleceng.2016.06.002, 2016.

[14] M.-W. Lee, K.-C. Kwak, "Performance comparison of gender and age group recognition for human-robot interaction," International Journal of Advanced Computer Science and Applications (IJACSA), vol. 3 no. 12,DOI: 10.14569/IJACSA.2012.031234, 2012.

[15] Z. Qawaqneh, A. A. Mallouh, B. D. Barkana, "Deep neural network framework and transformed MFCCs for speaker's age and gender classification," Knowledge-Based Systems, vol. 115,DOI: 10.1016/j.knosys.2016.10.008, 2017.

[16] R. V. Sharan, T. J. Moir, "Robust acoustic event classification using deep neural networks," Information science, vol. 396, pp. 24-32, DOI: 10.1016/j.ins.2017.02.013, 2017.

[17] E. Ramdinmawii, V. K. Mittal, "Gender identification from speech signal by examining the speech production characteristics," 2016 International Conference on Signal Processing and Communication (ICSC), pp. 244-249, DOI: 10.1109/ICSPCom.2016.7980584, .

[18] D. Palaz, M. Magimai-Doss, R. Collobert, "End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition," Speech Communication, vol. 108, pp. 15-32, DOI: 10.1016/j.specom.2019.01.004, 2019.

[19] S. Becker, M. Ackermann, S. Lapuschkin, K.-R. Müller, W. Samek, "Interpreting and explaining deep neural networks for classification of audio signals," 2018. https://arxiv.org/abs/1807.03418

[20] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, G. Weber, "Common Voice: a massively-multilingual speech corpus," 2019. https://arxiv.org/abs/1912.06670

[21] Z. Dair, R. Donovan, R. O'Reilly, "Linguistic and gender variation in speech emotion recognition using spectral features," 2021. https://arxiv.org/abs/2112.09596

[22] S. A. Fulop, Speech Spectrum Analysis,DOI: 10.1007/978-3-642-17478-0, 2011.

[23] D. Gabor, "Theory of communication. Part 1: the analysis of information," The journal of the Institution of Electrical Engineers. Radio and communication engineering, vol. 93 no. 26, pp. 429-441, DOI: 10.1049/ji-3-2.1946.0074, 1946.

[24] K.-I. Kanatani, Group-Theoretical Methods in Image Understanding, vol. 20, 2012.

[25] S. S. Stevens, J. Volkmann, "The relation of pitch to frequency: a revised scale," The American Journal of Psychology, vol. 53 no. 3, pp. 329-353, DOI: 10.2307/1417526, 1940.

[26] T. Qiao, S. Zhang, Z. Zhang, S. Cao, S. Xu, "Sub-spectrogram segmentation for environmental sound classification via convolutional recurrent neural network and score level fusion," 2019 IEEE International Workshop on Signal Processing Systems (SiPS), pp. 318-323, DOI: 10.1109/SiPS47522.2019.9020418, .

[27] S. Davis, P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, vol. 28 no. 4, pp. 357-366, DOI: 10.1109/TASSP.1980.1163420, 1980.

[28] W. H. Abdulla, N. K. Kasabov, D.-N. Zealand, "Improving speech recognition performance through gender separation," Changes, vol. 9, 2001.

[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," Journal of Machine Learning Research, vol. 15 no. 1, pp. 1929-1958, 2014.

[30] P. Harar, J. B. Alonso-Hernandezy, J. Mekyska, Z. Galaz, R. Burget, Z. Smekal, "Voice pathology detection using deep learning: a preliminary study," 2017 international conference and workshop on bioinspired intelligence (IWOBI),DOI: 10.1109/IWOBI.2017.7985525, 2017.

[31] W. J. Barry, M. Pützer, Saarbrücken Voice Database, 2022. http://www.stimmdatenbank.coli.uni-saarland.de/

[32] S. R. Livingstone, F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," PloS one, vol. 13 no. 5, article e0196391, 2018.

[33] R. Zazo, P. S. Nidadavolu, N. Chen, J. Gonzalez-Rodriguez, N. Dehak, "Age estimation in short speech utterances based on LSTM recurrent neural networks," IEEE Access, vol. 6, pp. 22524-22530, DOI: 10.1109/ACCESS.2018.2816163, 2018.

[34] D. Mahmoodi, H. Marvi, M. Taghizadeh, A. Soleimani, F. Razzazi, M. Mahmoodi, "Age estimation based on speech features and support vector machine," pp. 60-64, .

[35] M. Abdollahi, E. Valavi, H. A. Noubari, "Voice-based gender identification via multiresolution frame classification of spectro-temporal maps," 2009 International Joint Conference on Neural Networks,DOI: 10.1109/IJCNN.2009.5178984, .

[36] R. S. Alkhawaldeh, "DGR: gender recognition of human speech using one-dimensional conventional neural network," Scientific Programming, vol. 2019,DOI: 10.1155/2019/7213717, 2019.

[37] W. Hou, Y. Dong, B. Zhuang, L. Yang, J. Shi, T. Shinozaki, "Large-scale end-to-end multilingual speech recognition and language identification with multi-task learning," Babel, vol. 37 no. 4k, 2020.

[38] A. Tursunov, J. Y. Choeh, S. Kwon, "Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms," Sensors, vol. 21 no. 17,DOI: 10.3390/s21175892, 2021.

[39] K. Palanisamy, D. Singhania, A. Yao, "Rethinking CNN models for audio classification," 2020. https://arxiv.org/abs/2007.11154

[40] F. Li, M. Liu, Y. Zhao, L. Kong, L. Dong, X. Liu, M. Hui, "Feature extraction and classification of heart sound using 1D convolutional neural networks," EURASIP Journal on Advances in Signal Processing, vol. 2019, 2019.

[41] S. R. Zaman, D. Sadekeen, M. A. Alfaz, R. Shahriyar, "One source to detect them all: gender, age, and emotion detection from voice," 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 338-343, DOI: 10.1109/COMPSAC51774.2021.00055, .

[42] K. Chachadi, S. R. Nirmala, "Voice-based gender recognition using neural network," Information and Communication Technology for Competitive Strategies (ICTCS 2022), pp. 741-749, 2022.

[43] K. Chachadi, S. R. Nirmala, "Gender recognition from speech signal using 1-D CNN," Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, pp. 349-360, .

Word count: 4073

Show less

Copyright © 2022 Abeer Ali Alnuaim et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Several speaker recognition algorithms failed to get the best results because of the wildly varying datasets and feature sets for classification. Gender information helps reduce this effort since categorizing the classes based on gender may help lessen the impact of gender variability on the retrieved features. This study attempted to construct a perfect classification model for language-independent gender identification utilizing the Common Voice dataset (Mozilla). Most previous studies are doing manual extracting characteristics and feeding them into a machine learning model for categorization. Deep neural networks (DNN) were the most effective strategy in our research. Nonetheless, the main goal was to take advantage of the wealth of information included in voice data without requiring significant manual intervention. We trained the deep learning network to choose essential information from speech spectrograms for the classification layer, performing gender detection. The pretrained ResNet 50 fine-tuned gender data successfully achieved an accuracy of 98.57% better than the traditional ML approaches and the previous works reported with the same dataset. Furthermore, the model performs well on additional datasets, demonstrating the approach’s generalization capacity.

Details

Title

Speaker Gender Recognition Based on Deep Neural Networks and ResNet50

Author

Abeer Ali Alnuaim¹

; Zakariah, Mohammed²

; Shashidhar, Chitra³; Wesam Atef Hatamleh⁴

; Tarazi, Hussam⁵

; Shukla, Prashant Kumar⁶; Ratna, Rajnish⁷

¹ Department of Computer Science and Engineering, College of Applied Studies and Community Services, King Saud University, P.O. Box 22459, Riyadh 11495, Saudi Arabia
² College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia
³ Department of Commerce and Management, Seshadripuram College, Seshadripuram, Bengaluru 20, India
⁴ Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia
⁵ Department of Computer Science and Informatics, School of Engineering and Computer Science, Oakland University, Rochester Hills, MI, 318 Meadow Brook Rd, Rochester, MI 48309, USA
⁶ - Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur, 522502 Andhra Pradesh, India
⁷ Gedu College of Business Studies, Royal University of Bhutan, Bhutan

Editor

Mohammad Farukh Hashmi

Publication year

2022

Publication date

2022

Publisher

John Wiley & Sons, Inc.

e-ISSN

15308677

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2022/4444388

ProQuest document ID

2646636412

Speaker Gender Recognition Based on Deep Neural Networks and ResNet50

Jump to:

Full text

Abstract

Details

Suggested sources