Content area
In a bilingual and linguistically diverse country like India, where a significant portion of the population is fluent in multiple languages, the conventional bilingual Transformer neural network architecture faces challenges in accurately translating conversations that seamlessly switch between different languages. In this paper, we propose a multilingual automatic speech recognition system that can understand all intra-sentential terms and transcribe human speech into written text in English or any other language without making any grammatical mistakes. As a result, this method of translating Tanglish to Tamil or English works well. It is finished with the help of generative AI. Here, we use a generative pre-trained transformer model, which learns to predict the subsequent word in a language during the pre-training stage in order to get an understanding of language structure and semantics. The algorithm used here is long short-term memory (LSTM) plays a crucial role in speech to text by capturing temporal dependencies maintaining context and generating accurate transcriptions from audio inputs. We experimented on 50 Tamil–English agriculturally based data and found that the generative pre-trained transformer model can achieve an 84.37% relative accuracy rate even for short sentences and 73.98% relative accuracy rate for lengthy sentences in bilingual automatic speech recognition (ASR) performance.
Introduction
A bilingual automatic speech recognition (ASR) system powered by Generative Pre-Trained Transformer (GPT) offers groundbreaking capabilities in interpreting and transcribing speech across multiple languages seamlessly. By integrating GPT's robust natural language understanding with ASR technology, this system transcends linguistic barriers, enabling fluid communication and comprehension in diverse linguistic contexts [7]. At its core, GPT utilises advanced machine learning algorithms to comprehend and generate human-like text across various languages. Leveraging this, a bilingual ASR system harnesses the power of GPT to accurately transcribe spoken language into text, supporting multilingual environments effortlessly. This integration not only enhances the accuracy and adaptability of ASR but also expands its reach to serve linguistically diverse populations [6]. A pre-trained GPT model can be fine-tuned on bilingual speech data. This involves training the model on speech data with corresponding text transcripts in both languages. The model learns to recognise speech patterns and translate them into the appropriate text, even if it has not seen those specific words or phrases before. In practical terms, the bilingual ASR system operates by first converting speech signals into text using ASR techniques tailored to each language. Then, GPT processes this text, understanding its semantic nuances and translating it into the desired target language if necessary.
The main contributions of the paper are as follows:
The system leverages Generative Pre-Trained Transformer (GPT) models, which allows the system to transcribe code-switched sentences with grammatical accuracy.
The paper experiments on Tamil–English bilingual agricultural data. The system shows promising results, with an 84.37% accuracy rate for short sentences and 73.98% accuracy rate for longer sentences in bilingual ASR tasks.
The proposed method demonstrates that the use of GPT models, combined with LSTM, enhances the performance of ASR systems in environments where code mixing and intra-sentential switching are frequent, thus improving transcription quality across mixed languages
Related works
Al-Barhamtoshy et al. [1] aim to translate text using neural networks using the English–Arab dataset. For this kind of machine translation, an encoder and decoder model were used in the suggested system. The representations of source language vectors are generated using a first sequence network, like CNN. The target language translation is produced by using a second sequence, such RNN, as a decoder.
Deng et al. [2] examined how multilingual automated speech recognition (ASR) is impacted by an end-to-end neural network's output representation using machine learning and natural language processing techniques. The goal of this project is probably to create or investigate a system that can effectively transcribe speech in two languages from start to finish, with an emphasis on managing word units at the byte level for better performance in a variety of linguistic contexts. Authors investigate several representations, such as byte-level BPE (byte-level encoding) and character-level representations.
Abushariah et al. [3] explored a subset of automatic speech recognition (ASR) called bilingual automatic speech recognition (ASR) is concerned with identifying speech that is multilingual [3]. The difficulty of handling the lexical and auditory differences between languages makes it a difficult undertaking. There are two primary categories of bilingual ASR systems: code switching and code mixing. The process of switching between two or more languages in a single speech is known as"code switching". On the other side, code mixing is the practice of using terms or expressions from one language in a context of another.
Wei et al. [4] proposed using both voice and bilingual text data; the research suggests a unique pre-training approach for direct speech-to-speech translation (S2ST) models. The title aims to investigate a training strategy that improves the performance of direct voice-to-speech translation systems by using bilingual text and speech data during joint pre-training. Three components make up the suggested technique, named Speech2S: a speech encoder, a unit encoder, and a unit decoder [4]. The input voice signal is encoded into a series of hidden states using a speech encoder. These concealed states are then translated into a series of distinct units using the unit encoder. Ultimately, the target speech signal is produced from the unit sequence by the unit decoder.
Yu et al. [5] proposed code-switching speech recognition, which describes the blending of two or more languages in a single utterance, has been covered. They have employed the text-to-speech (TTS) method. The resulting text is then spoken by means of a multilingual text-to-speech technology. Another strategy is cross-modality learning (CML), where the text-only data are fed into the T-T model by linking the speech and text latent spaces together. The TTS-converted data are added to the paired speech-text training set. Mean squared error (MSE) loss is one technique used to achieve this.
Du et al. [6] restricted that availability of code-switching data has led to the discussion of using a code-switching end-to-end automated speech recognition (ASR) model in this paper. This title aims to investigate the particular applications of data augmentation techniques for code-switching voice data in the context of end-to-end speech recognition systems. This method uses a Mandarin–English dictionary to transform monolingual text into code-switching text. A text-to-speech (TTS) technology is then used to convert the translated text into speech.
Nedjah et al. [7] explore the strategy for automated speech recognition (ASR) of Portuguese phonemes using an ensemble of neural networks is presented in this research, which is named"Automatic speech recognition of Portuguese phonemes using neural networks ensemble". Pre-processing and categorisation are the two phases that make up the suggested system. The speech signal is pre-processed at the pre-processing step in order to extract pertinent features, such as the Mel-frequency cepstral coefficients (MFCCs), which characterise the signal's spectrum properties. Using an ensemble of neural networks, the retrieved characteristics are categorised in the classification step.
Gao et al. [8] proposed the SP-AED approach includes adaptive combination fine-tuning for the whole system, linguistic pre-training for the decoder, and auditory pre-training for the encoder. The unpaired text data are used by the SP-AED approach for the decoder pre-training. By substituting random noise for the actual auditory representations, the decoder is pre-trained as a noise-condition language model. The wav2vec2.0 pre-training approach is used by the SP-AED method for the encoder pre-training. By masking the input acoustic characteristics and pre-training the encoder to reconstruct the input, the wav2vec2.0 approach is a mask-based pre-training technique. Following completion of all pre-training, an adaptive combination fine-tuning approach is used to combine and fine-tune the encoder and decoder.
Yu et al. [9] proposed non-autoregressive ASR models use a PLM to forecast every word in the utterance at the same time, processing the complete speech signal at once. Although this can be substantially quicker than autoregressive decoding, it may also be more difficult to teach the PLM to make precise predictions in the absence of word order knowledge. The non-autoregressive ASR model of the authors suggests a two-phase training procedure. Pre-training a PLM on a sizable corpus of Chinese text data is the initial step. In order to improve the PLM's ability to anticipate the right word order, a smaller corpus of Chinese speech data is used in the second stage. A non-autoregressive loss function is used to achieve this goal.
Yang et al. [10] proposed "Mutual Information Maximisation" refers to the idea that learning entails maximising the mutual information between pertinent model components. Maximising mutual information, which quantifies the degree of reliance between two variables, is frequently used as a goal to increase the efficacy and efficiency of learning algorithms. Achieving more significant and efficient clustering in a deep generative framework is the aim. To drawback, amount of latent variables and learning rate are two examples of hyperparameters that might affect it. The process of learning representations with deep generative models is referred known as"deep generative clustering".
Nazif Aydın and G. Ayhan Erden aims to discuss the efficiency and structure of the Transformer assisted GPT-3 model is one of NLP technologies for the literature in term of quantity and quality. In addition, the performance parameters of the model were verified by making fine application in the beta version of the GPT-3 model [11].
Jun Liu et al. proposed to developed a computer aided Japanese grammar database using NLP and machine learning to assist JLPT learners. It automatically detects sentence patterns overcoming challenges in selecting examples. Experiments shows the approaches effectiveness. The database enhances Japanese learning tools and improves language education through technology [12].
Sheng Li et al. introduced an end-to-end Chinese-Japanese ASR system using shared smaller sub-character units. Decomposing Chinese characters and kanji enables unified representations improving performance over monolingual models. This reduces modelling and proves effective for similar languages enhancing bilingual speech recognition [13].
Lee Moa et al. introduced a non-autoregressive fully parallel deep convolutional speech syntheses framework. Our approach uses a time-varying metatemplate (TVMT) to generate spectral features without autorepression and enhanced by interconnected decoders with teratve attention. This design accelerates syntheses while improving speech quality compared to traditional autoregressive models [14].
Rishabh Jain et al. proposed a multi-speaker TTS retuning workflow using transfer learning based on a cleaned 19-hour child speech dataset. Evaluations showed high subjective scores: MOS of 3.95 intelligibly, 3.98 naturalness and 3.96 consistency and alongside strong objective correlations via MOSNet. Speaker similarity was confirmed through cosine similarity and an ASR model assessed WER differences. The final model can synthesize child-like speech from just 5-second reference clips [15].
Jingbei Li et al. proposed a joint multiscale cross-lingual speaking style transfer framework that models bidirectional style transfer at both utterance and word levels using encoder-decoder and shared attention mechanisms. The system extracts and predicts global and local styles to synthesize speech with a multiscale-enhanced Fast Speech. Experiments shows the approach outperforms baselines and achieving better objective and subjective results [16].
Ye-Qian Du et al. proposed complementary joint training (CJT) method was enhanced with label masking and parallel layers for re-training termed CJT++. It was further extended to zero-pared data scenarios via iterative CJT for seed ASR training. Experiments on Libri-light confirmed the effectiveness of joint training and second-round strategies, outperforming recent models, especially in low-resource settings [17].
Aarati H. Patil et al. developed a deep-learning- based translation system that employs multiple models to perform text-to-text translation and converting input text into the target language [18].
Chang Liu et al. proposed multilingual speech syntheses approach uses learned phonetic representations- unsupervised (UPR) via wav2vec 2.0 and supervised (SPR) from LI-ASR to eliminate pronunciation dictionaries. An acoustic model combines UPRs and SPRs to generate mel-spectrograms with components pre-trained on non-target languages to enhance performance. Experiments on six languages show t outperforms traditional methods with pre-training further improving results [19].
Dheeraj K.N. et al. proposed the python-based multilingual speech recognition and translation system using pygame, gTTS, speech recognition and a translation model and all accessed via a Streamlit web interface. Users select languages with auto-detection available and receive real-time recognized and translated speech feedback. The system continuously listens, recognizes, translates and vocalizes speech which s providing an intuitive and interactive experience. It handles errors gracefully as showcasing practical multilingual communication capabilities [20].
Seidu Agbor Abdul Rauf and Adebayo F.Adekoya reviewed more research works on household appliance anomaly detection using machine learning classifying techniques into ML-based statistical and physical approaches.it analysed parameters likes algorithms ,data sources and metrices finding 81.2% used single data sources with ARIMA being the most common regression model. RMSE 35% was the top error metric and 46% of detections relied on whether and energy data. The review highlights challenges and future research directions globally and locally [21].
Shereen A. Hussein et al. proposed the model detects and classifies facial behaviours to assess mental health using Haar features from FER+ dataset. VGG classifies faces as normal or abnormal and further predicts specific disorders likes depression or anxiety. This enables timely supports with the system achieving 95% accuracy in predictions [22].
Odeyinka Abiola et al. uses TextBlob and VADAR to analyse emotional responses to COVID-19 in Nigeria based on historical tweets and highlights its social, environmental and economic impacts. It provides valuable insights for researchers and students in data science, machine learning and deep learning. The approach advances understanding of public sentiment during the pandemic [23].
Existing system
The existing systems for the main focus of the problem statement are code switching (CS) in speech, especially in multilingual nations like India where it is widespread and results in hybrid languages like Tanglish (Tamil–English) and Hinglish (Hindi–English). The problem is that there is not enough information available for Indic CS voice recognition. The existing system discussed in the context focuses on using neural network-based approaches for developing a machine translation system that can translate between two languages. These neural network-based machine translation model will be created. After non-alphanumeric texts are cleaned and removed utilising linguistic modification tasks for the proposed machine translation model, bilingual dictionaries will be engaged. Thus, for such machine translation, encoder and decoder models are involved. The first sequence network (CNN) is used to generate source language vectors’ representations. The second sequence (RNN) is employed as a decoder to generate the translated text for the target language. The encoder encodes the words stream sequence; at the end of the input stream, the encoder passes its hidden states to the decoder to generate the related translated sequence using highest probability. Bilingual ASR system may not work correctly all the time. The model may lack in accessing and annotating large amounts of diverse code-switched data that remain a significant obstacle. Here CNN and RNN are used. CNNs are primarily designed for spatial data and may not be as effective in capturing temporal dependencies in speech, which is crucial for accurate transcription in bilingual ASR. Training RNNs for bilingual ASR may suffer from vanishing or exploding gradient problems, particularly in deep architectures, which can hinder convergence and degrade performance. Hence the accuracy of the model is very low. Figure 1 shows the architecture of neural encoder & decoder.
[See PDF for image]
Fig. 1
Neural encoder/decoder architecture
Proposed system
The proposed method is to develop a system where bilingual automated speech recognition system converts the human speech into written text in English or even any other language without any grammatical error and understand all the intra-sentential words. So, this system effectively converts the Tanglish to English or Tamil. It is done with the help of two models LSTM and GPT model in lstm; they process the audio input step by step and capture current sound wave nuances of spoken language including dependencies between phonemes and words. In gpt asr model leverage their strength in predicting the next likely words in a sequence by analysing the recognised speech probable words continuations and the gpt model sources its training data from vast and diverse internets multilingual culture. At its core, the system utilises automatic speech recognition (ASR) techniques to transcribe spoken input into text, leveraging state-of-the-art deep neural network architectures trained on large-scale corpora to achieve high accuracy and robustness. Subsequently, the transcribed text undergoes gpt model, where sophisticated techniques generate the response into text. Using voice commands, users may input spoken Tanglish sentences into the interface and receive the Tanglish text back. In order to construct the suggested system, large-scale datasets of spoken Tanglish recordings and responses are used for data collecting, annotation, and model training. To maximise performance and generalisation abilities, pre-trained lstm and gpt models are fine-tuned using domain-specific data throughout this training procedure. Thorough testing and validation are used to analyse the system's performance. Quantitative metrics that gauge translation correctness, fluency, and responsiveness are combined with qualitative evaluations. By means of iterative refining and optimisation, the suggested bilingual asr system endeavours to attain cutting-edge performance levels and cater to the multifarious communication requirements of persons who speak Tanglish in a variety of situations, including agriculture. In the end, the effective implementation of the suggested system has the potential to promote increased diversity, accessibility, and comprehension among Tanglish speakers and support larger initiatives for everyone.
Algorithm steps
Bilingual automatic speech recognition (ASR)
The system receives Tanglish speech as input through voice from the user.
The raw audio signal captured from the user's voice undergoes signal processing, which may include steps such as Mel-frequency cepstral coefficient to enhance its quality and prepare it for further analysis.
The processed audio signal is converted into text using automatic speech recognition (ASR) technology. This converts the spoken words into a textual representation that can be understood and processed by the system.
Long short-term memory network
The LSTM remembers past conversations.
It breaks down the user's current question word by word. It looks for connections between the current question and past conversations. These connections might be based on keywords, topics, or even emotions expressed earlier.
By understanding these connections, the LSTM builds a picture of the overall conversation flow and what the user might be referring to in their current question.
An LSTM can be used in the encoder part. It can analyse user input, capturing long-term dependencies within the conversation history to understand the context of the current query.
Generative pre-trained transformer model
The converted text serves as input to the GPT (Generative Pre-trained Transformer) model. The text undergoes preprocessing, including tokenisation, positional encoding, and feeding it through the transformer encoder layers of the GPT model.
The input text is tokenised into individual tokens (words or subwords) and converted into dense vector representations known as embeddings. Each token is assigned a unique numerical identifier.
In addition to token embeddings, positional encodings are added to each token to provide information about its position in the sequence.
The token embeddings, along with their positional encodings, are passed through multiple layers of Transformer encoder blocks. These blocks process the input tokens and capture their contextual information and dependencies.
The GPT model generates text autoregressively, predicting one token at a time based on the previous tokens and the input context.
Based on the context extracted by the LSTM, a GPT can be used in the decoder part. It can then generate creative and coherent responses that are relevant to the conversation flow.
Once the entire sequence of tokens has been generated, the tokens are decoded into human-readable text using the model's vocabulary.
The decoded text represents the response or output generated by the model based on the input text provided during inference.
User interface design
The user begins by creating an account, providing necessary information such as username password.
After creating the account, the user logs in using their credentials (username and password).
Then the user can speak in Tanglish to ask their questions which are related to agriculture and the system will display the response for the user’s questions in Tanglish text.
Data collection and model training
Gather and annotate textual and spoken Tanglish recording datasets.
Utilising carefully chosen datasets, deep learning models are trained to maximise efficiency and generalisation potential.
Improve system performance by fine-tuning pre-trained models using domain-specific data.
Evaluation and validation
Assess system performance by use of thorough testing and validation procedures.
To guarantee translation correctness, fluency, and cultural appropriateness, conduct qualitative evaluations.
To evaluate translation accuracy, fluency, and responsiveness, measure quantitative measures.
Refinement and optimisation
Refine and optimise the system iteratively in response to evaluation input.
Incorporate user input to improve efficacy and usability.
Update and enhance the system often to satisfy the various communication requirements of those who speak Tanglish.
System architecture
The following Fig. 2 shows the architecture of proposed system.
[See PDF for image]
Fig. 2
Proposed system’s architecture
Methodology
Module 1: speech to text
The purpose of the speech-to-text conversion module is to convert spoken words into written language. The audio input is first received by the module, usually from a microphone or recorded files. Pre-processing is then used to improve the audio signal's quality. After that, feature extraction methods like Mel-frequency cepstral coefficients are used to represent the audio in an analysis-ready manner. Transformers and other machine learning algorithms translate these attributes into text by using patterns discovered through massive amounts of training data. Post-processing procedures improve the result by fixing mistakes and guaranteeing grammatical accuracy. Upon completion, the spoken words are rendered as a text, which may be incorporated into various applications such as voice assistants and transcription services.
Module 2: text analysis for context understanding
An LSTM can be used in the encoder part. It can analyse user input, capturing long-term dependencies within the conversation history to understand the context of the current query. The Generative Pre-Trained Transformer (GPT) model receives the translated text as input. Preprocessing of the text includes positional encoding, tokenisation, and feeding it through the GPT model's transformer encoder layers. The input text is tokenised, breaking it up into individual words or subwords and then transformed into embeddings dense vector representations. Every token has a special number identification allocated to it. To give information about each token's position in the sequence, positional encodings are added to it in addition to token embeddings. The positional encodings of the token embeddings are transferred via several tiers of Transformer encoder blocks. These blocks handle the input tokens and record their dependencies and contextual data. The GPT model predicts one token at a moment using the input context and prior tokens in an autoregressive manner, producing text. In the decoder section, a GPT can be employed based on the context that the LSTM has extracted. After the whole sequence of tokens has been created, the tokens are decoded into human-readable text using the vocabulary of the model. This allows it to provide imaginative and cohesive replies that are pertinent to the conversation flow.
Module 3: response conveyance
The final stage of the text analysis pipeline is the Response Conveyance module, where responses retrieved from the internet using a transformer model are formatted and made comprehensible before being displayed to the user. This module is essential for guaranteeing that data obtained from the internet are correct and presented in an approachable way. The user is then supplied with the revised response via a web console. In situations where the initial input was voice-based, this might entail playing the text aloud to the user or displaying it immediately on the interface. By taking into account a variety of preferences and accessibility requirements, this multimodal approach guarantees a flexible and inclusive user experience.
Results and discussion
Loading and preparation of code-switched data
To train the proposed language model, we used a model based on the agricultural question which would be helpful for the home gardening people from the internet. Thus, our study team began gathering CS data in Tamil and English. Due to resource limitations, we were only able to collect 50 CS data questions pertaining to agriculture and their accompanying responses. We have finished recording the Tamil–English code switch (CS) voice corpus. The next task for each participant was to read around five questions. Since readers who read in Tanglish favoured this style, the CS text uses English terms. We gathered CS text from various popular websites reporting news, etc. Figure 3 shows the Mel spectrogram of a signal.
[See PDF for image]
Fig. 3
Mel spectrogram of a signal
The above figure shows the feature extraction techniques Mel-frequency signal for the dataset.
Evaluation result
The life cycle of the proposed system based on GPT consists of (1) creating training process, (2) testing phase, and (3) evaluating phase. We used the bilingual dataset. We split the dataset into 90% for training and 10% for testing to calculate performance and loss improvement during training. Table 1 shows the input and the output of the system.
Table 1. Evaluation of training accuracy
Tanglish transcribed text | Tanglish response text |
|---|---|
Indoor plants ah epdi maintain pannanum? | Indoor plants vera level-a maintain pannanum nu ninaikathinga. Just light source, water and manure mattum kudutha podhum |
Solam entha sand la grow aagum? | Solam clay loam illana loamy sand la grow aagum |
Ethana types of urea iruku? | Prilled and granules iruku. Athu ammonia and carbon dioxide pola reaction ah produces pannum |
Indian household la valarkura maadri indoor plants na ethu best? | Money plant oru nalla option. En na athuku sunlight um vendam and water um occasional ah potukalaam |
Jade plant valartha nallathu nadakkuma? | Yes. Athoda leaves la coins shape la irukurathu naala, atha namba veetula valartha nalla fortune kudukum nu makkal believe panranga |
Chedi oda leaves la yellow ah maarura reason enna? | Mannoda moisture apram proper sunlight illana leaves colour change aagum |
We can see that the output results are accurate and understandable. As an example, “Indoor plants ah epdi maintain pannanum?” was correctly answered into “Indoor plants vera level-a maintain pannanum nu ninaikathinga. Just light source, water and manure mattum kudutha podhum”.
Word error rate (WER) is a common metric used to evaluate the performance of automatic speech recognition (ASR) systems. It measures the difference between the reference transcript (ground truth) and the hypothesis transcript generated by the ASR system.
1
where S = Number of substitutions, D = Number of deletions, I = Number of insertions, N = Total number of words in the reference transcript.The following Fig. 4 shows the reference transcript and the hypothesis transcript. From this WER is calculated for each dataset and finally taking average of all the WER values.
[See PDF for image]
Fig. 4
Reference transcript and the hypothesis transcript
Confusion matrix
Confusion matrices are used to assess machine and deep learning processing tasks'correctness. Therefore, a confusion matrix is used to assess the correctness of the suggested ecosystem.
TP (True Positive): Number of data points correctly predicted as class X.
FP (False Positive): Number of data points incorrectly predicted as class X.
TN (True Negative): Number of data points correctly predicted as not-X.
FN (False Negative): Number of data points incorrectly predicted as not-X.
From Table 1, taking one transcribed example and assuming the ASR system classified the source language (Ethana) and target language (Prilled/granules) correctly, here is the confusion matrix for the given bilingual Tanglish data is shown in Fig. 5.
[See PDF for image]
Fig. 5
Confusion matrix for the Tanglish code-switched data
The source language ("Ethana types urea iruku?") was classified correctly (Ethana), resulting in a count of 1 in the top-left cell."Iruku"might be mispronounced as"iruhu"in the source sentence. Since the target language options are"Prilled"and"granules", and the ASR did not classify it as either (or any other language), we leave those cells at 0. There are no other errors (both languages classified correctly). Table 2 shows the performance metrics of the proposed work.
Table 2. Performance metrics
Accuracy% | Precision% | Recall% | F1-score% |
|---|---|---|---|
84.37% | 84.28 | 83.12 | 82.23 |
The following formulas will be used to calculate the accuracy:
2
3
4
Table 2 shows the performance metrics.
Table 3 shows comparison table that outlines the key features and performance metrics of your proposed multilingual ASR system using Generative Pre-Trained Transformer and LSTM, compared to other conventional techniques.
Table 3. Comparison table of proposed system with other existing models
Features | Proposed system | Convention bilingual transformer | Monolingual ASR systems | Hybrid ASR systems (HMM–DNN) |
|---|---|---|---|---|
Language support | Multilingual (English, Tamil) | Bilingual | Single language | Potentially multilingual but complex |
Handling code—switching | Effective | Limited | Not applicable | Limited |
Generative AI utilisation | Yes (GPT for pre-training) | No | No | No |
Algorithm used | LSTM for temporal dependencies | Transformer | Traditional models (e.g. HMM) | HMM combined with DNN |
Training data | Tanglish agricultural data | Bilingual data | Monolingual data | Monolingual or limited bilingual data |
Accuracy (short sentences) | 84.37% | 70–80% | 80–85% | 75–85% |
Accuracy (long sentences) | 73.98% | 65–75% | 80–85% | 70–80% |
Context understanding | Strong (due to LSTM) | Moderate | Weak | Moderate |
In the following Fig. 6, graph shows the accuracy during the bilingual code-switched data that were received as a system response.
[See PDF for image]
Fig. 6
Result% for Tanglish data
Figures 7 and 8 show the User interface to create an account by providing a username and password then once created, the user has been to taken to login page. With the credentials, they should login to the system.
[See PDF for image]
Fig. 7
Account creation
[See PDF for image]
Fig. 8
User login page
Research on bilingual automated recognition systems (BARS) for code-switched data is exploring the use of generative pre-trained transformer models. The results are promising. Here, researchers train a powerful AI model on a large dataset of conversations with code switching. This fine-tuned model is then integrated into the BARS. Discussions focus on how well the system handles different language combinations and the frequency of code switching within a conversation. The ultimate goal is to see if this approach surpasses traditional BASR methods, improving speech recognition accuracy in our increasingly multilingual world.
Conclusion
In conclusion, Enhanced Bilingual ASR through generative pre-trained transformer model using Code-Switched Speech provides advancement for the Tanglish speaker. The proposed system leverages the Generative Pre-trained Transformer (GPT) paradigm to address the challenges posed by multilingual talks, particularly in the area of code-switching implementation. The system systematically integrates essential components, including data preparation, multilingual conversation fine-tuning, unsupervised pre-training, and targeted language discrimination techniques. The system leverages the flexibility and contextual knowledge of the GPT model to handle a wide range of language inputs and provide coherent, context-sensitive responses. The inclusion of modules for integration, testing, deployment, and evaluation demonstrates how useful the recommended method is in real-world scenarios. The deployment module ensures a seamless integration in cases where multilingual conversation processing is essential to systems or applications. However, further research is necessary to address remaining challenges, such as data scarcity, noise robustness, and multi-lingual pre-training. As research in this area continues, we can expect even more advanced models and techniques to emerge, paving the way for seamless and accurate communication in multilingual settings. We believe future research should focus on data augmentation techniques, robust model architectures, and multi-lingual pre-training approaches to address these challenges and unlock the full potential of this technology for real-world applications. The verdict of our proposed system objectively simplified by embracing the complexity of code switching and leveraging the power of these models; we can break down language barriers and promote cross-cultural understanding. As we move forward, continued research and collaboration are crucial to refine these technologies and foster a future where seamless communication transcends language boundaries.
Future work
Further develop techniques to improve the recognition accuracy of code-switched speech, including better modelling of language alternation patterns, handling of intra-sentential and inter-sentential code switching, and adapting to variations in code-switching behaviour across different speakers and contexts. Develop methods for personalised ASR that can adapt to individual speakers'speech patterns, accents, and vocabulary preferences. Design user-friendly interfaces and accessibility features to make the ASR system more usable and accessible to a diverse range of users. Research adaptive language models that dynamically adjust their parameters based on a user's speech patterns and vocabulary usage over time, leveraging reinforcement learning, or online learning techniques. Explore techniques for privacy-preserving ASR, such as federated learning or differential privacy, to protect sensitive user data while still providing personalised speech recognition capabilities. Integrate emotion recognition capabilities into ASR systems to enhance user experience and enable applications such as sentiment analysis or personalised feedback based on emotional cues in speech.
Acknowledgements
We would like to thank the anonymous referees for their helpful guidance that has improved the quality of this paper. We would also like to express my gratitude and sincere thanks to my guide for the valuable support and guidance in the completion of this paper. We would like to thank to our research domain and work supported by the Research and Development (R & D) experts at our college Sri Manakula Vinayagar Engineering College and by our mentor, an expert in deep learning of our department.
Author contributions
Dr. Puspita Dash helped in conceptualisation; Sruthi Babu contributed to methodology; Sruthi Babu and Logeswari Singaravel validated the study; Sruthi Babu and Devadarshini Balasubramanian helped in writing; Sruthi Babu, Logeswari Singaravel, and Devadarshini Balasubramanian contributed to writing—review and editing. All authors have read and agreed to the published version of the manuscript.
Funding
Not applicable.
Availability of data and materials
Not applicable.
Declarations
Ethics approval and consent to participate
Not applicable.
Informed consent
Not applicable.
Competing interests
The authors declared no conflict of interest.
Abbreviations
Artificial Intelligence
Recurrent neural networks
Automatic speech recognition
Bilingual automated recognition systems
Speech-to-speech translation
Text-to-speech
Cross-modality learning
Mean squared error
Mel-frequency cepstral coefficients
Byte-level encoding
Self-supervised attention encoder–decoder
Generative pre-trained transformer
Code switch
Word error rate
True positive
False positive
True negative
False negative
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Al-Barhamtoshy HM, Metwalli ASQ (2023) Neural networks for bilingual machine translation model (IEEE Xplore 16 January 2023)
2. Deng L, Hsiao R, Ghoshal A (2022) Bilingual end-to-end ASR with Byte-level subwords (IEEE Xplore 27 April 2022)
3. Abushariah AAM, Ting H-N, Mustafa MBP, Khairuddin ASM, Tans T-P (2022) Bilingual automatic speech recognition a review taxonomy and open challenges (IEEE 01 November 2022)
4. Wei K, Zhou L, Zhang Z, Chen L, Liu S, He L, Li J, Wei F (2023) Joint pre-training with speech and bilingual text for direct speech to speech translation ( IEEE Xplore 05 May 2023)
5. Yu Y, Hu Y, Qian Y, Jin M, Liu L, Liu S, Shi Y, Qian Y, Lin, Zeng M (2023) Code-switching text generation and injection in Mandarin–English ASR (IEEE Xplore 05 May 2023)
6. Du C, Li H, Lu Y, Wang L, Qian Y (2021) Data augmentation for end-to end code-switching speech recognition (IEEE Xplore 25 March 2021)
7. Nedjah N, Bonilla AD, de Macedo Mourelle L (2023) Automatic speech recognition of Portuguese phonemes using neural networks ensemble (Elsevier 1 November 2023)
8. Gao C, Cheng G, Li T, Zhang P, Yan Y (2022) Self-supervised pre-training for attention-based encoder-decoder ASR model (ACM 04 May 2022)
9. Yu F-H, Chen K-Y, Lu K-H (2022) Non-autoregressive ASR modeling using pre-trained language models for Chinese speech recognition (ACM 11 April 2022)
10. Yang X, Yan J, Cheng Y, Zhang Y (2022) Learning deep generative clustering via mutual information maximization (IEEE Xplore Issue 4 January 2022)
11. Aydın N (2022) A research on the new generation artificial intelligence technology generative pretraining transformer 3 (IEEE Xplore 29 December 2022)
12. Liu J, Fang Y, Yu Z, Wu T (2022) Design and construction of a knowledge database for learning Japanese grammar using natural language processing and machine learning techniques (IEEE Xplore Issue 19 September 2022)
13. Li S, Li J, Liu Q, Gong Z (2023) An End-to-end Chinese and Japanese bilingual speech recognition systems with shared character decomposition (Springer 14 April 2023)
14. Lee M, Lee J, Chang J-H (2022) Non-autoregressive fully parallel deep convolutional neural speech synthesis (IEEE Xplore Issue 8 March 2022)
15. Jain R, Yiwere MY, Bigioi D, Corcoran P, Cucu H (2022) A text-to speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis (IEEE Xplore Issue 28 April 2022)
16. Jain R, Yiwere MY, Bigioi D, Corcoran P, Cucu H (2022) A text-tospeech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis (IEEE Xplore Issue 28 April 2022)
17. Li J, Li S, Chen P, Zhang L, Meng Y, Wu Z, Meng H (2023) Joint multiscale cross-lingual speaking style transfer with bidirectional attention mechanism for automatic dubbing (IEEE Xplore Issue 10 November 2023)
18. Du Y-Q, Zhang J, Fang X, Wu M-H, Yang Z-W (2023) A semi-supervised complementary joint training approach for low-resource speech recognition (IEEE Xplore Issue 11 September 2023)
19. Patil AH, Patil SS, Patil SM, Nagarhalli TP (2022) Real time machine translation system between Indian languages (IEEE Xplore Issue 24 May 2022)
20. Liu C, Ling Z-H, Chen L-H (2023) Pronunciation dictionary-free multilingual speech synthesis using learned phonetic representations (IEEE Xplore Issue 8 September 2023)
21. Dheeraj, KN et al. Multilingual speech transcription and translation system. Int J Adv Res Sci Commun Technol; 2024; 4,
22. Rauf, SAA; Adekoya, AF. Sytematic Literature review of the techniques for household electrical appliance anomaly detections and knowledge extractions. J Electr Syst Inf Technol; 2023; 10, 22.
23. Hussein, SA et al. Automated detection of human mental disorder. J Electr Syst Inf Technol; 2023; 10, 9.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.