Content area
Text sentiment is a way of extracting data and transforming it into meaningful sentiment. In this research study, we tried to extract Urdu text data linked to medicine and convert it into a useful format that can be used to create an application. Electronic media quickly provides a large amount of information in any language, but it is unstructured and raw, making easily available data difficult to understand. Urdu is the most sought-after language in Asian countries, and the majority prefer this language. The sole distinction between the Urdu and Hindi languages is their writing script. However, the Roman scripts of both languages are comparable. In the Urdu dataset, pre-processing, feature engineering, and other approaches are utilized to extract clean data that can be easily trained using multiple machine learning models because the application that is going to be built requires only medical-related datasets retrieved from external sources, i.e., websites, newspapers, blogs, and other physical resources, the techniques used are appropriate.
Introduction
Text sentiment analysis plays a pivotal role in extracting meaningful information insights from data, particularly in the domain of healthcare. In this research study, we delve into the realm of Urdu text sentiment analysis, specifically focusing on this research in medicine. We aim to harness the vast reservoir of Urdu-language data available in electronic media and transform it into a format conducive to developing a medical assistant application.
The proliferation of electronic media has exponentially increased the accessibility of information in various languages, including Urdu. However, much of this data exists in an unstructured and raw format, posing significant challenges to comprehension and analysis. In healthcare, where accurate interpretation of textual data is paramount, the need for effective sentiment analysis tools becomes apparent.
Urdu is one of the most widely spoken languages in Asian countries, where a considerable portion of the population prefers it as their primary means of communication. Despite its linguistic similarity to Hindi, Urdu possesses distinct cultural nuances and expressions that necessitate specialized analysis tools. Furthermore, the prevalence of Urdu in medical literature, patient feedback, and public health discourse underscores the importance of developing robust sentiment analysis techniques tailored to this language.
We employ a multifaceted approach encompassing pre-processing, feature engineering, and machine learning methodologies to address the complexities inherent in Urdu text sentiment analysis. Through meticulous pre-processing techniques such as tokenization, stemming, and stop-word removal, we refine the raw data, ensuring its suitability for subsequent analysis.
Feature engineering is crucial in extracting relevant features from the Urdu text corpus, enabling effective sentiment classification. Techniques such as bag-of-words, n-grams, and word embedding facilitate the extraction of semantic meaning and contextual information from the textual data.
When sourcing our dataset, we prioritize medical-related content from reputable sources such as websites, medical journals, and patient forums. By focusing on data sources rich in medical terminology and domain-specific knowledge, we ensure the dataset’s relevance and reliability for training our sentiment analysis model.
The application of these techniques is particularly pertinent in developing a medical assistant chatbot, which relies on accurate sentiment analysis to comprehend and respond to user queries effectively. By leveraging machine learning models trained on meticulously curated Urdu text data, we aim to enhance the Chabot’s ability to interpret and address medical inquiries precisely and with nuance.
Urdu is the national language of Pakistan and has a worldwide presence. In addition to being an official language in six Indian states, Urdu is also widely spoken. There are approximately 12 million native speakers in Pakistan and over 300 million native speakers worldwide. Urdu is extensively spoken in the Indo-Pakistani area. Over 100 million people speak Urdu globally. It is the official language of Pakistan, one of India’s 23, and the 21 most spoken language in the world. (The Urdu Language n.d.).
Nowadays, user created media such as discussion forums, blogs, online reviews, and other web content are of tremendous importance in opinion mining. Researchers are attempting to assess the power of social media in data science. Data from social networking sites is extremely unstructured text. Data scientists are now working to provide a plethora of data services and tools to organise and evaluate hidden information in social data. This requirement fuels current academic research into language processing and sentiment analysis (Haenlein and Kaplan 2010). The practise of studying review sites on the internet to discover the overall opinion or feeling about a product is known as sentiment analysis in reviews. Reviews are examples of so-called user-generated content, which is gaining popularity and serving as a valuable resource for marketing teams, sociologists, psychologists, and others who are interested with thoughts, views, public mood, and general or personal attitudes (Tang et al. 2009). Due to the large variety and amount of social media data, it is difficult for women or corporations to receive the newest trends and summarise the state or general opinions about items. This generates the need for automated and real-time opinion extraction and mining. Deciding on the feeling of opinion is a difficult challenge owing to the subjectivity aspect, which is basically what individuals believe.
Sentiment analysis is viewed as a classification job since it categorises the orientation of a text as positive or negative. In addition to lexicon-based and linguistic methodologies, machine learning is a commonly utilised approach to sentiment categorization (Thelwall et al. 2011). Users want to find information fast and efficiently from enormous data volumes. As a result, text classification is critical and has several applications, such as identifying text genre, filtering news based on user interest, and determining if an email is spam or not. We saw a rapid growth in Urdu text documents like journals, web databases, and online news articles. There is still a lot of work to be done on the Urdu language in this field because it is rich in context. Most research work is done in English language but now different languages, i.e. Chinese, Arabic, and Indonesian are being used as a medium of research. However, the increasing size of the Urdu corpus is making it difficult for the researchers. Current implementations are limited to simple text, leaving a significant gap when handling large datasets. Text classification is widely used in applications such as text filtering, document organization, news story classification, and targeted web content search. These are language-specific systems mostly intended for English, with no work done for the Urdu language. As a result, constructing classification systems for Urdu texts is a difficult undertaking due to the language’s morphological complexity and shortage of resources such as automatic tools for tokenization, feature selection, and stemming. This is the reason most of the algorithm that proves fabricate accurate results in English text tends to have less accuracy when it comes to Urdu text analysis (Ali and Ijaz 2009; Javed et al. 2021).
The major and predominant focus of this study is to create a public framework to generate a textual summary of the Urdu language text. The proposed system’s goal is to emphasise the importance of health in life, reach out to people, and encourage them to follow health-maintenance measures by making the chatbot available to all. Chatbot and health have a history of working effectively together. It generates an excellent human-like conversational environment for interaction between the user and the technology. This system allows users to chat about their health, which is a terrific approach for them to maintain a healthy lifestyle. To that end, appropriate Urdu literature is gathered and made available for instructional purposes.
Literature review
A systematic method for analyzing emotional polarity using the VSM model and SVM/ELM classifiers is presented in the paper (Zhang and Zheng 2016). Although successful, the study’s primary emphasis is on Chinese text, which may limit its applicability to other languages. The significance of preprocessing in sentiment analysis pipelines is emphasized by the focus on feature selection, segmentation, and cleansing.
This research offered a collection of machine learning approaches with semantic analysis for identifying sentence and product evaluations based on Twitter data (Table 1). The main goal is to assess a huge number of reviews using a labelled Twitter dataset. The naive byes approach, which produces better results than maximal entropy and SVM, is exposed to the unigram model, which produces better results than using it alone. When the semantic analysis WordNet is followed by the aforesaid approach, the accuracy increases to 89.9%, up from 88.2%. The training data set may be expanded to enhance the feature vector associated sentence detection process, as well as WordNet for review summarization. It may provide better visualisation of the material, which would be beneficial to consumers (Pang et al. 2002).
Table 1. Related work on medical assistant chatbot Urdu text sentiment analysis
No | Research title | Proposed work | Reference |
|---|---|---|---|
1 | SACPC: A framework based on probabilistic linguistic terms for short text sentiment analysis | This study suggested that Word2PLTS is a novel text representation model. The SAPCP short text sentiment analysis framework is next described, which employs SVMs. It extracts word sentiment information from SWN and a word frequency distribution dictionary in the system | (Song et al. 2020) |
2 | Image–text sentiment analysis via deep multimodal attentive fusion | This study discussed how to use a hybrid fusion framework to utilise the visual and semantic attention mechanisms for image text sentiment analysis. Conclusion is reached by applying a late fusion strategy to the outputs of the three models | (Huang et al. 2019) |
3 | Text sentiment analysis based on long short-term memory | In this research, an enhanced RNN language model using LSTM is proposed which effectively covers all historical sequence information and outperforms traditional RNN | (Li and Qian 2016) |
4 | Research on text sentiment analysis based on CNNs and SVM | In this research, the author proposed that sentiment analysis is a popular topic in natural language processing research. This study combines the benefits of CNNs and SVM and builds the text-emotional analysis model described using CNNs and SVM | (Chen and Zhang 2018) |
5 | A fuzzy convolutional neural network for text sentiment analysis | This study examined the advantages of deep learning, fuzzy modelling, and neural networks for text sentiment classification and produced a hybrid deep learning-based fuzzy-neural model, fuzzy convolutional neural network, which integrates fuzzy logic to CNN. Compared to conventional algorithms such as CNN, FCNN may generate more believable features that result in greater classification accuracy for emotional data | (Nguyen et al. 2018) |
6 | Sentiment Analysis System for Roman Urdu | In this research, the researcher proposes a Roman Urdu Sentiment Analysis System using three separate features, including unigram, bigram, and uni-bigram, and five various classifiers, namely NB, DT, SVM, LR, and KNN. The results with the best outcomes were selected for further debate and in-depth analysis | (Mehmood et al. 2018) |
7 | Lexicon-based sentiment analysis for the Urdu language | This research introduced a new paradigm for sentiment analysis in Urdu comments. The lexicon-based architecture operates by assigning polarity to Urdu sentence tokens. To test the efficiency of the proposed framework, experiments are done using a data set comprising 124 Urdu comments from different Urdu websites. The architecture has a 66% total efficiency | (Rehman and Bajwa 2016) |
8 | Sentiment analysis of code-mixed Roman Urdu-English social media text using deep learning approaches | In this paper, the author used multilingual mBERT and XLM-Roberta models to analyse sentiment in codemixed Roman Urdu and English social media material. According to their findings, XLM-Roberta provides greater accuracy and F1 score for informal and low-resource languages such as code-mixed Roman Urdu and English text accessible on social media | (Younas et al. 2020) |
9 | Sentiment analysis of Roman Urdu on E-commerce reviews using machine learning | In this paper, the author generated a database of Roman Urdu comments against numerous items in this study. Experiments reveal that Roman Urdu sentiment analysis is a difficult issue to solve. Famous models with poor accuracy include LSTM, RNN, CNN, RCNN, and GRU. We further improved on our prior SVM model by integrating lexically normalised Roman Urdu terms, achieving an accuracy of 0.68 | (Chandio et al. 2022) |
10 | Exploring Twitter news biases using Urdu-based sentiment lexicon | This research examined news tweets from prominent Pakistani news outlets. Their primary contribution is the research of a mechanism for categorising the text of Urdu news tweets as good, negative, or neutral. They used a lexicon-based technique to create a comprehensive sentiment lexicon in Urdu and also proposed a method for doing domain knowledge-based viewpoint analysis | (Amjad et al. 2017) |
11 | Sentiment analysis of user reviews about hotels in Roman Urdu | In this research, Roman Urdu reviews from various hotel websites are crawled and recorded in an opinion database. The original corpus is used to create the testing and training datasets. In this paper, the SVM method also produces improved results | (Nazir et al. 2020) |
12 | Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu language understanding | This paper discussed a large-scale Roman-Urdu and Urdu parallel corpus, Roman-Urdu-Parl, comprising 6.37 million sentences and 0.186 billion words in this article. The author thinks that the existence of a large-scale dataset for a low-resource language like Urdu will accelerate and provide avenues for additional study in this field, as well as help push Urdu language research to the next level | (Alam and Hussain 2022) |
13 | Urdu sentiment analysis with deep learning methods | In this research, the author uses multiple machine and deep learning models; they acquire the greatest F1 score of 82.05% utilising LR with a mixture of features after executing numerous tests based on two text representations: n-gram features and pre-trained word embedding. The SVM classifier is the second-best performer for this job, with a better average performance than all other classifiers | (Khan et al. 2021) |
14 | Document-level text classification using single-layer multisize filters convolutional neural network | This research highlights how challenging it is to categorize lengthy Urdu texts in comparison to shorter ones. Longer Urdu documents have more vocabulary, more noise, and redundant information, among other issues. Researchers have historically ignored Urdu because of its complex morphology, unique characteristics, and dearth of language resources. As a result, compared to other languages, text classification tasks in Urdu require more effort and time | (Akhter et al. 2020) |
15 | Neural responding machine | This research is on raw datasets that contain noise that impairs classifier performance and increases learning time; they must be pre-processed before training a model on them. On raw data, pre-processing operations such as tokenizing text, removing non-language characters, removing stop words, and stemming are performed. It is normal procedure in Urdu text preparation to eliminate stop words; however, rare words are not removed | (Shang et al. 2015) |
16 | Extractive text summarization models for Urdu language | In this research, the author elaborates on text summarization in the Urdu language. According to the author, the Urdu language is a difficult undertaking due to its complexity in terms of word type and morphology. In comparison to other comparable languages like Arabic and Persian, there has been little peer-reviewed study on Urdu text summary. Most of the materials for Urdu NLP are not open source | (Nawaz et al. 2020) |
17 | Sequence to sequence model performance for education chatbot | This research study’s goal is to learn as much as possible about and provide an answer to a crucial query when we first investigate the Seq2Seq model for natural answer generation modelling. The algorithm will first be trained on a dataset of question-and-answer pairs in a conventional natural question-answering system. It has been observed that the Seq2Seq model is effective at responding to those queries | (Palasundram et al. 2019) |
18 | A systematic study of Urdu language processing its tools and techniques: a review | This study aims to provide a comprehensive review of Roman-Urdu and Urdu language, including previous works by authors in this field. In addition, we discuss the grammatical structure of Urdu languages, pre-processing techniques, software tools, and databases used in Roman Urdu and Urdu | (Lal et al. n.d. ) |
19 | Text sentiment analysis from students’ comments using various machine learning and deep learning techniques: a comparative analysis | In this paper, we propose a text sentiment analysis model to compare the performance and accuracy of various machine learning and deep learning approaches. In this regard, the deep learning techniques incorporated in this study are recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU), and we compare these techniques with state-of-the-art machine learning algorithms | (Chhajro, et al. 2023) |
The choice of a deep learning sequence-to-sequence model for our research stems from its proven efficacy in handling complex natural language processing tasks, particularly in sentiment analysis and language translation domains. Numerous studies have demonstrated the superiority of deep learning models over traditional machine learning approaches when dealing with unstructured text data.
Previous research findings
Extensive prior research has showcased the effectiveness of deep learning architectures, such as recurrent neural networks (RNNs) and their variants, like long short-term memory (LSTM) networks, in capturing sequential dependencies and extracting meaningful representations from textual data. Studies have shown remarkable performance of sequence-to-sequence models in tasks such as language translation, text summarization, and sentiment analysis.
Handling sequence data
The inherent sequential nature of language necessitates a model capable of understanding and generating text coherently. Unlike traditional machine learning approaches that treat input data as independent and identically distributed (i.i.d.), deep learning sequence-to-sequence models excel at capturing temporal dependencies and contextual information in sequential data. This capability is particularly advantageous in sentiment analysis tasks where the order of words and phrases significantly impacts the overall sentiment conveyed.
Flexibility and adaptability
Deep learning models offer high flexibility and adaptability, allowing them to learn complex patterns and representations directly from data without relying heavily on handcrafted features. This attribute is especially beneficial in scenarios where the underlying structure of the data is intricate and complex to capture using conventional feature engineering techniques.
State-of-the-art performance
Recent advancements in deep learning, coupled with the availability of large-scale datasets and computational resources, have propelled deep learning models to achieve state-of-the-art performance across various natural language processing tasks. By leveraging the expressive power of deep neural networks, our proposed sequence-to-sequence model aims to surpass existing benchmarks in Urdu text sentiment analysis, thereby advancing the frontier of sentiment analysis research in underrepresented languages.
In light of the compelling evidence from previous literature and the unique requirements of our research objectives, adopting a deep learning sequence-to-sequence model emerges as a natural choice. Through rigorous experimentation and validation, we seek to demonstrate the efficacy and robustness of our proposed model in accurately capturing and analyzing sentiment in Urdu text data.
The field of sentiment analysis, particularly in natural language processing (NLP), has witnessed significant advancements driven by machine learning and deep learning techniques. This section presents a comprehensive overview of existing research findings, emphasizing critical analysis and synthesis to identify gaps and opportunities for further investigation. The methodology section could benefit from a more rigorous explanation of the decision-making process that would enhance the reproducibility and transparency of the research. A more robust evaluation methodology, including cross-validation and benchmarking against baseline models, would strengthen the credibility of the research findings. While the manuscript acknowledges the limitations of the proposed approach, such as the complexity of Urdu morphology and the scarcity of resources, it fails to provide concrete suggestions for addressing these challenges in future research.
Explaining research findings and incorporating them into research initiatives
Emotional polarity analysis using machine learning
The paper presents a systematic method for analyzing emotional polarity using the VSM model and SVM/ELM classifiers (Zhang and Zheng 2016). Although successful, the study’s primary emphasis is on Chinese text, which may limit its applicability to other languages. The focus on feature selection, segmentation, and cleansing emphasizes the significance of preprocessing in sentiment analysis pipelines.
Semantic analysis for sentiment identification
Research efforts, exemplified by Pang et al. (2002), have focused on semantic analysis for identifying sentiments expressed in social media data, particularly on platforms like Twitter. Machine learning approaches, including the naive Bayes classifier, have demonstrated improved performance when combined with semantic analysis techniques such as WordNet. These studies underscore the significance of incorporating domain-specific knowledge to enhance sentiment analysis accuracy. This research (Rehman and Bajwa 2016) illuminates the difficulties associated with sentiment analysis in Urdu language processing by introducing a lexicon-based architecture for polarity assignment. Even though the model’s efficiency was 66%, more research into contextual subtleties could improve its performance.
Adding the significant gaps in the literature review in the study
The study discussed in Zhang and Zheng (2016) primarily concerns the emotional polarity analysis of Chinese text. While it provides valuable insights into classification methodologies, the findings’ generalizability to other languages or text types must be clarified. While the SAPCP framework discussed in Song et al. (2020) shows promising results in sentiment analysis, there is still room for further research into methods to improve classification performance. Research (Li and Qian 2016) suggests improved RNN models for detecting text emotional characteristics. A need exists regarding their performance across various text types and domains. Addressing these gaps through additional empirical and theoretical research could help to advance sentiment analysis, especially in challenging contexts like low-resource languages and multimodal data analysis.
Text representation models for sentiment analysis
As discussed in Song et al. (2020), recent advancements introduce novel text representation models like Word2PLTS and sentiment analysis frameworks like SAPCP, leveraging support vector machines (SVM) for classification. These models emphasize extracting word sentiment information and aggregating text representations to improve sentiment analysis performance, particularly in short text data.
Multimodal fusion for sentiment analysis
Hybrid fusion frameworks integrate visual and textual modalities for sentiment analysis, such as the deep multimodal attentive fusion (DMAF) strategy proposed in Huang et al. (2019). By leveraging attention mechanisms and fusion-based models, these approaches demonstrate superior performance compared to traditional unimodal methods, particularly in analyzing sentiment in multimodal data sources.
Deep learning models for sentiment classification
Deep learning models, including recurrent neural networks (RNNs) with long short-term memory (LSTM) cells and convolutional neural networks (CNNs), have shown promising results in sentiment classification tasks, as highlighted in Li and Qian (2016) and (Chen and Zhang 2018). Attention mechanisms and hybrid architectures, such as the fuzzy convolutional neural network (FCNN), improve classification accuracy by capturing nuanced features in emotional data.
Language-specific sentiment analysis
Studies focusing on language-specific sentiment analysis, such as the Roman Urdu Sentiment Analysis System proposed in Mehmood et al. (2018), emphasize the importance of feature selection and classifier optimization for accurate sentiment classification. Lexicon-based approaches and machine learning classifiers are utilized to analyze sentiment in languages with unique morphological characteristics, such as Urdu.
Challenges and opportunities in sentiment analysis
Despite advancements, challenges persist in sentiment analysis, particularly in low-resource languages like Urdu, as discussed in Akhter et al. (2020) and (Nawaz et al. 2020). Limited access to annotated datasets, lack of NLP tools, and morphological complexity pose significant research hurdles. However, initiatives such as creating large-scale parallel corpora, as described in Alam and Hussain (2022), offer avenues for addressing these challenges and advancing research in language-specific sentiment analysis.
Methodology
The primary focus of this study is to produce a public framework to generate a textual summary of the Urdu language text. The objective of the proposed system is to show the importance of health in life, reach out to people, and encourage them to follow measures to maintain health by making the chatbot available to all. Chatbot and health have a history of working well together. It creates an excellent human-like conversational environment for interaction between the user and the system. In this system, the user talks about their health, and it is a great way for the users to regulate a healthy lifestyle. For this purpose, related text in Urdu is obtained, making it usable for training. This experiment evaluates if the Seq2Seq model can also be implemented in Urdu because of its diversity and complexity. This model uses language as an input and reads it in the form of numbers; this entirety has been done using the encoding and decoding process. Encoding functionality encodes inputs into fixed-length vectors, and decoding functionality takes a fixed-length vector and converts it into variable-length sentences using a beam search decoder. Figure 1 illustrates the process used in this paper.
Fig. 1 [Images not available. See PDF.]
Flowchart of the process
Data extraction
Since most human knowledge is not recorded in conversational datasets, developing entirely data-driven conversation models is a significant problem. Despite the enormous growth in these databases, there still needs to be more records compared to other Urdu languages (Ghazvininejad et al. 2018). Medical Urdu datasets have been compiled from a variety of sources. Data is extracted from news-related websites and directly consulted with doctors. Their published articles, blogs, and posts on websites containing medical data are used appropriately. A large dataset is usually required to improve the classification accuracy of a statistical system. Many text words are gathered from online Urdu news sources (Sordoni et al. 2015).
Data pre-processing
Probabilistic analyzers almost always demand that the input dataset be pre-processed in the way they specify because this pre-processing is language-specific; general text mining algorithms must be changed before they can be used in the Urdu dataset.
Importing dataset
The whole corpus is curated manually from multiple resources. The dataset in Urdu is extracted from different websites by applying techniques, and the plain text is converted into a CSV file (Mahmood et al. 2020).
Punctuation removal
Changing all characters to lowercase, deleting punctuation marks, and removing stop words and typos make removing unwanted data sections or noise possible. Eliminating noise is helpful when performing text analysis on the data, such as comments or tweets.
Stop words removal
Stop words are linguistic functional words that have no relevance in the context of text analysis. They are removed from the vocabulary to reduce its size by employing a list of the most frequently used words known as the stop word list. The list of stop words that must be removed is below.
Tokenization
The corpus generates words due to white spaces and punctuation signs. The lexicon also includes several words with no white space or punctuation marks and some non-Urdu terms. To address this challenge, each word is checked against the tokenization vocabulary and, if found, becomes a token; otherwise, it is removed.
Sentence tokenization
The technique of breaking text into individual sentences is known as sentence tokenization.
Word tokenization
We apply the approach to break down a statement into words. The result of word tokenization can be translated into a Data Frame for better text interpretation in machine learning applications. It can also be used as an input for additional text-cleaning procedures like punctuation and numeric character removal.
One hot encoding
It is called one-hot because only one bit is “hot” or TRUE at any time. One-hot encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. A visual representation is attached below.
Count vectorizer
It is used to transform a given text into a vector based on the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts and we wish to convert each word in each text into vectors.
N-gram model
Enormous probabilities in a dataset can be made through this model because it can count how many sequences can be made and have a useful impact in a lot of applications, that is, machine translation, speech recognition, and predictive text input. Shannon initially mentioned N-grams in the context of communication theory in 1948. N-grams are text sequences with a defined window size of N that are used to find important information in a corpus. On a 37-million-word corpus from the poetry and prose domains, a word-based N-gram was used in the study (Pennington et al. 2014; Papineni et al. 2002). In this research paper, the N-gram model is used to calculate the probability distribution over word sequences.
Bi gram
In the Bigram Language Model, we find bigrams, which are two words coming together in the corpus (the entire collection of words/sentences) (Ali and Ijaz 2009).
Trigram
In the Trigram Language Model, we find bigrams, which are three words coming together in the corpus (the entire collection of words/sentences).
Methodology
Keras
Keras is a high-level deep-learning library developed in Python and is lightweight and easy to learn and implement. It allows programmers to concentrate on the main ideas of deep learning, including neural network layers. It needs TensorFlow as its backend and can be used in applications. It offers high-level neural networks because of its layers and model function along with TensorFlow tensors.
Tensorflow
TensorFlow was originally created to handle huge numerical computations, which is considered deep learning. However, it turns out to be extremely helpful for the advancement of deep learning. It receives data from a multidimensional array and operates using data flow graphs with nodes and edges. It is considered the simple way to distribute the execution of TensorFlow code by utilizing the GPUs across clusters of computers because the execution mechanism takes the form of graphs.
Word embedding
Word embedding was implemented when computers could not comprehend text despite having a deep learning method, and it was discovered that computers can only read numbers as input, not words. In order to comprehend and create text, NLP-enabled computers must be able to recognize word syntax and a vast array of linguistic subtleties. Since computers can only understand numbers, NLP uses a method known as word embedding to bridge the comprehension gap.
TF-IDF—term frequency-inverse document frequency
A machine learning (ML) approach called TF-IDF is based on a statistical method for determining the significance of words in a text. The text may take the form of one or more corpus. It combines two measures: Inverse document frequency (IDF) and term frequency (TF) (IDF). The frequency of words in a document determines the TF score. The number of times a word appears in the papers is calculated.
Word2Vec—capturing semantic information
This technique is used to solve advanced NLP problems. It can make iteration over a large dataset of text to learn the dependencies and association among words. It works by finding similarities in cosine metrics. If the angle is 90° that means words are independent completely and if it is 1° than it shows words are overlapping. It assigns similar representations to similar words.
GloVe—global vectors for word representation
All unsupervised approaches for learning word representations primarily rely on statistics of word occurrences in a corpus, but despite the increasing use of these techniques, it is still unclear how meaning is derived from these statistics and how the word vectors that arise might represent that meaning (Chen and Zhang 2018). The effort GloVe is an advanced approach that captures global contextual information in a text corpus by creating a global word-word co-occurrence matrix and extending Word2Vec. It analyses the whole corpus and generates a big corpus matrix capable of capturing the co-occurrence of terms within the corpus.
BERT—bidirectional encoder representations from transformers
BERT is a member of the transformer subclass of NLP-based language algorithms. There are two variations of the enormous pre-trained profoundly bidirectional encoder-based transformer model known as BERT. BERT uses an attention method to create word embeddings. It produces top-notch contextualized or context-aware word embeddings. A more sophisticated method than any of the ones previously mentioned is BERT.
Sequence to sequence
It is one of the most special classes of recurrent neural network architecture that we normally use to solve complex and hard language problems like machine translation, text summarization, and question answers in chatbot (Fig. 2).
Fig. 2 [Images not available. See PDF.]
Graphical representation of sequence-to-sequence model
Proposed deep learning model for chatbot
Model architecture: encoder-decoder framework
The proposed model leverages an encoder-decoder framework, a fundamental architecture in sequence-to-sequence learning. The encoder processes the input sequence (Urdu text) and generates a fixed-dimensional representation, capturing the contextual information embedded within the text. This representation is then passed to the decoder, which generates the output sequence (response) based on the encoded information and the context provided by previous decoder outputs. This architecture enables the chatbot to understand and respond to user queries in a coherent and contextually relevant manner.
Training process
Training an Urdu dataset for natural language processing tasks poses unique challenges due to the scarcity of annotated data and linguistic complexities. Our approach addresses these challenges through a systematic training process outlined as follows:
Data collection and pre-processing
A small corpus of Urdu text data is collected from various sources, including online forums, medical websites, and social media platforms, to build the training dataset.
The collected data undergoes pre-processing using a tokenizer class, which removes unnecessary punctuation and redundant words and converts the text into a space-separated word format. This pre-processing step ensures that the data is clean and ready for model training.
Intent file and user interaction
An intent file is created to encapsulate human knowledge and user interactions, facilitating the chatbot’s understanding of user intents and tasks.
Python libraries are employed to parse and interpret the intent file, enabling the chatbot to comprehend user queries and respond accordingly.
Model training with deep learning and Keras
The preprocessed corpus is used to train the chatbot model using deep learning techniques implemented with the Keras framework.
A sequential model class in Keras is utilized to define the neural network architecture, comprising layers of encoder and decoder units.
The architecture, depicted in Fig. 2, illustrates the neural network configuration, including the embedding layer, recurrent layers (such as LSTM or GRU), and output layer.
During training, the model learns to map input sequences to output sequences, optimizing for accuracy and coherence in responses.
Model evaluation and user interaction
Upon completion of the training process, the chatbot model is evaluated using various performance metrics, including accuracy, fluency, and contextual relevance.
The chatbot is deployed for user interaction, allowing users to engage in conversations and assess the accuracy and effectiveness of the Chabot’s responses in real-world scenarios.
Demonstration of chatbot performance
The deployment of the trained chatbot demonstrates its ability to accurately understand user queries and provide contextually relevant responses. Through user interactions, the efficacy of the chatbot in addressing medical inquiries and maintaining coherent conversations is evaluated, showcasing the practical utility and accuracy of the proposed model.
Now comes the vital part, where the model is being trained using a sequential model class of Keras, where neural networks are defined as its architecture. The architecture is shown in Fig. 3.
Fig. 3 [Images not available. See PDF.]
Graphical representation of architecture
After the process is completed, the chatbot is finally ready for user interaction. This will demonstrate how accurate the response is (Figs. 4 and 5).
Fig. 4 [Images not available. See PDF.]
Graphical representation of output obtained
Fig. 5 [Images not available. See PDF.]
Encoder and decoder architecture
Result and discussion
This research discusses the technological issues of constructing a medical chatbot. This evaluation may be used by professionals to determine the most prevalent development strategies to limit their alternatives and choose the most suitable development strategy for their proposed system.
This model’s result sounds functional when it comes to the context where input and outputs possess different sizes and characteristics. It was found to be the core solution for solving such hurdles, and its utilization can be seen as the best solution for the application of this model. The below image demonstrates how question–answer can be paired in multiple ways.
The provided information indicates that the majority of created chatbot converse with users in English. Existing efforts to create and construct chatbot in other languages like Urdu, Chinese, and Arabic, which are spoken by a significant fraction of the world’s population, need more development.
Conclusion and future work
The article represents the process undertaken to construct an Urdu text classifier. The method we used here is a sequence-to-sequence model process. From the first stage of development, the data is tokenized, and then pre-processing techniques such as stop word removal and stemming are applied to the tokenized data using various algorithms. We expect that the trained models will also perform well on various types as well as in different sequences of sentences that are used in the Urdu dataset. Their study will be used to produce creative solutions with more accuracy. The classification of Urdu text has a lot of space for improvement. We can also use lemmatization instead of stemmer to improve text categorization results. As there is still a research gap, a lot of work and effort must be done.
Based on the research discussed, we have included some specific suggestions for addressing challenges in future research in an updated version.
Future research could concentrate on collecting more extensive and diverse datasets in Urdu, particularly in fields such as healthcare, where specialized terminology and context-specific language are common. This expansion would improve the robustness and generalization capability of natural language processing models for Urdu.
Researchers can investigate advanced training techniques, such as transfer learning and domain adaptation, to effectively use pre-trained language models. Fine-tuning models on domain-specific Urdu datasets can significantly improve the performance of natural language processing tasks such as text summarization and chatbot development.
Future research could look into integrating multimodal information, such as textual and visual data.
Declarations
Conflict of interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Akhter, MP; Jiangbin, Z; Naqvi, IR; Abdelmajeed, M; Mehmood, A; Sadiq, MT. Document-level text classification using single-layer multisize filters convolutional neural network. IEEE Access; 2020; 8, pp. 42689-42707. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2976744]
Alam, M; Hussain, SU. Roman-Urdu-Parl: Roman-Urdu and Urdu parallel corpus for Urdu language understanding. Trans Asian Low-Resour Lang Inf Process; 2022; 21,
Ali AR, Ijaz M (2009) Urdu text classification. In: Proceedings of the 6th International Conference on Frontiers of Information Technology (FIT '09). https://doi.org/10.1145/1838002.1838025
Amjad K, Ishtiaq M, Firdous S, Mehmood MA (2017) Exploring twitter news biases using Urdu-based sentiment lexicon. In: 2017 International Conference on Open Source Systems & Technologies (ICOSST). IEEE, pp 48–53
Chandio, B; Shaikh, A; Bakhtyar, M; Alrizq, M; Baber, J; Sulaiman, A; Rajab, A; Noor, W. Sentiment analysis of Roman Urdu on e-commerce reviews using machine learning. CMES-Comput Model Eng Sci; 2022; 131, pp. 1263-1287.
Chen Y, Zhang Z (2018) Research on text sentiment analysis based on CNNs and SVM. In: 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA). IEEE, pp 2731–2734
Chhajro, MA et al. Text sentiment analysis from students’ comments using various machine learning and deep learning techniques: a comparative analysis. Int J Data Anal Tech Strateg; 2023; 15,
Ghazvininejad M et al (2018) A knowledge-grounded neural conversation model. In: 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), pp 5110–5117
Haenlein, M; Kaplan, AM. An empirical analysis of attitudinal and behavioral reactions toward the abandonment of unprofitable customer relationships. J Relatsh Mark; 2010; 9,
Huang, F; Zhang, X; Zhao, Z; Jie, Xu; Li, Z. Image–text sentiment analysis via deep multimodal attentive fusion. Knowl-Based Syst; 2019; 167, pp. 26-37. [DOI: https://dx.doi.org/10.1016/j.knosys.2019.01.019]
Javed TA, Shahzad W, Arshad U (2021) Hierarchical text classification of Urdu news using deep neural network, ACM Trans. Asian Low-Resource Lang Inf Process 37(4) [Online]. https://www.researchgate.net/publication/353066845_Hierarchical_Text_Classification_of_Urdu_News_using_Deep_Neural_Network. Accessed 1/10/22
Khan, L; Amjad, A; Ashraf, N; Chang, H-T; Gelbukh, A. Urdu sentiment analysis with deep learning methods. IEEE Access; 2021; 9, pp. 97803-97812. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3093078]
Lal M, Kumar K, Wagan AA, Laghari AA, Khuhro MA, Saeed U, Umrani A, Chahjro MA (n.d.) A systematic study of Urdu language processing its tools and techniques: a review
Li D, Qian J (2016) Text sentiment analysis based on long short-term memory. In: 2016 First IEEE International Conference on Computer Communication and the Internet (ICCCI). IEEE, pp 471–475
Mahmood, Z et al. Deep sentiments in Roman Urdu text using recurrent convolutional neural network model. Inf Process Manag; 2020; 57,
Mehmood, K; Essam, D; Shafi, K. Sentiment Analysis System for Roman Urdu. Science and Information Conference; 2018; Cham, Springer: pp. 29-42.
Nawaz, A; Bakhtyar, M; Baber, J; Ullah, I; Noor, W; Basit, A. Extractive text summarization models for Urdu language. Inf Process Manag; 2020; 57,
Nazir MK, Ahmad M, Ahmad H, Abdul Qayum M, Shahid M, Habib MA (2020) Sentiment analysis of user reviews about hotel in Roman Urdu. In: 2020 14th International Conference on Open Source Systems and Technologies (ICOSST). IEEE, pp 1–5
Nguyen, T-L; Kavuri, S; Lee, M. A fuzzy convolutional neural network for text sentiment analysis. Journal of Intelligent & Fuzzy Systems; 2018; 35,
Palasundram, K; Sharef, NM; Nasharuddin, NA; Kasmiran, KA; Azman, A. Sequence to sequence model performance for education chatbot. Int J Emerg Technol Learn; 2019; 14,
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Linguistics, pp 79–86. https://doi.org/10.3115/1118693.1118704
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 311–318
Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Rehman ZU, Bajwa IS (2016) Lexicon-based sentiment analysis for Urdu language. In: 2016 Sixth International Conference on Innovative Computing Technology (INTECH). IEEE, pp 497–501
Shang L, Lu Z, Li H (2015) Neural responding machine. ACL-2015, pp 1577–1586
Song, C; Wang, X-K; Cheng, P-F; Wang, J-Q; Li, L. SACPC: a framework based on probabilistic linguistic terms for short text sentiment analysis. Knowl-Based Syst; 2020; 194, 105572. [DOI: https://dx.doi.org/10.1016/j.knosys.2020.105572]
Sordoni A, Galley M, Auli M, Brockett C (2015) Chitchat: efficient conversations, pp 196–205
Tang, H; Tan, S; Cheng, X. A survey on sentiment detection of reviews. Expert Syst Appl; 2009; 36,
The Urdu Language (n.d.) [Online]. https://www.researchgate.net/publication/353066845_Hierarchical_Text_Classification_of_Urdu_News_using_Deep_Neural_Network. Accessed 1/10/22
Thelwall, M; Buckley, K; Paltoglou, G. Sentiment in Twitter events. J Am Soc Inform Sci Technol; 2011; 62,
Younas A, Nasim R, Ali S, Wang G, Qi F (2020) Sentiment analysis of code-mixed roman Urdu-english social media text using deep learning approaches. In: 2020 IEEE 23rd International Conference on Computational Science and Engineering (CSE). IEEE, pp 66–71
Zhang X, Zheng X (2016) Comparison of text sentiment analysis based on machine learning. In: 15th International Symposium on Parallel and Distributed Computing
Copyright Springer Nature B.V. Dec 2024