Content area
The World Health Organisation (WHO) has identified infectious diseases, particularly COVID-19, tuberculosis, malaria, and measles, as significant global health challenges in the past 5 years. The COVID-19 pandemic exposed critical limitations in traditional disease tracking systems, such as the lack of integrated data visualization, co-monitoring, and real-time analytics, leading to delayed and often ineffective public health responses. In this context, Big Data Analytics (BDA) offers significant potential for improving infectious disease mitigation through predictive modelling, mapping, tracking, and real-time monitoring. This study systematically reviews the role of BDA in monitoring and predicting epidemic and pandemic infections using the PRISMA methodology and quality appraisal techniques to provide comprehensive insights into its healthcare applications. From an initial pool of 846 articles from Scopus, PubMed, Science Direct, IEEE, ProQuest, and Springer, 30 high-quality studies were selected for in-depth analysis. The review identifies four key predictive models—epidemiological, time series, machine learning, and deep learning—and seven analytical techniques, including SIR, SEIR, regression analysis, random forest, support vector machines, auto-regressive methods, and deep learning. BDA supports infectious disease control by processing diverse healthcare data and leveraging technologies like IoT and social media to enhance diagnosis, clinical decision-making, and surveillance. However, a key limitation is predictive models’ limited reliability and generalizability in real-world settings, mainly due to low-quality, noisy, and incomplete data. For instance, during early COVID-19 phases, inconsistent case reporting hindered accurate forecasting and timely response efforts.
Introduction
Geographically, between 2019 and 2023, the global burden of infectious diseases remained high, with COVID-19 leading in total cases (over 776 million) and deaths (over 7 million) (John [24]), while tuberculosis continued to claim more than a million lives annually [68, 69]. Malaria showed a troubling rise, reaching 263 million cases and nearly 600,000 deaths in 2023 alone, largely driven by climate-related factors and resistance issues [68, 69]. Measles and dengue also surged due to lapses in vaccination and changing weather patterns, respectively [45]. These statistics indicates that infectious disease outbreaks have remained a persistent global issue in recent years. Therefore, timely detection and accurate diagnosis are essential for effective prevention and control.
The modern-day IT field has witnessed a big revolution in “Big Data”. Through this approach, organization’s gather, analyse, and retrieve information from an extensive amount of data sources. This enormous quantity, speed, and type of information exceeds the processing capabilities of conventional software systems and therefore it is called as “big data” [70]. Big data analytics have caught the attention of scholars and professionals in the last decade. Recent studies have shown that in several sectors, BDA plays a significant role in ensuring organisational success [57].
Advanced technologies like network sensors, social media platforms, Internet of Things (IoT) have led to more generation of data which opened new opportunities and challenges to individuals, companies, and society at large. These massive data sets are rich in knowledge and information, and they greatly improve many facets of our society, including commerce, healthcare, medicine, and others [70].
By applying sophisticated algorithms and analytical tools to big data, organizations can discover patterns, correlations and trends [15] in huge sets of data that enables them to make informed decisions and innovate through utilizing a data driven approach. This perspective involved research into the contribution of large data towards clinical treatments and interventions, assisting decision making of patients and the medical staff, improving diagnoses’ accuracy, managing the various procedures in the healthcare system [14] all for the purpose of enhancing health care.
During the recent COVID-19 pandemic, a huge volume of data has been produced from diverse source [6]. Extracting key information from extensive COVID-19 datasets presents significant challenges without a structured and systematic approach. Therefore, this study aims to review the role of Big Data Analytics (BDA) in monitoring epidemic and pandemic infections, with particular emphasis on its potential for infectious disease prediction. This study aims to provide insights into BDA’s role in mitigating pandemic outbreaks by addressing the following research questions.
What is the current implementation of big data for infectious disease risk mitigation?
How big data been used in healthcare, specifically in mitigating infectious disease risk?
Therefore, the systematic literature review (SLR) is conducted as it can help detailing and organizing the way through a comprehensive literature review using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. This approach ensures a structured and transparent method for identifying, evaluating, and synthesizing relevant research studies, ultimately providing a strong foundation for understanding the current state of knowledge in the field.
This paper has been organised into four sections to help to understand the subject. The first section offers a conceptual setting by explaining why the issue is crucial and relevant. Next, the methodology section outlined the process of collecting and choosing relevant studies. The main findings section of the paper critically evaluates previous research and identifies essential topics, methods, and gaps in the current pool of data. Research gaps, and research discussions are also carefully explained in the third section. Lastly, the conclusion section concludes and summarizes the whole study and provide recommendation for future works.
Significance of study
In the scope of infectious disease monitoring and prevention, the integration of Big Data Analytics plays an important role in revolutionizing our approach towards public health. Firstly, it serves as a powerful tool to enhance our comprehension of the potential impact of infectious diseases through the comprehensive analysis of vast and diverse datasets [3]. By using analytical methods, patterns and trends can be identified to help predict when herd immunity might be at risk [2]. This leads to the second point, where the practical insights derived from Big Data Analytics contribute to the development of predictive models for forecasting. These models become invaluable in anticipating and preparing for outbreaks, enabling timely intervention and resource allocation.
Furthermore, Big Data Analytics offers strategies for real-time monitoring, ushering in a new era of dynamic and real-time disease surveillance [35, 72]. Surveillance system such as [29, 49] suggests more real-time data collection integration of multiple data streams for rapid response. Through continuous data streams, anomalies and deviations from the norm can be promptly identified, facilitating rapid response mechanisms to contain and mitigate the spread of infectious diseases.
In addition, the implications of Big Data Analytics go beyond the technical area, reaching policymakers and healthcare practitioners. Recommendations derived from data-driven insights can guide decision-makers in formulating effective strategies for risk mitigation [21, 22]. Policymakers can utilise this information to allocate resources efficiently, implement targeted interventions, and formulate evidence-based policies [22]. Healthcare practitioners, on the other hand, can benefit from the actionable insights to optimize patient care and resource utilization [35].
Pandemic risk mitigation
Mitigating risks is a crucial aspect of pandemic management, and different fields of authority need to work as a whole to ensure optimum preventive measures. Those efforts include vaccination campaigns, research and innovation, travel restriction and border control, surveillance, and early detection [25]. In the aspect of surveillance and early detection, there are three types of strategies: active surveillance, syndromic surveillance and laboratory surveillance [9].
Active surveillance involves continuous monitoring of disease patterns, outbreaks, and unusual events. Health authorities collect data from various sources, including hospitals, clinics, laboratories, and public health agencies. By analyzing this data, they can identify patterns that indicate a potential outbreak or pandemic. Early detection allows swift intervention and containment measures [26].
Syndromic surveillance focuses on tracking symptoms and patients’ behaviours in real-time. Instead of waiting for confirmed diagnoses, health systems collect data on patient symptoms. This can include fever, cough, shortness of breath, and other relevant signs. Syndromic surveillance complements laboratory-based approaches and provides a broader picture of disease activity [54].
Laboratory surveillance involves testing and analysing samples to detect novel pathogens. Laboratories receive samples (such as blood, swabs, or tissue) from patients with suspected infections. These samples are then sent to laboratories undergoing various diagnostic tests (e.g., PCR, serology) to identify specific pathogens. Medical professionals then analyze the results of these tests to determine the appropriate treatment plan for the patient [39].
In addition to traditional classifications of surveillance systems it is also important to consider an alternative classification based on the source of data utilization which categorized surveillance systems according to their reliance on data sources such as clinical records, laboratory results, pharmacy sales, internet-based sources (e.g., search queries, social media), and environmental data [42].
In summary, these surveillance methods work together as warning system, allowing public health authorities to respond swiftly and mitigate the impact of infectious diseases. This study will examine the use of big data in active surveillance.
Methodology
To deliver a synopsis of the contemporary perspective on the usage of big data to readers, this study is conducted by adhering to Systematic Literature Review (SLR) method outlined by [46]. An SLR uses a systematic approach that ensures that the selected literature is identified accurately, thus giving a sound basis for this study [61]. A research process known as systematic literature review (SLR) looks at the facts and conclusions of the researchers in relation to predetermined questions [46]. This SLR begins with using the PICo approach to formulate the research questions—“P” for Problem or Population, “I” for Interest, and “Co” for Context [38]. Subsequently, the plan was made, and the document-searching approach was executed in three methodical stages: eligibility, screening, and identification. The study topics and the article selection procedure are described in the following subsection.
Finally, data extraction and analysis were two processes through which the chosen articles were handled. The main research question served as the guide for the data extraction method, and the extracted data was then analysed using thematic synthesis. To make sure the review methodology achieved the review’s goal, the authors, when appropriate, complied with the recommendations made in the review by considering substitutes. A full view of the methodology process is visualized in Fig. 1 and further explanation of the process is detailed in Sect. "Identification" and "Screening".
[See PDF for image]
Fig. 1
Methodology flowchart for SLR using PRISMA method from the total source of 846 related articles
Identification
Research questions formulation
Research questions were constructed by using the mnemonic of PICo. According to this guide, the authors decide the population involved is ‘healthcare’, the research is interested in ‘big data’ and ‘infectious diseases’ in the context of ‘risk mitigation’. Hence, the following research questions are established:
What are the current models in big data for infectious disease risk mitigation?
How big data been used in healthcare, specifically in mitigating infectious disease risk?
Database collection
Search query
The process of records searching is conducted in January 2024 for 3 weeks. Two searching approaches are used to conduct comprehensive research which are: Block-by-block where each search concept is combined in one block of words, so it becomes “infectious disease predictive”, “big data infectious disease”, “big data epidemic”, “infectious disease mitigation”, and “infectious disease prevention”. Second technique is a single line method where all search terms are combined in one sentence by using Boolean operators, so it becomes “big data infectious disease” OR “big data epidemic” OR “infectious disease predictive” OR “infectious disease mitigation” OR “infectious disease prevention”.
The terms are varied according to the unique indexing and search functionalities of each database, ensuring alignment with their specific controlled vocabularies and syntax. The composition of these questions is based on a careful choice of synonyms and alternative terminology as well as the appropriate use of Boolean operators to collectively cover all sides of this research topic within selected databases.
Database sources
Libraries were selected based on factors including user-friendliness, document accessibility, range of indexed materials (such as journal articles, conference proceedings, books, and book chapters), the search engines’ capability to handle complex textual queries using logical operators, the precision of searches in adhering to the specified query without alterations, and the consistency in yielding reproducible outcomes. Additionally, automatic filtration provided by the databases is utilized by setting the timeline of publication to not longer than 2019. The specified timeframe ensures the review captures the most recent and relevant studies. Selected databases are detailed in Table 1.
Table 1. Summary of databases repositories used for resources collection
Database | Details | Link |
|---|---|---|
Science Direct | Has wide scope in scientific, technical and medical research articles. Has peer-reviewed material | https://www.sciencedirect.com/ |
IEEE Xplore (IEEE) | A database that focuses on electronics, computer science, and electrical engineering | https://ieeexplore.ieee.org/Xplore/home.jsp |
Springer | Has comprehensive collection of academic resources, including books, journals, and conference proceedings | https://link.springer.com/ |
ProQuest | Provides access to diverse range of academic content, including dissertations, newspapers and scholarly journals | https://www.proquest.com/ |
Scopus | A comprehensive abstract and citation database of peer-reviewed literature, encompassing scientific journals, books, and conference proceedings across a wide range of disciplines | https://www.scopus.com/ |
PubMed | A free search engine accessing primarily the Medline database of references and abstracts on life sciences and biomedical topics | https://pubmed.ncbi.nlm.nih.gov/ |
Filtering either by using a database automation filtering feature or authors manual effort helps narrow the research scope and eliminate articles that are irrelevant to the study. After obtaining a large number of records, reference management software like Mendeley is used to automate the process of detecting duplicates by comparing titles, author names, publication years, and DOIs. Despite the effectiveness of Mendeley’s duplicate detection, a manual review is conducted using the same techniques to identify any missed duplicates. Next, any records identified to be in language other than English is excluded to avoid potential translation errors and ensure consistency in data analysis.
Other than that, scholarly articles from reputable sources have been included in the screening process for this literature review. Specifically, only journals from the top quarter (Q1 until Q3) in the Schimago Journal Rank (SJR) are taken into consideration. The ranking system provides a reference to evaluate the position and significance of academic journals from different research fields. The higher Q number indicates that it has a higher impact factor and more extensive scholarly influence.
Screening
Screening criteria
Screening process started by excluding records with titles that are not related to the main area of the study such as the fact that they were review articles, did not centre on infectious diseases, or focused on subjects unrelated to computer science, including medical science or bioinformatics. This includes titles that emphasize specific medical terms or titles that clearly mention unrelated topics. In the case of bizarre and broad title, records were further evaluated based on their abstracts to ensure relevance and quality. The area of research has been carefully defined to match the subject matter of the study. In this case, only articles that discuss topics related to computer science, artificial intelligence, machine learning, and medical technology are kept, whereas everything else is excluded.
Eligibility and full-text screening
In order to ensure the integrity of our full-text screening process, we conducted a quality appraisal to assess the remaining records. This appraisal involved evaluating the methodological consistency, relevance, and overall quality of each study to ensure that only high-quality, pertinent research was included in the final analysis.
Quality appraisal
Following the framework outlined by [46], after selecting primary studies (PSs), it is necessary to evaluate the quality of the research presented and conduct a quantitative comparison using Quality Assessments (QAs). Hence, for our Systematic Literature Review (SLR), we established six specific QAs based on the guide from [1].
QA1. Is the purpose of the study clearly stated?
QA2. Is the interest and the usefulness of the work clearly presented?
QA3. Is the study methodology clearly established?
QA4. Are the concepts of the approach clearly defined?
QA5. Is the work compared and measured with other similar work?
QA6. Are the limitations of the work clearly mentioned?
The scoring procedure used to evaluate each QA was Yes (✓) = 1, Unclear (U) = 0.5 or No (X) = 0. The article will be considered of high quality and included in the SLR if the overall score of the article exceeds 3.0 or 50%.
Final result after the thorough screening methods leaves out 30 articles for further review analysis.
Results and discussion
Bibliometric analysis
Bibliometric analysis is a method used to evaluate the academic literature within a specific field. This study applies bibliometric techniques to gain insights into the development and impact of research of BDA in infectious disease prevention and risk mitigation. The articles are searched based on keywords that has been chosen in subSect. "Database collection" which are “big data infectious disease” OR “big data epidemic” OR “infectious disease predictive” OR “infectious disease mitigation” OR “infectious disease prevention”.
Based on Fig. 2, while there were few publications in 2018, the number of publications increased significantly in 2020. The COVID-19 pandemic has caused a surge in research on infectious diseases, with many scientists exploring how big data can be used to manage such crises. Despite a slight decline in 2023, the number of publications remained consistently high, suggesting that the scientific community is still focused on refining and expanding upon the findings from the pandemic’s peak years.
[See PDF for image]
Fig. 2
Publications related to the utilization of big data in infectious disease prevention in the past 6 years which is extracted from the conducted SLR using PRISMA methodology
Based on the document analysis, Fig. 3 is a comprehensive categorization of how big data is utilized in infectious disease management and surveillance.
[See PDF for image]
Fig. 3
Number of studies regarding utilization of Big Data scopes from 2019 to 2024
Figure 3 shows the distribution of studies regarding the utilization of BDA across different categories. This chart indicates that the majority of studies focus on predictive analytics and disease forecasting, highlighting the importance of using BDA to anticipate and manage disease outbreaks. Population health management and real-time surveillance are also significant areas of research, while data integration and system performance have fewer studies.
Findings discussion
This subsection analyses the selected 30 articles, highlighting key insights, noting limitations, and examining the techniques used by models to manage and leverage Big Data Analytics (BDA). The findings found four types of models which are epidemiological model, time series analysis, machine learning model and deep learning model. There are seven distinct type of techniques that are useful in analysing data for forecasting purposes which are Susceptible-Infectious-Removed (SIR), Susceptible-Exposed-Infectious-Removed (SEIR), Regression Analysis, Random Forest, Support Vector Machine (SVM), Auto-regressive, and Deep Learning. Each model has its strengths and applications, depending on the nature of the data and the specific goals of the forecast.
Epidemiological model
SIR Model (Susceptible-Infectious-Removed): The SIR model is a mathematical model for describing the transmission of infectious diseases [44]. It was first published by William O. Kermack and Anderson G. McKendrick in 1927 [19]. It categorizes people into three groups: susceptible, infected, and removed. Susceptible individuals are those who are healthy yet vulnerable to infection upon exposure to the virus. Infected individuals are those currently carrying and capable of transmitting the disease. The removed group are those who have either recovered from the disease or died to it, with the assumption that they can no longer participate in the disease’s spread. The following set of differential equations are used to describe the model, according to [11].where S represents the number of susceptible individuals, I represent the number of infected individuals, and R represents the number of recovered individuals. The parameters beta (β) and gamma (γ) are constants that represent the rate of transmission and rate of recovery, respectively.
The constant population constraint is (S + I + R = N), where (N) is the total population size.
According to Gopagoni and Lakshmi [18], in Fig. 4, there are a lot of variations of the SIR model, which includes considerations for birth and death rates and intermediate states in disease spread. However, the model assumes that people develop immunity due to the early stages of COVID-19 expansion and a short-term focus. It acknowledges that in the long term, immunity may wane, potentially allowing COVID-19 to become seasonal like the flu, but it does not account for transitions from recovered individuals back to susceptible or infected states.
[See PDF for image]
Fig. 4
Classic SIR Model (left) Classic SEIR Model (right) [18]
However, [52] discovered that the synergistic effects of population migration factors and variations in disease transmission rates are not considered by the dynamic models that are currently in use. Hence, they introduced the M-SIR (Migration-Susceptible-Infectious-Recovered) model, an adaptation of the traditional SIR model incorporating population migration to understand disease spread dynamics better. It employs LightGBM, a machine learning method, to accurately track infection and recovery rates, considering the effects of cross-regional movement and control measures on epidemic trends [52]. Using COVID-19 data, the model demonstrates improved prediction accuracy for infected and recovered cases in Beijing and Shanghai compared to the standard SIR model.
SEIR Model (Susceptible-Exposed-Infectious-Removed): Considering the incubation period characteristic of some infectious diseases, susceptible individuals might not exhibit symptoms right after infection. This model expands upon the SIR model by including an ‘Exposed’ category. This group consists of individuals who have been exposed to the infection but are not yet symptomatic, representing a phase where they could either recover without developing further symptoms or progress to becoming infectious.
In 2020, Tan et al., introduced SEIR-DSGE model. The model closely matches statistical data and is considered ideal for predicting the spread of the neo-coronavirus using the SEIR framework across different countries. Due to variables like policy and healthcare resources, parameter adjustments are needed every 2 months for precision. [16], presents a SEIRS (Susceptible-Exposed-Infectious-Recovered-Susceptible) model to simulate and forecast the COVID-19 epidemic trend in China. The model accurately fits the actual numbers and peak timing of the epidemic in Hubei province and predicts a decline in cases by early June. However, inconsistencies exist between the model’s simulations and the actual situation in other Chinese regions, attributed to the model’s exclusion of human intervention strategies.
In 2021, Liu X. et al. introduced SEAIRD (susceptible-exposed-asymptomatic-infected-recovered-death). In the model, individuals who have recovered may once again become susceptible to the infection, indicating that recovery does not confer permanent immunity (Table 2).
Table 2. Result of quality appraisal for 30 chosen articles that have passed screening criteria
No. | Study | QA1 | QA2 | QA3 | QA4 | QA5 | QA6 | Number of criteria fulfilled/6 | Inclusion in the review |
|---|---|---|---|---|---|---|---|---|---|
1. | [18] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
2. | [52] | ✓ | U | ✓ | ✓ | ✓ | ✓ | 5.5 | ✓ |
3. | [65, 66] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
4. | [73] | ✓ | ✓ | ✓ | X | U | ✓ | 4.5 | ✓ |
5. | [16] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
6. | [32] | ✓ | ✓ | U | ✓ | ✓ | ✓ | 5.5 | ✓ |
7. | [36, 37] | ✓ | ✓ | ✓ | ✓ | X | ✓ | 5 | ✓ |
8. | [28] | ✓ | X | ✓ | ✓ | ✓ | ✓ | 5 | ✓ |
9. | [13] | ✓ | ✓ | ✓ | ✓ | ✓ | X | 5 | ✓ |
10. | [43] | ✓ | X | ✓ | ✓ | ✓ | ✓ | 4 | ✓ |
11. | [5] | ✓ | X | ✓ | ✓ | ✓ | ✓ | 4 | ✓ |
12. | [47] | ✓ | U | ✓ | ✓ | ✓ | ✓ | 4.5 | ✓ |
13. | [53] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 5 | ✓ |
14. | [10] | X | ✓ | X | ✓ | U | ✓ | 3.5 | ✓ |
15. | [17] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
16. | [23] | ✓ | U | ✓ | X | ✓ | ✓ | 4.5 | ✓ |
17. | [58] | ✓ | ✓ | ✓ | ✓ | ✓ | U | 5.5 | ✓ |
18. | [7] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
19. | [50] | ✓ | ✓ | X | ✓ | ✓ | ✓ | 5 | ✓ |
20. | [20] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
21. | [30] | ✓ | X | ✓ | ✓ | ✓ | ✓ | 4 | ✓ |
22. | [65, 66] | ✓ | ✓ | U | ✓ | ✓ | ✓ | 5.5 | ✓ |
23. | [41] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
24. | [31, 33] | ✓ | ✓ | ✓ | ✓ | X | ✓ | 5 | ✓ |
25. | [34] | ✓ | ✓ | ✓ | ✓ | ✓ | X | 5 | ✓ |
26. | [3] | ✓ | ✓ | ✓ | ✓ | X | ✓ | 5 | ✓ |
27. | [55] | ✓ | U | ✓ | ✓ | ✓ | ✓ | 5.5 | ✓ |
28. | [67] | ✓ | X | ✓ | X | ✓ | ✓ | 4 | ✓ |
29. | [12] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
30. | [29] | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 6 | ✓ |
Table 3 presents two primary epidemiological models used for COVID-19 prediction. Both models show distinct characteristics and limitations in their application to pandemic modelling.
Table 3. Summary of Studies using Epidemiological Model
Technique | Author | Strengths/Limitations |
|---|---|---|
SIR | [18] | 1. The data for all studies is geographically limited to China, Different public health responses in countries created variations in infection rates and effectiveness 2. While short-term predictions can be made, the long-term effects of the virus and immunity are still unclear 3. The model does not account for various environmental and social factors that may influence the pandemic’s spread |
[52] | ||
[65, 66] | ||
[73] | ||
SEIR | [32] | 1. The model does not account for all potential external factors influencing disease spread, such as regional lockdown measures or behavioural changes in the population during the outbreak |
[36, 37] | 1. The adaptive SEAIRD model lacks synchronization in time dynamics, potentially affecting its accuracy in larger geographic contexts | |
[40] | 2. They note that there are multiple unknown factors in global epidemic dynamics that can introduce errors in their model, suggesting that while their predictions may be more accurate, they are not reliable | |
[53] | 1. The model does not account for all potential external factors | |
[63] | 1. While the model performs well in predictions, its accuracy diminishes after 2 months due to rapid changes in the epidemic context and intervention strategies | |
[71] | 1. The outcomes heavily depend on the accurate calibration of model parameters (like contact rates and recovery rates), which can vary significantly across different contexts and over time 2. The study primarily focuses on a few countries (China, America, India, and Brazil), which may not generalize to other regions with different socio-economic and health infrastructures | |
[16] | 1. The model does not include things like lockdowns or social distancing hence it works well for Hubei but is not as accurate for other places where these measures changed the outbreak’s spread | |
[31, 33] | 1. The model works well for Kashgar but might not apply elsewhere, ignores new COVID variants, and assumes testing/isolation policies work perfectly which is not always realistic |
Each SEIR-based models have been modified with additional features to improve accuracy. Most author agrees that SEIR model is the best model with highest performance. SEIR demonstrates superior adaptability compared to SIR, as evidenced by successful modifications and enhanced accuracy [73]. A critical weakness appears in both models’ heavy reliance on Chinese data, potentially limiting their global applicability.
While SEIR shows better overall performance, it faces challenges with time dynamics synchronization, suggesting a complexity-accuracy [36, 37, 40, 53].
Both models struggle with external factor incorporation, particularly regarding social interventions and environmental variables, indicating a significant gap in real-world application capability [53, 65, 66, 73] (Lu et al., 202).
Machine learning models
Logistic Regression: Logistic Regression is a popular binary classification tool known for its simplicity and interpretability, widely applied in both industrial and medical fields [62]. It uses maximum likelihood estimation to determine distribution parameters, making it valuable for disease diagnosis [62]. However, its inability to address non-linear problems limits its effectiveness in predicting epidemics. The algorithm’s simplicity often results in lower accuracy rates, making it less suitable for modelling the complex dynamics of epidemic spread [10].
Decision Trees: A decision tree is a versatile supervised machine learning algorithm used for classification and regression tasks [27]. It creates a tree-like structure where each internal node represents a feature, and branches represent decision rules based on feature values. Starting from the root node, decisions are made at each internal node until reaching a leaf node, which provides the final prediction.
[17] propose an intelligent model to detect illnesses, combining learning techniques. Infectious diseases remain a major global cause of death. The suggested model involves predictive analysis based on symptoms, using an infectious disease dataset from Kaggle. Pre-processing includes mean, median, and mode data imputation. Correlation-based feature selection is applied, and disease analysis employs a fuzzy model-based Random Forest Algorithm (RFA). Comparing results with existing models (such as Support Vector Machine and Naive Bayes), the suggested model achieves up to 97.61% accuracy.
Random Forests: Random Forest builds upon the concept of a decision tree, which uses data attributes to make predictions through a tree-like structure [8]. However, a single decision tree can easily overfit, meaning it might not perform well on new, unseen data. Random Forest addresses this by creating a ‘forest’ of multiple decision trees, each trained on random subsets of data and features. It then aggregates the predictions from all these trees to make a final decision, significantly reducing the risk of overfitting and improving the model’s accuracy on large and complex datasets [8].
In [23] paper, COVID-19 prediction algorithms based on artificial intelligence were compared. The study considered various characteristic constraints and prediction accuracies. Notably, random forest outperformed logistic regression and the support vector machine in epidemic prediction. The proposed medical platform showed promising results when utilizing random forest. Additionally, the author suggested using unmanned aerial vehicles (based on blockchain) for contactless transmission in the COVID-19 environment. Future development aims to support more applications in the epidemic context through prediction analysis.
Random Forest demonstrates remarkable accuracy [17], suggesting its superior capability in handling complex epidemiological data. The integration of AI and IoT in random forest applications represents a significant advancement in prediction methodology [23]. The success of hybrid approaches, such as combining Random Forest with Fuzzy Models [17], suggests a promising direction for future model development (Table 4).
Table 4. Summary of Studies using Machine Learning Models
Technique | Author | Strength/Limitations |
|---|---|---|
Regression Analysis | [10] | 1. There is need to incorporate additional sociopolitical variables such as government measures and public cooperation, alongside the requirement for enhanced model generalizability for cross-national epidemic prediction |
Random Forest | [17] | 1. Introduced a hybrid approach combining Random Forest Algorithm with a Fuzzy Model for disease prediction 2. Random Forest algorithm consistently outperformed both Naïve Bayes and SVM with accuracy 97.61% |
[23] | 1. Develops a novel prediction method combining AI and IoT for COVID-19 detection 2. Suggest that random forest algorithm demonstrated better prediction accuracy compared to logistic regression and SVM methods due to its lower noises graph |
Time series analysis
AutoRegressive Integrated Moving Average (ARIMA): The ARIMA model, combining autoregressive, moving average, and differencing methods, is a forecasting technique for time series data. It hinges on three key parameters: the autoregressive term (p), the number of differencing required to make the series stationary (d), and the moving average term (q) (K. [30]). The stationarity of the data is crucial for ARIMA’s application, as it is assessed through various tests like visual checks and statistical tests. Determining the model’s order involves balancing information minimization and prediction error, with the final model validation achieved through analysing the residual’s autocorrelation. In the context of infectious diseases, it can be applied to predict future case counts based on historical data.
Support Vector Machine (SVM): SVM is a supervised machine learning algorithm for classification and regression tasks [60]. It works by finding the hyperplane that best separates different classes in the feature space. The goal is to maximize the margin between the closest points of the classes, which are known as support vectors. SVM is versatile, capable of handling linear and non-linear separations through the use of kernel functions, making it effective for a wide range of applications, from image classification to bioinformatics [60].
Based on Table 5, the ARIMA model demonstrated superior performance metrics with results outperformed the random forest model which had higher values. However, the model does not account for important variables like facial mask-wearing measures and social distance restrictions [58]. There is limited information about SVM, however it is important to consider social and technological barriers in data collection and model implementation to avoid limitation as in [60].
Table 5. Studies about Time Series Models
Technique | Author | Strength/Limitations |
|---|---|---|
Auto-Regressive | [58] | 1. Autoregressive Integrated Moving Average (ARIMA) model achieved values of 0.14 for MASE, 9.97 for SMAPE, and 22,316.57 for RMSE 2. These results were better than the random forest model which achieved 0.51, 41.77, and 71,579.50 respectively 3. The model does not consider additional variables like facial mask wearing measures, social distance restrictions, and geographic characteristics |
[36, 37] | 1. Non-Linear Auto-Regressive (NAR) outperformed other models (Logistic Regression, ARIMA, and SEIR) in predicting both confirmed cases and deaths with a maximum error of 3.6% and a minimum error of − 0.3% 2. The model showed particularly strong performance in predicting death cases, with prediction errors ranging from − 0.4% to 1.9% for the tested countries 3. The study was limited to analysing data from only three countries (United States, India, and Brazil) for the first 10 months of 2020 | |
[7] | 1. This study uses six univariate and two multivariate methods simultaneously 2. Univariate models performed better than expected, often surpassing multivariate methods 3. The dependency on past data creates challenges in producing reliable predictions for unique disease outbreaks | |
[30] | 1. Development of an ARIMA-GRNN combination model that enables dynamic and continuous prediction. The combination model achieved the lowest error rates with MSE of 0.0016, MAE of 0.0334, and MAPE of 9.6914% 2. The sample size of the training set is insufficient, and the prediction effect of all models shows decreased accuracy in the last 3 months compared to earlier month | |
[41] | 1. Proposes an infectious disease prediction model based on ARIMA that considers multiple characteristics including changing trends, periodicity, and sudden occurrence patterns 2. The model achieved an average absolute value error of 4.166% with accuracy as high as 95% 3. The model is primarily suitable for short-term time series data prediction | |
[31, 33] | 1. Developed a hybrid Transformer-GCN model with dynamic positional encoding was developed for predicting COVID-19 cases, deaths, and vaccination rates across US states 2. The model achieved 20%, 16.6%, and 50% lower MAPE value drops in predicting confirmations, deaths, and vaccinations respectively compared to the Time Series Transformer (TST) model | |
[47] | 1. The using only past time series data without accounting for factors like new COVID-19 variants, vaccination rates, or government interventions 2. Relying on ARIMA and Prophet models that assume linearity and seasonality | |
SVM | [60] | 1. Developed a framework that uses a hybrid of three SVM kernels Linear kernel, Polynomial kernel, Radial basis function (RBF) kernel that uses data from CDC, NOAA, and Google Trends 2. Web search data can be influenced by media coverage and language barriers, people may incorrectly self-diagnose when searching online, and limited internet access in some areas can affect data reliability |
Deep learning models
Neural Networks: Deep learning models, like neural networks, can be used to analyse complex relationships within large datasets [4]. They are particularly effective when there are intricate patterns in the data.
L. [65, 66] focused on using deep learning models for COVID-19 forecasting. They combined multiple recurrent neural network (RNN)-based models and considered various data sources. They also propose clustering-based training to fix the sparsity of data for training.
[34] Thinks that relying only on artificial intelligence methods for prediction cannot predict how infectious diseases change over time developed a COVID-19 prediction model using deep learning. By combining deep learning technology (LSTM) with a mathematical infectious disease model (SIR), they improved single-day predictions by 50%. The model adapts well to short- and medium-term forecasts.
Meanwhile, [4] utilize RNN (Recurrent Neural Networks) and LSTM (Long Short-Term Memory) deep-learning models. These networks were specifically chosen for their ability to analyse time series data and accurately predict future trends. Notably, both models demonstrated significant success in forecasting temporal data compared to traditional methods. Firstly, RNNs were employed for processing time series and sequential data. They are also valuable for modelling sequence data [4]. Derived from feedforward networks, RNNs exhibit behaviour similar to the human brain. In simpler terms, RNNs produce predictive results for sequential data that other algorithms cannot achieve. Secondly, LSTMs incorporate a sophisticated gated memory unit designed to address the vanishing gradient problem encountered in simple RNNs [4].
Table 6 outlines various deep learning applications, showcasing the evolution and advancement of AI-based approaches in COVID-19 prediction. Deep learning models demonstrate exceptional accuracy levels:
[34]: The 98.53% accuracy achievement in COVID-19 case prediction indicates high reliability
[34]: The 50% improvement in single-day predictions shows significant advancement over traditional methods
[4, 50, 67]: Low error rates demonstrate consistent performance.
Table 6. Summary of Studies using Deep Learning Model
Technique | Author | Strength/Limitations |
|---|---|---|
Deep learning | [34] | 1. Proposes a COVID-19 prediction model SIRVD by using deep learning 2. Results show that the model improves single-day predictions by 50% and adapts well to short- and medium-term forecasting |
[4] | 1. Proposing deep learning prediction models that used the Rectified Linear Unit (ReLU) activation function in LSTMs, along with tanh and sigmoid activation functions in RNN models 2. The LSTM prediction model achieved 98.53% accuracy in predicting the number of confirmed COVID-19 cases and resultant deaths | |
[67] | 1. Employed LSTM (Long-Short Term Memory) prediction model 2. The LSTM model achieved the lowest error rates compared to other models, with RMSE of 6.903, MAE of 5.742, and MASE of 0.246 | |
[59] | 1. Predicts COVID-19 confirmed cases from April to October 2022 using the AdaBoost-Bi-LSTM 2. The model achieved improved predictive power and shows an increased performance of 17.41% over the simple GRU/LSTM model and of 15.62% over the Bi-LSTM model | |
[50] | 1. This study provides evidence of the predictive performance of LSTM using evaluation metrics such as RMSE, MSE, MAE, and R2. The findings support the potential of LSTM models to inform timely and effective disease control strategies 2. The complex and nonlinear structure of LSTM limits its practical use in critical fields like public health | |
[20] | 1. Developed an XGBoost-LSTM mixed framework that predicts infectious disease spread across multiple cities and region 2. The framework needs further improvement in mining traffic big data characteristics related to infectious disease spread | |
[65, 66] | 1. The primary challenge is the lack of sufficient training data for deep learning methods 2. The surveillance data is extremely noisy due to rapidly evolving epidemics |
The integration of multiple techniques like [4, 20, 34, 59] suggests an evolving understanding of complex pandemic patterns and indicates significant progress in model architecture.
These complex models are hard to use in real situations, and lack of data makes them less accurate. To address this, we need better ways to collect and process data so these models can work effectively in public health.
Applied BDA in pandemic forecasting
Due to the urgent need for utilization, many organizations have applied the integration of BDA into forecasting although with limited resources and some technological flaws as summarized in Table 3. Mobile devices and social media platforms also generate data that can be used to monitor infected patients, providing data on their location, for example (Kraemer et al. 2018). In this way, people who have had contact with the infected person can be notified through text messages, so that the necessary measures for their safety are taken, such as carrying out tests that indicate the contamination or not. Such initiative was carried out by the government of South Korea, during the COVID-19 pandemic, which started in 2020 (Harvard Business Review 2020). However, this large amount of data generated by different devices requires data processing capable of handling this problem. [12] Used the COVID-19 Forecast Hub as a platform for Big Data Analytics to derive underlying relationships in data sources and understand how that information can help achieve public health goals about epidemics. The COVID-19 Forecast Hub serves as an authoritative resource for real-time forecasts related to COVID-19 in the United States. Created by numerous leading infectious disease modelling teams worldwide, this hub collaborates closely with the US Centers for Disease Control and Prevention (CDC) (COVID 19 Forecast Hub, Accessed April 2024). Every week, modelling teams contribute their predictions to the COVID-19 Forecast Hub’s data repository. The hub then consolidates these individual submissions into a unified ensemble forecast [12].
Another interesting public information system is Reproduction.live. Reproduction Live is a web-based platform designed to estimate the reproductive number for COVID-19 across various geographical regions [56]. The reproductive number, denoted as R, quantifies the contagiousness of a disease. Specifically, it represents the average number of secondary infections caused by a single infected individual [48].
With the recent COVID-19 pandemic, the authors [64] presented a system, which provides a visualization of the regional geographical map of the Czech Republic in 10 s and can exchange data with other systems medical assistance. In addition to data integration, geographic mapping functionality provides near real-time pandemic/epidemic tracking, outbreak propagation monitoring, and visualization of risk data. This system is an example that can be applied to other territories, thus integrating more information, becoming an essential tool, since it is a global pandemic.
Before the Big Data era, the results of health science were not easily disseminated, and the time to raise public awareness about the emergence of new diseases could be very long. Now, with Big Data and the emergence of Big Data Analytics, this reality can be changed, as scientific data can be sent directly to mobile devices, through messages delivered almost instantly, for example. In this way, scientific reports can be widely disseminated, and the impact of health science can be extended (Huang et al. 2015). From the results obtained, it is observed that the data and techniques used in previous outbreaks, epidemics, and pandemics mentioned in the extant literature, can be applied in the current context of the COVID-19 pandemic, as they are generic technologies and applicable to different scenarios. Thus, the good practices reported can serve as a guide in the current pandemic scenario.
Research gaps
Several critical research gaps and technical challenges persist in the field. Complex models frequently encounter substantial difficulties in real-world applications, as outlined in Sect. "Findings discussion", where data availability and quality limitations undermine predictive accuracy. Moreover, many studies struggle with insufficient training datasets and noisy surveillance data, exacerbated by the dynamic nature of epidemics. These findings highlight the pressing need for enhanced data collection and processing methodologies. A key challenge remains data integration, particularly in harmonising diverse data types into cohesive estimates while accounting for the inherent variability and biases within each data stream.
Addressing these challenges is crucial for leveraging Big Data Analytics in proactive infectious disease prevention and risk mitigation for COVID-19. Predictive models such as SEIR frameworks can enhance outbreak forecasting, enabling more effective policy responses. Real-time monitoring platforms facilitate continuous surveillance, supporting timely intervention strategies. Additionally, the integration of advanced machine learning and deep learning algorithms strengthens predictive accuracy and operational efficiency, enabling robust decision-making for pandemic management (Table 7).
Table 7. Current Covid-19 forecasting web-based system
Title | References | Technique | Scope |
|---|---|---|---|
COVID-19 Forecast Hub | [12] | Weighted Interval Score (WIS) | Deaths, hospitalisations, number of cases |
Prediction of Epidemic Disease Dynamics on the Infection Risk Using Machine Learning Algorithms | [51] | Multivariate Logistic Regression | Infection risk |
Reproduction. live | [48] | R-Score | Risk value, daily reported cases |
High Performance Computing Cluster (HPCC) | [64] | SIR | Infection Rate |
Findings highlights
This review demonstrates that Big Data Analytics plays a pivotal role in pandemic management through several key mechanisms:
Predictive Capabilities: The study identified a range of effective predictive models, with Susceptible-Exposed-Infectious-Removed (SEIR)-based frameworks exhibiting high adaptability and accuracy in forecasting infection trends. For instance, SEIR models were widely employed during the COVID-19 pandemic to project outbreak trajectories across regions, aiding in strategic policy decisions.
Real-time Monitoring: The adoption of BDA platforms such as the COVID-19 Forecast Hub enabled real-time surveillance by aggregating and visualising data on case counts, testing rates, and hospitalisations. These platforms supported timely decision-making by public health authorities through continuous updates and scenario modelling.
Integration of Technologies: The review further revealed successful integration of advanced technologies, particularly machine learning algorithms like Random Forest and deep learning architectures such as Long Short-Term Memory (LSTM) networks. For example, Random Forest models were used to predict COVID-19 severity based on patient comorbidities and demographics, while LSTM models demonstrated strong performance in time series forecasting of daily case numbers.
Table 8 summarises the key mechanisms highlighted in pandemic management.
Table 8. Summary of Key Mechanisms Utilising BDA
Mechanism | Description | Examples |
|---|---|---|
Predictive Capabilities | Effective models for forecasting infection trends | SEIR-based models were used during COVID-19 to project outbreak trajectories due to their high accuracy in forecasting and aiding strategic policy decisions |
Real-time Monitoring | Platforms for real-time surveillance of pandemic data | BDA platforms, such as the COVID-19 Forecast Hub, aggregate case counts, testing rates, and hospitalisations for decision-making |
Integration of Technologies | Advanced machine learning and deep learning technologies for predictions and analysis | Random Forest predicted COVID-19 severity; LSTM excelled in forecasting daily case numbers |
These findings highlight strategies for addressing the COVID-19 pandemic. Big Data’s transformative potential in healthcare surpasses current understanding by governments, organizations, and corporations. Emerging technologies for collecting, storing, analysing, and sharing cloud-based data are poised to make vast amounts of information accessible to millions.
Conclusion
This study systematically examines the pivotal role of Big Data Analytics (BDA) in monitoring and predicting epidemic and pandemic infections, employing the PRISMA methodology and quality appraisal techniques to deliver comprehensive insights into its healthcare applications. From an initial pool of 846 articles across multiple renowned databases, 30 high-quality studies were selected for in-depth analysis. The review highlights four key predictive models: epidemiological, time series, machine learning, deep learning, and seven analytical techniques, including SIR, SEIR, regression analysis, random forest, support vector machines, auto-regressive methods, and deep learning architectures.
BDA has demonstrated immense potential in infectious disease control by processing diverse healthcare data and integrating technologies such as IoT and social media to enhance diagnosis, clinical decision-making, and surveillance. However, challenges persist, particularly concerning predictive models’ limited reliability and generalizability in real-world scenarios due to low-quality, noisy, and incomplete data. Early phases of the COVID-19 pandemic exemplify this, where inconsistent case reporting hindered accurate forecasting and timely responses.
Effective treatment strategies are essential as infectious disease outbreaks increase, especially during the COVID-19 pandemic. BDA emerges as a transformative tool in epidemic and pandemic management, extending beyond traditional data analysis to revolutionise healthcare responses. Utilising historical datasets allows for precise predictions of disease spread, hospitalisation rates, and resource allocation. Integrated frameworks like SEIR, ARIMA, and deep learning models have enhanced prediction accuracy. Furthermore, BDA facilitates efficient contact tracing by analysing movement patterns, social interactions, and exposure risks through data collected from public information platforms and wearable devices. Platforms like the COVID-19 Forecast Hub offer real-time dashboards to monitor critical metrics, support proactive decision-making, and enhance public health responses. This study emphasises the crucial role of BDA in reshaping pandemic management. It outlines potential avenues for future research to address current limitations and fully harness the capabilities of global health systems.
Future recommendations
This study highlights several areas for future research to enhance the effectiveness of Big Data Analytics (BDA) in infectious disease mitigation. Data quality, availability, and integration challenges continue to affect the accuracy and generalizability of predictive models. To address these issues, future research should prioritise integrating diverse data sources, particularly hospital records and social media streams, with traditional surveillance data to improve model robustness across varied geographical contexts.
Incorporating hospital and social media data offers promising directions for methodological advancement. For instance, machine learning techniques such as Long Short-Term Memory (LSTM) and transformer-based models can be utilized for real-time trend detection in unstructured text. In contrast, anomaly detection approaches, including autoencoders, may effectively capture deviations in hospital admission patterns. Structured clinical data can benefit from ensemble models such as XGBoost, which are well-suited for the robust classification of outbreak signals. Furthermore, federated learning presents a privacy-preserving solution for collaborative modelling across multiple healthcare institutions, ensuring secure data sharing without compromising patient confidentiality.
Another critical area for future exploration is the development of more advanced frameworks for integrating traditional and non-traditional data streams to enhance early detection capabilities. Additionally, establishing standardised ethical review mechanisms is essential to guide the responsible use of data in outbreak prediction and monitoring systems. Advancements in these areas—particularly in data integration, predictive modelling, and ethical governance will strengthen global capacities for pandemic preparedness and real-time response.
Acknowledgements
The authors would like to thank Universiti Sains Islam Malaysia (USIM) and the Ministry of Higher Education for the support and facilities provided.
Author contributions
Nurun Nuha—Study design, data collection and analysis, writing the main manuscript Sakinah Ali and Murtadha Arif- supervision of the student progress with the assigned research activities and analyzing the related matters. Review the manuscript Azni Haslizan—assisting in evaluation of the result for the outcome. Review the manuscript Ilfita—contribution related to medical field disease. Review the manuscript.
Funding
This research was funded by the Ministry of Higher Education (MOHE) Malaysia under the Fundamental Research Grant Scheme (FRGS/1/2023/ICT03/USIM/02/3).
Data availability
No datasets were generated or analysed during the current study.
Declarations
Competing interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Abouzahra, A; Sabraoui, A; Afdel, K. Model composition in model driven engineering: a systematic literature review. Inf Softw Technol; 2020; 125, [DOI: https://dx.doi.org/10.1016/J.INFSOF.2020.106316] 106316.
2. Ahmed, I; Ahmad, M; Jeon, G; Piccialli, F. A framework for pandemic prediction using big data analytics. Big Data Res; 2021; [DOI: https://dx.doi.org/10.1016/j.bdr.2021.100190]
3. AlMeslamani, AZ; Sobrino, I; de la Fuente, J. Machine learning in infectious diseases: potential applications and limitations. Ann Med; 2024; [DOI: https://dx.doi.org/10.1080/07853890.2024.2362869]
4. Alassafi, MO; Jarrah, M; Alotaibi, R. Time series predicting of COVID-19 based on deep learning. Neurocomputing; 2022; 468, pp. 335-344. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.10.035]
5. Alqaissi, EY; Alotaibi, FS; Ramzan, MS. Modern machine-learning predictive models for diagnosing infectious diseases. Comput Math Methods Med; 2022; [DOI: https://dx.doi.org/10.1155/2022/6902321]
6. Alsunaidi, SJ; Almuhaideb, AM; Ibrahim, NM; Shaikh, FS; Alqudaihi, KS; Alhaidari, FA; Khan, IU; Aslam, N; Alshahrani, MS. Applications of big data analytics to control covid-19 pandemic. Sensors; 2021; [DOI: https://dx.doi.org/10.3390/s21072282]
7. Assad, DBN; Cara, J; Ortega-Mier, M. Comparing Short-Term Univariate and Multivariate Time-Series Forecasting Models in Infectious Disease Outbreak. Bull Math Biol; 2023; 4526036 [DOI: https://dx.doi.org/10.1007/s11538-022-01112-5]
8. Attanasi ED, Coburn TC. Random Forest. In Daya Sagar BS, Cheng Q, McKinley J, Agterberg F (Eds.), Encyclopedia of Mathematical Geosciences. Springer International Publishing. 2020. pp. 1–4. https://doi.org/10.1007/978-3-030-26050-7_265-1
9. Ayouni, I; Maatoug, J; Dhouib, W; Zammit, N; Fredj, SB; Ghammam, R; Ghannem, H. Effective public health measures to mitigate the spread of COVID-19: a systematic review. BMC Public Health; 2021; [DOI: https://dx.doi.org/10.1186/s12889-021-11111-1]
10. Bai Y. Epidemic Case Prediction of COVID-19: using Regression and Deep based Models. Proceedings-2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence, MLBDBI 2020. 2020. pp. 40–45. https://doi.org/10.1109/MLBDBI51377.2020.00015
11. Chasnov JR. Mathematical biology. 2024. https://LibreTexts.org
12. Cramer, EY; Huang, Y; Wang, Y; Ray, EL; Cornell, M; Bracher, J; Brennen, A; Rivadeneira, AJC; Gerding, A; House, K; Jayawardena, D; Kanji, AH; Khandelwal, A; Le, K; Mody, V; Mody, V; Niemi, J; Stark, A; Shah, A; Reich, NG. The United States COVID-19 Forecast Hub dataset. Sci Data; 2022; [DOI: https://dx.doi.org/10.1038/s41597-022-01517-w]
13. Fan, X-R; Zuo, J; He, W-T; Liu, W. Stacking based prediction of COVID-19 Pandemic by integrating infectious disease dynamics model and traditional machine learning. ACM Int Conf Proc Ser; 2022; [DOI: https://dx.doi.org/10.1145/3561801.3561805]
14. Furstenau, LB; Leivas, P; Sott, MK; Dohan, MS; López-Robles, JR; Cobo, MJ; Bragazzi, NL; Choo, KKR. Big data in healthcare: conceptual network structure, key challenges and opportunities. Dig Commun Netw.; 2023; 9,
15. Gandomi, A; Haider, M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage; 2015; 35,
16. Ge J, Zhang L, Chen Z, Chen G, Peng J. Simulation analysis of epidemic trend for Covid-19 based on SEIRS model. Proceedings of 2020 IEEE 19th International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2020. 2020. pp. 158–161. https://doi.org/10.1109/ICCICC50026.2020.09450226
17. Geeitha S, Karthikeyan P, Aravinth S, Nachiappan PL, Jagadeesh S. Contagious Disease Prediction using Random Forest Algorithm Interpolated with Fuzzy Model. 2nd International Conference on Sustainable Computing and Data Communication Systems, ICSCDS 2023-Proceedings 2023. pp. 674–679. https://doi.org/10.1109/ICSCDS56580.2023.10104962
18. Gopagoni, D; Lakshmi, PV. Susceptible, infectious and recovered (SIR model) predictive model to understand the key factors of COVID-19 transmission. Int J Adv Comput Sci Appl; 2020; 11,
19. Gounane, S; Barkouch, Y; Atlas, A; Bendahmane, M; Karami, F; Meskine, D. An adaptive social distancing SIR model for COVID-19 disease spreading and forecasting. Epidemiol Methods; 2021; [DOI: https://dx.doi.org/10.1515/em-2020-0044]
20. Guo, K; Shen, C; Zhou, X; Ren, S; Hu, M; Shen, M; Chen, X; Guo, H. Traffic Data-Empowered XGBoost-LSTM Framework for Infectious Disease Prediction. IEEE Trans Intell Transp Syst; 2024; 25,
21. Jang, Y; Lee, H; Park, H. Surveillance system for infectious disease prevention and management: direction of korea’s infectious disease surveillance system. J Korean Med Sci; 2025; 40,
22. Jiao, Z; Ji, H; Yan, J; Qi, X. Application of big data and artificial intelligence in epidemic surveillance and containment. Intell Med; 2023; 3,
23. Jing, S; Qian, Q; She, H; Shan, T; Lu, S; Guo, Y; Liu, Y. A novel prediction method based on artificial intelligence and internet of things for detecting coronavirus disease (COVID-19). Secur Commun Netw; 2021; [DOI: https://dx.doi.org/10.1155/2021/7812223]
24. ElfleinJ. Coronavirus (COVID-19) cases by country worldwide 2023 | Statista. 2023. https://www.statista.com/statistics/1043366/novel-coronavirus-2019ncov-cases-worldwide-by-country/
25. Khair, NKM; Lee, KE; Mokhtar, M. Community-based monitoring in the new normal: a strategy for tackling the covid-19 pandemic in malaysia. Int J Environ Res Public Health; 2021; [DOI: https://dx.doi.org/10.3390/ijerph18136712]
26. Kim, D-Y; Shinde, SK; Lone, S; Palem, RR; Ghodake, GS; Aleya, L; Gu, W. Personalized medicine review COVID-19 pandemic: public health risk assessment and risk mitigation strategies. J Pers Med; 2021; 11, 1243.
27. Kotsiantis, SB. Decision trees: a recent overview. Artif Intell Rev; 2013; 39,
28. Lee MK, Paik JH, Na IS. Outbreak Prediction of Hepatitis A in Korea based on Statistical Analysis and LSTM Network. 2020 International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2020. 2020. pp. 379–381. https://doi.org/10.1109/ICAIIC48513.2020.9065082
29. Lee, M; Kim, JW; Jang, B. DOVE: an infectious disease outbreak statistics visualization system. IEEE Access; 2018; 6, pp. 47206-47216. [DOI: https://dx.doi.org/10.1109/ACCESS.2018.2867030]
30. Li, K. Disease Prediction Model based on Neural Network ARIMA Algorithm. Int J Adv Comput Sci Appl; 2022; 13,
31. Li, Y; Wang, Y; Ma, K. Integrating Transformer and GCN for COVID-19 Forecasting. Sustainability (Switzerland); 2022; [DOI: https://dx.doi.org/10.3390/su141610393]
32. Li Z, Feng F. Propagation Prediction of COVID-19 Based on Immune Differential Evolution Algorithm and Dynamic DSEIR Model. 2021 IEEE 3rd International Conference on Communications, Information System and Computer Engineering, CISCE 2021. 2021. pp. 578–584. https://doi.org/10.1109/CISCE52179.2021.9445997
33. Li Z, Xia B, Ma L. Prediction and evaluation of the SARS-CoV-2 epidemic using an improved SEIR model. Proceedings-2022 Global Conference on Robotics, Artificial Intelligence and Information Technology, GCRAIT 2022. 2022. pp. 160–168. https://doi.org/10.1109/GCRAIT55928.2022.00042
34. Liao, Z; Lan, P; Fan, X; Kelly, B; Innes, A; Liao, Z. SIRVD-DL: A COVID-19 deep learning prediction model based on time-dependent SIRVD. Comput Biol Med; 2021; [DOI: https://dx.doi.org/10.1016/j.compbiomed.2021.104868]
35. Lipsitch, M; Bassett, MT; Brownstein, JS; Elliott, P; Eyre, D; Grabowski, MK; Hay, JA; Johansson, MA; Kissler, SM; Larremore, DB; Layden, JE; Lessler, J; Lynfield, R; MacCannell, D; Madoff, LC; Metcalf, CJE; Meyers, LA; Ofori, SK; Quinn, C; Grad, YH. Infectious disease surveillance needs for the United States: lessons from Covid-19. Front Public Health; 2024; [DOI: https://dx.doi.org/10.3389/fpubh.2024.1408193]
36. Liu, X-X; Fong, SJ; Dey, N; Crespo, RG; Herrera-Viedma, E. A new SEAIRD pandemic prediction model with clinical and epidemiological data analysis on COVID-19 outbreak. Appl Intell; 2021; 51,
37. Liu Z, Zuo J, Lv R, Liu S, Wang W. Coronavirus Epidemic (COVID-19) Prediction and Trend Analysis Based on Time Series. 2021 IEEE International Conference on Artificial Intelligence and Industrial Design, AIID 2021. 2021. pp. 35–38. https://doi.org/10.1109/AIID51893.2021.9456463
38. Lockwood, C; Munn, Z; Porritt, K. Qualitative research synthesis: methodological guidance for systematic reviewers utilizing meta-aggregation. Int J Evid Based Healthc; 2015; 13,
39. Loh, TP; Horvath, AR; Wang, CB; Koch, D; Lippi, G; Mancini, N; Ferrari, M; Hawkins, R; Sethi, S; Adeli, K. Laboratory practices to mitigate biohazard risks during the COVID-19 outbreak: an IFCC global survey. Clin Chem Laborat Med; 2020; 58,
40. Lu T, Huo S, Huang L. Prediction and Analysis of COVID-19 Epidemic Based on Improved GEP Algorithm to Optimize SEIR Mode. In 2022 4th International Conference on Frontiers Technology of Information and Computer, ICFTIC 2022. 2022. pp. 675–680. https://doi.org/10.1109/ICFTIC57696.2022.10075276
41. Luo Z, Zhang Y, Yin C, Yang M, Li J. Application of ARIMA Model in Infectious Disease Prediction. In Proceedings-2023 5th International Conference on Decision Science and Management, ICDSM 2023. 2023. pp. 3–6. https://doi.org/10.1109/ICDSM59373.2023.00012
42. Meckawy, R; Stuckler, D; Mehta, A; Al-Ahdal, T; Doebbeling, BN. Effectiveness of early warning systems in the detection of infectious diseases outbreaks: a systematic review. BMC Public Health; 2022; [DOI: https://dx.doi.org/10.1186/s12889-022-14625-4]
43. Mehrab, Z; Adiga, A; Marathe, MV; Venkatramanan, S; Swarup, S. Evaluating the utility of high-resolution proximity metrics in predicting the spread of COVID-19. ACM Trans Spat Algor Syst; 2022; [DOI: https://dx.doi.org/10.1145/3531006]
44. Milgroom, MG. Epidemiology and SIR Models. Biol Infect Dis From Mol Ecosyst; 2023; [DOI: https://dx.doi.org/10.1007/978-3-031-38941-2_16]
45. Minta, AA; Ferrari, M; Antoni, S; Portnoy, A; Sbarra, A; Lambert, B; Hatcher, C; Hsu, CH; Ho, LL; Steulet, C; Gacic-Dobo, M; Rota, PA; Mulders, MN; Bose, AS; Caro, WP; O’Connor, P; Crowcroft, NS. Progress toward measles elimination—worldwide 2000–2022. MMWR Morb Mort Weekly Report; 2023; 72,
46. Mohamed Shaffril, HA; Samsuddin, SF; Abu Samah, A. The ABC of systematic literature review: the basic methodological guidance for beginners. Qual Quant; 2021; 55,
47. Mohan, S; Solanki, AK; Taluja, HK; Singh, A. Predicting the impact of the third wave of COVID-19 in India using hybrid statistical machine learning models: a time series forecasting and sentiment analysis approach. Comput Biol Med; 2022; [DOI: https://dx.doi.org/10.1016/j.compbiomed.2022.105354]
48. Murrell H, Murrell D. Estimating R t from Covid-19 data, using SIR models. 2020. https://mediahack.co.za/datastories/
49. Osaghae, EN; Okokpujie, K; Ndujiuba, C; Okesola, O; Okokpujie, IP. Epidemic alert system: a web-based grassroots model. Int J Electr Comput Eng; 2018; 8,
50. Osama OM, Alakkari K, Abotaleb M, El-Kenawy E-SM. Forecasting Global Monkeypox Infections Using LSTM: a non-stationary time series analysis. In ICEEM 2023-3rd IEEE international conference on electronic engineering. 2023. https://doi.org/10.1109/ICEEM58740.2023.10319532
51. Palaniappan, S; V, R; David, B; S, PN. Prediction of epidemic disease dynamics on the infection risk using machine learning algorithms. SN Comput Sci; 2022; [DOI: https://dx.doi.org/10.1007/s42979-021-00902-3]
52. Qiu Y, Wang Q, Wu Y, Wang G, Zhang W, Wang S, Zhang L, Pu H. Prediction and Analysis of Infectious Diseases Based on M-SIR Model. In 2022 5th International Conference on Pattern Recognition and Artificial Intelligence, PRAI 2022. 2022. pp. 225–234. https://doi.org/10.1109/PRAI55851.2022.9904071
53. Radwan AM, Mortaja W, Alzaq H. Modeling of COVID-19 outbreak in Gaza Strip using SEIR model. In ICCoSITE 2023-International Conference on Computer Science, Information Technology and Engineering: Digital Transformation Strategy in Facing the VUCA and TUNA Era. 2023. pp. 998–1001. https://doi.org/10.1109/ICCoSITE57641.2023.10127789
54. Ramchand R, Ahluwalia SC, Avriette M, Cecchine G, Cooper M, Foran C, Hicks D, Lander N, Lee SD. Syndromic Surveillance 2.0: emerging global surveillance strategies for infectious disease epidemics. 2023. www.rand.org/about/research-integrity.
55. Rauf, HT; Gao, J; Almadhor, A; Arif, M; Nafis, MT. Enhanced bat algorithm for COVID-19 short-term forecasting using optimized LSTM. Soft Comput; 2021; 25,
56. Reproduction Live | Estimating R for Covid-19. 2024. Retrieved January 18, 2024, from https://reproduction.live/
57. Shabbir, MQ; Gardezi, SBW. Application of big data analytics and organizational performance: the mediating role of knowledge management practices. J Big Data; 2020; [DOI: https://dx.doi.org/10.1186/s40537-020-00317-6]
58. Shi Y, Wu K, Zhang M. COVID-19 Pandemic Trend Prediction in America Using ARIMA Model. In Proceedings-2022 International Conference on Big Data, Information and Computer Network, BDICN 2022. 2022. pp. 72–79. https://doi.org/10.1109/BDICN55575.2022.00022
59. Shin, D-R; Chae, G; Park, M. A Study on the Prediction of COVID-19 Confirmed Cases Using Deep Learning and AdaBoost-Bi-LSTM Model. Int J Reliab Qual Saf Eng; 2023; [DOI: https://dx.doi.org/10.1142/S0218539323500316]
60. Souza, J; Leung, CK; Cuzzocrea, A. An innovative big data predictive analytics framework over hybrid big data sources with an application for disease analytics. Adv Intell Syst Comput; 2020; 1151, pp. 669-680. [DOI: https://dx.doi.org/10.1007/978-3-030-44041-1_59]
61. Stapic Z, García E, García-Cabot A, Strahonja V, Stapić Z, García López E, García A, Luis C, Ortega M. Performing systematic literature review in software engineering. 2014. https://www.researchgate.net/publication/267770597
62. Starbuck C. Logistic Regression. In Starbuck C (Ed.) The Fundamentals of people analytics: with applications in R. Springer International Publishing. pp. 223–238. 2023. https://doi.org/10.1007/978-3-031-28674-2_12
63. Tan W, Bian R, Yang W, Hou Y. Analysis of 2019-nCoV epidemic situation based on modified SEIR model and DSGE algorithm. In Proceedings-2020 5th International Conference on Information Science, Computer Technology and Transportation, ISCTT 2020. 2020. pp. 369–376. https://doi.org/10.1109/ISCTT51595.2020.00070
64. Villanustre, F; Chala, A; Dev, R; Xu, L; LexisNexis, JS; Furht, B; Khoshgoftaar, T. Modeling and tracking Covid-19 cases using Big Data analytics on HPCC system platformm. J Big Data; 2021; [DOI: https://dx.doi.org/10.1186/s40537-021-00423-z]
65. Wang L, Adiga A, Venkatramanan S, Chen J, Lewis B, Marathe M. Examining Deep Learning Models with Multiple Data Sources for COVID-19 Forecasting. In Proceedings-2020 IEEE International Conference on Big Data, Big Data 2020. 2020. pp. 3846–3855. https://doi.org/10.1109/BigData50022.2020.9377904
66. Wang R, Hu G, Jiang C, Lu H, Zhang Y. Data Analytics for the COVID-19 Epidemic. In Proceedings-2020 IEEE 44th Annual Computers, Software, and Applications Conference, COMPSAC 2020. 2020. pp. 1261–1266. https://doi.org/10.1109/COMPSAC48688.2020.00-83
67. Wang, W; Cai, J; Xu, J; Wang, Y; Zou, Y. Prediction of the COVID-19 infectivity and the sustainable impact on public health under deep learning algorithm. Soft Comput; 2023; 27,
68. World Health Organization. Global tuberculosis report 2021: TB deaths and incidence. Global Tuberculosis Report. 2021. pp. 13–14.
69. World Health Organization. WHO Malaria World Report. World Malaria Report 2021. 2021. https://www.who.int/teams/global-malaria-programme/reports/world-malaria-report-2021
70. Yu, S; Liu, M; Dou, W; Liu, X; Zhou, S. Networking for big data: a survey. IEEE Commun Surv Tutor; 2017; 19,
71. Yu, Y; Zhou, Y; Meng, X; Li, W; Xu, Y; Hu, M; Zhang, J. Evaluation and Prediction of COVID-19 Prevention and Control Strategy Based on the SEIR-AQ Infectious Disease Model. Wirel Commun Mob Comput; 2021; [DOI: https://dx.doi.org/10.1155/2021/1981388]
72. Zhang, Q. Data science approaches to infectious disease surveillance. Philosoph Trans R Soc A Math Phys Eng Sci; 2022; [DOI: https://dx.doi.org/10.1098/rsta.2021.0115]
73. Zhou, T; Zhang, Y. COVID-19 spread prediction model. ACM Int Conf Proc Ser; 2021; [DOI: https://dx.doi.org/10.1145/3448734.3450846]
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.