Content area
Malaria remains a major global health challenge, particularly in Brazil’s Legal Amazon region, where environmental and socioeconomic conditions foster favorable conditions for disease transmission. Traditional control measures have shown limited effectiveness, emphasizing the need for better predictive approaches to support timely and targeted public health interventions. This study evaluates the performance of six computational models—Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), Support Vector Regression (SVR), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Autoregressive Integrated Moving Average (ARIMA)—for forecasting weekly malaria cases across multiple states in the Legal Amazon. The results demonstrate that the RF model consistently outperformed the other models, achieving the lowest Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) values in most cases, such as in cluster 02 of the state of Acre, with RMSE of 0.00203 and MAE of 0.00133. The integration of K-means clustering further improved the model predictive accuracy by accounting for spatial heterogeneity and capturing localized transmission dynamics. This hybrid modeling approach, combining machine learning models with spatial clustering, offers a promising tool for enhancing malaria surveillance and guiding more effective public health strategies, especially for malaria control efforts in high-risk regions.
Introduction
Malaria is a curable disease caused by parasites of the Plasmodium genus and transmitted by infected female mosquitoes of the Anopheles genus. Malaria disproportionately affects vulnerable populations, including children under five years of age, pregnant women, displaced persons, and indigenous communities, who often face significant barriers to accessing timely diagnosis and treatment [1]. Efforts to combat malaria, including vector control, vaccination programs, and therapeutic interventions, have progressed along the years; however, malaria still remains a major global public health challenge due to its high morbidity and mortality rates, particularly in resource-limited settings [2].
In 2023 alone, an estimated 263 million cases of malaria and 597,000 deaths were reported worldwide [3]; additionally, malaria is responsible for over 52 million disability-adjusted life years (DALYs), a composite measure that combines the years of life lost due to premature mortality (YLL) and the years lived with disability (YLD). This substantial DALY burden highlights not only the high mortality rate associated with malaria but also the significant long-term morbidity that affects survivors. The measure underscores the need for effective interventions to mitigate the disease’s impact on public health and quality of life [4,5,6].
In Brazil, the Amazon region bears the highest malaria burden, with 139,884 cases reported in 2023, accounting for approximately 99.98% of the total cases in the country. Malaria transmission is predominantly concentrated in indigenous territories (40.0%), rural areas (33.4%), and mining sites (14.6%), which are often characterized by inadequate housing, poor sanitation, and limited access to healthcare services [7, 8].
Given these challenges, computational modeling has emerged as a promising tool for strengthening surveillance systems and guiding public health responses. Machine learning (ML) and deep learning (DL) models have demonstrated the capacity to process large epidemiological datasets and extract temporal patterns that may not be easily identifiable through traditional statistical approaches [9, 10].
Several recent studies have explored predictive modeling for malaria using various computational approaches. Barboza et al. [11] applied LSTM and GRU networks to malaria incidence data in the state of Amazonas, highlighting the capacity of deep learning to adapt to different levels of temporal variability. Wang et al. [12] proposed an ensemble framework that combines ARIMA with neural networks and Gradient Boosting Regression Trees, achieving superior performance through model stacking. Thomas et al. [13] used Generalized Additive Models (GAMs) and mixed-effects models in Togo, emphasizing the importance of accounting for spatial and climatic variability in disease forecasting. Similarly, Singh et al. [14] utilized ARIMA and Holt’s exponential smoothing to project malaria trends in India through 2030, reinforcing the applicability of traditional models for long-term forecasting.
Machine learning techniques have also shown strong results in outbreak classification contexts. Khan et al. [15] compared multiple algorithms—including XGBoost, Random Forest, and SVM—for outbreak prediction in The Gambia, reporting high accuracy, particularly for ensemble-based approaches. In Senegal, Ileperuma et al. [16] employed Random Forest models informed by satellite-derived climate data, demonstrating the potential of integrating environmental variables for localized prediction. However, while these models are promising, they often depend on external data sources not consistently available across different public health contexts.
Other works have focused on improving deep learning architectures. Naroum et al. [17] found that LSTM outperformed classical ML models for malaria prediction in Cameroon, particularly when no exogenous variables were used. Kamana et al. [18] proposed a LSTM-Seq2Seq model to forecast malaria reemergence in China, successfully capturing long-term dependencies in time series. In Brazil, Laporta et al. [19] used ARIMA-based forecasts to evaluate national progress toward malaria elimination targets, warning that more sophisticated predictive tools may be necessary. Finally, Santangelo et al. [20] conducted a systematic review and confirmed that Random Forest, LSTM, and SVM are among the most robust and commonly used models in infectious disease forecasting.
Despite these advances, there remains a gap in the integration of temporal prediction models with spatial segmentation strategies, particularly within the Brazilian context. Many studies either aggregate data at the national or regional level, or do not account for the local heterogeneity that characterizes malaria transmission across Amazonian municipalities. Furthermore, the reliance on external or complex data inputs can limit real-world applicability in endemic areas with constrained infrastructure.
Our study focuses exclusively on historical malaria notification data, available in Brazil’s official epidemiological surveillance system. By using only routinely collected data, the proposed approach ensures practical applicability in real-world public health scenarios. Our methodology, considering the heterogeneity of Brazil’s Legal Amazon, uses K-means clustering, based on statistical similarity, to aggregate municipalities with similar seasonal characteristics; together with machine learning models adapted to each epidemiological cluster. This strategy reduces intra-cluster variability and improves the predictive accuracy of the models, increasing early warning capacity and supporting more effective and targeted malaria control interventions.
To achieve this, we evaluate the performance of six computational models: Random Forest (RF), Support Vector Regression (SVR), eXtreme Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Autoregressive Integrated Moving Average (ARIMA). These models are applied across different states and clusters within the Legal Amazon. In addition, we assess the impact of spatial segmentation using K-means clustering, hypothesizing that it enhances predictive performance by accounting for localized transmission patterns. This hybrid framework aims to contribute to the design of more precise and data-driven surveillance strategies, particularly in support of national initiatives such as the Elimina Malária Brasil plan.
Materials and methods
Dataset
The data for this study were obtained from the Malaria Epidemiological Surveillance Information System (SIVEP-Malaria),Footnote 1 a database maintained by the Brazilian Ministry of Health. SIVEP-Malaria is the official system for the mandatory reporting of malaria cases in the Amazon Region and provides comprehensive data on malaria cases, treatments, and related variables, enabling systematic monitoring and analysis across Brazil [21].
Access to this system is restricted and requires authentication via a user-specific login and password. The data used in this study were obtained following approval from the Research Ethics Committee of FMT-HVD under approval number CAAE.
Although SIVEP-Malaria encompasses all federal units of Brazil, in 2023 over 99.98% of all malaria cases are concentrated in the Legal Amazon [7], which comprises nine states: Acre, Amapá, Amazonas, Maranhão, Mato Grosso, Pará, Rondônia, Roraima, and Tocantins. However, it is important to note that malaria burden is not evenly distributed across these states. The northwest region, particularly in states like Amazonas, Roraima, and Pará, experiences the highest incidence, whereas Maranhão and Tocantins report significantly fewer cases [7, 8].
This study focuses on the location where malaria cases were officially notified, regardless of whether they were autochthonous or imported. Although the states of Maranhão and Tocantins have a lower number of reported cases compared to other states in the Legal Amazon, they were included in the analysis to provide a comprehensive regional assessment. Each state was analyzed independently, and municipalities were grouped into distinct clusters based on notification patterns. This approach minimizes potential biases in the predictive models and ensures that localized transmission dynamics are accurately captured.
The dataset used in this study includes all confirmed malaria cases reported between 2003 and 2022, aggregated at a weekly granularity. The choice of this extended time frame is essential for training artificial intelligence models, as these models rely on historical data to recognize patterns and learn long-term trends. By incorporating a broad temporal window, the models can better capture seasonal variations, cyclical outbreaks, and long-term shifts in malaria transmission, improving their predictive accuracy.
Figure 1 illustrates the spatial distribution of malaria risk in the Legal Amazon, based on the Annual Parasite Index (API) for the year 2022, an epidemiological indicator that estimates the risk of contracting malaria in a given population. The use of the API allows a more accurate representation of the intensity of disease transmission, considering both case counts and population size. The municipalities at highest risk identified were Jacareacanga (Pará), Japurá (Amazonas), Alto Alegre (Roraima), Barcelos (Amazonas) and São Gabriel da Cachoeira (Amazonas). These findings highlight critical areas where malaria transmission remains intense and persistent, highlighting the need for predictive models capable of supporting targeted control strategies in high-risk and often underserved regions.
[IMAGE OMITTED: SEE PDF]
Although the epidemiological landscape has evolved over the past two decades, recent surveillance data confirm that malaria transmission remains concentrated in the same high-burden areas [7, 8]. This validates the use of long-term historical data to enhance predictive modeling while ensuring its relevance for current and future malaria control strategies.
Data preprocessing and feature engineering
Figure 2 provides an overview of the entire data preprocessing workflow designed to create a new dataset that can be utilized in computational models for time-series forecasting.
[IMAGE OMITTED: SEE PDF]
The initial step involved extracting raw data from the SIVEP-Malaria system, which contains malaria notification records, geographic identifiers, and laboratory confirmed test results. Subsequently, the annual datasets were consolidated into a single comprehensive database to streamline data analysis and manipulation.
Following data integration, invalid, duplicate, and negative result entries were identified and removed to prevent redundancy and inconsistencies. In Brazil, the Ministry of Health recommends verifying parasite clearance five days after the initiation of treatment, and these follow-up test results are recorded in the SIVEP-Malaria system. However, in this study, such follow-up entries were excluded to avoid data duplication and minimize potential bias related to recurrence or treatment monitoring. Since the analysis focuses specifically on the Amazon region, records from municipalities outside the Legal Amazon were also excluded, enabling a more geographically targeted and epidemiologically relevant assessment.
After data cleaning, feature selection was performed, keeping only variables relevant to predicting malaria cases. Non-epidemiological attributes, such as administrative metadata unrelated to disease occurrence, were discarded. Date values were standardized by converting them to a DateTime format and grouping weekly, ensuring temporal consistency between records. The variables selected from the SIVEP-Malaria system for this study included: Date of notification, Municipality code of notification and Number of confirmed cases.
Subsequent to the initial preprocessing, the dataset was integrated with demographic data provided by the Brazilian Institute of Geography and Statistics (IBGEFootnote 2). This integration step is crucial to enhance the epidemiological analysis, as it allows the calculation of notification rates adjusted for population size. The integration was conducted through a merging operation using standardized municipality codes provided by IBGE, ensuring that each malaria record was accurately linked to its corresponding demographic data.
After merging the datasets, an additional attribute representing the Notification Rate was generated. This rate will be employed by the K-means clustering algorithm to group municipalities within each state. The Notification Rate is defined as the ratio between confirmed malaria cases and the local population, as presented in Equation.
$$Notification\;Rate=\left(\frac{Number\;of\;confirmed\;malaria\;cases}{Total\;population\;of\;the\;city}\right)\;x\;1000$$
This normalization enables fair comparisons of malaria incidence across different regions by accounting for population size disparities. The final output consists of a processed dataset covering the period from 2003 to 2022. All codes used in data preprocessing are available at: https://github.com/dotlab-brazil/Malaria-AmazoniaLegal.
The datasets used in this study were obtained from the SIVEP-Malaria system, maintained by the Brazilian Ministry of Health. These data have been fully anonymized and contain aggregated records of confirmed malaria cases reported across municipalities within Brazil’s Legal Amazon region from 2003 to 2022.
The dataset includes the following variables: date of notification, municipality of notification, laboratory test result, and notification count (i.e., the number of confirmed malaria cases per day, by municipality). The processed dataset that supports the findings of this research is publicly available on the Mendeley Data Repository at: https://data.mendeley.com/datasets/9n6b97fsbd/2.
Statistical analysis
K-means clustering
Cluster analysis is a technique used to group samples in a dataset based on shared characteristics [22]. The K-means algorithm is one of the most recognized methods for data clustering, employing unsupervised classification to partition data into a predefined number of clusters, denoted as k. This algorithm operates by evaluating elements based on the Euclidean distance from the cluster centroids [23].
The K-means algorithm begins by randomly selecting initial centroids, which serve as central points for each cluster. Elements are then assigned to clusters based on their proximity to the centroids. This process iterates until convergence. Initially, each element is assigned to the nearest centroid, as described by the following equation:
$${S}_{i}^{(t)}=\left\{{x}_{p}:{\parallel {x}_{p}-{\mu }_{i}^{(t)}\parallel }^{2}\le {\parallel {x}_{p}-{\mu }_{j}^{(t)}\parallel }^{2}\forall j,1\le j\le k\right\}$$
(1) Where:
*
S is the distance between the element and the centroid.
*
i indicates the cluster number.
*
xp represents the number of assignments to the closest point.
*
µ(t) is the centroid value.
*
j is the dissimilarity measure.
*
(t) refers to the number of iterations of the algorithm.
*
k is the number of clusters.
After the assignment step, the algorithm updates the centroids by calculating the average of the observations in each cluster.
This standardized measure allows the identification of clusters of municipalities with similar statistical epidemiological patterns, regardless of their geographic location. The K-means algorithm groups cities in the same state based on statistical similarity in the behavior of malaria cases, rather than their geographic proximity. This approach allows for more data-driven segmentation, which may ultimately support more precise public health interventions.
For each state, the optimal number of clusters was determined by the elbow method using the Within-Cluster Sum of Squares (WCSS) as the evaluation metric. This method identifies the point at which the explained balances capture significant variability while avoiding overfitting due to excessive clustering [24].
Forecasting models
Time series forecasting involves predicting future values based on previously observed data points, taking into account temporal dependencies and patterns such as trends and seasonality [25]. In epidemiology, time series models can be used to anticipate disease outbreaks, allowing public health authorities to allocate resources more effectively and implement timely interventions.
To assess the performance of different approaches for predicting weekly malaria cases, this study evaluates six computational models: LSTM, GRU, SVR, RF, XGBoost, and ARIMA. The models were strategically selected to compare both traditional forecasting approaches and more recent techniques. LSTM and GRU are modern architectures based on Recurrent Neural Networks (RNNs) designed to capture long-term temporal dependencies in time series data [26, 27]. SVR, RF, and XGBoost are traditional machine learning models known for their robustness [28,29,30,31,32,33,34]. Finally, ARIMA is a widely used statistical model for time series forecasting and serves as a reliable baseline for comparison with more complex approaches [35].
Deep learning models: long short-term memory and gated recurrent units
LSTM is a Recurrent Neural Network (RNN) architecture designed to capture longterm dependencies in time series data. LSTMs address the vanishing gradient problem found in traditional RNNs by introducing memory cells regulated by three gates: the input gate, forget gate, and output gate [36]. This mechanism allows LSTMs to retain and selectively forget information over extended periods, making them particularly effective for time series forecasting tasks that require modeling complex temporal dependencies based solely on historical data. In the case of malaria incidence prediction, the disease exhibits intrinsic temporal patterns influenced by seasonality, cyclic behavior, and historical trends, even when external factors are not explicitly included in the model [11, 37]. LSTM networks are well-suited to capture these patterns because they are designed to handle non-linear relationships and long-term dependencies in sequential data [12]. Unlike traditional time series models, such as ARIMA, which assumes linearity and requires stationarity, LSTMs can model complex and dynamic fluctuations without the need for prior data transformation [38]. This capability makes LSTM a robust choice for predicting malaria cases from historical notification data, enabling the identification of trends and future outbreaks with higher accuracy.
GRUs are a simplified variant of LSTMs, designed to reduce computational complexity while maintaining similar performance. GRUs merge the input and forget gates into a single update gate, making them more efficient in processing without sacrificing accuracy [27]. GRUs have been used successfully in infectious disease prediction, including malaria [27, 39, 40].
Like LSTMs, GRUs are capable of handling long-term dependencies but with fewer parameters, making them suitable for tasks with large datasets or limited computational resources.
Machine learning models: support vector regression, random forest, and eXtreme gradient boosting
SVR is a supervised machine learning technique that models non-linear relationships in time series data. SVR extends the Support Vector Machine (SVM) classification method by estimating a real-valued function, which is useful for continuous predictions, such as forecasting the number of malaria cases [41]. SVR’s ability to handle high-dimensional data and incorporate external variables makes it a valuable tool for complex forecasting tasks [28, 29].
SVR works by constructing a hyperplane that minimizes prediction error and model complexity, relying on a small subset of the training data, known as support vectors [30].
RF is an ensemble learning algorithm that constructs multiple decision trees using random subsets of data and features, combining their predictions for a final output [42]. This approach is robust against overfitting and is well-suited for high-dimensional datasets. In malaria prediction, RF has been used to analyze various factors influencing disease spread and forecast future incidence [31, 32].
XGBoost is a scalable and efficient implementation of gradient boosting algorithms. It is widely recognized for its superior performance in structured data and has been used extensively in time series forecasting and classification problems [34]. XGBoost operates by iteratively adding weak learners (typically decision trees) to minimize the residual errors of the previous models [33].
One of XGBoost’s key advantages is its ability to handle missing data and optimize model complexity using regularization techniques [34]. The algorithm’s computational efficiency and scalability make it attractive for large datasets, allowing rapid iteration and fine-tuning [33].
Autoregressive Integrated Moving Average (ARIMA)
ARIMA is a widely used time series analysis and forecasting technique applicable in fields such as economics, finance, healthcare, and meteorology. The model uses past values of a time series to predict future values by fitting a mathematical model to the data. ARIMA has been employed to forecast various phenomena, including stock prices and disease outbreaks [35].
The ARIMA model consists of three components: autoregression (AR), differencing (I), and moving average (MA). The AR component models the current value of the time series based on its past values. The I component ensures stationarity by differencing the time series, stabilizing its mean and variance. The MA component models the errors as a linear combination of past errors [35].
Evaluation metrics
The models were trained and tested using temporal and geospatial subsets of the dataset. Each model’s performance was evaluated based on the following metrics Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) [43].
RMSE measures the square root of the average of the squared differences between actual and predicted values, as defined in the following Eq. 1:
$$RMSE=\sqrt{\frac1T\sum\nolimits_{t=1}^T(y_t-{\widehat y}_t)^2}$$
(1)
where yt is the actual value, ˆyt is the value predicted by the model, and T is the value given the number of samples of model errors [44].
Conversely, the MAE provides a straightforward calculation of the average absolute differences between actual and predicted values, as expressed in the following equation:
$$MAE=\frac1n\sum\nolimits_{i=1}^n\left|y_i-{\hat y}_i\right|$$
(2)
MAE is generally less sensitive to outliers than RMSE, as it does not square the errors. This characteristic makes it useful for providing a direct measure of the magnitude of prediction errors without excessively penalizing larger discrepancies.
When assessing time series forecasting models, it is important to compare RMSE and MAE within the context of the specific problem. Ideally, both metrics should be minimized to achieve accurate and reliable predictions. These metrics offer insights into model accuracy, enabling an assessment of how closely predicted values align with actual values and providing an understanding of the average magnitude of errors. This information is vital for selecting the most appropriate computational model to ensure dependable and precise predictions.
By comparing these metrics across models, the study aims to identify the most effective predictive approach for malaria cases. This evaluation also considers how geospatial clustering using K-means impacts model performance by accounting for regional transmission patterns.
Models’ configuration
To find the best configuration for each model, a holdout validation method was used, where 80% of the historical data was allocated for training and 20% for testing. Each technique was evaluated in 30 iterations to ensure statistically robust results, with the average metrics RMSE and MAE used as performance indicators.
Computational models require the configuration of multiple hyperparameters, which are crucial for their performance. However, manual tuning of these parameters is often impractical due to the vast search space. To optimize hyperparameters for each model, we employed two strategies: Grid Search and Optuna.
Grid Search is an exhaustive search technique that systematically trains and evaluates models across all possible combinations of hyperparameters within a predefined search space. The combination that delivers the best performance is selected as the optimal configuration [45, 46].
Optuna was specifically used to optimize the deep learning models (LSTM and GRU). It is an open-source framework designed for hyperparameter optimization that allows users to define the search space dynamically. Optuna aims to minimize or maximize an objective function until the optimal value is achieved. To avoid overfitting during optimization, Optuna incorporates a regularization mechanism called pruning, which halts unpromising trials early in the process [47].
Table 1 summarizes the hyperparameters used during the optimization process for all models. This dual approach to hyperparameter tuning ensures that each model achieves its best possible performance while maintaining generalizability and robustness.
[IMAGE OMITTED: SEE PDF]
Results
Epidemiological clustering
To enhance the accuracy of epidemic forecasting and facilitate a more localized public health response, we employed a clustering approach to group municipalities with similar epidemiological trends. This method allows for a more granular analysis, reducing variability within each group and improving the precision of the predictive models.
The clustering process followed two main steps for each of the nine states analyzed. First, we applied the elbow method to determine the optimal number of clusters. This technique evaluates the trade-off between cluster compactness and cluster separation, selecting a value of K that balances the gain in explanatory power with model simplicity. Second, the K-Means algorithm was used to group municipalities into clusters based on their historical malaria notification rates, ensuring that regions within the same state with comparable transmission dynamics were analyzed together.
The elbow method was systematically applied to all nine states of the Brazilian Legal Amazon. As an illustrative example, Fig. 3 shows the elbow plot for the state of Amazonas, where the optimal number of clusters was determined to be five. This result reflects the complex dynamics of malaria transmission in the state, which justified a finer stratification.
[IMAGE OMITTED: SEE PDF]
In the remaining states, the elbow method indicated three or four optimal clusters, as summarized in Table 2. Specifically, the states of Acre, Amapá, Pará, and Roraima were divided into three clusters, while Maranhão, Mato Grosso, Rondônia, and Tocantins were segmented into four clusters. Figure 4 presents the geographic division of the clusters by state. A detailed list of municipalities assigned to each cluster is provided in Appendix A.
[IMAGE OMITTED: SEE PDF]
[IMAGE OMITTED: SEE PDF]
It is worth noting that, during the clustering process, some clusters consisted of only a single municipality. The K-means algorithm groups observations based on statistical similarity in temporal incidence patterns, rather than geographic proximity or numeric balance. In cases where a municipality displayed a highly distinct epidemiological profile—either due to exceptionally high case numbers, unique seasonality, or divergent trends over time—it naturally formed a cluster on its own. Maintaining these isolated clusters was essential to preserve the integrity of the temporal signal and avoid diluting distinctive patterns within larger, heterogeneous groups. Furthermore, no clusters were considered statistically insignificant, as all showed relevant internal cohesion and contributed meaningfully to the predictive modeling.
Forecasting models
The models were assessed to identify those with the highest predictive accuracy, supporting timely and data-driven public health interventions. The analysis includes comparisons at both the state and cluster levels, highlighting the benefits of incorporating spatial segmentation into the predictive process. Table 3 presents the performance of the models for state-level predictions, while Table 4 shows the results for clustering through K-means.
[IMAGE OMITTED: SEE PDF]
[IMAGE OMITTED: SEE PDF]
The results indicate that the RF model consistently outperformed other models across most states. Specifically, RF achieved the lowest RMSE and MAE values in eight out of nine states analyzed. For instance, in the state of Amazonas, see Fig. 5, RF obtained an RMSE of 0.00570 and an MAE of 0.00120, outperforming all other models. Similarly, in Maranhão, RF achieved an RMSE of 0.00089 and an MAE of 0.00039, further demonstrating its superior predictive capability.
[IMAGE OMITTED: SEE PDF]
The only exception was observed in the state of Roraima, see Fig. 6, where the SVR model showed the best performance with an RMSE of 0.00374 and an MAE of 0.00320. This suggests that while RF is highly effective in most regions, SVR can outperform in areas with distinct data characteristics.
[IMAGE OMITTED: SEE PDF]
The integration of K-means clustering further enhanced the predictive capabilities of the models by accounting for spatial heterogeneity in malaria transmission. The clustered analysis, presented in Table 4, demonstrated that the RF model continued to perform best in most clusters, for instance, in the states of Acre, Maranhão, Pará, Rondônia, and Tocantins, RF was the most effective model across all clusters.
Nevertheless, in certain situations, the SVR model exhibited superior performance. For example, in Cluster 02 of Amapá see Fig. 7, and Cluster 03 of Mato Grosso, SVR outperformed other models in both RMSE and MAE metrics. This improved performance can be attributed to the specific characteristics of these clusters. Unlike other regions where malaria incidence follows distinct seasonal patterns, these clusters demonstrated more stable and linear temporal trends, which are better suited for SVR’s kernel-based regression approach. The SVR model excels at capturing linear relationships within the data while maintaining robustness in small datasets, making it particularly effective in regions with lower case variability and gradual fluctuations over time.
[IMAGE OMITTED: SEE PDF]
In Cluster 04 of Mato Grosso, SVR had the lowest RMSE at 0.00768, while RF obtained the lowest MAE at 0.00631. These discrepancies can be explained by the distinct characteristics of each metric and the specific behavior of the models. RMSE penalizes larger errors more severely and is more sensitive to outliers, indicating that SVR captured the general trend of the data more accurately but exhibited greater variability in specific errors. On the other hand, MAE evaluates the average absolute error without amplifying large deviations, suggesting that RF produced more consistent predictions with lower average errors but may not have modeled more complex data variations as effectively.
It is important to note that in the state of Tocantins, Clusters 01, 03, and 04 did not present RMSE and MAE values. This outcome occurred because no confirmed malaria cases were reported in the municipalities within these clusters during the evaluation period. Consequently, the models were not required to make predictions for these clusters, naturally resulting in perfect scores.
The results from the clustered analysis indicate that incorporating spatial segmentation allows models to better capture the heterogeneity of malaria transmission dynamics, reducing prediction errors and enhancing model precision.
For example, in the state of Amazonas, the RF model achieved an RMSE of 0.00570 and an MAE of 0.00120 when predicting malaria cases at the state level, already demonstrating excellent predictive performance. However, when the data were segmented into five distinct clusters within Amazonas, the RF model achieved even lower RMSE and MAE values in several clusters. Specifically, in Cluster 01 of Amazonas, the RF model reached an RMSE of 0.00010 and an MAE of 0.00006, indicating a substantial improvement in prediction accuracy.
These results highlight how spatial clustering enhances the model’s ability to identify for localized transmission patterns, improving sensitivity to regional epidemiological variations that may be masked when analyzing the state as a single unit. Consequently, combining machine learning models with spatial clustering techniques presents an approach to refine disease prediction and support more targeted and efficient public health interventions.
The overall results of the predictive models highlight the robustness and effectiveness of the RF model for forecasting malaria cases in the Brazilian Legal Amazon, particularly when combined with spatial clustering techniques. Additionally, the SVR model demonstrated superior predictive performance in specific contexts, particularly in clusters characterized by lower variability and smaller populations, where simpler non-linear relationships tend to dominate.
Discussion
Epidemiological interpretation
The implementation of spatial clustering, using the K-means algorithm, enabled the identification of municipalities and regions with similar epidemiological patterns. For this purpose, the municipality of case notification was adopted as the spatial reference for clustering. This choice is consistent with the operational logic of the health system, where surveillance activities and intervention logistics are typically organized based on where cases are officially reported. While the place of infection provides important epidemiological context, the place of notification is often more relevant for informing public health logistics and operational planning. This is particularly true for interventions such as the distribution of medications, diagnostic tests, and the coordination of local control measures. In practice, the location of infection and the location of notification are frequently the same, minimizing potential discrepancies. However, from a health systems perspective, prioritizing the notification site aligns more directly with how resources are deployed and how responses are implemented at the ground level. Therefore, we maintain that the notification location offers a more practical basis for public health action.
This methodological approach not only improved model accuracy but also enhanced the granularity of the analysis, providing customized forecasts aligned with local transmission dynamics. These findings reinforce the potential of combining machine learning techniques with spatial clustering to support precision public health strategies, contributing to more timely and efficient interventions.
For example, the state of Amazonas is the largest in Brazil, covering over 1,559,255 square kilometers and encompassing 62 municipalities, with an estimated population of 4.2 million inhabitants, mostly concentrated in urban centers such as Manaus, but with significant rural and indigenous populations spread across vast, hard-to-reach areas [48]. Given its geographic and demographic complexity, statewide predictive models may fail to capture micro-regional differences in transmission patterns. In this study, the clustering of municipalities based on statistical similarity of malaria incidence time series significantly improved predictive performance compared to a single aggregated model for the entire state.
This is particularly relevant for the operationalization of the Elimina Malária Brasil Plan, which emphasizes targeted and differentiated strategies based on epidemiological risk classification [49]. By segmenting areas into epidemiologically homogeneous clusters, health authorities can adopt proactive interventions to prevent possible outbreaks, such as intensifying indoor residual spraying or increasing the distribution of insecticide-treated nets in high-risk areas [7, 49].
Furthermore, spatial clustering and predictive modeling present themselves as a tool for more efficient allocation of limited health resources. The identification of clusters with increasing case trends allows the early mobilization of Rapid Response Teams, aligned with the Diagnosis, Treatment, Investigation and Response (DTI-R) strategy advocated by the Brazilian Ministry of Health [49]. This approach can prevent the escalation of outbreaks, particularly in special areas like indigenous lands, rural settlements, and mining zones, where malaria transmission remains persistent and often under-reported [7].
Additionally, this methodological approach supports epidemiological surveillance systems, enabling the transition from passive to active and predictive surveillance — a recommendation of the WHO Global Technical Strategy for Malaria (GTS) 2016–2030 [50, 51]. By providing accurate forecasts of malaria incidence, these models empower local decision-makers to design integrated control strategies, contributing to Brazil’s goal of eliminating malaria by 2035 [49].
Public health implications
The findings of this study provide relevant contributions to the ongoing efforts for malaria control and elimination in Brazil, particularly in the Brazilian Legal Amazon, where more than 99% of the country’s malaria cases are concentrated [7]. The proposed methodological framework, which integrates ML models with spatial clustering techniques, demonstrates potential for augmenting public health surveillance systems within the Unified Health System (SUS). By utilizing algorithms such as Random Forest and K-means clustering, the approach facilitates a refined analysis of epidemiological patterns across states. This enables greater early detection capacity for malaria outbreaks and supports the design of targeted interventions.
The integration of predictive modeling and spatial analysis enables the identification of areas with similar epidemiological characteristics, supporting the strategic stratification advocated by the Elimina Malária Brasil: Plano Nacional de Eliminação da Malária [49]. This plan highlights the need for targeted interventions according to epidemiological risk levels, with special attention to vulnerable territories such as indigenous lands, illegal mining zones, and cross-border regions [49]. The ability to identify clusters with higher predicted incidence rates allows health authorities to better implement the DTI-R strategy (Diagnóstico, Tratamento, Investigação e Resposta), enabling rapid and localized responses to minimize transmission risks [49].
Moreover, the application of these computational models within the SUS facilitates more precise geographic targeting of interventions. Accurate forecasts of malaria incidence support health managers in anticipating demands for diagnostic supplies, antimalarial treatments, and vector control actions, such as the distribution of insecticide-treated nets (ITNs) and implementation of indoor residual spraying (IRS) [7]. This is particularly relevant in remote and hard-to-reach areas of the Amazon region, where logistical challenges often delay timely interventions.
The predictive capability demonstrated by the Random Forest and SVR models, when combined with spatial clustering, aligns with the recommendations of the WHO for enhancing surveillance as a core intervention in malaria control and elimination programs [51]. The WHO GTS 2016–2030 highlights the importance of transforming malaria surveillance into a core intervention and leveraging data for evidence-based decision-making [50, 51]. In this sense, predictive modeling can support the transition from passive to proactive surveillance systems, enhancing the capacity for early warning and rapid response.
In addition, the use of spatial clustering and predictive models can inform crossborder coordination efforts, particularly in regions such as Roraima and Amazonas, where malaria transmission is influenced by migratory flows from neighboring countries like Venezuela [19]. Strengthening regional cooperation and data sharing is crucial to address cross-border malaria and contribute to the broader goal of malaria elimination in the Americas.
Despite the potential benefits, the practical implementation of these models within the SUS requires overcoming several challenges. Establishing the technical infrastructure necessary for real-time data analysis and processing is a key consideration. Furthermore, the effective use of predictive tools depends on comprehensive training programs for healthcare professionals, ensuring the appropriate interpretation and application of the generated information [52].
The integration of predictive modeling and spatial analysis enhances the capacity of malaria surveillance systems to support timely and effective public health interventions. These tools contribute to Brazil’s efforts to meet the malaria elimination targets set for 2035 and align with global strategies for malaria control and elimination [49, 51].
Conclusion
This study evaluated the use of computational models to forecast weekly malaria incidence across states of the Brazilian Legal Amazon, aiming to support decision-making with accurate and timely predictions. By incorporating historical epidemiological data and spatial segmentation through K-means clustering, the proposed framework adapts to local transmission dynamics and enhances predictive performance.
Six computational models were evaluated for forecasting malaria incidence, with RF consistently outperforming the others across most states. The integration of K-means clustering allowed the models to capture spatial heterogeneity, enhancing predictive accuracy by adapting to localized transmission dynamics. In particular, RF exhibited exceptional performance in various clusters, although the SVR model showed superior results in specific contexts, such as Roraima.
An important observation emerged in the state of Tocantins, where Clusters 01, 03, and 04 presented RMSE and MAE values of zero due to the absence of confirmed malaria cases during the evaluation period. This finding emphasizes the importance of dynamic modeling approaches capable of adapting to regions with very low or null disease prevalence.
The methodological approach of using statistical clustering to segment each state based on temporal similarity proved to be a viable and scalable alternative for public health decision-making. This strategy enables the identification of high-risk areas and supports the development of more targeted and effective interventions. By combining machine learning techniques with spatial segmentation, public health authorities can optimize resource allocation and implement proactive measures to mitigate malaria transmission in vulnerable populations.
In summary, the proposed framework offers a practical and scalable solution for strengthening malaria surveillance in endemic regions. Its ability to operate using only routinely collected notification data ensures its feasibility for integration into existing systems, particularly within the Brazilian Unified Health System (SUS). As Brazil advances toward its malaria elimination goals, predictive tools such as the one proposed in this study can serve as essential components in the design of timely, localized, and evidence-based public health actions.
Data availability
The data that support the findings of this study are available from FMT-HVD but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Vanderson Sampaio.
Notes
1.
http://sivepmalaria.saude.gov.br/sivepmalaria/.
2.
https://www.ibge.gov.br/
Abbreviations
API:
Annual Parasite Index
ARIMA:
Autoregressive Integrated Moving Average
DALYs:
Disability-Adjusted Life Years
DL:
Deep Learning
DTI-R:
Diagnóstico, Tratamento, Investigação e Resposta
GAMs:
Generalized Additive Models
GRU:
Gated Recurrent Units
GTS:
Global Technical Strategy (for Malaria)
IBGE:
Instituto Brasileiro de Geografia e Estatística
IRS:
Indoor Residual Spraying
ITNs:
Insecticide-Treated Nets
LSTM:
Long Short-Term Memory
MAE:
Mean Absolute Error
ML:
Machine Learning
RF:
Random Forest
RMSE:
Root Mean Squared Error
RNNs:
Recurrent Neural Networks
SIVEP-Malária:
Sistema de Informação de Vigilância Epidemiológica da Malária
SUS:
Sistema Único de Saúde
SVM:
Support Vector Machine
SVR:
Support Vector Regression
WHO:
World Health Organization
XGBoost:
EXtreme Gradient Boosting
YLD:
Years Lived with Disability
YLL:
Years of Life Lost
WCSS:
Within-Cluster Sum of Squares
Ministério da Saúde SDVES. Panorama epidemiológico da malária em 2021: buscando o caminho para a eliminação da malária no Brasil (Ministério da Saúde, Brazil, 2022). URL https://www.gov.br/saude/pt-br/centrais-de-conteudo/publicacoes/boletins/epidemiologicos/edicoes/2022/boletim-epidemiologico-vol-53-no17.pdf.
Organization WH. World malaria report 2022. 2022. URL https://www.who.int/publications/i/item/9789240064898. Accessed: 2024–02–04.
World Health Organization. World malaria report 2024. 2024. https://www.who.int/publications/malaria-report-2024. Accessed May 2025.
Pattanayak SK, Pakhtigian EL, Litzow EL. Through the looking glass: environmental health economics in low and middle income countries. 2018.
Ashley EA, Phyo AP, Woodrow CJ. Malaria. Lancet. 2018;391:1608–21.
Tapajós R, et al. Malaria impact on cognitive function of children in a peri-urban community in the brazilian amazon. Malar J. 2019;18:1–12.
Ministério da Saúde SDVESEA. Boletim epidemiológico: Caracterização da malária em Áreas especiais da região amazônica.Tech. Rep. 14, Ministério da Saúde, Brasil. 2024. URL https://www.gov.br/saude/pt-br/assuntos/boletins-epidemiologicos.
Ministério da Saúde SDVESEA. Boletim epidemiológico: Dia da malária nas américas – um panorama da malária no brasil em 2022 e no primeiro semestre de 2023. Tech. Rep. 1, Ministério da Saúde, Brasil. 2024. URL https://www.gov.br/saude/pt-br/assuntos/boletins-epidemiologicos.
Mundo TN. A agenda 2030 para o desenvolvimento sustentável. Recuperado Em. 2016;15:24.
Organization WH. Malaria eradication: benefits, future scenarios and feasibility. 2019.
Barboza MFX, et al. Prediction of malaria using deep learning models: a case study on city clusters in the state of amazonas, brazil, from 2003 to 2018. Rev Sociedade Bras Med Trop. 2022;55:e0420.
Wang M, Wang H, et al. A novel model for malaria prediction based on ensemble algorithms. PLoS ONE. 2019;14:e0226910.
Thomas A, et al. Exploring malaria prediction models in togo: a time series forecasting by health district and target group. BMJ Open. 2024;14:e066547.
Singh MP, et al. Time series analysis of malaria cases to assess the impact of various interventions over the last three decades and forecasting malaria in India towards the 2030 elimination goals. Malaria J. 2024;23:50.
Khan O, et al. Predicting malaria outbreak in the gambia using machine learning techniques. PLoS ONE. 2024;19:e0299386.
Ileperuma K et al. Predicting malaria prevalence with machine learning models using satellite-based climate information: technical report. Tech. Rep., International Water Management Institute. 2023. URL https://www.cgiar.org/initiative/climate-resilience/.
Naroum E, et al. Comparative analysis of deep learning and machine learning techniques for forecasting new malaria cases in cameroon’s adamaoua region. Intell-Based Med. 2025;11:100220.
Kamana E, et al. Predicting the impact of climate change on the re-emergence of malaria cases in China using lstmseq2seq deep learning model. BMJ Open. 2022;12:e053922.
Laporta GZ, et al. Reaching the malaria elimination goal in Brazil: a spatial analysis and time-series study. Infect Dis Poverty. 2022;11:39.
Santangelo OE, et al. Machine learning and prediction of infectious diseases: a systematic review. Mach Learn Knowl Extraction. 2023;5:175–98.
Ministério da Saúde do Brasil. SIVEP-Malaria: Sistema de Informação de Vigilância Epidemiológica da Malária. 2025. http://sivepmalaria.saude.gov.br/sivepmalaria/. Acesso em: abr. 2025.
Kopec D. Classic computer science problems in python. Shelter Island: Manning Publications Co; 2019.
Arora P, Varshney S, et al. Analysis of k-means and k-medoids algorithm for big data. Procedia Comput Sci. 2016;78:507–12.
Cui M, et al. Introduction to the k-means clustering algorithm based on the elbow method. Account Auditing Finance. 2020;1:5–8.
Lim B, Zohren S. Time-series forecasting with deep learning: a survey. Phil Trans R Soc A. 2021;379:20200209.
Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: Lstm cells and network architectures. Neural Comput. 2019;31:1235–70.
Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. 2014. arXiv preprint arXiv:1412.3555.
Speiser JL, Miller ME, Tooze J, Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst Appl. 2019;134:93–101.
Biau G, Scornet E. A random forest guided tour. TEST. 2016;25:197–227.
Zhang F, O’Donnell LJ. Support vector regression. 2020.
Zacarias OP, Boström H. Comparing support vector regression and random forests for predicting malaria incidence in mozambique. 2013.
Chen H, Wu L, Chen J, Lu W, Ding J. A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manage. 2022;59:102798.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. 2016.
Lv CX, An SY, Qiao BJ, Wu W. Time series analysis of hemorrhagic fever with renal syndrome in mainland China by using an xgboost forecasting model. BMC Infect Dis. 2021;21:1–13.
Neily N, Ammar BB, Kammoun HM. Prediction of covid-19 active cases using polynomial regression and arima models. 2021.
Staudemeyer RC, Morris ER. Understanding lstm–a tutorial into long short-term memory recurrent neural networks. 2019. arXiv preprint arXiv:1909.09586.
Santosh T, Ramesh D, Reddy D. Lstm based prediction of malaria abundances using big data. Comput Biol Med. 2020;124:103859.
Adeyeye J, Nkemnole E. Predicting malaria incident using hybrid sarimalstm model. Int J Math Sci Optim: Theory Appl. 2023;9:123–37.
Cho K et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. 2014. arXiv preprint arXiv:1406.1078.
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014. arXiv preprint arXiv:1409.0473.
Sharifzadeh M, Sikinioti-Lock A, Shah N. Machine-learning methods for integrated renewable power generation: a comparative study of artificial neural networks, support vector regression, and gaussian process regression. Renew Sustain Energy Rev. 2019;108:513–38.
Chi CM, Vossler P, Fan Y, Lv J. Asymptotic properties of highdimensional random forests. Ann Stat. 2022;50:3415–38.
Chai T, Draxler RR. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature. Geosci Model Dev. 2014;7:1247–50.
Willmott CJ, Matsuura K. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate Res. 2005;30:79–82.
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13:281.
Wu J, et al. Hyperparameter optimization for machine learning models based on bayesian optimization. J Electron Sci Technol. 2019;17:26–40.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a nextgeneration hyperparameter optimization framework. 2019.
Instituto Brasileiro de Geografia e Estatística (IBGE). Area territorial - amazonas 2022. 2022. https://www.ibge.gov.br/geociencias/organizacao-do-territorio/estrutura-territorial/15761-areas-dos-municipios.html?t=acesso-ao-produto&c=13. Accessed 12 Jan 2025.
Ministério da Saúde SDVES. Elimina malária brasil: Plano nacional de eliminação da malária. Tech. Rep., Ministério da Saúde, Brasil. 2022. URL https://www.gov.br/saude/pt-br/assuntos/saude-de-a-a-z/m/malaria.
Organization WH. Global technical strategy for malaria 2016–2030. United Kingdom: World Health Organization; 2015.
World Health Organization. World Malaria Report 2024: addressing inequity in the global malaria response (World Health Organization, Geneva, Switzerland, 2024). URL https://www.who.int/publications/i/item/9789240104440.
World Health Organization. Ethics and governance of artificial intelligence for health: Who guidance. 2021. Geneva: WHO. Disponível em: https://www.who.int/publications/.
© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.