Content area
Groundwater samples were collected from 45 wells across three regions in the United Arab Emirates (UAE)—Jabel Hafeet, Fujairah, and Ras Al Khaimah; their major-ion chemistry and radon activity were analyzed to characterize hydrogeochemical facies. Machine learning (ML) techniques were employed to impute the missing chloride and sulfate concentrations for any missing samples. In this regard, the best models accuracy wise were optimized Random Forest and Extra-trees. The resultant complete dataset was subjected to unsupervised K-means clustering. Mapping the clusters using GeoZ library revealed distinct spatial patterns related to different geological settings. Most Fujairah and Ras Al Khaimah samples clustered together, indicating aquifer similarity, while the Jabel Hafeet samples clustered separately. Several Jabel Hafeet surface water samples were clear outliers. Within the clusters, radon exhibited variation related to groundwater source and could be a useful environmental tracer. The study demonstrates that machine learning could be used to extract meaningful information from incomplete geoscience data. Major findings were the hydrogeochemical similarities between the Fujairah and Ras Al Khaimah aquifers and their differences with the Hafeet aquifer, identification of the Jabel Hafeet surface water samples, and utility of radon in environmental tracing. This research provides valuable insights into major UAE aquifers and the ability of artificial intelligence to boost the value of imperfect datasets.
Highlights
Machine learning can predicts missing hydrogeochemical data, enhancing groundwater insights.
Radon is a valuable tracer for tracking groundwater sources in arid regions.
Fujairah and Ras Al Khaimah aquifers exhibit similar water chemistry profiles.
Introduction
Groundwater is a vital resource in arid regions, such as the United Arab Emirates (UAE), and proper characterization of major aquifers in such regions is crucial for informed groundwater management. Hydrogeochemical analysis provides insights into groundwater systems by elucidating geological controls, flow patterns, quality issues, etc. However, missing data constitutes a common challenge hampering robust aquifer characterization. This study employs machine learning (ML) techniques to extract meaningful information from an incomplete hydrogeochemical dataset collected across three major, hydrogeologically and geographically distinct regions in the UAE—Jabel Hafeet, Fujairah, and Ras Al Khaimah; these regions were selected by considering the structure of the hydrogeological system in the UAE. The groundwater pathway in the UAE starts at the Hajar mountains and proceeds to the northwestern coasts [1, 2]. In the UAE, aquifer systems are primarily controlled by lithology and not tectonics; moreover, geography affects recharge fronts rather than tectonics of the Hajar. Hence, the considered three regions can capture most of the groundwater variations in the UAE. Hydrogeological studies in the UAE have indicated the existence of several groundwater aquifers and have outlined their presumed geographical distributions [3, 4–5]
The study aimed to profile the hydrogeochemical facies by measuring major-ion chemistry and radon activity. ML models were developed to impute missing chloride and sulfate concentrations. The complete dataset was then subjected to unsupervised clustering to reveal spatial patterns related to the different aquifers.
In recent years, ML techniques have been used to evaluate and predict water quality. For example, artificial neural networks, support vector machines, and random forests models have been applied to impute missing water quality data [6, 7–8]. Additionally, self-organizing maps have been effectively used for missing data imputation in incomplete hydrogeochemical datasets [9]. ML approaches have also shown promise for classification and pattern recognition in large and complex water quality datasets [10, 11–12]. Overall, machine learning provides a useful set of tools for maximizing information extraction from imperfect water quality data, enabling more effective modeling and management.
Furthermore, the utilization of unsupervised ML techniques has been used to extract insights from complex groundwater hydrogeochemical datasets. Methods such as hierarchical clustering, K-means clustering, and self-organizing maps have been utilized to identify hydrogeochemical facies and spatial patterns based on major-ion chemistry [13, 14–15]. Clustering approaches can reveal distinct groundwater groups related to flow paths, recharge sources, aquifer lithology, and geochemical processes. Coupled with explainable frameworks such as SHAP, they provide a data-driven technique for elucidating controls on groundwater chemistry and delineating hydrostratigraphy. Clustering has also been coupled with techniques like discriminant analysis to further evaluate controlling factors on groundwater quality [16]. Overall, unsupervised clustering presents a robust approach for leveraging hydrogeochemical data to better characterize complex aquifer systems.
This work makes several contributions. The ML method provides a valuable blueprint for maximizing value from imperfect geoscience datasets. The key novel aspect of this work is the integration of ML for data imputation with unsupervised clustering to achieve a deep hydrogeochemical characterization from sparse or incomplete sampling. Radon has been known as an excellent environmental tracer for groundwater sources and flow paths in arid environments [17]. The research findings corroborate the effectiveness of radon as an environmental tracer for fingerprinting groundwater systems. Additionally, the regional view of groundwater chemistry advances the hydrogeological understanding of these major UAE aquifers. Proper characterization of major aquifers is crucial for developing informed groundwater management practices that ensure the long-term sustainability of this vital resource across arid regions like the UAE.
Geological settings and datasets
In this work, 45 groundwater samples were collected over three different regions of the UAE. We assume that the collected samples will belong to at least three aquifers based on the aquifer distribution maps, well distributions, and proximity of the wells in relation to the aquifers. The regions (Jabel Hafeet Mountain, Fujairah, and Ras Al Khaimah) cover the principal aquifers of the northern and eastern regions of the UAE. We collected 10 water samples from the Jabel Hafeet region, 15 from the Fujairah region, and 20 from the Ras Al Khaimah region (Fig. 1).
[See PDF for image]
Fig. 1
Location map of the groundwater sampling wells marked by red dots. The northern group are samples collected in Ras Al Khaimah, while the middle ones are samples collected in Fujairah, and lastly the southern group were samples collected in Hafeet region (modified after the British Geological Survey and UAE Ministry of Energy 2006)
Geologic setting
Jabel Hafeet
The first region is Jabel Hafeet, it is the area surrounding Jabel Hafeet Mountain. Which is located southeast of Al Ain and approximately 30-km west of the front of the Oman Mountains. The Al Jawwa plain separates the region from these mountains. The Jabel Hafeet Mountain is an extended folding structure with an NNW-SSE direction. It is a double plunging anticline with an axis that plunges NNW at its northern end in the UAE and SSE at its southern end in Oman [18]. This mountain is an anticline with a sharply dipping eastern limb and a slightly dipping western limb. The eroded core and limbs of a large anticline expose a stretch of mostly carbonate deposits ranging from the Early Eocene to the Miocene [19]. The Jabel Hafeet Mountain’s gravel plain is a gently sloping gravel, and a sand plain is formed by down-wash material drained by wadis from the eastern mountains. The continuation of the wadi channels to the north is characterized by a series of sabkha patches developed after floods due to increased groundwater levels. The western section is dominated by dune fields, while most of the land is covered by Quaternary sediments. The principal aquifer in this region belongs to Paleogene to Neogene (younger than 40 million years) carbonate rocks and is composed of nodular and partly dolomitic limestones with interlayers of marls, anhydrites, and some shale [20].
Ras Al Khaimah
The Ras Al Khaimah region is located in the northern part of the UAE. Most groundwater samples were collected from the Wadi Al Bih area, which covers 483 km2. Wadi Al Bih is a massive southwest-flowing wadi complex that includes portions of the Ruus al Jibal. The Rus al Jibal group is distinguished by Permian to Early Triassic dolomites and dolomitized limestones in Wadi Al Bih [21], as well as notable fault runs across the dolomite from north to south [22]. The sediments were deposited as the Bih Formation, Hagil Formation, and Ghail Formation on Arabia’s continental edge. These formations comprise well-bedded dolomite, ranging in color from light to dark gray, and contain numerous joints and fissures. The Wadi Al Bih basin is external, well-developed, and mostly controlled by geologic structures. The Wadi Al Bih Aquifer in Ras Al Khaimah is the main source of groundwater for different uses and is considered as a sustainable aquifer [23]. The aquifer system near the Wadi Al Bih Dam can be classified into two subunits—gravel deposits and weathered, karstified limestone beneath them. Both these geologic layers have favorable properties to store and transmit groundwater, indicating good potential for groundwater resources. [24].
Fujairah
The Fujairah Emirate can be found on the eastern coast of the UAE, covering an area of approximately 1580 km2. Approximately 1200 km2 of this area comprises mountains and rocky heights, which act as a natural museum highlighting numerous forms and kinds impacted by tectonic factors like folding, cracking, sliding, and geological overlap. The terrain in this Emirate is diverse, consisting of various geological features, including mountain ranges, plateaus, valleys, oases, and sandy beaches. The uplifting and geological faults have influenced the distribution of the Hawasina Nappes, a solid formation dating from the Late Permian to the Cretaceous period, and the Semail Ophiolite, which originates from the Late Cretaceous period [25]. The area’s main geology is dominated by Oman–UAE ophiolite complex rocks, with a minor portion in the center-north underlain by medium- to high-grade metamorphic rocks of the Bani Hamid Group. The ophiolitic rocks comprise an older magmatic suite of mantle harzburgite and dunite, which is overlain by a characteristic spreading-ridge crustal section of layered gabbro, extending from “low-level” to “high-level” gabbro [26]. The Semail Ophiolite aquifer has low productivity, except in areas rich in secondary structures like joints, fractures, folds, and faults [23]. The aquifer mainly comprises medium-grained gabbro and fine to medium-grained diorites [27].
Dataset and sampling process
The dataset used in this research was created from 45 groundwater samples collected across three regions and analyzed for several parameters, including temperature, pH, total-dissolved solids (TDSs), Radon-222 (222Rn) activity, major cation concentrations (K+, Na+, Ca2+, and Mg2+), and major anion concentrations (HCO3−, Cl−, NO3−, and SO42−). Data quality was verified through charge balance error (CBE) calculations, with values ranging from [− 1.7%] to [4.97%], falling within the acceptable limit of ± 5% (Table S1). The WTW-COND-3301 instrument was used to measure temperature, pH, and TDS, with an error not exceeding 5% and an accuracy of ± one digit. The 222Rn activity was measured in situ using Rad7 (electronic portable radon detector from Durridge Co., USA). The average value of radon concentration was recorded for the four cycles in Bq/m3 and then converted into Bq L−1 to compare it with the World Health Organization (WHO) standards. The analytical accuracy of the measurement was calculated using three executive measurements and estimated at 5%. The cation concentrations were analyzed using inductively-coupled-plasma atomic emission spectrometry (ICP-AES). Anion concentrations were measured using high-performance liquid chromatography (HPLC).
Methodology
The research methodology can be divided into two parts. The first part comprised an imputation strategy and best model creation to predict the missing values. Next, we used a clustering algorithm and selected the most important parameters. A flowchart representing the methodology utilized in this research is presented in Fig. 2.
[See PDF for image]
Fig. 2
Flowchart detailing the two main processes of this study methodology (imputation and clustering)
Imputation
When dealing with missing data, ML experts predict missing values using various approaches, whereas statisticians often discard rows containing missing values as the simplest option in dealing with missing data [28]. Balancing both approaches requires scientists to choose when to impute data and when to discard it. The provided dataset contains two types of missing data. The first type concerns the data collected in the Ras Al Khaimah region. The sample analysis did not measure (Mg2+, Ca2+, HCO3−, and NO3−). As these samples were the most numerous, we had to choose between discarding the four parameters (columns) or discarding the Ras Al Khaimah samples (rows) when building the model. We selected the approach that would sacrifice the least amount of data. When comparing both approaches, we kept the Ras Al Khaimah samples and discarded the four parameters (Mg2+, Ca2+, HCO3−, and NO3−). Discarding the Ras Al Khaimah samples meant losing 178 readings—translating to 35% of our data; however, discarding the four parameters meant losing 100 readings—translating to 20% of the data. The data loss options are illustrated visually along with the raw data in the supplementary table (Table S1). The red cells in the supplementary table represent the missing cells. The yellow cells represent the data lost if we choose to discard the four elements (Mg2+, Ca2+, HCO3−, NO3−). The green cells represent the data lost if we choose to discard Ras Al Khaimah data to rectify the missing cells.
The second type of missing data covers any minor missing data in the dataset. We only had one sample with missing data, located in Ras Al Khaimah region. It is named (R-KH18) and it lacked readings for values corresponding to Cl− and SO42− parameters. As such, our imputation strategy would be to develop a model that can predict these two elements with the highest accuracy possible. Not only for the R-KH18 sample but also for any future samples.
When trying to impute missing values using advanced algorithms, the main input to predict the missing values is usually the other existing parameters. We have 12 parameters. However, we excluded six. Four of them were discussed earlier (Mg2+, Ca2+, HCO3−, and NO3−), while the extra two (Cl− and SO42−) will be used to evaluate the model’s performance. Therefore, the parameters used to predict the two missing elements (Cl− and SO42−) will be the temperature, pH, TDSs, Radon-222 (222Rn) activity, major cation concentrations (K+ and Na+), and major anion concentrations (Cl− and SO42−).
We imputed the missing values in R-KH18 by testing several models to find the one best able to predict the values of our dataset. Specifically, we tested K-nearest neighbor, Bayesian ridge, support vector machines, Extra-trees, and Random Forests. The models were implemented in scikit learn python library [29]. We also tried to “tune” the models to improve their accuracy. Owing to the relatively small dataset size (45 samples) and the importance of the prediction accuracy, we used Leave-One-Out Cross-Validation (LOOCV) to measure the model’s performance [30]. This method leaves one sample out of the training data and then uses the trained model to predict the sample value, counting any discrepancies between the sample and its prediction as errors. We used root–mean–square error (RMSE), the coefficient of determination (R2), and visual comparisons to summarize each model’s accuracy.
Clustering
The dataset includes 12 features for each sample. In clustering approaches, it is imperative to distinguish the redundant features and the features that need to be scaled. Aside from the sample ID column and the four elements (Mg2+, Ca2+, HCO3−, and NO3−) removed during the imputation phase, leaving easting, northing, pH, temperature, TDS, Na+, K+, Cl−, SO42−, 222Rn, and elevation for consideration. Prior to the clustering analysis, we conducted a preliminary statistical analysis to evaluate the variability of each parameter in the dataset. The Coefficient of Variation (CV), calculated as the ratio of standard deviation to mean expressed as a percentage, was used to assess the relative variability of each parameter. The CV values ranged from 7.54% to 124.72%, indicating substantial variation across all measured parameters. The lowest relative variation was exhibited by pH (CV = 7.54%), followed by temperature (CV = 10.25%). Chemical parameters showed higher variability, with Cl− (CV = 81.61%), K+ (CV = 81.87%), 222Rn (CV = 83.85%), and SO42− (CV = 84.32%) displaying similar levels of variation. The highest variability was observed in TDS (CV = 103.72%) and Na+ (CV = 105.07%). Given that all parameters demonstrated significant variation (CV > 5%), no variables were removed based on low variance, as each parameter potentially carried meaningful information for the clustering analysis.
To further understand the relationships between variables, we analyzed their pairwise correlations (Table 1). The correlation matrix revealed strong positive correlations (> 0.9) among TDS, Na+, K+, and Cl−, suggesting these parameters are strongly interrelated. Temperature showed weak to moderate negative correlations with most parameters, while pH demonstrated moderate positive correlations with most chemical parameters. Both SO42− and 222Rn exhibited weak to moderate correlations with all parameters (correlation coefficients mostly < 0.5), indicating their potential independence in characterizing the groundwater samples.
Table 1. Correlation Matrix between the analyzed variables
[See PDF for image]
As this section involves determining the relation between the wells based on their hydrogeochemical characteristics, we removed the easting, northing, and elevation columns, while keeping the temperature column tentatively. We utilized the SHAP library to explain the clustering algorithm output and determine the effect each element had on the output [31]. However, as SHAP does not support clustering algorithms by default, a surrogate model was created using the Kernel SHAP method based on the input and output of the clustering algorithm; subsequently, SHAP would determine the relations of each element with the model output.
We observed that the temperature variable considerably influenced our model’s outcomes, superseding the effects of other chemical interactions. Initially, the temperature was used in the clustering algorithm, anticipating its potential contribution to the analysis based on the statistical analysis. However, numerous experimental trials revealed that its inclusion adversely affected the algorithm’s output. The influence of the “Temperature” variable on the clustering algorithm’s results was disproportionate, leading to skewed output and reducing the algorithm effectiveness. As such, it became apparent that the “Temperature” parameter was a detrimental factor to our model, necessitating its exclusion for ensuring accuracy and reliability of our results.
For the clustering algorithm, we choose the K-means algorithm as it is a well-known and established algorithm. Reviewing the features of the dataset indicates a clear need for scaling the values as some feature values lie in the tens, while others exceeded tens of thousands. A notable disparity between the features can substantially affect K-means clustering [32]. Thus, a standardization process of each column is established to have zero mean and unit variance. The process is done for each column independently, and the transformed dataset is used as input for the K-means clustering algorithm. The last important aspect related to clustering is the number of clusters. Based on the sampling regions, we establish a priori assumption of three clusters, each representing the distinct characteristics that form due to the rock-water interaction between the groundwater and rocks in the Hafeet, Fujairah, and Ras Al Khaimah regions. However, we use the silhouette score to determine the best number of clusters based on the homogeneity of the data to verify our assumption.
Results and discussion
We simplified the imputation models’ codes for quick and easy implementation. We also used several existing algorithms. Table 2 shows the models’ accuracies. A detailed table (Table S2) is included in the supplementary materials.
Table 2. Experiments that scored the highest accuracy (R2) with the lowest RMSE
Experiment ID | Estimator | Hyper Parameters (1) | Hyper Parameters (2) | Chloride (Cl) | Sulfate (SO4) | ||
|---|---|---|---|---|---|---|---|
RMSE | R2 | RMSE | R2 | ||||
4 | KNN Imputer | Number of Neighbors = 5 | Uniform | 794.590 | 0.800 | 467.510 | 0.100 |
11 | KNN Imputer | 3 | Distance | 827.360 | 0.780 | 463.400 | 0.110 |
18 | KNN Imputer | 10 | Distance | 799.240 | 0.803 | 425.570 | 0.258 |
19 | Bayesian Ridge | - | - | 723.530 | 0.830 | 441.560 | 0.200 |
28 | Random Forest | n_estimators = 200 | max_depth = 6 | 597.710 | 0.890 | 457.900 | 0.140 |
29 | SVR | - | - | 657.430 | 0.860 | 474.880 | 0.075 |
30 | Gradient Boosting | max_iter = 100 | learning_rate = 0.1 | 1017.280 | 0.680 | 436.530 | 0.210 |
32 | ExtraTrees | max_depth = 5 max_features = “sqrt” min_samples_leaf = 1 | min_samples_split = 5, n_estimators = 20, random_state = 0 | - | - | 420.434 | 0.275 |
First, different models were needed to produce the best results for each missing element, meaning a different model was selected for each one. As the RMSE is a summarized quantification of the error between the observed and predicted values, it is difficult to grasp the qualitative aspects of each model’s performance. As a result, we created Fig. 3 from some of the models tested in Table 2 to visually illustrate the model’s prediction when compared with the observed data.
[See PDF for image]
Fig. 3
Selected examples of models and their accuracy in predicting the data. The solid lines represent the observed while the dotted lines represent the predicted. The blue line represents Chloride while the red line represents the Sulfate values
According to the no-free-launch theory, no model works best for every problem [33]. Therefore, model performance variations may occur due to the model’s underlying mechanisms, assumptions, complexity, randomness, or feature importance. Despite our best efforts in tuning the various models, the sulfate (SO42−) models never crossed the 30% accuracy in R2. This likely indicates the sulfate’s independence from the other chemical parameters measured in the collected samples. Despite its weak performance, it is still the best option for imputing the missing sulfate value based on the available data. Discarding the sample entirely would forfeit valuable data, and using the mean to impute the missing values yielded an (RMSE = 493.86, R2 = 0.0026), which is much worse than our models. Lastly, despite the weak performance, the model still captured the trends and fluctuations in the data (Fig. 4). Visualization conveys the prediction fidelity better than summary metrics [34, 35].
[See PDF for image]
Fig. 4
The most optimum predictions of Chloride and Sulfate were achieved using random forests and Extra-trees
Based on the previous findings, we decided to employ a random forest algorithm fine-tuned with particular hyperparameters to predict the missing chloride (Cl−). Conversely, we used the Extra-trees algorithm to estimate the values for sulfate (SO42−). The final computational model (Fig. 4) illustrates the application of these algorithms. The predicted values for chloride and sulfate were approximately 118.800316 and 564.28237, respectively. Filling the gaps with the predicted data allows us to utilize the ML clustering approach to investigate the hydrogeochemical dataset.
As was discussed in the methodology section. The temperature parameter was initially considered among the input parameters for the clustering process; however, the temperature is a physical parameter, not a chemical one, so we considered removing it [14]. Not to mention, as shown in Fig. 5a, it had the highest impact on the model output, overshadowing the rest of the chemical relations. In Fig. 5b, we demonstrate the effects each element had on the output after the temperature was removed from the inputs. And the results seem much more logical and reasonable. As a result, the temperature was removed from the inputs to the model, and the final input for the clustering algorithm became the seven remaining columns (pH, TDS, Na+, K+, Cl−, SO42−, and 222Rn). According to the findings, (SO42−, pH, and 222Rn) had the highest effects on the clustering output (Fig. 5b).
[See PDF for image]
Fig. 5
SHAP Model output values. (a) demonstrate the impact of all considered groundwater parameters, while (b) demonstrate the impact of all considered groundwater parameters while excluding the temperature parameter
After imputing the missing values and removing the redundant parameters, we used the silhouette score to determine the best number of clusters. We assumed with the priori assumption of three clusters, each representing the characteristics of the Hafeet, Fujairah, and Ras Al Khaimah regions. The results of the silhouette score are shown in Fig. 6. The score suggests two clusters were the optimal distribution of the samples based on the available data. The score considerably dips in the afterward cluster numbers, suggesting a strong heterogeneity between data points.
[See PDF for image]
Fig. 6
A diagram showing the relationship between the silhouette score and number of clusters. Note that only the clusters achieving the highest Silhouette Score are shown in the table which were used to create the spatial clustering maps in Fig. 7
We created several spatial maps representing the clusters using the GeoZ library [36]. We used these maps to check the spatial distribution of the established clusters to corroborate the findings of the silhouette score. As a result, a two-cluster distribution for data would be the ideal distribution; however, we will map the highest four silhouette scores to check the variability of the distribution, verify our assumptions, and also glean into the homogeneity of the three regions. The map compilation is presented in Fig. 7.
[See PDF for image]
Fig. 7
Comparison map illustrating the different spatial distributions of the clustering algorithm based on the number of clusters. The blue points indicate the samples’ location, while the regions indicate the clusters. Each cluster is highlighted by a region of the same color
We noticed some interesting events based on the spatial distribution of the clusters for the four maps. First, the samples collected in the Fujairah region were always classified as one cluster, never subdivided into smaller clusters, even when we increased the cluster numbers to five. This indicates that the readings collected in that area are very homogenous, which can be interpreted as a single homogeneous aquifer being the source for all the samples.
Another notable finding is that in the two-cluster distributions (Fig. 7a), the Fujairah and Ras Al Khaimah regions were considered one cluster, while the Hafeet region was considered a different cluster. This indicates a higher resemblance between the hydrogeochemistry of Fujairah and Ras Al Khaimah regions, while the Hafeet region differs from them. However, this distribution was excluded because we knew for a fact that Al Ain region was composed of at least two different clusters, which will be discussed with more details later. When subdivided further by asking the algorithm to create three clusters (Fig. 7b), the algorithm created three clusters highlighting each area. This finding agrees with our initial assumption that each region has its own cluster based on its unique traits and hydrogeochemical characteristics. However, some of the wells in the Ras Al Khaimah (R-KH08, R-KH14, R-KH15, R-KH17, and R-KH19) region were classified into the same cluster as the Fujairah wells, which indicate their abnormality compared to the region and their close resemblance to the Fujairah region wells (Fig. 8). The Hafeet region maintained its cluster shape with no change in the wells’ distribution.
[See PDF for image]
Fig. 8
Diagram showing the profiles of the average values for the three regions (Hafeet, Ras Al Khaimah, Fujairah). See Fig. 1 for the region’s location. The solid lines represent the average for the three clusters option, while the dotted lines represent the extra clusters (formed mostly by abnormal or outlier samples) in the four and five cluster options. Drawn on logarithmic scale to account for the scale difference in parameters values
When tasked with producing four clusters, the model subdivided the Hafeet region into two clusters while keeping the rest as is (Fig. 7c). The new cluster contained two samples that were abnormally higher than the average of the region (Fig. 8). Focusing on the Hafeet region, we identify the two outlier samples that differ from the rest of the samples collected in the region. The model classified samples SP5 and SP6 as a different group (Hafeet Outliers). These outlier samples had the highest readings for pH, TDS, Na+, K+, Cl−, and SO42− but substantially lower readings in 222Rn compared to the rest of the samples (Fig. 8). Interestingly, the Hafeet outlier group had normal HCO3− and NO3 values, but it had extremely high Ca2+ readings (twice the average) and the highest readings of Mg2+. The temperature was also the lowest in the region by a wide margin.
In The five-cluster distribution, the model divided Hafeet further into three clusters (Fig. 7d). The two outliers in the four-cluster distribution maintained their cluster as is. However, another set of outliers was separated from the remaining samples in the Hafeet region and formed into a new cluster. However, the newly formed cluster (Hafeet Outliers (2)) shows minor differences compared to the rest of the samples in Hafeet region as demonstrated in Fig. 8 when comparing the Hafeet profile to the (Hafeet Outliers (2)) profile. As evident by the profiles, the latter follows the Hafeet profile closely. This result indicates that our model is becoming over-sensitive in its division of clusters. As such, we select the four clusters as the optimal distribution of our samples (Fig. 9).
[See PDF for image]
Fig. 9
Clustering Map illustrating the spatial distribution of the K-means using four clusters
Based on this selection, we can assume that the samples collected in the Fujairah and Ras Al Khaimah regions are considerably similar in their hydrogeochemical characteristics but notably different from those collected in the Hafeet region. The earlier mentioned assumption that Hafeet region contained at least two clusters stems from the fact that aside from the collected groundwater samples, we collected two surface water samples. The samples were collected as a test of the algorithm efficiency and distinction capabilities. They were collected from the surface water (Green Mubazzarah Lake) instead of the groundwater used for collecting the remaining samples in the study area. As such, under normal circumstances the Hafeet region would contain at least two clusters.
Regarding the samples from Ras Al Khaimah region that were clustered with Fujairah samples, five of the twenty samples collected in Ras Al Khaimah were considered part of the Fujairah region. Upon investigation, the similarity in the groundwater hydrogeochemical response was mainly due to the water–rock interactions between the geological formations containing the well. Though it is inconclusive and would require further investigation, we expect the similarity in the hydrogeochemical signatures might be due to the water interaction with the Hili formation. The Hili formation is deposited in the late quaternary, where most of the country's aquifers are located. The formation is found in both regions, which can affect the wells' behavior and thus produce relatively similar responses in both areas. The improved hydrogeological understanding provided by these findings can help prepare for future studies that will expand on the scope of this article while also providing support for evidence-based policies for the sustainable management and utilization of groundwater resources across the UAE.
Conclusions
Groundwater samples were collected from 45 wells in three areas with varying aquifer lithologies: Jabel Hafeet Mountain, Fujairah, and Ras Al Khaimah. Furthermore, the physiochemical parameter, concentration of major cations and anions (K+, Na+, Ca2+, Mg2+, HCO3−, Cl−, and SO42−), and 222Rn activity were analyzed. Imputing missing values is crucial in ML analysis. However, different models produced varying results based on the predicted hydrogeochemical parameters. Therefore, testing other models and tuning their hyperparameters are required to find the best fitting model. The ML results classified the samples into four categories: three covering the previously discussed regions and the remaining for accommodating the outlier samples collected in the Jabel Hafeet region. The clustering also indicated the homogeneity of Fujairah samples, suggesting that all the region samples belong to one aquifer. There is a close relation between the Fujairah and Ras Al Khaimah regions, considering 25% of Ras Al Khaimah samples are part of the Fujairah region on hydrogeochemical parameters. This can be interpreted as a groundwater response to the shared geologic formations in both areas, such as the Hili formation. One of the best environmental indicators in this study was radon, which differed considerably among all the clusters, making it a useful environmental tracking tool to find the sources of groundwater. Overall, the integrated techniques provide valuable insights into UAE aquifer systems that can inform sustainable groundwater management in the face of increasing exploitation and climate pressures.
While these findings provide valuable insights, we acknowledge that the current dataset, with 45 samples and two missing values, is limited in size and may not fully demonstrate the potential of the proposed imputation methods. The small sample size restricts our ability to draw broad conclusions about the comparative performance of different imputation techniques. However, one of the major purposes of this study was to establish a comprehensive workflow for handling missing data in hydrogeochemical datasets, which can be scaled up and applied to larger datasets in future research. The methods presented here serve as a proof of concept and a blueprint for more extensive studies.
Recommendations
This study has revealed the potential for ML techniques, particularly random forests and Extra-trees, to address the challenge of missing data in hydrogeochemical datasets. Future research could explore other advanced ML or statistical methodologies for missing-data imputation in the UAE. Despite the broad representation of the three regions in UAE hydrogeology, the methodologies used could be expanded to other regions in UAE or countries with similar hydrogeological contexts to evaluate their effectiveness in a broader context. The results represent a snapshot analysis based on a single round of groundwater sampling. Multiple rounds could provide better insights, especially regarding the relation between the aquifers. Additionally, due to having only a few missing values, the performance of the imputation approach was not clear. As such, it would be recommended to utilize larger hydrogeochemical datasets to observe the effects of the imputation method. Lastly, the research found that certain samples from Ras Al Khaimah were classified with the Fujairah samples, suggesting a close resemblance in their hydrogeochemical characteristics. Further investigation is needed to confirm this finding and its implications for groundwater management in these regions.
Acknowledgements
The authors thanks Dr. Saber Hussein for his valuable assistance in the fieldwork and laboratory measurements, as well as Engr. Omer Elbashir for providing help with laboratory analysis. The authors also extend their thanks to Dr. Osman Abdelghany and Prof. Abdel-Rahman Fowler for their help in dating the geologic formations.
Author contributions
Conceptualization, Dalal Alshamsi; Data curation, Khalid ElHaj, Dalal Alshamsi, Balqees Alblooshi, Fatima Haile, Shamma AlRashdi, and Basant Elabyad; Formal analysis, Dalal Alshamsi, Fatima Haile, and Shamma AlRashdi; Funding acquisition, Dalal Alshamsi; Investigation, Khalid ElHaj, and Balqees Alblooshi; Methodology, Khalid ElHaj and Dalal Alshamsi; Project administration, Dalal Alshamsi; Resources, Dalal Alshamsi; Software, Khalid ElHaj; Supervision, Dalal Alshamsi; Validation, Khalid ElHaj, Dalal Alshamsi, Fatima Haile, Shamma AlRashdi, and Basant Elabyad; Visualization, Khalid ElHaj; Writing—original draft, Khalid ElHaj and Balqees Alblooshi; Writing—review & editing, Khalid ElHaj and Dalal Alshamsi. All authors have read and agreed to the published version of the manuscript.
Funding
This study was jointly funded by the Research Affairs Office, UAE University (Fund No. 31S445), and the National Water and Energy Center (Fund No. G00004039).
Availability of data and materials
The open-source code and the dataset are available on GitHub and can be accessed using the following link: https://github.com/Ne-oL/MLHydrogeochemisty.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent to Publish
Not applicable.
Competing interests
The author declares no conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Gómez-Alday, JJ et al. A multi-isotopic evaluation of groundwater in a rapidly developing area and implications for water management in hyper-arid regions. Sci Total Environ; 2022; [DOI: https://dx.doi.org/10.1016/j.scitotenv.2021.150245]
2. Ministry of Environment & Water. Hydroatlas United Arab Emirate. 2014.
3. El Mahamoudi A, Sherif M. Geomorphological and geological setting of selected Wadis in the Northern Emirates. Fifth Annu. U.A.E. Univ. Res. Conf., pp. 23–31, 2004. https://doi.org/10.13140/2.1.4542.3685.
4. Rizk, ZS; Alsharhan, AS. Water resources in the United Arab Emirates. Dev Water Sci; 2003; 50, pp. 245-264.
5. Halcrow M, Nouh IP. An overview for the water resources of the United Arab Emirates. 1st Tech. Meet. Muslim Water Res. Coop., pp. 27–37, 2008.
6. Dickson, BL; Giblin, AM. An evaluation of methods for imputation of missing trace element data in groundwaters. Geochem Explor Environ Anal; 2007; 7,
7. Najah-Ahmed, A et al. “Machine learning methods for better water quality prediction. J Hydrol; 2019; 578, 124084. [DOI: https://dx.doi.org/10.1016/j.jhydrol.2019.124084]
8. Singha, S; Pasupuleti, S; Singha, SS; Singh, R; Kumar, S. Prediction of groundwater quality using efficient machine learning technique. Chemosphere; 2021; 276, 130265. [DOI: https://dx.doi.org/10.1016/j.chemosphere.2021.130265]
9. Folguera, L; Zupan, J; Cicerone, D; Magallanes, JF. Self-organizing maps for imputation of missing data in incomplete data matrices. Chemom Intell Lab Syst; 2015; 143, pp. 146-151. [DOI: https://dx.doi.org/10.1016/j.chemolab.2015.03.002]
10. Haggerty, R; Sun, J; Yu, H; Li, Y. Application of machine learning in groundwater quality modelling—a comprehensive review. Water Res; 2023; 233, 119745. [DOI: https://dx.doi.org/10.1016/j.watres.2023.119745]
11. Rodríguez, R et al. Water-quality data imputation with a high percentage of missing values: a machine learning approach. Sustainability; 2021; 13,
12. Masood, A; Niazkar, M; Zakwan, M; Piraei, R. A machine learning-based framework for water quality index estimation in the southern Bug river. Water; 2023; 15,
13. Samal, P; Mohanty, AK; Khaoash, S; Mishra, P. Hydrogeochemical characteristics and spatial analysis of groundwater quality in a semi-arid region of Western Odisha, India. Arab J Geosci; 2023; 16,
14. Yang, J et al. Using cluster analysis for understanding spatial and temporal patterns and controlling factors of groundwater geochemistry in a regional aquifer. J Hydrol; 2020; 583, 124594. [DOI: https://dx.doi.org/10.1016/j.jhydrol.2020.124594]
15. Cloutier, V; Lefebvre, R; Therrien, R; Savard, MM. Multivariate statistical analysis of geochemical data as indicative of the hydrogeochemical evolution of groundwater in a sedimentary rock aquifer system. J Hydrol; 2008; 353,
16. Mahlknecht, J; Steinich, B; Navarro de León, I. Groundwater chemistry and mass transfers in the Independence aquifer, central Mexico, by using multivariate statistics and mass-balance models. Environ Geol; 2004; 45,
17. Alshamsi D, et al. Environmental assessment of radon-222 in groundwater of the United Arab Emirates. In International Conference on Engineering Geophysics, Al Ain, United Arab Emirates, pp. 512–515, 2017. https://doi.org/10.1190/iceg2017-093.
18. Zaineldeen, U; Fowler, AR. Structural style and fault kinematics of the Lower Eocene Rus Formation at Jabal Hafit area, Al Ain, United Arab Emirates (UAE). Arab J Geosci; 2014; 7,
19. Boukhary, M; Abdelghany, O; Bahr, S. Nummulites alsharhani n.sp. (Late Lutetian) from Jabal Hafit and Al Faiyah: Western side of the Northern Oman Mountains, United Arab Emirates. Rev Paleobiol; 2002; 21,
20. El-Saiy, AK; Jordan, BR. Diagenetic aspects of tertiary carbonates west of the Northern Oman Mountains, United Arab Emirates. J Asian Earth Sci; 2007; 31,
21. Ali, MY; Aidarbayev, S; Searle, MP; Watts, AB. Subsidence history and seismic stratigraphy of the Western Musandam Peninsula, Oman-United Arab Emirates Mountains. Tectonics; 2018; 37,
22. Ebraheem, AM; Sherif, MM; Al-Mulla, MM; Akram, SF; Shetty, AV. A geoelectrical and hydrogeological study for the assessment of groundwater resources in Wadi Al Bih. UAE. Environ. Earth Sci.; 2012; 67,
23. Alsharhan, AS; Rizk, ZE. Water resources and integrated management of the United Arab Emirates; 2020; Cham, Springer International Publishing:
24. Sefelnasr, A et al. enhancement of groundwater recharge from Wadi Al Bih Dam, UAE. Water; 2022; 14,
25. Searle, MP; Ali, MY. Structural and tectonic evolution of the Jabal Sumeini - Al Ain - Buraimi region, northern Oman and eastern United Arab Emirates. GeoArabia; 2009; 14,
26. Searle, MP; Waters, DJ; Garber, JM; Rioux, M; Cherry, AG; Ambrose, TK. Structure and metamorphism beneath the obducting Oman ophiolite: Evidence from the Bani Hamid granulites, northern Oman mountains. Geosphere; 2015; 11,
27. EWES. Wadi Ham dam and groundwater recharge facilities, p. Volume I Design, 1981.
28. Enders CK. Applied missing data analysis. 2010.
29. Pedregosa, F et al. Scikit-learn: machine learning in python. J Mach Learn Res; 2012; 12, pp. 2825-2830.2854348
30. James G, Witten D, Hastie T, Tibshirani R, et al. An introduction to statistical learning, vol. 112. Springer, 2013.
31. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems 30. Curran Associates, Inc., 2017. pp. 4765–4774.
32. Ozsahin DU, Taiwo Mustapha M, Mubarak AS, Said Ameen Z, Uzun B. Impact of feature scaling on machine learning models for the diagnosis of diabetes. In Proceedings—2022 international conference on artificial intelligence in everything, AIE 2022, 2022. pp. 87–94, https://doi.org/10.1109/AIE57029.2022.00024.
33. Wolpert, DH; Macready, WG. No free lunch theorems for optimization. IEEE Trans Evol Comput; 1997; 1,
34. Anscombe, FJ. Graphs in statistical analysis. Am Stat; 1973; 27,
35. Chambers JM. Graphical methods for data analysis. 2018.
36. ElHaj, K; Alshamsi, D; Aldahan, A. GeoZ: a region-based visualization of clustering algorithms. J Geovisual Spat Anal; 2023; 7,
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.