Content area
Ground-level ozone (
1 Introduction
Air quality (AQ) is a fundamental aspect of environmental health, addressing the composition and purity of atmospheric gases, in terms of fine particulate matter (PM), nitrogen oxides (such as , , and total ), sulfur dioxide (), total volatile organic compounds (), and ground-level ozone (), from now on named simply .
AQ has a direct impact on both human health and the environment (). According to the World Health Organization (WHO) (), 99 % of the world's population breathes air that exceeds the recommended safe limit values of the air quality guidelines (AQG) (). These guidelines specify recommended levels for these pollutants for both short-term and long-term exposure. They are regularly reviewed and updated to incorporate the latest scientific evidence on the health effects of air pollution. This helps governments and authorities establish and implement policies to protect human health from the adverse effects of air pollution.
Among these pollutants, we focus on , a highly oxidizing gaseous pollutant that has very reactive properties and is harmful at high levels. Notice that in the AQG with regard to , the target is to achieve a concentration of 100 for the daily maximum 8 average. Continued exposure to levels above those recommended by the AQG may lead to respiratory irritation, lung inflammation, aggravation of respiratory diseases such as asthma or bronchitis, and cell damage, and may have associated effects on the cardiovascular system. Those at the highest risk include children, older adults, people with respiratory or heart conditions, and individuals who spend significant time outdoors ().
This gas is very important to monitor because it is a secondary pollutant; it is generated in cities by complex photochemical reactions when primary pollutants from combustion of fossil materials (such as , , and ) react with sunlight (). Thus, its concentration indicates the activity of other air pollutants and plays a crucial role in AQ monitoring systems in smart cities to help citizens improve their quality of life. It is worth mentioning that it is being recommended to increase the spatial sampling resolution of this pollutant, ideally at least one sample per 100 , according to Annex III-B of the European Parliament Directive (). Low-cost sensors (LCSs) are becoming increasingly important; they offer an interesting alternative, but they do not have good accuracy () in comparison with the regulated equipment due to limitations in their sensing technology, lack of frequent calibration, sensitivity to environmental factors, cross-sensitivity issues, use of less durable materials, and the absence of rigorous certification processes. While regulated equipment uses advanced technologies and is subject to strict standards of accuracy and reliability, LCSs are designed to offer basic monitoring at a low price, which involves sacrifices in accuracy and durability. So, in this context, it is a challenge to estimate the regulated measurements from these LCSs with a reduced error ().
Artificial intelligence (AI) techniques are valuable for environmental research due to their capacity to process large datasets and identify patterns that enhance system explainability and clarify the behavior of these AQ parameters ().
In this paper, we show that machine learning (ML) models, particularly ensemble models, can correct the raw readings from LCSs by incorporating additional environmental information, such as temperature (Temp), relative humidity (RH), and other pollutants, as well as by including metadata to account for sensor aging effects and improve the models based on road traffic patterns. With these models, we are able to use these sensors to extend the resolution of AQ monitoring networks at low cost, but assuming a small error. This is our main objective. We propose and compare different techniques, reducing the estimation error by up to 94.05 % in a low concentration scenario (mean value of 55.72 ). In particular, using the gradient boosting (GB) algorithm, we achieved a mean absolute error (MAE) of 4.022 and a mean relative error (MRE) of 7.21 %, outperforming related works, using sensors approximately 10 times less expensive. We also carry out the calibration process using random forest (RF), adaptive boosting (ADA), and decision tree (DT) models. To train and test these models, we use two datasets in the city of Valencia (Spain), at two different locations with the same characteristics (close to the ring road but separated by 4.1 ), of 165 and 239 .
2 Related work
Regarding AQ LCSs, due to the increasing market demand, a wide variety are available to measure different pollutants, gases, and particles. These sensors are available in different price ranges and are more affordable compared with standardized measuring stations.
Since AQ considers different pollutants and each sensor measures only one, we will analyze sensor modules that embed some of these LCSs. A list of these sensor modules with a cost estimate is given in Table . The selection criteria of these modules is determined by related work, selecting those modules which have been considered under similar studies to that proposed here. We must stress that these modules have different costs due to their quality, order quantity, country, etc., which we can classify as Low (less than USD 10), Mid-Low (USD 100–200), Mid-High (USD 600–1000) and High ( USD 4000). A larger selection and comparison of these LCS modules is given in and .
Table 1
AQ Sensor modules with cost estimate: Low (less than USD 10), Mid-Low (USD 100–200), Mid-High (USD 600–1000) and High ( USD 4000).
| Module | Sensors | Price range |
|---|---|---|
| SDSO11 () | Temp, RH, PM, PA | Low |
| DL-LP8P () | Temp, RH, , PA | Low |
| MiCS-6814 () | , , , , | Low |
| ZPHS01B () | Temp, RH, , , , , , | Mid-Low |
| Sensit RAMP () | , , , , , | High |
| AirSensEUR () | , , , , , | Mid-High |
Note that LCSs are designed for basic monitoring at a low cost, which compromises accuracy and durability. In this list, there are several types of LCSs. Optical type sensors, such as SDSO11 () and DL-LP8P (), measure the amount of light absorbed by a given gas. Metal-oxide sensors, such as (), measure the change in electrical conductivity on a semiconductor due to the presence of certain gases. Usually, sensors of this type are the cheapest and are particularly susceptible to cross-sensitivities. Electrochemical sensors have higher selectivity and are good for measuring specific gases, but they are more expensive. Sensit RAMP () and AirSensEUR () use this type of sensor. Finally, the ZPHS01B module () integrates optical, metal-oxide, and electrochemical sensors. It is a Mid-Low price module with the best price/sensor ratio.
Since one of the key points to improve the accuracy of these LCSs is the use of marginal information (such as Temp and RH as well as other AQ pollutants) exploited using AI techniques (), as mentioned before, it is necessary to use multi-gas modules embedding as many AQ LCSs as possible.
Thus, among the different low-cost alternatives, and taking into account the number of sensors and the price/sensor ratio, the ZPHS01B () is the AQ sensor module that best meets the needs and objectives of this study at the time of writing, since it embeds nine different sensors: Temp (); RH (%); , , , and , which are measured in parts per million (); formaldehyde (), which is measured in ; PM, which is measured in ; and , which are measured using four levels according to its concentration (0 – very low, 1 – low, 2 – intermediate, and 3 – high). Table summarizes all this information. Notice that the sensor used in this module is the electrochemical ZE27-O3 (), which measures in the range 0–10 with a resolution of 0.01 . It operates with an accuracy of 0.1 when the concentration is 1 and 20 % when the concentration is above 1 . Also, notice that the PM readings in this module are given for 2.5 (fine particles with a diameter of 2.5 ), and and are estimated from the readings.
Table 2AQ information from the ZPHS01B module and units.
| Parameter | Unit | Range of measurement |
|---|---|---|
| Temp | 20 to 65 | |
| RH | % | 0 to 100 |
| 0 to 1000 | ||
| levels | 0 to 3 | |
| 0 to 6.25 | ||
| 0 to 5000 | ||
| 0 to 500 | ||
| 0 to 10 | ||
| 0.1 to 10 |
There are several research works and projects based on this ZPHS01B module. show the implementation of a device for AQ outdoor evaluation directly using this module without calibration to map AQ pollutants in a metropolitan area. use this module in an AQ monitoring network where different neural networks have been trained for forecasting of pollutant concentrations, with an estimation error of 7.2 % on average and where the calibration process is done on a daily basis but is not specified. use this module for indoor AQ monitoring and calculating an AQ index. explain briefly the use of a neural network to determine (classify) types of air: with or without pollution. show a prototype to measure ground to stratosphere AQ using this module in a drone. However, the variability among the individual sensors is high, stressing that the calibration process is complex and has not been done.
Regarding LCS performance analysis, conducted a 2-week assessment in Aveiro (Portugal) of various LCS models. Specifically for , the best performance compared to a reference station was achieved by the MiCS-OZ-47 and Alphasense B4 electrochemical sensors, which obtained a coefficient of determination values (and MAE in parts per billion, ) of 0.77 (7.66) and 0.70 (2.4), respectively.
The calibration process of these LCSs is a challenge, as mentioned before, where ML and deep learning (DL) models can be used. In , a low-cost multi-parameter AQ system based on , , , , , and along with Temp and RH is proposed, using and evaluating various calibration algorithms. For , the algorithms are ranked from best to worst fit as follows: RF, -nearest neighbors (KNN), back propagation (BP), genetic algorithm back propagation (GABP), and multiple linear regression (MLR), with values (MAE, in ) of 0.98 (2.88), 0.87 (7.33), 0.83 (11.14), 0.83 (10.90), and 0.74 (13.46), respectively. With a mean concentration of approximately 70 , as shown in their Fig. 12, the RF model achieves a MRE of 4.11 %. In , based on and metal oxide sensors, along with Temp and RH, the authors analyzed different calibration options using uni-variate/multi-variate, linear/nonlinear and parametric/non-parametric approaches with algorithms such as linear regression (LR), nonlinear regression (NLR), support vector machines (SVMs), RF, and GB. They concluded that multiple random forest (MRF) achieved the highest accuracy during Phase I (pre-deployment), with an of 0.98 and MAE (MRE) of 4.31 (5.74 %), considering a mean during their deployments of 75 , depicted in their Figs. 3 and 7. However, in Phase II (field validation) conducted at a different location, the performance worsened, with MAE (MRE) 22.22 (29.62 %) and MLR 12.96 (17.28 %). In this case, MLR provided better results. The authors conclude that MLR may be a more suitable solution for representing physical models beyond the Phase I calibration dataset, demonstrating better transferability across diverse spatial and temporal settings, highlighting that parametric models such as MLR have a defined equation with only a few parameters, making them easier to adjust for potential changes over time. In , the authors propose a category-based calibration approach (piecewise) using ML, which builds separate regression models for different pollutant concentration levels. This proposal is tested on and data from two Chinese cities, Fuzhou and Lanzhou, with good and bad AQ with mean concentrations of 69.545 and 49.781 , respectively, for 11 months (48 weeks). The achieved metrics for the best results are given by extreme GB and light GB machine algorithms (outperforming linear regression and RF) with MAE () of 10.75 and 10.98, respectively, in Lanzhou city and 13.83 and 14.98 in Fuzhou city, with an MRE greater than 19.88 %. In , calibration models (using 16 weeks of data) are shown to improve sensor performance, highlighting that the RF approach is more robust since it accounts for pollutant cross-sensitivities. Using specific LCSs (RAMP system), they achieve an MRE of 15 % for . In the study performed by , the calibration of an aerosol sensor for is carried out by comparing simple linear regression models with GB using the PPD42 PM sensor (). The study concludes that gradient boosting performed better and significantly improved the performance of the sensors, reaching an of up to 0.76. show that neural networks (NNs) generally outperform lineal models to quantify , , , and in ambient air using gas sensors integrated into U-Pod AQ monitors. They also highlight that NNs capture the complex nonlinear interactions among multiple gas sensors, considering factors such as Temp, RH and atmospheric chemistry. Also, use dynamic NNs for calibration, achieving models with (MAE) (in ppb) of 0.69 (7.45) with an MRE of 42 %.
In this context, when using AI techniques for environmental research, it is important to follow the recommendations and good practices given by based on a review of more than 148 highly cited research papers. In this paper, it is highlighted that data preprocessing, analysis, and interpretability, such as feature importance analysis (FIA), principal component analysis (PCA), and feature selection (FS), are often overlooked as part of the exploratory data analysis (EDA). A good example of the use of these good practices is shown in . In addition, in , it is said that the process of optimizing algorithms through the selection of their hyperparameters (hyperparameter optimization, HPO) is neglected in most of the environmental research studies considered. For instance, in , better results are obtained with GB, but HPO, which could allow further improvements of the results, is not performed in the model. , , and take into account some aspects related to the data analysis focused on the optimization of the problem, but they do not carry out an HPO. carry out a kind of simple HPO based on raw tests of different architectures and modifying hyperparameters such as the number of hidden layers of the model, tapped delay length, and feedback delay line length, and conclude that a dynamic approach to these parameters improves the results with respect to a static approach without changing the value of these parameters.
Regarding the selection of parameters, do not perform an analysis using techniques such as the aforementioned FIA and FS, but a sensitivity analysis using different meteorological variables (such as Temp and RH), determining that it is useful information for GB. In , the quantification of the importance of the model variables is mentioned as a means to understand which information is useful, concluding that for RF, it is very helpful to add additional information apart from AQ measurements, such as Temp and RH. and do not include a specific analysis of the relative importance of different variables or features. However, a good example of FS is depicted in , where it is shown that identifying the environmental factors affecting LCSs is crucial for improving data quality using data fusion and ML. These factors are then incorporated into the development of the calibration model.
In conclusion, in order to increase the resolution of city-scale AQ monitoring according to the recommendations given by as mentioned before, it is necessary to perform a calibration process of these LCSs. In this scenario, we focus on calibration using ensemble ML techniques to minimize inaccuracies and nonlinearities, comparing four different models, considering different environmental variables as well as metadata mainly to account for sensor aging effects. For this purpose, it is necessary to carry out a thorough data treatment with good practice criteria () including HPO, FIA, and/or FS, which are usually overlooked. In a scenario with low concentration, we achieve interesting results compared with related works, as shown in Sect. .
3 Building the datasets and using machine learning algorithmsIn this section, we explain the process of gathering AQ monitoring information from a prototyped low-cost internet of things (IoT) node based on the ZPHS01B AQ module, how it is deployed, and how the datasets are built to apply ML techniques for calibration purposes. For this, we generate two datasets in the city of Valencia (Spain) at two different locations with the same characteristics (close to the ring road but separated by 4.1 ), covering periods of 165 and 239 .
3.1 Building the datasets
To calibrate the sensor from the ZPHS01B module, we require a dataset (named dataset-1) to train various ML models. For this purpose, we use as reference values concentration readings from the standardized AQ station in the Valencian AQ Monitoring Network (VAQMN) at Bulevar Sur (Valencia, Spain) managed by Generalitat Valenciana (GVA) with latitude and longitude 39.450389 and 0.396324, respectively, as shown in Fig. . In this picture, the IoT node is shown placed 4 above ground level in accordance with Directive 2008/50/EC . These reference values are given in , periodically averaging every 10 . The AQ station data is retrieved from . Dataset-1 includes 165 from 8 June to 20 November 2023. The ZPHS01B module's readings are taken at a rate of 10 samples per minute, one sample every 6 . Notice that, as a first approach, creating a dataset with different locations is not recommended, as it could alter environmental conditions and interfere with the training process. That is the reason we generate two different and independent datasets.
Figure 1
Detail of the standardized AQ monitoring station and the AQ node with a ZPHS01B module located at Bulevar Sur (Valencia, Spain).
[Figure omitted. See PDF]
Table presents the structure and main statistics of dataset-1. The units used for concentration from the standardized and regulated station are , while the ZPHS01B module uses ppm. Both units are typically used in a formal and academic context, but we need to standardize them. The formula used to carry out this conversion for is in standard conditions: “Concentration () molecular weight (48 ) concentration () 24.45”, that is, 1 is 1.96 ().
Table 3Summary of main statistics of dataset-1: minimum (Min.), maximum (Max.), mean (Mean), standard deviation, median absolute deviation (MAD), percentage of samples taking different values (Diff.) and high correlation (High corr.)
| Temp | RH | TVOC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| [] | [%] | [] | [] | [] | [] | [] | [levels] | [] | [] | |
| Min | 5.24 | 62.29 | 21.25 | 693.43 | 0.78 | 0 | 0.005 | 0 | 39.57 | 8.71 |
| Max | 42.26 | 118 | 83.69 | 1792.50 | 18.81 | 0.75 | 1.21 | 2.95 | 255.76 | 97.85 |
| Mean | 20.60 | 91.31 | 49.99 | 780.33 | 15.27 | 0.34 | 0.021 | 0.024 | 114.39 | 55.72 |
| SD | 5.70 | 18.12 | 18.14 | 57.16 | 5.65 | 0.28 | 0.02 | 0.13 | 67.11 | 24.83 |
| MAD | 3.92 | 16.37 | 13.31 | 24.53 | 0.59 | 0 | 0.001 | 0 | 51.40 | 16.21 |
| Diff. | 99.1 % | 81.9 % | 87.9 % | 97.5 % | 50.6 % | 0.2 % | 81.2 % | 5.8 % | 75.0 % | 30.3 % |
| High corr. | yes | yes | yes | yes | no | yes | no | no | yes | yes |
Figure shows the IoT node (and its housing) that keeps the ZPHS01B module within a PVC pipe with a small fan at the top to ensure air circulation. In the head of this node is placed the microcontroller that sends data via the LTE-M communications. Further detail about the features of this node is given in .
Figure 2
(1) AQ IoT node; (2) deployment detail; (3) hardware detail.
[Figure omitted. See PDF]
In addition, in order to test the proposed models in this paper and their generalization in Sect. , we have used another dataset (named dataset-2) with two different AQ IoT nodes (Node 1 and Node 2) from the standardized AQ monitoring station called Moli del Sol (Valencia, Spain) with latitude and longitude 39.48113875, 0.40855865 . This station is 4.1 away from the previous one. Its data is retrieved from . This dataset includes 239 from 31 May 2024 to 25 January 2025. From now on, we will refer to dataset-1 as the dataset, except in Sect. where we generalize the models with dataset-2.
3.2 Analyzing the datasetThe initial data collection (dataset-1) is based on 6 frequency samples. Based on this collection, three datasets have been created by averaging data over different time monitoring intervals: 10 , 30 , and 1 , with 23 496, 7843, and 3922 samples, respectively. The lowest 10 interval is given by the standardized AQ station, and 30 and 1 are common time bases for AQ parameters. Although they are not large datasets, they are sufficient, as shown in , due to the relationship (ratio) between sample size and feature size, with four features in total, as seen next. This ratio is called the sample-size to feature-size ratio (SFR), with an SFR higher than 500 recommended. More detail is given in Sect. .
Initially, the datasets were cleaned of invalid data. Notice that from the readings of the standardized AQ station, we had 275 not-a-number (NaN) results during this period, that, in our case, were replaced using the quadratic interpolation method, since experimentally it gave better results and made the interpolation closer to the ozone signal. This explanation for preparing the dataset, also known as missing data management (MDM), is recommended according to .
Table shows a summary of the main statistics of dataset-1. For each parameter, the table shows the minimum value (Min.), maximum value (Max.), mean value of all entries (Mean), standard deviation (SD), median absolute deviation (MAD), and the percentage of samples taking different values (Diff.) and high correlation (High corr.) with others.
From these results, it is worth mentioning that the , , , and sensors do not seem to be working properly in the ZPHS01B module; , , and are almost always stuck to values close to zero, seeming to not excite at normal concentrations, with very low variability. On the other hand, the sensor appears saturated. Thus, in practice, the number of features used from Table is five; that is, from the initial nine (the reference is not included), we remove these four: , , , and . Also, the RH sensor has a positive offset, as we can see from the maximum value, 118 %.
Figure shows the readings in from the LCSs and the regulated station (reference) for 1 week. It can be seen that there is an offset in the LCS readings over the ones from the reference. Also, it is clear how the LCS captures the trends, useful information for the ML models. A further analysis of these sensor readings in the frequency domain is shown in Appendix , where a repeated daily pattern is clearly observed, as expected, based on how is generated from other pollutants produced by road traffic and complex photochemical reactions, as discussed in Sect. .
Figure 3
readings in from the LCS and reference for 1 week.
[Figure omitted. See PDF]
3.3 Feature importance analysis and selectionFIA and FS play crucial roles in ML models, especially in environmental research, by helping to preserve essential features (variables), reduce noise, and enhance model efficiency, which is particularly relevant when dealing with a small set of samples or large numbers of variables ().
Table shows the normalized output of the FIA using the scikit-learn library () for the parameters complementary to for each ML model used. In order to determine the most useful parameters for the models, a threshold is established at 0.08, that is, 8 % of importance. These parameters are in bold. Notice that the set of parameters with the highest importance is repeated for all models: Temp, RH, , and .
Table 4
FIA of ozone's complementary parameters for RF, GB, ADA, DT, the selected ones in bold, contribution higher than 0.8.
| Model | Temp | RH | TVOC | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RF | 0.128 | 0.103 | 0.069 | 0.222 | 0.078 | 0.269 | 0.002 | 0.003 | 0.064 |
| GB | 0.107 | 0.105 | 0.052 | 0.211 | 0.057 | 0.253 | 0.001 | 0.001 | 0.068 |
| ADA | 0.119 | 0.097 | 0.064 | 0.246 | 0.067 | 0.287 | 0.001 | 0.001 | 0.066 |
| DT | 0.115 | 0.088 | 0.070 | 0.232 | 0.061 | 0.276 | 0.001 | 0.002 | 0.061 |
With regard to FS, Fig. shows the correlation matrix for these variables. There is a high correlation among all readings because all of them are calculated directly from (). Also, from this analysis, we stress that Temp and RH are the best correlated with the rest of the variables, as well as LCS, reference, , and , but these have a lower correlation. Thus, all this information is very valuable for training the ML models.
Figure 4
LCS readings and reference correlation matrix.
[Figure omitted. See PDF]
3.4 Applying machine learning algorithmsAs mentioned before, in environmental research, the use of ML algorithms, in particular ensemble models, has increased significantly compared with DL (). Some of the most popular ensemble algorithms are RF or GB related models (). Furthermore, based on our experience, we recognize that in AQ monitoring scenarios using LCSs such as the ZPHS01B module, datasets are often limited and constrained, which affects the use of DL techniques as they usually tend to overfit.
This paper evaluates these ensemble ML algorithms: RF, GB, and ADA algorithms, implemented in the scikit-learn () library (in the ensemble submodule), which offers efficient solutions for time series regression problems such as this one. These evaluated methods exhibit the ability to handle nonlinear relationships and adapt to changing patterns over time. In addition, the DT model, belonging to the tree submodule of scikit-learn, is also evaluated, since it is a common base of this type of ensemble algorithm.
To optimize these models as indicated in , there are different techniques and tools for carrying out the HPO, with GridSearch being the most commonly used method to obtain a good configuration for these algorithms. GridSearch in scikit-learn is a hyperparameter tuning technique that exhaustively searches through a user-defined hyperparameter space to find the optimal combination for an ML model. These hyperparameters are external specific model configuration settings. This method systematically evaluates the model's performance across all possible user-defined hyperparameters using cross-validation, aiming to identify the configuration that maximizes estimation accuracy or minimizes a specified loss function. We choose this method because of its higher flexibility compared with other tools, such as RandomSearch (), which has a more random approach.
Next, we discuss the different supervised ML algorithms used and the selection of the different hyperparameters, taking into account the best results of , root mean square error (RMSE), and MAE.
3.4.1 Random forest (RF)
RF is an ensemble algorithm that relies on constructing multiple DTs during training. Each tree is trained on a random subset of the dataset, and the final predictions are obtained by averaging the individual predictions for all of them. This “forest” approach helps to mitigate overfitting and improves the model's generalization. Furthermore, introducing randomness in the selection of features and samples during tree construction contributes to a more robust and accurate model for regression tasks. Table shows the hyperparameters evaluated, with the best option in bold. The number of estimators refers to the number of trees in the forest, while the maximum depth refers to the maximum depth of the tree. The maximum features variable determines the upper limit on the number of features to consider when splitting a tree into two child nodes during the tree construction process. Note that as the number of estimators does not have a significant role in this use case, we use the default value, 100.
Table 5
RF hyperparameters evaluated on GridSearch showing in bold the combination that gives the best results in terms of , RMSE, and MAE.
| No. of estimators | Max. depth | Max. features |
|---|---|---|
| 50, 100, 250, 500, 900 | 2, 5, 7, none | sqrt, log2, 0.1, 0.3, 0.5, 1.0 |
GB is an ensemble algorithm based on the iterative construction of weak DTs, which are sequentially aggregated to enhance the predictive capability of the model. In each iteration, it focuses on correcting the residual errors of the existing model by fitting a new DT to capture the deficiencies of the current model. The weighting of individual trees is determined by a learning rate, and the final output of the model is the weighted sum of predictions from all these trees. This gradual building process and the ability to handle nonlinear relationships in the data make GB effective for regression tasks. Table shows the hyperparameters evaluated, with the best option in bold. In addition to the previous hyperparameters, in this case, the loss hyperparameter refers to the loss function to be optimized, while the learning rate reduces the contribution of each tree according to the value of the variable. The subsample hyperparameter represents the fraction of samples that will be used to adjust the individual base learners, and if it is less than 1.0, it results in stochastic gradient boosting (SGB).
Table 6
GB hyperparameters evaluated on GridSearch showing in bold the combination that gives the best results in terms of , RMSE, and MAE.
| No. of estimators | Max. depth | Max. features | Learning rate | Subsample | Loss |
|---|---|---|---|---|---|
| 50, 100, 250, | 2, 5, 7, none | sqrt, log2, 0.1, | 0.01, 0.05, | 0.5, 0.8, 1.0 | squared err., |
| 500, 900 | 0.3, 0.5, 1.0 | 0.1, 0.3 | absolute err., huber |
ADA is an ensemble algorithm whose primary goal is to improve the predictive accuracy by combining multiple weak regression models. When training, ADA assigns weights to data instances, giving more emphasis to observations that were poorly predicted in previous iterations. Its construction involves the sequential aggregation of regression models, each fitted to correct errors from the existing combined model. The final model is a weighted combination of individual predictions from the base models. ADA is particularly effective in enhancing generalization capability and reducing overfitting in regression tasks. Table shows the hyperparameters evaluated, with the best option in bold. In this model, a key concept is to run the optimization process related to the estimator variable, which by default is an instance of type DecisionTreeRegressor, initialized with a maximum depth value of 3. If the value of this hyperparameter is not modified, this model is largely constrained. Also, notice that as the number of estimators does not have a significant role on this use case, we use the default value of 50 estimators. The other hyperparameters have the same meaning in this model.
Table 7
ADA hyperparameters evaluated on GridSearch showing in bold the combination that gives the best results in terms of , RMSE, and MAE.
| No. of estimators | Learning rate | Loss |
|---|---|---|
| 50, 100, 250, 500, 900 | 0.01, 0.05, 0.1, 0.3 | linear, square, exponential |
DT is an algorithm that recursively partitions the dataset based on features, aiming to create a hierarchical structure of decision nodes to make predictions. Table shows the hyperparameters evaluated, with the best option in bold. The splitter hyperparameter indicates which strategy is used to perform the splitting at each node.
Table 8
DT hyperparameters evaluated on GridSearch showing in bold the combination that gives the best results in terms of , RMSE, and MAE.
| Max. depth | Max. features | Splitter |
|---|---|---|
| 2, 5, 7, none | sqrt, log2, 0.1, 0.3, 0.5, 1.0 | best, random |
We evaluated the performance metrics of these ML models under different configurations (in terms of , RMSE and MAE in , mean absolute percentage error (MAPE), and execution time in seconds), with the optimized hyperparameters that achieve higher and lower errors. Also, we used the three different datasets given by different monitoring intervals: 10 , 30 , and 1 , as depicted in Sect. . We tested different training–test ratio percentages from these datasets: 60 %–40 %, 70 %–30 %, 80 %–20 %, and 90 %–10 %, denoted as , , , and . Note that when we split the dataset for training and testing, both sets remain independent and isolated. However, during the training process itself, the dataset is further divided into two parts: one for training and the other for validation. By default, we allocate 80 % of the data for training and 20 % for validation. In this process, the training and validation datasets are combined across different iterations. From all of them, we achieved the best results in terms of these performance metrics with a training–test ratio with a monitoring interval of 10 , as shown in Table . Also, from the analysis carried out in Sect. , for the feature selection, we proceed in this section with the features that also provide the best results, based on [date, , Temp, RH]. Notice that “date” is included as metadata to account for the aging effect and to improve the models following the traffic pattern. We see that the fewer the features, the better the results, i.e., increased SFR. Then, other dimensionality reduction techniques are not required. If we add more features that are not so significant, it makes the dataset poorer. Notice that the performance metrics shown in Table are the weighted average of each metric over 100 different iterations by changing the content of the training and test set to obtain results with the minimum bias possible.
Table 9
Performance metrics for HPO models with and (training testing) ratio.
| Model | GB | RF | ADA | DT | ||||
|---|---|---|---|---|---|---|---|---|
| Ratio | ||||||||
| 0.938 | 0.936 | 0.927 | 0.924 | 0.922 | 0.920 | 0.878 | 0.863 | |
| RMSE | 6.492 | 6.664 | 7.093 | 7.253 | 7.289 | 7.416 | 9.149 | 9.735 |
| MAE | 4.022 | 4.221 | 4.185 | 4.415 | 3.642 | 3.833 | 4.684 | 5.104 |
| MAPE | 0.194 | 0.206 | 0.208 | 0.228 | 0.160 | 0.175 | 0.206 | 0.226 |
| Time | 66.937 | 61.054 | 18.316 | 16.618 | 7.805 | 7.078 | 0.212 | 0.194 |
During the training process, we can observe the convergence of the performance metrics, which provides information about overfitting, considering both the training and validation datasets. Appendix includes this information analyzing both and RMSE across different iterations. Each model uses its own reference hyperparameter for convergence. In particular, in Fig. , we observe the fit of the model in terms of , with a better fit with training than with validation, as expected, and similarly for RMSE, as we can see in Fig. . It should be noted that the convergence process with training does not reach a perfect fit in any case, which justifies and supports initially the conclusion that there is no overfitting in the models. Moreover, we see that the achieved and RMSE scores for both training and validation are better than the values shown for testing in Table , because the testing dataset does not participate in the training process.
It is worth mentioning that the improvement achieved by HPO is greater in GB and ADA models than in RF and DT, which are already well optimized with default values. In particular, for the optimized GB and ADA models, is improved by 42 % and 182 %, respectively, while RMSE is reduced by 57 % and 66 %. However, the execution time required for training is influenced by HPO, increasing to 66.937 and 7.805 for GB and ADA, respectively, as shown in Table . We highlight that RF and DT are already well optimized, and their execution times remain unchanged between the default and optimized versions.
Figure shows the calibration process for both the default and HPO models vs. reference given by the different algorithms.
Figure 5
Ozone calibration done with default and optimized models with (training testing) ratio dataset.
[Figure omitted. See PDF]
However, it is common to use an training–testing ratio (). For this ratio, Table also shows these results with the optimized models by HPO, where the GB model is the best one again, as with the previous ratio.
A summary of these metrics ( and errors) for the GB model, with different monitoring intervals and different training–testing ratio percentages, is shown in Fig. . We can see that when increasing the training percentage, the trend is to improve the accuracy of the model ( getting closer to 1) and to slightly reduce the errors but increase the training time, as could be expected. Similar behaviors are exhibited by the other models, particularly the ADA model. Regarding overfitting, Fig. shows that the error difference between using 90 % and 60 % of the data for training (the maximum and minimum percentages, respectively) is approximately 2 % in the worst-case scenario (1 dataset). This suggests that overfitting is not significant in the proposed model, as we mentioned before, during the convergence process shown in Appendix .
Figure 6
Ozone estimation analysis for GB default and optimized models with different % training datasets and monitoring intervals.
[Figure omitted. See PDF]
In terms of generalization, as mentioned in Sect. , we checked the same proposed models with dataset-2 under the same conditions, with a (training testing) ratio. In Fig. , we summarize the performance metrics given by the best model based on GB for dataset-1 and for Node 1 and Node 2 from dataset-2. In particular, if we focus on MAE, we see that Node 2 performs slightly better than Node 1 in dataset-2, likely due to manufacturing variations associated with their low cost. We also see that the results from dataset-1 are between these two, validating its generalized behavior. In terms of RMSE, the results from dataset-2 with both nodes is slightly better since it is larger. In all these cases, is higher that 0.938.
Figure 7
Performance metrics comparison using GBoptimized algorithm with (training testing) ratio over dataset-1 and dataset-2 (Node 1 and Node 2).
[Figure omitted. See PDF]
In Table , we show the improvement in percentage using the different ML models for the calibration process from the LCS raw readings of the module, highlighting the better performance of the GB model compared with the other models. Notice that with this GB model, the initial MAE from the raw readings was 67.59 reducing to 4.022 , that is, an improvement of 94.05 % as depicted in the table.
Table 10Improvement (in %) of calibration from the raw readings with the different optimized models.
| GB [%] | RF [%] | ADA [%] | DT [%] | |
|---|---|---|---|---|
| 258.13 | 256.27 | 256.1 | 246.27 | |
| RMSE | 93.05 | 92.43 | 92.29 | 89.85 |
| MAE | 94.05 | 93.82 | 94.59 | 92.79 |
| MAPE | 62.75 | 58.8 | 68.35 | 59.12 |
Finally, in Table , we compare our models for calibration for LCSs against related works with a similar approach, highlighting the location, platform (and sensors used), , and MRE along with additional comments about the detail of the models used and dataset duration. First, we must stress that the starting point of the selected papers is slightly different to ours, since these studies have used more reliable and expensive LCSs, approximately 10 times more expensive than the ZPHS01B module. Moreover, since ML-based algorithms show the best results, as discussed in Sect. , we have focused exclusively on them, evaluating up to four different models, whereas other studies have considered only one or two. Our model, in particular GB with four features (including “date” as metadata), as shown in Sect. , achieves a MRE of 7.21 % (given by MAE 4.022 with dataset, Table ; and the mean value of 55.72 , Table ). Also, not all of these works follow and discuss an structured EDA with FIA, FS, and HPO. In particular, when compared to the first two works with slightly better results, in , we observe higher values, with mean values higher than 70 , while in our case we have lower levels (55.72 ), as well as there not being a complete EDA. It is important to note that these sensors perform worse at low concentrations than at high ones due to their sensitivity limitations and the weakness of the signals generated, as well as interference from other pollutants. Finally, in , although the authors use a complete EDA, they only use two sensors ( and ) apart from Temp and RH, and the aging effect is considered a posteriori, while this information is included in our case by including the date in our models, which also detect other patterns derived from road traffic.
Table 11Comparison with similar related works.
| Study | Location | Platform, sensor | MRE [%] | Comment | |
|---|---|---|---|---|---|
| Zhengzhou | by Hanwei Electronics | 0.93 | 4.11 | 52-week dataset | |
| (China) | Corp, B4 Alphasense | with RF and HPO | |||
| Florence, | AirQino LC, | 0.98 with | I:MRF 5.74, | 61-week dataset | |
| Montale | MiCS-2714, | MRF | II:MLR 17.28, | with MRF and MLR, | |
| (Italy) | MiCS-2614 | MRF 29.62 | using complete EDA | ||
| Lanzhou | Sailhero instrument | – | 19.88 | 48-week dataset, category-based | |
| (China) | – | calibration (piecewise) with | |||
| extreme GB and FS | |||||
| Pittsburg | RAMP, | 0.86 | 15 | 16-week dataset with RF | |
| (USA) | Alphasense Ox-B431 | ||||
| Cambridge | SnaQ, Alphasense | 0.69 | 42 | 5-week dataset using Dynamic NN | |
| (UK) | B4 Electrochemical | with a kind of HPO | |||
| Our model | Valencia | ZPHS01B, | 0.93 | 7.21 | 57-week dataset using GB |
| (Spain) | Winsen ZE27 | with FIA, FS, and HPO |
This paper focuses on ground-level ozone () as it serves as an indicator of other pollution levels in urban areas using LCS nodes based on the ZPHS01B module. These nodes will enable an increase in the spatial sampling of AQ monitoring in cities, following the interests of AQG () and in line with the future plans of the related directives, ideally at least one sample per 100 , according to Annex III-B of the European Parliament Directive ().
Given the low accuracy and nonlinearities of these nodes' sensors, we employed ML models (particularly DT and the ensemble algorithms GB, RF, and ADA) after thorough data analysis, considering additional environmental information and including metadata to account for the aging effect and detect other patterns derived from road traffic; we reduced the estimation error by approximately 94.05 %, and more than 89 % in the other models. In particular, using the GB algorithm, we achieved an MAE of 4.022 and a MRE of 7.21 %, outperforming related works while using a module approximately 10 times less expensive.
Initially, we used a dataset spanning 165 (with low concentrations and a mean value of 55.72 ) with different monitoring intervals, giving the best results when using a 10 monitoring interval, as could be expected. If we use higher monitoring intervals (30 or 1 ), we see that we start losing details, smoothing the dataset and overlooking different behaviors that in the ML process help to reduce the prediction error. For the training process, we carried out several techniques (FIA and FS) in order to select the most relevant features, applying HPO within the different models, with different percentages for training and testing.
Also, we checked that for the ZPHS01B module and calibration, the 165 of dataset-1 provided sufficient information to generalize the proposed models, comparing with dataset-2 of 239 . This aligns with the SFR recommended values according to . Thus, given the features and characteristics of this module, the original dataset (165 ) contains enough information to generalize the behavior of the sensor and its response.
As future work, we plan to expand the dataset and include complementary parameters, such as wind speed or additional metadata variables, to increase the accuracy of these models. In addition, we will focus on the design of new calibration and forecasting algorithms for the different sensors embedded in the low-cost ZPHS01B module in order to improve AQ monitoring resolution.
Appendix A Spectral analysis for low-cost O3 readings from the ZPHS01B module
To characterize the measurements of , we carry out a discrete Fourier transform (DFT) analysis to see the changing patterns. DFT is a mathematical technique that transforms a discrete signal from the time domain to the frequency domain. Figure shows the peaks obtained from the signal. There are two main peaks and their harmonics. The first peak appears in the frequency 0.00025 , which corresponds to a period of 4000 , 5.56 months, that is, the total duration of dataset. The second peak indicates and reveals a relevant frequency component at 0.04182251 , which represents a period of 23.91 (approximately 1 ). Thus, there is an pattern that is repeated every day, as could be expected in a city, based on how it is generated from road traffic by combustion engines as discussed in Sect. .
Figure A1
DFT of readings from LCS.
[Figure omitted. See PDF]
Appendix B Model convergence resultsIn this appendix, we plot and RMSE across different iterations during the training process, with training and validation datasets. Each model uses its own reference hyperparameter for convergence. In Fig. , we observe the fit of the model in terms of , with a better fit with training than with validation, as expected, and similarly for RMSE, as we can see in Fig. . It should be noted that the convergence process with training does not reach a perfect fit in any case, which justifies and supports initially the conclusion that there is no overfitting in the models.
Figure B1
Convergence of across different iterations for training and validation for the models with their main reference hyperparameter.
[Figure omitted. See PDF]
Figure B2
Convergence of RMSE across different iterations for training and validation for the models with their main reference hyperparameter.
[Figure omitted. See PDF]
Data availability
Dataset-1 and dataset-2 (Node 1 and Node 2) are available in the Zenodo public repository at 10.5281/zenodo.17018565 and 10.5281/zenodo.17044034 , respectively.
Author contributions
GMF and SFC contributed equally to air quality gathering process and calibration techniques, as well as coding and manuscript writing. EMA prepared the dataset. JJPS prepared the hardware infrastructure. SFC and JSG managed the funding and external collaborations.
Competing interests
The contact author has declared that none of the authors has any competing interests.
Disclaimer
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
Acknowledgements
We are grateful to the Generalitat Valenciana and its AQ monitoring network, in particular to Rafael Orts Bargues from the Atmospheric Protection Service.
Financial support
This paper is partially funded by the grant PID2021-126823OB-I00 MCIN funded by MCIN/AEI/10.13039/501100011033 and by the European Union Next-Generation EU/PRTR; by the Generalitat Valenciana with grant references CIAICO/2022/179, CIACIF/2023/416, and CIAEST/2022/64; and by the Spanish Ministry of Education in the call for Senior Professors and Researchers to stay in foreign centers for the grant with reference PRX23/00589.
Review statement
This paper was edited by Reem Hannun and reviewed by Ali Kourtiche and one anonymous referee.
© 2025. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.