1 Introduction
China has faced severe environmental problems during the last 2 decades, especially air pollution (An et al., 2019; Chan and Yao, 2008; Z. Li et al., 2017; Q. Zhang et al., 2019; Wei et al., 2021a). The sources of air pollution are numerous, coming from both natural changes (e.g., forest fires, biomass burning) and human activities (e.g., industrial production, transportation) (Huang et al., 2014; Sun et al., 2004; Wei et al., 2019a, b, 2021b). Particulate matter with a diameter of less than 2.5 () has a greater impact on the atmospheric environment and climate change than other air pollutants (e.g., , nitrogen dioxide, , and sulfur dioxide, ) (Jacob and Winner, 2009; Z. Li et al., 2017, 2019; Ramanathan and Feng, 2009). Moreover, they can cause great harm to human health due to their smaller particle size (Delfino et al., 2005; Kampa and Castanas, 2008; Kim et al., 2015; Lelieveld et al., 2015). China has established and operates multiple ground-based observation networks to monitor air pollution in real time across mainland China, including information about pollution.
For near-surface concentrations, the networks provide high-quality measurements every hour (even every few minutes) but with non-uniform coverage. In recent years, an increased effort has been made in estimating with products generated from multiple instruments on sun-synchronous satellites, e.g., the Multi-angle Imaging SpectroRadiometer (MISR) (Liu et al., 2005; van Donkelaar et al., 2006), the Moderate Resolution Imaging Spectroradiometer (MODIS) (Liu et al., 2007; Ma et al., 2014; Wei et al., 2019a, 2020, 2021a), and the Visible Infrared Imaging Radiometer Suite (VIIRS) (Wei et al., 2021c; Wu et al., 2016; Yao et al., 2019). However, due to their low revisit cycles (one or two overpasses per day), they are unable to monitor the diurnal variation in pollution. Currently, most available datasets are at low temporal resolutions that cannot meet the requirements of air pollution real-time monitoring (Lennartson et al., 2018). For example, knowing when heavy pollution might occur during the day, people may adjust their time outdoors doing activities accordingly. Following the launch of the Himawari-8 Advanced Himawari Imager (Himawari-8/AHI) on 7 October 2014 (Bessho et al., 2016; Letu et al., 2020), near-surface concentrations in the Eastern Hemisphere can now be estimated and used to examine their diurnal cycle.
Wang et al. (2017) used the linear mixed-effect (LME) model, and Sun et al. (2019) applied the geographically weighted regression (GWR) and support vector regression (SVR) models to estimate hourly concentrations in the Beijing–Tianjin–Hebei (BTH) region from the Himawari-8 aerosol optical depth (AOD) product. T. Zhang et al. (2019) developed an improved LME model, and Xue et al. (2020) proposed an improved geographically and temporally weighted regression (IGTWR) model to derive hourly maps based on the Himawari-8 AOD product over central and eastern China. In addition to traditional statistical regression models, several artificial intelligence models, including the random forest (RF), the gradient boosting decision tree (GBDT), the eXtreme Gradient Boosting (XGBoost), and the deep neural network (DNN), have been recently successfully adopted to obtain ground-level concentrations in local regions and in the whole of China (Chen et al., 2019; Gui et al., 2020; Liu et al., 2019; Sun et al., 2019; Zhang et al., 2020). Nevertheless, due to their poor data-mining ability, traditional statistical regression methods usually suffer from large uncertainties. While artificial intelligence methods can achieve high accuracies, they are often highly demanding on computational power and are thus often slow. Therefore, spatiotemporal variations in have often been neglected in the models developed in previous studies (Chen et al., 2019; Liu et al., 2019; Sun et al., 2019; Wang et al., 2017; T. Zhang et al., 2019), resulting in relatively low accuracies.
Focusing on the above issues, we have developed a new, highly efficient, and precise method for improving ground-level estimates by incorporating spatial and temporal information into the tree-based Light Gradient Boosting Machine (LightGBM) model. This new model is called the space-time LightGBM (STLG) model, and it has been used to generate a high-quality, high-temporal-resolution (hourly) dataset over eastern China (at a spatial resolution of 5 ) from the Himawari-8/AHI hourly AOD product. Section 2 provides details about the data used and introduces the development of the STLG model. Section 3 validates the hourly estimates and shows the diurnal variations across China. Comparisons with results from traditional models and from previous studies are also presented. Section 4 summarizes the study.
2 Materials and methods
2.1 Data sources
2.1.1
and AOD data
hourly measurements from 1583 monitoring stations across China for the year 2018 were collected (Fig. 1 in Wei et al., 2020). The latest Himawari-8 version 2 hourly 5 AODs at 500 across mainland China for that year were also collected. This AOD product is synthesized from level 2 10 AODs generated by a newly developed Lambertian-surface-assumed aerosol retrieval algorithm (Letu et al., 2020; Yoshida et al., 2018). Himawari-8 AOD retrievals have been preliminarily evaluated against in situ AOD retrievals provided by the Aerosol Robotic Network (Giles et al., 2019) and the Sun–Sky Radiometer Observation Network (Li et al., 2018), showing that they are consistent ( 0.75), with a root-mean-square error (RMSE) and mean absolute error (MAE) of 0.39 and 0.21, respectively (Wei et al., 2019c). Here, only low-uncertainty AOD retrievals (500 ) were selected for estimating concentrations.
2.1.2 Meteorological conditionscan be significantly affected by meteorological conditions (Su et al., 2018). However, most currently available reanalysis meteorological products have low temporal resolutions ( 3–6 h). Recently (14 June 2018), the fifth-generation European Centre for Medium-range Weather Forecasts (ECMWF) global atmospheric reanalysis (ERA5) at a horizontal resolution of 0.25 0.25 has been released, as well as the land version (12 July 2019) at a horizontal resolution of 0.1 0.1, both at an hourly timescale (1979 to the present). Here, we use seven ERA5 hourly meteorological parameters, i.e., the 2 temperature (TEM), total evaporation (ET), relative humidity (RH), 10 u- and v-components of wind, surface pressure (SP), and boundary-layer height (BLH).
2.1.3 Human influences
Human activity is a key factor affecting pollution. The global annual LandScan™ product at a 1 spatial resolution for the year 2018 was selected to obtain the population distribution (POP) (Dobson et al., 2000). Monthly anthropogenic source emission data from the Multi-resolution Emission Inventory for China (MEIC) (M. Li et al., 2017; Zheng et al., 2018) were also employed. This dataset is generated from agricultural, industrial, power, residential, and transportation information obtained at more than 700 anthropogenic sources, including a total of 10 atmospheric pollutants and greenhouse gases. Here, four main precursors were selected, i.e., ammonia (), nitrogen oxides (), , and volatile organic compounds (VOCs), and direct emissions to PM.
2.1.4 Ancillary data
Two additional ancillary datasets, namely, the MODIS monthly normalized difference vegetation index (NDVI) at a horizontal resolution of 0.05 0.05 and the Shuttle Radar Topography Mission (SRTM) 90 digital elevation model (DEM) products, were selected to characterize land cover, its change, and topographical conditions in China. All selected variables (Table 1) with potential impacts on concentrations were resampled to the same spatial resolution as the Himawari-8 aerosol product, namely, 0.05 0.05.
Table 1
Summary of datasets and sources used in this study. CNEMC is the China National Environmental Monitoring Center.
Dataset | Variable | Content | Unit | Spatial resolution | Temporal resolution | Data source |
---|---|---|---|---|---|---|
In situ | Hourly | CNEMC | ||||
AOD | AOD | Himawari-8 AOD | – | 5 5 | Hourly | Himawari-8 |
Meteorology | ET | Total evaporation | mm | 0.1 0.1 | Hourly | ERA5 |
SP | Surface pressure | hPa | 0.1 0.1 | |||
TEM | 2 temperature | K | 0.1 0.1 | |||
WU | 10 u-component of wind | 0.1 0.1 | ||||
WV | 10 v-component of wind | 0.1 0.1 | ||||
BLH | Boundary-layer height | m | 0.25 0.25 | |||
RH | Relative humidity | % | 0.25 0.25 | |||
Emissions | Ammonia | 0.25 0.25 | Monthly | MEIC | ||
Nitrogen oxides | ||||||
Sulfur dioxide | ||||||
VOC | Volatile organic compounds | |||||
PM | PM, coarse | |||||
Land cover | NDVI | NDVI | – | 0.05 0.05 | Monthly | MOD13C2 |
Topography | DEM | Surface elevation | m | 90 90 | – | SRTM |
Population | POP | Ambient population | – | 1 1 | Yearly | LandScan™ |
2.2.1 LightGBM model
The LightGBM model, a newly developed tree-based machine-learning approach, was introduced in 2017 (Ke et al., 2017). Using the gradient boosting framework to construct the decision tree, this approach can tackle both regression and classification tasks and as such can be expanded for PM applications. It can also tackle the main challenge faced in traditional machine-learning approaches, namely, computational complexities, which are very time-consuming. LightGBM is a fast, distributed, and highly efficient method that reduces the number of data samples () and features (). The LightGBM model includes three main steps when constructing the decision tree.
-
Histogram-based algorithm. Continuous features are first converted to different bins which are used to construct feature index histograms without the need to sort during training. It goes through all the data bins to find the best split point from the feature histograms, which can significantly reduce the computation cost of the split gain. The overall complexity is ( .
-
Gradient-based one-side sampling. Data samples are first sorted in descending order according to their absolute gradients, and the top % of them are selected as a subset sample with large gradients. The % samples are then randomly chosen from the remaining data as a subset sample with small gradients. The sampled data with small gradients are multiplied by a weight coefficient . Consequently, a new classifier is learned and established using the above-sampled data until convergence.
-
Exclusive feature bundling. A graph with weighted edges is first constructed, and each weight corresponds to the total number of conflicts between two features. The features are then sorted in descending order according to the degree of each feature (the greater the degree, the greater the conflict with other points). Last, each feature is checked in the sorted sequence, and it is assigned to a combination with small conflicts or a new combination is created.
In addition to the main technologies mentioned above, there are other features of the optimization, such as the leaf-wise tree growth strategy with depth restriction (Shi, 2007), histogram difference acceleration, sequential access gradient, and the support of category feature and parallel learning. These advanced methodologies make it possible to reach a high accuracy and efficiency (Ke et al., 2017).
2.2.2 Model development
It is well known that air pollution has spatiotemporal heterogeneity leading to large differences in concentrations in both time and space. Such characteristics have always been ignored in most traditional statistical regression and artificial intelligence methods. Studies have shown that including spatiotemporal information has led to improved estimates using remote sensing techniques (Z. Li et al., 2017; Wei et al., 2019a, 2020). Therefore, we have introduced a new approach to integrating spatiotemporal information into the LightGBM model. The new model developed here is called the STLG model. The spatial feature is represented by the geographical distances of one pixel to other points in the circumscribed rectangle of the study region (Baez-Villanueva et al., 2020; Behrens et al., 2018). The distance is calculated using the haversine method (Eq. 1) to reflect the spherical distance between two points in the sphere space (Wei et al., 2021a). The temporal feature is represented by the day of the year (DOY), which is used to distinguish each data record on different days of the year during the model training.
1 where and represent the latitude and longitude of a point on the sphere, respectively, and denotes Earth's mean radius ( 6371 ). Figure 1 illustrates the flowchart of the new STLG model.
Figure 1
Schematics of the space-time LightGBM (STLG) model developed in this study (upper panel) and the framework of the original LightGBM model (bottom panel).
[Figure omitted. See PDF]
In addition to Himawari-8 AODs, other auxiliary variables were considered and employed to improve –AOD relationships. However, to avoid redundant information, we first calculated the normalized importance (%) of each feature to the estimation during the model training (Fig. 2). It represents the total gains of splits that use the feature during the decision-tree construction but not the physical contribution. AOD is found to be the most important feature, accounting for about 17 %. All meteorological factors have an important impact on the estimation, especially BLH, RH, and TEM (importance 8 %), followed by two surface-related variables (i.e., NDVI and DEM) and POP. The influence of aerosol precursors and emissions (i.e., , , , PM, and VOC) on the estimation cannot be ignored (importance 2 %). Therefore, all 16 selected variables are included to establish the final model in this study.
Figure 2
Sorted normalized importance (%) of each feature in the estimation during the model construction.
[Figure omitted. See PDF]
Here, two independent 10-fold cross-validation methods (10-CV) (Rodriguez et al., 2010) based on all the data samples (i.e., out-of-sample) and monitoring stations (i.e., out-of-station) were selected to validate the model performance and the spatial prediction ability, respectively.
3 Results and discussion3.1 Model fitting and validation
3.1.1 Spatial-scale performance
The STLG model can largely minimize overfitting, showing a strong data-mining ability (Fig. 3) which can more accurately establish the relationships between hourly observations and influential variables (i.e., coefficient of determination, 0.97–0.98, RMSE 4.18–7.31 ). Figure 4 illustrates the out-of-sample evaluation results of estimated hourly values over China from 08:00 to 17:00 LT in 2018. The STLG model is highly accurate in estimating hourly concentrations, with high sample-based CV- values ranging from 0.81 to 0.85, strong slopes of 0.81–0.84, and small -intercepts of 5.52–7.84 . The uncertainties are overall small, with RMSEs (MAEs) ranging from 11.24 (6.82) to 15.56 (9.79) . However, the STLG performs slightly differently with small differences in main evaluation indicators throughout the day. The main reason being that the number of training samples is reduced during sunrise (Fig. 4a and b) and sunset (Fig. 4i and j) in optical remote sensing, affecting the model training. Air pollution also has clear diurnal variations at different pollution levels due to the different intensities of human activities and natural conditions. In general, our model is stable and robust, with an equal out-of-sample CV- of 0.85 and an equal regression slope of 0.81 at most hours during the day in China (Fig. 4c–h).
Figure 3
Density scatterplots of model-fitted estimates () at (a) 08:00 LT, (b) 09:00 LT, (c) 10:00 LT, (d) 11:00 LT, (e) 12:00 LT, (f) 13:00 LT, (g) 14:00 LT, (h) 15:00 LT, (i) 16:00 LT, and (j) 17:00 LT in 2018 in China. Dashed lines denote lines, and solid lines denote best-fit lines from linear regression.
[Figure omitted. See PDF]
Figure 4
Density scatterplots of out-of-sample cross-validation results of estimates () at (a) 08:00 LT, (b) 09:00 LT, (c) 10:00 LT, (d) 11:00 LT, (e) 12:00 LT, (f) 13:00 LT, (g) 14:00 LT, (h) 15:00 LT, (i) 16:00 LT, and (j) 17:00 LT in 2018 in China. Dashed lines denote lines, and solid lines denote best-fit lines from linear regression.
[Figure omitted. See PDF]
Figure 5
Density scatterplots of out-of-station cross-validation results of estimates () at (a) 08:00 LT, (b) 09:00 LT, (c) 10:00 LT, (d) 11:00 LT, (e) 12:00 LT, (f) 13:00 LT, (g) 14:00 LT, (h) 15:00 LT, (i) 16:00 LT, and (j) 17:00 LT in 2018 in China. Dashed and solid lines denote and best-fit lines from linear regression, respectively.
[Figure omitted. See PDF]
Furthermore, out-of-station CV- values range from 0.76 to 0.81, and RMSE (MAE) values range from 12.49 (7.85) to 17.61 (11.33) (Fig. 5), indicating that our model has a strong spatial prediction ability and can predict values well in those areas without surface observations.. The station-based accuracy is also slightly decreased with reference to the sample-based accuracy, further illustrating the robustness of our model. However, two cross-validation results (e.g., slopes 0.78–0.84) indicate that hourly concentrations are overall underestimated (Figs. 4–5), a common issue in fine-particle remote sensing (Wei et al., 2020). This can be explained by the large aerosol retrieval uncertainty, as well as the small number of data samples under highly polluted conditions (Wei et al., 2019c, d).
The regional performance of the STLG model for hourly estimates (Fig. 6) was also evaluated. Hourly estimates (number of data samples, 1 151 595) are highly consistent with ground measurements, with a high sample-based CV- of 0.87 and a strong regression slope of 0.86, showing small estimation uncertainties (i.e., RMSE 12.77 , MAE 8.12 ) over eastern China. The STLG model performs well (e.g., CV- 0.88, slope 0.87) in two typical urban agglomerations of public concern in China, i.e., the Beijing–Tianjin–Hebei (BTH) (Fig. 6b) and Yangtze River Delta (YRD) (Fig. 6c) regions. By contrast, our model performs relatively poorly in the Pearl River Delta (PRD) region (Fig. 6d) possibly due to the significant reduction in the number of data samples caused by frequent, long-term cloud cover in southern China. Note that there are some differences in the uncertainty of hourly estimates mainly because of varying levels of air pollution. The pollution level in the BTH region is about 3 times higher than that in the PRD region.
Figure 6
Density scatterplots of out-of-sample cross-validation results of hourly estimates () in 2018 for (a) eastern China, (b) the Beijing–Tianjin–Hebei (BTH) region, (c) the Yangtze River Delta (YRD), and (d) the Pearl River Delta (PRD) in China.
[Figure omitted. See PDF]
Figure 7
Individual-site-scale validation of hourly estimates () in 2018 in China in terms of (a) the sample size (N), (b) CV-, (c) RMSE, and (d) MAE.
[Figure omitted. See PDF]
Figure 7 shows the accuracy of the STLG model at each monitoring station across China. At the individual site scale, the number of data samples gradually decreases from northern China to southern China mainly due to increasing cloud contamination with a site average of 997 data samples in China. Except for several scattered monitoring stations in western China, the STLG model has a high performance and adaptability and can estimate well hourly concentrations at most monitoring stations (e.g., average CV- 0.78, RMSE 12.21 , and MAE 8.17 ). In general, approximately 76 %, 79 %, and 82 % of monitoring stations show high accuracy, with out-of-sample CV- values 0.7, RMSE values 15 , and MAE values 10 in hourly estimates, especially for those located in central and northern China.
3.1.2 Temporal-scale performanceWe first quantified the time series of the bias in hourly estimates during the day in China (Fig. 8). There is a slight temporal dependence in that the bias increases gradually with increasing standard deviation, reaching a maximum around 11:00 LT and subsequently decreasing. This seems to be closely related to the diurnal variation in concentrations. The estimates are less affected by the time-dependent bias in the Himawari-8 AOD product (Wei et al., 2019c) because machine learning is not sensitive to the systematic bias of aerosol retrievals (Wei et al., 2021c). Nevertheless, our model is generally robust and can accurately estimate concentrations with small mean (median) biases of 0.05–0.08 (0.63–0.99) during different hours throughout the day.
Figure 8
Boxplots of the temporal dependence of the bias in hourly estimates () in 2018 in China. In each box, the red dot represents the mean bias, and the middle, lower, and upper horizontal blue lines represent the median bias, 25th percentile, and 75th percentile, respectively.
[Figure omitted. See PDF]
Figure 9
Time series of Himawari-8-derived (blue bars) and ground-based (orange bars) diurnal variations () in (a) China, (b) the Beijing–Tianjin–Hebei (BTH) region, (c) the Yangtze River Delta (YRD), and (d) the Pearl River Delta (PRD).
[Figure omitted. See PDF]
We also compared Himawari-8-derived and ground-based diurnal variations from all available monitoring stations in China and three typical urban clusters (Fig. 9). Hourly concentrations observed by satellite are highly consistent with ground-based measurements, with a small difference within 0.10, 0.11, 0.13, and 0.11 in China and in each region. Moreover, the same diurnal variations in pollution are seen during the day; i.e., they reach their maximum values at 10:00 or 11:00 LT and are lower at sunrise and sunset. These results illustrate that the diurnal variations derived from Himawari-8 are reasonable compared to ground-based measurements.
Figure 10
Time series of out-of-sample cross validation of hourly estimates () in terms of (a) the sample size (, red) and CV- (blue) and (b) RMSE (red) and MAE (blue) in 2018 in China.
[Figure omitted. See PDF]
We investigated the time series of the daily performance of the STLG model in estimating hourly concentrations in China. The number of data samples varies on a daily basis, with an average of 3975 and with more than 83 % of all days having more than 2000 (Fig. 10). The large gap in the number of data samples is mainly caused by different degrees of cloud contamination in the satellite aerosol products for different days. The STLG model captures well the hourly values on most days, with an average out-of-sample of 0.73 and average RMSE and MAE values of 13.06 and 8.53 , respectively. In general, hourly estimates are more reliable on approximately 79 % (CV- 0.7), 70 % (RMSE 15 ), and 74 % (MAE 10 ) of the days in the year. The model performance also varies greatly at the seasonal level, with average CV- values of 0.82, 0.71, 0.87, and 0.86 and average RMSE values of 14.55, 9.63, 11.83, and 17.57 in spring, summer, autumn, and winter, respectively (Fig. 11). In general, the overall uncertainty of estimates increases at the beginning and at the end of the year likely due to the harsher environmental conditions (e.g., low humidity and less precipitation) and more intense human activities (e.g., coal heating and straw burning) in winter and spring.
Figure 11
Density scatterplots of out-of-sample cross-validation results of hourly estimates () for (a) spring, (b) summer, (c) autumn, and (d) winter of 2018 in China. Dashed and solid lines denote and best-fit lines from linear regression, respectively.
[Figure omitted. See PDF]
We have evaluated temporally synthesized data from the hourly data samples at each monitoring station for the year 2018 (Fig. 12). Daily mean estimates are highly correlated to those calculated from surface observations ( 0.91), and the average RMSE (MAE) value is 10.11 (6.39) . This suggests that the STLG model can capture daily variations more accurately. Note that daily synthetic data derived from geostationary satellites have a higher temporal frequency than data derived from sun-synchronous satellites. In general, synthetic values also have high accuracies and low estimation uncertainties (e.g., 0.98, RMSE 1.6–3.3 , MAE 1.1–2.3 ) from monthly to annual scales, allowing for a better description of spatiotemporal distributions and variations in pollution across China.
Figure 12
Density scatterplots of out-of-sample cross-validation results of (a) daily, (b) monthly, (c) seasonal, and (d) annual mean estimates () in 2018 across China.
[Figure omitted. See PDF]
3.2 Spatiotemporal characteristics3.2.1 Diurnal variations
Figure 13 shows Himawari-8-derived hourly mean near-surface concentrations from 08:00 to 17:00 LT in 2018 across mainland China. They do not cover western Xinjiang and Tibet due to the limitation of satellite scanning. pollution varies diurnally across China, being at an overall low level at sunrise ( 29.94 10.91 ). With the increase in human activities, air pollution becomes more severe over time, reaching a peak at around 10:00–11:00 LT in China ( 36 13 ). These high levels of pollution can last several hours. As the day progresses, human activities subside, and atmospheric fine particles settle on surfaces. concentrations thus decrease towards sunset in most areas in China ( 23.21 9.73 ). In general, air pollution in the morning (i.e., 08:00–12:00 LT) is much more severe than in the afternoon (i.e., 13:00–17:00 LT) in China, with morning concentrations about 1.3 times higher than afternoon levels. This is related to the influence of varying BLHs (Z. Li et al., 2017; Su et al., 2018).
Figure 13
Himawari-8-derived hourly mean maps (5 ) for different times of the day: (a) 08:00 LT, (b) 09:00 LT, (c) 10:00 LT, (d) 11:00 LT, (e) 12:00 LT, (f) 13:00 LT, (g) 14:00 LT, (h) 15:00 LT, (i) 16:00 LT, (j) 17:00 LT, (k) morning (08:00–12:00 LT), and (l) afternoon (13:00–17:00 LT) in 2018 across China.
[Figure omitted. See PDF]
Table 2Hourly mean concentrations () in 2018 in China, eastern China (ECHN), the Beijing–Tianjin–Hebei (BTH) region, the Yangtze River Delta (YRD), and the Pearl River Delta (PRD).
Time | China | ECHN | BTH | YRD | PRD |
---|---|---|---|---|---|
08:00 | 29.94 10.91 | 31.97 11.55 | 42.46 12.97 | 38.60 10.57 | 29.34 5.01 |
09:00 | 33.37 12.59 | 36.29 13.52 | 47.32 15.04 | 43.55 11.27 | 34.81 5.46 |
10:00 | 35.67 13.53 | 38.56 14.05 | 49.31 15.03 | 44.72 11.17 | 35.48 5.47 |
11:00 | 35.63 13.05 | 38.72 13.53 | 49.10 13.77 | 44.27 10.55 | 36.36 5.76 |
12:00 | 31.23 11.74 | 35.10 12.47 | 42.38 12.86 | 41.37 9.77 | 34.56 5.72 |
13:00 | 28.45 11.40 | 32.23 11.73 | 37.70 11.55 | 39.36 9.22 | 33.33 5.48 |
14:00 | 26.36 11.18 | 30.14 11.09 | 34.32 11.81 | 37.31 8.59 | 32.05 5.50 |
15:00 | 24.25 10.06 | 28.67 10.21 | 31.95 11.26 | 36.77 8.13 | 30.34 5.43 |
16:00 | 23.63 9.26 | 27.38 9.15 | 29.82 10.13 | 32.84 6.30 | 29.49 5.97 |
17:00 | 23.21 9.73 | 26.63 8.93 | 28.88 10.16 | 27.59 4.39 | 31.56 6.17 |
Morning | 33.29 11.59 | 36.15 12.41 | 46.12 13.29 | 42.50 10.22 | 34.52 4.63 |
Afternoon | 25.11 9.78 | 29.01 9.70 | 32.53 10.53 | 34.76 6.66 | 31.42 4.85 |
Table 2 summarizes the diurnal variations in eastern China and three typical urban agglomerations. pollution levels in eastern China are generally higher than the national level at each hour of the day due to the dense human population and intensive human activities. In the BTH region, pollution varies greatly, with hourly concentrations ranging from 28.88 10.16 (10:00 LT) to 49.31 15.03 (16:00 LT) and with differences exceeding 20 . pollution remained at a high level ( 42 ) before 12:00 LT and dropped to a lower level ( 29 ) after 16:00 LT. This is closely related to people's daily activities and the production and life cycle of during the day, as well as the change in boundary mixing as a function of the day (Lennartson et al., 2018; Wang and Christopher, 2003). Similar patterns and pollution levels are seen in the YRD region. In general, the PRD region is less polluted in the morning but more severely polluted in the afternoon than the BTH region. Compared with the BTH and PRD regions, pollution in the PRD region is much lower and shows a smaller diurnal difference, with hourly values ranging from 29.49 5.97 (11:00 LT) to 36.36 5.76 (08:00 LT). Better natural conditions and fewer pollutant emissions mainly explain this (Su et al., 2018).
In general, our satellite-derived diurnal variations in pollution agree well with ground-based observations at both national and regional levels but with generally lower concentrations (Fig. 9). The reason is that the monitoring stations are unevenly distributed and vary greatly in the number of stations at the regional scale. Also, most sites are distributed in urban areas, leading to inevitable overestimations due to urban–rural differences. However, satellite remote sensing can cope with this deficiency by generating spatially continuous maps, providing more accurate information about the distribution of and variations in pollution.
3.2.2 Seasonal and annual variationsSeasonal maps are synthesized from daily maps from 2018 across China according to our previous approach (Wei et al., 2019a). Our results illustrate that pollution varies greatly on a seasonal scale (Fig. 14). Pollution levels are generally low and show similar spatial patterns in summer ( 22.86 7.05 ) and autumn ( 23.76 10.97 ) across China (Table 3). By contrast, it is much more severe in spring ( 32.84 11.49 ) and winter ( 39.04 16.32 ) across China, especially in the BTH and YRD regions in winter. The main reasons are the frequent sandstorms and the long-distance transmission of sand and dust in spring and the burning of coal and fossil fuels for heating in winter leading to more pollutant emissions in northern China.
Figure 14
Himawari-8-derived seasonal mean maps (5 ) for (a) spring, (b) summer, (c) autumn, and (d) winter of 2018 across China.
[Figure omitted. See PDF]
Figure 15
Himawari-8-derived annual mean map (5 ) for the year 2018 across China. The lower-left-inserted density scatterplot represents out-of-sample cross-validation results for all hourly estimates in China.
[Figure omitted. See PDF]
Table 3Annual and seasonal mean concentrations () in 2018 in China, eastern China (ECHN), the Beijing–Tianjin–Hebei (BTH) region, the Yangtze River Delta (YRD), and the Pearl River Delta (PRD).
Time | China | ECHN | BTH | YRD | PRD |
---|---|---|---|---|---|
Spring | 32.84 11.49 | 34.93 10.95 | 45.75 12.96 | 40.35 9.55 | 33.97 4.50 |
Summer | 22.86 7.05 | 24.16 6.29 | 29.99 7.46 | 26.16 4.58 | 23.56 3.18 |
Autumn | 23.76 10.97 | 28.64 11.60 | 35.98 11.20 | 35.97 7.80 | 29.54 4.43 |
Winter | 39.04 16.32 | 48.34 17.47 | 48.36 18.92 | 57.41 16.88 | 43.92 8.56 |
Annual | 28.99 10.31 | 32.56 10.78 | 39.32 11.74 | 38.64 8.27 | 32.98 4.53 |
pollution also shows significant spatial heterogeneities across China (Fig. 15), with an annual mean concentration of 28.99 10.31 in 2018 (Table 3). High pollution levels are always observed in the Hebei, Shandong, Jiangsu, Anhui, Henan, Hubei, and Sichuan provinces. Interactions between intensive human activities, adverse stagnant weather (e.g., low BLHs and low winds), and special terrain (e.g., basin) can increase anthropogenic aerosols (Chen et al., 2008; Wang et al., 2018). By contrast, pollution is relatively light in the northeast (e.g., Heilongjiang and Jilin provinces), the southwest (e.g., Tibet and Yunnan provinces), and the eastern coastal areas of China (e.g., Zhejiang and Fujian provinces). These provinces are sparsely populated or experience meteorological conditions favorable for dispersing pollution (Su et al., 2018).
3.3 Discussion3.3.1 Comparison with traditional models
We first compared results from the STLG model with results from five widely used statistical regression models employed for estimating in China using the same input dataset (Table 4). The multivariate linear regression (MLR) model performs the worst due to the complex nonlinear –AOD relationship. The GWR model performs better because it takes into account the spatial characteristics of pollution. The generalized additive model (GAM) and the LME model show overall improved performances with decreasing estimation uncertainties because of their nonlinear characteristics and stronger data regression abilities. The two-stage model outperforms the GAM and maximum likelihood estimation (MLE) models with higher CV- values and smaller estimation uncertainties by combining the advantages of the GWR and LME models. Our model performs better than all of the traditional statistical regression models considered mainly due to its stronger data-mining ability.
Table 4
Comparison of the model performances of widely used models and the STLG model in estimating from Himawari-8 data at 14:00 LT in 2018 in China ( 162 840).
Model | Out-of-sample validation | Out-of-station validation | ||||
---|---|---|---|---|---|---|
CV- | RMSE | MAE | CV- | RMSE | MAE | |
MLR | 0.19 | 24.17 | 22.89 | 0.19 | 24.19 | 22.91 |
GWR | 0.39 | 21.96 | 20.74 | 0.37 | 22.42 | 21.02 |
GAM | 0.39 | 19.09 | 18.64 | 0.36 | 19.77 | 18.89 |
LME | 0.50 | 18.91 | 17.34 | 0.48 | 19.06 | 17.95 |
Two-stage | 0.58 | 17.60 | 15.71 | 0.54 | 17.99 | 16.01 |
STLG | 0.85 | 13.09 | 8.11 | 0.81 | 14.63 | 9.29 |
Comparison of the model performances of different tree-based machine-learning models and the STLG model using the same input data. Data are from 14:00 LT in 2018 in China ( 162 840). TSpeed and TMemory refer to the speed and memory consumption during the model training.
Model | Out-of-sample validation | Out-of-station validation | TSpeed | TMemory | ||||
---|---|---|---|---|---|---|---|---|
RMSE | MAE | RMSE | MAE | (s) | (GB) | |||
DT | 0.52 | 25.53 | 14.80 | 0.48 | 27.03 | 15.57 | 6 | 0.58 |
GBDT | 0.65 | 20.03 | 13.17 | 0.61 | 21.20 | 14.10 | 94 | 0.59 |
XGBoost | 0.73 | 17.94 | 10.78 | 0.68 | 19.59 | 11.93 | 456 | 0.69 |
RF | 0.72 | 17.86 | 11.33 | 0.69 | 18.80 | 11.95 | 165 | 2.59 |
ERT | 0.74 | 17.12 | 10.87 | 0.72 | 18.01 | 11.49 | 54 | 3.69 |
LightGBM | 0.78 | 15.79 | 9.84 | 0.73 | 17.59 | 11.21 | 34 | 0.60 |
STDT | 0.65 | 21.09 | 12.33 | 0.63 | 22.00 | 12.85 | 8 | 0.60 |
STGB | 0.75 | 16.82 | 10.93 | 0.73 | 17.61 | 11.54 | 503 | 0.61 |
STXB | 0.82 | 14.73 | 8.76 | 0.78 | 15.92 | 9.62 | 456 | 0.68 |
STRF | 0.81 | 14.62 | 9.17 | 0.79 | 15.44 | 9.69 | 219 | 2.75 |
STET | 0.82 | 14.42 | 8.95 | 0.80 | 15.30 | 9.55 | 77 | 4.25 |
STLG | 0.85 | 13.09 | 8.11 | 0.81 | 14.63 | 9.29 | 46 | 0.60 |
The first six rows of Table 5 show the accuracies and efficiencies of six tree-based machine-learning models when estimating in China using the same input dataset. The decision tree (DT; Quinlan, 1986) is a traditional, frequently used, supervised learning classification method. Although the training speed is the fastest and the memory consumption is the least, it has the worst performance because of the simple single classifier. The model performances of ensemble-learning approaches, i.e., GBDT (Friedman, 2001), RF (Breiman, 2001), extremely randomized trees (ERTs; Geurts et al., 2006), and XGBoost (Chen and Guestrin, 2016), can be significantly improved by combining several weak classifiers into a strong classifier. Among them, the ERT model yields a higher estimation accuracy and a stronger spatial prediction ability than other ensemble-learning models. The LightGBM model (Ke et al., 2017) performs the best with the highest accuracy and smallest uncertainty among all tree-based machine-learning approaches considered.
The model efficiency differs among these models due to the large differences in the algorithm design frameworks. These tree-based machine-learning models can be divided into two categories. The DT, RF, and ERT models fall into the “bagging” category, which synthesizes multiple independent and unrelated weak classifiers into a strong classifier. It allows for work in parallel, which can save much time but may need more computer memory. The GBDT, XGBoost, and LightGBM models fall into the “boosting” category, which synthesizes multiple interdependent and related weak classifiers into a strong classifier. They can only work in serial, which may take much time but not too much memory. In general, the STGB model is the most time-consuming, while the STET model is the most memory-consuming. By contrast, the LightGBM model runs very fast and consumes very little computer memory, benefiting from a series of algorithm optimizations (Ke et al., 2017).
After considering spatiotemporal variations, all the newly defined space-time DT, GBDT, XGBoost, RF, ERT, and LightGBM models (i.e., STDT, STGB, STXB, STRF, STET, and STLG) show significant improvements in both overall estimation accuracy and spatial prediction ability in estimating hourly concentrations with reference to their original models. This further illustrates the importance of including spatiotemporal information when constructing –AOD relationships. More importantly, the training speed of these models did not decrease much, and the memory consumption did not increase much either. In general, the STLG model shows the best performance with a high efficiency (i.e., training speed 46 , memory usage 0.60 ) among all the space-time tree-based machine-learning models. Therefore, our new STLG model is highly valuable for accurate and fast air pollution monitoring, in particular for our future study extended to the global scale.
3.3.2 Comparison with related studiesWe compared Himawari-8-based hourly estimates at regional and national scales in China with previous related studies (Table 6). Local hourly concentrations retrieved from our national-scale model are more accurate than those derived from the models developed separately in local areas, e.g., the LME model (Wang et al., 2017), the GWR, SVR, RF, and DNN models in the BTH region (Sun et al., 2019) and the two-stage RF and DNN models in the YRD region (Fan et al., 2020; Tang et al., 2019). Our model also outperforms most of the statistical regression models and machine-learning models focused on the entirety of China, e.g., the I-LME, IGTWR, RF, AdaBoost, XGBoost, and their stacked models in China (Chen et al., 2019; Liu et al., 2019; Xue et al., 2020; T. Zhang et al., 2019). This is due to the stronger data-mining ability, considering key spatial and temporal information about air pollution (ignored in previous studies), which introduces more comprehensive factors that affect pollution (e.g., emission inventories).
Table 6
Comparison of model performances from previous studies in estimating hourly concentrations in China.
Model | Model validation | Region | Reference | ||
---|---|---|---|---|---|
RMSE | MAE | ||||
LME | 0.86 | 24.5 | 14.2 | BTH | Wang et al. (2017) |
LME | 0.63 | 29.0 | 18.1 | BTH | Sun et al. (2019) |
GWR | 0.76 | 23.3 | 16.7 | Sun et al. (2019) | |
SVR | 0.77 | 21.5 | 12.3 | Sun et al. (2019) | |
RF | 0.82 | 20.3 | 12.1 | Sun et al. (2019) | |
DNN | 0.84 | 19.9 | 11.9 | Sun et al. (2019) | |
two-stage RF | 0.86 | 12.4 | – | YRD | Tang et al. (2019) |
DNN | 0.86 | 14.3 | – | YRD | Fan et al. (2020) |
RF | 0.82 | 19.6 | 12.2 | China | Chen et al. (2019) |
AdaBoost | 0.84 | 18.3 | 10.7 | Chen et al. (2019) | |
XGBoost | 0.84 | 18.1 | 11.4 | Chen et al. (2019) | |
Stacked model | 0.85 | 17.3 | 10.5 | Chen et al. (2019) | |
RF | 0.86 | 17.3 | 10.3 | China | Liu et al. (2019) |
I-LME | 0.84 | – | – | BTH | T. Zhang et al. (2019) |
0.80 | – | – | YRD | ||
0.74 | – | – | PRD | ||
0.82 | – | – | China | ||
IGTWR | 0.78 | 21.1 | – | China | Xue et al. (2020) |
has a great impact on the atmospheric environment and is also used as a key indicator in environmental health studies. It varies diurnally, affected by both natural and human factors. Previous studies have been based on data from sun-synchronous satellites which can monitor air pollution at coarse temporal scales (i.e., daily), while high-temporal-resolution and accurate information on is needed. In this study, the Himawari-8/AHI hourly AOD product is employed to address this issue. Moreover, considering the large volume of input data and the large errors in estimation using traditional methods, an efficient and accurate space-time Light Gradient Boosting Machine (i.e., STLG) model has been developed. It utilizes meteorological, human, land use, and topographical parameters and is implemented at 5 resolution and hourly timescale to generate information over China. The hourly estimates are evaluated against surface observations, and spatiotemporal variations are also investigated.
The STLG model predicts hourly values accurately, with high out-of-sample (out-of-station) CV- values of 0.81–0.85 ( 0.76–0.81) and low RMSE values of 11.24–15.56 ( 12.49–17.61) throughout the day. The model can also produce daily (e.g., 0.91, RMSE 10.11 ), monthly, seasonal, and annual mean values (e.g., 0.98, RMSE 1.6–3.3 ). varies diurnally in most areas of mainland China, where concentrations reach a maximum at 10:00 LT and are generally low at sunrise and sunset on a given day. also varies greatly on a seasonal basis, in which winter and summer experience the highest and lowest air pollution levels, respectively. Comparison results suggest that the proposed model is more accurate than traditional statistical regression models, other tree-based machine-learning models, and various models developed in previous studies. Overall, the STLG model is more efficient, having faster training speed and less memory consumption. These results illustrate that this algorithm can be useful for real-time monitoring of pollution in China.
Data availability
measurements are available at
Author contributions
JiW designed the research and wrote the initial draft of this manuscript. ZL, RTP, JuW, and LS reviewed and edited the paper. RL and WX helped to process the data. MC copy-edited the article. All authors made substantial contributions to this work.
Competing interests
The authors declare that they have no conflict of interest.
Special issue statement
This article is part of the special issue “Satellite and ground-based remote sensing of aerosol optical, physical, and chemical properties over China”. It is not associated with a conference.
Acknowledgements
We would like to thank Qiang Zhang at Tsinghua University for providing MEIC pollution emission data for China. Jun Wang's participation is made possible via the in-kind support from the University of Iowa.
Financial support
This research has been supported by the National Natural Science Foundation of China (grant no. 42030606) and the National Key Research and Development Program of China (grant no. 2017YFC1501702).
Review statement
This paper was edited by Toshihiko Takemura and reviewed by two anonymous referees.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Fine particulate matter with a diameter of less than 2.5
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details




1 State Key Laboratory of Remote Sensing Science, College of Global Change and Earth System Science, Beijing Normal University, Beijing, China; Earth System Science Interdisciplinary Center, Department of Atmospheric and Oceanic Science, University of Maryland, College Park, MD, USA; Department of Chemical and Biochemical Engineering, Iowa Technology Institute, Center for Global and Regional Environmental Research, The University of Iowa, Iowa City, IA, USA
2 Earth System Science Interdisciplinary Center, Department of Atmospheric and Oceanic Science, University of Maryland, College Park, MD, USA
3 Department of Chemical and Biochemical Engineering, Iowa Technology Institute, Center for Global and Regional Environmental Research, The University of Iowa, Iowa City, IA, USA
4 College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao, China
5 State Key Laboratory of Remote Sensing Science, College of Global Change and Earth System Science, Beijing Normal University, Beijing, China