Diabetes mellitus is a common chronic disease caused by metabolic disorders. It has become one of the most frequent and widespread diseases in both developed and developing countries, associated with the epidemiology of obesity (Jia, Xue, Yin, et al., 2019) and its acute complications, such as diabetic ketoacidosis, diabetic coma, heart disease, and stroke (Dales et al., 2012). Besides its impact on diabetes suffers' health, diabetes imposes a large financial burden on individuals and a significant economic cost on a nation's health system, making diabetes one of the foremost public health challenges of the 21st century (Rong et al., 2016). According to the International Diabetes Federation (Domingueti et al., 2016), there are 425 million people with diabetes, accounting for 9% of the adult population worldwide, two thirds of whom are of working age. This figure is projected to reach 592 million globally by 2035, with 62.6 million in China, the country with the second‐highest number of people diagnosed with diabetes (Xu et al., 2013).
Diabetes studies have shown that age, sex, educational level, regional economic development, and medical facilities level are all influencing indicators in the risk of diabetes (Brown et al., 2004; Hipp & Chalise, 2015; Maier et al., 2013). More recently, interdisciplinary collaboration, particularly with geography and sociology, and greater data accessibility offer diabetes studies a mass of meteorological and geo‐environmental data on potential influencing indicators related to diabetes prevalence. Geographic information system (GIS) and remote sensing methods (Jia et al., 2017) provide more abundant and effective geo‐environmental data, such as the built environment (Feng et al., 2010), the accessibility of local food (Salois, 2012), urbanization degree (Cherubini et al., 1999; Zhou, Astell‐Burt, Yin, et al., 2015), and physical activities (Hu et al., 1999). For example, the food environment, such as the density of restaurants, retail food stores, and supermarkets, can play an important role in the prevalence of diabetes through a population's diet (Jia, Xue, Cheng, & Wang, 2019; Salois, 2012). Pilot GIS studies have identified significant relationships between air pollution, including PM2.5, SO2, and NO2 concentrations obtained from remote sensing images, and diabetes prevalence (Dales et al., 2012; Eze et al., 2015; Thiering & Heinrich, 2015). Land use types and land use mix have been related to diabetes and public health more generally (Christian et al., 2011; Su et al., 2016). Drawing on sociology, sociodemographic indicators, such as low socioeconomic status (Walker et al., 2011), family income (Dinca‐Panaitescu et al., 2011), housing inequality (Wan & Su, 2016), education level (Zhou, Astell‐Burt, Bi, et al., 2015), and immigrant status (Siordia et al., 2012), are potential diabetes influencing indicators. Diabetes has been associated with a population's socioeconomic inequities, frequently conceptualized as social deprivation (Connolly et al., 2000; Maier et al., 2013; Tompkins et al., 2010). Some diet and nutrition indicators, like fruits, vegetables, and cereals intake, have also been linked to diabetes (Ezzati & Riboli, 2013).
Understanding the association between these influencing indicators and the prevalence of diabetes can control and prevent the risk of diabetes‐related illnesses, informing treatment regimens and guiding effective policies. But understanding these associations has faced two challenges. First, selected influencing indicators have been based mainly on subjective judgment or advice from expert consultants. Health problems are subject to complex interactions among different influencing indicators that are not equally important for diabetes prevalence in different areas. If the number of modeled influencing indicators does not correspond to the real factors, model accuracy will be negatively impacted (Frank & Friedman, 1993). One solution is to use data‐driven methods to extract significant influencing indicators automatically. With unprecedented volumes of data now available, machine learning methods, such as ridge regression and lasso regression, have been applied widely in high dimensional feature selection (Reichstein et al., 2019). Surprisingly, the applications of these new data‐driven methods as ways of identifying influencing indicators of diabetes prevalence have been largely ignored in diabetes research.
The second challenge is that public health results tend to be spatially clustered, and obvious regional differences exist in epidemiological disease prevalence (Chalkias et al., 2013). Traditional statistical modeling methods, such as ordinary least squares (OLS) regression (Green et al., 2003), logistic multilevel binomial regression (Jia, Xue, Cheng, & Wang, 2019; Maier et al., 2013), robust regression (Salois, 2012), and generalized linear models with natural cubic splines (Dales et al., 2012) have not addressed spatial issues. These conventional statistical methods are often based on the assumption that the samples are independent of each other, which contradicts data with spatial autocorrelation, where two measurements taken from geographically close locations are often more similar than measurements from a distant location. Thus, it is necessary to add spatial effects to traditional statistical models (LeSage & Pace, 2009). With the development of GIS and spatial analysis, new spatial techniques and methods have been proposed to explain the spatial association between diabetes and various influencing indicators (Shi & Wang, 2015). For example, spatial clustering patterns of diabetes prevalence have been detected and quantified (Tompkins et al., 2010), and spatial effects have been added to statistical models such as spatial regression models (Sridharan et al., 2007; Wan & Su, 2016; Weng et al., 2017) and geographical weighted regression models (Hipp & Chalise, 2015; Siordia et al., 2012).
New to diabetes research, this paper proposes and develops a step‐by‐step framework that combines a machine learning model and a spatial regression model. Using medical insurance data from Shandong province, China, we apply this framework to identify automatically the probable indicators that could influence diabetes prevalence significantly, and then, establish the relationship between these selected influencing indicators and diabetes prevalence considering the spatial autocorrelation effects. Our framework provides public health officials and urban planners with a new perspective to inform the implementation of improved treatment and policies to attenuate diabetes diseases.
As shown in Figure 1, Shandong is an advanced industrial province on China's east coast and the lower reaches of the Yellow River between 34°25′ and 38°23′ north latitude and between 114°36′ and 122°43′ east longitude. With a population exceeding 100 million, Shandong had a gross domestic product (GDP) of RMB7.27 trillion (US$1.08 trillion) in 2017, ranking the third highest in China. As a largely industrial and fast‐growing province, the total industrial added value was RMB2.87 trillion (US$425.07 billion) in 2017, an increase of 6.6% over the previous year, and an added value of agriculture of RMB280.2 billion (US$41.5 billion), an increase of 4.6% over the previous year. Shandong province has a warm temperate monsoon climate with four seasons. Its average annual temperature is 13°C Celsius and its rainfall is focused in the summer with 550–950 mm annual rainfall.
The type 2 diabetes mellitus data were obtained from a management database of patients' medical insurance in 12 Shandong cities in 2017, comprising 89 counties. The data involved inpatient and outpatient records with patient ID, sex, age, residential address at the county level, diagnosis results from the International Classification of Diseases (ICD) codes, hospital name, medical expenses, and medical insurance type. The medical insurance types consist of Urban Employed Basic Medical Insurance (UEBMI) and Integration of Urban and Rural Medical Insurance (IURMI), forming the basic medical insurance systems. Since there have been significantly different individual characteristics between these two types (Huang et al., 2019), it is worth exploring the patterns of diabetes by subgroups. China's basic medical insurance plans cover 95% of the population, thus the information from the medical insurance database can account for almost all the patients during the study period in each city. Based on the diagnosis results (ICD‐10: E11–E14) recorded in the insurance database, diabetes mellitus patients were identified, and their inpatient and outpatient visit information were extracted including non‐insulin‐dependent diabetes mellitus, diabetic complications, and diabetic comorbidities. Due to the lack of patients' accurate addresses, the annual diabetes prevalence was calculated at the county level.
We collected 46 influencing indicators at the county level, categorized into four different domains: economic, sociodemographic, education, and geographical environment. For parsimony and space limitations, we only listed the important influencing factors after indicators extraction mentioned in Section 2.4.2: (1) Economic indicators included per capita total export and per capita GDP, which were collected from the 2017 Shandong statistical yearbook; (2) sociodemographic indicators included per capita retail sales of social consumer goods, per capita grain production, per capita fruit production, and per capita meat production, which were also obtained from the 2017 statistical yearbook; (3) educational indicators included per capita teacher amounts in general secondary school and per capita teacher amounts in primary school from the statistical yearbook; (4) geographical environmental indicators were derived and calculated based on GIS tools in the county scale. A spatial database was created in which GIS layers of spatial environmental data were recorded in ArcGIS version 10.2 (ESRI Inc.) software. Specifically, the average elevation for each county was calculated from digital elevation model. Based on the Kriging interpolation and zonal statistic methods, the annual average sunshine hours were obtained from meteorological stations in the China Meteorological Data Network, and the annual average PM2.5 were calculated from air quality monitoring stations in Shandong Province. The proportions of land use types were calculated by implementing supervision classification in the ENVI software (Exelis Visual Information Solutions Company) on the remote sensing images, downloaded from Google Earth. Road network density in each county was counted and calculated based on the road length of the main roads (for any form of motor transport) and secondary roads (supplementing a main road at moderate or slow speeds), which were crawled from Amap (
The flowchart of our framework is shown in Figure 2. The procedure includes three steps. First, the Chi‐square tests were used to analyze the statistical significance of stratification differences and the autocorrelation index was applied to detect spatial patterns of diabetes. If the spatial clustering pattern exists in diabetes of prevalence, it is necessary to use the spatial regression model. Second, before analyzing the relationship between the indicators and diabetes prevalence, we used binary linear regression and lasso regression to extract the most significant influencing indicators of diabetes without collinearity. Lastly, we applied the spatial regression model to analyze how the spatial prevalence of diabetes is associated with significant diabetes correlates and utilized variance decomposition to isolate the relative effect of influencing indicators on diabetes prevalence.
Diabetes data were divided into subgroups according to inpatient or outpatient record types and medical insurance types. After filtering and removing duplicate visiting records based on the patient ID, the number of diabetes patients in each county was calculated. A Chi‐square test was used to analyze the significance of the difference in these subgroups at the county level, which was done by SPSS 22.0 software (SPSS Inc.). If the difference was significant, these subgroups were chosen to proceed to the next step. Five subgroups were formed: Group 1 Inpatients; Group 2 Outpatients; Group 3 UEBMI Patients; Group 4 IURMI Patients; and Group 5 All Patients in our database. To normalize disease values, the diabetes prevalence rates of five subgroups in each county were expressed as (Coggon et al., 1997) [Image Omitted. See PDF]where is the prevalence rate for diabetes in county i; ni is the number of patients who suffered from diabetes in county i with a specific period of time; N is the total population in each county collected from the 2017 statistical yearbook. The spatial distributions of diabetes prevalence were visualized by a series of thematic maps, which were classified into six classes using the “natural breaks” method by seeking to minimize each class's average deviation from the class mean, while maximizing each class's deviation from the means of the other classes (Faka et al., 2017).
To detect global spatial clustering patterns and quantify the degree of spatial autocorrelation of diabetes prevalence, Moran's I index was applied (Moran, 1950). Moran's I index ranges between −1 and +1, where the value close to +1 indicates a strong spatial correlation of the diabetes prevalence rate, while −1 shows spatial dispersal. To find the location of significantly similar or dissimilar clustering, local spatial autocorrelation is quantified by calculating the Local Moran's I index (Anselin, 1995). Positive values indicate spatial clustering of similar values, and negative values show the clustering of dissimilar values. The outcomes of Local Moran's I include five possible categories to identify the existence of pockets or clusters: High‐High, Low‐Low, High‐Low, Low‐High, and Not Significant. Analyses were done with GeoDa software, version 1.12 using queen contiguity weights, and the parameter of the order contiguity was set to 1.
In Step 2, two analyses were conducted to identify the most essential influencing indicators without multicollinearity. First, binary linear regression was performed between each indicator and diabetes prevalence. Indicators that were statistically significant (p < 0.05) were selected. Second, to eliminate the multicollinearity problem, lasso regression was utilized on these selected indicators. The lasso regression extracts prognostic signatures from large databases driven entirely by the data itself. It uses the absolute coefficient function of the model as a penalty term to compress the coefficients of the model, achieving the purpose of variable selection and parameter estimation simultaneously (Hayes et al., 2015; Tibshirani, 1996). By weighing the deviation variance of the model, the lasso regression overcomes the shortcoming of traditional indicator selection methods, like stepwise multiple regression, and effectively maintains the interpretability of selected variables that have explicit properties (Guo et al., 2015; Mueller‐Using et al., 2016). In our study, cross‐validation with 10 folds was used to tune the regularization parameter. The influencing indicators with non‐zero coefficients in the sparsest model were then chosen into Step 3. Analyses were done with R software, version 3.3.2.
The spatial autocorrelation phenomenon has always existed in public health studies (Chalkias et al., 2013; Wan & Su, 2016), and it violates the independence assumption of errors in OLS. Thus, in Step 3, we applied spatial regression modeling to analyze the associations between diabetes prevalence and the significant influencing indicators in different subgroups incorporating the spatial autocorrelation dependency. There are two commonly used spatial regression models: the Spatial Lag regression in Equation 2 and the Spatial Error regression in Equation 3 (Anselin, 2013). [Image Omitted. See PDF] [Image Omitted. See PDF]where X is influencing indicators; Y is the diabetes prevalence rate for each county; is the coefficients for each indicator; is the error term; is the spatial weight matrix for the dependent variable and is the spatial weight matrix for error term; is the spatial autoregressive coefficient; and are the scalar variables.
Robust Lagrange Multiplier (LM) tests were used to determine the type of spatial regression by judging which LM value in the regression models was more significant illustrated in LeSage and Pace (2009). All the indicators were normalized before modeling. Spatial regression modeling was implemented in the GeoDa software.
To compare the relative importance of essential influencing indicators on the diabetes prevalence rate, we employed the variance decomposition (VD) method (Anderson & Cribble, 1998; Su et al., 2014). VD can decompose the variances of the dependent variable into shares and compare the relative effect of different exploratory variables by calculating individual or joined effects (Heikkinen et al., 2005). We classified the significant influencing indicators into four categories as illustrated before. Next, the total explained variance was calculated and decomposed into several sections: (1) the individual effects of four categories of influencing indicators; (2) the joint effect of two categories of influencing indicators; (3) the joint effect of three categories of influencing indicators; and (4) the joint effect of four categories of influencing indicators.
The number of diabetes patients was 387,954, 371,617, 356,044, and 403,527 in four subgroups (inpatient, outpatient, UEBMI, and IURMI) and 759,571 in All Patients Group, respectively. The Chi‐square test results indicated that different subgroups displayed significant differences at the county level, comprising inpatient or outpatient (Chi‐square = 34882.868, p < 0.001) and patients using UEBMI or IURMI (Chi‐square = 114966.648, p < 0.001). Thus, it is necessary to implement the grouping process in subsequent analysis.
For the five groups, Figure 3 showed the diabetes prevalence rates in each county in the left subfigures and their global Moran' I index values. Diabetes prevalence per 1,000 people in 12 Shandong cities at the county level ranged from 0.97% to 49.43% for all patients (Group 5) and was significantly clustered (Moran's I = 0.328, p < 0.01). The spatial distribution patterns in Figure 3 were broadly similar for the different groups. Generally, the coastal north‐eastern counties presented a higher prevalence of diabetes, while those in the western part, especially the northwest region, exhibited lower diabetes prevalence. Some differences in spatial distributions were also found, with UEBMI patients (Group 3) displaying lower diabetes prevalence than IURMI patients (Group 4).
3 Figure. The spatial patterns of diabetes prevalence rates per 1,000 people in 12 Shandong cities at the county level based on different subgroups with the global Moran' I index and LISA maps.
Heterogeneity of diabetes prevalence was observed among the counties from the global Moran's I and local autocorrelation results. Counties in the north‐eastern region have High‐High values presenting spatial aggregation phenomena (red color), while counties in the western region have Low‐Low prevalence clustering patterns (blue color). These results confirm that a spatial regression method should be applied to analyze the relationship between diabetes prevalence and influencing indicators.
Before applying the spatial regression model, we extracted 29 influencing indicators that were statistically significant (p < 0.05) after binary linear regression, and then selected 17 influencing indicators after lasso regression, which became final determinants of diabetes (Table 1). Note that different subgroups may include different significant indicators.
TableAssociations Between Diabetes Prevalence and Influencing Indicators at the County Scale in 2017 in 12 Shandong Cities, ChinaDomain | Significant influencing indicators | Group 1 ‐ Inpatients | Group 2 ‐ Outpatients | Group 3 ‐ UEBMI | Group 4 ‐ IURMI | Group 5 ‐ All |
ECO | Per capita total export | – | 13.736*** | 7.948* | – | – |
Per capita GDP | 0.992 | – | 3.921*** | – | – | |
SOC | Per capita retail sales of social consumer goods | 5.912** | 6.830*** | – | 15.916*** | 14.356*** |
Per capita grain production | −1.347 | −2.796 | – | −6.492*** | −9.486** | |
Per capita fruit production | 8.500*** | 7.052** | 10.972*** | – | 17.773*** | |
Per capita meat production | – | – | 2.032 | – | – | |
EDU | Per capita teacher amounts in general secondary school | 3.609* | – | – | – | 7.167 |
GEO | Average elevation | 1.619 | – | – | – | −3.874 |
Proportion of building land | 0.931 | – | 2.618 | – | – | |
Proportion of green space | – | −5.786*** | −0.763 | −2.095 | −3.914 | |
Proportion of blue space | −1.849 | −5.595*** | −4.486** | – | −8.882*** | |
Proportion of bare soil land | – | 0.805 | – | 3.182 | −0.583 | |
Annual average sunshine hours | – | – | 1.173 | – | – | |
Annual average PM2.5 | −0.120 | 1.938 | 1.290 | – | 1.382 | |
Road density | – | – | −6.000** | – | – | |
The accessibility of hospital | – | – | −1.568 | – | – | |
The accessibility of clinic | – | – | – | – | −2.739 | |
Constant | 3.207* | 7.559*** | 4.340 | 4.972** | 14.062*** | |
R2 | 0.539 | 0.612*** | 0.535 | 0.610*** | 0.629*** |
Abbreviations: ECO, economic level; SOC, sociodemographic level; EDU, education level; GEO, geographical environment level.
*p < 0.1, **p < 0.05, ***p < 0.01.
Based on these selected indicators, the results of the robust LM test in spatial regression indicated that all LM Error values in five groups were more significant (p < 0.01) than LM Lag (p > 0.01). Thus, the Spatial Error model with the queen weight matrix, incorporating the average neighboring influences in the geographical space, was chosen as the final model.
Table 1 displayed the association between different significant influencing indicators and diabetes prevalence based on the Spatial Error model. Two economic indicators including per capita total export and per capita GDP promoted the increase of diabetes prevalence. Sales of social consumer goods played an essential role in shaping public health, with per capita retail sales of social consumer goods a significant positive influence on diabetes prevalence across all groups except UEBMI. Unexpectantly, per capita grain production presented a negative correlation with diabetes prevalence (Group 4 IURMI and Group 5 All Patients) although some groups were insignificant (Group 1 Inpatients and Group 2 Outpatients). Per capita fruit production showed a significant positive relationship with diabetes prevalence across all groups. The education indicator had a significant positive relationship with diabetes prevalence in Group 1 Inpatients, but it was insignificant in Group 5 All Patients, which was contrary to our expectations. All the land use type indicators were extracted. The proportion of green space (e.g., vegetation and woodland) and blue space (e.g., water and wetlands) presented significant negative relationships with diabetes prevalence across all groups except in the Inpatients and IURMI. High PM2.5 concentration increased the risk of diabetes. High accessibility of medical facilities presented a negative correlation in Group 3 UEBMI and Group 5 All Patients, though they were not significant.
Contributions of these significant influencing indicators to the total variations of diabetes prevalence in each group are displayed in Table 2. For Group 3 and Group 4, the individual effects of the geographic environment were stronger than those of the other categories, which indicated that UEBMI and IURMI patients were more influenced by these environmental indicators. The educational factors contributed less to the total variations individually. Besides, the joint effects between sociodemographic and geographical environment indicators, as well as those between economic with sociodemographic and geographical environment indicators accounted for a relatively high proportion of the total explained variances. In Group 1 Inpatients and Group 5 All Patients, the joint influences between all four categories of factors were also relatively strong. Such results suggested that in most cases, the joint effect of influencing indicators explained more diabetes prevalence.
TableThe Individual and Joint Effects of Different Categories of Influencing Indicators in Terms of Their Contributions to the Total Variations on Diabetes Prevalence (%)Domain | Group 1 ‐ Inpatients | Group 2 ‐ Outpatients | Group 3 ‐ UEBMI | Group 4 ‐ IURMI | Group 5 ‐ All |
ECO | 9.92 | 22.33 | 2.81 | – | 13.59 |
SOC | 10.69 | 5.12 | 3.09 | 4.29 | 6.41 |
EDU | 1.15 | – | – | – | – |
GEO | – | 5.12 | 44.94 | 61.90 | 1.55 |
ECO & SOC | 1.15 | 0.93 | – | 3.81 | 3.88 |
ECO & EDU | 17.37 | – | – | – | 13.79 |
ECO & GEO | – | 8.37 | 23.60 | 0.48 | – |
SOC & GEO | 18.89 | 20.47 | 15.73 | 13.33 | 21.17 |
EDU & GEO | 2.48 | – | – | – | 1.17 |
ECO & SOC & GEO | – | 37.67 | 9.83 | 16.19 | 20.19 |
ECO & EDU & GEO | 1.72 | – | – | – | 2.14 |
SOC & EDU & GEO | 18.70 | – | – | – | – |
ECO & SOC & EDU & GEO | 17.94 | – | – | – | 16.12 |
Total | 100 | 100 | 100 | 100 | 100 |
Note. Bold numbers denote the top three largest proportions.
Abbreviations: ECO, economic level; SOC, sociodemographic level; EDU, education level; GEO, geographical environment level.
This paper proposed a framework to extract the influencing indicators automatically and to evaluate the associations between the most essential influencing indicators and diabetes prevalence, with data‐driven and spatial methods new to diabetes research. We found that these influencing indicators and estimated coefficients varied across different groups, which revealed complex associations and potential mechanisms between diverse indicators and diabetes prevalence though they did not specify causal relationships.
In Figure 3, we observed spatial clustering patterns of diabetes prevalence among the counties in Shandong Province. The degree of spatial autocorrelation of diabetes prevalence was significant (Moran's I = 0.328, p < 0.01). This result was consistent with previous diabetes studies (Green et al., 2003; Hipp & Chalise, 2015; Siordia et al., 2012; Tompkins et al., 2010). Although diabetes is a health threat all over the world, its prevalence and distribution in various areas are heterogeneous. Also, we found that the coastal north‐eastern counties in both Inpatients and Outpatients groups existed several clusters of high prevalence of diabetes. These areas have been far from the city center and socially and ethnically diverse with high socioeconomic deprivation, which may cause this hotspot phenomenon. The specific reasons need further analysis. There were also different diabetes prevalence and spatial distribution patterns in different subgroups such as UEBMI and IURMI. UEBMI patients had lower diabetes prevalence than IURMI patients. It may be subject to average individual characteristics in different groups (Huang et al., 2019). IURMI patients mainly consisted of rural residents, urban retired, unemployed, students, and children with lower education levels and poor physical conditions, which may increase the prevalence of diabetes.
We found that several economic indicators had a positive correlation with diabetes prevalence. With rapid urbanization and the province's economic transition, risks of diseases have increased, negatively influencing public health and lifestyle (Gong et al., 2012). Previous studies have shown that diabetes prevalence has been greatly affected by economic factors, and socioeconomic inequalities will result in higher risks to diabetes (Fano et al., 2012; Grintsova et al., 2014). From our results, socioeconomic factors, including per capita retail sales of social consumer goods, per capita total export and per capita GDP, were significantly positively associated with the diabetes prevalence (Couchoud et al., 2011; Li et al., 2018). A previous study on a regional scale also suggested that rapid economic growth and transition may pose a potential higher risk of diabetes prevalence compared to low GDP regions (Tang et al., 2019). Possible explanations for these findings are that the urban areas with high GDP attract poor rural migratory workers, who have poorer health, lack health education and are less health‐conscious. Moreover, people in high economic development areas tend to consume high‐calorie foods, which may lead to obesity, one of the main causes of diabetes (Kastorini & Panagiotakos, 2009; Tang et al., 2019).
We also found an association between food production and the prevalence of diabetes. Interestingly, our results indicated that fruit production had a significant and positive relationship associated with diabetes prevalence, while grain production moderated the prevalence. This observation seems to agree with a global study based on country‐level data demonstrating that fruit intake significantly positively influenced the prevalence of diabetes, and cereal intake had a negative association (Li et al., 2020). Although the production of food cannot totally stand for residents' volume of food intake, we speculate that there are two potential pathways concerning the effect of food production on diabetes prevalence. First, grain production may represent the availability of adequate food supply for farmers. Some researchers have identified inadequate food supply as a potential risk factor for diabetes, so‐called food deserts (Maier et al., 2013; Seligman et al., 2009), which may bring higher diabetes prevalence. Second, excessive fructose intake (>50 g/d) may be one of the underlying pathogens of type 2 diabetes (Johnson et al., 2007, 2009), though fructose may not cause health problem (DiNicolantonio et al., 2015). In the case where fruit products cannot be completely sold or exported, we surmise that these fruits may be processed and consumed by local people and the fructose generated in this process may have a potential impact on health. It is interesting to relate industrialization level with this indicator in the future.
Higher education levels generally lead to a better understanding of healthcare instructions, such as glycemic control (Schillinger et al., 2002). However, our results indicated that the educational indicator was positively correlated with the diabetes prevalence though only one group was significant. One explanation might be that this indicator did not fully evaluate the education level of each county, especially urban counties with high levels of migrant workers. Some studies have also argued that the association between education level and diabetes prevalence has been uncertain and insignificant (Couchoud et al., 2011; Zhou, Astell‐Burt, Bi, et al., 2015), which is consistent with our empirical results.
Many selected geographic environment indicators were observed affecting the associations with diabetes prevalence. Our results supported the general hypothesis that pleasant and comfortable environments positively influenced people's physical and mental health (Corman et al., 2016; Wan & Su, 2016). We found that the proportion of green space and blue space both had significantly negative relationships with diabetes prevalence, consistent with previous research (Faka et al., 2017; Li et al., 2020; Mackenbach et al., 2014). This is also supported by recent urban planning research linking land use types to people's lifestyles and health (Gascon et al., 2016; Su et al., 2016). They can be explained as follows: first, green space can purify the air environment (Alcock et al., 2015) and blue space is regarded as an important factor in absorbing pollutants and harmful substances (Völker & Kistemann, 2015), which may slow down the risk of diabetes prevalence; second, areas with a high proportion of green and recreational spaces motivate people to enhance the physical activity, social interaction, and emotional connections (Faka et al., 2017; Wall et al., 2012), which moderate diabetes prevalence and provide plenty of public health benefits. In addition, exposure to PM2.5 pollution has been regarded as a significant factor influencing the prevalence of diabetes (Coogan et al., 2016; Eze et al., 2017; Puett et al., 2011). We also found that weakly positive associations between PM2.5 and diabetes prevalence, which was similar to several epidemiologic studies demonstrating air pollution increased the prevalence of diabetes and acute diabetic complications explained by subclinical inflammation (Dales et al., 2012; Krämer et al., 2010).
We observed that individual effects of economic factors on diabetes prevalence were significant, while other factors were more subjected to the joint effects, which was consistent with previous studies on the influence of social deprivation on public health (Wan & Su, 2016). It suggested that diabetes resulted from the interactions and combinations of multiple influencing indicators from different domains.
Some studies have suggested that cold temperature could lead to elevating glycosylated hemoglobin levels and acute complications of diabetes (Hou et al., 2017; Huang et al., 2019). Also, the proportion of the secondary and third industry output has often been used in related diabetes research (Couchoud et al., 2011). However, after executing the binary linear regression and lasso regression, these indicators were insignificant and were not included in the final spatial regression model.
Our methodological framework combined a data‐driven method with a spatial regression model to improve the performance in estimating the association of indicators with diabetes prevalence. Two suggestions can be proposed for future directions of disease spatial modeling. First, it is necessary to perform factor pre‐extraction as was done in our study; and, second, if spatial autocorrelation phenomenon exists, spatial components need to be considered by applying spatial analysis. The former guarantees the parsimony and adaptability of the modeling, while the latter improves the accuracy of the modeling. This framework can be easily extended to other regions or other diseases similar to diabetes associated with many uncertain indicators and drivers. Such selected influencing indicators promise powerful insights for the local health policy makers and medical practitioners to propose tailored advice and decision‐making support to solve the issues in public health from the perspective of urban planning and economics. The influencing factors also help government policy makers to plan and regulate the external environment.
This study had some limitations. First, our analysis was based on counties as basic spatial units, which were identified according to administrative divisions. Importantly, in the health geography field, different study scales, such as the Thiessen polygon (Openshaw, 1984) or creating zones with certain characteristics (Li et al., 2019), may generate different results and statistical bias, the so‐called modifiable area unit problem. Second, we did not consider the temporal series of indicators and the patients' exposures to risk factors for a long time. In this paper, we mainly focused on the differences among different counties. Lastly, our framework was exploratory and did not specify the causal mechanisms. To improve our framework, the next step would involve identifying the causal path and mediator variables by combing structural equation modeling (SEM) or mediating effect tests (Wan & Su, 2016).
In this study, we proposed a framework to measure the association between diabetes prevalence and influencing indicators with spatial effects. The significant influencing indicators were identified automatically using a data‐driven method new to diabetes research. We not only detected the spatial patterns of diabetes prevalence in five different groups (inpatient, outpatient, UEBMI, IURMI, and All), but also isolated the individual and joint effects of influencing indicators on diabetes prevalence. We also performed a comprehensive exploration of the influence of economic, sociodemographic, education, and geographic environment indicators on diabetes prevalence. Finally, we provided detailed methodological improvements to help public health departments treat diabetes diseases and build healthy environments. This framework can be extended to other regions or other diseases to explore corresponding relationships between diseases and influencing indicators.
The authors would like to thank Shandong Medical Insurance Research Association to provide fund support under Award Number SK170078.
The authors declare no conflicts of interest relevant to this study.
The data including diabetes prevalence data across each county and the elevation data, and the codes crawling the POIs of medical facilities and obtaining remote sensing images that support the findings of this study are available at
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
To control and prevent the risk of diabetes, diabetes studies have identified the need to better understand and evaluate the associations between influencing indicators and the prevalence of diabetes. One constraint has been that influencing indicators have been selected mainly based on subjective judgment and tested using traditional statistical modeling methods. We proposed a framework new to diabetes studies using data‐driven and spatial methods to identify the most significant influential determinants of diabetes automatically and estimated their relationships. We used data from diabetes mellitus patients' health insurance records in Shandong province, China, and collected influencing indicators of diabetes prevalence at the county level in the sociodemographic, economic, education, and geographical environment domains. We specified a framework to identify automatically the most influential determinants of diabetes, and then established the relationship between these selected influencing indicators and diabetes prevalence. Our autocorrelation results showed that the diabetes prevalence in 12 Shandong cities was significantly clustered (Moran's I = 0.328, p < 0.01). In total, 17 significant influencing indicators were selected by executing binary linear regressions and lasso regressions. The spatial error regressions in different subgroups were subject to different diabetes indicators. Some positive indicators existed significantly like per capita fruit production and other indicators correlated with diabetes prevalence negatively like the proportion of green space. Diabetes prevalence was mainly subjected to the joint effects of influencing indicators. This framework can help public health officials to inform the implementation of improved treatment and policies to attenuate diabetes diseases.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 School of Resource and Environmental Sciences, Wuhan University, Wuhan, China
2 Research Center of Health Economics and Management, Dong Fureng Institute of Economic and Social Development, Wuhan University, Beijing, China
3 Top Education Institute, Sydney, NSW, Australia; Newcastle Business School, University of Newcastle, Newcastle, NSW, Australia; School of Management and School of Economics, Tianjin Normal University, Tianjin, China
4 School of Public Health, Center for Health Economics Experiment and Public Policy, Shandong University, Key Laboratory of Health Economics and Policy Research, NHFPC (Shandong University), Jinan, China