Introduction
The traditional property valuation method is based on the visual and comparative characteristics that an appraiser can gather, where variables such as the year of construction, total surface area, number of bathrooms and number of bedrooms, among others, are analyzed. Then, a detailed report is created that indicates the estimated value of the property based on the appraiser’s expertise [1]. However, due to the diversity of appraiser approaches, the results of each valuation show significant differences in the property value estimates, which implies subjectivity and arbitrariness [2]. Two approaches to automated valuation models (AVMs) have been developed to achieve more objective appraisals and to improve information about the key variables that affect home prices: Hedonic pricing models (HPM) and machine learning (ML) algorithms.
Hedonic models are good tools for identifying the variables that have the most significant effect on the price of homes. However, these methods are not precise enough to estimate the value of properties due to their econometric problems [3]. On the other hand, models based on machine learning algorithms have proven to be superior in estimating real estate prices because they can capture the complex non-linear relationships between attributes and house prices [4].
Nevertheless, these models generally lack interpretability as they are black boxes, making it difficult to quantify the importance of each characteristic. This trade-off has generated a broad debate on which method to use since it is desirable to have a model that is both accurate and interpretable. Consequently, multiple methods have been developed that allow the decisions made by machine learning models to be interpreted [5–7]. Also, these methods can reveal the existing interactions between attributes, which are mostly complex and non-linear. Still, in the context of housing price prediction, few studies have used these tools to explain the importance of specific features, which this research attempts to address.
Through the years, multiple variables related to the neighborhood, location, and environmental quality have been studied, which, together with the structural attributes, determine the property price. However, only in recent years have variables based on the appearance of the homes been considered, which generally must be obtained from interior and exterior photos of the residences. These important variables can now be assessed due to the development of new techniques for interpreting the visual information contained in images.
The contributions of this paper are twofold. First, we build a machine learning AVM for the real estate market of the Santiago Metropolitan Region of Chile. Using this framework, we evaluate the influence and impact of including structural, location, neighboring environment, and image-based features. In this regard, we show (i) complex non-linear interactions between the explanatory variables and the property price, providing support for the use of machine learning-based AVMs, and (ii) that the use of image-based attributes significantly increases the model’s performance, highlighting the importance of including visual information in AVMs. Second, we compare the results obtained through the use of the machine learning model with those obtained using more traditional approaches, namely, a hedonic price model with spatial adjustments. To the best of our knowledge, this is the first paper to compare an interpretable machine learning (IML) approach with a hedonic price model that incorporates spatial autocorrelation for studying real estate prices. We find, as previous literature does, that machine learning techniques are vastly superior to traditional approaches when attempting to predict prices. However, we also find that trying to interpret machine learning outcomes—using interpretable machine learning techniques (IML)—might lead to some incorrect results. Indeed, we find that the IML insights seems to align more with the simpler OLS model that does not correct for spatial autocorrelation, rather than with the more robust SAR model. Thus, since not correcting for spatial autocorrelation leads to biased estimates [8]. we believe that this result calls for a careful interpretation of machine learning models’ outcomes, specially when they are used for inference purposes.
The rest of this paper is structured as follows: In Sect 1.1 we review the relevant literature; Sect 1.2 describes the study area and the dataset; The methods and algorithms used are described in Sect 1.3 and Sect 1.4; The experimental results are shown in Sect 2; The conclusions are discussed in Sect 3.
[Figure omitted. See PDF.]
1. Materials and methods
1.1. Literature review
The hedonic price model is based on the theory proposed by [9] and later extended by [10], who postulate that goods are a final product made up of a set of features that add value to it. In the context of the real estate market, houses are composed of and described by a group of n attributes , where zi measures the amount of the i-th characteristic of a property. Properties have a quoted market price . The property price and the implicit price of the features are determined by the consumers’ preferences and the producers’ costs.
An example of what happens with demand can be seen on the left side of Fig 1. Here 80% represents the consumer’s willingness to pay for the different values of property attributes (z), given a utility index (u) and income (y). In this case, p(z) is defined as the minimum price the consumer is willing to pay for a property with characteristics z. The consumer’s utility is maximized when and , where z∗ and u∗ are optimal values obtained when p(z) and are tangent. On the other hand, an example of the producer’s response to this offer can be seen on the right side of Fig 1. In this case, ϕ ( z ; π , β ) denotes the producer’s or seller’s willingness to accept an offer on a property with characteristics z, given a constant profit (γ) and a specific production cost (β). Thus, p(z) denotes the maximum price that can be obtained in the market for a property with attributes z. Finally, symmetric with demand, the producer’s profit is maximized when and , where the optimal values z∗ and π∗ are obtained when p(z) and are tangent. It is expected that, given a property with a set of characteristics z, the willingness to pay by a consumer (β) is different from the willingness to accept by a producer (ϕ). However, they can become equivalent through negotiation, resulting in the curves of the functions ϕ and ( C , t , w ) touching at a common point where the gradient of both is the gradient of the market price function p(z) at that point. Consequently, the shadow price of each attribute zi is given by .
The hedonic approach has been applied in real estate market research in multiple cities in the world to construct AVMs that provide information on the characteristics that affect property prices. A variety of variables have been studied, and the classification made by [11] was used to categorize them. In this classification system, the attributes are divided into three categories, the first of which represents structural variables such as the number of bathrooms, bedrooms, total and habitable surface area, and the type of building, which in general provide the base of the models since they are the traditional property characteristics. The second category refers to characteristics related to the property’s neighborhood, which generally describes the demographics of the sector. The third category is linked to the property’s location and the distances to specific services or points that may affect its price.
Several studies have highlighted the importance of complementing the structural attributes of homes with location and neighborhood variables since they can help calculate the added value in the sector and allow differentiation between homes and their prices based on patterns that are not always easy to identify. Some variables that have been studied are distances and travel times to essential services such as subway stations, the central district, malls, parks [12–14], neighborhood population density, the average income of the sector, and the building density of the area [1,15,16], among others. Also, other types of relevant variables are related to the environmental quality, such as air and water pollution, noise level [13], and the Green View Index [17], among others; however, these attributes are outside the scope of this investigation. Another branch of the literature shows the importance of making specific corrections to improve the hedonic models, such as that carried out by [18], which highlights the importance of using prices in logarithmic form to mitigate the heteroscedasticity commonly present in hedonic estimates. Additionally, the study by [11] highlights using models that consider the spatial dependence present in the price of houses, demonstrating empirically that a Spatial Autoregressive Model (SAR) and a Geographically Weighted Regression (GWR) perform better than a traditional Ordinary Least Squares (OLS) model. Thus, the previous studies show that the hedonic approach is widely used to determine the variables that explain the price of properties. Nevertheless, because the data from the real estate markets shows non-linear patterns and the inherent econometric problems such as model specification, heteroscedasticity, and independent variable interactions [19], this methodology does not perform well when creating a model focused on valuing properties with the least possible error. Also, there are issues of endogeneity due to the omission of location, neighborhood, and environmental variables given their difficulty to access and the limited data available [3].
The rapid growth of the real estate market has produced broad interest in creating AVMs to value properties objectively. The difficulty of creating accurate hedonic models has led to the use of machine learning algorithms due to their ability to capture the non-linear relationship between property prices and attributes. Several studies have empirically compared the performance of hedonic models against machine learning algorithms. In particular, [4] performed a systematic review of studies that compared the accuracy of both approaches, with 57 of the reviewed articles determining that machine learning models are more accurate, compared to 13 studies that concluded that the hedonic approach was superior. It is not possible to determine which machine learning algorithm is better since multiple factors can affect their results, such as the quality of the data, the choice of hyperparameters, the treatment of missing data, and outliers, among others [2]. Still, many studies have empirically demonstrated the superiority of algorithms based on decision trees to automatically evaluate properties [17,20,21]. Due to the above, in this study, we applied algorithms based on decision trees.
Despite the intense development that AVMs have undergone in the industry, they have continued to ignore the interior and exterior appearance of the properties, critical factors in home prices. Consequently, various investigations have focused on including visual information in the models, reaching conclusions that empirically demonstrate that using attributes extracted from images can reduce the errors in property appraisals. Different techniques have been used, such as Speeded-Up Robust Feature (SURF) in [22] and Convolutional Neural Networks (CNN) in [23], where the latter achieved a median error rate of 5.6% in the estimation of property values, which are better results than those obtained by Zillow’s AVM (Zestimate) in 2016. These results led Zillow to launch a new version of its AVM in June 2019, incorporating photo analysis to understand home quality and attractiveness, which reduced the median error rate in estimating the price of the properties to less than 2% [24]. Thus, the image variables belong to a new category of characteristics that determine the price of homes, and which have been enhanced by the development of techniques that allow information from property photographs. A complete study on the power of visual information in predicting the price of homes was conducted by [25], where multiple techniques were applied to obtain the most significant number of attributes from the images, resulting in a substantial improvement in the accuracy of home price estimates when using the best combination of image attributes.
Most studies agree that machine learning algorithms outperform traditional models but lack transparency, unlike the hedonic approach, which can, for example, infer the relative importance of explanatory variables [4]. To address this, research in Explainable Artificial Intelligence (XAI) or IML proposes methods to infer variable importance. Techniques like Permutation Importance [5], LIME [6], and SHAP [7] have advantages and disadvantages, with SHAP offering strong projections based on game theory. Few studies have applied these methods in real estate, with [17] and [21] being two notable examples. This study closely relates to [21] but uses hedonic price models with spatial adjustments. We believe this is a key issue, since spatial autocorrelation leads to the estimated coefficients and standard errors to be biased, as well as most statistical significance tests [26]. Thus, the interpretation and comparison with other methods might also be biased when not incorporating the spatial correlation.
In summary, this paper uses decision tree-based algorithms to build AVMs for Santiago, Chile. Then, we apply the IML techniques to explain the underlying mechanisms of the constructed model. Finally, we compare the IML results with the importance assigned by hedonic price models considering spatial dependence in property values. To the best of our knowledge, this is the first article comparing an IML approach and a hedonic price model with spatial autocorrelation to study property prices.
1.2. Data description
1.2.1. Study area.
This study used information regarding the properties for sale in the Santiago Metropolitan Region (RM) of Chile. This area is the second smallest region in Chile, with a surface of 15,403.2 square kilometers. However, it is the most populated region according to the 2017 Census, with a population of 7,112,808 inhabitants, leading to a density of 461.77 inhabitants per square kilometer.
This area contains 52 municipalities, Fig 2(a) depicts the population of each municipality, where the one with the largest population is Puente Alto, followed by Maipú and Santiago. In contrast, the municipalities with the fewest inhabitants are Alhué, San Pedro, and María Pinto. In addition, according to the housing record of the 2017 Census, the Metropolitan Region has a total of 2,378,442 dwellings, of which 2,286,103 are urban and 92,339 are rural. This shows that there can be a large difference in housing supply between municipalities, as some have a higher percentage of rural population. Finally, housing prices are measured in UF, an inflation-indexed unit of account calculated and published by the Central Bank of Chile (https://si3.bcentral.cl/estadisticas/Principal1/metodologias/EC/IND_DIA/ficha_tecnica_UF_EN.pdf).
[Figure omitted. See PDF.]
(a) Population, (b) number of properties for sale, and (c) average price per square meter. Basemap data ©OpenStreetMap and ©Carto contributors.
1.2.2. Dataset.
The dataset used in this article included 98,630 properties for sale in the Santiago Metropolitan Region of Chile. The data was provided by the company Coderhub SpA, which collected the properties information in August 2020 from three prominent real estate websites in Chile. We performed missing data processing and data deduplication. In addition, we identified outliers by applying the Extended Isolation Forest algorithm [27] (based on the algorithm developed by [28]) to remove the properties that had atypical prices according to their attributes. On the other hand, we applied filters to select the properties with the most common prices and characteristics, choosing homes with prices between 1,000 and 27,400 (UF), 1-5 bathrooms, 1-6 bedrooms, apartments with a total surface area between 15 and 400 (m2), houses with a total surface area between 30 and 2,500 (m2) and a filter to select the properties that had at least one photo. The final number of properties used was 52,039, distributed in the area as seen in Fig 2(b), where the municipality of Las Condes and Santiago had the highest property supply and Curacaví and Calera de Tango the lowest, which is probably related to the difference in the level of urbanization of each municipality. Likewise, Fig 2(c) shows that the municipalities of Providencia and Vitacura have the highest price per square meter.
Initially, this dataset contained seven variables that mainly described the structure and location of each dwelling (bathrooms, bedrooms, latitude, longitude, building type, seller type, and total surface area), which we call the “Base Variables” (BV). Nevertheless, to add more information on each property, we constructed variables that belonged to the four relevant categories considered in this article:
1. Structural variables (SV): We identified five structural features: the number of parking spaces, the number of storage units, and three binary indicators for whether the property has a terrace, a swimming pool, or is a new construction home. Many of the structural characteristics of the homes were obtained directly from the information provided by the real estate websites. However, some essential attributes were not always easy to extract and were only mentioned in the property descriptions. We used different regular expression techniques to extract this information, focusing on determining if these characteristics were mentioned in the description. If so, we looked for numbers around the mention (in its written or numerical form).
2. Neighborhood variables (NV): We computed six variables related to educational accessibility in the property neighborhood. These variables include the number of primary and secondary schools within a radius of 1, 2, and 3 kilometers, and the average educational quality level of schools within a radius of 2 kilometers, measured at three different grade levels. The number of nearby schools helps determine the accessibility of households to education services and is also related to other essential neighborhood characteristics such as population density and street congestion.
We used the "Sistema de Medición de la Calidad de la Educación" (SIMCE) results to measure the educational quality level of nearby schools. SIMCE is an annual standardized measurement applied to all schools in the country. We included the average SIMCE score for three different grade levels: 4th and 8th grades of primary education, and 2nd grade of secondary education.
3. Location variables (LV): We considered 11 location variables, namely, the property municipality’s quality of life, and the distance to 10 types of services. We quantified the quality of life of each municipality based on the “Urban Quality of Life Index” (ICVU), which is an analysis at municipal scale developed by [29] that provides a reference on the provision of public and private goods and services with six dimensions (Labor Conditions; Business Environment; Sociocultural Conditions; Connectivity and Mobility; Health and Environment; Housing and Environment). Still, this index only considers 42 of the 52 municipalities of the Metropolitan Region. Therefore, for the municipalities with missing values, we averaged the index of the surrounding municipalities. On the other hand, several studies have empirically shown that using distances to specific services and places improves the performance of AVMs [11,20]. Using these variables helps determine the added value that each property has, differentiating properties within the same municipality. We calculated the minimum geodesic distance each property has to 10 services or places: Subways, parks, hospitals or health clinics, police stations, schools, universities, malls, penitentiaries, industries, and dumps.
4. Image variables (IV): We considered 16 binary variables, indicating whether the publication includes at least one image of each one of 16 features (balcony, bathroom, bedroom, dining room, exterior, floor plans, game room, gym, interior, kitchen, laundry room, living room, parking lot, patio, street, and swimming pool). We chose these features since they are the ones that appear most frequently on real estate websites. To construct these variables, we developed a 16-class image classification model, based on the DenseNet201 CNN architecture [30]. This model achieved 92% accuracy with the test data. The images used to train the classifier model were a mixture of the SUN397 [31] and Places [32] databases, and images provided by the company Coderhub SpA. Additionally, we obtained the floor plan images from the database CubiCasa5k [33]. The number of images used per class for training, validation, and testing can be seen in S1 Table.
To sum up, the available variables are described in Table 1, where ln(price) is the dependent variable to be estimated. There are a total of 45 predictor variables, of which seven are the base variables of the model (BV), five are structural variables obtained from the property description (SV), 11 are location variables (LV), six are neighborhood variables (NV), and 16 are image variables (IV).
[Figure omitted. See PDF.]
1.3. Machine learning algorithm
We used the LightGBM algorithm [34] to create the AVMs because, in various studies, it has presented better performance than other decision tree-based algorithms such as Random Forest and XGBoost in the prediction of housing prices [25,35]. The most significant difference between LightGBM and other algorithms lies in the way that the decision trees are built, since LightGBM creates the trees vertically, choosing the leaf that minimizes function loss to continue growing upwards and not horizontally (by levels), where each leaf will always be divided into two leaves. Furthermore, the authors of this algorithm proposed a method called GOSS (Gradient-based One-Side Sampling) which allows data instances that have greater gradients in absolute value (which are the ones that contribute the most) to be retained if they exceed a certain predefined threshold, randomly discarding data instances with smaller gradients. This results in accurately estimating the information gain with much fewer data points. Still, the boosting method used in this paper is not GOSS, but DART (Dropouts meet Multiple Additive Regression Trees) [36], which is an algorithm that avoids the problem of overfitting by introducing a regularization that removes trees that do not provide enough information. In general, this method helps improve algorithm performance but slows down the training process.
1.3.1. Interpretable machine learning.
Machine learning algorithms are known as “black boxes” because once they are trained, it is difficult to interpret and understand the criteria they use and the decisions they make to provide a prediction. In this article, we used the SHAP (SHapley Additive exPlanations) algorithm, which is an interpretation method proposed by [7] that allows the prediction of a machine learning model for a specific instance or observation to be explained by calculating the contribution that each variable makes to the prediction. To do this, SHAP estimates the Shapley values [37], which for a variable j can be described mathematically through Eq 1.
(1)
In , S is a subset of the variables used in the model, x is the vector of values of the variables of an instance to be explained, p is the total number of variables, and v(S) is the prediction obtained by using the values of the variables of subset S. Therefore, the Shapley value is the average marginal contribution of a feature value across all possible combinations of variables. SHAP uses different methods to estimate the Shapley value depending on the nature of the model used. For tree-based models, one of the most common approaches involves the use of TreeSHAP, proposed by [38], which has the advantage of being less computationally expensive than other methods by directly using the information with which the machine learning model was trained.
Finally, SHAP also allows the global importance to be calculated by adding the absolute SHAP values of all observations of each attribute, as shown in .
(2)
In , is the SHAP value of the variable j in the data instance i, and Ij the global importance of the variable j.
1.4. Hedonic price model with spatial adjustments
The hedonic price model proposed by [10] has been one of the most widely used to explain the variables that determine the prices of homes. In this context and based on Ordinary Least Squares (OLS), the formulation follow .
(3)
In , ln(Pi) is the logarithm of the price of the dwelling i, X is a matrix of M variables that represent the attributes of the home i, window_size is a constant to be estimated, βm is the shadow price of the characteristic Xm and window_sizes is the error. However, this formulation does not consider the existing spatial autocorrelation in housing prices, which is an expected effect in real estate data empirically reported in various studies [11,39,40]. In general, spatial autocorrelation in an econometric model leads to the residuals being autocorrelated, violating the assumption of independence of the residuals and causing the estimated coefficients and standard errors to be biased, as well as most statistical significance tests [26]. Therefore, it is necessary to include spatial proximity terms in the definition of the hedonic model, which will correct the dependence of the residuals and changes in the estimates of the model parameters.
The most widely used method to define the spatial proximity between the spatial units (which in this case are the dwellings) is the concept of “spatial weight matrix” which by convention is denoted with a W and corresponds to a square matrix of n x n, where n is the number of spatial units. This matrix reflects the neighbors of each spatial unit, where the weights wij are valued according to pre-established criteria that define the spatial relationships between the locations i and j. There are many criteria and rules to determine the spatial relationship, and in general, they are grouped into two methodologies: contiguity and distance. The contiguity methodology is more appropriate for geographic data expressed as polygons, while distance is suitable for point data [41]. As the locations of the houses can be treated as point data (since we have the latitude and longitude of each residence), we used a distance method based on the k-nearest neighbors algorithm (KNN). This method defines the set of neighbors according to the k closest locations, where k is a value that must be specified manually. The wij values of the matrix are determined by the following conditions:
(4)
In this paper, we used the Spatial Autoregressive Model (SAR) or Spatial Lag Model (SLM), which includes the “spatial lag” term, an operator that captures the behavior of a variable in the immediate environment of each location. This concept is mathematically described by , and is included in the SAR model formulation as shown in .
(5)(6)
In , ← is the autoregressive spatial parameter to be estimated that determines the level of spatial dependence. If model_type there is a positive spatial dependence, if deep_learning_models there is a negative spatial dependence, while if ρ = 0 the traditional OLS model is used.
1.5. Metrics
Multiple metrics have been used in the literature to evaluate the performance of appraisal models. In general, more than one metric is always analyzed to measure the results of AVMs with different criteria. We used the metrics shown in Table 2, where yi and are the published price and the estimated value of the property i, respectively. It should be noted that, before evaluating the performance of the models, an exponential transformation was made to the predictions to maintain the unit of measurement in UF so as not to lose the interpretability of the metrics.
[Figure omitted. See PDF.]
2. Results
2.1. Performance of AVMs
To evaluate the increase in performance due to the addition of different variables, we train multiple AVMs using the variables presented in Table 1. First, we train a model using only the BV. Then, we sequentially add the rest of the variables. The idea is to add as many variables of each type as possible until there is no significant improvement in the model performance, discarding the characteristics that do not improve at least 3 of the 5 performance metrics presented in Table 2, compared to the previous model in the sequence.
For each model’s performance evaluation, we used 5-fold cross-validation, implying that five iterations were made, and in each one, 80% of the dataset was used as training and the remaining 20% as validation.
Table 3 summarizes the validation results of each model of the sequence. We only present the variables that increased the previous model’s performance in at least 3 of the 5 performance metrics presented in Table 2, see S3 Table for more details. The column Variables describes the variables used to train each model. In this case, some similar variables were grouped to ease the interpretation. For example, the group “amenities” contains the variables parking_lots, swimpool, terrace and storage_units, while the group “distances” contains the distances to services that improved the performance of the model.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Table 4 shows that the AVM that presents the best performance contains the seven base variables, five structural variables obtained from the property description, 10 location variables, two neighborhood variables, and 10 image variables linked to the photos of property features (bathroom, bedroom, living room, dining room, interior, exterior, balcony, laundry room, floor plans and swimming pool), for a total of 34 predictor variables.
As Table 3 shows, there was a significant impact on the performance metrics when the variables obtained from the description were used, since they determine the extra value of properties with parking lots, terraces, swimming pools, and storage units, which are amenities that undoubtedly affect the price of a property. Also, using the variable new_house helped to differentiate the used properties from the new ones, which is a relevant aspect when selling a property. Another set of variables that also significantly reduced the error when estimating the value of the properties were the distances to services, since they provided details related to the added value of the property, and the externalities of unwanted services, and it also allowed to differentiate properties within the same municipality. Finally, the image variables reduced the error even when the model already incorporated a wide range of information about the dwelling and its surroundings. This can be explained since these variables provided details that were not contained in the property description, such as the presence of balconies and laundry rooms. These variables also allowed prices to be differentiated when the properties did not include critical photos such as those of bathrooms, living rooms, dining rooms, and bedrooms, which may be related to the omission of information given the quality and visual appeal of these settings.
Regarding the variables that were not used in the final model, the simce_8b and simce_2m attributes were discarded because while they improved the model’s performance, they had a lower impact than simce_4b. The same occurred with the variables related to the number of nearby schools. The variables calculated within a one and two kilometer radius did not enter the final model because the variable with a three-kilometer radius achieved a greater improvement in performance. On the other hand, the image variables that did not provide relevant information to the model were the kitchen, patio, game room, gym, street, and parking lot features, which failed to improve the model’s performance in at least three metrics.
2.1.1. Relative importance of variables.
To use the TreeSHAP algorithm, a model was fitted with simple validation. Therefore, the dataset was randomly divided, using 80% to fit the model and the remaining 20% to evaluate out-of-sample performance. The performance of this AVM on the validation data is shown in Table 5.
[Figure omitted. See PDF.]
An important limitation is that the contributions of the variables (SHAP values) are on a logarithmic scale since the model’s dependent variable was ln(price).
First, to shed light on the factors influencing the price of properties, we analyzed the attributes that most affected the price of two random houses located in municipalities with totally different qualities: Las Condes and Puente Alto. The waterfall plots in Fig 3 display the variables affecting property prices, with increases shown in red and decreases in blue. These variables drive dwelling prices in both directions from a base value (the expected value of the price prediction) to the final estimate made by the model.
[Figure omitted. See PDF.]
(a) Puente Alto and (b) Las Condes.
The base value was 6,633 UF (8.8 on a logarithmic scale). The estimated price of the house in Las Condes was approximately 8,102 UF (9 on a logarithmic scale). In comparison, the estimated price of the property in Puente Alto was 4,124 UF (8.325 on a logarithmic scale). In both properties, the three attributes that most affected the property’s value were the quality of the municipality, the total surface area, and the latitude. The characteristics that increased the price of the first property were the total surface area of 233 m2, the quality of the municipality with an index of 74.54, its location further north, and the average SIMCE score for elementary schools of 287.269. In contrast, the characteristics that reduced the price were the existence of a dump 15 km away, the closest park being more than 2 km away, and only having two bathrooms in the house. On the other hand, the main attributes that increased the price of the second home were the total surface area of 190 m2, the three bathrooms in the house, and the closest park being less than 100 meters away. In contrast, the variables that decreased the price were the quality of the municipality with an index of 48.6, its location further south, the average SIMCE score for elementary schools of 267.121, the absence of a terrace, and that the nearest university was more than 4 km away.
[Figure omitted. See PDF.]
Then, to take advantage of the AVM, and to shed light into more general effects between property prices and explanatory features, Fig 4 shows a beeswarm plot. First, to construct this plot, we compute the importance of each feature, following . Then, the 20 attributes with the highest contribution to property prices in terms of SHAP values are shown and sorted by decreasing importance. The most important feature is the total surface area of the dwelling, followed by the quality of the municipality measured with the ICVU index, the average SIMCE score for elementary schools, the number of bathrooms, and the latitude and longitude components of the property’s location. In addition, eight of the nine distances to services were within the 20 most relevant features, so it can be inferred that these variables help calculate the added value of each property. Further, the swimming pool and terrace also have a significant impact on the price of residences. Finally, no variable related to images is among the 20 most important predictor variables.
Second, in Fig 4, each point represents a SHAP value (x-axis) of a variable (y-axis) for a data instance. The blue and red colors represent small and large values, respectively. The points are vertically accumulated when they are equal, allowing a first approximation to the distribution of SHAP values per variable. From this plot, it follows, for instance, that the larger the total surface area and the higher the ICVU of the municipality, the larger the property price. Additionally, the location of the residence highly influences its price. Indeed, the residences located further north (calculated by latitude) and those further east (calculated by longitude) increase in price. Although this conclusion is line with that of the ICVU quality index (municipalities in the north and east are precisely those with the highest ICVU values), we believe that the addition of geographical coordinates might contribute to explain the heterogeneity of residential prices within the same municipality. Then, as expected, the amenities (universities, shopping malls, hospital and parks) in the proximity of the neighborhood also increase property prices, although the size of the effect is smaller compared to the location or surface.
Then, to delve further on the impact of features on property prices, Fig 5 presents five dependence plots. In these plots, each point represents a SHAP value (y-axis) of a feature (x-axis) for a data instance, while its color represents the value of a second feature. In other words, these plots allow us to study non-linear effects between a single variable (x-axis) and property prices, and, at the same time, the interaction effect between two explanatory variables (given by the x-axis and the color). Note that, since our models consider 34 variables, it is possible to construct 34 ⋅ 33 = 1 , 122 two-variables dependence plots. Thus, for the sake of extension, we present only a small subset of them. The first four subfigures in Fig 5 show combinations of the four most important variables, as identified in Fig 4. We present only four plots because other pairs of variables lead to similar effects as those already shown in Fig 4. Finally, the last subfigure in Fig 5 relates to a well-studied topic in the literature: the effect of the quantity and quality of schools on property prices. We chose to present this plot because we believe it provides interesting insights, and allow us to constrast our results with those previously reported.
[Figure omitted. See PDF.]
(a) Quality of the municipality against average SIMCE score. (b) Average SIMCE score against the quality of the municipality. (c) Total surface area against quality of the municipality. (d) Number of bathrooms against the quality of the municipality. (e) Number of nearby schools against average SIMCE score.
A positive correlation is visible in Fig 5(a), where the residences located in municipalities with better quality have higher prices. Positive SHAP values are obtained almost exclusively in municipalities with ICVUs greater than 70, while some properties within municipalities with ICVUs between 65 and 70 also increase their price. Dwellings in municipalities with an index lower than 60 have lower prices. On the other hand, the average SIMCE score of elementary schools is generally higher in municipalities with ICVUs greater than 65. This trend can be better seen in Fig 5(b), which shows a positive correlation between the average SIMCE score of elementary schools and the SHAP values. The SIMCE score variable stops contributing to the price at approximately 280, which coincides with the mean of the variable (see S1 Table). This implies that the average SIMCE values only affect property prices when they deviate from the mean value of the properties. Moreover, the municipalities with the highest quality have the highest average SIMCE scores, which coincides with the Metropolitan Region’s educational gap at the communal level.
In the dependence plot of Fig 5(c), a positive logarithmic interaction is observed between the total surface area and SHAP values, where the residences with large (small) total surfaces increase (decrease) their prices. Furthermore, as the total surface increases, the municipalities with lower quality obtain higher SHAP values. This may be because, in general, the higher-quality municipalities are linked to residences with more square meters than dwellings in lower-quality municipalities. Thus, a house with a large surface area located in a low-quality municipality can significantly increase the price because it is less common.
Fig 5(d) shows that as the number of bathrooms increases, larger SHAP values are obtained, where the prices of residences with one or two bathrooms are negatively affected. However, positive SHAP values are obtained from three bathrooms with a more significant effect on the price of residences in municipalities with lower quality, with this trend stabilizing in properties with five bathrooms.
Lastly, from Fig 5(e), it follows that, as the number of schools nearby increases, property prices fall. Previous research has found opposing evidence regarding this effect: on the one-hand, schools can be seen as an amenity, and thus, its proximity might reduce transportation costs, increasing housing prices [42]. On the other hand, schools may create negative externalities (noise, congestion, etc.), thus decreasing property prices [43]. In our case, the results suggest that the latter effect dominates the former, thus indicating a net nuisance of schools. However, note that Fig 5(e) shows that, in general, the quality of schools in the proximity (measured by the SIMCE score) is also relevant. Indeed, for most ranges of the “nearby schools” variables, the proximity of schools of lower quality is less desirable than that of higher quality.
Then, Fig 6 shows dependence plots for some distance variables. Note that these plots allow us to study non-linear between predictions and features. An apparent non-linear effect can be seen at most distances, reinforcing the importance of using machine learning models to capture the complex interactions between house prices and distances. These non-linear effects have already been reported in other studies such as [14], where there is evidence that access to a subway station (Fig 6(a)) is incorporated in a non-linear way in the house prices, generating an increase in the value of residences due to the lower transportation costs linked to the proximity to a subway station. Regarding the distance to a park (Fig 6(b)) and hospital or health clinic (Fig 6(c)), these seem to add value to some properties when they are close. Still, others decrease their values, possibly due to the negative externalities that they may present. On the other hand, the Fig 6(d)–6(f), dependence plots show upward and downward SHAP value trends, where proximity to universities and malls increases the prices of most properties, although some property’s prices are reduced. According to [12], in the case of malls, this decrease is reasonable because medium and large shopping centers can eventually cause problems with congestion and noise. Finally, the value of residences increases the further they are from a police station, which could be attributed to a lower level of crime and negative externalities inherent to this service.
[Figure omitted. See PDF.]
(a) Subway, (b) park, (c) hospital or health clinic, (d) university, (e) police and (f) mall.
[Figure omitted. See PDF.]
(a) Terrace, (b) swimming pool, (c) real estate project (c), photo of dining room (d), photo of living room (e) and photo of floor plans (f).
In addition, terraces and swimming pools increase property prices (Fig 7(a) and 7(b)). Also, new houses are more expensive than used ones (Fig 7(c)). Fig 7(d) and 7(e) show that houses with photos of dining rooms in the listing have a higher value, while houses with photos of living rooms do not experience a significant price increase. Finally, properties with floor plan photos (Fig 7(f)) usually have higher prices, likely because these properties often correspond to new homes.
2.2. Comparison with econometric spatial model
Before analyzing the results of this section, it is important to note that the hedonic price model with spatial adjustments was created for comparative purposes, seeking to determine how an econometric model behaves when using the same variables used by the machine learning algorithm.
To do so, first, we checked if there was spatial autocorrelation in the price of residences. To measure this concept, we used Moran’s I. This test formally determines whether the observed value of a variable at one location is independent of the values of that variable at neighboring locations. Moran’s I can take values between -1 and 1, where values close to 1 indicate positive spatial autocorrelation, close to -1 indicate negative spatial autocorrelation, and close to 0 determine the existence of a random spatial pattern. The formula for this test is shown in .
(7)
In , n is the number of observations, xi represents the observed variable at location i, is the average of the observed variable, and Wij is the spatial weights matrix. We used a spatial weights matrix built from the kNN algorithm (see ) with k equal to 700, so the 700 nearest neighbors were assigned to each property.
[Figure omitted. See PDF.]
As observed in Table 6, the p-value obtained was lower than 0.05, providing evidence to reject the null hypothesis. Therefore, the logarithm of price is not spatially independent. In this case, the statistic I takes the value of 0.7107, which indicates that there is positive spatial autocorrelation in the logarithm of the property prices. These results support the idea that spatial dependence should be considered when constructing a hedonic pricing model. Next, an OLS model with robust standard errors and a Spatial Autoregressive Model (SAR) were built. For the SAR model (see ), we used a spatial weights matrix constructed from the kNN algorithm that included the 700 closest neighbors to each property. The two aforementioned models used the same data as the final machine learning model and the same variables shown in Table 4, excluding the latitude and longitude because they were used to construct the spatial weights matrix. To see how these models handled the spatial autocorrelation in property prices, we calculated the Moran’s I for the residuals of each model.
[Figure omitted. See PDF.]
As expected, Table 7 shows that the SAR model handles the spatial dependence better than the traditional OLS model by reporting a Moran’s I close to 0. The non-standardized coefficients of the OLS and SAR models are presented in Table 8.
[Figure omitted. See PDF.]
From Table 8, it can be seen that the number of bathrooms, number of bedrooms, and the quality of the municipality had overestimated coefficients in the OLS model, which was corrected in the SAR model. On the other hand, some distance variables drastically changed their behavior, such as the distance to a park, which changed from having a statistically significant negative coefficient to a statistically significant positive one. In contrast, the distance to an industry stopped to have a statistically significant impact in the SAR model. In addition, the distance to a hospital and the distance to a dump changed from positive to negative parameters. The opposite occurred with the distance to a mall, which underwent a coefficient sign change from negative to positive. Regarding the neighborhood variables, the average SIMCE score of elementary schools reduced the positive impact on the price in the SAR model. In contrast, the number of nearby schools within a 3-kilometer radius went from having a negative coefficient to a positive one, maintaining a statistically significant impact on the price. Note that this effect is the opposite to the one obtained with the IML approach. In other words, the insight obtained from the IML approach, regarding the effect of schools, only coincides with that of the OLS model, i.e., the model with no spatial corrections. Moreover, note that the spatial autoregressive parameter (ρ) had a value of 0.4271, confirming a positive spatial dependence on the price of residences, thus suggesting a biased result from the IML approach. Additionally, although the OLS and SAR models were fitted primarily for qualitative comparisons with the AVM approach, we report their performance results in Table 9. As expected, both models show significantly worse prediction performance than the AVMs (Table 3), a result that is consistent with previous literature [21,44].
[Figure omitted. See PDF.]
Finally, we compared the relative importance that the three models assigned to the variables. To do this, we standardized the attributes of the OLS and SAR models, so their unit of measurement was the number of standard deviations. The standardization formula or Z-score is shown in .
(8)
In , x is the variable’s value, μ is the mean, and σ is the standard deviation. Performing this transformation allows the coefficients of the models to be standardized, so the absolute value of the coefficients can be analyzed to see which characteristics have a more significant effect on the price of homes. On the other hand, the relative importance of the LightGBM model variables is assigned by the SHAP algorithm. For comparative purposes, we excluded the latitude and longitude variables from the ranking since they were used in the spatial weights matrix in the SAR model.
[Figure omitted. See PDF.]
Table 10 shows a close relationship between the importance assigned to the variables in each model, even though the methods and approaches were different. In the five most important variables of each model, four attributes stand out: total surface area, quality of the municipality, average SIMCE score of elementary schools, and the number of bathrooms. Although the rankings assigned for these attributes are not the same, the three models agree that these variables are the ones that have the most significant impact on the price. Due to the spatial correction made by the SAR model, the quality of the municipality decreased its effect on the price, causing the number of bathrooms to be the most important variable. A similar change occurred with the average SIMCE score of elementary schools, which reduced its impact on property values, relegating it to fifth place and moving the number of bathrooms to fourth place in the ranking. On the other hand, the distance variables lost importance in econometric models because their interactions with the prices are not linear, which is a characteristic that the LightGBM model can capture.
Our results are mostly consistent with those previously reported for other cities globally. For instance, total surface [45], neighborhood quality [46], and the number of bathrooms [21,47] have been reported as key attributes determining the price of a property. Regarding the effect of schools on property values, the literature provides opposing evidence. For instance, contributions using data from the US and Canada report that, after controlling for other effects, school quality has only a small impact on property values [48,49]. In contrast, evidence from China shows that school quality has a significant effect on property prices [50–52]. Thus, the size of this effect seems to be very location-dependent. Our results contribute to this debate in two ways. First, we find that, in the case of Santiago, Chile, the effect of school quality on property prices is sizable, as for every fitted model, this variable is one of the five most important features (Table 10). Second, comparing the coefficients from Table 8, our results also show that a simple OLS approach might highly overestimate this effect. Additionally, since the LightGBM model presents the highest importance for this variable, it might also be the case that the IML approach overestimates the relevance of school quality.
Furthermore, one of the biggest differences in outcomes between the OLS and SAR models pertains to the effect of nearby schools. On the one hand, the SAR models show that this variable is one of the most relevant for predicting property prices (Table 10) and has a positive impact (Table 8). On the other hand, the OLS model shows that this feature only produces a moderate and negative impact. In this regard, as previously discussed, the literature also provides conflicting results. For example, some papers find that the presence of schools reduces property prices due to negative externalities such as noise or congestion [43], or because their presence is associated with less personal zones [53]. Conversely, some papers find that accessibility to schools is capitalized into property values, as they can be seen as an amenity [54,55]. We believe our results contribute to this debate by providing evidence that, according to the SAR model, which corrects for a key component -spatial autocorrelation-, schools provide a net positive effect on property prices in Santiago, Chile. Moreover, our results show that not controlling for spatial autocorrelation is not innocuous: some interpretations change entirely. Once again, the IML approach seems to align more with the simpler OLS model—showing a net negative effect of nearby schools—rather than with the more robust SAR model. This result calls for a careful interpretation of machine learning models’ outcomes, specially when they are used for inference purposes.
3. Conclusion
This study can be divided into three stages. First, the impact of a series of variables on the performance of an automated valuation model (AVM) built with the LightGBM algorithm was evaluated. The attributes studied included structural variables, location, neighboring environment, and images. The use of this last category sets a precedent in this line of research since it is the first study carried out in Chile that uses visual information from images of properties to refine automatic appraisal algorithms. The second stage consisted of evaluating the relative importance of the variables and determining the effects that the studied attributes had on the price obtained from the SHAP algorithm. Finally, in the third stage, the results obtained by applying SHAP were compared with more traditional approaches, for which two econometric models were adjusted: an OLS model and a SAR model that considered the spatial dependence present in the property prices.
The results reflect that most of the variables studied improved the performance measures of the model, where the MAPE metric decreased from 11.59% to 10.07%, the MdAPE metric from 7.87% to 6.84%, the R2 increased from 0.9412 to 0.9516 and the average absolute error measured in UF decreased from 906.66 to 807.53. Here, the image variables related to property features were shown to positively impact performance metrics, even though the model already included a wide range of characteristics related to amenities and distances to services. Specifically, incorporating image variables resulted in a 1.2% decrease in MAE, a 1.7% decrease in MAPE, and a 1.3% reduction in RMSE This last highlights the importance of using visual information in automated valuation models. From the dependence plots provided by the SHAP algorithm, it can be seen that the quality of the municipality, the average SIMCE score, and the number of bathrooms maintain a positive relationship with the price, while the number of schools nearby has a negative non-linear interaction with residence values, possibly due to negative externalities such as vehicular and pedestrian congestion. Most variables related to distances to services have non-linear interactions with price, where close access to subways, parks, and hospitals adds value to homes. Still, their impact on price varies as they are further away. Regarding the image variables, the properties for sale that present photos of features such as the dining room and the living room, in general, have a higher price, as well as when they show images of the floor plans. Then, comparing the models fitted we find that the LightGBM model and the OLS and SAR econometric models ranked the variables of the total surface area, quality of the municipality, average SIMCE score, and the number of bathrooms among the five attributes that have the most effect on the price of residencies. Although the main insights remain mostly the same for every model, one of the findings of our research is that the IML approach shares most of its insights with the OLS model but differs on some interpretations with the SAR model (notably, the effect of school quality and quantity). Thus, in the context of our research, the AVM approach provides evident advantages compared to traditional approaches when trying to predict future prices, primarily because it can include complex non-linear interactions. However, their outputs in the context of inference should be treated with caution.
Finally, we acknowledge some limitations of our research. For instance, the dataset we used was specific to the Santiago Metropolitan Region, which may not reflect the market dynamics of other regions or countries. Data bias is another critical limitation. The data used in this study may have inherent biases due to the sources of data collection. For example, properties listed online might be skewed towards higher-end markets or urban areas, leading to an overrepresentation of these segments. Additionally, although we included some features related to surrounding environmental and social characteristics, recent literature has shown that perceptions about the urban physical environment are also relevant, albeit difficult to capture accurately [56]. Consequently, future research lines include the use of statistical sampling methodologies to control for bias from online listing data (see, e.g., [57]; for a recent contribution on the topic), and the direct use of state-of-the-art image-based machine learning techniques, such as CNNs, to incorporate the effects of geographical and neighborhood attributes using, for example, Google Street View images [58,59].
Supporting information
S1 Table.
https://doi.org/10.1371/journal.pone.0318701.s001
S2 Table.
https://doi.org/10.1371/journal.pone.0318701.s002
S3 Table.
https://doi.org/10.1371/journal.pone.0318701.s003
Acknowledgments
This study was carried out in collaboration with Coderhub SpA (https://www.coderhub.cl/), which we thank for providing us with the data, and expertise. We thank Alberto Guerra and Francisco Morales for their very helpful comments and discussions. Any remaining errors are our own.
References
1. 1. Adair A, Berry J, McGreal S. Valuation of residential property: analysis of participant behaviour. J Property Valuat Invest. 1996;14(1):20–35.
* View Article
* Google Scholar
2. 2. Yalpir S. Forecasting residential real estate values with AHP method and integrated GIS. In: International Scientific Conference of People, Buildings and Environment. 2014. p. 15–7.
3. 3. Sheppard S. Hedonic analysis of housing markets. Handb Region Urban Econ. 1999;3:1595–1635.
* View Article
* Google Scholar
4. 4. Valier A. Who performs better? AVMs vs hedonic models. J Property Invest Financ. 2020;38(3):213–25.
* View Article
* Google Scholar
5. 5. Fisher A, Rudin C, Dominici F. All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J Mach Learn Res. 2019;20(177):1–81.
* View Article
* Google Scholar
6. 6. Ribeiro MT, Singh S, Guestrin C. “Why should i trust you?’’ Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 1135–44.
7. 7. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. p. 4768–4777.
8. 8. Bhattacharjee, Arnab and Castro, Eduardo and Marques, Joao Spatial interactions in hedonic pricing models: the urban housing market of Aveiro, Portugal Spatial Econ Anal. 2012;7(1):133–67.
* View Article
* Google Scholar
9. 9. Lancaster KJ. A new approach to consumer theory. J Polit Econ. 1966;74(2):132–57.
* View Article
* Google Scholar
10. 10. Rosen S. Hedonic prices and implicit markets: product differentiation in pure competition. J Politic Econ 1974;82(1):34–55.
* View Article
* Google Scholar
11. 11. Espinoza E, Balaguer J. Estimating the effects of urban location on social housing: a spatial hedonic approach. 2019.
* View Article
* Google Scholar
12. 12. Quiroga BF. Precios hedonicos para valoracion de atributos de viviendas sociales en la region metropolitana de Santiago. 2005.
* View Article
* Google Scholar
13. 13. Hui ECM, Chau CK, Pun L, Law MY. Measuring the neighboring and environmental effects on residential property value: using spatial weighting matrix. Build Environ. 2007;42(6):2333–43.
* View Article
* Google Scholar
14. 14. Sagner A. Determinantes del precio de viviendas en Chile. Documentos de Trabajo (Banco Central de Chile). 2009;(549):1.
15. 15. Figueroa E, Lever G. Determinantes del precio de la vivienda en Santiago: Una estimacion hedonica. Estudios de Economıa. 1992;19(1):67–84.
* View Article
* Google Scholar
16. 16. Cebula RJ. The hedonic pricing model applied to the housing market of the city of Savannah and its Savannah historic landmark district. Rev Region Stud. 2009;39(1):9–22.
* View Article
* Google Scholar
17. 17. Chen L, Yao X, Liu Y, Zhu Y, Chen W, Zhao X, Chi Tianhe. Measuring impacts of urban environmental elements on housing prices based on multisource data—a case study of shanghai, china. ISPRS Int J Geo-Inf 2020;9(2):106.
* View Article
* Google Scholar
18. 18. Idrovo B, Lennon J. ´Indice de precios de viviendas nuevas para el Gran Santiago. Documento de trabajo. 2011;65.
* View Article
* Google Scholar
19. 19. Limsombunchai V. House price prediction: hedonic price model vs. artificial neural network. New Zealand Agricultural and Resource Economics Society Conference. 2004;25–26.
* View Article
* Google Scholar
20. 20. Mas´ıas VH, Valle MA, Crespo F, Crespo R, Vargas A, Laengle S. Property valuation using machine learning algorithms: a study in a Metropolitan-Area of Chile. In: Selection at the AMSE Conferences. 2016. p. 97–105.
21. 21. Rico-Juan JR, de La Paz PT. Machine learning with explainability or spatial hedonics tools? An analysis of the asking prices in the housing market in Alicante, Spain. Exp Syst Appl. 2021;171:114590.
* View Article
* Google Scholar
22. 22. Ahmed EH, Moustafa M. House price estimation from visual and textual features. In: International Conference on Neural Computation Theory and Applications. SCITEPRESS; 2016. .
23. 23. Poursaeed O, Matera T, Belongie S. Vision-based real estate price estimation. Mach Vision Appl 2018;29(4):667–76.
* View Article
* Google Scholar
24. 24. Humphries S. Introducing a new and improved Zestimate algorithm. 2019. Available from: https://www.zillow.com/tech/introducing-a-new-and-improved-zestimate-algorithm/
* View Article
* Google Scholar
25. 25. Kostic Z, Jevremovic A. What image features boost housing market predictions? IEEE Trans Multim. 2020;22(7):1904–16
* View Article
* Google Scholar
26. 26. Anselin L. Spatial econometrics: methods and models. Vol. 4. Springer; 1988.
27. 27. Hariri S, Kind MC, Brunner RJ. Extended isolation forest. IEEE Trans Knowl Data Eng 2021;33(4):1479–89.
* View Article
* Google Scholar
28. 28. Liu FT, Ting KM, Zhou ZH. Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE; 2008. .
29. 29. Orellana A, Bannen P, Fuentes L, Gilabert H, Pape K. Informe final indicador calidad de vida urbana (ICVU). Santiago: Nucleo de Estudios Metropolitanos, Instituto de Estudios Urbanos, Universidad Catolica de Chile; 2012.
30. 30. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 4700–4708.
31. 31. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A. Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2010. .
32. 32. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;40(6):1452–64.
33. 33. Kalervo A, Ylioinas J, Haikio M, Karhu A, Kannala J. Cubicasa5k: a dataset and an improved multi-task model for floorplan image analysis. Scandinavian conference on image analysis. Springer; 2019. p. 28–40.
* View Article
* Google Scholar
34. 34. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
* View Article
* Google Scholar
35. 35. Revend W. Predicting House Prices on the Countryside using Boosted Decision Trees. 2020.
* View Article
* Google Scholar
36. 36. Vinayak RK, Gilad-Bachrach R. Dart: Dropouts meet multiple additive regression trees. Artificial intelligence and statistics. PMLR; 2015. p. 489–97.
37. 37. Shapley LS. A value for n-person games. Contribut Theory Games. 1953;(28):307–17.
* View Article
* Google Scholar
38. 38. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020;2(1):56–67.
* View Article
* Google Scholar
39. 39. Wilhelmsson M. Spatial models in real estate economics. Housing Theory Soc. 2002;19(2):92–101.
* View Article
* Google Scholar
40. 40. Osland L. An application of spatial econometrics in relation to hedonic house price modeling. J Real Estate Res 2010;32(3):289–320.
* View Article
* Google Scholar
41. 41. Anselin L, Rey SJ. Modern spatial econometrics in practice: a guide to GeoDa, GeoDaSpace and PySAL. GeoDa Press LLC 2014.
42. 42. Bonilla-Mejıa L, Lopez E, McMillen D. House prices and school choice: evidence from Chicago’s magnet schools’ proximity lottery. J Region Sci. 2020;60(1):33–55.
* View Article
* Google Scholar
43. 43. Sah V, Conroy SJ, Narwold A. Estimating school proximity effects on housing prices: the importance of robust spatial controls in hedonic estimations. J Real Estate Financ Econ. 2016;53:50–76.
* View Article
* Google Scholar
44. 44. Zaki J, Nayyar A, Dalal S, Ali ZH. House price prediction using hedonic pricing model and machine learning techniques. Concurr Comput: Pract Exp. 2022:34(27);e7342.
* View Article
* Google Scholar
45. 45. Wang L, He S, Su S, Li Y, Hu L, Li G. Urban neighborhood socioeconomic status (SES) inference: a machine learning approach based on semantic and sentimental analysis of online housing advertisements. Habitat Int. 2022;124:102572.
* View Article
* Google Scholar
46. 46. Soltani A, Heydari M, Aghaei F, Pettit CJ. Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Cities. 2022;131:103941.
* View Article
* Google Scholar
47. 47. Gao Q, Shi V, Pettit C, Han H. Property valuation using machine learning algorithms on statistical areas in Greater Sydney, Australia. Land Use Policy. 2022;123:106409.
* View Article
* Google Scholar
48. 48. Ries J, Somerville T. School quality and residential property values: evidence from Vancouver rezoning. Rev Econ Statist 2010;92(4):928–44.
* View Article
* Google Scholar
49. 49. Dhar P, Ross SL. School district quality and property values: Examining differences along school district boundaries. J Urban Econ 2012;71(1):18–25.
* View Article
* Google Scholar
50. 50. Wen H, Zhang Y, Zhang L. Do educational facilities affect housing price? An empirical study in Hangzhou, China. Habitat Int. 2014;42:155–63.
* View Article
* Google Scholar
51. 51. Wen H, Xiao Y, Hui ECM. Quantile effect of educational facilities on housing price: Do homebuyers of higher-priced housing pay more for educational resources? Cities. 2019;90:100–12.
* View Article
* Google Scholar
52. 52. Peng Y, Tian C, Wen H. How does school district adjustment affect housing prices: An empirical investigation from Hangzhou, China. China Econ Rev. 2021;69:101683.
* View Article
* Google Scholar
53. 53. Clark DE, Herrin WE. The impact of public school attributes on home sale prices in California. Growth Change 2000;31(3):385–407.
* View Article
* Google Scholar
54. 54. Wen H, Xiao Y, Hui ECM, Zhang L. Education quality, accessibility, and housing price: does spatial heterogeneity exist in education capitalization? Habitat Int. 2018;78:68–82.
55. 55. Merrall J, Higgins CD, Paez A. What’s a school worth to a neighborhood? A spatial hedonic analysis of property prices in the context of accommodation reviews in Ontario. Geograph Anal. 2024;56(2):217–43.
* View Article
* Google Scholar
56. 56. Chen M, Liu Y, Arribas-Bel D, Singleton A. Assessing the value of user-generated images of urban surroundings for house price estimation. Landsc Urban Plan. 2022;226:104486.
* View Article
* Google Scholar
57. 57. Lopez Ochoa E. Housing price indices from online listing data: addressing the spatial bias with sampling weights. Environ Plan B: Urban Analyt City Sci. 2023;50(4):1039–56.
* View Article
* Google Scholar
58. 58. Zhao C, Ogawa Y, Chen S, Oki T, Sekimoto Y. Quantitative land price analysis via computer vision from street view images. Eng Appl Artif Intell. 2023;123:106294.
* View Article
* Google Scholar
59. 59. Lee H, Han H, Pettit C, Gao Q, Shi V. Machine learning approach to residential valuation: a convolutional neural network model for geographic variation. Annals Region Sci 2024;72(2):579–99.
* View Article
* Google Scholar
Citation: Tapia J, Chavez-Garzon N, Pezoa R, Suarez-Aldunate P, Pilleux M (2025) Comparing automated valuation models for real estate assessment in the Santiago Metropolitan Region: A study on machine learning algorithms and hedonic pricing with spatial adjustments. PLoS ONE 20(3): e0318701. https://doi.org/10.1371/journal.pone.0318701
About the Authors:
Jocelyn Tapia
Contributed equally to this work with: Jocelyn Tapia, Nicolas Chavez-Garzon, Raúl Pezoa
Roles: Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing
E-mail: [email protected]
Affiliation: Department of Business Engineering, Universidad Técnica Federico Santa María, Santiago, Chile
ORICD: https://orcid.org/0000-0003-4961-5893
Nicolas Chavez-Garzon
Contributed equally to this work with: Jocelyn Tapia, Nicolas Chavez-Garzon, Raúl Pezoa
Roles: Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing
Affiliation: Department of Business Engineering, Universidad Técnica Federico Santa María, Santiago, Chile
Raúl Pezoa
Contributed equally to this work with: Jocelyn Tapia, Nicolas Chavez-Garzon, Raúl Pezoa
Roles: Conceptualization, Methodology, Writing – original draft
Affiliation: School of Industrial Engineering, Faculty of Engineering and Sciences, Universidad Diego Portales, Santiago, Chile
Paulina Suarez-Aldunate
Roles: Conceptualization, Data curation, Methodology, Software
¶¤ PS-A and MP also contributed equally to this work.
Affiliation: Coderhub SpA, Santiago, Chile
Mauricio Pilleux
Roles: Data curation, Methodology, Software
Affiliation: Coderhub SpA, Santiago, Chile
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Adair A, Berry J, McGreal S. Valuation of residential property: analysis of participant behaviour. J Property Valuat Invest. 1996;14(1):20–35.
2. Yalpir S. Forecasting residential real estate values with AHP method and integrated GIS. In: International Scientific Conference of People, Buildings and Environment. 2014. p. 15–7.
3. Sheppard S. Hedonic analysis of housing markets. Handb Region Urban Econ. 1999;3:1595–1635.
4. Valier A. Who performs better? AVMs vs hedonic models. J Property Invest Financ. 2020;38(3):213–25.
5. Fisher A, Rudin C, Dominici F. All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J Mach Learn Res. 2019;20(177):1–81.
6. Ribeiro MT, Singh S, Guestrin C. “Why should i trust you?’’ Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 1135–44.
7. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. p. 4768–4777.
8. Bhattacharjee, Arnab and Castro, Eduardo and Marques, Joao Spatial interactions in hedonic pricing models: the urban housing market of Aveiro, Portugal Spatial Econ Anal. 2012;7(1):133–67.
9. Lancaster KJ. A new approach to consumer theory. J Polit Econ. 1966;74(2):132–57.
10. Rosen S. Hedonic prices and implicit markets: product differentiation in pure competition. J Politic Econ 1974;82(1):34–55.
11. Espinoza E, Balaguer J. Estimating the effects of urban location on social housing: a spatial hedonic approach. 2019.
12. Quiroga BF. Precios hedonicos para valoracion de atributos de viviendas sociales en la region metropolitana de Santiago. 2005.
13. Hui ECM, Chau CK, Pun L, Law MY. Measuring the neighboring and environmental effects on residential property value: using spatial weighting matrix. Build Environ. 2007;42(6):2333–43.
14. Sagner A. Determinantes del precio de viviendas en Chile. Documentos de Trabajo (Banco Central de Chile). 2009;(549):1.
15. Figueroa E, Lever G. Determinantes del precio de la vivienda en Santiago: Una estimacion hedonica. Estudios de Economıa. 1992;19(1):67–84.
16. Cebula RJ. The hedonic pricing model applied to the housing market of the city of Savannah and its Savannah historic landmark district. Rev Region Stud. 2009;39(1):9–22.
17. Chen L, Yao X, Liu Y, Zhu Y, Chen W, Zhao X, Chi Tianhe. Measuring impacts of urban environmental elements on housing prices based on multisource data—a case study of shanghai, china. ISPRS Int J Geo-Inf 2020;9(2):106.
18. Idrovo B, Lennon J. ´Indice de precios de viviendas nuevas para el Gran Santiago. Documento de trabajo. 2011;65.
19. Limsombunchai V. House price prediction: hedonic price model vs. artificial neural network. New Zealand Agricultural and Resource Economics Society Conference. 2004;25–26.
20. Mas´ıas VH, Valle MA, Crespo F, Crespo R, Vargas A, Laengle S. Property valuation using machine learning algorithms: a study in a Metropolitan-Area of Chile. In: Selection at the AMSE Conferences. 2016. p. 97–105.
21. Rico-Juan JR, de La Paz PT. Machine learning with explainability or spatial hedonics tools? An analysis of the asking prices in the housing market in Alicante, Spain. Exp Syst Appl. 2021;171:114590.
22. Ahmed EH, Moustafa M. House price estimation from visual and textual features. In: International Conference on Neural Computation Theory and Applications. SCITEPRESS; 2016. .
23. Poursaeed O, Matera T, Belongie S. Vision-based real estate price estimation. Mach Vision Appl 2018;29(4):667–76.
24. Humphries S. Introducing a new and improved Zestimate algorithm. 2019. Available from: https://www.zillow.com/tech/introducing-a-new-and-improved-zestimate-algorithm/
25. Kostic Z, Jevremovic A. What image features boost housing market predictions? IEEE Trans Multim. 2020;22(7):1904–16
26. Anselin L. Spatial econometrics: methods and models. Vol. 4. Springer; 1988.
27. Hariri S, Kind MC, Brunner RJ. Extended isolation forest. IEEE Trans Knowl Data Eng 2021;33(4):1479–89.
28. Liu FT, Ting KM, Zhou ZH. Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE; 2008. .
29. Orellana A, Bannen P, Fuentes L, Gilabert H, Pape K. Informe final indicador calidad de vida urbana (ICVU). Santiago: Nucleo de Estudios Metropolitanos, Instituto de Estudios Urbanos, Universidad Catolica de Chile; 2012.
30. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 4700–4708.
31. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A. Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2010. .
32. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell. 2017 Jun;40(6):1452–64.
33. Kalervo A, Ylioinas J, Haikio M, Karhu A, Kannala J. Cubicasa5k: a dataset and an improved multi-task model for floorplan image analysis. Scandinavian conference on image analysis. Springer; 2019. p. 28–40.
34. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
35. Revend W. Predicting House Prices on the Countryside using Boosted Decision Trees. 2020.
36. Vinayak RK, Gilad-Bachrach R. Dart: Dropouts meet multiple additive regression trees. Artificial intelligence and statistics. PMLR; 2015. p. 489–97.
37. Shapley LS. A value for n-person games. Contribut Theory Games. 1953;(28):307–17.
38. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020;2(1):56–67.
39. Wilhelmsson M. Spatial models in real estate economics. Housing Theory Soc. 2002;19(2):92–101.
40. Osland L. An application of spatial econometrics in relation to hedonic house price modeling. J Real Estate Res 2010;32(3):289–320.
41. Anselin L, Rey SJ. Modern spatial econometrics in practice: a guide to GeoDa, GeoDaSpace and PySAL. GeoDa Press LLC 2014.
42. Bonilla-Mejıa L, Lopez E, McMillen D. House prices and school choice: evidence from Chicago’s magnet schools’ proximity lottery. J Region Sci. 2020;60(1):33–55.
43. Sah V, Conroy SJ, Narwold A. Estimating school proximity effects on housing prices: the importance of robust spatial controls in hedonic estimations. J Real Estate Financ Econ. 2016;53:50–76.
44. Zaki J, Nayyar A, Dalal S, Ali ZH. House price prediction using hedonic pricing model and machine learning techniques. Concurr Comput: Pract Exp. 2022:34(27);e7342.
45. Wang L, He S, Su S, Li Y, Hu L, Li G. Urban neighborhood socioeconomic status (SES) inference: a machine learning approach based on semantic and sentimental analysis of online housing advertisements. Habitat Int. 2022;124:102572.
46. Soltani A, Heydari M, Aghaei F, Pettit CJ. Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Cities. 2022;131:103941.
47. Gao Q, Shi V, Pettit C, Han H. Property valuation using machine learning algorithms on statistical areas in Greater Sydney, Australia. Land Use Policy. 2022;123:106409.
48. Ries J, Somerville T. School quality and residential property values: evidence from Vancouver rezoning. Rev Econ Statist 2010;92(4):928–44.
49. Dhar P, Ross SL. School district quality and property values: Examining differences along school district boundaries. J Urban Econ 2012;71(1):18–25.
50. Wen H, Zhang Y, Zhang L. Do educational facilities affect housing price? An empirical study in Hangzhou, China. Habitat Int. 2014;42:155–63.
51. Wen H, Xiao Y, Hui ECM. Quantile effect of educational facilities on housing price: Do homebuyers of higher-priced housing pay more for educational resources? Cities. 2019;90:100–12.
52. Peng Y, Tian C, Wen H. How does school district adjustment affect housing prices: An empirical investigation from Hangzhou, China. China Econ Rev. 2021;69:101683.
53. Clark DE, Herrin WE. The impact of public school attributes on home sale prices in California. Growth Change 2000;31(3):385–407.
54. Wen H, Xiao Y, Hui ECM, Zhang L. Education quality, accessibility, and housing price: does spatial heterogeneity exist in education capitalization? Habitat Int. 2018;78:68–82.
55. Merrall J, Higgins CD, Paez A. What’s a school worth to a neighborhood? A spatial hedonic analysis of property prices in the context of accommodation reviews in Ontario. Geograph Anal. 2024;56(2):217–43.
56. Chen M, Liu Y, Arribas-Bel D, Singleton A. Assessing the value of user-generated images of urban surroundings for house price estimation. Landsc Urban Plan. 2022;226:104486.
57. Lopez Ochoa E. Housing price indices from online listing data: addressing the spatial bias with sampling weights. Environ Plan B: Urban Analyt City Sci. 2023;50(4):1039–56.
58. Zhao C, Ogawa Y, Chen S, Oki T, Sekimoto Y. Quantitative land price analysis via computer vision from street view images. Eng Appl Artif Intell. 2023;123:106294.
59. Lee H, Han H, Pettit C, Gao Q, Shi V. Machine learning approach to residential valuation: a convolutional neural network model for geographic variation. Annals Region Sci 2024;72(2):579–99.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Tapia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This study compares the precision and interpretability of two automated valuation models for evaluating the real estate market in the Santiago Metropolitan Region of Chile: machine learning algorithms, specifically LightGBM, and hedonic prices with spatial adjustments (SAR). Traditional residence attributes, such as housing amenities and proximity to services, were considered alongside visual information extracted from images using Convolutional Neural Networks (CNN). The research evaluates the influence of each model characteristic on performance metrics and identifies the relative importance of attributes using the SHapley Additive exPlanations (SHAP) algorithm. The results demonstrate the positive impact of image-based variables on performance metrics, showing that the introduction of visual information can considerably reduce error margins when estimating housing prices. In addition, the SHAP algorithm reveals complex non-linear interactions between price and crucial variables such as total surface area and neighborhood attributes, highlighting the importance of using methods that can capture these effects. Likewise, both LightGBM and SAR models indicate that variables that have the most significant impact on the value of properties are total surface area, municipality quality index, average academic level of nearby schools, and the number of bathrooms.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer