Keywords: Energy measurement; Artificial intelligence; Statistics; Probabilistic computing; Data science; Data aggregation; Algorithms; Energy efficiency
Abstract: This paper delves into the subject of outlier detection techniques tailored for unique datasets related to residential energy consumption. Building upon the current state of research [1] we introduce the Grubbs and Z-score methods and investigate a range of outlier detection strategies encompassing statistical, probabilistic, and machine learning algorithms. The findings underscore the importance of outlier detection in the Romanian residential energy sector.
(ProQuest: ... denotes formulae omitted.)
1. INTRODUCTION
Outlier detection in energy data plays a pivotal role in ensuring the accuracy and reliability of energy management systems [2]. By identifying and addressing anomalies, utilities and energy managers can gain a clearer understanding of consumption patterns, optimize energy distribution, and prevent potential system failures [3]. Moreover, detecting outliers, aids in eliminating data errors, facilitating more precise forecasting [4], and enhancing the overall efficiency of energy systems. In essence, it serves as a foundational step in refining energy data analysis and driving informed decision-making in the energy sector [5]. In the current study, serving as an extension to the article "Applied data cleaning methods in outlier detection for residential consumer" [1], the outlier detection methodologies tailored for specialized datasets related to residential energy consumption are examined. This exhaustive research covers a broad range of outlier detection techniques in data cleaning, from statistical and probabilistic methods to advanced machine learning algorithms. Notably, the Z-score [6] and Grubbs [7] methods are introduced and integrated into this expanded investigation, enhancing the analytical depth.
2. CONTEXT
Understanding the nature of a dataset is essential when applying algorithms or mathematical methods [8], especially when distinguishing energy consumption patterns across residential, industrial, and public buildings. Energy consumption characteristics differ significantly among residential homes, public infrastructures, and industrial facilities due to diverse usage patterns, energy needs, and operational demands [9]. Residential energy use is shaped by factors like home size, design, occupancy, construction materials, lifestyle, and appliance usage. Typically, residential energy patterns show spikes during morning and evening, reflecting daily routines, and dip during the night. Seasonal changes, such as increased heating or cooling needs during harsh weather, also play a role. Moreover, individual behaviors, household income, and occupants' education levels further influence these patterns [10]. Public structures, like schools, hospitals, and government offices, have energy patterns distinct from residential settings. Given their high occupancy and continuous operations, their energy use remains relatively consistent throughout the day, with higher consumption on weekdays than weekends. Unlike residences, seasonal fluctuations in energy use in public buildings are less pronounced, mainly due to their climate control systems [11].
Industrial sectors, including factories and warehouses, have unique energy consumption patterns driven by their production activities, machinery operations, and specific requirements. The energy demand in these settings is often high and consistent, influenced by machinery and equipment operations. However, energy use can vary based on production timelines, work shifts, and the nature of the industry. For instance, chemical industries might have different energy needs compared to textile ones. Industries linked to agriculture might see seasonal shifts in energy use, reflecting harvest times or changing product demands [12].
The importance of outlier detection in residential energy consumption arises from the diverse and ever-changing monthly energy use patterns. In contrast to public or industrial settings, monthly household energy consumption is influenced by numerous, primarily static, technical factors. These include home features, resident lifestyles, and energy use behaviors, which are often affected by seasonal changes [9].
In Romania, the energy sector is regulated by European directives mandating the installation of smart meters in residences by Distribution System Operators (DSO) [13]Error! Reference source not found.. However, the rollout of this initiative has been delayed, with DSOs typically manually reading meters every 3-6 months. While new regulations advocate for monthly readings, a disconnect exists between governmental intentions and the present capabilities of DSOs.
A significant shift occurred in the Romanian energy market in 2021 when it embraced free-market principles [14]. This change has spurred discussions on introducing various services and concepts, like demand response, differential tariffs, and local energy communities. Technically, these services necessitate precise monthly energy consumption data across all sectors, including households. Yet, due to current metering constraints, such data remains elusive.
For instance, the idea behind energy communities revolves around consolidating numerous households within similar geographic regions into a single organizational structure. This encourages them to actively engage in the energy market, promoting bi-directional energy exchanges (prosumers). In this context, precise monthly energy consumption predictions are crucial for securing smart energy contracts and negotiating favorable future [15].
A significant hurdle in Romania is obtaining accurate residential energy consumption data, primarily sourced from physical or digital energy bills. Given that current Romanian regulations require DSOs to check meters every 3-6 months, utility companies often resort to estimations. This approach frequently leads to substantial variations and anomalies in monthly billing, complicating the task of algorithms aiming to identify and rectify such patterns [16]. Consequently, due to these metering challenges, billed energy often doesn't mirror actual consumption, highlighting the need for a precise data adjustment method.
Such inconsistencies often manifest as outliers or anomalous data points, which significantly deviate from typical energy consumption trends. Identifying and addressing these outliers in the residential sector is pivotal for ensuring data accuracy in energy analysis, billing, and forecasting. This sets the stage for a robust data processing approach tailored [17] for the evolving needs of the Romanian energy market.
3. PREVIOUS RESULTS
The findings from previous research [ 1 ] indicated that the IQR method was the most effective, accurately identifying 69.45% of outliers. Both the MAD and MOVMAD methods displayed commendable results, detecting outliers at rates of 62.89% and 48.10%, respectively. In contrast, the LOF method's performance was less satisfactory, pinpointing just 23.83% of outliers.
Generally, the outlier detection techniques discussed in this study necessitate a substantial volume of data for precise outcomes. Despite the inherent challenges with residential energy consumption datasets, which typically comprise 12 data points annually, the researchers successfully adapted most of the algorithms. However, the parameters used for testing DBSCAN didn't match the outlier profiles, suggesting a need for further refinement of this method.
These insights underscore the effective implementation of the discussed methodology on datasets pertaining to the residential energy domain. Proper outlier identification is pivotal for data quality enhancement, deeper insights into energy consumption trends, and facilitating informed decisions in the residential arena. The findings advocate for the IQR method, succeeded by MAD and MOVMAD, as potential techniques for outlier detection in domestic energy data. Still, there's a call for more research to further hone these detection methods for superior precision.
4. METHOD USED
Given that statistical models have demonstrated greater efficiency with smaller data volumes in detecting outliers from the tested energy data consumption, the Z-score [18] and Grubbs [19] methods were selected for this study analysis. Both techniques are renowned for their efficacy in statistical analysis and outlier detection. By integrating these methods, the research aims to address the existing gaps and elevate the accuracy of data cleaning, ensuring more reliable and robust results in the realm of energy consumption analysis.
The Z-score [8], often referred to as the standard score, measures how many standard deviations a data point is from the mean of a set of data. It's a useful metric in statistics to identify outliers, as it quantifies the extent to which a particular observation deviates from the norm.
Mathematically, the Z-score for an individual data point x is calculated as:
... (1)
where:
* x is the individual data point.
* p. is the mean of the dataset.
* a is the standard deviation of the dataset.
A Z-score [8] of 0 indicates that the data point's score is identical to the mean score. A Z-score of 1.0 indicates a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean. In many contexts, a Z-score above 2.0 or below -2.0 is considered an outlier, but this threshold can vary based on the specific application or field of study.
Crabbs' Test [19] is a statistical test used to detect outliers in a univariate data set that follows an approximately normal distribution. The test works by comparing the absolute deviation of a suspected outlier from the sample mean to the sample standard deviation. If this ratio is sufficiently large, the data point can be considered an outlier.
Mathematically, the Grubbs' statistic G for a given observation xi is calculated as:
... (2)
where:
* X1 is the individual data point being tested.
* x is the sample mean.
* s is the sample standard deviation.
4. RESULTS
In this study, data was collected from 100 volunteering households. This data was then anonymized, cataloged, and subsequently analyzed. Energy consumption information was sourced from energy bills. However, it became evident that the recorded data didn't truly mirror the genuine energy usage patterns of the households. This discrepancy arose from the data processing system that relied on "estimations" and "energy meter readings". In Table 1 each data point represented the monthly energy consumption over the course of a year.
The proposed methods for detecting outliers, IQR (Interquartile Range), LOF (Local Outlier Factor), MAD (Median Absolute Deviation), and MOVMAD (Moving Median Absolute Deviation), underwent a rigorous validation process. This validation encompassed the analysis of energy consumption data derived from both public buildings [5] and residential buildings [1], reflecting their versatility and applicability across different contexts.
To gain a comprehensive understanding of these outlier detection methods, energy consumption data were collected and integrated into a unified database, as meticulously detailed in Table 2 and 3. Subsequently, a comparative analysis was conducted, pitting these techniques against the Z-score and Grubbs methods.
Prior to conducting the algorithm tests, any instances of missing data and irregularities were visually identified, allowing for an initial evaluation of the algorithms' accuracy. The abnormal data within the consumption profile was highlighted under the "HUMAN I" category and was specifically emphasized in Table 2 and 3 for User 5 and User 36. Likewise, the absence of a data point for User 5 and the visually identified outliers for User 36 were brought to attention in figure 1 and figure 2.
In order to evaluate the accuracy of the outlier detection methods, a scoring system was introduced. The scoring method was designed to avoid binary outcomes where points are awarded solely for 100% accuracy. The algorithms were classified as "T" (True) if they correctly identified the visually identified outlier and as "F" (False) if they either misclassified the outlier or highlighted a different one. This approach enables a more precise assessment by considering the granularity of points assigned, thereby reducing the presence of black and white scenarios.
If a method successfully identified all outliers, it was assigned a score of 1, as shown in previous study [1]. However, if the method erroneously classified non-outliers as outliers or failed to detect genuine outliers, the final score was calculated using the following formula:
... (3)
Considering the results obtained from both Table 2 and Table 3, we can observe the following scores for the tested algorithms:
* IQR consistently performs exceptionally well in both evaluations, earning a perfect score of 1 in both instances.
* MAD demonstrates strong and consistent performance, achieving a score of 0.75 in Table 2 and a perfect score of 1 in Table 3.
* MOVMAD maintains a moderate level of accuracy, with scores of 0.33 in Table 2 and 0.66 in Table 3.
* LOF and Z-score exhibit similar performance, scoring 0.75 in Table 2 and 0.5 and 1, respectively, in Table 3.
* Grubbs shows variability in performance, with a score of 0.6 in Table 2 and a score of 0 in Table 3, suggesting room for improvement.
Applying the scoring method to the analyzed data from 100 users, as presented in Table 5, IQR emerges as the leading performer, securing a score of 49.31, which translates to an impressive accuracy rate of 69.45%. MAD also demonstrates robust performance, garnering a score of 44.65 and achieving an accuracy rate of 62.89%. MOVMAD maintains a reasonable accuracy rate of 48.10% while scoring 34.15. Z-score, while displaying promise with a score of 43.14, attains a slightly lower accuracy rate of 60.76%. Conversely, LOF and Grubbs exhibit lower scores and accuracy rates, with LOF registering 16.92 (23.83%) and Grubbs scoring 19.15 (26.97%). In summary, the IQR and MAD methods excel in accurately identifying outliers for this dataset, closely followed by MOVMAD and Z-score. Although LOF and Grubbs have their merits, they demonstrate relatively lower accuracy rates. Thus, for this particular dataset and evaluation, the IQR and MAD methods appear to offer the most reliable options for outlier detection.
5. CONCLUSION
When we compare the Z-score and Grubbs methods to the other outlier detection techniques, some interesting observations come to light: Firstly, both the Z-score and Grubbs methods hold their own in terms of their performance, falling somewhere in the middle of the pack. While they may not achieve the highest accuracy rates seen in some of the other methods, they display competitive scores and show potential for effectively identifying outliers in the context of residential energy consumption data. The IQR and MAD methods achieve the highest accuracy rates still Z-score perform significantly better than LOF and Grubbs. In scenarios where a balanced approach is essential, Z-score and Grubbs may be preferred choices. There is room for improvement in the Z-score and Grubbs methods. Further optimization could enhance their accuracy and reliability. In conclusion, the evaluation of the Z-score and Grubbs methods in this study suggests their potential as valuable tools for outlier detection. However, to further validate their effectiveness and robustness, future research endeavors should aim to test these methods with larger volumes of data. The scalability and adaptability of Z-score and Grubbs to more extensive datasets are crucial aspects that warrant exploration to ensure their reliability in various real-world applications and scenarios.
ACKNOWLEDGEMENT
This paper was financially supported by the Project "Network of excellence in applied research and innovation for doctoral and postdoctoral programs " / InoHubDoc, project cofunded by the European Social Fund financing agreement no. POCU/993/6/13/153437.
REFERENCES
[1] Jurj, Dacian I., Alexandru G. Berciu, Alexandru Muresan, Mircea Lancranjan, Levente Czumbil, Dan D. Micu, Andrei Bende, and Bogdan A. Mitrache. "Applied data cleaning methods in outlier detection for residential consumer." In 2023 10th International Conference on Modern Power Systems (MPS), pp. 01-04. IEEE, 2023.
[2] D. I. Jurj, D. D. Micu, L. Czumbil, A. G. Berciu, M. Lancrajan and D. M. B&acaron;rar, "Analysis of Data Cleaning Techniques for Electrical Energy Consumption of a Public Building," 2020 55th International Universities Power Engineering Conference (UPEC), Turin, Italy, 2020, pp. 1-6, doi: 10.1109/UPEC49904.2020.9209781.
[3] D. Jurj, A. Polycarpou, L. Czumbil, A. Berciu, M. Lancranjan, D. Barar and D. Micu, "Extended Analysis of Data Cleaning for Electrical Energy Consumption Data of Public Buildings," in 12th Mediterranean Conference on Power Generation, Transmission, Distribution and Energy Conversion (MEDPOWER), Online, 2020.
[4] D. Jurj, L. Czumbil, B. Bârg&acaron;uan, A. Ceclan, A. Polycarpou and D. Micu, "Custom Outlier Detection for Electrical Energy Consumption Data Applied in Case of Demand Response in Block of Buildings," Sensors, no. 21, p. 2946, 2021.
[5] Zhang, J., Zhang, H., Ding, S., & Zhang, X. (2021). Power Consumption Predicting and Anomaly Detection Based on Transformer and К-Means. Frontiers in Energy Research, 9, 779587.
[6] V. Aggarwal, V. Gupta, P. Singh, K. Sharma and N. Sharma, "Detection of Spatial Outlier by Using Improved Z-Score Test," 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 2019, pp. 788-790, doi: 10.1109/ICOEI.2019.8862582.
[7] X. Guangcheng, C. Wenli, L. Xingzhi, Z. Ke, Z. Bo and S. Hongliang, "Research and Application of Verification Error Data Processing of Electricity Meter Based on Grubbs Criterion," 2019 International Conference on Smart Grid and Electrical Automation (ICSGEA), Xiangtan, China, 2019, pp. 13-17, doi: 10.1109/ICSGEA.2019.00012.
[8] S. K. Aggarwal, L. M. Saini and A. Kumar, "Electricity price forecasting in deregulated markets: A review and evaluation," Electrical Power and Energy Systems, no. 31, pp. 13-22, 2009.
[9] L. Perez-Lombard, J. Ortiz and C. Pout, "A review on buildings energy consumption information," Energy and Buildings, no. 40, pp. 394-398, 2008.
[10] Nikolaos Zografakis, Angeliki N. Menegaki, Konstantinos P. Tsagarakis, Effective education for energy efficiency, Energy Policy, Volume 36, Issue 8, 2008,
[11] ¥ang Wang, Jens Kuckelkorn, Fu-Yun Zhao, Di Liu, Alexander Kirschbaum, Jun-Liang Zhang, Evaluation on classroom thermal comfort and energy performance of passive school building by optimizing HVAC control systems, Building and Environment, Volume 89, 2015
[12] "Manufacturing Energy Consumption Survey 2018," Administration, U.S. Energy Information, Feb. 2021.
[13] "Smart Metering deployment in the European Union," [Online]. Available: https://ses.jrc.ec.europa.eu/smart-metering-deploymenteuropean- union.
[14] "Art 20 Energia electrica | Lege 123/2012," [Online]. Available: https://lege5.ro/Gratuit/gmzdenjwga/art-20-energia-electrica-lege-1232012?dp=gyytmmbzge3dm.
[15] S. K. Aggarwal, L. M. Saini and A. Kumar, "Electricity price forecasting in deregulated markets: A review and evaluation," Electrical Power and Energy Systems, no. 31, pp. 13-22, 2009.
[16] Chanda, S.S., Banerjee, D.N. Omission and commission errors underlying AI failures. AI & Soc (2022).
[17] A. G. Berciu, D. Jurj, L. Czumbil, D. D. Micu and E. H. Dulf, "Energy Pulse - the Efficient Solution for Monitoring Electricity Consumption from Decentralized Data Sets," 2021 9th International Conference on Modern Power Systems (MPS), Cluj-Napoca, Romania, 2021, pp. 1-6, dói: 10.1109/MPS52805.2021.9492626.
[18] Vikas Khare, Cheshta Khare, Savita Nema, Prashant Baredar, Chapter 2 - Data visualization and descriptive statistics of solar energy system, Editor(s): Vikas Khare, Cheshta Khare, Savita Nema, Prashant Baredar, Decision Science and Operations Management of Solar Energy Systems, Academic Press, 2023,
[19] K. K. L. B. Adikaram, M. A. Hussein, M. Effenberger, T. Becker, "Data Transformation Technique to Improve the Outlier Detection Power of Grubbs' Test for Data Expected to Follow Linear Relation", Journal of Applied Mathematics. vol. 2015, Article ID 708948, 9 pages, 2015. https://doi.org/10.1155/2015/708948
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023. This work is published under https://creativecommons.org/licenses/by/4.0/ (the“License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This paper delves into the subject of outlier detection techniques tailored for unique datasets related to residential energy consumption. Building upon the current state of research [1] we introduce the Grubbs and Z-score methods and investigate a range of outlier detection strategies encompassing statistical, probabilistic, and machine learning algorithms. The findings underscore the importance of outlier detection in the Romanian residential energy sector.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer