Content area
Full text
Abstract
Data quality monitoring plays a critical role in various real-world engineering system inspection problems. Anomalous or invalid inspection data commonly exist due to computer/human recording errors, sensor faults, etc. Thus, an efficient tool to detect data anomalies is critically needed. However, it is challenging due to high dimensionality, unknown underlying distribution, insufficient sample size, and high level of noise. To address these challenges, an effective approach that can learn the underlying distribution of normal data with anomaly detection rules was developed. In this approach, the Generative Adversarial Network (GAN) was employed to identify the underlying distribution of normal data and filter out noise. After using the trained GAN to generate points of the learned distribution, a fc-nearest neighbor-based approach is used to define the anomaly detection rules. In the proposed approach, the normal records are used to train the GAN and establish the control rule. Specifically, after training the GAN using the normal records, the pairwise distances over all the GAN-generated data points are calculated, and the fc-nearest neighbors for every single data point are accordingly determined. Then, the average distance from each single data point to its fc-nearest neighbors is calculated as the statistics to indicate the data quality and establish a control chart. When a new record comes in, its similarity to the GAN-generated distribution can be evaluated by the established control chart to identify whether the new record is anomalous or not.
Keywords
Data quality monitoring, generative adversarial network (GAN), high dimensional data, fc-nearest neighbor, out-ofdistribution detection
(ProQuest: ... denotes formulae omitted.)
1. Introduction
Data-driven methods have been extensively leveraged in quality improvement in engineering systems, such as infrastructure systems [1], manufacturing systems [2], and health systems [3]. Advanced sensing technology provides a massive amount of high-dimensional data, from which the system quality can be evaluated and monitored [4]. Despite the development of cutting-edge high-dimensional data analytics methods, data quality is also a key factor influencing the perfonnance of quality control yet to be well investigated.
Anomalous or invalid monitoring data commonly exist due to computer/human recording enors, sensor faults, cyberattacks, etc., which might raise false alanns or miss detection about the system quality and even cause significant economic loss [5]. Thus, there is a pressing need for a data...




