Content area
Modern web applications often consist of hundreds of services distributed in different servers or tiers. On one hand, this architecture may provide easy abstraction and modularity for software development and reuse. On the other hand, such architecture makes difficult to predict the behavior of the systems, as each tier has its own functionality, configuration, and demands for computing resources. Thus, anomaly detection becomes an important aspect for the management and operation of multi-tier web systems. In order to track their operation and aid on their behavior analysis, web systems expose numerous metrics in all the tiers. However, collecting and analyzing all available metrics reduces the system performance due to a non-negligible overhead on communication, storage, and processing. Another concern is the nature of the workload of these systems, which may fluctuate widely over time. One of the approaches to support anomaly detection in web systems is to use stable correlations among monitoring metrics. This approach, called correlation-based monitoring, does not require any deep understanding about the system internals or metric semantic, and also does not demand the existence of data about the faults. In addition, as only the metrics involved in stable correlations are periodically collected, the monitoring overhead is reduced. Stable correlations also have the desired property of holding for long period of time before becoming invalid due to workload fluctuations. The challenge, however, is to identify the stable correlations. In this work, we address this challenge by proposing three novel strategies based on partial correlation, a statistical tool commonly employed to summarize the relevant information of complex systems. We evaluate our strategies using traces obtained from an e-commerce, web transaction benchmark deployed in our testbed. Results show that our best strategy allows the construction of a monitoring network with less metrics than a state-of-the-art solution while achieving larger fault coverage. They also show that the correlations are reasonably stable, and the models can be applied for sufficiently long periods of time (at least 50 times the training time) before they become invalid.
Details
Correlation analysis;
Software industry;
Workloads;
Time;
Modularity;
Software;
Storage;
Semantics;
Architecture;
Reuse;
Interpersonal communication;
Applied behavior analysis;
Electronic commerce;
Applications programs;
Complex systems;
Monitoring;
Correlation;
Workload;
Software reuse;
Anomalies;
Software development
1 Instituto de Informática, Universidade Federal de Goiás (UFG), Goiânia, GO, Brazil
2 Departamento de Informática, Universidade Federal Do Paraná (UFPR), Curitiba, PR, Brazil