Content area

Abstract

In modern system management, it is critical to collect and analyze large volumes of log data. Regular expressions (regex) are the norm in the industry for extracting information from these logs. However, neither database systems (DBMSs) nor log analysis systems incorporate regex evaluation in their query optimization. Brute-force regex evaluation is computationally expensive, especially as log data grows into the petabyte range while hardware performance remains the same. Such an issue creates a critical bottleneck where traditional regex engines, designed for correctness and generality, cannot sustain the required throughput, while traditional full-text indexing solutions impose prohibitive storage overhead.

This dissertation presents a workload-aware approach to efficient query processing in resource-constrained settings. We demonstrate that by exploiting the specific statistical characteristics of log analysis workloads, we can design specialized query engines and lightweight indexing structures that deliver order-of-magnitude performance gains with much less overhead space.

We address the problem of optimizing log processing with regex under computational and strict space constraints by introducing BLARE, a framework that speeds up regex matching without requiring extra storage. BLARE treats regex evaluation as a query-planning problem. It breaks down complex regex queries into patterns and literals and uses multi-armed bandits to learn an effective splitting strategy from a small sample. This adaptive strategy can make things up to 168 times faster than standard libraries like RE2 and PCRE2.

Second, we address the computational and memory constraints associated with regex evaluation by carefully examining different n-gram indexing methods, including their performance and overheads. We show that theoretically optimal selection algorithms incur prohibitive construction costs, negating their benefits. We demonstrate that different methods suit different types of workloads, and simple frequency-based heuristics yield a practical and robust solution with better scalability as data size grows.

Finally, building on these insights, we introduce REI, a lightweight bit-vector indexing framework. By indexing query n-grams rather than the data, REI improves filtering performance while reducing index construction overhead by several orders of magnitude. These contributions show that exploiting the workload characteristic enables high-performance log analytics when hardware provisioning is constrained.

Details

1010268
Title
Workload-Aware Optimization for High-Throughput Log Analytics
Number of pages
164
Publication year
2025
Degree date
2025
School code
0262
Source
DAI-A 87/6(E), Dissertation Abstracts International
ISBN
9798270242589
Committee member
Koutris, Paris; Huang, Tsung-Wei
University/institution
The University of Wisconsin - Madison
Department
Computer Sciences
University location
United States -- Wisconsin
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32402404
ProQuest document ID
3285443214
Document URL
https://www.proquest.com/dissertations-theses/workload-aware-optimization-high-throughput-log/docview/3285443214/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic