Content area
Modern cyberattack investigations face significant challenges due to the exponential growth in both the scale and sophistication of attacks. Logs serve as the primary source for attack investigations, but as attacks become increasingly complex and prolonged, traditional log-based investigation struggles with high storage costs, slow query performance, resource-intensive and low precision analysis. This dissertation examines how artificial intelligence (AI) can improve attack investigation systems for large-scale advanced threat analysis by optimizing log storage, management, and analysis within existing investigation pipelines.
First, we introduce ELISE, a storage-efficient logging system built on a novel lossless data compression technique using deep neural networks (DNNs) to learn optimal character encoding. ELISE naturally supports all types of attack investigation tasks based on logs while achieving two to three times better compression results compared with existing state-of-the-art methods Gzip and DeepZip, showing a promising future research direction of using AI to reduce the log storage costs.
Second, we present LEONARD, the first DNN-based provenance graph storage system designed for efficient log-derived provenance data management. Provenance graphs abstract log data to capture causal dependencies crucial for attack investigation but impose significant storage and query overhead. LEONARD converts these graphs into numerical vectors and stores them using DNNs. Compared with the widely used databases, LEONARD can reduce the space overhead by up to 25.90 times and boosted up to 99.6% query executions.
Third, we develop AIRTAG, an unsupervised learning-based attack investigation system that operates directly on log texts, eliminating the need for manual works such as labeling, and computationally expensive graph processing. AIRTAG can recover the attack chain while achieving superior efficiency and effectiveness—2.5 times faster, with 9.0% fewer false positives and 16.5% more true positives compared with traditional provenance-based approaches.
Together, these contributions form a comprehensive AI-powered framework addressing the challenges of large scale threat investigation in existing attack analysis systems. By significantly reducing storage costs and computational overhead while also improving the effectiveness, our research offers security teams better solutions for advanced threat investigation.