Content area
Query engines enable users to execute queries quickly and gather results, supporting data retrievalacross multiple data sources without needing custom code. The exponential growth of data volumes places increasing demands on modern databases, requiring higher performance, scalability,and efficient real-time query processing. These demands motivated the creation of alternativeDatabase Management System (DBMS) architectures. Unlike traditional systems optimized forquick read-and-write operations on small datasets for transactional workloads, other architecturesprioritize statistical insights.
Columnar query engines have become a prominent architecture for analytical processing, asthey efficiently store and handle large datasets and optimize analytics extraction. These enginesleverage columnar storage formats to improve query performance, particularly for data scans andaggregations.
SIMD instructions allow CPUs to simultaneously execute the same operation across multiple data elements organized in vectors, significantly reducing execution time. This technique isparticularly beneficial for column-oriented databases due to their inherent memory locality.
Indexes provide an additional method for enhancing database performance. Traditional indexing techniques like B-trees are optimized for relational DBMS to accelerate row-level retrievals.In contrast, columnar systems focus on large-scale scans and aggregations, where conventionalindexes are less effective. Recent research, however, has refined indexing techniques to be morecompatible with OLAP queries and analytical workloads.
This dissertation investigates how combining indexing techniques with columnar databasesand vectorization improves performance in real-time analytics and query systems. It addresseslimitations in existing approaches by integrating index structures, such as bitmap and tree-basedindexes, with optimizations tailored for real-time analytics performance.
A systematic evaluation methodology is employed to validate the proposed solution usingindustry-standard benchmarks, including TPC-H and TPC-DS. These benchmarks measure querylatency, I/O operations, and resource utilization. Experiments cover multiple configurations, including tests with unindexed data, to isolate and demonstrate the contributions of the proposedtechniques. Performance metrics such as CPU and memory usage are analyzed to identify bottlenecks and opportunities for further optimization.
The results confirm that integrating vectorized indexing techniques can improve query performance by reducing latency, depending on the use case. However, the research also examinesinherent trade-offs, including increased data structure size, additional write overhead, and hardware usage. These findings validate the proposed approach and underscore its potential to addressthe challenges of modern analytical workloads.
These findings suggest SIMD-optimized indexes improve performance in OLAP workloadsand require further research into their integration in columnar query engines.