Content area
As heterogeneous parallel architectures grow increasingly complex, achieving high performance and effectively teaching parallel programming have become more challenging. Benchmark suites are powerful tools for illustrating and evaluating optimization techniques in practical, performance-critical scenarios. However, widely used parallel benchmark suites such as SPEC OMP and Rodinia are primarily designed for performance assessment and lack structured support for optimization exploration or ease of use. To address these limitations, this dissertation presents two novel benchmark suites, NeoRodinia and CUDAMicroBench, that extend beyond traditional benchmarking by enabling more accessible and structured performance experimentation.
NeoRodinia features a structured three-level parallelization model (P1, P2, P3) across CPU worksharing, GPU offloading, SIMD, and tasking. It provides standardized execution workflows, automated performance evaluation scripts and visualization tools. Additionally, NeoRodinia integrates AI-assisted analysis, allowing LLMs to offer optimization recommendations and debugging insights. CUDAMicroBench is a modular microbenchmark suite targeting key GPU optimization challenges such as memory hierarchy usage, warp divergence, and concurrent kernel execution. This work highlights a focused performance study on shuffle-based reduction kernels, analyzing instruction variants across GPU generations and demonstrating the suite’s capability for low-level architectural evaluation.
In addition to benchmark-based contributions, this dissertation advances parallel programming education by introducing the Interactive OpenMP Programming book. By employing deliberate prompt engineering strategies, it effectively leverages large language models (ChatGPT-4, Gemini Pro 1.5, and Claude 3) to enhance the quality, relevance, and pedagogical value of the generated content. Delivered via a Jupyter-based environment, it enables real-time experimentation with OpenMP constructs, promoting hands-on learning and deeper understanding. Surveys from both OpenMP educators and learners confirm that the book aligns well with instructional principles in parallel programming and functions effectively as a learning material.
Together, these contributions form a cohesive infrastructure for performance analysis, optimization strategy exploration, and interactive learning in modern parallel computing. By integrating benchmark design, LLM-driven analysis, and accessible tooling, this work provides a scalable resource for researchers, instructors, and practitioners navigating today’s increasingly heterogeneous high-performance computing landscape.