Content area
This dissertation explores the problem of analyzing and optimizing data-dependent programs from a theoretical and practical perspective. The performance of these programs depends in a complex manner on the distribution of the input data, and they arise in many contexts, e.g. databases, sparse tensor programming, and graph analytics. By definition, these programs cannot be optimized by considering the code alone, so an optimizer for them must consider information about the data distribution. This dissertation presents two new theoretical approaches for analyzing data-dependent programs by bounding the size of their intermediate results: the degree sequence bound and partition constraints. It then describes two practical systems for producing these bounds: SafeBound and COLOR. Lastly, we present a state-of-the-art optimizer for sparse tensor programs, Galley, that demonstrates the value of data-aware optimization.