Content area
The twenty-first century is a data-driven era where human activities and behavior, physical phenomena, scientific discoveries, technology advancements, and almost everything that happens in the world resulting in massive generation, collection, and utilization of data. Connectivity in data is a crucial property. A straightforward example is the World Wide Web, where every webpage is connected to other web pages through hyperlinks, providing a form of directed connectivity. Combinatorial data refers to combinations of data items based on certain connectivity rules. Other forms of combinatorial data include social networks as simplicial complexes, molecules as graphs, sets of binary classifiers as hypergraphs, and radius graphs from a metric space, which realizes connectivity for topological data analysis. This Ph.D. dissertation focuses on computation and learning on combinatorial data.
Persistence theory asks the question of: "What are the future dependencies of a vector in a time-indexed representation space?". A canonical example is given by persistent homology, which asks this question for a sequence of homology vector spaces. Homology is a measurement of "partial disconnectedness." For example, it can measure an obstruction to traveling through a 2d surface from one point to another. Since persistence is computed over a causal ordering, computing persistent homology appears as an inherently sequential process which would make computation on GPU difficult. We find that much of the algebraic computation only depends on the neighboring connectivity of the data and not on any algebraic operations, allowing for a CPU-GPU hybrid approach.
When given samples from a metric space, higher order connectivity can be realized on samples through a Vietoris-Rips complex. The persistence can be computed efficiently by exploiting the heavy hitter case of "generalized nearest neighbors." This can be computed massively in parallel on GPU with an efficient transfer of data structures between GPU and CPU. We have successfully designed and implemented algorithms and their implementations on GPU in the domain of persistent homology, achieving high performance by two opensource software, namely HYPHA and Ripser++.
The multiset of pairs of time end points when a homology generator belongs to a homology vector space is called a persistence diagram. Since any pair of times which are equal belong to the persistence diagram, when embedded in the plane, there is a diagonal with infinitely many points. To compute the earth mover's distance (EMD) between persistence diagrams, also called the Wasserstein distance between persistence diagrams, we have reduced the geometry of the plane to a graph embedded onto the surface of a finite height three dimensional pointed cone and compute an optimization problem on a sparse graph. This will not introduce any distortion in computing the EMD from the diagonal, which can be as large as 2 when only taking finitely many points from the diagonal. Indirectly, our approach gives a near linear time algorithm for a (1 +E) -approximation of the EMD between persistence diagrams. This is an optimal lower bound if a (1 + E) approximate EMD cannot be solved in time O(n1 + o(1) - δ) for any δ> 0 Besides algorithms and analysis, we have developed an open source software called PDopt Flow based on our algorithm.