Content area

Abstract

A streaming sketch is a data structure that can approximately answer queries of a data stream, whose size is much smaller than the stream length or the domain size. This thesis studies a wide range of topics on streaming sketching including approximate counting, cardinality estimation, the tractability of queries, the universality of sketches, non-mergeable sketches, as well as sampling. My research goes in two directions.

Mathematical theory For this direction, the simplicity and generality of the theory is the primary goal, and the produced algorithms can be existential or more like mathematical prototypes. Notable progress includes the following.

I explicitly connect Levy processes to streaming algorithms, revealing a deep connection between infinite divisibility and scale-invariance. This connection sheds new light on the important tractability problem and also pins down the hash pattern one needs to G-sample over incremental streams with exactly correct sampling probabilities.

I propose a new universal sketching scheme by decomposing general streaming queries into easy-to-estimate “harmonic queries.” The decomposition can be obtained from the Levy-Khintchine representation theorem or the Fourier transform.

I propose a formal measurement of the goodness of a cardinality sketch, which sets up a tension between the Fisher information number and Shannon's entropy. I also generalize cardinality sketches to work over any (finite) commutative monoid, using the harmonic decomposition technique together with the semigroup decomposition of commutative monoids.

Practical algorithms For this direction, the simplicity and practicality of the algorithm is the primary goal, and the associated analysis is usually complicated and computation-heavy. Small-scale experiments are often included to confirm the analysis.

I analyze various classes of sketches and estimators for the cardinality query. Many recipes are constructed for practitioners to pick and construct cardinality sketches and estimators in different contexts with different performance emphases.

I study the sequential behaviors of cardinality sketches and construct simple fraud detectors.

I study two variants of the Morris counter, one tracks a multi-dimensional count vector and the other tracks a count that can be both incremented and decremented.

Fortunately, some of the algorithms I construct turn out to possess simplicity in both theory and practice. Notable examples are the F1/2-sampler, stable-HyperLogLog, and the d-dimensional approximate counter.

Details

1010268
Title
Streaming Sketching: Mathematical Theory and Practical Algorithms
Number of pages
358
Publication year
2025
Degree date
2025
School code
0127
Source
DAI-B 86/11(E), Dissertation Abstracts International
ISBN
9798314875285
Advisor
Committee member
Rudelson, Mark; Bansal, Nikhil; Dereziński, Michał; Woodruff, David
University/institution
University of Michigan
Department
Computer Science & Engineering
University location
United States -- Michigan
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32092649
ProQuest document ID
3203185130
Document URL
https://www.proquest.com/dissertations-theses/streaming-sketching-mathematical-theory-practical/docview/3203185130/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic