Content area

Abstract

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. This work studies computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective.

The problem of computing the marginal probability of a tuple in the result of a query over set probabilistic databases (PDBs) can be reduced to calculating the probability of the lineage formula of the result, a Boolean formula over random variables representing the existence of tuples in the database’s possible worlds. The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world. This work studies the problem of calculating the expectation of such polynomials (a tuple’s expected multiplicity) exactly and approximately. For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple’s lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs). This work further examines the fine-grained complexity of this problem for c-TIDBs, i.e., probabilistic databases where tuples are independent probabilistic events and the multiplicity of each tuple is bound by a constant c. Unfortunately, our results imply that computing expected multiplicities for c-TIDBs introduces super-linear overhead over the corresponding deterministic query evaluation algorithms (under certain complexity hardness conjectures). Using a reduction from the problem of counting k-matchings, this work demonstrates that calculating the expectation is #W[1]-hard when the polynomial is compressed, for example through factorization. Such factorized representations are exploited by modern join algorithms (e.g., worst-case optimal joins), and so our results imply that computing probabilities for Bag-PDB based on the results produced by such algorithms introduces super-linear overhead. Next, this work then develops a sampling algorithm that computes a (1 ± ϵ)-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra (RA+ ) query over c-TIDBs and for a non-trivial subclass of block-independent databases. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation in FASTPDB provides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.

By removing Bag-PDB’s reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.

Details

1010268
Title
Theoretical Bounds and Systems Development for Exact/Approximate Multiplicities in Bag PDBs
Number of pages
90
Publication year
2025
Degree date
2025
School code
0656
Source
DAI-B 87/3(E), Dissertation Abstracts International
ISBN
9798293833276
Committee member
Rudra, Atri; Zola, Jaroslaw; Zhao, Zhuoyue
University/institution
State University of New York at Buffalo
Department
Computer Science and Engineering
University location
United States -- New York
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32169839
ProQuest document ID
3250297233
Document URL
https://www.proquest.com/dissertations-theses/theoretical-bounds-systems-development-exact/docview/3250297233/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic