Content area
Extended SQL with machine learning (ML) predicates, commonly referred to as SQL+ML, integrates ML abilities into traditional SQL processing in databases. When processing SQL+ML queries, some methods move data from database (DB) systems to ML systems to support SQL+ML queries. Such methods are not only costly due to maintaining two copies of data, but also pose security risks due to data movement. Fortunately, in-database SQL+ML processing addresses these limitations. However, conventional DB optimizers take ML predicates as UDFs (user-defined functions) and cannot optimize them using query rewriter and cost models. To boost the efficiency of in-database SQL+ML processing, this paper proposes to generate SQL predicates based on ML predicates and add them into SQL+ML queries, which can prune a significant number of irrelevant tuples and thus improve the performance. Optimizing SQL+ML queries presents three challenges: (C1) how to generate valid SQL predicates, (C2) how to select high-quality SQL predicates, and (C3) how to optimize the query using SQL predicates. To address these challenges, we propose
Introduction
Many customers have both SQL analytics and machine learning (ML) requirements on top of their data in databases [20, 47]. Existing methods usually use two independent systems to support this requirement, i.e., using DBMS to support SQL analytics, exporting the data to another ML platform, and using ML platforms to support ML predicates [23, 43]. Obviously this solution is expensive and insecure, because it maintains two copies of the data, transforms the data between the two systems [46], and cannot effectively reuse the hardware environments between the two systems [4].
To alleviate this limitation, the concept of in-database machine learning is introduced [12, 35]. It extends SQL capabilities to support ML operators [5, 37] as SQL+ML. SQL+ML encompasses the ML predicate, which consists of an inference function by utilizing a trained ML model (e.g., linear regression
The primary objective of in-database ML systems is to optimize SQL+ML queries. However, the processing of ML predicate incurs significant costs due to its computationally intensive operations, which involve joining multiple tables, inferring the value for each joined tuple, and returning the tuples that satisfy the ML predicate [19, 51]. A natural idea to accelerate the in-database SQL+ML processing is to prune irrelevant tuples before performing joins.
[See PDF for image]
Fig. 1
Example of an SQL+ML. In this query, an ML predicate (bold red lines) utilizes a trained ML model to infer and filter tuples
To this end, we propose deriving “valid” SQL predicates, solely based on columns and values, from ML predicates involving ML models. Notably, a “valid” SQL predicate is a necessary condition for an ML predicate, that is, if a tuple satisfies the ML predicate, it must also satisfy the SQL predicate1. These derived SQL predicates can be optimized by DB query optimizers as well. Collaboratively optimizing ML predicates and other DB operators can early prune some irrelevant tuples using SQL predicates, and consequently enhance the efficiency of SQL+ML processing.
In this paper, we study how to optimize an SQL+ML query through employing valid SQL predicates within DB optimizers.
Challenge 1: How to generate valid SQL predicates that achieve the equivalence guarantee? The generated SQL predicate should be valid and tight, ensuring that it does not include false-negative tuples and minimizes the occurrence of false-positive tuples.
Contribution 1: Inference rewrite for SQL predicate generation. By combining the model’s weight and structure with conditions in ML predicate, we derive valid SQL predicates. Inference rewrite incorporates these SQL predicates into the query tree as a logical optimization step in DB optimizer.
Challenge 2: How to choose high-quality SQL predicates? The number of valid SQL predicates is large when ML predicate involves many columns across numerous tables. However, multiple SQL predicates result in execution overhead and pruning redundancy. Therefore, it becomes crucial to judiciously select a high-quality subset of valid SQL predicates.
[See PDF for image]
Fig. 2
A running example of SQL+ML query optimization in
Contribution 2: Progressive inference for SQL predicate selection. We compute the pruning redundancy and overhead of different SQL predicates, and select a subset with the maximum pruning performance. Aiming to minimize the overhead, the selected SQL predicates infer tuples progressively.
Challenge 3: Where to place the SQL predicates in the plan tree? Each SQL predicate can be positioned at different nodes in the plan tree to affect the subsequent operators’ costs, while the relative positions of SQL predicates influence each other. Therefore, it is crucial to identify the cost-optimal positions for them.
Contribution 3: Cost-optimal inference for SQL predicates placement. We propose computing the cost of placing multiple SQL predicates at different positions, and then design a dynamic programming algorithm that finds the cost-optimal positions.
We summarize the main contributions of this paper.
(1) To the best of our knowledge, this is the first work focused on optimizing ML predicates within the DB optimizer.
(2) We propose a logical optimization technique, inference rewrite, that generates SQL predicates while preserving equivalence.
(3) We introduce the progressive inference to select high-quality SQL predicates, aiming to improve query performance.
(4) We devise a cost-optimal inference technique to optimize the physical positions for SQL predicates.
(5) We have implemented
Notably, the “State of Data Science and Machine Learning” report [22] indicates that the most commonly used ML algorithms are “linear or logistic regression” (80.3%) and followed closely by “decision tree” (74.1%), which are popular on relational datasets. Our techniques can also be adapted for linear/non-linear SVM and random forests with slight modification, and can be seamlessly integrated in most modern DB query optimizers.
System framework
In this section, we first define SQL+ML and then present a framework for optimizing SQL+ML queries.
SQL+ML queries. We extend SQL to support ML predicates, where each ML predicate can be modelled as “
Key idea. If an SQL+ML query contains an ML predicate, traditional databases typically first compute the intermediate table for the SQL+ML query by omitting the ML predicate, then infer values of these tuples based on the model in the ML predicate, and finally prune the tuples that do not satisfy the ML predicate. However, this approach definitely incurs high computational costs, especially when the intermediate table comprises a large number of tuples after multiple joins. Therefore, it’s crucial to prune irrelevant tuples either before or during the join operators, especially when the intermediate table is substantial but only a small proportion of tuples satisfy the ML predicate. To achieve this goal, we design SQL+ML optimizer ARchiTecture (
We classify SQL predicates into atomic predicates with single table (e.g.,
Atomic Predicates.(1) Generation: Each ML predicate contains multiple tables, and we generate a corresponding atomic predicate for each table. (2) Selection: The cost of executing atomic predicates is almost negligible when compared to the cost of table scan and join operators. Each atomic predicate highly likely prunes many tuples in the corresponding basic table, so
Overall, the generation, selection, and placement of atomic predicates constitute the inference rewrite module as shown in Figs. 2 and 3. Since
[See PDF for image]
Fig. 3
The framework of
Composite predicates.(1) Generation: Composite predicates are generated by combining multiple tables and their corresponding model weights. They provide tighter necessary conditions than atomic predicates, thus can further prune irrelevant tuples. The generation function, which extends from the atomic predicate generation function, will be discussed in Sect. 3.3. (2) Selection: Unfortunately, the total number of candidate composite predicates that can be generated reaches up to , where D represents the number of columns in the ML predicate, resulting in significant time and space costs when generating all of them. Hence, it becomes crucial to select effective composite predicates that possess high pruning ability. We propose a tree-based method to evaluate the pruning redundancy between different composite predicates and select the high-quality composite predicates to maximum pruning performance in Sect. 4, which corresponds to the progressive inference module in Figs. 2 and 3. (3) Placement: Different places lead to varying total costs. Firstly, the computational cost is affected by the positions of other composite predicates due to progressive inference. Secondly, the pruning benefits change based on positions. Considering both factors,
System workflow. The framework is shown in Fig. 3, and Fig. 2 illustrates a running example. A user-given SQL+ML is parsed into a query tree ❶ as the input of
Optimizing SQL+ML directly within DB optimizer is notably more effective. Since ML predicates and DB operators are intricately intertwined, they should not be optimized separately. DB optimizers optimize ML predicates using statistical data and cost estimation models.
Different from existing frameworks.
While both PP [34] and
Orion [27] proposed that in each iteration of the ML model training process, the total gradient calculation on the total tuples can be accelerated over factorized joins [8, 33]. Different from Orion,
SQL predicate generation
In this section, we take linear regression, logistic regression, decision tree as examples to discuss how to generate SQL predicates.
Preliminary
Linear regression inference. A linear regression model f has a model weight array . Given a tuple , the inferred numeric value of model f is defined as :
1
We abbreviate this linear regression model asRange ML predicate. Given a numeric value as the upper bound of the inferred value with the inference function , the range ML predicate can be written in an SQL+ML query as “
2
Similarly, we abbreviate this logistic regression model asEquality ML predicate. Given a category label
Decision tree inference. A decision tree model h is typically structured as a tree. Each internal node tests a feature condition (one column predicate) for the tuple , and the traversal turns left if the condition is satisfied or turns right otherwise. Each leaf node represents the inferred value. The inferred value of model h for tuple is obtained by traversing the tree from the root to a leaf: . The inferred value is one of the unique decision labels , which can be reached through various paths in the tree.
Similarly, the equality ML predicate can be written as “
SQL predicate generation. Given an SQL+ML query Q, SQL predicate generation aims to generate a set of valid SQL predicates, such that adding them into Q will keep equivalent with original query.
Atomic predicate generation
An ML predicate is a condition that combines columns from multiple tables, whereas an atomic predicate, which involves only a column from a single table, represents a necessary but not sufficient (less restrictive) condition. We first discuss linear models and then move on to discuss tree models. We consider each table only maintains one column, then multiple columns are easy to expend.
Linear models. To derive the atomic necessary conditions, it is necessary to exclude other columns in other tables with additional conditions. Thus, we propose incorporating the min/max ranges of all columns from DBMS catalogs, , to create a combined inequality. DBMS catalogs generally store up-to-date statistic data for column values. Theorem 1 serves as an example of the derivation.
Theorem 1
(Atomic Necessary Condition for Linear Models) When an SQL+ML query includes a range ML predicate as "
For an equality ML predicate
The model weights can be either positive () or negative (), leading to different derivations in Theorem 1. However, for the sake of simplicity in further discussions, we consider each negative weight as logically positive, represented by the transformation:By doing so, we only need to consider situations where all weights are positive. Subsequently, we present the atomic necessary conditions for the ML predicates.
Decision tree. For each path in the decision tree, every branch represents a condition. Within the conditions of a path, a column may appear multiple or zero times. By intersecting multiple occurrences and using for the omitted one, we transform them into D distinct necessary conditions.
Theorem 2
(Atomic Necessary Condition for Decision Tree) When an SQL+ML query includes an ML predicate as “
Atomic necessary conditions. A single SQL+ML query can include multiple ML predicates. For example, “
Assuming all weights are positive in a linear regression model as an example, specifically, the necessary conditions are computed using the minimum column values to determine the upper bounds. Subsequently, if these upper bounds are used instead of the maximum values to compute the necessary conditions, the lower bounds of the column values become tighter. Hence, we propose Algorithm 1 to leverage this effect.
Furthermore, the initialization in Algorithm 1 can use existing predicates in SQL+ML query. For example, the lower bound value of column
If any column exists in both different decision tree models, the two generated atomic necessary condition sets can also reinforce each other. Taking “
[See PDF for image]
Algorithm 1
Atomic Necessary Conditions for Linear Regression
Atomic predicates generation. From an ML predicate and column ranges, a set of atomic necessary conditions can be derived. In this set, each atomic necessary condition includes one column, a comparative operator, and a bound of value, such as , which can be used to generate one atomic predicate as “
Composite predicate generation
Similarly, before generating composite predicates, it is crucial to derive the composite necessary conditions for ML predicates firstly.
Theorem 3
(Composite Necessary Condition for Linear Models) When an SQL+ML includes a range ML predicate as “
When
Theorem 4
(Composite Necessary Condition for Decision Tree) When an SQL+ML includes an ML predicate as “
For decision tree, the generated atomic and composite predicates can be optimized further. Columns omitted in specific paths can be excluded as they do not contribute to pruning, and repeated conditions across paths need not be evaluated multiple times. For the larger decision trees, the generation time may increase. Its feasibility is maintained through optimization strategies such as consolidating conditions for the same feature across paths and amalgamating paths with identical labels during the preprocessing phase. Then, during real-time optimization, we can quickly check if all paths leading to the label share conditions on a particular feature before deriving the relevant predicates, which accelerates the process. Bitwise operations and modern compilers facilitate these optimizations. However, not all features in a decision tree can generate useful predicates. Only features that appear in all paths leading to the query label condition can produce meaningful predicates. Also, the predicate on a column or a column set needs to be the disjunction of all relevant paths, which will make the predicate filter less selective. For more complex models like random forests with thousands of trees, the generation time could become large, but its overhead is still negligible compared with the execution time. We leave optimizing the overhead for complex models as a future work.
Complex transformation. When a column undergoes transformations such as square, logarithmic, or exponential functions, we can derive the additional conditions of the transformed one by leveraging the min/max values of the original column. Then the necessary conditions can be further derived by exploiting the range of the transformed feature as in simpler models. And when generating atomic predicates for them, e.g., , if the inverse function is straightforward to compute, the atomic predicate can be transformed to . If the inverse function cannot be explicitly derived, we can retain expressions like . This retention does not impose a significant computational overhead because needs to be calculated even without our approach.
Supporting categorical values. Typically, categorical features are transformed into numerical ones using techniques like one-hot encoding. Our method’s principles and pipeline apply directly to these encoded features, and the characteristics of one-hot encoding further tighten the generated SQL predicates.
On the one hand, since these encoded features originally come from the same column, they can be treated as multiple numerical columns and combined to form one composite predicate. Therefore, the generation process is similar to that for numerical columns. For instance, consider a dataset with two features: “Color" (a categorical feature) and “Size" (a numerical feature). After applying one-hot encoding to the “Color" feature, it is split into three binary features: “Color_Red", “Color_Green", and “Color_Blue", with corresponding weights of 2, 3, and 1, respectively. In our method, using min(Size) and max(Size) will derive a necessary condition about “Color", e.g. “2*Color_Red + 3*Color_Green + 1*Color_Blue < 1.5", which can be simplified as an atomic predicate “Color = Blue".
On the other hand, by considering the nature of one-hot encoding, we can generate tighter additional conditions of min/max ranges. In the same instance, when deriving an atomic predicate for “Size", the additional conditions are “", corresponding to the min and max weight values of “Color". These additional conditions are tighter than treating each color feature independently, where the conditions might range from min(Color) = 0 to max(Color) = 6 = 2+3+1.
Inference rewrite
To generate SQL predicates from a given ML predicate in the SQL+ML query, the inference rewrite module initially derives a set of atomic necessary conditions from Theorem 1(2) and refines them further through Algorithm 1. Each atomic necessary condition is then used to generate a corresponding atomic predicate. Finally, the inference rewrite module rewrites the query tree by appending the generated atomic predicates to the scan node.
At a certain point, the generation time could take more than actually executing the model. We can use the cost to predict the execution cost. If the execution cost is high, we can use our method to optimize the query; otherwise, we can directly execute the query. Note that as the data size grows, the execution time increases proportionally, while the generation time remains a one-time cost during the query optimization phase. This means that the relative impact of the generation overhead decreases with larger datasets, emphasizing the efficiency of our method in big data scenarios.
Progressive inference for composite predicates
[See PDF for image]
Fig. 4
Example of the 4 Steps for composite predicate selection. Step 1 ① adds all candidate predicate to the root . Step 2 ② pushes down each predicate to sets , , and . Step 3 ③ selects the strongest predicates, and , from each set based on Lemma 1, and removes the others. Step 4 ④ reconstructs the selected predicates and using Principle 3 for progressive computation, where E(P) denotes the expression of the reused result
An ML predicate may contain multiple tables, e.g., D columns from D tables, and any subset of these tables can generate a valid composite predicate4. Thus, the total number of valid composite SQL predicates grows exponentially, specifically , which includes D atomic predicates and composite predicates.5 It is rather expensive to utilize all composite predicates blindly because of (1) their exponential count, leading to increased execution overhead, and (2) pruning redundancy across different composite predicates, which results in redundant computational costs. Consequently, the selection of which composite predicates to retain becomes crucial in minimizing computation redundancy.
Composite predicate correlation
The composite predicates selection must adhere to three principles to minimize computation redundancy.
Principle 1. A composite predicate must be after joining the corresponding tables. The execution of a composite predicate must occur after joining all tables involved in the predicate. For example, a composite predicate “
Principle 2. A sub-predicate must appear before its super-predicate. Given two generated predicates and from the same ML predicate, we denote ( is a sub-predicate of and is a super-predicate of ), if each column in appears in . For example, the composite predicate “
Lemma 1
Given two predicates derived from Theorem 1 and 3 for linear models or Theorem 2 and 4 for decision tree of the SQL+ML query Q, any tuple pruned by must be pruned by .
In this way, considering ,6 if appears at the same or higher node in the query plan tree than , is redundant, as cannot prune more tuples than . So the principle is predicate must appear before (lower in the query plan tree) if .
Principle 3. A super-predicate can reuse the result of its sub-predicate. Considering two predicates , if has already been computed in a former/deeper join operator, we do not need to compute from scratch. Instead, we can compute progressively by reusing the computed results of . For example, computing “
Composite predicate selection
Composite predicate selection. Composite predicate selection aims to select high-quality composite predicates by removing the redundant predicates and satisfying the three principles. We develop a heuristic method to address the composite predicate selection problem, comprising four primary steps and Fig. 4 shows a running example.
Step 1: Generate a set of all valid composite predicates, denoted as , and insert them into the predicate list of the root node in the query plan tree.
Step 2: Push down each composite predicate in as low as possible, adhering to Principle 1, with the objective of pruning the maximum intermediate tuples.
Step 3: At each node, among predicates with sub/super-predicate relationships, keep the super-predicates and remove the sub-predicates based on Principle 2, thereby allowing for the most effective tuple pruning.
Step 4: Progressively infer to minimize computation redundancy among the remaining predicates based on Principle 3. By adding the expression of sub-predicate into the target list7, the super-predicate can reuse the previously result to infer each tuple progressively.
Algorithm analysis. Composite predicate selection selects a maximum of composite predicates, where D is the number of tables in the ML predicate. It aims to prune as many tuples as possible on the query plan tree while minimizing redundant computations among the generated composite predicates. Despite processing an exponential number of composite predicates with respect to D, Algorithm 2 merges generation and selection to optimize it.
The input of this reduction function CPS is the root node of the plan tree and an initially empty set, CPS(). As the function executes, this set is populated with the selected composite predicates. Lines 4-5 generate the composite predicates with maximum pruning ability, and Lines 8-11 progressively infer ML predicates through passing up the intermediate results. The optimized algorithm yields a computational complexity of , where N represents the total number of operator nodes on the query plan tree.
[See PDF for image]
Algorithm 2
Composite Predicate Selection CPS(Node,)
Cost-optimal inference for composite predicates placement
For each composite predicate, it can be placed in multiple positions in the query plan tree, i.e., from the lowest position (e.g., the highest table-join nodes for tables in composite predicate) to the highest position (e.g., the lowest position of its super-predicate). As an example, the predicate node in the left tree in Fig. 5 has two potential positions and . Placing at different positions can lead to different costs, e.g., the computation cost of executing the predicate and the number of pruned tuples to save cost of the subsequent operators (executing higher operators in the query plan tree). Thus it is important to place each predicate at an appropriate position in the query plan tree in order to minimize the total execution cost.
Modern optimizers typically pushdown predicates as low as possible in the execution tree. This approach is based on the assumption that the calculation cost of SQL predicates is relatively low and can be ignored, therefore pushdown does not bring overhead and pruning a few data also brings benefits. This assumption is valid in many general traditional scenarios. However, in our method, the additional SQL predicates are generated from the ML predicate, which introduces a unique challenge. Firstly, the calculation costs of these predicates can be significant, particularly when they involve complex operations such as multiplication, exponential, or logarithm. Secondly, these predicates have overlapping or inclusive relationships in terms of the data they filter. Therefore, the placements of these relative SQL predicates impact the benefits of each other.
To address these challenges, we develop a novel cost model and placement strategy that optimize the positions of these additional SQL predicates based on their sub-super relations, relative calculation costs and position-aware benefits. While pushing predicates as low as possible in the execution plan tree is a common strategy, this approach could result in minimal data filtering but substantial computational overhead in some cases. In such cases, our cost-optimal placement strategy can pull up the predicates higher in the execution plan tree to effectively avoid unnecessary calculation overhead.
Cost model
We first focus on the cost of computing composite predicates, then other operators affected by them.
The cost of composite predicate. The cost of a predicate is determined by both (1) the computational cost for each tuple and (2) the total number of tuples to evaluate.
Firstly, in terms of the computational cost of a predicate for each tuple, a direct method is to evaluate the expression of the predicate itself. However, if some other sub-predicates are under the predicate, the progressive inference incrementally computes the predicate. Considering in Sect. 4.1, the calculation of predicate can be computed incrementally as when is the closest predicate to .8
Definition 1
(Closest predicate) Let be the closest predicate to , if and only if (1) and both are derived from Theorem 1-4, (2) is placed closest to in the plan tree, which means no other generated predicates exist between them.
Secondly, in terms of the number of input tuples for a predicate, when a predicate is located on a position , the input rows of are equal to the output rows from . We denote the size of the output rows from the operator as , where is the closest predicate to .9 The output rows from the operator is affected by the presence of , but it is independent of the location of due to the previous insights, as well as independent of the presence and location of other predicates. However, computing this rows function requires an estimation of the selectivity of the predicates. We propose to obtain the size of rows in three steps:
Without any generated predicates in the plan tree, the size of the output rows of is denoted as and can be estimated by DB modules;
The selectivity of the generated predicate is denoted as , which can be estimated based on the histograms of the relevant columns or more advanced methods [16];
Under the assumption of independence between operators, with inserted below , .
As a result, our method effectively manages the risks associated with less precise CE. Specifically, for atomic predicates with reliable CE, we push them down prior to determining the join order. This ensures that the join order chosen is an improvement over the original plan without predicates. For composite predicates where CE may be less accurate, we postpone the pushdown decision until after the join order is established. This helps avoid suboptimal join orders due to inaccurate CE. Even in worst-case scenarios, our optimization aims to avoid any significant overall degradation compared to the original plan without predicates. We acknowledge that accurately estimating the cardinality of composite predicates remains a challenge. As CE techniques improve, especially for composite predicates, we plan to incorporate these enhanced CE results directly into our join order selection process to further improve query performance.
[See PDF for image]
Fig. 5
The left tree shows an example of inserting a composite predicate. T nodes represent the scan nodes and O nodes represent other operators. includes columns from and . and is placed on . The right tree demonstrates how accumulates costs from three components. is the closest predicate of , so the first component is . And the second component is the sum of the costs associated with operators and respectively. Then the third component is the cost of on the top
Composite predicate placement
The optimization objective of composite predicate placement. We aim to place the composite predicates in the query plan tree to minimize the total execution cost. The above analysis leads to the conclusion that the cost of both the original operator and the inserted composite predicate depends on the closest composite predicate inserted under them. Therefore, we define a cost function for the (sub)plan tree with the inserted composite predicate serving as the root, and offer a recursive formula to compute the cost beginning from the subtree led by the closest predicate.
Cost Optimization in a Bottom-Up Manner. To reuse the computed results of a subtree, we propose a bottom-up manner to minimize the cost. Specifically, we note the cost function as with two parameters, which represents the cost of the whole sub-plan tree with generated predicate P as the root node when the P is placed on the operator O. Therefore, our goal is to firstly minimize where includes all columns involved in the ML predicate and is one of the possible positions of . Based on these minimum values, the total execution costs of the whole plan tree can be calculated easily. And the set of composite predicates placements, corresponding to the smallest cost among the total costs, are the optimal output.
For two closest composite predicates , where is placed on and is placed on , the total cost of plan tree rooted by is the sum of (1) the subplan tree cost rooted by , (2) the cost of the operators between and , and (3) the cost of itself, as:where represents traversal of between and .10
The right tree in Fig. 5 represents an example. When , is placed on and is placed on ,
[See PDF for image]
Fig. 6
The left matrix represents a placement on , on , and the absence of ; the three arrows correspond to accumulate components in Fig. 5 right tree. The right matrix shows the possible transitions to . The left matrix corresponds to the one possible transition from
This illustrates the overlapping cost calculation process within the subtrees of the whole query plan tree. Based on this, we can construct the optimal substructure for the overall cost of the optimized tree. We define the function as the minimum where is the closest predicate to . Then the transition function is:where is the lowest position of adhering to Principle 1. Specifically, we use to denote the scenario where no composite predicate exists below during initialization,
3
We iterate through both the closest predicate ranging from to and the position of . This allows us to determine the minimum costs for all possible scenarios, from which we then return the smallest value.The right matrix in Fig. 6 gives an example for the transition of using the state transition function: There are three choices for m and two choices for y, , if involves columns from , and . Then, the transition is determined by computing the minimum one.
Composite predicate placement. In Algorithm 3, we employ dynamic programming to optimize the placement of the composite predicates. It first initializes the storage and the boundary values (Lines 1-4). The core transition phase adheres to the transition function (Lines 5-18). Back tracking in Line 17 using Line 13 yields the optimal positions for the composite predicates.
[See PDF for image]
Algorithm 3
Composite Predicate Placement
Complexity analysis. Although Algorithm 3 exhibits a computational complexity of , methodologies for its reduction are feasible. By pre-computing , the value of any can be efficiently computed in constant time. Furthermore, by utilizing additional storage space to eliminate the variable y, the complexity can be further reduced to .
Experimental results
This section presents an experimental study to evaluate our system
Table 1. Statistics of datasets, model information, and query set characteristics for 5 benchmarks
Dataset | Data size | #rel | Distribution | ML model | Inferred value | #feature | Workload | #query | #table |
|---|---|---|---|---|---|---|---|---|---|
IMDB | 5.1 GB | 21 | Real-world | linear | Rating | 4 | JOB | 113 | 4-17 |
TPC-H | 0.1-30 GB | 8 | Synthetic | Linear | Price | 5 | TPC-H | 22 | 6-12 |
SSB | 10 GB | 5 | Synthetic | Logistic | Chosen or not | 4 | SSB | 13 | 5-6 |
Flight | 0.5 GB | 4 | Real-world | Logistic | Is delay | 5 | Delay | 10 | 4 |
IMDB | 5.1 GB | 21 | Real-world | Decision tree | Gross level | 6 | JOB-DT | 113 | 6-19 |
Experimental environment.
System integration. Database administrators (DBAs) can activate the proposed three ML modules individually by adding new configuration parameters to the DBMS, which is as simple as allowing Nested Loop with
sys_model[model_name, model_category, weights/paths]
sys_feature[attribute_name, table_name, min, max]
SQL+ML benchmarks and baselines
To facilitate a fair comparison of the SQL+ML query processing efficiency of
Benchmarks
Datasets. From Table 1, the benchmarks include two real-world datasets: IMDB [18] and Flight [21]. Additionally, two synthetic datasets were generated from TPC-H [40] and SSB [11]. The IMDB dataset is widely used in both DB and ML fields. The Flight dataset is derived from airport and flight information. The TPC-H and SSB datasets are generated using the data generation procedures in the corresponding benchmark kits, and they have the same structure as the original datasets. In particular, the TPC-H dataset scales from 100MB to 30GB, allowing us to explore the scalability of
ML model training. The experiments are based on linear models and decision tree, and the number of features/columns in ML predicate is shown in Table 1 as
[See PDF for image]
Fig. 7
Two SQL+ML example cases and a UDF example case. UDF is identical to inlined inequality, due to the pull-up strategy of PostgreSQL. Model parameters are embedded in the function, but
Workload. Each SQL+ML query template was adapted from the corresponding original SQL query template in JOB, TPC-H, and SSB benchmarks. The adaptations involved two parts: appending relational tables with features to the
Metric. To measure the efficiency, we utilized the execution time obtained from the “Explain Analyze" tool in PostgreSQL. We executed each SQL+ML query three times and reported the shortest execution time to mitigate the impact of cache memory. The generation and optimization overhead was included in the reported end-to-end times. The additional planning time added by
Baselines
PostgreSQL 12 + MADlib. MADlib [1] is a popular in-DB ML library that provides a user-friendly interface for inference functions with well-trained models stored in DB. When using MADlib to process SQL+ML queries, we modify each query by adding the model table to the "FROM" clause and replacing the inference function in ML predicates with the MADlib function.
PostgreSQL14 + UDF. Inference functions can be created with UDFs, aligning with the database design philosophy. Additionally, the utility of UDFs can only be minimally optimized by the DB optimizer, such as through pushdown. Unlike MADlib, creating UDFs is an offline process that does not impose the requirement for users to design SQL+ML query templates.
PostgreSQL 14 + ConnectorX + Scikit-Learn. Scikit-Learn [29] is a popular ML library in Python used for traditional ML algorithms. ConnectorX [46] is a state-of-the-art data transformation channel that facilitates the transformation of relational data from DBMS to ML systems. Hence, we integrate these platforms with PostgreSQL 14.4 to implement the DB+ML strategy. First, PostgreSQL retrieves tuples using joins and selections, and then utilizes ConnectorX to fetch the tuples. ConnectorX enables us to handle result tuples without exporting them to external storage or reloading them. Instead, it manages the result tuples in memory, and controls the data transport to Scikit-Learn for ML predicates processing. However, this baseline has limited processing capability. If ML predicate is not the final operator in the query plan, the inferred results need to be transformed back into relational data and loaded into the DBMS. Unfortunately, ConnectorX is currently unable to handle this for the DB+ML strategy.
Table 2. End-to-end comparison of
Workload | DB+ML | MADlib | UDF | |
|---|---|---|---|---|
JOB(113) | 19634.8 s | 3642.4 s | 1384.0 s | 309.7 s |
TPCH(22) | — | 4782.3 s | 2591.6 s | 1521.7 s |
SSB(13) | — | 2800.6 s | 1619.7 s | 665.9 s |
Flight(10) | 466.6 s | 228.1 s | 185.6 s | 67.6 s |
DT(113) | — | 705.1 s | 1140.0 s | 144.3 s |
[See PDF for image]
Fig. 8
Query to query speed up of
End-to-end comparison
To compare the efficiency of
The experimental results are presented in Table 2. When comparing the total execution time horizontally,
Despite ConnectorX reducing the overhead of transferring data from a relational data model to a dataframe model, the overall latency still remains relatively high. This set of experiments demonstrates that extending the entire ML pipeline and using an additional processing flow to complete the ML predicates is disadvantageous. It is preferable to avoid exporting data from the DBMS and instead integrate the data processing pipeline inside DBMS.
By installing the MADlib library in Postgres for ML functions, SQL+ML query processing can be performed inside DBMS, eliminating the need for data copying. However, MADlib still has its own limitations. Firstly, in order to perform the inference, the DBMS needs to access the integrated interface of MADlib and uses the returned value from the function, which introduces additional overhead. Furthermore, potential type conversions may occur during this process, resulting in additional execution time. Therefore, when using UDF to declare inference in SQL+ML queries, the execution is faster compared to using MADlib.
By moving the ML process from outside the database to inside the database, we gain the opportunity to utilize specially designed algorithms for optimizing computation. In
[See PDF for image]
Fig. 9
Query to query speed up of
Query-by-query comparison
To assess the effectiveness of
We executed queries in JOB and TPC-H separately with MADlib and
Out of the 113 evaluated queries in JOB, the fastest query demonstrated an impressive speed-up of over , and more than half of the queries achieved a speed-up of over . In the JOB-DT set, one query achieved a speed-up, and 23 queries exhibited speed-ups exceeding . In the TPC-H experiment, among the 22 SQL+ML queries in the TPC-H dataset, the performance of 21 queries improved. Notably, 8 queries achieving a speed-up greater than , and 3 queries experienced a speed-up exceeding . Query 25b and 32b in JOB experienced a slight slowdown, but the impact was negligible. This can be attributed to the influence of various factors on the PostgreSQL query optimizer when generating query plans. As a result of inference rewrite, queries with added predicates are pushed down to the scan operators of the query plan tree by DB optimizer. This leads to substantial changes in the overall query plan, and then the cardinality estimation by DB optimizer, due to its relative inaccuracy, might result in a plan with a slower execution speed.
The majority of queries experienced improved performance after applying
Even on uniformly generated data and SQL+ML queries adapted from the TPC-H query set,
From an end-to-end and query-wise experimental standpoint,
[See PDF for image]
Fig. 10
Effects of ML predicate selectivity
Selectivity of ML predicate
The optimization efficacy of
The results are presented in Fig. 10. The horizontal axis represents different ranges of the selectivity of ML predicates. As the selectivity of ML predicate decreases, the execution time of the MADlib group exhibits minor fluctuations within a narrow range. This behavior arises because the inference inequality is consistently positioned at the top of the query plan tree, and changes in ML predicate do not significantly impact the main portion of the query execution. Conversely, the UDF strategy demonstrates a slight decline in outcomes as the ML predicates are pushed down, resulting in the processing of fewer tuples by above operators when the ML predicate selectivity decreases. The execution time of
As selectivity increases, the optimization effect gradually diminishes, and the performance approaches that of the baseline. However, our method does not introduce additional overhead at higher selectivities such as above 20%. When selectivity is high, most of the data meets the ML predicate conditions. This leaves little opportunity for data pruning before the join, resulting in a smaller optimization. The third technique proposed in our paper, the cost-optimal placement strategy, addresses this challenge by dynamically adjusting the positions of atomic and composite predicates. Based on the cost model, if any additional SQL predicate has less pruning effect but introduces overhead, it will be moved up in the execution plan tree until merged with the super predicate to avoid redundant calculations. For instance, when selectivity reaches 100%, all generated SQL predicates will disappear since they are moved up and merged into the original ML predicate. Then, the final execution plan tree generated by our method is identical to the original plan tree, ensuring no additional overhead.
Table 3. Execution time (ms) of some JOB queries to show the performance of atomic predicates and composite predicates
JOB query | 7c | 13b | 14c | 28b | 30c |
|---|---|---|---|---|---|
UDF | 2067.0 | 805.9 | 749.7 | 7060.0 | 1283.5 |
Atomic preds | 797.2 | 512.9 | 630.0 | 2049.8 | 823.5 |
Atomic+comp | 238.6 | 389.7 | 318.2 | 934.2 | 297.7 |
[See PDF for image]
Fig. 11
Ablation study of
[See PDF for image]
Fig. 12
Scalability on TPC-H. Queries are ordered by the speedup
Ablation study
To examine the individual impact of two kinds of predicates and three proposed optimization techniques on query efficiency, we conducted ablation experiments on the JOB. We evaluate the importance of both atomic predicates and composite predicates, with the comparison results of some queries listed in Table 3. Then, to evaluate the three proposed techniques, we split the experiments into two groups depending on whether inference rewrite was used. Each group includes three scenarios: without using progressive inference and cost-optimal inference, using progressive inference without cost-optimal inference, and utilizing both of them. We report the total running time of the SQL+ML query set on JOB. The analyzed results are presented in Fig. 11.
The findings from both Table 3 and Fig. 11 reveal that atomic predicates employed by inference rewrite plays a significant role in enhancing overall query performance. However, the performance improvement resulting from composite predicates from progressive inference and cost-optimal inference becomes less pronounced after the application of atomic predicates from inference rewrite. This can be attributed to the fact that atomic predicates and composite predicates contain tuples prune overlapping from Fig. 11. Meanwhile, as shown in Table 3, composite predicates are vital that they further reduce the execution time significantly. It is noteworthy that inference rewrite has the potential to introduce changes to the plan tree structure due to its impact on the query optimizer. These changes can have either positive or negative effects on query performance. In contrast, progressive inference and cost-optimal inference do not influence the structure.
Overall, our ablation experiments demonstrate that atomic predicates with inference rewrite bring noticeable performance improvements, and the composite predicates with progressive inference and cost-optimal inference can further enhance performance.
Scalability
To investigate the trends of benefits when I/O cost becomes the bottleneck factor, we varied the sizes of the dataset and executed the SQL+ML workloads of TPC-H. The results are depicted in Fig. 12, where the orange line represents the baseline running the UDF strategy, and the blue line represents
Our findings indicate that approximately two-third of the queries exhibited a significant performance improvement, while the remaining one-third did not demonstrate much improvement.
As the data set size increased,
In summary,
Related works
In-DB ML inference. Inference (or prediction) is becoming an official statement of extended SQL in many DBMS products, such as Redshift [5], Big Query ML [13], SQL Server ML [36], MADlib [9, 12, 15, 45, 50] on PostgreSQL, and Spark MLlib [35, 42]. Recently, openGauss [32, 37, 52] designs predict by statement as well, to support in-DB ML inference. However, the in-DB optimization for inference predicate has not been widely studied.
Inference on ML platform. The bottleneck comes from ML inference [19] when processing SQL+ML. Clipper [10], Willump [25, 26] and PRETZEL [30] provide inference optimization methods for ML systems, but none of them works for DBMS queries with inference predicates. Raven [23] designs an intermediate representation for prediction queries, then separates operators into DB part and ML part with serious cross-optimizations [38]. The DB parts are compiled as a new SQL to process by DBMS, and the ML parts are executed with ML systems. For unstructured data like images and videos, PP [34], CORE [49] and FiGO [6] optimize inferred labels in video DBMS specifically. Among them, PP generates simpler predicates from models as proxy models to rewrite queries. A recent work, ConnectorX [46] accelerates the data loading from DBMS to ML systems. However, how to transfer inferred results back to DBMS to support complex applications is omitted. Thus, existing DB+ML strategies suffer from data copy and low performance, and the optimization techniques can not adapt to in-DB inference.
In-DB ML training. In-DB training acceleration has been studied [14, 48]. Lara [28] and LaraDB [17], LevelHeaded [3], and Froid [41] explored how to co-optimize linear algebra and relational algebra for model training. FAQ-AI [24] represents model training tasks as additive inequalities in SQL, then rewrites the complex input query into multiple simple sub-queries. However, they cannot be used for inference [7] because during training, complete computations are performed iteratively for each tuple, while inference computes only once and can terminate early based on query conditions.
Conclusion
We proposed
Acknowledgements
This paper was supported by National Key R&D Program of China (2023YFB4503600), National Science and Technology Major Project of China (CJGJZD20230724093403007), NSF of China (61925205, 62232009, 62102215), Zhongguancun Lab, Huawei, and Beijing National Research Center for Information Science and Technology (BNRist). Guoliang Li is the corresponding author.
1We will discuss how to generate such SQL predicates in Sects. 3.2 and 3.3.
2There is no need to consider for inference since that means the corresponding column is useless.
3Supporting disjunctions is similar but more verbose, and we leave it as a future work.
4In scenarios where multiple columns are in the same table, we can introduce a new column as a composite of them to logically simplify the situation. When applied across multiple columns within the same table, these composite predicates can still be efficiently pushed down to the scan operator, enhancing data filtering before joins.
5We represent the composite predicate “
One or more columns are in but are not in .
7In the query plan tree, each node has a target list to represent its output columns.
8The notation of is consistent with DB optimizer’s operator cost estimation.
9Note that if we only consider left-deep tree, there is only one for ; if we consider bushy tree, there may be multiple closest predicates. Without loss of generality, we consider the left-deep tree for ease of representation.
10For right leaf nodes, static costs like sequential scans are ignored, while linear changes, as in index scans, are added to join node costs.
11We use PostgreSQL 12.2 instead of 14.4 because MADlib supports PostgreSQL up to 12.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. 1.18.0, M.: Madlib 1.18.0 documentation (2023). https://madlib.apache.org/docs/v1.18.0/index.html
2. 14.4, P.: Postgresql 14.4 release notes (2023). https://www.postgresql.org/docs/release/14.4/
3. Aberger, C.R., Lamb, A., Olukotun, K., Ré, C.: Levelheaded: A unified engine for business intelligence and linear algebra querying. In: 34th IEEE international conference on data engineering, ICDE 2018, Paris, France, April 16-19 pp. 449–460. IEEE Computer Society (2018). https://doi.org/10.1109/ICDE.2018.00048
4. Agrawal, A., Chatterjee, R., Curino, C., Floratou, A., Godwal, N., Interlandi, M., Jindal, A., Karanasos, K., Krishnan, S., Kroth, B., Leeka, J., Park, K., Patel, H., Poppe, O., Psallidas, F., Ramakrishnan, R., Roy, A., Saur, K., Sen, R., Weimer, M., Wright, T., Zhu, Y.: Cloudy with high chance of DBMS: a 10-year prediction for enterprise-grade ML. In: 10th conference on innovative data systems research, CIDR 2020, Amsterdam, The Netherlands, January 12-15 Online Proceedings. www.cidrdb.org (2020). http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf
5. Amazon.com: Redshift ml (2023). https://aws.amazon.com/redshift/features/redshift-ml/
6. Cao, J., Sarkar, K., Hadidi, R., Arulraj, J., Kim, H.: Figo: Fine-grained query optimization in video analytics. In: Ives, Z.G., Bonifati, A., Abbadi, A.E., (eds.) SIGMOD ’22: International conference on management of data, Philadelphia, PA, USA, June 12 - 17 pp. 559–572. ACM (2022). https://doi.org/10.1145/3514221.3517857
7. Chai, C; Wang, J; Luo, Y; Niu, Z; Li, G. Data management for machine learning: a survey. IEEE Trans. Knowl. Data Eng.; 2023; 35,
8. Chen, L; Kumar, A; Naughton, JF; Patel, JM. Towards linear algebra over normalized data. Proc. VLDB Endow.; 2017; 10,
9. Cohen, J; Dolan, B; Dunlap, M; Hellerstein, JM; Welton, C. MAD skills: new analysis practices for big data. Proc. VLDB Endow.; 2009; 2,
10. Crankshaw, D., Wang, X., Zhou, G., Franklin, M.J., Gonzalez, J.E., Stoica, I.: Clipper: A low-latency online prediction serving system. In: Akella, A., Howell, J., (eds.) 14th USENIX symposium on networked systems design and implementation, NSDI 2017, Boston, MA, USA, March 27-29 pp. 613–627. USENIX Association (2017). https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw
11. eyalroz: Ssb kit (2023). https://github.com/eyalroz/ssb-dbgen
12. Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-rdbms analytics. In: Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., (eds.) Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24 pp. 325–336. ACM (2012). https://doi.org/10.1145/2213836.2213874
13. Google: Big query ml (2023). https://cloud.google.com/bigquery/docs/inference-overview
14. Guo, Y; Zhang, Z; Jiang, J; Wu, W; Zhang, C; Cui, B; Li, J. Model averaging in distributed machine learning: a case study with apache spark. VLDB J.; 2021; 30,
15. Hellerstein, JM; Ré, C; Schoppmann, F; Wang, DZ; Fratkin, E; Gorajek, A; Ng, KS; Welton, C; Feng, X; Li, K; Kumar, A. The madlib analytics library or MAD skills, the SQL. Proc. VLDB Endow.; 2012; 5,
16. Hu, X., Liu, Y., Xiu, H., Agarwal, P.K., Panigrahi, D., Roy, S., Yang, J.: Selectivity functions of range queries are learnable. In: Ives, Z.G., Bonifati, A., Abbadi, A.E., (eds.) SIGMOD ’22: international conference on management of data, Philadelphia, PA, USA, June 12 - 17 pp. 959–972. ACM (2022). https://doi.org/10.1145/3514221.3517896
17. Hutchison, D., Howe, B., Suciu, D.: Laradb: A minimalist kernel for linear and relational algebra computation. In: Afrati, F.N., Sroka, J., (eds.) Proceedings of the 4th ACM SIGMOD workshop on algorithms and systems for mapreduce and beyond, BeyondMR@SIGMOD 2017, Chicago, IL, USA, May 19 pp. 2:1–2:10. ACM (2017). https://doi.org/10.1145/3070607.3070608
18. IMDB: The csv files of imdb dataset (2023). http://homepages.cwi.nl/~boncz/job/imdb.tgz
19. Jassy, A.: Aws re:invent 2018 keynote (2023). https://www.youtube.com/watch?v=ZOIkOnW640A&t=5316s
20. Jr., J.F.R., Pépin, J.L., Goeuriot, L., Amer-Yahia, S.: An extensive investigation of machine learning techniques for sleep apnea screening. In: d’Aquin, M., Dietze, S., Hauff, C., Curry, E., Cudré-Mauroux, P., (eds.) CIKM ’20: The 29th ACM international conference on information and knowledge management, virtual event, Ireland, October 19-23 pp. 2709–2716. ACM (2020). https://doi.org/10.1145/3340531.3412686
21. Kaggle: Flight dataset (2023). https://www.kaggle.com/usdot/flight-delays
22. Kaggle: State of data science and machine learning 2021 (2023). https://www.kaggle.com/kaggle-survey-2021
23. Karanasos, K., Interlandi, M., Psallidas, F., Sen, R., Park, K., Popivanov, I., Xin, D., Nakandala, S., Krishnan, S., Weimer, M., Yu, Y., Ramakrishnan, R., Curino, C.: extending relational query processing with ML inference. In: 10th conference on innovative data systems research, CIDR 2020, Amsterdam, The Netherlands, January 12-15 Online Proceedings. www.cidrdb.org (2020). http://cidrdb.org/cidr2020/papers/p24-karanasos-cidr20.pdf
24. Khamis, MA; Curtin, RR; Moseley, B; Ngo, HQ; Nguyen, X; Olteanu, D; Schleich, M. Functional aggregate queries with additive inequalities. ACM Trans. Database Syst.; 2020; 45,
25. Kraft, P; Kang, D; Narayanan, D; Palkar, S; Bailis, P; Zaharia, M. A demonstration of willump: a statistically-aware end-to-end optimizer for machine learning inference. Proc. VLDB Endow.; 2020; 13,
26. Kraft, P., Kang, D., Narayanan, D., Palkar, S., Bailis, P., Zaharia, M.: Willump: A statistically-aware end-to-end optimizer for machine learning inference. In: Dhillon, I.S., Papailiopoulos, D.S., Sze, V., (eds.) Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4. mlsys.org (2020). https://proceedings.mlsys.org/book/297.pdf
27. Kumar, A., Naughton, J.F., Patel, J.M.: Learning generalized linear models over normalized data. In: Sellis, T.K., Davidson, S.B., Ives, Z.G., (eds.) Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4 pp. 1969–1984. ACM (2015). https://doi.org/10.1145/2723372.2723713
28. Kunft, A; Katsifodimos, A; Schelter, S; Breß, S; Rabl, T; Markl, V. An intermediate representation for optimizing machine learning pipelines. Proc. VLDB Endow.; 2019; 12,
29. scikit learn: scikit-learn 1.2.2 (2023). https://scikit-learn.org/1.2/
30. Lee, Y., Scolari, A., Chun, B., Santambrogio, M.D., Weimer, M., Interlandi, M.: PRETZEL: opening the black box of machine learning prediction serving systems. In: Arpaci-Dusseau, A.C., Voelker, G., (eds.) 13th USENIX symposium on operating systems design and implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10 pp. 611–626. USENIX Association (2018). https://www.usenix.org/conference/osdi18/presentation/lee
31. Leis, V; Gubichev, A; Mirchev, A; Boncz, PA; Kemper, A; Neumann, T. How good are query optimizers, really?. Proc. VLDB Endow.; 2015; 9,
32. Li, G; Zhou, X; Sun, J; Yu, X; Han, Y; Jin, L; Li, W; Wang, T; Li, S. opengauss: an autonomous database system. Proc. VLDB Endow.; 2021; 14,
33. Li, S., Chen, L., Kumar, A.: Enabling and optimizing non-linear feature interactions in factorized linear algebra. In: Boncz, P.A., Manegold, S., Ailamaki, A., Deshpande, A., Kraska, T., (eds.) Proceedings of the 2019 international conference on management of data, SIGMOD Conference, Amsterdam, The Netherlands, June 30 - July 5 pp. 1571–1588. ACM (2019). https://doi.org/10.1145/3299869.3319878
34. Lu, Y., Chowdhery, A., Kandula, S., Chaudhuri, S.: Accelerating machine learning inference with probabilistic predicates. In: Das, G., Jermaine, C.M., Bernstein, P.A., (eds.) Proceedings of the 2018 international conference on management of data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15 pp. 1493–1508. ACM (2018). https://doi.org/10.1145/3183713.3183751
35. Meng, X; Bradley, JK; Yavuz, B; Sparks, ER; Venkataraman, S; Liu, D; Freeman, J; Tsai, DB; Amde, M; Owen, S; Xin, D; Xin, R; Franklin, MJ; Zadeh, R; Zaharia, M; Talwalkar, A. Mllib Machine learning in apache spark. J. Mach. Learn. Res.; 2016; 17,
36. Microsoft: Sql server ml (2023). https://learn.microsoft.com/en-us/sql/t-sql/queries/predict-transact-sql?view=sql-server-ver16
37. openGauss: Native db4ai engine (2023). https://docs.opengauss.org/en/docs/5.0.0/docs/AIFeatureGuide/native-db4ai-engine.html
38. Park, K., Saur, K., Banda, D., Sen, R., Interlandi, M., Karanasos, K.: End-to-end optimization of machine learning prediction queries. In: Ives, Z.G., Bonifati, A., Abbadi, A.E., (eds.) SIGMOD ’22: international conference on management of data, Philadelphia, PA, USA, June 12 – 17 pp. 587–601. ACM (2022). https://doi.org/10.1145/3514221.3526141
39. Rahn, G.: Job kit (2023). https://github.com/gregrahn/join-order-benchmark
40. Rahn, G.: Tpc-h kit (2023). https://github.com/gregrahn/tpch-kit
41. Ramachandra, K; Park, K; Emani, KV; Halverson, A; Galindo-Legaria, CA; Cunningham, C. Froid: optimization of imperative programs in a relational database. Proc. VLDB Endow.; 2017; 11,
42. Spark, A.: Mllib (2023). https://spark.apache.org/docs/latest/ml-pipeline.html#estimators
43. SQLFlow: Sqlflow: extends sql to support ai (2023). https://sql-machine-learning.github.io
44. TPC: Tpc-h (2023). https://www.tpc.org/tpch/default5.asp
45. Wang, D.Z., Franklin, M.J., Garofalakis, M.N., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: Sellis, T.K., Miller, R.J., Kementsietsidis, A., Velegrakis, Y., (eds.) In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12–16 pp. 517–528. ACM (2011). https://doi.org/10.1145/1989323.1989378
46. Wang, X; Wu, W; Wu, J; Chen, Y; Zrymiak, N; Qu, C; Flokas, L; Chow, G; Wang, J; Wang, T; Wu, E; Zhou, Q. Connectorx: accelerating data loading from databases to dataframes. Proc. VLDB Endow.; 2022; 15,
47. Wu, Y; Lentz, M; Zhuo, D; Lu, Y. Serving and optimizing machine learning workflows on heterogeneous infrastructures. Proc. VLDB Endow.; 2022; 16,
48. Xu, L., Qiu, S., Yuan, B., Jiang, J., Renggli, C., Gan, S., Kara, K., Li, G., Liu, J., Wu, W., Ye, J., Zhang, C.: In-database machine learning with corgipile: stochastic gradient descent without full data shuffle. In: Ives, Z.G., Bonifati, A., Abbadi, A.E., (eds.) SIGMOD ’22: international conference on management of data, Philadelphia, PA, USA, June 12 - -17, pp. 1286–1300. ACM (2022). https://doi.org/10.1145/3514221.3526150
49. Yang, Z., Wang, Z., Huang, Y., Lu, Y., Li, C., Wang, X.S.: Optimizing machine learning inference queries with correlative proxy models. Proc. VLDB Endow. 15(10), 2032–2044 (2022)
50. Zhang, Y., Kumar, A., McQuillan, F., Jayaram, N., Kak, N., Khanna, E., Kislal, O., Valdano, D.: Tech report of distributed deep learning on data systems: a comparative analysis of approaches. Tech. rep., Pivotal (now VMware) (2021). https://adalabucsd.github.io/papers/TR_2021_Cerebro-DS.pdf
51. Zhou, X; Chai, C; Li, G; Sun, J. Database meets artificial intelligence: a survey. IEEE Trans. Knowl. Data Eng.; 2022; 34,
52. Zhou, X; Jin, L; Sun, J; Zhao, X; Yu, X; Li, S; Wang, T; Li, K; Liu, L. Dbmind: a self-driving platform in opengauss. Proc. VLDB Endow.; 2021; 14,
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.