Content area

Abstract

Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A widely used approach to reducing an algorithm’s runtime is to employ modern multi-core computer architectures, either with shared memory (SM) or distributed memory (DM), and use parallel computing to accelerate the algorithm. However, the inherent sequential nature of MCMC makes it unsuitable for parallel implementation unless the accuracy is sacrificed. This issue is particularly evident in DM architectures, which normally provide access to larger numbers of cores than SM. Sequential Monte Carlo (SMC) samplers are a parallel alternative to MCMC, which do not trade off accuracy for parallelism. However, the performance of SMC samplers in the context of DTs is underexplored, and the parallelization is complicated by the challenges in parallelizing its bottleneck, namely redistribution, especially on variable-size data types such as DTs. In this work, we study the problem of parallelizing SMC in the context of DTs both on SM and DM. On both memory architectures, we show that the proposed parallelization strategies achieve asymptotically optimal O(log2N) time complexity. Numerical results are presented for a 32-core SM machine and a 256-core DM cluster. For both computer architectures, the experimental results show that our approach has comparable or better accuracy than MCMC but runs up to 51 times faster on SM and 640 times faster on DM. In this paper, we share the GitHub link to the source code.

Full text

Turn on search term navigation

1. Introduction

1.1. Motivation

Decision tree (DT) models are widely used algorithms when addressing classification problems in machine learning (ML). Their versatility spans numerous fields, including medicine [1], biology [2], chemistry [3], and engineering [4]. Although real-world applications often provide access to large datasets, this typically imposes significant challenges in terms of both accuracy and runtime performance.

In recent years, Markov Chain Monte Carlo (MCMC) methods have gained popularity as a Bayesian approach to DTs [5,6,7,8], often outperforming traditional classification algorithms such as decision forests in terms of accuracy [9,10]. Despite achieving state-of-the-art accuracy, MCMC is computationally demanding. This challenge is typically mitigated by leveraging parallel computing on modern multi-core computer architectures, such as shared memory (SM) and distributed memory (DM) systems. However, the inherently sequential nature of MCMC makes it difficult to efficiently address modern classification problems while maintaining both accuracy and speed. Consequently, a Bayesian alternative to MCMC for DTs that is both parallelizable and accurate would be highly beneficial in solving ML classification tasks.

1.2. Related Work

Bayesian DTs were first introduced to solve ML supervised learning problems in [11,12] and improved in [13,14]. The key idea is to generate N samples randomly (where each sample is a DT) in order to estimate the possible problem outcomes. This approach is widely used [15,16] due to its good performance in terms of classification accuracy, but it is implemented sequentially, i.e., without any use of parallel computing.

Each new sample in an MCMC chain is generated as a function of the previous one, which is why this approach is difficult to perform in parallel. There exist several attempts to parallelize a single MCMC chain. This approach was first discussed in [17], and more recent attempts can be found in [18,19]. Examples can also be found in the context of DTs [18,20,21]. These studies have confirmed that, even on DTs, parallel approaches for a single MCMC chain are strongly problem-specific, which limits their applicability, and the results often show little to no speed-up improvements as the number of cores increases. Another widely used parallelization strategy is to run multiple chains in parallel. This approach was first discussed in [22], and more recent implementations can be found in [23,24]. There are numerous examples of this method in the literature in the context of DTs [25,26]. However, this parallelization strategy trades off the accuracy in order to achieve good scaling and runtime performance. This is particularly pertinent in the context of modern DM architectures, which often provide access to a larger number of computing resources than SM.

Sequential Monte Carlo (SMC) samplers [27] represent an alternative class of Monte Carlo (MC) methods that can be applied in the same scenarios as MCMC. Their core idea involves combining importance sampling (IS) with resampling to create a population of N random samples. A key benefit of SMC is its ability to deliver competitive accuracy while being potentially parallelizable, as the samples are generated independently. However, effectively parallelizing SMC is challenging due to difficulties in parallelizing the runtime bottleneck, i.e., the redistribution step (a sub-task of resampling). While early attempts to parallelize redistribution did not show substantial speed-up [28,29], more recent work on DM [30,31] and on SM [32,33,34] has shown promising results. These parallelization methods are specifically designed for scenarios where all samples are of a fixed size. However, for DTs and other abstract data types (ADTs) like additive structures [35], samples can have variable sizes, making the techniques in [30,31,32,33,34] less directly applicable. In [36], a parallel SMC sampler for DTs on SM was introduced, but only the IS step was parallelized, leaving the resampling step sequential due to the complexities of redistributing variable-size samples in parallel. Consequently, this approach exhibited limited scalability. Another approach, found in [37], presents a sequential implementation of an SMC sampler for DTs. However, as shown in [38], this method achieves significantly lower accuracy compared to MCMC.

1.3. Contributions

In this paper, we build upon our previous work on parallel SMC samplers for DTs on SM architectures [39] and extend it as follows.

  • We further explore the same problem on DM architectures. More precisely, by using a different approach than the one presented in [39], this work proves that, on DM (as well as on SM), it is possible to design a parallelization algorithm for the bottleneck, i.e., the redistribution step, which works for any variable-size samples, including DTs, and achieves asymptotically optimal O(log2N) time complexity. To the best of our knowledge, this is also the first time that a conventional SMC sampler (i.e., based on IS and resampling) has been applied in the context of DTs and parallelized on a DM.

  • We then extend the numerical results with respect to those presented in [39]. More precisely, besides the results on SM for the three datasets considered in [39], we first add new numerical results on SM for a fourth dataset; we then present new results on a DM of 256 cores for the same four datasets. These additional results demonstrate that our proposed approach achieves roughly the same accuracy as MCMC but runs up to 51 times faster on SM and 640 times faster on DM, given the same problem size, or achieves significantly better accuracy for the same runtime.

To summarize, this work demonstrates that, independently of the type of memory architecture, it is always possible to parallelize SMC in the context of DTs with an optimal runtime and outperform alternative Bayesian approaches to DTs based on MCMC.

The remainder of this paper is organized as follows: to provide context, Section 2 gives brief details of SM and DM architectures, highlighting the pros and cons of each architecture. Section 3 briefly describes the topology of a DT. Section 4 describes SMC and MCMC, with a view to emphasizing their implementation details in the context of DTs. Section 5 outlines the algorithmic details of our approaches on SM and DM and analyzes the time complexity in both cases. Section 6 reports the numerical results on both SM and DM. Section 7 draws the conclusions and gives suggestions for future work.

2. Parallel Computer Architectures

SM and DM architectures are fundamentally different. In this section, we briefly highlight their advantages and disadvantages, as well as the differences between the two architectures.

In SM architectures, we have a single memory, which all the P computational resources, i.e., the cores, share and can access simultaneously. This is exemplified in Figure 1a. The main advantage of this architecture with respect to DM is that the sharing of information between the cores can happen simply by writing data to and reading the same from shared memory locations, whose memory address needs to be known by the cores. The main disadvantages are related to the performance for two reasons. First of all, the scalability is not ideal as, when P increases, so does the memory contention when accessing the same memory locations. Addressing this requires the use of software synchronization points, e.g., barriers. The second disadvantage is that modern SM architectures do not provide large numbers of cores in comparison with DM architectures. In this work, we use OpenMP, one of the most commonly used parallel programming models for SM. OpenMP employs the #pragma omp parallel for instruction to specify which for loops to run in parallel. This instruction can be enriched with extra clauses, e.g., num_threads, to specify the number of threads to run in parallel, or reduce, to specify a parallel reduction operation occurring in the same for loop. Further details are omitted for brevity.

DM architectures are inherently different from SM architectures and consist of P machines interconnected by a common communication network. Each machine has one processor (or core—the terms in this paper are used interchangeably to make the terminology of distributed and shared memory architectures as consistent as possible) and a private memory. Therefore, each core can only directly access its own private memory, and the sharing of information between the cores happens through message exchange across the common communication network. Figure 1b illustrates this architecture. DM provides two advantages with respect to SM architectures. The first is that memory contention is not an issue for DM, as each core only has access to its private memory, meaning that the scalability is often better in memory-intensive applications. As previously mentioned, the second advantage is that DM architectures typically have more cores, which means that, for any given application, the maximum achievable speed-up on DM is typically higher than the same on SM. On the other hand, the two disadvantages are that the communication between the cores is naturally more intensive than on SM, and DM architectures are also more expensive to use and maintain, due to the more complex infrastructure required. In this work, we use the Message Passing Interface (MPI), arguably the most commonly used parallel programming model for DM. MPI provides routines, such as MPI_Send, MPI_Recv, or MPI_Sendrecv, to send and/or receive explicit messages (of a user-specified size) between a core i and a core j. Further details are omitted for brevity.

We also note that modern supercomputers are actually hybrid memory architectures, consisting of several DM machines, each having several SM cores. In the analysis of DM in Section 6, we employ MPI across cores in different physical machines and across cores within the same physical machine, with a view to maximizing the number of DM cores. We acknowledge that, in such hardware, one could use hybrid programming models MPI+X, such as MPI+OpenMP. Exploring the performance of such an environment is, however, beyond the scope of this study, but it is certainly an interesting direction for future development.

3. Decision Trees

This section provides an overview of the structure of a decision tree (DT), its components, and its application in classification tasks. For a more detailed description, the reader is encouraged to read [40].

Consider a dataset containing l records, represented as YZl, alongside a corresponding feature matrix, xRl×r. A DT model is trained to predict the classification of a record Yj based on the j-th row of the feature matrix x.

For a given tree T, let d(T) denote its depth, D(T) represent the set of m non-leaf nodes, and L(T) denote the set of leaf nodes. The tree T is characterized by a set of features k associated with the non-leaf nodes and a vector of thresholds c. A DT classifies data by traversing the tree. Starting at the root node, the model determines which child node to move to based on the features of the datum and the parameters of the node. This process is repeated at each non-leaf node until a leaf node is reached. At the leaf node, a classification result is produced. Figure 2 illustrates an example of a DT containing five non-leaf nodes and six leaf nodes.

4. Bayesian Decision Trees

MC methods are widely used to estimate the state of a statistical model, which is defined by a posterior distribution (or simply the posterior). The posterior represents the probability density function of the state of the model given the observed data. The main idea is to generate a set of random samples that approximate the posterior, enabling the computation of various estimates. However, sampling directly from posterior distributions is often challenging. To address this, MCMC and SMC are two of the most commonly used Bayesian frameworks for the generation of samples when direct sampling is infeasible.

For Bayesian approaches to DTs, the goal is to estimate the true tree, T, along with its associated set of features, θ, based on given data, Y, and its features, x, in a classification task. Specifically, we aim to sample from the posterior distribution π(T,θ|Y,x), which describes the probability of T and θ conditioned on Y and x.

This section outlines how MCMC and SMC can be applied to generate a population of DTs in this Bayesian context. For further details, refer to [11,12,27].

4.1. Markov Chain Monte Carlo

MCMC methods perform NT iterations, for k=0,1,2,,NT1, and each iteration produces a sample, tk, from the posterior distribution, π(T,θ|Y,x). Each sample includes a potential value for both T and θ. The initial sample, t0, is drawn from an arbitrary initial proposal distribution (also called an initial proposal here, for brevity), q0(·), which is chosen to be easy to sample from. After this, for every new iteration, a new sample is proposed by sampling from another proposal distribution (also called a proposal here, for brevity), q(·|tk1), as follows:

(1)tq(·|tk1),

where the proposal distribution is also arbitrary and is typically chosen to be convenient to sample from.

The newly proposed sample t is then either accepted or rejected based on the acceptance probability given by

(2)a(t,tk1)=min1,πt|Y,xπtk1|Y,xqtk1|tqt|tk1.

In other words, the new sample, tk, will be

(3)tk=tifua(t,tk1),tk1ifu>a(t,tk1),

where uUniform[0,1].

After NT1 iterations, we have a single chain of NT samples from which we can estimate T and θ. The performance in terms of classification accuracy is evaluated on a test set, and, for each tree, we compute a prediction; then, the overall accuracy is generated by either computing the mean of the predicted accuracy for each sample or by using majority voting. In this work, we use the mean. However, before doing so, an initial subset of the NT samples, typically Nb=0.5NT, is discarded; this initial part of the process is called burn-in. This is performed to improve the quality of the estimation, since the first samples are likely to be generated before the algorithm has converged. Therefore, we expect this approach to provide competitive overall accuracy at the price of a potentially long runtime, as the samples are generated sequentially. In Section 6, we refer to this method (which is also illustrated in Algorithm 1) as single-chain MCMC.

Algorithm 1 Single-Chain MCMC
Input: NT,NbOutput: t
  •   1:. t0q0(·)

  •   2:. for k1; k<NT; kk+1 do

  •   3:.       tq(·|tk1)

  •   4:.       a(t,tk1)min1,πt|Y,xπtk1|Y,xqtk1|tqt|tk1

  •   5:.       uUniform[0,1]

  •   6:.       if ua(t,tk1) then

  •   7:.             tkt

  •   8:.       else

  •   9:.             tktk1

  • 10:.       end if

  • 11:. end for

  • 12:. Burn-in the first Nb samples in t

This approach is inherently sequential, as each sample is generated as a function of the previous one. One could try to parallelize either (1) or (2), as performed in [20], for example. However, this approach is strongly problem-dependent and often results in poor scalability. Another alternative is to run N chains in parallel, each with K samples [21,26], such that the total number of samples is NT=N×K. In Section 6, we refer to this method (which is briefly summarized in Algorithm 2) as multi-chain MCMC. Since the chains are generated in an embarrassingly parallel fashion, the execution time of Algorithm 2 is O(NP) when using P cores and varying N. The NT generated DTs can then be used to compute the overall accuracy, in the same way as for single-chain MCMC, i.e., by either computing the mean or the majority voting of the individual predictions generated by the DTs. As for single-chain MCMC, in this work, we use the mean for multi-chain MCMC as well. Therefore, for the same problem size, NT=N×K, we expect this approach to provide a competitive runtime at the price of potentially significantly lower overall accuracy compared with single-chain MCMC, because it consists of N embarrassingly parallel chains of KN samples each.

Algorithm 2 Multi-Chain MCMC
Input: N,KOutput: t
  •   1:. NbK2

  •   2:. for i0; i<N; ii+1 do in parallel

  •   3:.       t0:KNb1iSingle-Chain MCMC(K,Nb)

  •   4:. end for

4.2. Sequential Monte Carlo

The general idea behind SMC samplers is to perform sampling and resampling in sequence for K iterations, for k=0,1,,K1, each of which will generate and update N independent, weighted samples, tki, for i=0,1,2,,N1. The samples are first initialized from the initial proposal and weighted as follows:

(4a)t0iq0(·),

(4b) w 0 i = π t 0 i | Y , x q 0 t 0 i .

Then, during each SMC iteration, for k=1,2,,K1, the samples are updated from the proposal and weighted as follows:

(5a)tkiq(·|tk1i),

(5b)wki=wk1iπtki|Y,xπtk1i|Y,xqtk1i|tkiqtki|tk1i.

We note that Equation (5) is often referred to as the IS step.

The weights, wki, for i=0,1,,N1, are then normalized such that iw˜ki=1, i.e., 100% of the total probability space. The normalized weights are then computed as follows:

(6)w˜ki=wkij=0N1wkj,i=0,1,2,,N1.

Since the samples are generated from q(·|·), and not directly from the posterior distribution of interest, this sampling strategy may suffer from a numerical error, called degeneracy, which could make most of the weights equal to 0, leading to poor estimations of the state. This problem is commonly tackled by using resampling when the effective sample size (ESS) of the weights,

(7)Neff=1i=0N1w˜ki2,

falls below a user-defined threshold, usually set at N2. Resampling overwrites the samples with low weights with copies of the samples with high weights. Several variants of resampling can be found in the literature [41]. In this work, we use multinomial resampling, which performs three steps.

In the first step, we compute ncopiesN0N, where N0 is the set of non-negative integers. The i-th element, ncopiesi, indicates how many copies of the i-th sample, tki, are necessary to keep. To do this, multinomial resampling first computes cdfRN, the cumulative sum (or prefix sum; the terms are used interchangeably) of Nw˜, as follows:

(8)cdfi=Nj=0i1w˜kj,i=0,1,2,,N1.

Then, each term ncopiesi is computed by

(9)ncopiesi=cdfi+Nw˜kiucdfiu,

where uUniform[0,1], and · is the ceiling operator. From (8) and (9), it is relatively straightforward to infer that

(10a)0ncopiesiN,

(10b)i=0N1ncopiesi=N.

The second step is redistribution, which creates a new population of samples by duplicating each sample, tki, a total of ncopiesi times, meaning that those samples for which ncopiesi=0 will be deleted. Algorithm 3 illustrates a possible implementation of sequential redistribution (S-R).

Algorithm 3 Sequential Redistribution (S-R)
Input: t, ncopies, NOutput: tnew
  •   1:. i0

  •   2:. for j0; j<N; jj+1 do

  •   3:.     for copy0; copy<ncopiesj; copycopy+1 do

  •   4:.         tnewitj

  •   5:.         ii+1

  •   6:.     end for

  •   7:. end for

In the third and final step of resampling, each weight, wi, is reset to 1/N.

After resampling, a new iteration starts from (5), and, after K1 iterations, the final samples are used to obtain estimates on a test set. More precisely, for each DT, we compute the classification accuracy, and then one could either use the mean or majority voting to calculate the overall accuracy. As described in Section 4.1, for MCMC, in this work, we compute the mean. Algorithm 4 summarizes the steps above. In line with the notation used for single-chain MCMC and multi-chain MCMC, we say that SMC has a total problem size equal to NT=N×K, since each of the N samples is updated K times. For the same problem size, we expect our experimental results to show that SMC achieves comparable accuracy to single-chain MCMC with a faster runtime due to its parallel nature. In the same settings, we expect our experimental results to show that SMC achieves significantly higher accuracy than multi-chain MCMC at the price of a longer runtime.

Algorithm 4 Sequential Monte Carlo Sampler
Input: K, NOutput: tk
  •   1:. t0iq0(),i=0,1,,N1

  •   2:. wki1N,i=0,1,,N1

  •   3:. for k0; k<K; kk+1 do

  •   4:.     tkiq(·|tk1i),i=0,1,2,,N1

  •   5:.     wkiwk1iπtki|Y,xπtk1i|Y,xqtk1i|tkiqtki|tk1i,i=0,1,2,,N1

  •   6:.     w˜kiwkij=0N1wkj,i=0,1,2,,N1

  •   7:.     Neff1i=1Nw˜ki2

  •   8:.     if Neff<N2 then

  •   9:.         cdfiNj=0i1w˜kj,i=0,1,,N1

  • 10:.         uUniform[0,1]

  • 11:.         ncopiesicdfi+1ucdfiu,i=0,1,,N1

  • 12:.         tkRedistribution(tk,ncopies,N)

  • 13:.         wki1N,i=0,1,,N1

  • 14:.     end if

  • 15:. end for

4.3. Posterior and Proposals for Bayesian Decision Trees

In this section, we briefly describe how to compute the posterior, the proposal, and the initial proposal in the specific case of DTs.

4.3.1. Posterior Distribution

The posterior is proportional (up to a normalization constant) to the product of the likelihood and the prior. In other words,

(11)π(T,θ|Y,x)p(Y|T,θ,x)p(θ,T)=p(Y|T,θ,x)p(θ|T)p(T),

where p(Y|T,θ,x) is the likelihood and p(θ,T) is the prior. More precisely, for each individual term in (11), we use the following expressions:

(12)p(Y|T,θ,x)=j=0l1p(Yj|xj,T,θ),

(13)p(θ|T)=j=0m1p(kj|T)p(cj|kj,T),

(14)p(T)=λm(eλ1)m!.

Using the Poisson distribution for the p(T) term is common practice in the literature [12,15,16]. As we can see, (14) only requires the tuning of the λ hyperparameter. Another valid expression can be found in [42,43]. This alternative, however, requires the tuning of two hyperparameters. Therefore, we employ (14) for simplicity.

4.3.2. Proposal Distribution

The proposal distribution, q(·|·), defines a set of four possible operations that either modify the structure of the DT and/or update its parameters. These operations are as follows.

  • Grow. This operation increases the dimensionality of the DT by randomly selecting a leaf node and attaching two child nodes to it.

  • Prune. This operation reduces the dimensionality of the DT by selecting a non-leaf node at random and removing its child nodes.

  • Change. This operation randomly selects a non-leaf node, j, and updates its feature, kj, and threshold, cj, without altering the tree’s dimensionality.

  • Swap. This operation randomly selects two non-leaf nodes, i and j, and exchanges their features, ki and kj, as well as their thresholds, ci and cj, keeping the tree’s dimensionality unchanged.

The probabilities of applying these operations, p(Grow), p(Prune), p(Change), and p(Swap), are user-specified and must sum to 1. A common choice is to assign equal probabilities of 25% to each operation. Details on the computation of q(·|·), used in (2) and (5b), can be found in [20] and are omitted here for brevity.

4.3.3. Initial Proposal

The initial proposal distribution, q0(·), is defined by sampling from the prior distribution of the DT, p(T).

5. Theoretical Results

5.1. Algorithmic Details

In this section, we describe the parallelization on both SM and DM of all tasks in the SMC sampler, including our approach to parallelizing the redistribution step.

It is relatively straightforward to infer that both steps to initialize and update the samples and the weights, i.e., Equations (4) and (5), are embarrassingly parallel. On SM, as well as on DM, these steps can be parallelized by equally dividing the iteration space of the related for loops, for i=0,1,,N1, across the P cores, such that each core, id=0,1,,P1, works on a chunk of n=NP samples and weights with index i=id·n,id·n+1,id·n+2,,(id+1)·n1. Therefore, these steps achieve O(NP) time complexity. The same can be said about (9).

Equation (6) requires the computation of a vector sum. On both SM and DM, this operation is parallelizable by using reduction, which notoriously achieves O(NP+log2P) time complexity.

The prefix sum in Equation (8) can also be performed in O(NP+log2P) by using prefix reduction [44,45]. Once again, this is true for both SM and DM.

Thus far, it is clear that all steps in Algorithm 4, except Algorithm 3, achieve, at most, O(log2N) time complexity for PN. We note that the parallel overhead term, log2P, will, in practice, limit the speed-up to less than P, but it becomes less important compared with NP in increasing N. Algorithm 3 takes O(N) steps on a single core, as with all the other tasks in the SMC sampler when P=1. This is because the total number of samples to be copied in the redistribution step is equal to N, as Equation (10b) emphasizes. However, this algorithm is impossible to parallelize if using embarrassingly parallel approaches. The main reason is that the workload associated with duplicating each sample, ti, a total of ncopiesi times is inherently unbalanced as Equation (10a) holds. In several cases, such as when sampling DTs, the samples have variable sizes, which could potentially render an embarrassingly parallel attempt to parallelize S-R even more unbalanced. Here, we describe two static load-balancing approaches, one for SM and one for DM, that work for variable-size samples and scale as O(NP+log2N). In both cases, the key idea is to divide the workload evenly, before having the cores run Algorithm 3 independently. We do this while also guaranteeing that the computation (and communication, in the case of DM) required to balance the workload (roughly) costs the same for all cores. A fully balanced distribution across the cores of the workload to redistribute N variable-size samples is defined, on both SM and DM, as follows.

Definition 1

(Workload Balanced). Let t be a list of N samples where each sample, ti, has size Mi for i=0,1,,N1. Let ncopiesN0N be an array that satisfies Equation (10), such that each i-th element, ncopiesi, defines how many times the i-th sample in t, ti, needs to be redistributed. On parallel architectures (either on SM or DM) of P cores, each owning the n elements in t and ncopies with index i=id·n,id·n+1,id·n+2,,(id+1)·n1, the workload is fully balanced across the cores if and only if the following condition is met:

(15)j=idNP(id+1)NP1ncopiesjMj=const,id=0,1,2,,P1.

However, since the algorithmic details of the parallel redistribution step on SM and DM differ for the most part, we describe them in the two following sections.

5.1.1. Parallel Shared Memory Redistribution

Our SM approach performs, in sequence, the following six steps, each running P cores in parallel.

Step 1—Max. In this step, the cores compute in parallel

(16)M¯=max0iN1Mi,

i.e., the size of the largest sample.

(17)j=idNP(id+1)NP1ncopiesj=NP,id=0,1,2,,P1.

Step 2—Pad. Each thread, id=0,1,,P1, extends each sample by appending M¯Mi dimensions each with a value that is known to be impossible. In the specific case of DTs, we can use negative integers such as 1, but we suggest that Not a Numbers (NaNs) would be a generic value for any variable-size ADT. In other words, by performing Step 1 and Step 2, we ensure that the copy and paste costs per sample in Algorithm 3 are equal for all samples.

Step 3—Prefix Sum. In this step, the cores first compute in parallel csumN0N, the cumulative sum of ncopies, such that

(18)csumi=j=0i1ncopiesj,i=0,1,,N1.

The integer csumi represents the number of copies to be created up to the index i. Alternatively, because the samples have now equal sizes, one could consider csumi as the total workload (up to a constant time term equal M¯) in order to sequentially redistribute the samples from the index 0 to the index i.

Step 4—Binary Search. It is now possible to perfectly divide the total workload to redistribute N samples between P cores if each core, id=0,1,,P1, searches for an index, called a pivot, pvt, which is the first index that satisfies the following Boolean expression:

(19)csumpvtid×n.

Since csum is inherently monotonically increasing, it is also sorted by definition. Therefore, each core can independently search for its pivot by using a binary search (BS).

Step 5—Copy. After Step 4, each core can freely and independently redistribute NP samples starting from its pivot by using Algorithm 3. However, more than one core may happen to share the same pivot. Therefore, before using S-R, each core must first determine how many copies of tpvt it is allowed to create. This is always computed as

(20)mincsumpvtid×n,n.

Indeed, since the workload is divided according to csum and Equation (19), if two or more cores share the same pivot, only the core with the highest id must create less than n=NP copies of tpvt. We note that, since this algorithm is designed to run on SM, in this step, it is strongly recommended to use a temporary array to which the samples are temporarily copied, such that the parallel cores do not risk overwriting sensitive information.

Step 6—Restore. Each core, id=0,1,,P1, loops over the samples ti, for all i=id·n,id·n+1,id·n+2,,(id+1)·n1, and removes up to M¯1 dimensions from each sample, i.e., those dimensions that are encoded with an impossible value during Pad, such as 1 s for DTs or NaNs for any ADT.

These steps are summarized in Algorithm 5, and Figure 3a illustrates a practical example for N=8 samples and P=4 cores.

Algorithm 5 Parallel SM Redistribution for Variable-Size Samples
Input: t, ncopies, M, N, P, n=NPOutput: t
  •   1:. M¯Max(M, P), spawns and runs P cores

  •   2:. tPad(t, M¯, P), spawns and runs P cores

  •   3:. csumPrefix Sum(ncopies, P), spawns and runs P cores

  •   4:. Spawn P cores with id=0,1,,P1{

  •   5:.     pvtBinary Search(csum, ncopies, n)

  •   6:.     cpmincsumpvtid×n,n

  •   7:.     ttempn:n×cp1tpvt

  •   8:.     ttempS-R(tpvt+1:N1, ncopiespvt+1:N1, ncp)

  •   9:. }

  • 10:. tRestore(ttemp,M¯, P), spawns and runs P cores

5.1.2. Parallel Distributed Memory Redistribution

Our SM approach in Section 5.1.1 resizes the samples to ensure that the cost of copying and pasting a sample in Algorithm 3 is constant. For obvious reasons, this is useful on DM as well. However, more crucially, on DM, there is an extra communication cost to message elements in t and ncopies between the (private) memories of different cores. Therefore, our DM approach performs the following six steps, where the early steps are similar to those in Algorithm 5.

Step 1—Max. Once again, the cores compute Equation (16) in parallel to compute M¯.

Step 2—Pad. We again resize the sample according to M¯ to ensure that all samples have the same size. By doing this, not only do we ensure that the copy and paste cost per sample in Algorithm 3 is the same for all samples, but we guarantee that the messages required to divide the workload will have the same size.

Step 3—Rotational Nearly Sort. In order to divide the workload, we start by reorganizing the elements in t and ncopies with a view to separating the samples to be redistributed (i.e., those for which ncopiesi>0) from the samples that need to be discarded (i.e., those for which ncopiesi=0). In other words, we wish to shift the elements in t and their corresponding elements in ncopies to lower memory locations (i.e., lower values of the index i) until ncopies has the following shape:

(21)ncopies=fl1,fl2,,flm,0,0,,0,

where fli>0 for i=0,1,,m1. This is achieved by performing the following steps.
  1. We use the prefix sum to compute shiftsN0N, an array whose i-th element counts how many ncopiesi=0 are found up to the i-th memory location. In other words,

    (22)shifti=j=0i(ncopiesi>0),i=0,1,2,,N1,

    where (ncopiesi>0) is a Boolean expression that returns 1 or 0 depending on whether the expression is true or false.

  2. We move the elements in t, ncopies and shifts by using rotational shifts. More precisely, each ti, ncopiesi, and shiftsi is rotated up to log2P times to the left by a progressively higher power-of-2 number of memory location, starting from 1, if and only if the least significant bit (LSB) of shiftsi is 1. After this, the bits in shiftsi are rotated to the right. This means that some samples are sent to the memory of cores with lower id.

    We note that, in order to keep the cores equally busy, each core always sends arrays of NP elements of t, ncopies, and shifts, regardless of the values in ncopies, but, for obvious reasons, the core that receives the message will have to save only the elements that have ncopiesi>0.

Step 4—Rotational Split. In contrast to Rotational Nearly Sort, this step moves t and its respective elements in ncopies to higher memory locations (i.e., higher values of the index, i) until (17) is satisfied and the ncopies have the following shapes:

(23)ncopies=[fl1,0,,0,fl2,0,,0,flm,0,,0]

where, for each ncopiesi>0 (fls in (23)), ncopiesi1 zeros follow. This is achieved by performing the following steps.
  1. We compute csumN0N, the prefix sum of ncopies, such that each csumi is computed as in (18). We do this because each csumi1=csumincopiesi can be intuitively viewed as the final memory location that each ti has to rotate to in order to achieve (23). Therefore, for every ncopiesi>0, each of the copies of ti will have to rotate by at least

    (24)min_shiftsi=csumincopiesii,

    number of memory locations and, at most, by

    (25)max_shiftsi=min_shiftsi+ncopiesi1=csumii1,

    number of memory locations.

  2. We rotate the elements in t, ncopies, and csum to the right by using rotational splits and shifts. More precisely, we perform log2P iterations, each splitting and rotating the elements of t, ncopies, and csum by progressively lower power-of-2 numbers of memory locations, starting from N2. This means that some samples are sent to the memory of cores with higher id. At each iteration, the splitting and shifting strategy depends on the most significant bits (MSBs) of min_shiftsi and max_shiftsi, as follows.

    • ncopiesi, csumi, and ti must not split and shift if the MSBs of min_shiftsi and max_shiftsi are 0, for obvious reasons.

    • All copies of ti must split and shift to the right if both MSBs of min_shiftsi and max_shiftsi are 1. Both ncopiesi and csumi will move consequently. This means that, after the splitting and rotation, we have to set ncopiesi=0 and csumi=0.

    • If only the MSB of max_shiftsi is 1, some copies of ti must not move, and the remaining ones must split and shift, i.e., those for which csumii is higher than the power-of-2 number of positions to rotate by at any given iteration. Therefore, the values of ncopiesi and csumi need to be reduced consequently by the number of copies of ti that are being rotated to the right.

  • At the end of each rotation, as described above, each core updates min_shifts and max_shifts according to (24) and (25).

    As conducted in Step 3—Rotational Nearly Sort, in order to keep the cores equally busy, each core always sends arrays of NP elements of t, ncopies, and csum, regardless of the values in ncopies, but, for obvious reasons, the core that receives the message will have to save only the elements that have ncopiesi>0.

  • Step 5—S-R. After performing Rotational Split, all cores have NP samples to redistribute. Therefore, each core can perform Algorithm 3 independently.

    Step 6—Restore. After redistributing, we use the same restore step as in Algorithm 5 to remove up to M¯1 dimensions from each sample, i.e., those dimensions that are encoded with an impossible value during Pad, such as 1 s for DTs or NaNs for any ADT.

    Algorithm 6 summarizes these steps, and Figure 3b illustrates a numerical example for N=8 samples and P=4 cores.

    Algorithm 6 Parallel DM Redistribution for Variable-Size Samples
    Input: t, ncopies, M, N, P, n=NPOutput: t
    •   1:. M¯Max(M, P)

    •   2:. tPad(t, M¯, P)

    •   3:. if P>1 then

    •   4:.     t,ncopiesRotational Nearly Sort(t,ncopies, P)

    •   5:.     t,ncopiesRotational Split(t,ncopies, P)

    •   6:. end if

    •   7:. tS-R(t,ncopies,n), See Algorithm 3

    •   8:. tRestore(t,M¯, P)

    5.2. Time Complexity Analysis

    In this section, we first analyze the time complexity of the approaches described in Section 5.1.1 and Section 5.1.2, and then we analyze the time complexity of a resampling step that utilizes either of them.

    Lemma 1.

    Let t be an array of N lists, where each list, ti, for i=0,1,,N1, has a variable size, Mi. Let ncopiesN0N be an array of non-negative integers for which (10) holds. On an SM architecture running P parallel cores, Algorithm 5 redistributes t according to ncopies in O(NP+log2N) steps with an O(M¯) constant time factor for each sample.

    Proof. 

    To prove Lemma 1, we start by analyzing the time complexity of each of the steps in Algorithm 5 individually.

    The computation of M¯ in (16) can be parallelized by using parallel reduction, as the vector max operator is also reducible, as with the vector sum. Therefore, (16) scales as O(NP+log2P) with a O(1) constant time factor.

    Step 2 and Step 6 are embarrassingly parallel, as they loop independently over the samples and extend or reduce the dimensionality of each sample by up to M¯1. Therefore, these two steps scale as O(M¯NP) or, more simply, as O(NP) with a constant time factor per sample of O(M¯).

    In Step 3, the parallel cores compute the cumulative sum of ncopies, which, as stated above, achieves O(NP+log2P) time complexity with a O(1) constant time factor.

    In Step 4, the cores perform one independent BS each, which notoriously takes O(log2N) comparisons. In addition, each core also computes (19) once, which takes O(1).

    The copying procedure described in Step 5 requires each core to copy independently NP samples, each of which has M¯ dimensions. Therefore, this step, as with Step 2 and Step 6, scales as O(M¯NP) or, more simply, as O(NP) with an O(M¯) constant time factor per sample.

    Hence, by summing up the individual time complexity terms described above, we can conclude that Algorithm 5 runs in O(NP+log2N) with an O(M¯) computational cost on each sample. □

    Lemma 2.

    Let t be a list of N samples where each sample, ti, has size Mi, and let ncopiesN0N be an array that complies with (10). On a DM architecture of P DM cores, each identified with a rank id=0,1,,P1 and owning the n=NP elements of t and ncopies with integer indexes n×idin×id+n1, Algorithm 6 copies tincopiesi times, for all i=0,1,,N1, in O(log2N) time complexity with an O(M¯) constant time factor for each sample.

    Proof. 

    This analysis follows the same logic as the proof of Lemma 1. In other words, we once again analyze each step of Algorithm 6 individually.

    Max, Pad, and Restore are steps in common between Algorithm 5 and Algorithm 6, and, as Lemma 1 proves, they achieve O(log2N), O(1), and O(1) time complexities, respectively, with Pad and Restore also considering a constant time factor of O(M¯).

    Step 3 and Step 4 both require one prefix sum at the beginning, followed by a loop of log2P iterations, where each DM core iteratively sends, receives, and updates NP samples and elements of ncopies. Therefore, these steps both achieve O(NP+NPlog2P) time complexity with a constant time of O(M¯).

    Algorithm 3 in Step 5 takes O(NP) iterations with a constant time of O(M¯). This is because, after Step 4, ncopies complies with (17).

    Hence, Algorithm 6 scales as O(NP+NPlog2P) with an O(M¯) constant time, which tends to O(log2N) as PN. □

    Theorem 1.

    Let t be a list of N samples where each sample, ti, has size Mi, and let w˜RN be an array of normalized weights. On a parallel architecture (either SM or DM) of P cores, each owning NP elements of t and w˜, a resampling algorithm performing Equations (8) and (9), either Algorithm 5 (on SM) or Algorithm 6 (on DM), and then resetting each element of w˜ to 1/N, achieves asymptotically optimal time complexity equal to O(log2N).

    Proof. 

    As described in this section, Equation (9) and resetting the weights to 1/N are embarrassingly parallel tasks, while Equation (8) and both Algorithms 5 and 6 scale as O(log2N) for PN. This straightforwardly proves Theorem 1 by absurdity because, even if it was possible to run Algorithm 5 or Algorithm 6 in less than O(log2N) (e.g., O(1)), a resampling algorithm performing the aforementioned steps would still achieve O(log2N) time complexity, since the prefix sum in Equation (8) is optimally bound to O(log2N), as already proven in [46]. □

    Remark 1.

    We note that Theorem 1 straightforwardly means that a parallel (either on SM or DM) SMC sampler for DTs running Equations (5)–(7), resampling with either Algorithm 5 (on SM) or Algorithm 6 (on DM), and with weight resetting is also asymptotically optimal. However, we do not claim that the constant time per sample is also optimal, as it is bound to O(M¯) in both Algorithms 5 and 6. Indeed, this paper only considers static load-balancing parallel solutions, but, given the variable-size nature of ADTs such as DTs, other solutions based on dynamic load balancing could improve the constant time.

    Remark 2.

    We note that, in the theoretical analysis of this section, we omit the overhead due to the data movement for simplicity reasons. The cost of data movement on SM could be roughly represented as O(tSMNP), where tSM is the average cost of copying a sample from memory address i to memory address j, while, on DM, it could be roughly represented as O(tDMNPlog2P), where tDM is the average cost of messaging a sample from core i to core j. The accurate quantification of tSM and tDM is difficult because it is strongly dependent on the hardware (e.g., using NUMA or UMA architectures) and is beyond the scope of this manuscript, but it is a very interesting point of study for future work.

    6. Numerical Results

    This section presents two experiments that compare SMC with single-chain MCMC and multi-chain MCMC using four publicly available datasets of increasing size, commonly utilized in ML classification studies (https://archive.ics.uci.edu/dataset/45/heart+disease, https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, https://www.kaggle.com/datasets/hurshd0/abalone-uci, https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success, accessed on 21 November 2024). The first dataset, Heart Disease (Heart), includes 303 records and 75 features. The second dataset, Pima Indians Diabetes (Diabetes), consists of 768 records and 9 features, designed to predict whether a patient has diabetes, with two possible outcome classes. The third dataset, Abalone, contains 4177 records and 9 features, aimed at classifying the ages of abalones into 29 possible outcome classes. The fourth dataset, Predict Students’ Dropout and Academic Success (Students), comprises 4424 records and 37 features and focuses on classifying students’ enrollment status into three possible outcomes: enroll, drop out, or graduate. In both experiments, 70% of the data were used for training, while the remaining 30% were reserved for testing.

    For each experiment, we include results for both SM and DM architectures, but we purposely avoid direct comparisons of the same methods (e.g., SMC) across different memory architectures, given the substantial differences between SM and DM. The results on DM are obtained on a cluster of 320 cores, equally distributed across 8 machines, each mounting 384 GB memory and a ‘2 Xeon Gold 6138’ CPU, which indeed provides 40 cores. For obvious reasons, the results on SM are obtained using only one of these 8 machines. We note that, independently of the system architecture, we always employ power-of-2 numbers of cores, as all algorithms described in Section 5 use the divide-and-conquer paradigm. Therefore, we are bound to use up to P=256 cores for the results on DM and up to P=32 cores for the results on SM. As stated in Section 2, the SM code is parallelized using OpenMP and interfaces with Python through Cython, and the DM code is also developed in Python and is parallelized by using MPI through the mpi4py library. We share our code via GitHub (https://github.com/DiscreteVariablesTaskForce/SMCDecisionTrees, accessed on 21 November 2024).

    6.1. Experiment 1—Performance for Fixed Problem Size

    This first experiment aims to compare the performance of SMC, single-chain MCMC, and multi-chain MCMC in terms of runtime and accuracy for a fixed problem size. The experimental setup is as follows. First, the number of samples in the SMC sampler, N, must be an integer multiple of P, for practical reasons. Since P is chosen as a power of two, it is also convenient to use power-of-two values for N. Specifically, we use N={256,512,1024} samples. The SMC sampler is run for K=10 iterations across all datasets, a value determined empirically to achieve competitive classification accuracy. To ensure that all methods operate on the same problem size, we configure single-chain MCMC to draw N×K=NT samples in total and multi-chain MCMC to generate N chains, each with K samples. Consequently, all methods are evaluated using the same total problem size, NT=N×K.

    6.1.1. Results on Shared Memory

    The findings for this experiment are presented in Table 1 and Figure 4a, Figure 5a, Figure 6a and Figure 7a. Table 1 shows the classification accuracy across all datasets for the largest problem size evaluated in this study, i.e., NT = 10,240. As observed, the SMC sampler achieves classification accuracy comparable to that of single-chain MCMC, which generates an extended sequence of samples. However, while single-chain MCMC is well known for being an inherently sequential approach, the SMC sampler also demonstrates excellent scalability, measured by the runtime speed-up, i.e., the ratio between the runtime for P=1 core and the runtime for P cores. Figure 5a shows that SMC on SM achieves up to a 16-fold speed-up (for Students), which improves when the problem size increases, as the computation becomes more intensive. This also happens when the size of the dataset is larger. We perceive this to happen because, when the size of the dataset increases, the DTs tend to grow (due to the increased number of possible features and/or thresholds), especially after initializing them by sampling from the prior (which may assign several nodes to each DT from the outset). This consequently increases the computation required for both IS and redistribution, while the data movement only increases in redistribution, since the IS step is embarrassingly parallel. We also point out that the achieved speed-up is not equal to P (i.e., the ideal speed-up), both because of the parallel overhead terms, O(log2P), and the cost of data movement. We also note that SMC for P=1 is faster than single-chain MCMC. While this may seem counterintuitive, the reason is that SMC generates N samples in K iterations plus an initialization step. This means that each DT contains at most K extra nodes after the initialization step. On the other hand, single-chain MCMC produces a single, extended chain of N×K DTs, potentially containing up to N×K nodes. By leveraging up to P=32 parallel threads, the SMC sampler achieves similar accuracy to single-chain MCMC but is significantly faster, up to eight times in Heart, up to 13 times in Diabetes, up to 42 times in Abalone, and up to 51 times in Students. These speed-up results are illustrated in Figure 7a, which depicts the speed-up factor (relative to the number of SM threads, P) of SMC compared to single-chain MCMC for a problem size of NT = 10,240. For this problem size, both methods achieve comparable classification accuracy.

    Although single-chain MCMC cannot be parallelized, multi-chain MCMC can run multiple chains in an embarrassingly parallel manner. As shown in Figure 6, multi-chain MCMC scales effectively with P and outperforms SMC in terms of runtime, by up to a factor of two (in Abalone and Students), up to a factor of four (in Diabetes), and up to a factor of six (in Heart), as indicated in Figure 4. This behavior is expected for two reasons. First, Algorithm 2 is embarrassingly parallel, while the SMC sampler has computational complexity of O(log2N). Second, the constant per-sample time in multi-chain MCMC is determined solely by Equations (1)–(3). By contrast, updating each sample in the SMC sampler involves additional calculations, including proposing and weighting the new sample, as well as operations from Equations (6), (8), and (9) and Algorithm 5, with a constant time of O(M¯) (see Lemma 1). Despite its scalability and faster runtime, multi-chain MCMC yields lower accuracy compared to SMC and single-chain MCMC for the same fixed problem size (see Table 1). SMC achieves superior accuracy because its N samples are not only weighted but are also iteratively updated and corrected K times through sampling and resampling. In contrast, single-chain MCMC generates a single, extended chain of unweighted samples that are likely to have converged. On the other hand, multi-chain MCMC produces N independent short chains, each with KN unweighted samples, which are less likely to have converged.

    6.1.2. Results on Distributed Memory

    Regarding this experiment, the reader is referred to Figure 4b, Figure 5b, Figure 6b and Figure 7b for the runtimes and speed-ups. For the accuracy, the reader is again referred to Table 1, as the accuracy performance for SMC, single-chain MCMC, and multi-chain MCMC only depends on the selected values for N, K, and NT and not on the type of architecture.

    On a qualitative level, the results on DM resemble the ones presented in Section 6.1.1. Indeed, the differences are seen in terms of quantitative results. Details follow. SMC scales well with respect to itself for P=1 core; more precisely, for P=256, SMC achieves up to a 200-fold speed-up for the largest considered dataset (Students), as Figure 5b illustrates. This means that SMC is once again able to provide roughly the same classification accuracy as single-chain MCMC (for the same problem size) but is significantly faster, as shown in Figure 7b. More precisely, the SMC sampler is able to provide the same accuracy as single-chain MCMC but is faster by up to a factor of 197 in Heart, by up to a factor of 292 in Diabetes, by up to a factor of 588 in Abalone, and by up to a factor of 640 in Students.

    As we also observe on SM, multi-chain MCMC scales well and is faster than SMC on DM as well, as shown in Figure 4b and Figure 6b. The reason is once again that multi-chain MCMC is an embarrassingly parallel algorithm with a faster computation time per sample than SMC, and it is bound to an optimal O(log2N) time complexity on DM as well, as proven in Lemma 2 and Theorem 1.

    Considering the trade-off between the runtime and accuracy inherent to multi-chain MCMC on both SM and DM computer architectures, one might question whether increasing K to match the elapsed time of SMC would allow multi-chain MCMC to achieve superior accuracy. This question is addressed in the following section.

    6.2. Experiment 2—Accuracy for Fixed Elapsed Time

    Building on the previous section, this experiment aims to compare the classification accuracy of SMC and multi-chain MCMC when both methods are run for the same elapsed time. To achieve this, we first identify the best runtimes for SMC and multi-chain MCMC at N=1024 and then incrementally increase K in multi-chain MCMC until the runtimes of the two methods are approximately equivalent.

    6.2.1. Results on Shared Memory

    Table 2 presents the results of this experiment conducted on SM across all datasets.

    As observed, we had to increase K in multi-chain MCMC by a factor of 4.2 for Heart, 3.3 for Diabetes, 1.8 for Abalone, and 1.5 for Students compared to the previous experiment (where K=10). This adjustment led to improved accuracy across all datasets relative to the results in Table 1. Specifically, the accuracy increased from 67.42% to 71.89% for Heart, from 66.76% to 69.49% for Diabetes, from 20.27% to 20.62% for Abalone, and from 55.53% to 60.32% for Students. Nonetheless, despite these improvements, the accuracy achieved by multi-chain MCMC remains substantially lower than the accuracy achieved by SMC.

    6.2.2. Results on Distributed Memory

    The results for this experiment on DM are provided in Table 3 for all datasets.

    At both the qualitative and quantitative levels, the results in Table 3 for DM are similar to the ones presented in Table 2 for SM. Indeed, in order to run for (roughly) the same elapsed time as SMC, multi-chain MCMC requires K=30 samples per chain on Heart, K=20 samples per chain on Diabetes, K=12 samples per chain on Abalone, and K=14 samples per chain on Students. This has again translated into improved classification accuracy with respect to the values reported in Table 1. More precisely, multi-chain MCMC achieves, once again, significantly lower classification accuracy than SMC. We note that the classification accuracies for SMC are identical to the ones in Table 2, as the values of N and K for SMC are unchanged.

    In other words, the results in this section indicate that a parallel SMC sampler on both SM and DM architectures is able to compensate for the larger constant time factor than the one in parallel multi-chain MCMC by achieving better accuracy per time unit and therefore providing a more convenient runtime vs. accuracy compromise.

    7. Conclusions

    In this study, we have proposed two approaches to parallelizing an SMC sampler for Bayesian DTs: one approach is specific to SM architectures; the other approach is specific to DM architectures. Given the inherent differences between SM and DM architectures, the parallel solutions that we describe are substantially different, especially in the parallelization of the bottleneck, i.e., redistribution. However, in both cases, we have proven that it is possible to achieve optimal time complexity O(log2N). By taking advantage of multi-core architectures, our approach provides up to a 16-fold speed-up on a 32-core SM machine and up to a 200-fold speed-up on a 256-core DM cluster. On both architectures, the proposed methods significantly improve the performance of alternative approaches to Bayesian DTs, such as MCMC. More precisely, on four exemplar datasets for classification, our approach could provide roughly the same classification accuracy but up to 51 times faster on SM and up to 640 times faster on DM for the same problem size. Alternatively, in both memory architectures, we could provide substantially higher classification accuracy in the same execution time.

    While the results are promising, there remains considerable room for improvement in both the accuracy and execution time. For instance, the accuracy could be improved by adopting a more effective proposal distribution for ADTs, such as HINTS [47]. However, refining the proposal distribution often comes at the cost of an increased runtime. This trade-off can be addressed through several strategies. Firstly, this work has independently examined SM and DM approaches for CPUs. However, high-performance computing (HPC) commonly relies on hybrid MPI+X programming models, which increase the parallelism by combining distributed and shared memory, maximizing the computational capabilities of modern supercomputers. Another option is to take advantage of GPUs, which often outperform CPUs in highly parallelizable tasks like the sampling step of the SMC sampler. Additionally, alternative parallelization techniques for tasks in the SMC sampler could be explored. In this work, we used static load-balancing methods, but dynamic load-balancing approaches might be more suitable for the variable-size nature of DTs, potentially improving both the runtime and scalability.

    Author Contributions

    Conceptualization, E.D. and A.V.; formal analysis, E.D., A.V., A.M.P., P.G.S. and S.M.; software, E.D., A.V. and A.M.P.; investigation, validation, visualization, and writing—original draft, E.D. and A.V.; review and editing, E.D., A.V., A.M.P., P.G.S. and S.M.; supervision, P.G.S. and S.M.; project administration and funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

    Data Availability Statement

    No new data were created or analyzed in this study. Data sharing is not applicable to this article.

    Conflicts of Interest

    The authors declare no conflicts of interest.

    Footnotes

    Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

    Figures and Tables
    Figure 1. Memory architectures.
    Figure 2. An example of a DT with [Forumla omitted. See PDF.], [Forumla omitted. See PDF.], and six leaf nodes with a quantitative response of two classes.
    Figure 3. Parallel redistribution—examples on SM and DM for [Forumla omitted. See PDF.] and [Forumla omitted. See PDF.]. Each tree node is encoded with a letter and number for brevity. The [Forumla omitted. See PDF.] s represent fake tree nodes.
    Figure 4. Runtimes for MCMC and SMC vs. total sample size [Forumla omitted. See PDF.].
    Figure 4. Runtimes for MCMC and SMC vs. total sample size [Forumla omitted. See PDF.].
    Figure 5. Speed-ups for SMC for [Forumla omitted. See PDF.] iterations and increasing number of samples N.
    Figure 5. Speed-ups for SMC for [Forumla omitted. See PDF.] iterations and increasing number of samples N.
    Figure 6. Speed-ups for multi-chain MCMC for [Forumla omitted. See PDF.] samples per chain and increasing number of chains N.
    Figure 6. Speed-ups for multi-chain MCMC for [Forumla omitted. See PDF.] samples per chain and increasing number of chains N.
    Figure 7. Speed-up gain of SMC vs. single-chain MCMC for (approximately) the same accuracy.

    Classification accuracy for the same problem size, NT=10,240 samples, generated by each method as follows: N=1024 samples updated K=10 times for SMC; one chain of NT = 10,240 samples for single-chain MCMC; N=1024 chains of K=10 samples each for multi-chain MCMC.

    Method P Heart Diabetes Abalone Students
    SMC Any 77.44 % 73.27 % 22.48 % 71.48 %
    Single-Chain 1 77.01 % 73.78 % 22.53 % 71.64 %
    Multi-Chain Any 67.42 % 66.76 % 20.27 % 55.53 %

    Accuracy per time unit on SM. The * is used to emphasize that the runtime is rounded to the closest integer, as running two different methods for exactly the same runtime is challenging.

    Dataset Method P N K Time Accuracy
    Heart SMC 32 2 10 10 7.6 [s] * 77.44 %
    Multi-Chain MCMC 32 2 10 42 7.6 [s] * 71.89 %
    Diabetes SMC 32 2 10 10 13 [s] * 73.27 %
    Multi-Chain MCMC 32 2 10 33 13 [s] * 69.49 %
    Abalone SMC 32 2 10 10 47 [s] * 22.48 %
    Multi-Chain MCMC 32 2 10 18 47 [s] * 20.62 %
    Students SMC 32 2 10 10 77 [s] * 71.48 %
    Multi-Chain MCMC 32 2 10 15 77 [s] * 60.32 %

    Accuracy per time unit on DM. The * is used to emphasize that the runtime is rounded to the closest integer, as running two different methods for exactly the same runtime is challenging.

    Dataset Method P N K Time Accuracy
    Heart SMC 256 2 10 10 0.36 [s] * 77.44 %
    Multi-Chain MCMC 256 2 10 30 0.36 [s] * 70.01 %
    Diabetes SMC 256 2 10 10 1 [s] * 73.27 %
    Multi-Chain MCMC 256 2 10 20 1 [s] * 67.61 %
    Abalone SMC 256 2 10 10 4 [s] * 22.48 %
    Multi-Chain MCMC 256 2 10 12 4 [s] * 20.28 %
    Students SMC 256 2 10 10 6 [s] * 71.48 %
    Multi-Chain MCMC 256 2 10 15 6 [s] * 60.32 %

    References

    1. Azar, A.T.; El-Metwally, S.M. Decision tree classifiers for automated medical diagnosis. Neural Comput. Appl.; 2013; 23, pp. 2387-2403. [DOI: https://dx.doi.org/10.1007/s00521-012-1196-7]

    2. Sankari, E.S.; Manimegalai, D. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. J. Theor. Biol.; 2017; 435, pp. 208-217. [DOI: https://dx.doi.org/10.1016/j.jtbi.2017.09.018] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28941868]

    3. Gajewicz, A.; Puzyn, T.; Odziomek, K.; Urbaszek, P.; Haase, A.; Riebeling, C.; Luch, A.; Irfan, M.A.; Landsiedel, R.; van der Zande, M. et al. Decision tree models to classify nanomaterials according to the DF4nanoGrouping scheme. Nanotoxicology; 2018; 12, pp. 1-17. [DOI: https://dx.doi.org/10.1080/17435390.2017.1415388] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29251527]

    4. Kazemi, P.; Ghisi, A.; Mariani, S. Classification of the Structural Behavior of Tall Buildings with a Diagrid Structure: A Machine Learning-Based Approach. Algorithms; 2022; 15, pp. 349-372. [DOI: https://dx.doi.org/10.3390/a15100349]

    5. Schetinin, V.; Jakaite, L.; Krzanowski, W. Bayesian averaging over Decision Tree models for trauma severity scoring. Artif. Intell. Med.; 2018; 84, pp. 139-145. [DOI: https://dx.doi.org/10.1016/j.artmed.2017.12.003]

    6. Waldmann, P. Genome-wide prediction using Bayesian additive regression trees. Genet. Sel. Evol.; 2016; 48, 42. [DOI: https://dx.doi.org/10.1186/s12711-016-0219-8]

    7. Hernández, B.; Pennington, S.R.; Parnell, A.C. Bayesian methods for proteomic biomarker development. EuPA Open Proteom.; 2015; 9, pp. 54-64. [DOI: https://dx.doi.org/10.1016/j.euprot.2015.08.001]

    8. Drousiotis, E.; Phillips, A.M.; Spirakis, P.G.; Maskell, S. Bayesian Decision Trees Inspired from Evolutionary Algorithms. Learning and Intelligent Optimization; Sellmann, M.; Tierney, K. Springer: Cham, Switzerland, 2023; pp. 318-331.

    9. Bez, J.L.; Boito, F.Z.; Nou, R.; Miranda, A.; Cortes, T.; Navaux, P.O.A. Detecting I/O Access Patterns of HPC Workloads at Runtime. Proceedings of the 2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD); Campo Grande, Brazil, 15–18 October 2019; pp. 80-87. [DOI: https://dx.doi.org/10.1109/SBAC-PAD.2019.00025]

    10. Linero, A.R. Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection. J. Am. Stat. Assoc.; 2018; 113, pp. 626-636. [DOI: https://dx.doi.org/10.1080/01621459.2016.1264957]

    11. Chipman, H.A.; George, E.I.; McCulloch, R.E. Bayesian CART model search. J. Am. Stat. Assoc.; 1998; 93, pp. 935-948. [DOI: https://dx.doi.org/10.1080/01621459.1998.10473750]

    12. Denison, D.G.; Mallick, B.K.; Smith, A.F. A bayesian cart algorithm. Biometrika; 1998; 85, pp. 363-377. [DOI: https://dx.doi.org/10.1093/biomet/85.2.363]

    13. Schetinin, V.; Fieldsend, J.E.; Partridge, D.; Krzanowski, W.J.; Everson, R.M.; Bailey, T.C.; Hernandez, A. The Bayesian decision tree technique with a sweeping strategy. arXiv; 2005; arXiv: cs/0504042

    14. Denison, D.G.; Holmes, C.C.; Mallick, B.K.; Smith, A.F. Bayesian Methods for Nonlinear Classification and Regression; John Wiley & Sons: Hoboken, NJ, USA, 2002; Volume 386.

    15. Hu, W.; O’Leary, R.A.; Mengersen, K.; Low Choy, S. Bayesian Classification and Regression Trees for Predicting Incidence of Cryptosporidiosis. PLoS ONE; 2011; 6, e23903. [DOI: https://dx.doi.org/10.1371/journal.pone.0023903] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21909377]

    16. Malehi, A.S.; Jahangiri, M. Classic and Bayesian Tree-Based Methods. Enhanced Expert Systems; Vizureanu, P. IntechOpen: Rijeka, Croatia, 2019; Chapter 3 pp. 27-51. [DOI: https://dx.doi.org/10.5772/intechopen.83380]

    17. Neal, R.M. Slice sampling. Ann. Stat.; 2003; 31, pp. 705-767. [DOI: https://dx.doi.org/10.1214/aos/1056562461]

    18. Byrd, J.M.; Jarvis, S.A.; Bhalerao, A.H. Reducing the run-time of MCMC programs by multithreading on SMP architectures. Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing; Miami, FL, USA, 14–18 April 2008; pp. 1-8. [DOI: https://dx.doi.org/10.1109/IPDPS.2008.4536354]

    19. Ye, J.; Wallace, A.M.; Al Zain, A.; Thompson, J. Parallel Bayesian inference of range and reflectance from LaDAR profiles. J. Parallel Distrib. Comput.; 2013; 73, pp. 383-399. [DOI: https://dx.doi.org/10.1016/j.jpdc.2012.12.003]

    20. Drousiotis, E.; Spirakis, P. Single MCMC chain parallelisation on decision trees. Ann. Math. Artif. Intell.; 2023; pp. 1-14. [DOI: https://dx.doi.org/10.1007/s10472-023-09876-9]

    21. Mohammadi, R.; Pratola, M.; Kaptein, M. Continuous-Time Birth-Death MCMC for Bayesian Regression Tree Models. J. Mach. Learn. Res.; 2020; 21, pp. 1-26.

    22. Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Introducing markov chain monte. Markov Chain Monte Carlo in Practice; CRC Press: Boca Raton, FL, USA, 1995; Volume 1.

    23. Seelinger, L.; Reinarz, A.; Rannabauer, L.; Bader, M.; Bastian, P.; Scheichl, R. High performance uncertainty quantification with parallelized multilevel Markov chain Monte Carlo. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; St. Louis, MO, USA, 14–19 November 2021; pp. 1-15.

    24. Goudie, R.; Turner, R.M.; De Angelis, D.; Thomas, A. Massively parallel MCMC for Bayesian hierarchical models. arXiv; 2017; arXiv: 1712.05907

    25. Pratola, M.T.; Chipman, H.A.; Gattiker, J.R.; Higdon, D.M.; McCulloch, R.; Rust, W.N. Parallel Bayesian Additive Regression Trees. J. Comput. Graph. Stat.; 2014; 23, pp. 830-852. [DOI: https://dx.doi.org/10.1080/10618600.2013.841584]

    26. Francom, D.; Sansó, B.; Kupresanin, A.; Johannesson, G. Sensitivity Analysis and Emulation for Functional Data Using Bayesian Adaptive Splines. Stat. Sin.; 2018; 28, pp. 791-816. [DOI: https://dx.doi.org/10.5705/ss.202016.0130]

    27. Moral, P.D.; Doucet, A.; Jasra, A. Sequential Monte Carlo Samplers. J. R. Stat. Soc. Ser. B (Stat. Methodol.); 2006; 68, pp. 411-436. [DOI: https://dx.doi.org/10.1111/j.1467-9868.2006.00553.x]

    28. Bolic, M.; Djuric, P.M.; Hong, S. Resampling Algorithms and Architectures for Distributed Particle Filters. IEEE Trans. Signal Process.; 2005; 53, pp. 2442-2450. [DOI: https://dx.doi.org/10.1109/TSP.2005.849185]

    29. Thiyagalingam, J.; Kekempanos, L.; Maskell, S. MapReduce Particle Filtering with Exact Resampling and Deterministic Runtime. EURASIP J. Adv. Signal Process.; 2017; 2017, pp. 71-93. [DOI: https://dx.doi.org/10.1186/s13634-017-0505-9] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32010202]

    30. Varsi, A.; Maskell, S.; Spirakis, P.G. An O (log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures. Algorithms; 2021; 14, pp. 342-362. [DOI: https://dx.doi.org/10.3390/a14120342]

    31. Rosato, C.; Varsi, A.; Murphy, J.; Maskell, S. An O(log2 N) SMC2 Algorithm on Distributed Memory with an Approx. Optimal L-Kernel. Proceedings of the 2023 IEEE Symposium Sensor Data Fusion and International Conference on Multisensor Fusion and Integration (SDF-MFI); Bonn, Germany, 27–29 November 2023; pp. 1-8. [DOI: https://dx.doi.org/10.1109/SDF-MFI59545.2023.10361452]

    32. Lopez, F.; Zhang, L.; Beaman, J.; Mok, A. Implementation of a Particle Filter on a GPU for Nonlinear Estimation in a Manufacturing Remelting Process. Proceedings of the 2014 IEEE/ASME International Conference on Advanced Intelligent Mechatronics; Besacon, France, 8–11 July 2014; pp. 340-345. [DOI: https://dx.doi.org/10.1109/AIM.2014.6878102]

    33. Lopez, F.; Zhang, L.; Mok, A.; Beaman, J. Particle Filtering on GPU Architectures for Manufacturing Applications. Comput. Ind.; 2015; 71, pp. 116-127. [DOI: https://dx.doi.org/10.1016/j.compind.2015.03.013]

    34. Murray, L.M.; Lee, A.; Jacob, P.E. Parallel Resampling in the Particle Filter. J. Comput. Graph. Stat.; 2016; 25, pp. 789-805. [DOI: https://dx.doi.org/10.1080/10618600.2015.1062015]

    35. Gardner, J.; Guo, C.; Weinberger, K.; Garnett, R.; Grosse, R. Discovering and exploiting additive structure for Bayesian optimization. Proceedings of the Artificial Intelligence and Statistics, PMLR; Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1311-1319.

    36. Drousiotis, E.; Spirakis, P.G.; Maskell, S. Parallel Approaches to Accelerate Bayesian Decision Trees. arXiv; 2023; arXiv: 2301.09090

    37. Lakshminarayanan, B.; Roy, D.; Whye Teh, Y. Top-down particle filtering for Bayesian decision trees. Proceedings of the 30th International Conference on Machine Learning; Atlanta, GA, USA, 17–19 June 2013; Dasgupta, S.; McAllester, D. Proceedings of Machine Learning Research Volume 28, pp. 280-288.

    38. Quadrianto, N.; Ghahramani, Z. A Very Simple Safe-Bayesian Random Forest. IEEE Trans. Pattern Anal. Mach. Intell.; 2015; 37, pp. 1297-1303. [DOI: https://dx.doi.org/10.1109/TPAMI.2014.2362751]

    39. Drousiotis, E.; Varsi, A.; Spirakis, P.G.; Maskell, S. A Shared Memory SMC Sampler for Decision Trees. Proceedings of the 2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD); Porto Alegre, Brazil, 17–20 October 2023; pp. 209-218. [DOI: https://dx.doi.org/10.1109/SBAC-PAD59825.2023.00030]

    40. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2.

    41. Li, T.; Bolic, M.; Djuric, P.M. Resampling Methods for Particle Filtering: Classification, implementation, and strategies. IEEE Signal Process. Mag.; 2015; 32, pp. 70-86. [DOI: https://dx.doi.org/10.1109/MSP.2014.2330626]

    42. Pratola, M.T. Efficient Metropolis–Hastings Proposal Mechanisms for Bayesian Regression Tree Models. Bayesian Anal.; 2016; 11, pp. 885-911. [DOI: https://dx.doi.org/10.1214/16-BA999]

    43. Wu, Y.; Tjelmeland, H.; West, M. Bayesian CART: Prior Specification and Posterior Simulation. J. Comput. Graph. Stat.; 2007; 16, pp. 44-66. [DOI: https://dx.doi.org/10.1198/106186007X180426]

    44. Diéguez, A.P.; Amor, M.; Doallo, R. Efficient Scan Operator Methods on a GPU. Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing; Paris, France, 22–24 October 2014; pp. 190-197. [DOI: https://dx.doi.org/10.1109/SBAC-PAD.2014.23]

    45. Kozakai, S.; Fujimoto, N.; Wada, K. Efficient GPU-Implementation for Integer Sorting Based on Histogram and Prefix-Sums. Proceedings of the 50th International Conference on Parallel Processing; New York, NY, USA, 9–12 August 2021; ICPP ’21 pp. 1-11. [DOI: https://dx.doi.org/10.1145/3472456.3472486]

    46. Santos, E.E. Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines. J. Parallel Distrib. Comput.; 2002; 62, pp. 517-543. [DOI: https://dx.doi.org/10.1006/jpdc.2000.1698]

    47. Strens, M. Efficient Hierarchical MCMC for Policy Search. Proceedings of the Twenty-First International Conference on Machine Learning; New York, NY, USA, 4–8 July 2004; ICML ’04 pp. 97-104. [DOI: https://dx.doi.org/10.1145/1015330.1015381]

    © 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.