Full Text

Turn on search term navigation

Jingjing Wang 1,2 and Chen Lin 1,2

Academic Editor:J. Alfredo Hernandez

1, School of Information Science and Technology, Xiamen University, Xiamen 361005, China
2, Shenzhen Research Institute of Xiamen University, Shenzhen 518058, China

Received 28 September 2014; Revised 24 February 2015; Accepted 2 March 2015; 30 April 2015

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

A fundamental problem in data mining is to detect similar items. Finding similar pairs of instances is an essential component in mining numerous types of data, including document clustering [1, 2], plagiarism detection [3], image search [4], and recommender system [5].

Identifying pairs of similar instances is also called similarity joins [6]. Given a set of data instances, a similarity threshold [figure omitted; refer to PDF] , and a join attribute [figure omitted; refer to PDF] , the goal of similarity joins is to find all pairs of instances [figure omitted; refer to PDF] where their similarity on the join attribute is larger than the threshold [figure omitted; refer to PDF] (i.e., [figure omitted; refer to PDF] ). There are various similarity measurements, including cosine similarity [7], edit distance [6, 8, 9], hamming distance [7, 10], dimension root similarity [2], and EDU-based similarity for elementary discourse units [11]. In this work we focus on Jaccard similarity, which is proven to be successful for high dimensional, sparse feature sets [6]. Jaccard similarity for two feature vectors [figure omitted; refer to PDF] and [figure omitted; refer to PDF] is defined as [figure omitted; refer to PDF] . As an example, we illustrate naive computation for similarity joins based on Jaccard similarity in Table 1. Suppose there are [figure omitted; refer to PDF] instances, namely, [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] , the join attribute consists of features [figure omitted; refer to PDF] , the Jaccard similarity of [figure omitted; refer to PDF] is [figure omitted; refer to PDF] , [figure omitted; refer to PDF] is [figure omitted; refer to PDF] , [figure omitted; refer to PDF] is [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] is [figure omitted; refer to PDF] , [figure omitted; refer to PDF] is [figure omitted; refer to PDF] , and the similarities for remaining pairs [figure omitted; refer to PDF] are all [figure omitted; refer to PDF] . Given the similarity threshold [figure omitted; refer to PDF] , it is directly concluded that instances [figure omitted; refer to PDF] and [figure omitted; refer to PDF] are similar.

Table 1: An illustrative example of similarity joins based on Jaccard similarity. 0/1 indicates absence/presence of features in each instance.

Instance	Feature
[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]
A	0	1	0	0	1	0
B	1	0	0	0	1	1
C	0	1	0	1	1	0
D	0	0	1	1	0	0
E	0	0	0	1	0	1

A naive algorithm, which finds similar pairs by computing similarities for all instance pairs, is clearly impracticable on a large collection of instances with high dimensional features. To improve efficiency and scalability of similarity joins, previous research efforts generally fall into two categories. On one hand, parallel algorithms are adopted on clusters of machines. Most of them are implemented using MapReduce framework, including a 3-stage MapReduce approach for end-to-end set-similarity join algorithm [12], fast computation of inner products for large scale news articles [13], and a new ant colony optimization algorithm parallelized using MapReduce [14] to select features in a high dimension space. Others exploit the parallelism and high data throughput of GPU, that is, the LSS algorithm [15]. On the other hand, algorithmic design can be improved to reduce time and storage cost of similarity computation for high dimensional feature space. One type of such approaches uses dimension reduction technologies, including Principle Components Analysis and neural networks [16]. Another type is to hash and filter, so that high dimensional feature space can be replaced by smaller representative signatures. Most popular hashing methods include minhashing [17], minwise hashing [18], and Locality Sensitive Hashing (LSH) [10]. The core idea of hashing is to map similar pairs to similar signatures with several hundred dimensions, each element of which is the result of hashing and hence sheds insights to the solution of high dimensionality. Hashing can also be a means for data clustering because it enables similar features with vast dimensions to be hashed into the same buckets and thus partitions features into groups [19]. Filtering methods, including length filter [7], prefix filter [20], and suffix filter [21], are frequently utilized consequently to eliminate dissimilar pairs while possible similar pairs remain. As a result, fewer similarity computations are needed. In particular, banding technique [22], a specified form of Locality Sensitive Hashing, which maps every band of signatures to an array of buckets so the probability of collision is much higher for instances close to each other, is the most efficient filtering method.

Although previous works have demonstrated the importance and feasibility of hashing and filtering approaches, one critical issue remains underestimated. Hashing and filtering approaches produce approximate results. The similarities of selected pairs are not guaranteed to be larger than the predefined threshold. In the meanwhile, obsoleted pairs are not indeed dissimilar, with similarities less than the predefined threshold. The former case is called false positive, while the latter one is called false negative. An appropriate number of false positives and false negatives are acceptable in many applications. However, the tolerance to false positive and false negative may differ. In most application scenarios such as clustering and information retrieval, a small amount of false positives is emphasized to increase efficiency and precision. In applications such as recommendation and bioinformatics systems [23-25], a small number of false negatives are more important.

In this paper, we address the problem of tailoring the number of false positives and false negatives for different applications. To the best of our knowledge, this is the first time in literature to present such detailed analysis. False positives and false negatives are caused by the scheme of pruning candidate pairs whose signatures map into disjoint bucket arrays. Intuitively, similar signatures are likely to have highly analogous bands. And analogous bands will be mapped into identical bucket arrays. Inspired by this intuition, we propose the new banding technique called Personalized Locality Sensitive Hashing (PLSH), in which bands of signatures mapped to at least [figure omitted; refer to PDF] identical buckets are selected as candidates. We also explore the probability guarantee of the new banding techniques provided for three cases, namely, false negatives, false positives, and the sum of both. According to these probabilities, we propose the upper bounds and lower bounds of false positives and false negatives and accordingly present to personalize the parameters involved in banding and hashing algorithms to fulfill different application demands.

The contributions of this paper are threefold:

(i) We improve the traditional banding technique by a new banding technique with flexible threshold to reduce the number of false positives and improve efficiency.

(ii) We derive the number and lower/upper bound of false negatives and false positives and balancing between them for our new banding technique.

(iii): We implement the new banding technique using parallel framework MapReduce.

The rest of the paper is structured as follows. In Section 2, the backgrounds of minhashing and banding technique are presented. In Section 3, we introduce Personalized Locality Sensitive Hashing (PLSH). The implementation of PLSH using MapReduce is shown in Section 4. In Section 5, we present and analyze the experimental results. We survey the related works in Section 6. Finally, the conclusion is given in Section 7.

2. Background

In this section, we briefly introduce the minhashing algorithm and the consequent banding algorithm, which are the fundamental blocks of Locality Sensitive Hashing (LSH). The intuition of minhashing is to generate low dimensional signatures to represent high dimensional features. The intuition of banding is to filter candidates which are not likely to be similar pairs.

2.1. MinHashing

For large scale data sets, feature space is usually high dimensional and very sparse; that is, only a tiny portion of features appear in a single instance. In order to reduce the memory used to store sparse vector, we use a signature, an integer vector consisting of up to several hundred elements to represent an instance. To generate a signature, we first randomly change the order of features. In other words, the permutation defines a hash function [figure omitted; refer to PDF] that shuffles the features. Each element of signature is a minhash value [17], which is the position of the first nonzero feature in the permuted feature vector. For example, the original feature vector in Table 1 is [figure omitted; refer to PDF] ; suppose the permuted feature vector is [figure omitted; refer to PDF] ; then feature vectors for [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] become [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] as illustrated in Table 2. Thus the minhash value for [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] is 1, 2, 1, 3, and 4, respectively.

Table 2: An illustrative example of permutation of feature vectors. 0/1 indicates absence/presence of features in each instance.

Instance	Feature
b	a	c	d	f	e
A	1	0	0	0	0	1
B	0	1	0	0	1	1
C	1	0	0	1	0	1
D	0	0	1	1	0	0
E	0	0	0	1	1	0

We can choose [figure omitted; refer to PDF] independent permutations [figure omitted; refer to PDF] . Suppose the minhash value of an instance [figure omitted; refer to PDF] for a certain permutation [figure omitted; refer to PDF] is denoted by [figure omitted; refer to PDF] ; then the signature denoted by [figure omitted; refer to PDF] is [figure omitted; refer to PDF]

The approximate similarity between two instances based on their signatures is defined as the percentage of identical values at the same position in the corresponding signatures. For example, given [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] , the approximate Jaccard similarity is [figure omitted; refer to PDF] .

2.2. Banding

Given a large set of signatures generated in Section 2.1, it is still too costly to compare similarities for all signature pairs. Therefore, a banding technique is presented consequently to filter dissimilar pairs.

The banding technique divides each signature into [figure omitted; refer to PDF] bands, where each band consists of [figure omitted; refer to PDF] elements. For each band of every signature, the banding technique maps the vector of [figure omitted; refer to PDF] elements to a bucket array.

As shown in Figure 1, the [figure omitted; refer to PDF] th band of each signature maps to bucket array [figure omitted; refer to PDF] . Intuitively, if for a pair of signatures, the corresponding bucket arrays have at least one bucket array in common, then the pair is likely to be similar. For example, signature 1 and signature 2 and signature 2 and signature [figure omitted; refer to PDF] in Figure 1 are similar. Such a pair with common bucket array is considered to be a candidate pair and needs to be verified in the banding technique.

Figure 1: An illustrative example of banding technique.

[figure omitted; refer to PDF]

3. Personalized LSH

3.1. New Banding Technique

The candidates generated by LSH are not guaranteed to be similar pairs. Chances are that a pair of signatures are projected to identical bucket arrays even if the Jaccard similarity between the pair of instances is not larger than the given threshold. In the meantime, a pair of instances can be filtered out from candidates since their corresponding signatures are projected into disjoint bucket arrays even if the Jaccard similarity is smaller than the given threshold. The former case is called false positive, while the latter one is called false negative. Massive false positives will lead to inaccurate results, while a large amount of false negatives will deteriorate computational efficiency of LSH. To enhance the algorithm precision and efficiency, we present here a new banding scheme to filter more dissimilar instance pairs. Intuitively, if two instances are highly alike, it is possible that many bands of the two corresponding signatures are mapped to identical buckets. For example, in Figure 1, there are at least 3 bands (i.e., the 1st, the 5th, and the [figure omitted; refer to PDF] th bands) of signature 1 and signature 2 which map to the same buckets (i.e., in the corresponding bucket array 1, 5, [figure omitted; refer to PDF] ).

Therefore, we change the banding scheme as follows. For any pair of instances, if the two corresponding signatures do not map into at least [figure omitted; refer to PDF] identical buckets, it will be filtered out. Otherwise, it is considered to be a candidate pair and the exact Jaccard similarity is computed and verified. For the signatures shown in Figure 1, given [figure omitted; refer to PDF] , signature [figure omitted; refer to PDF] and signature [figure omitted; refer to PDF] and signature 2 and signature [figure omitted; refer to PDF] are filtered.

3.2. Number of False Positives

A candidate pair [figure omitted; refer to PDF] is false positive, if [figure omitted; refer to PDF] and [figure omitted; refer to PDF] share at least [figure omitted; refer to PDF] common bucket arrays. Since the efficiency of LSH is mainly dependent on the number of false positives, and most real applications demand a high precision, we first derive the possible number of false positives generated by the new banding technique.

Lemma 1.

The upper bound of false positives generated by the new banding technique is equal to the original LSH and the lower bound is approximate to 0.

Proof.

According to the law of large numbers, the probability that the minhash values of two feature vectors (e.g., [figure omitted; refer to PDF] ) are equal under any random permutation [figure omitted; refer to PDF] , is very close to the frequency percentage of observing identical value in the same position at two long signatures of the corresponding feature vectors. That is, [figure omitted; refer to PDF] where [figure omitted; refer to PDF] is the length of signatures [figure omitted; refer to PDF] and [figure omitted; refer to PDF] ; [figure omitted; refer to PDF] is the position in signatures, [figure omitted; refer to PDF] .

Also, the probability that a random permutation of two feature vectors produces the same minhash value equals the Jaccard similarity of those instances [17]. That is, [figure omitted; refer to PDF]

Based on the above two equations, the probability of two instances with Jaccard similarity [figure omitted; refer to PDF] is considered to be a candidate pair by the new banding technique denoted by [figure omitted; refer to PDF] as [figure omitted; refer to PDF] where [figure omitted; refer to PDF] is the Jaccard similarity of the two instances, [figure omitted; refer to PDF] is the length of each band, and [figure omitted; refer to PDF] is the number of bands. We can prove the derivative of [figure omitted; refer to PDF] is greater than 0, which represents [figure omitted; refer to PDF] a monotonically increasing function of [figure omitted; refer to PDF] .

The number of false positive, denoted by [figure omitted; refer to PDF] , is [figure omitted; refer to PDF] where [figure omitted; refer to PDF] denotes the total number of similar pairs whose Jaccard similarity is [figure omitted; refer to PDF] in the instances set. Given an instance set, [figure omitted; refer to PDF] is a constant. [figure omitted; refer to PDF] is the given similarity threshold.

The value of [figure omitted; refer to PDF] depends on the similarity distribution of a given instance set. The upper bound of [figure omitted; refer to PDF] equals the original LSH [figure omitted; refer to PDF] . Without the knowledge of the similarity distribution of the data set, the lower bound of false positives cannot be directly derived. Hence, we introduce a threshold [figure omitted; refer to PDF] to ensure [figure omitted; refer to PDF] where [figure omitted; refer to PDF] is close to zero with increasing [figure omitted; refer to PDF] . If [figure omitted; refer to PDF] is [figure omitted; refer to PDF] , the lower bound of false positives approximates to 0, which indicates that the candidates generated by the proposed new banding technique are almost all truly similar pairs.

To understand the zero lower bound with [figure omitted; refer to PDF] , suppose there are two signatures with [figure omitted; refer to PDF] elements each, [figure omitted; refer to PDF] bands of which are mapped to the same bucket. At least [figure omitted; refer to PDF] elements in the two signatures are identical because a band includes [figure omitted; refer to PDF] elements. According to (2) and (3), the approximate similarity between the two corresponding instances is then greater than [figure omitted; refer to PDF] . Hence, similarity for each pair of signatures is greater than the threshold [figure omitted; refer to PDF] and no false positives exist.

The introduction of [figure omitted; refer to PDF] also enables us to personalize the number of false positives, that is, to vary the range of [figure omitted; refer to PDF] for different [figure omitted; refer to PDF] . The range of [figure omitted; refer to PDF] for a desired [figure omitted; refer to PDF] is a function of [figure omitted; refer to PDF] that can be numerically solved. For example, given [figure omitted; refer to PDF] ; Figure 2 shows the trend of [figure omitted; refer to PDF] for [figure omitted; refer to PDF] . The minimum of [figure omitted; refer to PDF] is achieved when [figure omitted; refer to PDF] . If the desired [figure omitted; refer to PDF] , we can find a satisfying range of [figure omitted; refer to PDF] since [figure omitted; refer to PDF] and [figure omitted; refer to PDF] .

Figure 2: An illustrative example of number of false positives for various [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

3.3. Number of False Negatives

False negatives are truly similar pairs mapped to disjoint bucket arrays. We also derive the upper and lower bound of false negatives generated by the proposed new banding technique.

Lemma 2.

The upper bound of false negatives generated by the new banding technique is [figure omitted; refer to PDF] . The lower bound is close to the original LSH.

Proof.

Similar to Section 3.2, the number of false negatives, denoted by [figure omitted; refer to PDF] , is [figure omitted; refer to PDF] [figure omitted; refer to PDF] is a monotonic increasing function of [figure omitted; refer to PDF] . The lower bound of it is achieved when [figure omitted; refer to PDF] . The upper bound of [figure omitted; refer to PDF] is obtained when [figure omitted; refer to PDF] is the total number of bands. Hence, the upper bound of [figure omitted; refer to PDF] is proportional to the number of similar instances [figure omitted; refer to PDF] : [figure omitted; refer to PDF]

For a desired the number of false negatives, we do a division between [figure omitted; refer to PDF] and [figure omitted; refer to PDF] in terms of [figure omitted; refer to PDF] where [figure omitted; refer to PDF] is a threshold which is always greater than 1. By deriving the numerical solution for [figure omitted; refer to PDF] , the range of [figure omitted; refer to PDF] for a desired [figure omitted; refer to PDF] is obtained. For example, given the arguments [figure omitted; refer to PDF] , Figure 3 shows us the trend of [figure omitted; refer to PDF] . If the desired [figure omitted; refer to PDF] , from Figure 3, we can find that [figure omitted; refer to PDF] and [figure omitted; refer to PDF] , so the satisfying range is [figure omitted; refer to PDF] .

Figure 3: An illustrative example of number of false negatives for various [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

3.4. Balance False Positives and False Negatives

In some application scenarios, we want to have a balance between false positives and false negatives. Here we analyse a special case where we want a desired aggregated number of false positives and false negatives. We use [figure omitted; refer to PDF] to denote the sum of false positives and false negatives, which is defined as follows: [figure omitted; refer to PDF]

The lower bound of [figure omitted; refer to PDF] is dependent on the similarity distribution of the given data set. However, since in most cases [figure omitted; refer to PDF] , thus [figure omitted; refer to PDF] is less than [figure omitted; refer to PDF] when [figure omitted; refer to PDF] is appropriately chosen.

Inspired by Sections 3.2 and 3.3, we can also use a threshold [figure omitted; refer to PDF] to obtain the desired degree of precision. As shown in Figure 4, the ratio of [figure omitted; refer to PDF] for [figure omitted; refer to PDF] on a uniformly distributed data set first decreases as the value of [figure omitted; refer to PDF] increases. The minimum is [figure omitted; refer to PDF] when [figure omitted; refer to PDF] . Then the ratio increases as [figure omitted; refer to PDF] becomes larger. If we are required to have a higher precision of the new banding technique, compared with traditional banding technique, in terms of aggregated number of false negatives and false positives (i.e., small [figure omitted; refer to PDF] ), then [figure omitted; refer to PDF] is acceptable.

Figure 4: An illustrative example of balancing false positives and false negatives.

[figure omitted; refer to PDF]

4. MapReduce Implementation of PLSH

In this section, we first introduce the MapReduce framework. Then we present the details of implementing Personalized LSH with MapReduce, including minhashing, banding, and verification.

4.1. MapReduce

MapReduce [26] is a framework for processing paralleled algorithms on large scale data sets using a cluster of computers. MapReduce allows for distributed processing of data, which is partitioned and stored in a distributed file system (HDFS). Data is stored in the form of [figure omitted; refer to PDF] pairs to facilitate computation.

As illustrated in Figure 5, the MapReduce data flow consists of two key phases: the map phase and the reduce phase. In the map phase, each computing node works on the local input data and processes the input [figure omitted; refer to PDF] pairs to a list of intermediate pairs [figure omitted; refer to PDF] in a different domain. The [figure omitted; refer to PDF] pairs generated in map phase are hash-partitioned and sorted by the key, and then they are sent across the computing cluster in a shuffle phase. In the reduce phase, pairs with the same key are passed to the same reduce task. User-provided functions are processed in the reduce task on each key to produce the desired output.

Figure 5: An illustrative example of MapReduce data flow, with 2 reduce tasks.

[figure omitted; refer to PDF]

In similarity joins, to generate [figure omitted; refer to PDF] pairs, we first segment the join attributes in each instance to tokens. Each token is denoted by a unique integer id. In the following steps, token id is used to represent each feature.

4.2. MinHashing

We use one map reduce job to implement minhashing. Before the map function is called, [figure omitted; refer to PDF] pairs are loaded. In the map task, each instance is represented by a set of tokens [figure omitted; refer to PDF] present in this instance.

In the reduce task, for each instance, the reducer produces a signature with length [figure omitted; refer to PDF] . As described in Section 2, the minhashing function requires random permutations of features. But it is not feasible to permute massive features explicitly. Instead, a random hash function is adopted to simulate this task. Suppose the total number of features is [figure omitted; refer to PDF] , integer set [figure omitted; refer to PDF] , we choose the hash function as [figure omitted; refer to PDF] where [figure omitted; refer to PDF] and [figure omitted; refer to PDF] must be relatively prime and [figure omitted; refer to PDF] is a function that obtains the remainder of [figure omitted; refer to PDF] divided by [figure omitted; refer to PDF] . It maps a number [figure omitted; refer to PDF] to another number [figure omitted; refer to PDF] with no collision. Hence the result list [figure omitted; refer to PDF] is a permutation of the original features.

Furthermore, since it requires [figure omitted; refer to PDF] independent permutations to produce a signature for each instance, we prove that there are more than [figure omitted; refer to PDF] different permutations.

Lemma 3.

Given [figure omitted; refer to PDF] features, the desired signature length [figure omitted; refer to PDF] , the hash function [figure omitted; refer to PDF] , where [figure omitted; refer to PDF] and [figure omitted; refer to PDF] , produces more than [figure omitted; refer to PDF] different permutations.

Proof.

Assume a permutation [figure omitted; refer to PDF] is generated by hash function [figure omitted; refer to PDF] with parameters [figure omitted; refer to PDF] and [figure omitted; refer to PDF] ; then [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] , [figure omitted; refer to PDF] . Hence, for a specified [figure omitted; refer to PDF] , different integers [figure omitted; refer to PDF] produce different permutations. Euler's totient function [figure omitted; refer to PDF] is an arithmetic function that counts the number of totatives of integer [figure omitted; refer to PDF] , which indicates the number of desired [figure omitted; refer to PDF] is [figure omitted; refer to PDF] . Therefore, there are [figure omitted; refer to PDF] pairs of [figure omitted; refer to PDF] which produce [figure omitted; refer to PDF] different permutations. Since [figure omitted; refer to PDF] , we prove that hash function [figure omitted; refer to PDF] produces more than [figure omitted; refer to PDF] different permutations.

4.3. Banding

Banding technique filters dissimilar pairs. As shown in Figure 6, we implement the banding procedure in two MapReduce phases.

Figure 6: An illustrative example of data flow in banding stage.

[figure omitted; refer to PDF]

In the first phase, the signatures are input to the map function. The map function divides each signature into [figure omitted; refer to PDF] bands, each band consists of [figure omitted; refer to PDF] elements, and then each band is mapped to a bucket. The outputs are in form of [figure omitted; refer to PDF] . In other words, [figure omitted; refer to PDF] and [figure omitted; refer to PDF] are combined as a key, and [figure omitted; refer to PDF] is assigned to the corresponding value. For example, as shown in Figure 6, the signature of instance [figure omitted; refer to PDF] is (1 2 11 3 4 23 [figure omitted; refer to PDF] ) and (2 2 13 3 4 23 [figure omitted; refer to PDF] ) for instance 2. Suppose [figure omitted; refer to PDF] ; then instance 1 is divided into at least 2 bands (1 2 11) and (3 4 23). The two bands are mapped to bucket [figure omitted; refer to PDF] in bucket array [figure omitted; refer to PDF] and bucket [figure omitted; refer to PDF] in bucket array [figure omitted; refer to PDF] . So the outputs of map function for instance 1 include [figure omitted; refer to PDF] and [figure omitted; refer to PDF] . Analogously, [figure omitted; refer to PDF] and [figure omitted; refer to PDF] are a part of map outputs for instance 2.

In reduce task, all instances with the same [figure omitted; refer to PDF] and [figure omitted; refer to PDF] are assigned to the same reduce task. An output in the form of [figure omitted; refer to PDF] is produced for every pair of instances, where the fixed value 1 represents the occurrence frequency for pair [figure omitted; refer to PDF] . For instance, [figure omitted; refer to PDF] and [figure omitted; refer to PDF] are the aforementioned map outputs for instances 1 and 2, since they have the same [figure omitted; refer to PDF] 1 and [figure omitted; refer to PDF] 5; the reduce task produces a pair [figure omitted; refer to PDF] . That is to say, instance 2 and instance 23 are likely to be a candidate pair because their first bands are both mapped to the 5th bucket.

In the second phase, the map task outputs what is produced in the first phrase. To minimize the network traffic between the map and reduce functions, we use a combine function to aggregate the outputs generated by the map function into partial local counts in the computing node. Subsequently, the reduce function computes the total counts for each instance pair. Outputs of the reduce function are in the form of [figure omitted; refer to PDF] pairs, where [figure omitted; refer to PDF] is the global frequency for the instance pair. Personalized LSH eliminates those pairs of instances whose [figure omitted; refer to PDF] is less than the given threshold [figure omitted; refer to PDF] . As shown in Figure 6, suppose [figure omitted; refer to PDF] , the count of instance pair [figure omitted; refer to PDF] is greater than [figure omitted; refer to PDF] , so [figure omitted; refer to PDF] is a candidate pair.

4.4. Verification

Candidate pairs generated in Section 4.3 need to be checked in the verification stage. For each candidate instance pair [figure omitted; refer to PDF] , signature similarity is computed. Because of the massive number of instances, it is not a trivial task.

In order to reduce the storage cost for each reduce task, the set of signatures for all instances is partitioned into small files according to instance ids. In this way, each reduce task holds only two different small partitions, where the first partition is for the first instance in the pair [figure omitted; refer to PDF] , and the second partition is for the second instance. For example, for each reduce input pair [figure omitted; refer to PDF] , the first partition contains the signature for [figure omitted; refer to PDF] while the second partition contains the signature for [figure omitted; refer to PDF] . All pairs of instances ids contained in the same partitions have to be assigned to the same reduce task. Hence, map task calculates the reduce task id for each pair according to its instances ids and produces an output [figure omitted; refer to PDF] . In reduce task, signatures similarity for each pair of instances is computed. Finally, reduce task outputs pairs whose similarities are greater than the given threshold [figure omitted; refer to PDF] . The outputs are in the form of [figure omitted; refer to PDF] , where [figure omitted; refer to PDF] is the Jaccard similarity for [figure omitted; refer to PDF] and [figure omitted; refer to PDF] . As shown in Figure 7, suppose the given threshold [figure omitted; refer to PDF] ; similarity between signature 2 and signature 5 is 0.73, which is greater than [figure omitted; refer to PDF] ; the output is [figure omitted; refer to PDF] .

Figure 7: An illustrative example of the data flow in verification stage.

[figure omitted; refer to PDF]

5. Experiment Evaluation

In this section, we design a series of experiments to analyze the performance of the proposed PLSH algorithms. We want to study the following research questions.

(i) The efficiency of PLSH: for example, is the PLSH fast, compared with other similarity join algorithms? Can it scale to large scale data sets? Will the different values of parameters affect the efficiency of PLSH?

(ii) The effectiveness of PLSH: for example, is the PLSH accurate? Can it generate less false positives and false negatives?

(iii): The personalization of PLSH: for example, how should we set the parameters of PLSH to generate the desired number of false positives and false negatives? Is the tailored PLSH more appropriate for different applications?

The experiments are conducted on a 6-node cluster. Each node has one processor i7-3820 3.6 GHz with four cores, 32 GB of RAM, and 100 G hard disks. On each node, we install the Ubuntu 12.04, 64-bit, server edition operating system, Java 1.6 with a 64-bit server JVM, and Hadoop 1.0. Apache Hadoop is an open source implementation of MapReduce. We run 2 reduce tasks in parallel on each node.

We use DBLP and CITESEERX dataset and increase the number of instances when needed. The original DBLP dataset has approximately 1.2 M instances while the original CITESEERX has about 1.3 M instances. As shown in Figure 8, when increasing the number of instances of these two data sets, CITESEERX occupies larger storage than DBLP.

Figure 8: Dataset sizes of DBLP and CITESEERX.

[figure omitted; refer to PDF]

In our experiments, we tokenize the data by word. The concatenation of the paper title and the list of authors are the join attributes (i.e., paper title and the list of authors are two attributes in each instance.). The default threshold of Jaccard similarity [figure omitted; refer to PDF] . The default length of hash signature [figure omitted; refer to PDF] is 100, and each band has [figure omitted; refer to PDF] elements.

5.1. Efficiency of PLSH

The ground truth is the truly similar pairs generated by fuzzyjoin [12]. A 3-stage MapReduce approach is implemented for end-to-end set-similarity join algorithm and selects instance pairs with Jaccard similarities greater than the given threshold.

We first compare the efficiency of our method PLSH with fuzzyjoin on the DBLP and CITESEERX data sets. The CPU times of different algorithms are shown in Figure 9. We can conclude from Figure 9 that [figure omitted; refer to PDF] generally PLSH is faster than fuzzyjoin. When the number of instances is 7.5 M in DBLP and CITESEERX, the time cost of fuzzyjoin is nearly two times of that of PLSH. [figure omitted; refer to PDF] When the data size increases, the efficiency improvement is more significant. This suggests that PLSH is more scalable to large scale data sets. [figure omitted; refer to PDF] Fuzzyjoin takes roughly equivalent CPU time on DBLP and CITESEERX with similar size, while PLSH works faster on DBLP than on CITESEERX. This suggests that PLSH is more affected by the similarity distribution in a data set.

Figure 9: CPU time for various data sizes.

[figure omitted; refer to PDF]

Next we analyze the effect of the number of reduce tasks for algorithm efficiency. Because the reduce task number of verification step is different from the previous steps, we record the CPU time in the previous three stages (minhashing, banding1, and banding2, note that banding procedure is completed in two MapReduce phases) and in verification step separately. In the three steps, we vary the reduce tasks number from 2 to 12 with a step size of 2. The data sets are DBLP datasets containing 0.6 M, 1.2 M, 2.5 M, 5 M, and 7.5 M instances, respectively.

From Figure 10 we have the following observations. [figure omitted; refer to PDF] In general, the time cost of PLSH reduces with more reduce tasks. This verifies our assumption that the parallel mechanism improves the algorithm efficiency. [figure omitted; refer to PDF] When there are less than 2.5 M instances, CPU time decreases slowly with the increasing reduce tasks number. This is because the start time of MapReduce cannot be shortened even though tasks number is increasing. When the CPU time is more than 2000 seconds, the percentage of start time is just a small part of the total time. Therefore, when the number of instances is relatively large, the improvement is more significant. This suggests that the parallel mechanism of PLSH is more suitable for large scale data sets.

Figure 10: CPU time of stages previous to verification for different data sizes with various reduce task number.

[figure omitted; refer to PDF]

We further study the speedup of reduce task numbers. We set the time cost with 2 reduce tasks as the baseline and plot the ratio between the running time versus the baseline for various reduce task numbers. From Figure 11, we observe that when the size of dataset increases, the speedup is more significant. But none of the curves is a straight line, which suggest that the speedup is not linear with respect to the number of reduce tasks. The limited speedup is due to two main reasons. [figure omitted; refer to PDF] There are 5 jobs to find the similar instances. The start time of each job will not be shortened. [figure omitted; refer to PDF] As the number of dataset increases, more data is sent through the network, and more data is merged and reduced; thus the traffic cost cannot be reduced.

Figure 11: Relative CPU time for different data sizes with various reduce task number.

[figure omitted; refer to PDF]

The number of reduce tasks is fixed in the verification stage. However, in this stage, parameter [figure omitted; refer to PDF] in the proposed PLSH has a significant effect on the running time. We analyze the performance of PLSH with various values of [figure omitted; refer to PDF] . From Figure 12, we have the following observations. [figure omitted; refer to PDF] In general, as the value of [figure omitted; refer to PDF] increases, the time cost in verification stage decreases. This suggests that, with a larger [figure omitted; refer to PDF] , the proposed PLSH can better prune dissimilar pairs and enhance efficiency. [figure omitted; refer to PDF] When the data set is small, that is, the number of instances is smaller than 2.5 M, the enhancement is not obvious. The underlying reason is that there are fewer candidates in a small data set. [figure omitted; refer to PDF] Speedup is significant in large scale data set (with more than 5 M instances), which suggests the scalability of PLSH.

Figure 12: Verification time with various [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

5.2. Effectiveness Evaluation

In this subsection, we study the effectiveness of PLSH, and how the length of signature, the defined threshold, and the parameter [figure omitted; refer to PDF] affect effectiveness.

We first define four types of instances. False positives (FP) are dissimilar pairs (similarity lower than the threshold) that are selected as candidates by PLSH. True positives (TP) are similar pairs (similarity higher than the threshold) that are selected as candidates by PLSH. False negatives (FN) are similar pairs (similarity higher than the threshold) that are pruned by PLSH. True Negatives (TN) are dissimilar pairs (similarity lower than the threshold) that are pruned by PLSH.

The numbers of FP with various signature length and [figure omitted; refer to PDF] , with similarity threshold [figure omitted; refer to PDF] , are shown in Figure 13. We have the following observations. [figure omitted; refer to PDF] For a fixed signature length, the number of false positives significantly decreases with larger [figure omitted; refer to PDF] . [figure omitted; refer to PDF] For larger [figure omitted; refer to PDF] (e.g., [figure omitted; refer to PDF] ), prolonging the signature results in fewer false positives. [figure omitted; refer to PDF] For small [figure omitted; refer to PDF] and shorter signatures (<400 elements), the number of false positives is uncertain. The above three observations indicate that the proposed PLSH can achieve a high precision with larger [figure omitted; refer to PDF] , longer signatures.

Figure 13: The number of false positives with various signature length and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

The numbers of FP with various similarity threshold and [figure omitted; refer to PDF] , with signature length [figure omitted; refer to PDF] , are shown in Figure 14. We have the following observations. [figure omitted; refer to PDF] For a fixed similarity threshold, the number of false positives significantly decreases with larger [figure omitted; refer to PDF] . [figure omitted; refer to PDF] For a fixed [figure omitted; refer to PDF] , the number of false positives significantly decreases with larger similarity threshold. This is because, with larger similarity threshold, there are fewer qualified candidates.

Figure 14: Number of false positives with various similarity thresholds and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

The numbers of FN with various signature length and [figure omitted; refer to PDF] , with similarity threshold [figure omitted; refer to PDF] , are shown in Figure 15. We have the following observations. [figure omitted; refer to PDF] For a fixed signature length, the number of false negatives generated by original LSH ( [figure omitted; refer to PDF] ) is fewer than PLSH. [figure omitted; refer to PDF] For original LSH, shorter signatures generate more FN. [figure omitted; refer to PDF] In terms of FN, PLSH is not very sensitive to the length of signature.

Figure 15: The number of false negatives with various signature lengths and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

The numbers of FN with various similarity threshold and [figure omitted; refer to PDF] , with signature length [figure omitted; refer to PDF] , are shown in Figure 16. Although the number of false negatives for PLSH is larger than the original PLSH, we observe that for large similarity threshold [figure omitted; refer to PDF] the difference is minute. In most applications, we want to search for very similar pairs; thus, the PLSH will still perform good in most scenarios.

Figure 16: The number of false negatives with various similarity thresholds and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

Since the false negatives generated by PLSH are in general more than LSH, we further use precision and specificity to evaluate the performance of PLSH. Precision is defined as [figure omitted; refer to PDF] and specificity is defined as [figure omitted; refer to PDF] . In this subsection, we want to analyze the effect of [figure omitted; refer to PDF] , [figure omitted; refer to PDF] , and [figure omitted; refer to PDF] in terms of precision and specificity. We vary [figure omitted; refer to PDF] from 1 to 5 with a step size of 1, [figure omitted; refer to PDF] from 200 to 500 with a step size 100, and [figure omitted; refer to PDF] from 0.7 to 0.9 with a step size 0.1.

From Figures 17 to 20, we have the following observations. [figure omitted; refer to PDF] The PLSH method performs better than LSH in terms of specificity and precision, which demonstrates the potential of PLSH in many data mining and bioinformatics applications. [figure omitted; refer to PDF] Generally, longer signatures outperform shorter signatures.

Figure 17: Specificity with various [figure omitted; refer to PDF] and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

Figure 18: Specificity with various [figure omitted; refer to PDF] and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

Figure 19: Precision with various [figure omitted; refer to PDF] and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

Figure 20: Precision with various [figure omitted; refer to PDF] and [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]

5.3. Personalized LSH

As proved in Section 3, an important characteristic of our proposed PLSH is that it is capable of tailoring the number of false positives and false negatives for different applications. In this subsection, we present a pilot study on the tailoring capability of PLSH. We first numerically derive the appropriate [figure omitted; refer to PDF] for different degree of desired precision. As shown in Table 3, the required precision is measured in terms of the ratio of false positives versus conventional LSH [figure omitted; refer to PDF] , the ratio of false negatives versus conventional LSH [figure omitted; refer to PDF] , and the ratio of total errors (the sum of false positives and false negatives) versus the conventional LSH [figure omitted; refer to PDF] . For example, we can see that if we want to generate less than half of false negatives in LSH, [figure omitted; refer to PDF] , we should set [figure omitted; refer to PDF] .

Table 3: Recommendation performance with different [figure omitted; refer to PDF] .

[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF]	[figure omitted; refer to PDF] @5DBLP	[figure omitted; refer to PDF] @5CITESEERX
1	1	7	1	0.71	0.82
2	0.9	20	0.9	0.83	0.90
3	0.5	42	0.5	0.91	0.99
4	0.2	71	0.3	0.99	0.93
5	0.1	103	0.4	0.97	0.84

We then use the different settings of [figure omitted; refer to PDF] to generate collaborator recommendations on DBLP and CITESEERX data sets. We keep the authors who have published more than 25 papers. We use a modified collaborative filtering [25] to generate recommendations. For each paper [figure omitted; refer to PDF] published by author [figure omitted; refer to PDF] , PLSH is utilized to find a set of similar publications [figure omitted; refer to PDF] with similarity [figure omitted; refer to PDF] , [figure omitted; refer to PDF] . We then gather the set of authors [figure omitted; refer to PDF] who write the similar publications. In recommendation systems, the similar authors are treated as nearest neighbors. Each nearest neighbor is assigned a score, which is the accumulated similarities between the publication of nearest neighbor and the publication of the author [figure omitted; refer to PDF] . For each author, the top 5 nearest neighbors with largest accumulated similarities are returned as recommended collaborators. We manually annotate the results. The evaluation metric is the precision at top 5 results ( [figure omitted; refer to PDF] ). The average [figure omitted; refer to PDF] results are shown in Table 3. We have the following conclusions. [figure omitted; refer to PDF] When the number of false positives is reduced, we can generate more precise nearest neighbours; thus the recommendation performance is boosted. [figure omitted; refer to PDF] When the number of false positives is too small, then due to the sparsity problem, the collaborative filtering framework is not able to generate enough nearest neighbors; thus the recommendation performance is deteriorated. [figure omitted; refer to PDF] A recommendation system achieves best performance with a fine tuned parameter [figure omitted; refer to PDF] , which optimizes the trade-off between false positives and false negatives.

6. Related Work

There are fruitful research works on the similarity joins problem. Commonly adopted similarity measurements include cosine similarity [7], edit distance [8], Jaccard similarity [12], hamming distance [10], and synonym based similarity [27].

The methodologies used in similarity joins can be categorized into two classes. The first class adopts dimension reduction technologies [16], so that the similarity computation is conducted in a lower dimensional space. The second class utilizes filtering approaches to avoid unnecessary similarity computation, including prefix filter [7], length filter [7], prefix filter [20] and suffix filter [21], and positional filter [8]. To take advantage of the two categories, a series of hashing methods (such as minhashing [17], minwise hashing [18], and Locality Sensitive Hashing (LSH) [10, 19]) combine dimension reduction and filtering approaches by first using a signature scheme [28] to project the original feature vectors to low dimensional signatures and then filter unqualified signatures.

Recently, parallel algorithms are adopted on clusters of machines to improve efficiency of similarity joins for large scale and high dimensional data sets. We have seen an emerging trend of research efforts in this line, including a 3-stage MapReduce approach for end-to-end set-similarity join algorithm [12, 13], fast computation of inner products for large scale news articles [13], a new ant colony optimization algorithm parallelized using MapReduce [14] to select features in a high dimension space, and the LSS algorithm exploiting the high data throughput of GPU [15].

In the era of big data, approximate computation is also a possible solution, if the approximate result with much less computation cost is pretty close to the accurate solution. Approximate computation can be implemented not only through hardware modules [29, 30] but also through approximate algorithms [31, 32]. Unlike exact similarity joins [25], approximate similarity joins and top- [figure omitted; refer to PDF] similarity joins [33] are better adapted to different application domains [34].

7. Conclusion

The problem of similarity joins based on Jaccard similarity is studied. We propose a new banding technique which reduces the number of false positives, false negatives, and the sum of them. From our experiments, it is shown that our new method is both efficient and effective.

Acknowledgments

Chen Lin is partially supported by China Natural Science Foundation under Grant nos. NSFC61102136 and NSFC61472335, Base Research Project of Shenzhen Bureau of Science, Technology, and Information (JCYJ20120618155655087), CCF-Tencent Open Research Fund under Grant no. CCF-Tencent20130101, and Baidu Open Research under Grant no. Z153283.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

[1] C. Carpineto, S. Osinski, G. Romano, D. Weiss, "A survey of web clustering engines," ACM Computing Surveys , vol. 41, no. 3, article 17, 2009.

[2] R. Saraçoglu, K. Tütüncü, N. Allahverdi, "A fuzzy clustering approach for finding similar documents using a novel similarity measure," Expert Systems with Applications , vol. 33, no. 3, pp. 600-605, 2007.

[3] T. C. Hoad, J. Zobel, "Methods for identifying versioned and plagiarized documents," Journal of the American Society for Information Science and Technology , vol. 54, no. 3, pp. 203-215, 2003.

[4] B. Kulis, K. Grauman, "Kernelized locality-sensitive hashing for scalable image search," in Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV '09), pp. 2130-2137, IEEE, Kyoto, Japan, October 2009.

[5] L. Li, L. Zheng, F. Yang, T. Li, "Modeling and broadening temporal user interest in personalized news recommendation," Expert Systems with Applications , vol. 41, no. 7, pp. 3168-3177, 2014.

[6] J. Wang, J. Feng, G. Li, "Efficient trie-based string similarity joins with edit-distance constraints," Proceedings of the VLDB Endowment , vol. 3, no. 1-2, pp. 1219-1230, 2010.

[7] R. J. Bayardo, Y. Ma, R. Srikant, "Scaling up all pairs similarity search," in Proceedings of the 16th International World Wide Web Conference (WWW '07), pp. 131-140, ACM, May 2007.

[8] C. Xiao, W. Wang, X. Lin, "Ed-join: An efficient algorithm for similarity joins with edit distance constraints," in Proceedings of the VLDB Endowment, vol. 1, pp. 933-944, 2008.

[9] G. Li, D. Deng, J. Wang, J. Feng, "Pass-join: a partition-based method for similarity joins," Proceedings of the VLDB Endowment , vol. 5, no. 3, pp. 253-264, 2011.

[10] A. Gionis, P. Indyk, R. Motwani, "Similarity search in high dimensions via hashing," in Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99), pp. 518-529, Edinburgh, Scotland, September 1999.

[11] N. X. Bach, N. L. Minh, A. Shimazu, "Exploiting discourse information to identify paraphrases," Expert Systems with Applications , vol. 41, no. 6, pp. 2832-2841, 2014.

[12] R. Vernica, M. J. Carey, C. Li, "Efficient parallel set-similarity joins using MapReduce," in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '10), pp. 495-506, ACM, June 2010.

[13] T. Elsayed, J. Lin, D. W. Oard, "Pairwise document similarity in large collections with mapreduce," in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265-268, Association for Computational Linguistics, Stroudsburg, Pa, USA, June 2008.

[14] M. J. Meena, K. R. Chandran, A. Karthik, A. V. Samuel, "An enhanced ACO algorithm to select features for text categorization and its parallelization," Expert Systems with Applications , vol. 39, no. 5, pp. 5861-5871, 2012.

[15] M. D. Lieberman, J. Sankaranarayanan, H. Samet, "A fast similarity join algorithm using graphics processing units," in Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE '08), pp. 1111-1120, Cancun, Mexico, April 2008.

[16] M. Ture, I. Kurt, Z. Akturk, "Comparison of dimension reduction methods using patient satisfaction data," Expert Systems with Applications , vol. 32, no. 2, pp. 422-426, 2007.

[17] A. Z. Broder, M. Charikar, A. M. Frieze, M. Mitzenmacher, "Min-wise independent permutations," Journal of Computer and System Sciences , vol. 60, no. 3, pp. 630-659, 2000.

[18] P. Li, C. König, "b-Bit minwise hashing," in Proceedings of the 19th International World Wide Web Conference (WWW '10), pp. 671-680, ACM, April 2010.

[19] C. Chen, S.-J. Horng, C.-P. Huang, "Locality sensitive hashing for sampling-based algorithms in association rule mining," Expert Systems with Applications , vol. 38, no. 10, pp. 12388-12397, 2011.

[20] S. Chaudhuri, V. Ganti, R. Kaushik, "A primitive operator for similarity joins in data cleaning," in Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), pp. 5, IEEE, April 2006.

[21] C. Xiao, W. Wang, X. Lin, J. X. Yu, G. Wang, "Efficient similarity joins for near-duplicate detection," ACM Transactions on Database Systems , vol. 36, no. 3, article 15, 2011.

[22] A. Rajaraman, J. D. Ullman Mining of Massive Datasets , Cambridge University Press, 2011.

[23] J. P. Lucas, S. Segrera, M. N. Moreno, "Making use of associative classifiers in order to alleviate typical drawbacks in recommender systems," Expert Systems with Applications , vol. 39, no. 1, pp. 1273-1283, 2012.

[24] J. P. Lucas, N. Luz, M. N. Moreno, R. Anacleto, A. A. Figueiredo, C. Martins, "A hybrid recommendation approach for a tourism system," Expert Systems with Applications , vol. 40, no. 9, pp. 3532-3550, 2013.

[25] J. Bobadilla, F. Ortega, A. Hernando, A. Gutierrez, "Recommender systems survey," Knowledge-Based Systems , vol. 46, pp. 109-132, 2013.

[26] T. White Hadoop: The Definitive Guide , O'Reilly Media, Sebastopol, Calif, USA, 2012.

[27] A. Arasu, S. Chaudhuri, R. Kaushik, "Transformation-based framework for record matching," in Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE '08), pp. 40-49, Cancun, Mexico, April 2008.

[28] A. Arasu, V. Ganti, R. Kaushik, "Efficient exact setsimilarity joins," in Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 918-929, VLDB Endowment, September 2006.

[29] W.-T. J. Chan, A. B. Kahng, S. Kang, R. Kumar, J. Sartori, "Statistical analysis and modeling for error composition in approximate computation circuits," in Proceedings of the IEEE 31st International Conference on Computer Design (ICCD '13), pp. 47-53, IEEE, Asheville, NC, USA, October 2013.

[30] J. Huang, J. Lach, G. Robins, "A methodology for energy-quality tradeoff using imprecise hardware," in Proceedings of the 49th Annual Design Automation Conference (DAC '12), pp. 504-509, ACM, New York, NY, USA, June 2012.

[31] M. W. Mahoney, "Approximate computation and implicit regularization for very large-scale data analysis," in Proceedings of the 31st Symposium on Principles of Database Systems(PODS '12), pp. 143-154, ACM, New York, NY, USA, 2012.

[32] J. Strassburg, V. Alexandrov, "On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations," in Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '13), pp. 6:1-6:8, ACM, New York, NY, USA, November 2013.

[33] C. Xiao, W. Wang, X. Lin, H. Shang, "Top-k set similarity joins," in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE '09), pp. 916-927, IEEE, April 2009.

[34] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin, "An efficient filter for approximate membership checking," in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '08), pp. 805-817, ACM, June 2008.

Word count: 9349

Show less

Copyright © 2015 Jingjing Wang and Chen Lin. Jingjing Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Translate

Locality Sensitive Hashing (LSH) has been proposed as an efficient technique for similarity joins for high dimensional data. The efficiency and approximation rate of LSH depend on the number of generated false positive instances and false negative instances. In many domains, reducing the number of false positives is crucial. Furthermore, in some application scenarios, balancing false positives and false negatives is favored. To address these problems, in this paper we propose Personalized Locality Sensitive Hashing (PLSH), where a new banding scheme is embedded to tailor the number of false positives, false negatives, and the sum of both. PLSH is implemented in parallel using MapReduce framework to deal with similarity joins on large scale data. Experimental studies on real and simulated data verify the efficiency and effectiveness of our proposed PLSH technique, compared with state-of-the-art methods.

Details

Title

MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data

Author

Wang, Jingjing; Chen, Lin

Publication year

2015

Publication date

2015

Publisher

John Wiley & Sons, Inc.

ISSN

16875265

e-ISSN

16875273

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2015/217216

ProQuest document ID

1679858127

MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data

Jump to:

Full Text

Abstract

Details

Suggested sources