Handling data-skewness in character based string

Full text

Turn on search term navigation

1. Introduction

The term “Big Data” [1–10] has turned into a buzzword and is broadly used in both research and industrial world. Big Data refers to a term which is a blend of the 4V's, namely Volume, Variety, Veracity, and Velocity. But now it is evolving with time. It is closely being related to data integration, which aims at combining the various forms of data from different sources and provides a consolidated view. Data integration [3,11] can be achieved by using the String similarity join [2,3], which provides a similar pair of strings from the two-given collection of strings. The similarity of the two strings can be calculated by using their similarity functions. There are two types of similarity functions which are used to calculate the similarity viz.: character-based similarity functions [12–23] and set-based similarity functions [14–18].

$Dice (r, s) = (2 | r \cap s | / | r | + | s |)$

where Dice(r,s) refers to the Dice similarity that exists between string r and s.

$Jac (r, s) = (| r \cap s | / | r \cup s |)$

where Jac(r,s) refers to the Jaccard similarity that exists between string r and s.

$Cos (r, s) = (| r \cap s | / \sqrt | r | . | s |)$

where Cos(r,s) refers to the Cosine similarity that exists between string r and s.

For example, given r=“I am cool person” and s=“I am cool”. We have $| r | = 4$ and $| s | = 3$ , further $| r \cup s | = 4$ and $| r \cap s | = 3$ . Thus, $DICE (r, s) = (6 / 4)$ , $JAC (r, s) = (3 / 4)$ and $COS (r, s) = (3 / \sqrt 12)$ .

String similarity joins have many real-world applications, e.g., entity resolution, copy detection, document clustering [19], plagiarism detection and data integration [23]. Traditional algorithms of string similarity joins have memory constraints and hence they have limited applications when they deal with a large amount of data.

Similarity join is the critical issue of discovering all sets of records from a given set that have similarity scores more noteworthy than a predefined Similarity limit under a given similitude work. It can be applied in various applications where it needs to deal with the progressively immense amount of information; the issue of scaling up the similarity joins is always getting more imperative. Performing Similarity joins on the enormous amount of data presents two principal difficulties. First, the information can never again fit into the memory of one machine, which calls for workload partitioning. Given, the pairwise-comparison nature of the issue, dividing data to guarantee load balancing, while at the same time limiting correspondence cost and redundancy is troublesome. The trouble of load adjusting is additionally aggravated by the need to deal with various informational indexes with skewed distributions and high dimensionality. Second, since the number of examinations required develops quadratically as data increases, methods that require looking at all sets of records don't scale well. Along these lines, another test is to filter the candidate pair without calculating the actual similarity. Designing filters to help a vast class of valuable similarity function is of incredible and practical significance [23].

Data Skewness [23,24–31] occurs when the computational load is unbalanced among map tasks or among reducing tasks. Skewness can lead to significantly longer job execution time, lower cluster throughput, high computation time and high computation cost. Performance of existing algorithms degrades or not up to the mark due to the presence of data skewness. It is essential to define the method and overcome this problem for better results. Existing data techniques causes data skewness during the partitioning which creates the imbalance of data and data duplication. As well they require a high cost of computation for handling the vast amount of data. In this paper, character-based similarity functions are being used to make a scalable approach to process big data.

The MapReduce [5] framework proposed by Google is utilized to perform substantial scale information in a distributed manner. Therefore, there has been valuable research on analytical join query handling in MapReduce for extensive datasets. In any case, few applications need to deal with an immense amount of data, and there are three fundamental difficulties to be intended which occurs usually.

First, because the size of the data collection is enormous, the data must be partitioned and prepared appropriately. Thus, workload-aware information dividing procedures are required. These guarantees not only balance the data as well as an output of each machine. Second, a Complex filtering strategy is needed since the quantity of correlations develops significantly as the dataset size increase. Third, a principle issue occurs when handling a join query on MapReduce over the multiple datasets.

This paper presents a hybrid approach by using Pass Join (pl. refers Section 2 of this paper), the map-reduce framework with the concept of the Map-Reduce Frequency Adaptive (MRFA) (pl. refers Section 2), to handle the basic record join. This approach has been designed to Handle Data-Skewness in the Character Based String Similarity Join.

As explained, pass join itself is a robust algorithm which states “proposed a high-quality partition-based signature scheme for the edit distance function” [14–19], it has been proven to be an effective and efficient method when it comes to generating the various candidate pairs in the concerned dataset. This paper presents a MapReduce framework of pass join using MRFA [24] concepts. For the string, the similarity joins named as MRFA-SSJ. Hence, the proposed work has the combined advantages of pass join [14] and MRFA [24].

The proposed technique can be used in various felid to have better result estimation and calculation. The proposed technique helps in better decision making as it evenly distributes the data. An area in which it can be applied viz. optimization algorithms, IOT of an educational environment, quadratic assignment problem, decision-based methods, neutrosophic problems, etc. [32–40]

The rest of the paper is organized as follows: Section 2 describes the preliminaries which are related to the proposed work. Section 3 is devoted to the literature review of the existing techniques. Formulation of the proposed work has been discussed in Section 4. Experimental results are described in Section 5 and the conclusion has been discussed in Section 6.

2. Material and methods

This section briefly describes the important concepts which are repeatedly used in this paper. The proposed work strictly focuses on the data skewness effects in join operations [19]. For implementing the same, one must be familiar with the concepts of Hadoop and MapReduce. All these concepts are briefly described as follows:

For records, the mapper outputs in the form <key, value> pairs: for example − <1, R_tag 1 a and a mobility sales & service>.
Consider the RID pair, “Pair_tag 1 52,763 7”. For such RID pair, the output as shown in Table 5:

Now, once the RID pairs generate the records are joined together, at the same time there is a need to get ensure to avoid skewness which can hamper the performance. One record can be paired with many records which can cause skewness. For this MapReduce phase, algorithm 3.2 has been used. Using the concepts from MRFA, distribute the output of stage 2 on the basis of frequency. Then compute the total number of RID pairs for 1 record. If this frequency is greater than the user-defined threshold frequency then the entire set of RID pairs is divided into buckets, which are distributed to different reducers. The record is then replicated across all those reducers for the join operation. Then for RID pairs, the record is set as the value while the RID pair acts as the key. This is one half of the result which is computed by the MapReduce phase in algorithm 3.3. To combine the two halves, group all the <key, value> pairs according to the keys (here the key is the RID pair) explained in algorithm 3.4.

Two <key, value> pairs sharing the same key are thus the two halves of the result. The two values (i.e. records) are then combined to give the final results.

Stage 3 is dedicated to the most crucial and important part of this proposed technique. It is not affected method as other existing algorithms in literature are sensitive to data skewness. Analysis has been done by following the ZipF law. The detailed explanation and comparative study has been shown in Section 4 (pl.refer Experimental Analysis).

4.3.5 Illustration of the proposed approach

The algorithm begins with Stage 1, which computes how the data from relation R will be distributed. It outputs the minimum and maximum lengths of the strings to each reducer. So, for instance, if the shortest string received by the reducer and its length is 20 and the longest string is of length 500, then the output will be the reducer number followed by the values of the minimum and maximum lengths of the strings (using algorithms 1.1 and 1.2).

For the second stage, the map phase process strings from S as follows: Consider the string from relation S “aaacookcountyconsolidation.com”. Its length is 30. Let the edit distance, $T = 5$ . An even partition scheme will be followed to divide this string into segments. According to this scheme, $k = 30 - [30 / (5 + 1)) * (5 + 1)] = 0]$ . Thus the length of the $T + 1 = 5 + 1 = 6$ segments will be $[(30 / (5 + 1))] = 5$ . For each segment, the mapper outputs a pair: the start position (the position from which the segment begins) and the segment itself. So for this example string the segment list from the mapper is {(0, aaaco), (5, okcou), (10, ntyco), (15, nsoli), (20, datio), (25, n.com)} (here, the considered strings has been started from the index 0). This procedure is illustrated in Algorithm 2.1.

The reduce phase (algorithm 2.2) is responsible for generating substrings of the strings from relation R for each segment of the strings from S. As an example, consider the string “aaacoman” belonging to the relation R. For each segment of the used segment list has been computed. So, for the 1st segment with start position=0, the list of substrings generated for this segment is {‘aaaco’}. Similarly, for the 2nd segment with start position=5, the list of substrings is {'oman', 'man', 'an']}. For our example, no substrings are generated for the rest of the segments.

As and when the list of substrings for each segment is generated the next step is to check whether the segment matches any one of the substrings in the list or not. In this example, the 1st segment matched the substring that was generated. Hence these two strings have been concluded to be a candidate pair. When a pair is identified to be a candidate one, it is sent for verification. As explained, the extension-based verification and sharing computation algorithms are used to verify the pair. If they pass the edit distance threshold, then the record IDs of both the strings is the output result. In case of our example, the pair, although being a candidate one, fails to satisfy the edit distance threshold.

Now, once the RID pairs generate the records they are joined together at the same time there is a need to handle skewness which can hamper the performance. One record can be paired with many records which can cause skewness. For this MapReduce phase, algorithm 3.2 has been used. Using the concepts from MRFA, distribute the output of stage 2 based on frequencies. Then compute the total number of RID pairs for one record. If this frequency is greater than the user-defined threshold frequency then the entire set of RID pairs are divided into the number of buckets, which are distributed to different reducers. Now, records get start replicating across all those reducers for the join operation. Then for RID pairs, the record is set as the value while the RID pair acts as the key. This is one half of the result which is computed by the MapReduce phase in algorithm 3.3. To combine the two halves, group all the <key, value> pairs according to the keys (here the key is the RID pair) explained in algorithm 3.4. Two <key, value> pairs sharing the same key are thus the result will come in two halves. Further, these two halves (i.e. records) have been combined for the final outcome. After this illustration, experiment analysis (pl. refers to section 4) follows the detailed setup for the proposed approach with graphical results and comparative study of the compared techniques.

5. Experimental analysis

In this section, we are analyzing the performance of the proposed method. We have compared our work with the best methods, namely: Improved Repartition Join [52], Standard Repartition Join [52] and MRFA Join algorithm [24]. We used the set of 15 node clusters, where every node consists of Intel Xenon processor E5520 and contains 3GHz with four cores, 12GB of RAM and five 200GB Hard disks. We had also set up the extra node for running the master node, to manage Hadoop jobs as well as HDFS working.

We have installed the Hadoop 2.6.0, and every node consists of the Ubuntu 9.04, 64-bit, server edition operating system, Java 1.6 with a 64-bit server JVM. We have done some changes in a default Hadoop configuration just to maximize the parallelism and minimize the running time; the default block size of each block is 64 bit, but we set the block size of HDFS to 128 bit, and allocate 1GB of virtual memory to each node and 1GB of virtual memory to each map/reduce task. It runs eight maps, and eight reduce tasks in parallel on each node. As well as we had set the replication factor to 1 and disable the unpredictable task execution feature.

We had followed the Zipf law to study the effect of data skewness on the proposed method during the join operation. To examine the impact of skewness, it was incorporated artificially in the join operations by following the Zipf distribution law [53,54]. The factor of Zipf law varies from 0 to 1. Here “0” stands for the uniform data distribution which carries zero skewness whereas “1” stands for highly skewed data. In this experiment, we fix the size of the input up to 10 GB of data on the right side of the table and one GB of data in the left side table. So, the range of final result varies from 5 GB to 40 GB records.

We have analyzed during the experiment that our proposed algorithms outperform other existing algorithms which we have compared for low to medium skewness. As confirmed from the series of experiments performed on Query log, DBLP datasets and for more realistic results we have used the real IPS and cookies dataset. It can be shown that the proposed method is better than the other existing methods available in the literature for handling data skewness in big data for the Basic Record Join method. Further, the proposed method has been compared with the existing techniques namely: Improved Repartition Join [52], Standard Reparation Join [52] and MRFA Join [24], with respect to the join processing time and reduce shuffle time in Figures 5 and 6 respectively.

When skewness is high, all the existing algorithms are not able to perform except our proposed algorithm. In the skew factor ranging from 0.6 to1, it shows that the job failed for Improved repartition join and Standard repartition join due to the lack of memory as well as it affects the scalability of the algorithm. Both of the join algorithms are skew sensitive algorithms whereas, MRFA sustains for the entire range of skewness factor, but it does not ensure the scalability. While it is compared with the proposed method, it has been noticed that the proposed method takes very less time to process the jobs at high data skewness. Hence the proposed method is skew-insensitive for character-based string similarity join. Comparison of different techniques has been made in Table 6.

6. Conclusion

In this paper, a new approach has been presented for handling data skewness in character-based string similarity join using the MapReduce framework, and we have proposed an algorithm namely MR- Pass Join for handling this task. After performing the experiments, we observed that the candidate pair generation method is neither generating false positive nor false negative results. We found that the proposed algorithm ensures scalability with reasonable query processing time for string similarity database join. To obtain the results of the proposed approach MRFA-SSJ has been compared with the best-known solutions namely Improved Repartition Join, Standard Repartition Join, and MRFA Join. Our results are found to be better, and result analysis has been done to the Zipf law in respect of reducing shuffling time and joins processing time.

The proposed algorithm is highly skew insensitive, and it is a scalable method with respect to the high usage of memory. Proposed algorithm survives till the highest value of the skew factor “1” while the existing algorithms survive during the experiment till skew factor 0.5 in respect of Zipf distribution. As per our analysis, present algorithms are not able to give good results as they are skew sensitive.

This analysis has been done on the set-up of the set of 15 cluster nodes, where we provide 1 GB virtual memory to every node and five hard disks of 200 GB. Size of map nodes in Hadoop extended from 64 bit to 128 bit for better analysis. We had also fixed the size of input 10GB in both, right & left table after the analysis the outcome varies from 5 GB to 40 GB.

It has been noticed that the proposed algorithm can handle highly skewed data whereas other algorithms cannot, and hence it is a skew insensitive approach. The proposed algorithm is also found to be scalable and better in load balancing. It outperforms the existing methods which fail to handle data skewness in similarity joins.

Publishers note: The publisher wishes to inform readers that the article “Handling data-skewness in character based string similarity join using Hadoop” was originally published by the previous publisher of Applied Computing and Informatics and the pagination of this article has been subsequently changed. There has been no change to the content of the article. This change was necessary for the journal to transition from the previous publisher to the new one. The publisher sincerely apologises for any inconvenience caused. To access and cite this article, please use Meena, K., Tayal, D.K., Castillo, O., Jain, A. (2020), “Handling data-skewness in character based string similarity join using Hadoop”, New England Journal of Entrepreneurship. Vol. 18 No. 1/2, pp. 22-46. The original publication date for this paper was 16/11/2018.

Figure 1

Working of Stage 1.

[Figure omitted. See PDF]

Figure 2

Execution of work of stage 2.

[Figure omitted. See PDF]

Figure 3

Workflow of stage 3.

[Figure omitted. See PDF]

Figure 4

Basic record join used with MRFA-SSJ to combat skewness in string similarity join.

[Figure omitted. See PDF]

Figure 5

Job processing time with respect to skew effect in data and the time taken by techniques to process it completely.

[Figure omitted. See PDF]

Figure 6

Shuffle time with respect to the skewness in data.

[Figure omitted. See PDF]

Table 1

Example of Sample Input.

Tag	RID	Key (the 1st string of the entire record – used for joining)	Value (the rest of the record)
R_tag	1	A	& a mobility sales & service
R_tag	2	Aa	20meetings 20in 20charlotte
R_tag	3	Aaa	and grand canyon discounts
R_tag	4	aaacookcountyconsolidation.com	Null
R_tag	5	Aaacoman	Null
S_tag	1	a & a mobility sales & service	& a mobility sales & service
S_tag	2	aa 20meetings 20in 20charlotte	20meetings 20in 20charlotte
S_tag	3	aaa and grand canyon discounts	and grand canyon discounts
S_tag	4	aaacookcountyconsolidation.com	Null

Table 2

Input to the Mapper.

Tag	RID	Key (the 1st string of the entire record – used for joining)	Value (the rest of the record)
R_tag	1	A	& a mobility sales & service
R_tag	2	Aa	20meetings 20in 20charlotte
R_tag	3	Aaa	and grand canyon discounts
R_tag	4	aaacookcountyconsolidation.com	Null
R_tag	5	Aaacoman	Null

Table 3

Record from relation S.

Tag	RID	Key	Value
S_tag	4	aaacookcountyconsolidation.com	Null

Table 4

RID pairs generated from stage 2.

Tag	RID 1	RID 2	Edit distance
Pair_tag	1	52,763	7
Pair_tag	1	1	0
Pair_tag	230,150	146,949	11

Table 5

output of RID pairs.

Key	Value
1	Pair_tag 1 52,763 7
52,763	Pair_tag 1 52,763 7

Table 6

Comparison of different Techniques.

Characteristics	MRSim Join	V-Smart Join	Mass Join	Improved Repartition Join	Standard Repartition Join	MRFA	Proposed (MRFA-SSJ)
Data Skew	Insensitive	Insensitive	Insensitive	Sensitive	Sensitive	Less Sensitive	Insensitive
Scalability	Less	Moderate	Less	Less	Less	Below Moderate	High
Memory Requirement	Lack of Memory	Moderate Amount of Memory	Lack of Memory	Lack of Memory	Lack of Memory	Moderate Amount of Memory	High Usage of Memory

Word count: 3197

Show less

© Kanak Meena, Devendra K. Tayal, Oscar Castillo and Amita Jain. This work is published under https://creativecommons.org/licenses/by-nc/3.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Details

Title

Handling data-skewness in character based string similarity join using Hadoop

Author

Kanak Meena; Tayal, Devendra K; Castillo, Oscar; Jain, Amita

Pages

22-44

Publication year

2022

Publication date

2022

Publisher

Emerald Group Publishing Limited

Website

http://www.emeraldinsight.com/

ISSN

26341964

e-ISSN

22108327

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1016/j.aci.2018.11.001

ProQuest document ID

2617078235

Handling data-skewness in character based string similarity join using Hadoop

Jump to:

Full text

Abstract

Details

Suggested sources