Quasi-Identifier Recognition Algorithm for

Full text

Turn on search term navigation

This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

1. Introduction

In the modern information age, many companies are using external sources of data for processing, storing, or obtaining some services such as data mining. Unlimited computational resources, reduced costs, nonburden of maintenance, and nondiligence to learn the skills of proficiency in certain services, all of these were temptations to advance to the modern change. However, there are still security and privacy concerns that hinder the use of the features offered by the cloud [1]. Numerous studies clarified that attackers often reveal the information from third-party services or third-party clouds [2]. For example, one of the security breaches in October 2014 was a breakthrough for Dropbox. The attackers stole 700 user passwords to obtain cash values of its Bitcoins (BTC). In 2015, a lot of users’ information, which exceeds 4 million, such as the user’s name, date of birth, address, e-mail, phone number, and other sensitive data, were leaked through the TalkTalk service provider in the UK. In 2016, Time Warner, one of the largest cable television companies in the United States, has announced that about 32 million passwords and e-mail of the users have been stolen via an attacker. In 2017, more than 200 million data of the users containing users’ names, phone numbers, e-mail addresses, home addresses, and other data have been disclosed through the API of McDelivery Company in India [2, 3]. A fresh security violation in Google displayed that any administrator of the server who has access to the secret information can misuse it easily. The worst problem is that administrator of the honest-but-curious server can violate privacy without being discovered [4].

Three kinds of the disclosure can cause privacy leakage, identity disclosure, attribute disclosure, and membership disclosure [5]. In attribute disclosure and identity disclosure, the intruder identifies that the tuple of the target individual is found in the released dataset and he aims to acquire some private/sensitive data about that individual from the released dataset [6]. Serious issues that lead to identity disclosure are quasi-identifier (QID) value linking and the attacker’s knowledge background. The QIDs are the dataset attributes that if each of them is considered separately does not distinguish the individual, but when several attributes are combined they can give a distinctive identification of individuals [7]. For example, when looking at the attributes of date of birth, gender, and ZIP code together, one can reidentify the individuals as stated in [8]. Reidentification of the individuals through linking their QIDs leads to what are called linking attacks. Therefore, the careless publication of QIDs will lead to leakage of privacy [9].

One of the popular practices to avoid privacy leakage is anonymization. The anonymization can be performed via several types of transformations, by removing the values, changing the structure, replacing the values by taxonomy, and combining the values. The anonymization-based methods use one or a combination of operations to accomplish an optimum level of concealment [10]. A commonly utilized privacy criterion of anonymization is $k$ -anonymity introduced by Sweeney [8]. The $k$ -anonymization model is aimed at making any record in the released dataset that cannot be distinguished from at least ( $k - 1$ ) other records [1, 11]. To avoid the linking attacks, $k$ -anonymization can be used. The effective method to determine the real QIDs is the primary issue for privacy-preserving methods based on $k$ -anonymity or other anonymization models seek to prevent QID linking. While most of the current methods neglected this issue or just determine QIDs manually, this reduces the validity of the anonymization method as well as negatively affects the usefulness of anonymous data [9]. This study is aimed at overcoming the identity disclosure resulting from QID linking and reducing the leakage of privacy by proposing a QID recognition (QIR) algorithm based on risk rate reidentification. The proposed algorithm comprises two main stages: (1) attribute classification (or QIDs Recognition) and (2) QID dimension identification. The algorithm works based on the reidentification of risk rate for all attributes and the dimension of QIDs where it determines the proper QIDs and their suitable dimensions. Figure 1 shows the cause-effect diagram of privacy leakage. The dark boxes in Figure 1 explain the privacy leakage causes addressed by the proposed QID recognition (QIR) algorithm in this study. As shown in Figure 1, it is essential to properly identify the QID attributes to overcome the identity disclosure to reduce the leakage of privacy resulting from QID linking. This paper is made up of 5 sections. Section 2 describes the state of the art of privacy-preserving data mining (PPDM) over the cloud, whereby some of the current methods and algorithms that address the issue of identification QIDs accurately to avoid identity disclosure are presented. A detailed description of the proposed algorithm has been provided in Section 3. Section 4 demonstrates the experimental evaluation, discussion, and comparison with related work. Section 5 concludes this work.

[figure omitted; refer to PDF]

[figures omitted; refer to PDF]

For the bank dataset, we identify $α$ and $β$ as $α = 30, β = 0$ . Table 1 demonstrates bank attribute classification. In the adult dataset, we add $α = 0.2, β = 0.01$ to classify the attributes. Table 2 demonstrates the classification of the adult dataset. Because the “balance” attribute has a risk of 52.04 %, which is large compared to other attributes, it is excluded from Figure 3 to highlight the difference between the attributes that have relatively small risk values.

Table 1

Classification of the bank dataset.

Classification	Threshold value $α = 30, β = 0$	Attributes
SAs	$Rrisk > α$	Balance
QIDs	$β \leq Rrisk < α$	Age, job, education, and marital status
NSs	$Rrisk < β$	Default, housing, and loan

Table 2

Classification of the adult dataset.

Classification	Threshold value $α = 0.2, β = 0.01$	Attributes
SAs	$Rrisk > α$	Capital gain, capital loss
QIDs	$β \leq Rrisk < α$	Hours-per-week, work-class, age, native-country, education, education-num, occupation, marital-status, relationship, and race
NSs	$Rrisk < β$	Sex, income

After calculating the risk rate of each attribute in the dataset, the attribute is classified according to the selected threshold $α$ and $β$ as was explained in QID Recognition Stage. Tables 1 and 2 show the classification results of the bank dataset and the adult dataset, respectively, according to the selected classification thresholds $α$ and $β$ for each dataset. After the classification stage, the best dimension of QIDs that achieves optimum case should be determined. In the bank dataset, the QID dimension (QidD) is four ( $QidD = 4$ ) while in the adult dataset QidD is 10 ( $QidD = 10$ ). For each dataset, the initial value of QID dimension is set to one $QidD = 1$ to be used as input into the proposed QID dimension identification algorithm (as explained in Algorithm 2) Identification of QID dimension begins with the initial value of QidD, and it is incremented until the maximum number of QID dimension. Identification of QID dimension begins also with a sample size equal to 10% of the dataset with $k$ -anonymity of 5, and it is incriminated until $k = 25$ for each QidD value (sample size is changeable). Then, the privacy gain (PG) and the nonuniform entropy (NUE) are calculated for each sample and each new QidD until QidD values reach four ( $QidD = 4$ ) for the bank dataset and $QidD = 10$ for the adult dataset.

Finally, the proposed algorithm returns the QidD that achieves the optimum case to be as the best dimension will be used in the anonymization process. Table 3 demonstrates the results of finding the best QidD for the adult dataset.

Table 3

Experimental results for selecting the best QidD in the adult dataset.

QID value	$k = 5$		$k = 15$		$k = 25$
QID value	PG %	NUE %	PG %	NUE %	PG %	NUE %
1	33.8	30.41	38.35	21.05	38.35	21.05
2	55.34	44.65	76.86	23.13	76.86	23.13
3	77.94	22.05	83.17	16.82	83.17	16.82
4	79.53	20.46	84.39	15.6	84.39	15.6
5	83.62	16.37	87.48	12.51	87.48	12.51
6	85.91	14.08	89.56	10.43	89.56	10.43
7	86.65	13.34	86.65	13.34	89.69	10.3
8	90.51	9.48	90.51	9.48	90.51	9.48
9	92.68	7.31	92.68	7.31	92.68	7.31
10	91.59	8.4	91.59	8.4	91.59	8.4

According to Table 3, we observed that $QidD = 2$ is the optimum case that increases the privacy gain as well as the NUE. Moreover, we can notice that the privacy level also increases when QidD value increases. The privacy gain reaches 91.59% when QidD is 10. On the other hand, NUE decreases, and accordingly, the data utility decreases when QidD increases. Figures 4(a)–4(c) demonstrate the selection of the best QidD for the bank dataset by the proposed QIR algorithm on different $k$ -anonymity values, 5, 15, and 25, respectively. In the bank dataset, the proposed algorithm’s selected QID attributes are work-class and hours-per-week (HPW). These two attributes achieve the highest reidentification risk; thus, they must be involved in the anonymization process (see Figure 5).

[figure omitted; refer to PDF]

To determine the best QidD in the bank dataset, track Table 4 and Figures 6(a)–6(c); it is clear that when $QidD = 1$ the proposed algorithm achieves the optimum case as it gives high privacy in several cases of $k$ values. It can be also observed in Table 4 that NUE drops from 45.28% when $k = 5$ to 17.27% when $k$ increases above 15. It is also noticeable in the bank database that privacy decreases as the value of QidD increases which is normal with the level of privacy provided.

Table 4

Experimental results for selecting the best QidD in the bank dataset.

QidD	QID	$k = 5$		$k = 15$		$k = 25$
QidD	QID	PG %	NUE %	PG %	NUE %	PG %	NUE %
1	Age	23.89	45.28	36.12	17.27	36.12	17.27
2	Age, job	21.83	36.65	21.83	36.65	21.83	36.65
3	Age, job, marital status	15.83	40.35	16.67	37.18	17.94	32.37
4	Age, job, marital status, education	14.88	35.93	14.88	35.93	16.43	29.26

[figures omitted; refer to PDF]

4.3. Performance Benchmark and Discussion

To evaluate the proposed QIR algorithm, we compare it based on $k$ -anonymity against recent similar work SQI algorithm [37]. The comparison was conducted in terms of their privacy gain (PG) and nonuniform Entropy (NUE). Multiple $k$ values and different dataset sizes of the adult dataset will be used. In Figures 7 and 8, the privacy provided by QIR is more than the privacy achieved by SQI, where the improvement average exceeds 23%. Although SQI outperformed the QIR in data utility represented by NUE at $k = 26, 29, 35$ , with a privacy rate of 9.57%, this is considered a deficiency because QIR provided data utility higher than that with much higher privacy at $k = 4$ , 6, 10, 17, and 20.

[figure omitted; refer to PDF]

In Figures 9 and 10, it can be observed that at 10% of the dataset and $k = 10$ the privacy achieved by the proposed QIR algorithm is more than double the privacy achieved by the SQI algorithm with slight increases in data utility, that is, the proposed QIR algorithm outperforms the SQI algorithm in terms of preserving privacy and data utility. With data size 20% and $k = 20$ , NUE obtained by SQI and QIR is 30.27 and 31.66%, respectively, while the privacy given by SQI is 20.52% and that by QIR is 51.82 which is twice more than that achieved by SQI. Similar results were obtained at $k = 20$ and $data size = 30 %$ and 90%, respectively. In most cases, when data size increases the privacy decreases, and therefore, the data utility increases.

[figure omitted; refer to PDF]

Generally, for the whole adult data, results of the experiments at $k = 10$ and $k = 20$ show that the average privacy percentage presented by SQI is 10.17% with 48.62% data utility, while the average privacy percentage offered by the proposed QIR is 46.49% with 41.04% data utility. Also, for the whole adult dataset and all $k$ values experimented, the average privacy provided by SQI is 7.51% against 54.13% data utility, while the average privacy percentage achieved by QIR is 30.67% against 55.46% data utility; hence, using QIR for identification of the real QIDs is considered more ideal.

5. Conclusions

Accurate identification of QIDs is an important issue for the success and validity methods of privacy-preserving outsourced data that seek to avoid privacy leakage caused by QID linking. This paper is aimed at classifying dataset attributes before the anonymization process and determining the proper QIDs that should be involved in the anonymity operation. A new algorithm is proposed based on the calculation of the reidentification risk for dataset attributes to classify attributes to SAs, QIDs, and NSs based on prespecified thresholds. In addition to attribute classification, the algorithm determines the actual dimension of QIDs that is required in the anonymization process depending on the amount of privacy provided versus a loss of the quality of the data. The experiment results indicated that the proposed identification algorithm has better performance and is more perfect in terms of privacy provided against data utility when compared with other works. Although the proposed algorithm is suitable to be used with any method or privacy model concerned with QID attributes, in this paper, we have relied on the $k$ -anonymity model.

Acknowledgments

The authors would like to acknowledge Taif University Researchers Supporting Project (number TURSP-2020/292) Taif University, Taif, Saudi Arabia.

References

[1] J. Domingo-Ferrer, O. Farràs, J. Ribes-González, D. Sánchez, "Privacy-preserving cloud computing on sensitive data: a survey of methods, products, and challenges," Computer Communications, vol. 140–141, pp. 38-60, DOI: 10.1016/j.comcom.2019.04.011, 2019.

[2] S. Aldeen Yousra, S. Mazleena, "A new heuristic anonymization technique for privacy preserved datasets publication on cloud computing," Journal of Physics: Conference Series, vol. 1003,DOI: 10.1088/1742-6596/1003/1/012030, 2018.

[3] C. Bradford, "7 most infamous cloud security breaches - StorageCraft," storagecraft, 2020. https://blog.storagecraft.com/7-infamous-cloud-security-breaches/

[4] B. Chen, P. Cheung, P. Cheung, Y. Kwok, "Cypherdb: a novel architecture for outsourcing secure database processing," IEEE Transactions on Cloud Computing, vol. 6 no. 2, pp. 372-386, DOI: 10.1109/tcc.2015.2511730, 2018.

[5] B. C. M. Fung, K. Wang, R. Chen, P. S. Yu, "Privacy-preserving data publishing," ACM Computing Surveys, vol. 42 no. 4,DOI: 10.1145/1749603.1749605, 2010.

[6] S. A. Abdelhameed, S. M. Moussa, M. E. Khalifa, "Privacy-preserving tabular data publishing: a comprehensive evaluation from web to cloud," Computers & Security, vol. 72, pp. 74-95, 2018.

[7] A. Bampoulidis, I. Markopoulos, M. Lupu, "PrioPrivacy: a local recoding K -anonymity tool for prioritised qasi-identifiers," IEEE/WIC/ACM International Conference on Web Intelligence - Companion Volume, pp. 314-317, .

[8] L. Sweeney, "Achieving k -anonymity privacy protection using generalization and suppression," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10 no. 5, pp. 571-588, DOI: 10.1142/s021848850200165x, 2002.

[9] Y. Yan, W. Wang, X. Hao, L. Zhang, "Finding quasi-identifiers for k -anonymity model by the set of cut-vertex," Engineering Letters, vol. 26 no. 1, 2018.

[10] G. Kaur, S. Agrawal, "Differential privacy framework," Impact of Quasi-identifiers on Anonymization, vol. 46, 2019.

[11] D. Wei, K. Natesan Ramamurthy, K. R. Varshney, "Distribution-preserving k -anonymity," Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 11 no. 6, pp. 253-270, DOI: 10.1002/sam.11374, 2018.

[12] P. R. Bhaladhare, D. C. Jinwala, "Novel approaches for privacy preserving data mining in k -anonymity model," Journal of Information Science and Engineering, vol. 32 no. 1, pp. 63-78, 2016.

[13] M. S. Simi, K. S. Nayaki, M. S. Elayidom, "An extensive study on data anonymization algorithms based on K -anonymity," IOP Conference Series: Materials Science and Engineering, vol. 225,DOI: 10.1088/1757-899x/225/1/012279, 2017.

[14] H. Kaur, N. Kumar, S. Batra, "ClaMPP: a cloud-based multi-party privacy preserving classification scheme for distributed applications," The Journal of Supercomputing, vol. 75 no. 6, pp. 3046-3075, DOI: 10.1007/s11227-018-2691-0, 2019.

[15] G. G. Dagher, B. C. M. Fung, N. Mohammed, J. Clark, "SecDM: privacy-preserving data outsourcing framework with differential privacy," Knowledge and Information Systems, vol. 62 no. 5, pp. 1923-1960, DOI: 10.1007/s10115-019-01405-7, 2020.

[16] A. F. Westin, "Privacy and freedom," American Sociological Review, vol. 33 no. 1,DOI: 10.2307/2092293, 1968.

[17] M. Templ, Statistical disclosure control for microdata, 2017.

[18] W. Mahanan, W. A. Chaovalitwongse, J. Natwichai, "Data anonymization: a novel optimal k -anonymity algorithm for identical generalization hierarchy data in IoT," Service Oriented Computing and Applications, vol. 14 no. 2, pp. 89-100, DOI: 10.1007/s11761-020-00287-w, 2020.

[19] S. Mayil, M. Vanitha, C. Science, J. J. College, T. St, "A survey on privacy preserving data mining techniques for clinical decision support system," International Research Journal of Engineering and Technology, vol. 5 no. 5, pp. 6054-6056, 2016.

[20] N. Uttarwar, M. A. Pradhan, "K-NN data classification technique using semantic search on encrypted relational data base," 2016 International Conference on Computing Communication Control and automation (ICCUBEA), .

[21] K. El Makkaoui, A. Beni-Hssane, A. Ezzati, A. El-Ansari, "Fast Cloud-RSA scheme for promoting data confidentiality in the cloud computing," Procedia Computer Science, vol. 113, pp. 33-40, DOI: 10.1016/j.procs.2017.08.282, 2017.

[22] W. Wang, L. Chen, Q. Zhang, "Outsourcing high-dimensional healthcare data to cloud with personalized privacy preservation," Computer Networks, vol. 88, pp. 136-148, DOI: 10.1016/j.comnet.2015.06.014, 2015.

[23] K. El Makkaoui, A. Beni-Hssane, A. Ezzati, "Speedy Cloud-RSA homomorphic scheme for preserving data confidentiality in cloud computing," Journal of Ambient Intelligence and Humanized Computing, vol. 10 no. 12, pp. 4629-4640, DOI: 10.1007/s12652-018-0844-x, 2019.

[24] D. Chandravathi, P. V. Lakshmi, "Privacy preserving using extended Euclidean algorithm applied to RSA-homomorphic encryption technique," VOLUME-8 ISSUE-10, AUGUST 2019, REGULAR ISSUE, vol. 8 no. 10, pp. 3175-3179, DOI: 10.35940/ijitee.j1236.0881019, 2019.

[25] P. Shyja Rose, J. Visumathi, H. Haripriya, "Research paper on privacy preservation by data anonymization in public cloud for hospital management on big data," International Journal of Control Theory and Applications, 2016.

[26] Y. A. A. S. Aldeen, M. Salleh, "Privacy preserving data utility mining architecture," Smart Cities Cybersecurity and Privacy, pp. 253-268, 2019.

[27] Y. A. A. S. Aldeen, M. Salleh, "Techniques for privacy preserving data publication in the cloud for smart city applications," Smart Cities Cybersecurity and Privacy, pp. 129-145, 2019.

[28] Y. A. A. S. Aldeen, M. Salleh, "A hybrid K -anonymity data relocation technique for privacy preserved data mining in cloud computing," Journal of Internet Computing and Services, vol. 17 no. 5, pp. 51-58, 2016.

[29] H. Lee, S. Kim, J. W. Kim, Y. D. Chung, "Utility-preserving anonymization for health data publishing," BMC Medical Informatics and Decision Making, vol. 17 no. 1, 2017.

[30] Y. A. A. S. Aldeen, M. Salleh, Y. Aljeroudi, "An innovative privacy preserving technique for incremental datasets on cloud computing," Journal of Biomedical Informatics, vol. 62, pp. 107-116, 2016.

[31] S. R. P. Reddy, K. V. S. V. N. Raju, V. V. Kumari, "Personalized privacy preserving incremental data dissemination through optimal generalization," International Journal of Engineering & Technology, vol. 7 no. 2.20,DOI: 10.14419/ijet.v7i2.20.13296, 2018.

[32] R. V. Sudhakar, T. C. M. Rao, "Security aware index based quasi–identifier approach for privacy preservation of data sets for cloud applications," Cluster Computing, 2020.

[33] S. A. Onashoga, B. A. Bamiro, A. T. Akinwale, J. A. Oguntuase, "KC-Slice: a dynamic privacy-preserving data publishing technique for multisensitive attributes," Information Security Journal: A Global Perspective, vol. 26 no. 3, pp. 121-135, DOI: 10.1080/19393555.2017.1319522, 2017.

[34] R. Wang, Y. Zhu, T.-S. Chen, C.-C. Chang, "Privacy-preserving algorithms for multiple sensitive attributes satisfying t-closeness," Journal of Computer Science and Technology, vol. 33 no. 6, pp. 1231-1242, DOI: 10.1007/s11390-018-1884-6, 2018.

[35] S. Srijayanthi, T. Sethukarasi, A. Thilagavathy, "Efficient anonymization algorithm for multiple sensitive attributes," International Journal of Innovative Technology and Exploring Engineering, vol. 9 no. 1, pp. 4961-4963, DOI: 10.35940/ijitee.a4486.119119, 2019.

[36] L. Huang, J. Song, Q. Lu, X. Liu, C. Zhang, "Hypergraph-based solution for selecting quasi-identifier," International Journal of Digital Content Technology and its Applications, vol. 6 no. 20, pp. 597-606, DOI: 10.4156/jdcta.vol6.issue20.65, 2012.

[37] A. M. Omer, M. M. Bin Mohamad, "Simple and effective method for selecting quasi-identifier," Journal of Theoretical and Applied Information Technology, vol. 89 no. 2, pp. 512-517, 2016.

[38] Y. J. Lee, K. H. Lee, "Re-identification of medical records by optimum quasi-identifiers," 2017 19th international conference on advanced communication technology (ICACT), pp. 428-435, .

[39] K. S. Wong, N. A. Tu, D. M. Bui, S. Y. Ooi, M. H. Kim, "Privacy-preserving collaborative data anonymization with sensitive quasi-identifiers," 2019 12th CMI Conference on Cybersecurity and Privacy (CMI), .

[40] Y. Sei, H. Okumura, T. Takenouchi, A. Ohsuga, "Anonymization of sensitive quasi-identifiers for l-diversity and t-closeness," IEEE Transactions on Dependable and Secure Computing, vol. 16 no. 4, pp. 580-593, DOI: 10.1109/tdsc.2017.2698472, 2019.

[41] N. Victor, D. Lopez, "Privacy preserving sensitive data publishing using ( k , n , m ) anonymity approach," Journal of communications software and systems, vol. 16 no. 1, pp. 46-56, DOI: 10.24138/jcomss.v16i1.825, 2020.

[42] A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, "L-diversity: privacy beyond k -anonymity," 22nd International Conference on Data Engineering (ICDE'06), pp. 24-24, DOI: 10.1109/ICDE.2006.1, .

[43] N. Li, T. Li, S. Venkatasubramanian, "t-Closeness: privacy beyond k -anonymity and l-diversity," 2007 IEEE 23rd International Conference on Data Engineering, pp. 106-115, DOI: 10.1109/ICDE.2007.367856, .

[44] H. Y. Tran, J. Hu, "Privacy-preserving big data analytics a comprehensive survey," Journal of Parallel and Distributed Computing, vol. 134, pp. 207-218, DOI: 10.1016/j.jpdc.2019.08.007, 2019.

[45] K. Patel, G. B. Jethava, "Privacy preserving techniques for big data: a survey," 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 194-199, .

[46] E. E. Brown, "Improving privacy preserving methods to enhance data mining for correlation research," SoutheastCon 2017, .

[47] X. Jiang, A. D. Sarwate, L. Ohno-Machado, "Privacy technology to support data sharing for comparative effectiveness research: a systematic review," Medical Care, vol. 51, pp. S58-S65, 2013.

[48] K. Benitez, B. Malin, "Evaluating re-identification risks with respect to the HIPAA privacy rule," Journal of the American Medical Informatics Association, vol. 17 no. 2, pp. 169-177, DOI: 10.1136/jamia.2009.000026, 2010.

[49] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, J. Chen, "Combining top-down and bottom-up: scalable sub-tree anonymization over big data using MapReduce on cloud," 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 501-508, .

[50] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, J. Chen, "A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud," Journal of Computer and System Sciences, vol. 80 no. 5, pp. 1008-1020, DOI: 10.1016/j.jcss.2014.02.007, 2014.

[51] F. Prasser, R. Bild, K. A. Kuhn, "A generic method for assessing the quality of de-identified health data," Studies in Health Technology and Informatics, vol. 228, pp. 312-316, 2016.

[52] S. Moro, P. Cortez, P. Rita, UCI Machine Learning Repository: Bank Marketing Data Set, 2014. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

[53] R. Kohavi, B. Becker, Adult Census Income | Kaggle, 2016. https://www.kaggle.com/uciml/adult-census-income

[54] F. Prasser, K. A. Kuhn, J. Eicher, "Flexible data anonymization using ARX—current status and challenges ahead," Software: Practice and Experience, vol. 50 no. 7, pp. 1277-1304, DOI: 10.1002/spe.2812, 2020.

Word count: 3550

Show less

Copyright © 2021 Huda O. Mansour et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Cloud computing plays an essential role as a source for outsourcing data to perform mining operations or other data processing, especially for data owners who do not have sufficient resources or experience to execute data mining techniques. However, the privacy of outsourced data is a serious concern. Most data owners are using anonymization-based techniques to prevent identity and attribute disclosures to avoid privacy leakage before outsourced data for mining over the cloud. In addition, data collection and dissemination in a resource-limited network such as sensor cloud require efficient methods to reduce privacy leakage. The main issue that caused identity disclosure is quasi-identifier (QID) linking. But most researchers of anonymization methods ignore the identification of proper QIDs. This reduces the validity of the used anonymization methods and may thus lead to a failure of the anonymity process. This paper introduces a new quasi-identifier recognition algorithm that reduces identity disclosure which resulted from QID linking. The proposed algorithm is comprised of two main stages: (1) attribute classification (or QID recognition) and (2) QID dimension identification. The algorithm works based on the reidentification of risk rate for all attributes and the dimension of QIDs where it determines the proper QIDs and their suitable dimensions. The proposed algorithm was tested on a real dataset. The results demonstrated that the proposed algorithm significantly reduces privacy leakage and maintains the data utility compared to recent related algorithms.

Details

Title

Quasi-Identifier Recognition Algorithm for Privacy Preservation of Cloud Data Based on Risk Reidentification

Author

Mansour, Huda O¹

; Siraj, Maheyzah M²

; Ghaleb, Fuad A³

; Saeed, Faisal⁴

; Alkhammash, Eman H⁵

; Maarof, Mohd A³

¹ Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia (UTM), Johor 81310, Malaysia; Department of Computer Science, Faculty of Computer Science and Information Technology, University of Kassala, Kassala 31111, Sudan
² Department of Computer Science, Faculty of Computer Science and Information Technology, University of Kassala, Kassala 31111, Sudan
³ Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia (UTM), Johor 81310, Malaysia
⁴ College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia
⁵ Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia

Editor

Ihsan Ali

Publication year

2021

Publication date

2021

Publisher

John Wiley & Sons, Inc.

e-ISSN

15308677

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2021/7154705

ProQuest document ID

2569267329

Quasi-Identifier Recognition Algorithm for Privacy Preservation of Cloud Data Based on Risk Reidentification

Jump to:

Full text

Abstract

Details

Suggested sources