This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
1. Introduction
In the modern information age, many companies are using external sources of data for processing, storing, or obtaining some services such as data mining. Unlimited computational resources, reduced costs, nonburden of maintenance, and nondiligence to learn the skills of proficiency in certain services, all of these were temptations to advance to the modern change. However, there are still security and privacy concerns that hinder the use of the features offered by the cloud [1]. Numerous studies clarified that attackers often reveal the information from third-party services or third-party clouds [2]. For example, one of the security breaches in October 2014 was a breakthrough for Dropbox. The attackers stole 700 user passwords to obtain cash values of its Bitcoins (BTC). In 2015, a lot of users’ information, which exceeds 4 million, such as the user’s name, date of birth, address, e-mail, phone number, and other sensitive data, were leaked through the TalkTalk service provider in the UK. In 2016, Time Warner, one of the largest cable television companies in the United States, has announced that about 32 million passwords and e-mail of the users have been stolen via an attacker. In 2017, more than 200 million data of the users containing users’ names, phone numbers, e-mail addresses, home addresses, and other data have been disclosed through the API of McDelivery Company in India [2, 3]. A fresh security violation in Google displayed that any administrator of the server who has access to the secret information can misuse it easily. The worst problem is that administrator of the honest-but-curious server can violate privacy without being discovered [4].
Three kinds of the disclosure can cause privacy leakage, identity disclosure, attribute disclosure, and membership disclosure [5]. In attribute disclosure and identity disclosure, the intruder identifies that the tuple of the target individual is found in the released dataset and he aims to acquire some private/sensitive data about that individual from the released dataset [6]. Serious issues that lead to identity disclosure are quasi-identifier (QID) value linking and the attacker’s knowledge background. The QIDs are the dataset attributes that if each of them is considered separately does not distinguish the individual, but when several attributes are combined they can give a distinctive identification of individuals [7]. For example, when looking at the attributes of date of birth, gender, and ZIP code together, one can reidentify the individuals as stated in [8]. Reidentification of the individuals through linking their QIDs leads to what are called linking attacks. Therefore, the careless publication of QIDs will lead to leakage of privacy [9].
One of the popular practices to avoid privacy leakage is anonymization. The anonymization can be performed via several types of transformations, by removing the values, changing the structure, replacing the values by taxonomy, and combining the values. The anonymization-based methods use one or a combination of operations to accomplish an optimum level of concealment [10]. A commonly utilized privacy criterion of anonymization is
[figure omitted; refer to PDF]
[figures omitted; refer to PDF]
For the bank dataset, we identify
Table 1
Classification of the bank dataset.
| Classification | Threshold value | Attributes |
| SAs | Balance | |
| QIDs | Age, job, education, and marital status | |
| NSs | Default, housing, and loan |
Table 2
Classification of the adult dataset.
| Classification | Threshold value | Attributes |
| SAs | Capital gain, capital loss | |
| QIDs | Hours-per-week, work-class, age, native-country, education, education-num, occupation, marital-status, relationship, and race | |
| NSs | Sex, income |
After calculating the risk rate of each attribute in the dataset, the attribute is classified according to the selected threshold
Finally, the proposed algorithm returns the QidD that achieves the optimum case to be as the best dimension will be used in the anonymization process. Table 3 demonstrates the results of finding the best QidD for the adult dataset.
Table 3
Experimental results for selecting the best QidD in the adult dataset.
| QID value | ||||||
| PG % | NUE % | PG % | NUE % | PG % | NUE % | |
| 1 | 33.8 | 30.41 | 38.35 | 21.05 | 38.35 | 21.05 |
| 2 | 55.34 | 44.65 | 76.86 | 23.13 | 76.86 | 23.13 |
| 3 | 77.94 | 22.05 | 83.17 | 16.82 | 83.17 | 16.82 |
| 4 | 79.53 | 20.46 | 84.39 | 15.6 | 84.39 | 15.6 |
| 5 | 83.62 | 16.37 | 87.48 | 12.51 | 87.48 | 12.51 |
| 6 | 85.91 | 14.08 | 89.56 | 10.43 | 89.56 | 10.43 |
| 7 | 86.65 | 13.34 | 86.65 | 13.34 | 89.69 | 10.3 |
| 8 | 90.51 | 9.48 | 90.51 | 9.48 | 90.51 | 9.48 |
| 9 | 92.68 | 7.31 | 92.68 | 7.31 | 92.68 | 7.31 |
| 10 | 91.59 | 8.4 | 91.59 | 8.4 | 91.59 | 8.4 |
According to Table 3, we observed that
[figure omitted; refer to PDF]
To determine the best QidD in the bank dataset, track Table 4 and Figures 6(a)–6(c); it is clear that when
Table 4
Experimental results for selecting the best QidD in the bank dataset.
| QidD | QID | ||||||
| PG % | NUE % | PG % | NUE % | PG % | NUE % | ||
| 1 | Age | 23.89 | 45.28 | 36.12 | 17.27 | 36.12 | 17.27 |
| 2 | Age, job | 21.83 | 36.65 | 21.83 | 36.65 | 21.83 | 36.65 |
| 3 | Age, job, marital status | 15.83 | 40.35 | 16.67 | 37.18 | 17.94 | 32.37 |
| 4 | Age, job, marital status, education | 14.88 | 35.93 | 14.88 | 35.93 | 16.43 | 29.26 |
[figures omitted; refer to PDF]
4.3. Performance Benchmark and Discussion
To evaluate the proposed QIR algorithm, we compare it based on
[figure omitted; refer to PDF]
In Figures 9 and 10, it can be observed that at 10% of the dataset and
[figure omitted; refer to PDF]
Generally, for the whole adult data, results of the experiments at
5. Conclusions
Accurate identification of QIDs is an important issue for the success and validity methods of privacy-preserving outsourced data that seek to avoid privacy leakage caused by QID linking. This paper is aimed at classifying dataset attributes before the anonymization process and determining the proper QIDs that should be involved in the anonymity operation. A new algorithm is proposed based on the calculation of the reidentification risk for dataset attributes to classify attributes to SAs, QIDs, and NSs based on prespecified thresholds. In addition to attribute classification, the algorithm determines the actual dimension of QIDs that is required in the anonymization process depending on the amount of privacy provided versus a loss of the quality of the data. The experiment results indicated that the proposed identification algorithm has better performance and is more perfect in terms of privacy provided against data utility when compared with other works. Although the proposed algorithm is suitable to be used with any method or privacy model concerned with QID attributes, in this paper, we have relied on the
Acknowledgments
The authors would like to acknowledge Taif University Researchers Supporting Project (number TURSP-2020/292) Taif University, Taif, Saudi Arabia.
[1] J. Domingo-Ferrer, O. Farràs, J. Ribes-González, D. Sánchez, "Privacy-preserving cloud computing on sensitive data: a survey of methods, products, and challenges," Computer Communications, vol. 140–141, pp. 38-60, DOI: 10.1016/j.comcom.2019.04.011, 2019.
[2] S. Aldeen Yousra, S. Mazleena, "A new heuristic anonymization technique for privacy preserved datasets publication on cloud computing," Journal of Physics: Conference Series, vol. 1003,DOI: 10.1088/1742-6596/1003/1/012030, 2018.
[3] C. Bradford, "7 most infamous cloud security breaches - StorageCraft," storagecraft, 2020. https://blog.storagecraft.com/7-infamous-cloud-security-breaches/
[4] B. Chen, P. Cheung, P. Cheung, Y. Kwok, "Cypherdb: a novel architecture for outsourcing secure database processing," IEEE Transactions on Cloud Computing, vol. 6 no. 2, pp. 372-386, DOI: 10.1109/tcc.2015.2511730, 2018.
[5] B. C. M. Fung, K. Wang, R. Chen, P. S. Yu, "Privacy-preserving data publishing," ACM Computing Surveys, vol. 42 no. 4,DOI: 10.1145/1749603.1749605, 2010.
[6] S. A. Abdelhameed, S. M. Moussa, M. E. Khalifa, "Privacy-preserving tabular data publishing: a comprehensive evaluation from web to cloud," Computers & Security, vol. 72, pp. 74-95, 2018.
[7] A. Bampoulidis, I. Markopoulos, M. Lupu, "PrioPrivacy: a local recoding K -anonymity tool for prioritised qasi-identifiers," IEEE/WIC/ACM International Conference on Web Intelligence - Companion Volume, pp. 314-317, .
[8] L. Sweeney, "Achieving k -anonymity privacy protection using generalization and suppression," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10 no. 5, pp. 571-588, DOI: 10.1142/s021848850200165x, 2002.
[9] Y. Yan, W. Wang, X. Hao, L. Zhang, "Finding quasi-identifiers for k -anonymity model by the set of cut-vertex," Engineering Letters, vol. 26 no. 1, 2018.
[10] G. Kaur, S. Agrawal, "Differential privacy framework," Impact of Quasi-identifiers on Anonymization, vol. 46, 2019.
[11] D. Wei, K. Natesan Ramamurthy, K. R. Varshney, "Distribution-preserving k -anonymity," Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 11 no. 6, pp. 253-270, DOI: 10.1002/sam.11374, 2018.
[12] P. R. Bhaladhare, D. C. Jinwala, "Novel approaches for privacy preserving data mining in k -anonymity model," Journal of Information Science and Engineering, vol. 32 no. 1, pp. 63-78, 2016.
[13] M. S. Simi, K. S. Nayaki, M. S. Elayidom, "An extensive study on data anonymization algorithms based on K -anonymity," IOP Conference Series: Materials Science and Engineering, vol. 225,DOI: 10.1088/1757-899x/225/1/012279, 2017.
[14] H. Kaur, N. Kumar, S. Batra, "ClaMPP: a cloud-based multi-party privacy preserving classification scheme for distributed applications," The Journal of Supercomputing, vol. 75 no. 6, pp. 3046-3075, DOI: 10.1007/s11227-018-2691-0, 2019.
[15] G. G. Dagher, B. C. M. Fung, N. Mohammed, J. Clark, "SecDM: privacy-preserving data outsourcing framework with differential privacy," Knowledge and Information Systems, vol. 62 no. 5, pp. 1923-1960, DOI: 10.1007/s10115-019-01405-7, 2020.
[16] A. F. Westin, "Privacy and freedom," American Sociological Review, vol. 33 no. 1,DOI: 10.2307/2092293, 1968.
[17] M. Templ, Statistical disclosure control for microdata, 2017.
[18] W. Mahanan, W. A. Chaovalitwongse, J. Natwichai, "Data anonymization: a novel optimal k -anonymity algorithm for identical generalization hierarchy data in IoT," Service Oriented Computing and Applications, vol. 14 no. 2, pp. 89-100, DOI: 10.1007/s11761-020-00287-w, 2020.
[19] S. Mayil, M. Vanitha, C. Science, J. J. College, T. St, "A survey on privacy preserving data mining techniques for clinical decision support system," International Research Journal of Engineering and Technology, vol. 5 no. 5, pp. 6054-6056, 2016.
[20] N. Uttarwar, M. A. Pradhan, "K-NN data classification technique using semantic search on encrypted relational data base," 2016 International Conference on Computing Communication Control and automation (ICCUBEA), .
[21] K. El Makkaoui, A. Beni-Hssane, A. Ezzati, A. El-Ansari, "Fast Cloud-RSA scheme for promoting data confidentiality in the cloud computing," Procedia Computer Science, vol. 113, pp. 33-40, DOI: 10.1016/j.procs.2017.08.282, 2017.
[22] W. Wang, L. Chen, Q. Zhang, "Outsourcing high-dimensional healthcare data to cloud with personalized privacy preservation," Computer Networks, vol. 88, pp. 136-148, DOI: 10.1016/j.comnet.2015.06.014, 2015.
[23] K. El Makkaoui, A. Beni-Hssane, A. Ezzati, "Speedy Cloud-RSA homomorphic scheme for preserving data confidentiality in cloud computing," Journal of Ambient Intelligence and Humanized Computing, vol. 10 no. 12, pp. 4629-4640, DOI: 10.1007/s12652-018-0844-x, 2019.
[24] D. Chandravathi, P. V. Lakshmi, "Privacy preserving using extended Euclidean algorithm applied to RSA-homomorphic encryption technique," VOLUME-8 ISSUE-10, AUGUST 2019, REGULAR ISSUE, vol. 8 no. 10, pp. 3175-3179, DOI: 10.35940/ijitee.j1236.0881019, 2019.
[25] P. Shyja Rose, J. Visumathi, H. Haripriya, "Research paper on privacy preservation by data anonymization in public cloud for hospital management on big data," International Journal of Control Theory and Applications, 2016.
[26] Y. A. A. S. Aldeen, M. Salleh, "Privacy preserving data utility mining architecture," Smart Cities Cybersecurity and Privacy, pp. 253-268, 2019.
[27] Y. A. A. S. Aldeen, M. Salleh, "Techniques for privacy preserving data publication in the cloud for smart city applications," Smart Cities Cybersecurity and Privacy, pp. 129-145, 2019.
[28] Y. A. A. S. Aldeen, M. Salleh, "A hybrid K -anonymity data relocation technique for privacy preserved data mining in cloud computing," Journal of Internet Computing and Services, vol. 17 no. 5, pp. 51-58, 2016.
[29] H. Lee, S. Kim, J. W. Kim, Y. D. Chung, "Utility-preserving anonymization for health data publishing," BMC Medical Informatics and Decision Making, vol. 17 no. 1, 2017.
[30] Y. A. A. S. Aldeen, M. Salleh, Y. Aljeroudi, "An innovative privacy preserving technique for incremental datasets on cloud computing," Journal of Biomedical Informatics, vol. 62, pp. 107-116, 2016.
[31] S. R. P. Reddy, K. V. S. V. N. Raju, V. V. Kumari, "Personalized privacy preserving incremental data dissemination through optimal generalization," International Journal of Engineering & Technology, vol. 7 no. 2.20,DOI: 10.14419/ijet.v7i2.20.13296, 2018.
[32] R. V. Sudhakar, T. C. M. Rao, "Security aware index based quasi–identifier approach for privacy preservation of data sets for cloud applications," Cluster Computing, 2020.
[33] S. A. Onashoga, B. A. Bamiro, A. T. Akinwale, J. A. Oguntuase, "KC-Slice: a dynamic privacy-preserving data publishing technique for multisensitive attributes," Information Security Journal: A Global Perspective, vol. 26 no. 3, pp. 121-135, DOI: 10.1080/19393555.2017.1319522, 2017.
[34] R. Wang, Y. Zhu, T.-S. Chen, C.-C. Chang, "Privacy-preserving algorithms for multiple sensitive attributes satisfying t-closeness," Journal of Computer Science and Technology, vol. 33 no. 6, pp. 1231-1242, DOI: 10.1007/s11390-018-1884-6, 2018.
[35] S. Srijayanthi, T. Sethukarasi, A. Thilagavathy, "Efficient anonymization algorithm for multiple sensitive attributes," International Journal of Innovative Technology and Exploring Engineering, vol. 9 no. 1, pp. 4961-4963, DOI: 10.35940/ijitee.a4486.119119, 2019.
[36] L. Huang, J. Song, Q. Lu, X. Liu, C. Zhang, "Hypergraph-based solution for selecting quasi-identifier," International Journal of Digital Content Technology and its Applications, vol. 6 no. 20, pp. 597-606, DOI: 10.4156/jdcta.vol6.issue20.65, 2012.
[37] A. M. Omer, M. M. Bin Mohamad, "Simple and effective method for selecting quasi-identifier," Journal of Theoretical and Applied Information Technology, vol. 89 no. 2, pp. 512-517, 2016.
[38] Y. J. Lee, K. H. Lee, "Re-identification of medical records by optimum quasi-identifiers," 2017 19th international conference on advanced communication technology (ICACT), pp. 428-435, .
[39] K. S. Wong, N. A. Tu, D. M. Bui, S. Y. Ooi, M. H. Kim, "Privacy-preserving collaborative data anonymization with sensitive quasi-identifiers," 2019 12th CMI Conference on Cybersecurity and Privacy (CMI), .
[40] Y. Sei, H. Okumura, T. Takenouchi, A. Ohsuga, "Anonymization of sensitive quasi-identifiers for l-diversity and t-closeness," IEEE Transactions on Dependable and Secure Computing, vol. 16 no. 4, pp. 580-593, DOI: 10.1109/tdsc.2017.2698472, 2019.
[41] N. Victor, D. Lopez, "Privacy preserving sensitive data publishing using ( k , n , m ) anonymity approach," Journal of communications software and systems, vol. 16 no. 1, pp. 46-56, DOI: 10.24138/jcomss.v16i1.825, 2020.
[42] A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, "L-diversity: privacy beyond k -anonymity," 22nd International Conference on Data Engineering (ICDE'06), pp. 24-24, DOI: 10.1109/ICDE.2006.1, .
[43] N. Li, T. Li, S. Venkatasubramanian, "t-Closeness: privacy beyond k -anonymity and l-diversity," 2007 IEEE 23rd International Conference on Data Engineering, pp. 106-115, DOI: 10.1109/ICDE.2007.367856, .
[44] H. Y. Tran, J. Hu, "Privacy-preserving big data analytics a comprehensive survey," Journal of Parallel and Distributed Computing, vol. 134, pp. 207-218, DOI: 10.1016/j.jpdc.2019.08.007, 2019.
[45] K. Patel, G. B. Jethava, "Privacy preserving techniques for big data: a survey," 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 194-199, .
[46] E. E. Brown, "Improving privacy preserving methods to enhance data mining for correlation research," SoutheastCon 2017, .
[47] X. Jiang, A. D. Sarwate, L. Ohno-Machado, "Privacy technology to support data sharing for comparative effectiveness research: a systematic review," Medical Care, vol. 51, pp. S58-S65, 2013.
[48] K. Benitez, B. Malin, "Evaluating re-identification risks with respect to the HIPAA privacy rule," Journal of the American Medical Informatics Association, vol. 17 no. 2, pp. 169-177, DOI: 10.1136/jamia.2009.000026, 2010.
[49] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, J. Chen, "Combining top-down and bottom-up: scalable sub-tree anonymization over big data using MapReduce on cloud," 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 501-508, .
[50] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, J. Chen, "A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud," Journal of Computer and System Sciences, vol. 80 no. 5, pp. 1008-1020, DOI: 10.1016/j.jcss.2014.02.007, 2014.
[51] F. Prasser, R. Bild, K. A. Kuhn, "A generic method for assessing the quality of de-identified health data," Studies in Health Technology and Informatics, vol. 228, pp. 312-316, 2016.
[52] S. Moro, P. Cortez, P. Rita, UCI Machine Learning Repository: Bank Marketing Data Set, 2014. https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
[53] R. Kohavi, B. Becker, Adult Census Income | Kaggle, 2016. https://www.kaggle.com/uciml/adult-census-income
[54] F. Prasser, K. A. Kuhn, J. Eicher, "Flexible data anonymization using ARX—current status and challenges ahead," Software: Practice and Experience, vol. 50 no. 7, pp. 1277-1304, DOI: 10.1002/spe.2812, 2020.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2021 Huda O. Mansour et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Cloud computing plays an essential role as a source for outsourcing data to perform mining operations or other data processing, especially for data owners who do not have sufficient resources or experience to execute data mining techniques. However, the privacy of outsourced data is a serious concern. Most data owners are using anonymization-based techniques to prevent identity and attribute disclosures to avoid privacy leakage before outsourced data for mining over the cloud. In addition, data collection and dissemination in a resource-limited network such as sensor cloud require efficient methods to reduce privacy leakage. The main issue that caused identity disclosure is quasi-identifier (QID) linking. But most researchers of anonymization methods ignore the identification of proper QIDs. This reduces the validity of the used anonymization methods and may thus lead to a failure of the anonymity process. This paper introduces a new quasi-identifier recognition algorithm that reduces identity disclosure which resulted from QID linking. The proposed algorithm is comprised of two main stages: (1) attribute classification (or QID recognition) and (2) QID dimension identification. The algorithm works based on the reidentification of risk rate for all attributes and the dimension of QIDs where it determines the proper QIDs and their suitable dimensions. The proposed algorithm was tested on a real dataset. The results demonstrated that the proposed algorithm significantly reduces privacy leakage and maintains the data utility compared to recent related algorithms.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
; Siraj, Maheyzah M 2
; Ghaleb, Fuad A 3
; Saeed, Faisal 4
; Alkhammash, Eman H 5
; Maarof, Mohd A 3 1 Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia (UTM), Johor 81310, Malaysia; Department of Computer Science, Faculty of Computer Science and Information Technology, University of Kassala, Kassala 31111, Sudan
2 Department of Computer Science, Faculty of Computer Science and Information Technology, University of Kassala, Kassala 31111, Sudan
3 Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia (UTM), Johor 81310, Malaysia
4 College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia
5 Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia




