Content area
Data collected for providing recommendations can be partitioned among different parties. Offering distributed data-based predictions is popular due to mutual advantages. It is almost impossible to present trustworthy referrals with decent accuracy from split data only. Meaningful outcomes can be drawn from adequate data. Those companies with distributed data might want to collaborate to produce accurate and dependable recommendations to their customers. However, they hesitate to work together or refuse to collaborate because of privacy, financial concerns, and legal issues. If privacy-preserving measures are provided, such data holders might decide to collaborate for better predictions. In this study, we investigate how to provide predictions based on vertically distributed data (VDD) among multiple parties without deeply jeopardizing their confidentiality. Users are first grouped into various clusters off-line using self-organizing map clustering while protecting the online vendors' privacy. With privacy concerns, recommendations are produced based on partitioned data using a nearest neighbour prediction algorithm. We analyse our privacy-preserving scheme in terms of confidentiality and supplementary costs. Our analysis shows that our method offers recommendations without greatly exposing data holders' privacy and causes negligible superfluous costs because of privacy concerns. To evaluate the scheme in terms of accuracy, we perform real-data-based experiments. Our experiment results demonstrate that the scheme is still able to provide truthful predictions. [PUBLICATION ABSTRACT]
1. Introduction
Online companies help customers purchase or sell several kinds of products through the Internet facilities. With escalating popularity of the Internet, e-commerce has been receiving increasing attention. Owing to the attractiveness of business over the Internet and great income, online vendors provide several services like providing recommendations to their customers for satisfying their needs. Meziane and Kasiran (2008) present the development of the information extraction systems in e-commerce. Collaborative filtering (CF) techniques are widely utilized to estimate predictions using other users' preferences about products (Ahn, 2008).
CF systems collect users' preferences about various items and offer referrals to them on their data (Grcar, 2004). When a customer, referred to as the active user (a ), requests a prediction for a target item (q ), the system first determines similarities between a and each user u in the database. The best similar users are then used to form a 's neighbourhood. Finally, her neighbours' data are utilized to estimate a recommendation for a on q , referred to as paq .
Customers may use different vendors for their different needs because the Internet provides people various sources for their daily activities. A person may purchase a book from BarnesAndNoble.com and a movie from Amazon.com. Moreover, she might buy a CD from CDNow.com, and so on. Thus, preferences might be vertically distributed among multiple companies, even competing ones. In other words, different retailers might hold ratings of different items for the same users. Companies end up with vertically distributed data (VDD) because of the viable business, specialized vendors, and existence of great number of vendors.
If customers' ratings are vertically partitioned among multiple parties, the companies might face with the coverage problem. In other words, they might not be able to provide predictions for many items; they just offer referrals for limited number of entities. Furthermore, the more ratings the parties own, the more accurate outcomes they produce. User-user similarities are estimated based on commonly rated items. With increasing number of commonly rated products, dependability and correctness of the similarity weights enhance. Therefore, recommendations on scarce data might be inaccurate and make customers angry. When data are distributed, the parties can face the above-mentioned problems. E-commerce sites can produce more truthful and reliable predictions if they decide to collaborate (Kaleli and Polat, 2007a). Mutual advantages force online vendors to collaborate. Through partnership, they can overcome the preceding challenges. Although it is advantageous for online companies to produce recommendations on integrated data; however, they might not want to work together because of privacy, financial, and legal concerns. Collected data are considered companies' confidential data. Moreover, they are regarded as valuable assets. Finally, revealing users' data or transferring them might cause legal problems (Oliveira et al , 2004). According to reports published by OECD (OECD, 1998, 2000, 2005) and Bertino et al (2006), exposing of customers' privacy is very serious issue, and the companies are obliged to protect the data.
Owing to mutual advantages, privacy-preserving data mining applications are very attractive (Clifton et al , 2004; Bhowmick et al , 2006). In order to solve life-threading problems like efficient disease control and effective public safety, it is inevitable to amalgamate data (Clifton et al , 2004; Bhowmick et al , 2006). As stated in Clifton et al (2004), fire fighting departments in Illinois regularly ask for sample instructions and training materials from fellow fire fighting departments to provide the most up-to-date effective community defense. However, fellow departments hesitate to provide such regulations and materials because they do not want to be responsible if their programmes are considered inadequate. If their identity is protected, they can share their material.
Besides having enough data and achieving privacy, producing predictions efficiently is critical. Thus, researches introduce various approaches to improve the online performance of CF services. In other words, to improve scalability, researches propose to use dimension reduction and clustering techniques. Goldberg et al (2001) utilize principle component analysis in their scheme and they produce recommendations in a constant time, O (1). Sarwar et al (2000) uses singular value decomposition (SVD) to reduce dimensions. Clustering is also utilized to enhance efficiency. Kim et al (2002) propose a method reducing search space by using k -means clustering algorithm. Xue et al (2005) utilize clustering for data smoothing and neighbourhood selection. Roh et al (2003) propose to use self-organizing map (SOM) clustering for improving k -nn CF algorithm's efficiency. SOM is an unsupervised competitive learning method used for clustering and data visualization. It works well on dividing an input data into closest clusters. According to the study in Roh et al (2003), SOM cluster approach alleviates the online computational complexity and improves scalability of the recommendation process.
In this study, we propose a method for providing predictions on VDD among multiple parties without greatly jeopardizing their privacy. We assume that n users' preferences for m items are vertically partitioned among y parties. Users are grouped into various clusters using SOM clustering off-line. After determining a 's cluster, those users in that cluster are considered the best similar k users to a . As off-line costs are not critical for the success of CF, our scheme performs some computations off-line. We analyse the scheme in terms of privacy and performance; and perform real-data-based experiments for accuracy analysis. As explained before, in one hand, those companies holding VDD might want to collaborate to overcome the problems caused by scarce data. On the other hand, they do not want to work together because of privacy concerns. Our scheme makes it possible for such data holders to provide predictions on integrated data without disclosing their confidential data to each other. Using our method, the parties can overcome coverage and accuracy problems through partnership. Additionally, as they do not reveal their private data to each other, they do not face with privacy, financial, and legal issues.
2. Related work
With increasing popularity of distributed data-based computations, several works have been proposed to perform different data mining functionalities on partitioned data while preserving data holders' privacy. Such studies help data owners collaborate when they own inadequate data and need to combine their split data for better services. Kantarcioglu and Vaidya (2003) study how to learn distributed naïve Bayesian classifier (NBC) securely using secure summation and logarithm on horizontally distributed data (HDD). Vaidya et al (2008) propose a method producing decision trees using the ID3 algorithm over VDD with privacy. Lin et al (2005) propose a technique for clustering distributed data. The authors utilize EM mixture modelling to perform clustering. They address the task of clustering with privacy, where the distributed data are horizontally partitioned. Liu et al (2006) explore to use multiplicative random projection matrices for privacy-preserving distributed data mining. Their technique is successful for applying to different types of privacy-preserving data mining applications over partitioned data. Kaya et al (2007) propose a privacy-preserving distributed clustering protocol for HDD. The authors utilize a very efficient homomorphic additive secret sharing scheme using a more efficient model for many data mining applications.
Privacy-preserving collaborative filtering (PPCF) on distributed data is an attractive issue for researches. As data owners need to collaborate with other data owners to overcome the challenges caused by having insufficient data, various PPCF schemes on distributed data have been proposed. As peer-to-peer (P2P)-based CF systems break control of central server, the authors in Canny (2002), Berkovsky et al (2005), Kaleli and Polat (2010) propose P2P-based private recommendation schemes. Canny (2002) proposes to utilize homomorphic encryption to produce P2P system-based private referrals. In his scheme, users in a P2P network can produce private recommendations. Berkovsky et al (2005) present a solution, which enables users to obfuscate their data and get private predictions. Kaleli and Polat (2010) introduce a private P2P recommendation system based on NBC. They utilize randomize response techniques to disguise users' personal data. Besides P2P network solutions, researches present solutions for producing private recommendations on distributed data between two parties. Polat and Du (2008) present a scheme to produce binary ratings-based top-N recommendations on distributed data between two parties. Kaleli and Polat (2007a) introduce a method, which enables data owners to combine their split data. The authors propose privacy-preserving schemes to produce NBC-based recommendations on partitioned data. Lathia et al (2007) propose a new private method to measure the similarity, which is calculated over a distributed data by estimating the number of concordant, discordant, and tied pairs of ratings between two users. Yakut and Polat (2010) investigate providing SVD-based referrals on partitioned data while protecting data holders' privacy.
Without privacy protection, customers usually refuse to provide data at all or decide to give false data. As CF systems' success depends mainly on amount of data collected and its quality, researches propose several solutions for producing recommendations while protecting users' privacy (Polat and Du, 2003; Polat and Du, 2005; Zhang et al , 2006; Parameswaran and Blough, 2007; Kaleli and Polat, 2007b; Yakut and Polat, 2007). The authors propose to use various data disguising and perturbation techniques to mask private user data so that CF services can be achieved without jeopardizing data owners' privacy. They show that it is still possible to produce accurate recommendations while protecting privacy.
Our work has some differences from the aforementioned works. As our work aims to overcome having insufficient data problem of data holders, we propose a solution for collaboration of data owners while producing recommendations without jeopardizing their privacy. P2P-based solutions cannot be applied central server-based CF applications, which are commonly used by online buyers. The proposed solutions for partitioned data between two parties have the same motivation with our work. On the other hand, they assume the collaboration of two data owners only, while our solution makes it possible for multiple parties to collaborate. Finally, owing to SOM clustering conducted off-line and neighbourhood of a 's is formed off-line, our scheme does not suffer from scalability problem. Also, note that distributed data introduce additional workloads; that might cause online efficiency problems. However, in our scheme, two major steps in a traditional CF scheme are performed off-line. Namely, in off-line phase, similarity weight estimation and neighbourhood formation steps are conducted. Off-line costs are not critical to the overall performance compared with online costs. Collaborating parties do last step, estimation recommendations, online in a parallel manner. As a result, our work makes it possible for online vendors to provide CF services efficiently on VDD among multiple data owners without deeply exposing their privacy.
3. Background
In this section, we briefly explain SOM clustering and SOM-based CF. SOM is a type of artificial neural networks introduced by Kohonen (1995). It reduces dimensions into one or two by producing a map showing the similarities of the data by grouping similar objects together. Suppose that Dn × m is an input data to be clustered, t is the number of neurons in the Kohonen layer, and Wj is the initial weights chosen randomly among objects in D for j =1, 2, ..., t . SOM clustering works as follows (Kohonen, 1995):
Determine values of initial constants, η0 , σ0 , τ1 , and τ2 , where η0 =0.1, σ0 is the radius of the lattice, τ1 =1000/log σ0 , and τ2 =1000. These constant values were configured by Haykin (1999).
Find the winning Kohonen layer neuron. To do so, a random object o1 × m is selected from D and the winning Kohonen neuron (KNi ) is determined by the computed minimum Euclidean distance between o and Wj using Equation (1) (See PDF) as follows: (Formula Omitted: See PDF)
Update the Wjs of all neurons using Equation (2) (See PDF) as follows: (Formula Omitted: See PDF) where s shows iteration, η (s) shows the learning-rate, and hji (s ) is neighbourhood function. η (s ) and hji (s ) are computed using Equations (3) (See PDF) and (4) (See PDF) as follows: (Formula Omitted: See PDF) (Formula Omitted: See PDF) where σ (s )=σ0 (-s /τ1 ).
Repeat from step (ii) until no noticeable change in the future map.
Mangiameli et al (1996) evaluate clustering capability of SOM and state that the SOM network is superior to hierarchical clustering algorithms. Mingoti and Lima (2006) compare SOM and other clustering algorithms like fuzzy c -means, k -means according to their clustering capabilities on different data structures. Their results show that there are situations in which SOM performs well, even though it is not the best. Roh et al (2003) apply SOM clustering to CF in which empirical results demonstrate that the SOM-based CF design supplies higher-quality referrals than other comparative models. Hence, in this study, we also use SOM clustering to group users off-line to obtain dual goal of better accuracy and performance.
As shown in Figure 1 - See PDF,, in a CF algorithm based on k -nn , a sends her ratings vector (A ) and the target item (q ) for which she is looking for a prediction to a central server or a CF system. First, the best k similar users are determined as the k nearest neighbours of a . Then, the server uses a k -nn CF algorithm (using the neighbours' data only) to estimate a prediction for a on item q , called paq . It finally sends it to a .
Owing to the importance of online performance, various approaches can be used to enhance online performance. SOM clustering is among such methods and it is used to cluster users off-line for faster CF. In a SOM-based CF scheme, as shown in Figure 2 - See PDF,, users are clustered off-line, where x shows number of clusters. To obtain paq , a sends A and q to the server that first computes the distances between A and each cluster centre using Euclidean distance measure. It then assigns a to the closest cluster. Those users in that cluster are considered a 's neighbours. After estimating paq , it finally sends it to a .
The server estimates paq as follows (Herlocker et al , 1999):
(Formula Omitted: See PDF)
where va and vu are the a 's and user u 's mean votes, respectively, S is the set of users who rated q in a 's cluster, vuq is u 's rating on q , and wau is the similarity between a and u , which can be computed using adjusted cosine measure as follows (Sarwar et al , 2001):
(Formula Omitted: See PDF)
where (·) is the scalar dot product, J is the set of commonly rated items by both a and u , and vj is item j 's mean rating.
Homomorphic encryption can be used to achieve privacy. Suppose that ξ is an encryption function, e is a public key, and v1 and v2 are private data values that we want to hide. Homomorphic encryption allows an addition or a multiplication operation to be conducted based on the encrypted data without decrypting them as follows: ξe (v1 ) × ξe (v2 )=ξe (v1 +v2 ) or v1 × ξe (v2 )=ξe (v1 × v2 ). We use an efficient homomorphic encryption scheme proposed by Paillier (1999).
4. VDD-based recommendations with privacy
In the privacy-preserving data mining literature, it is difficult to define privacy succinctly. In this study, we can define privacy as follows: The parties should not be able to learn the true rating values and the rated items held by each other while providing SOM-based recommendations based on their distributed data . As a sends her data to the company holding q , called prime company (PC), the PC is responsible for protecting a 's data, as well. To provide predictions on VDD without deeply violating data owners' privacy, we propose various protocols explained in the following.
Protocol 1 SOMP : In the SOM-based recommendation process, the first step is clustering users using SOM clustering. Thus, vendors first need to cluster data using SOM clustering off-line. The protocol, called private SOM clustering on multi-party vertically distributed data (SOMP), enables them to cluster distributed data without exposing their privacy as follows:
The parties decide the number of clusters (x ) and determine the initial constant values, as explained in Section 3.
One of the parties then uniformly randomly chooses x users whose data form initial weight (Wj ) values and a user as o , among the users in its database; and informs the others. Owing to VDD, each party knows the values of the initial weights Wjs and o it holds and does not know the values that the other parties hold.
To determine KNi , the parties need to estimate distances between o and each Wj using the Euclidean distance measure. As data are vertically partitioned, each company i =1, 2, ..., z can compute |"|"oi -Wij |"|" values using the data items it holds for all Wjs . They then exchange such aggregate values. After calculating the sums of such aggregate values, they take their square roots and find the distances. Each party then determines the KNi . Note that distance between any two vectors, X and Y , whose elements are vertically distributed between z companies can be estimated as follows using Euclidean distance measure: (Formula Omitted: See PDF) where ni values represent the average number of elements held by company i . As seen from the above equation, owing to VDD, the parties compute partial sums and exchange them.
Using Equation (2) (See PDF) , each party updates the corresponding parts of the Wjs without sharing any information with other parties because of VDD.
They repeat steps (3) and (4) until no noticeable change in the future map. Each party ends up with the corresponding parts of the cluster centres and each user's cluster.
The parties finally exchange such distributed centres with each other so that each company now has the whole cluster centres.
Protocol 2 data perturbation protocol (DPP) : Data holders need to compute item and user mean votes and vector lengths (VLs). In such computations, the parties can derive confidential data. In order to prevent them from gaining private data, the vendors perturb their data. On one hand, such data masking should hide confidential data. On the other hand, it still allows generating correct outcomes. DPP can be used by each party i to mask its data set (Dz ) containing users' preferences about various items as follows:
Uniformly randomly chooses a random value, [theta]i , over the range (0, 100].
Selects a random value, βi , over the range (0, [theta]i ] uniformly randomly.
Uniformly randomly picks βi percent of the unrated cells.
Finally, fills such chosen cells with fake ratings and obtains D 'i .
Data collected for CF purposes are usually sparse. Using some methods (average, the most probable value, and so on), which handle missing values, the parties fill some of the empty cells of their databases. The DPP includes using fake ratings as part of data disguising process. Either the parties can use non-personalized votes (user, item, or overall mean votes) or personalized ratings estimated using a CF algorithm. We propose to use personalized votes to fill blank cells because personalized ratings are more likely represent users' preferences than non-personalized ratings, where such ratings can be computed using the algorithm proposed by Herlocker et al (1999). Each company can determine personalized or even non-personalized ratings using the available ratings it holds without the help of other companies. Thus, each vendor can estimate them off-line without revealing any information to others and they can mask their data using the DPP off-line.
Protocol 3 user mean ratings protocol (UMRP) : As seen from Equations (5) (See PDF) and (6) (See PDF) , the parties need to compute item and user mean ratings for normalizing votes. Like personalized ratings estimation, each party can find the item mean votes of those items it holds without the help of other parties because it knows the required data. In other words, owing to VDD, each party has the ratings for any item it holds. Thus, each company can determine non-personalized item mean ratings for each item it holds off-line without disclosing data to other parties. However, the parties need to collaborate with each other to estimate user mean ratings because of the nature of data partition. For a user u , her average rating can be computed as follows:
(Formula Omitted: See PDF)
where J shows the set of ratings u has, vuj represents u 's rating for item j , and |"J |" shows the number of ratings u has. As seen from Equation (7) (See PDF) , vu can be calculated in a distributive manner using distributive measures like sum and count. As data are vertically partitioned among z parties, Equation (7) (See PDF) can be written as follows:
(Formula Omitted: See PDF)
where for all i =1, 2, ..., z, Ji shows the set of user u 's ratings held by the party i . As seen from Equation (8) (See PDF) , in order to find vu values for n users, data owners need to exchange partial sums like (Formula Omitted: See PDF) and |"Ji |" values. The vendors can estimate user mean votes off-line using the following protocol, called UMRP as follows, without deeply jeopardizing their privacy:
They first perturb their data sets using the DPP, which elaborated previously.
They then compute interim aggregate results, sum and count ( (Formula Omitted: See PDF) and |"Ji |") for all n users.
Next, they exchange such partial aggregates with other parties.
After receiving the required data from other parties, they finally compute mean votes for all n users using Equation (8) (See PDF) .
Protocol 4 VL : As seen from Equation (6) (See PDF) , the parties need user ratings VLs in order to estimate similarity weights. For any user u , her ratings VL (|"|"Xu |"|") can be computed as follows:
(Formula Omitted: See PDF)
As seen from Equation (9) (See PDF) , VL can be calculated in a distributive manner using distributive measure sum. As data are vertically distributed among z parties and ratings are normalized; Equation (9) (See PDF) can be written as follows:
(Formula Omitted: See PDF)
As seen from Equation (10) (See PDF) , each party i needs to compute interim aggregate sum values, (Formula Omitted: See PDF) , for all n users. The companies can estimate VLs off-line using the following protocol, called vector lengths protocol (VLP), while preserving their privacy:
They first mask their data sets using the DPP, as described before.
Then, they calculate item mean ratings and normalize their ratings by subtracting item mean votes.
Next, they compute interim aggregate results, (Formula Omitted: See PDF) values, for all n users.
After that, they exchange them with other parties.
After receiving the necessary data from others, they finally compute VLs for all n users using Equation (10) (See PDF) .
The protocols described so far are used during off-line phase. In other words, data holders cluster their distributed data, mask their data, estimate personalized and non-personalized ratings, and compute VLs off-line using our proposed protocols. Note that the PC collaboratively estimates paq based on a 's neighbours' data using Equation (5) (See PDF) , which can be written as follows:
(Formula Omitted: See PDF)
Similarly, Equation (6) (See PDF) can be written as follows because the vendors know item mean ratings and VLs and they can normalize their ratings:
(Formula Omitted: See PDF)
where v ''aj and v ''uj represent the normalized ratings of user a and u , respectively. Notice that the ratings can be normalized by first subtracting corresponding item mean vote and dividing the result by user ratings VL. Paq in Equation (11) (See PDF) can be written as follows:
(Formula Omitted: See PDF)
As data are vertically distributed among z parties, Equation (13) (See PDF) can be written as follows: (Formula Omitted: See PDF)
As seen from Equation (14) (See PDF) , the parties can compute (Formula Omitted: See PDF) and (Formula Omitted: See PDF) values for all target items off-line in order to enhance online performance. This is like constructing a model off-line to improve online efficiency. Notice that q can be any of m items. Also, note that items are distributed among z parties due to vertical partitioning. We propose the following protocols called PN protocol (PNP ) and PD protocol (PDP ) utilized off-line to construct a model.
Protocol 5 PNP : PN values can be estimated for all target items off-line as follows:
For each party i =1, 2, ..., z , they do the followings assuming the i th party is the PC:
For each target item q =1, 2, ..., mi (mi is the number of items held by
The PC and the other parties mask their data sets using the DPP, as explained before.
The PC then encrypts each value in the column vector of q using a homomorphic encryption scheme (E ) with its public key (KPC ).
It then sends such encrypted values (EKPC (v 'uq )) to other parties.
For all items j= 1, 2, ..., mi held by each collaborating party i , each collaborating vendor computes EKPC (v 'uq )) × v ''uj and obtains EKPC (v 'uq × v ''uj ) values using the homomorphic encryption property.
Each party then permutes EKPC (v 'uq × v ''uj ) values for each item they hold using a row permutation function (Fir ).
Then, the collaborating vendors permute the items they hold using a column permutation function (Fic ).
They then send the encrypted and the permuted (both row and column) results to the PC.
As the PC knows the KPC , it can decrypt the encrypted results.
For each item j , the PC finds column sums (the PN values) from permuted ones. Note that the order does not matter to find them.
The PC then sends the permuted PN values to the corresponding parties.
As each collaborating vendor knows the column permutation functions, they order the received PN values and obtain the nominator part of the model.
Protocol 6 PDP : After estimating PN values, the parties should compute PD values as well. They can compute PD values for all target items as follows:
For each vendor i =1, 2, ..., z , they do the followings based on the mask data:
For each target item q =1, 2, ..., m ,
Calculate normalized ratings using the item mean votes and VLs determined previously and obtain v ''uj values for all j =1, 2, ..., m and u =1, 2, ..., n . Note that (Formula Omitted: See PDF)
Determine only those rows containing ratings for q (S ). Such users' data are used to generate prediction for q . The parties know which rows have ratings for q because of the PNP . However, they do not know actual ratings because of encryption and rated items because of filled cells.
Find column sums, (Formula Omitted: See PDF) values, for j =1, 2, ..., m .
Protocol 7 private recommendation protocol (PRP) : After the parties construct the model (estimating PN and PD values off-line), they can now start providing referrals online. After the PC receives A and q from a , it first assigns a to the closest cluster by estimating distances between A and each cluster centre. As all parties know cluster centres, the PC can easily compute such distances. Finally, the PC initiates the final recommendation estimation process and returns paq to a after collaborating with other parties. The proposed scheme, called PRP, can be described as follows:
The PC disguises a 's ratings vector using the DPP like data owners do for masking their databases.
It normalizes her data and obtains v ''aj values, where (Formula Omitted: See PDF)
After encrypting each v ''aj value using E and KPC by using the homomorphic encryption scheme, it sends the corresponding values to the collaborating companies.
Each collaborating party similarly inserts item average normalized ratings into uniformly randomly selected empty cells of the received normalized data. In other words, they similarly perturb the received data.
They then encrypt such default values using E and KPC by utilizing the homomorphic encryption.
Each collaborating party i finds (Formula Omitted: See PDF) and (Formula Omitted: See PDF) for all j in Ja , which is the set of a 's ratings including the fake ones.
After computing such encrypted aggregated results for nominator and denominator, each collaborating vendor i permutes them using a permutation function (Fi ) and sends them to the PC.
The PC decrypts them and finds the overall sums for nominator and denominator. It finally estimates paq and sends it to a .
5. Privacy and supplementary cost analysis
In this section, we analyse our proposed scheme in terms of privacy and additional costs like storage, communication, and computation. We first explain the security of each protocol. Note that each party tries to hide their true ratings and the rated items from each other. Thus, throughout the proposed privacy-preserving scheme, actual votes and the rated items are considered private, whereas s , cluster centres, user mean ratings, and VLs estimated by utilizing UMRP and VLP, respectively are considered public. The DPP and the PDP are secure because data owners do not exchange any data during such protocols. Each party itself can conduct them without the help of others. It is impossible to derive information about each other's data through them. But, the parties exchange data while performing other protocols.
SOMP is secure due to following reasons: First, as explained previously, the parties determine the initial constant values without exchanging any data they hold. Second, each party i computes Σj (oij -wij )2 based on the corresponding parts of any user's data (object o ) and weight vector it holds to estimate the distance between any user vector and each weight vector. They then exchange such aggregate values. As j , oij , and wij values are known by the party i only, the parties cannot derive any useful information from such aggregate sum values. Third, after every iteration, each data holder updates the corresponding part of Wj it holds without collaboration by utilizing Equation (3) (See PDF) . Therefore, other parties cannot obtain any useful information during updating step. And finally, at the end of the clustering, the parties reveal Wij vectors updated s times to each other to get the entire Wj vectors. As Wij vectors are updated by using Equation (3) (See PDF) based on each party's data, the data owners cannot get any useful information about each other's data even if some vendors collude.
In UMRP , the parties exchange sum and count aggregates. As each party disguises its data set using the DPP, the parties cannot learn true rating values from such sum values. Such aggregate values are the sum of the true ratings and inserted personalized votes. Even if the number of filled cells is guessed, the collaborating parties cannot estimate their sum because such personalized ratings are known by the party only that generates them off-line. However, any party can guess the filled cells or the rated items held by a collaborating party based on the count value came from that party. The probability of guessing the correct βi is 1 out of 100. After guessing it, the party can estimate number of filled cells or the rated items (mri ). Finally, the probability of guessing which items are rated is 1 out of (Formula Omitted: See PDF) , where Cgf represents the number of ways of picking g unordered outcomes from f possibilities. Therefore, for any company, the probability of guessing the rated items based on the received count is 1 out of 100 × (Formula Omitted: See PDF) .
VLP includes utilizing DPP and working on aggregates. The parties exchange aggregated sums estimated from filled vectors. The collaborating companies cannot learn true ratings and the rated items from received aggregates because such values are the sum of the squares of normalized votes. Moreover, the item means used for normalization are known by the parties only who own such items. Thus, the VLP is secure.
In PNP , privacy is achieved through DPP, encryption, and permutation. Each party utilizes homomorphic encryption to hide the true values of their private data. As it is shown that homomorphic encryption is secure (Paillier, 1999), the parties cannot derive information from encrypted values. Although the PC inserts fake ratings into q 's votes, collaborating parties can guess the number of filled cells and the rated items of q held by the PC. The probability of guessing the correct βi is 1 out of 100. Then, number of filled cells of q , mqi , can be estimated. Finally, the probability of guessing which items are rated is 1 out of (Formula Omitted: See PDF) , where qi shows the number of ratings including the fake ones of q . The PC can guess the rated items held by each collaborating party after receiving encrypted data. For the PC, estimating the probability of guessing the rated items of one item held by a collaborating company can be explained as follows: The probability of guessing the correct βi is 1 out of 100. The PC then can estimate number of filled cells of that item, mri . The probability of guessing the correct position of the item vector is 1 out of mi ! Similarly, the probability of guessing the correct position of the values in a column vector is 1 out of (Formula Omitted: See PDF) !, where qi shows the number of encrypted values that the PC sends and mq shows the number of values in the received column vector because such values are computed between commonly rated items only. Finally, the probability of guessing which items are rated is 1 out of (Formula Omitted: See PDF) . The probability of guessing the true values of normalized ratings can be estimated similarly. However, as only the results on commonly rated items are exchanged, the PC cannot learn other values.
The security of the PRP depends on inserting default values into a 's ratings vector, encryption, and permutation. The PN and PD values are parts of model distributed among multiple parties; and they are considered private. The parties including the PC can use privacy attacks to derive data during the PRP. To overcome such attacks, fake values are introduced. First of all, the PC disguises a 's data to prevent the collaborating parties acting as an a in multiple scenarios to derive its data. The basic idea of this type of attack, called the colluded collaborating companies (3C), is to change only one rating of those items held by the PC in each prediction process by acting as an a . After multiple interactions, the collaborating parties can learn the PN and PD values held by the PC. Second, the collaborating parties also mask a 's ratings vector because of the similar reasons. This time the PC acts maliciously to learn the PN and PD values held by a victim company. The PC changes only one rating of those items held by the target or the victim party in each prediction process by acting as an a . After multiple interactions, the PC can learn such values held by the victim. And finally, in a different type of 3C attack, the victim can be one of the collaborating parties other than the PC. Similarly, the colluding parties can learn the victim's data. In each recommendation process or PRP conducted online, the parties mask a 's data differently so that they can defend themselves against the aforementioned attacks. The collaborating parties cannot learn the PC's data encrypted using the KPC , because decryption key is known by the PC only. The PC can compute the number of inserted fake values by a collaborating party i , referred to as mfi , and it knows the number of items held by that party, referred to as mi . Thus, the probability of guessing which item cells are filled is 1 out of (Formula Omitted: See PDF) . However, owing to permutation, the server first needs to determine the correct positions. The probability of guessing such positions is 1 out of mai !, where mai shows the number of data values including the fake ones in the corresponding part of a 's ratings vector held by the party z .
Supplementary costs because of privacy are important for efficiency. Unlike online costs, off-line costs are not that critical for the overall success of CF schemes. We analyse extra storage, communication and computation costs. Our proposed scheme introduces additional storage costs as follows: Storage costs for saving cluster centres are in the order of O (mx ) and O (zmx ) in the original and our proposed schemes, respectively. Thus, storage costs needed to save cluster centres increase by z times because each company needs to save cluster centres. Owing to the non-personalized ratings used in the UMRP, additional storage costs are in the order of x , O (x ). Similarly, storage costs increase by x times as a result of the vectors lengths estimated using the VLP. The parties estimate a model off-line by utilizing the PNP and PDP . Owing to storing PN values estimated using the PNP , extra storage costs are in the order of O (m2 ). Similarly, additional costs are in the order of O (m2 ) due to storing PD values.
Owing to privacy concerns, our scheme introduces extra communication costs. In the SOMP, number of communications the parties need to make is in the order of O (ez2 ), where e shows the number of epochs. In the UMRP and the VLP, number of communications is both in the order of O (z2 ), where the parties exchange interim aggregate results. Additional communication costs conducted during the PNP are in the order of O (zm ). The PDP does not cause any communication costs. Notice that as all protocols except PRP are performed off-line, supplementary communication costs because of privacy concerns are not critical for online performance. However, the PRP is conducted online and number of communications made during online phase is imperative. During the PRP, there are 2(z- 1) communications made online. Although the number of extra communications made online is in the order of O (z ), they are made simultaneously between the PC and each collaborating company.
Our proposed scheme introduces extra computations. However, not all of these computations are performed online and directly affect the performance of recommendation process. Also note that as the parties conduct computations simultaneously, execution time for total computations in all protocols will be determined by the party, which owns the maximum number of items (mmax ). All protocols except the PRP are performed off-line, their computation costs are not that critical for recommendation process. The SOMP does not cause any extra computation cost. During the DPP, the computations costs for choosing random values and selecting some of the unrated item cells are negligible. However, additional computation costs because of estimating personalized ratings are in the order of O (m2n ). Owing to the UMRP and the VLP, there are no supplementary computation costs. In the PNP , extra computation costs are inevitable due to homomorphic encryption. Number of encryptions and decryptions performed in the PNP are [nmacr ]m (m +1) and [nmacr ]m2 , respectively, where [nmacr ] shows the average number of ratings for each item including the inserted false ones. Unlike the PNP, there are no supplementary computation costs due to the PDP . The PRP is performed online and causes extra computation costs because of homomorphic encryption. Number of encryptions and decryptions conducted online during the PRP are 3[mmacr ] and 2[mmacr ] , respectively, where [mmacr ] shows the number of ratings including the inserted ones in a 's rating vector. Although supplementary online computation costs are inevitable, again notice that such computations are performed in parallel by the collaborating parties. To determine the running times of cryptographic algorithms, benchmarks for the CRYPTO++ toolkit from http://www.cryptopp.com/ can be used (Canny, 2002).
6. Experimental evaluations
To show how accurate our proposed scheme-based predictions are, we conducted several experiments using real data sets. We used two well-known data sets called MovieLens Million (MLM) and EachMovie (EM) constructed for CF purposes. MLM was collected by GroupLens (http://www.cs.umn.edu/research/GroupLens) research team at the University of Minnesota and it has discrete ratings ranging from 1 to 5 of 6040 users for about 3900 items. EM (Billsus and Pazzani, 1998) contains ratings of 72 916 users for 1628 movies. User ratings were recorded on a numeric 6-point scale, ranging from 0 to 1.
To measure the quality of the referrals, we used mean absolute error (MAE) and normalized mean absolute error (NMAE), which happen to be the most popular statistical accuracy measures. They are widely used in CF systems. To compute NMAE, we normalized the MAEs by dividing them the difference between the maximum and the minimum ratings. The lower the MAE and the NMAE, the more accurate the outcomes are. They measure how close the predictions with privacy concerns to the true ones. We applied t -tests to measure the statistical significance of the results.
Before performing the experiments, we first determined those users who rated at least 50 items. We then uniformly randomly divided such users into two disjoint sets, training and test sets. For each experiment, we uniformly randomly selected different numbers of users for training and testing based on the experiment requirements. For each test user, we uniformly randomly chose five rated items. After withholding their true ratings, we replaced their entries with null; and tried to provide recommendations for them using the train users' data. We assumed that data are distributed among z companies, where z might be 1, 3, 5, and 7. Thus, any data owner i has n train users with about m /z items.
To show how accuracy changes because of collaboration among z parties, we conducted experiments using both data sets. We used 500 users for testing and 1000 users for training, where we set m at 1600 for both data sets. To determine the best k neighbours, we clustered train data into x clusters using SOM. As shown by Roh et al (2003), clustering data into three clusters happens to give the best results. Thus, we set x at 3 in our experiments. After clustering train data, for each test users, we generated five predictions for randomly chosen five rated items based on split data only and integrated data. We changed z from 1 to 7. We displayed the NMAEs for both data sets in Figure 3 - See PDF, after computing overall averages.
As seen from Figure 3 - See PDF,, the quality of the recommendations improves with collaboration of parties. In other words, if data owners provide predictions on their integrated data, they offer more accurate recommendations. When data are distributed among multiple parties, amount of data held by each party is not enough for precise referrals. However, as seen from our results, data owners are able to offer better recommendations if they collaborate with each other. With increasing y values or decreasing amount of ratings held by each party, accuracy diminishes. For example, when users' ratings are distributed among seven parties for EM data set, the NMAE is about 0.2187 for each individual party, whereas it is about 0.1928 when they collaborate. In other words, owing to collaboration, accuracy improves by about 10%. For MLM data set, if data are distributed among seven parties, the NMAE is about 0.2157. However, accuracy improves to 0.1938 if the parties collaborate. For both data sets, accuracy develops as a result of providing referrals on combined data. Thus, integrating split data definitely enhances accuracy. To show if the improvements because of collaboration are statistically significant, we conducted t -tests. The values of t are 3.29 and 2.95 for EM and MLM, respectively, where z is 7. Those values are statistically significant for 99% confidence intervals for both data sets.
We investigated the effects of randomly inserted fake ratings into the train data. Note that the parties can disguise their data using the DPP protocol. We again used 500 users for testing and 1000 users for training for both data sets. We varied [theta] values from 0 to 100 and we used personalized votes estimated from available data by each party as fake ratings. We ran the experiments 100 times. After calculating overall averages for both data sets, we presented the outcomes for varying [theta] values in Figure 4 - See PDF,.
The results in Figure 4 - See PDF, show that accuracy diminishes with increasing number of fake ratings for both data sets. As MLM is very sparse and it has more items than EM, with increasing [theta] values, we add more fake ratings. As a result, accuracy losses because of such inserted votes are greater for MLM than for EM with larger [theta] values. However, according to our results, it is still possible to produce accurate recommendations when data owners hide their ratings by inserting fake ratings according to [theta] values. Note that as privacy and accuracy are conflicting goals, the parties can determine the optimum [theta] values that they use according to the privacy and accuracy levels they want. Thus, we chose 30 as the [theta] value for masking train data.
We also performed experiments to show the effects of adding fake ratings into the test users' rating vectors. We used the same methodology for the previous experiment. We varied [theta] values from 0 to 100 while using 500 users for testing and 1000 users for training. We again clustered train users into three clusters. We ran trials 100 times for both data sets. We showed the overall averages of NMAE values for varying [theta] values in Figure 5 - See PDF,.
According to results shown in Figure 5 - See PDF,, accuracy diminishes with increasing number of fake ratings inserted into the test users' data, as expected. With increasing [theta] values, randomness increases; and that makes accuracy worse. However, accuracy losses are small for smaller [theta] values (values less than or equal to 30) and that helps the parties provide accurate recommendations. As previously mentioned, we can determine the optimum [theta] value for masking test data. In the following experiments, we set [theta] at 30 to mask test data.
After showing the effects of [theta] used to mask train and test data separately, we ran trials to show the joint effects of such values while varying number of users (n ). We set both [theta] values at 30 and performed experiments using 500 users for testing while varying n from 250 to 1000 for both data sets. We estimated predictions for randomly selected five rated items for each test user. After running our trials 100 times, we computed the overall averages. We presented the NMAEs for both data sets in Figure 6 - See PDF, and the MAEs for MLM in Figure 7 - See PDF,.
In Figures 6 - See PDF, and 7 - See PDF,, the NMAE and the MAE values show that our proposed solution produces accurate predictions while preserving privacy with increasing n values. According to our results, for EM data set, if data are partitioned among seven parties, the accuracy of recommendations is about 0.2187 without privacy concerns when n is 1000. However, when data owners collaborate and produce predictions using our scheme, accuracy improves to 0.1989, where n is 1000 and [theta] values are 30. The results are similar for MLM as well. When data are distributed among seven parties and the parties collaborate without jeopardizing their privacy, the MAE is about 0.7882. If they do not collaborate, accuracy decreases to 0.8004. As shown previously, accuracy improves because of collaboration and decreases because of privacy concerns. However, for larger n values (n is 1000), accuracy gains because of collaboration outweigh accuracy losses because of privacy concerns. To show if the improvements because of collaboration with privacy concerns are statistically significant, we conducted t -test. The t value is about 2.72 for EM when n is 1000 and m is 1600, where data are distributed among seven parties. Similarly, the t value is 1.98 for MLM when m is 3900. Those values are statistically significant for 99 and 95% confidence intervals for EM and MLM, respectively. Thus, although accuracy becomes worse because of privacy concerns, accuracy gains are still significant due to distributed CF with privacy concerns. Our scheme can produce accurate recommendations without deeply jeopardizing data owners' privacy.
7. Conclusions and future work
In this study, we proposed a privacy-preserving scheme to offer accurate recommendations using a SOM-based CF algorithm from VDD among multiple parties. Our scheme helps data owners produce referrals based on their combined data without greatly jeopardizing their privacy. We presented several protocols or building blocks to achieve privacy. We showed that our protocols are secure and our scheme is able to offer predictions while preserving data owners' privacy. We demonstrated that supplementary costs because of our scheme are negligible and the companies are still able to produce referrals efficiently. We performed various experiments to show how accuracy enhances because of collaboration. Our results demonstrate that the quality of the predictions significantly improves because of collaboration. As accuracy diminishes because of privacy concerns, we conducted trials to investigate how privacy concerns affect our outcomes. As shown by our experimental results, the companies are able to offer recommendations with decent accuracy when they mask their data to achieve privacy. To sum up, the parties holding VDD can use our scheme to produce accurate predictions efficiently without deeply jeopardizing their privacy.
With increasing number of collaborating parties, owing to online computations and communications even if they are performed simultaneously, it increasingly becomes difficult to utilize our scheme. Compared with privacy-preserving measures like randomization, using encryption is a challenge. Although using encryption is not easy, cryptographic methods achieve better confidentiality. Like other distributed-based systems, data losses and/or bottlenecks are also issues that should be considered while using our scheme.
In addition to SOM clustering, there are other clustering methods. We will investigate if we can apply other clustering methods while providing distributed data-based CF with privacy. We will also study how to provide recommendations on hybrid or arbitrarily distributed data among multiple parties while preserving their privacy. We finally plan to explore how we can improve overall performance while disclosing some aggregate data, which do not violate data owners' privacy.
[1] Anadolu University, Eskisehir, Turkey
Correspondence : H Polat, Computer Engineering Department, Anadolu University, 2 Eylul Kampusu, Eskisehir 26470, Turkey. E-mail: [email protected]
Acknowledgements
This work is supported by Grant 108E221 from TUBITAK.
Ahn HJ (2008). A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem. Inform Sci 178 (1): 37-51.
Berkovsky S, Eytani Y, Kuflik T and Ricci F (2005). Privacy-enhanced collaborative filtering. In: Workshop on Privacy-Enhanced Personalization, at the International Conference on User Modeling , Edinburgh, UK, pp 75-83.
Bertino E, Khan LR, Sandhu R and Thuraisingham B (2006). Secure knowledge management: Confidentiality, trust, and privacy. IEEE T Syst Man Cy A 36 (3): 429-438.
Bhowmick SS, Gruenwald L, Iwaihara M and Chatvichienchai S (2006). PRIVATE-IYE: A framework for privacy preserving data integration. In: The 22nd International Conference on Data Engineering Workshops , Washington, DC, p. 91
Billsus D and Pazzani MJ (1998). Learning collaborative information filters. In: Proceedings of the Fifteenth International Conference on Machine Learning . Madison, WI, Morgan Kaufmann Publishers Inc: Los Altos, CA, pp 46-54.
Canny J (2002). Collaborative filtering with privacy. In: Proceedings of the 2002 IEEE Symposium on Security and Privacy . IEEE Computer Society, Oakland, CA, pp 45-57.
Clifton C, Kantarcioglu M, Doan A, Schadow G, Vaidya J, Elmagarmid A and Suciu D (2004). Privacy-preserving data integration and sharing. In: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery . Paris, France, pp 19-26.
Goldberg K, Roeder T, Gupta D and Perkins C (2001). Eigentaste: A constant time collaborative filtering algorithm. Inform Retrieval 4 (2): 133-151.
Grcar M (2004). User profiling: Collaborative filtering. In: 7th International Multiconference Information Society IS 2004 , Slovenia, pp 75-78.
Haykin S (1999). Neural Networks: A Comprehensive Foundation , 2nd edn. Prentice Hall: Englewood Cliffs, NJ.
Herlocker JL, Konstan JA, Borchers A and Riedl J (1999). An algorithmic framework for performing collaborative filtering. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . Berkeley, CA, pp 230-237.
Kaleli C and Polat H (2007a). Providing Naive Bayesian classifier-based private recommendations on partitioned data. In: Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases . Warsaw, Poland, pp 515-522.
Kaleli C and Polat H (2007b). Providing private recommendations using Naïve Bayesian classifier. In: Wegrzyn-Wolska K and Szczepaniak P (eds). Advances in Intelligent Web Mastering , Vol. 43. Springer: Berlin/Heidelberg, pp 168-173.
Kaleli C and Polat H (2010). P2P collaborative filtering with privacy. Turk J Electr Eng Comput Sci 8 (1): 101-116.
Kantarcioglu M and Vaidya J (2003). Privacy-preserving naïve Bayes classifier for horizontally partitioned data. In: The IEEE ICDM Workshop on PPDM , Melbourne, FL, pp 3-9.
Kaya SV, Pedersen TB, Savas E and Saygin Y (2007). Efficient privacy preserving distributed clustering based on secret sharing. In: Proceedings of the 2007 International Conference on Emerging Technologies in Knowledge Discovery and Data Mining . Nanjing, China, pp 280-291.
Kim T-H, Ryu Y-S, Park S-I and Yang S-B (2002). An improved recommendation algorithm in collaborative filtering. In: Proceedings of the Third International Conference on E-Commerce and Web Technologies . Aix-en-Provence, France. Springer-Verlag, Berlin, Germany, pp 254-261.
Kohonen T (1995). Self-Organizing Map . Springer: Berlin, Heidelberg, New York.
Lathia N, Hailes S and Capra L (2007). Private distributed collaborative filtering using estimated concordance measures. In: Proceedings of the 2007 ACM Conference on Recommender Systems . Minneapolis, MN, pp 1-8.
Lin X, Clifton C and Zhu M (2005). Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inform Syst 8 (1): 68-81.
Liu K, Kargupta H and Ryan J (2006). Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE T Knowl Data Eng 18 (1): 92-106.
Mangiameli P, Chen SK and West D (1996). A comparison of SOM neural network and hierarchical clustering methods. Eur J Opl Res 93 (2): 402-417.
Meziane F and Kasiran MK (2008). Evaluating trust in electronic commerce: A study based on the information provided on merchants' websites. J Opl Res Soc 59 : 464-472.
Mingoti SA and Lima JO (2006). Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. Eur J Opl Res 174 (3): 1742-1759.
OECD (1998). Privacy online--OECD guidance on policy and practice.
OECD (2000). Guidelines for consumer protection in the context of electronic commerce.
OECD (2005). Guidelines on the protection of privacy and transborder flows of personal data.
Oliveira S, Oliveira SRM and Zaïane OR (2004). Toward standardization in privacy-preserving data mining. In: 3rd Workshop on Data Mining Standards (DM-SSP 2004), in Conjunction with KDD 2004 , Seattle, WA, pp 7-17.
Paillier P (1999). Public-key cryptosystems based on composite degree residuosity classes. In: Proceedings of the 17th International Conference on Theory and Application of Cryptographic Techniques . Prague, Czech Republic, pp 223-238.
Parameswaran R and Blough DM (2007). Privacy preserving collaborative filtering using data obfuscation. In: Proceedings of the 2007 IEEE International Conference on Granular Computing . IEEE Computer Society: Silicon Valley, CA, pp 380-386.
Polat H and Du W (2003). Privacy-preserving collaborative filtering using randomized perturbation techniques. In: Proceedings of the Third IEEE International Conference on Data Mining . Melbourne, FL, IEEE Computer Society, Washington, DC, pp 625-628.
Polat H and Du W (2005). Privacy-preserving collaborative filtering. Int J Electroni Comm 9 (4): 9-35.
Polat H and Du W (2008). Privacy-preserving top-N recommendation on distributed data. J Am Soc Inform Sci Technol 59 (7): 1093-1108.
Roh TH, Oh KJ and Han I (2003). The collaborative filtering recommendation based on SOM cluster-indexing CBR. Expert Syst Appl 25 (3): 413-423.
Sarwar B, Karypis G, Konstan J and Riedl J (2000). Application of dimensionality reduction in recommender system a case study. In: WebKDD-2000 , Boston, USA.
Sarwar B, Karypis G, Konstan J and Reidl J (2001). Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international Conference on World Wide Web . Hong Kong, pp 285-295.
Vaidya J, Clifton C, Kantarcioglu M and Patterson AS (2008). Privacy-preserving decision trees over vertically partitioned data. ACM T Knowl Discov Data 2 (3): 1-27.
Xue G-R, Lin C, Yang Q, Xi W, Zeng H-J, Yu Y and Chen Z (2005). Scalable collaborative filtering using cluster-based smoothing. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval . Salvador, Brazil, pp 114-121.
Yakut I and Polat H (2007). Privacy-preserving eigentaste-based collaborative filtering. In: Proceedings of the Security 2nd International Conference on Advances in Information and Computer Security . Nara, Japan, pp 169-184.
Yakut I and Polat H (2010). Privacy-preserving SVD-based collaborative filtering on partitioned data. Int J Inform Technol Decis Making 9 : 473-502.
Zhang S, Ford J and Makedon F (2006). A privacy-preserving collaborative filtering scheme with two-way communication. In: Proceedings of the 7th ACM Conference on Electronic Commerce . Ann Arbor, MI, pp 316-323.
© Operational Research Society 2012