Graph Embedding-Based Sensitive Link Protection

Full text

Turn on search term navigation

This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

1. Introduction

To build a highly automated, informative, and intelligent system, the Internet of Things (IoT) integrates numerous communication, computing, and sensing devices, ranging from smartphones to vehicles [1], which is an organic collection of intelligent terminal devices and users. In IoT, widely distributed terminal devices establish reliable wireless links through advanced wireless communication and network technology, forming distributed multidomain networks [2]. Networks are ubiquitous in the real world, such as communication networks, social networks, biological networks, and transportation networks, represented by graphs containing nodes and edges. Similarly, the networks in IoT can also be regarded as graphs with terminal devices as nodes and communication links as edges. Although attractive and convenient, IoT also brings a significant challenge, i.e., the concerns on privacy disclosure [3]. As a new paradigm of big data platform, IoT deploys smart city applications to timely monitor, analyze, and respond to volumes of physical data. The data in IoT collected in a distributed manner are strongly correlated with users’ sensitive status. However, some information platforms disclose private information inadvertently while trading the data, most likely the graphs in IoT. Furthermore, it does not rule out the possibility that malicious attackers may spy on entity privacy, analyze network traffic, and track users’ behavior by stealing the complete network graphs, which invade the entity privacy and threaten the security of the IoT system. At present, the study on privacy for IoT mainly focuses on the privacy of data, identity, and location [4], while rarely mentioning graph privacy, especially the privacy of the communication links between terminal nodes in graphs, i.e., sensitive links. Actually, the disclosure of sensitive links will bring many security threats to the IoT system. For example, some sensitive links usually involve personal privacy, such as the doctor-patient relationship in smart healthcare, one of the typical application scenarios of IoT, and the user trajectories that data requesters may expose when accessing IoT. In addition, in the man-in-the-middle (MITM) attack, hackers will try to intercept private data; control devices in smart homes, smart industries, and smart healthcare; or destroy the communication links in the IoT system, resulting in privacy disclosure, device failure, and even system collapse, which seriously threaten personal privacy, business activities, and industrial operations. Hence, it is imperative to detach private information from the graphs in advance. The most straightforward operation to hide the sensitive links is to delete the sensitive links in the graphs directly. Unfortunately, sensitive links may be predicted out of released data through data mining techniques, even if they have been deleted [5]. As an essential task in data mining, link prediction has been heating up in recent years. More and more link prediction methods and their application technologies have been proposed. Link prediction can predict the relationship between nodes by mapping the graph information to a continuous vector space. While being widely applied in network analysis, link prediction can also be used as an inference attack to speculate the sensitive links in graphs. Therefore, the data publisher shall carry out privacy processing for the published data to defend link prediction attacks while retaining necessary data utility. In recent years, the privacy disclosure caused by link prediction attacks has attracted researchers’ attention, and many researches on antilink prediction have emerged. To defend link prediction based on similarity and deep learning methods, most antilink prediction methods adopt various link disturbances, e.g., random link disturbance, heuristic link disturbance, and evolutionary link disturbance, at the expense of part of data utility [6–13]. Besides, these methods only focus on the graph structure information and fail to consider the unstructured information in graphs, such as node attributes. The node attributes may include the performance, identity, and type of devices, deepening the association strength between nodes and making the attacker’s prediction more accurate. As mentioned above, protecting sensitive links against link prediction attacks is an urgent problem to be solved. Significantly, Li et al. [14] proposed an adversarial privacy graph embedding (APGE) method to conceal users’ sensitive attributes from inference attacks, which opens up a novel idea for our work. In this paper, we intend to fill this blank by developing a graph embedding-based sensitive link protection method named SLPGE. Our basic idea is to use the graph embedding model combined with Variational Graph Autoencoder (VGAE) and Adversarially Regularized Variational Graph Autoencoder (ARVGA) to encode graph data into an embedding matrix before publishing the data. To be concrete, we utilize adversarial training assisted by two schemes to eliminate private information in the embedding matrix. Then, to balance the tradeoff between privacy and utility, we design the loss functions in SLPGE to retain the utility of graph structure and node labels. The main contributions of this paper are summarized below:

(i) This article focuses on the privacy protection of sensitive links in IoT and proposes a sensitive link protection method (SLPGE) to conceal sensitive links from link prediction attacks

(ii) The results of experiments on two public datasets with node attributes validate that our SLPGE can reduce the prediction accuracy of attack models for sensitive links by 30.05% and 15.03% at most on the basis of the original data

(iii) Our method achieves a tradeoff between privacy and utility. Different from the previous method, our method abandons the idea of directly applying link disturbance on the original graph to remove private information, for which we reduce the loss of utility

The rest of the paper is organized as follows. The related work and preliminaries are reviewed in Sections 2 and 3, respectively. The system models and problem formulation are presented in Section 4. The details of our SLPGE are described in Section 5. The simulation and results are shown in Section 6. Moreover, we give the conclusions and future work in Section 7.

2. Related Work

The emergence of various IoT platforms not only facilitates people’s lives but also generates a huge volume of data-carrying personal information. These data can be modeled into graph structure data, and attackers can then easily expose the privacy information hidden in graphs via link prediction. In this section, we briefly introduce the relevant work of graph privacy protection, link prediction, and antilink prediction.

2.1. Graph Privacy Protection

The main methods of graph privacy protection include anonymization, random disturbance, and clustering. Since Sweeney [15] introduced anonymization into graph structure data, different anonymization variants for graphs have also been derived. Ying and Wu [16] disrupted the graph structure by deleting and adding $k$ edges randomly. Li et al. [17] performed spectral clustering according to the distance between nodes firstly and then anonymized subgraphs. For the graphs with node labels, Yuan et al. [18] proposed the protection method of node attribute label $K$ -anonymity to ensure that the labels of at least $k$ nodes are the same. Chester and Srivastava [19] proposed an attribute probability distribution anonymity method to make the probability distribution of the label carried by each node in the attribute sets of its neighbors as close as possible to the global label probability distribution. The random graph modification technology proposed by Hay et al. [20] is the simplest technology to prevent node reidentification and edge exposure. Mittal et al. [9] proposed a link perturbation based on the random walk (LPRW), which improved the privacy and utility of data compared with Hay’s method. In edge clustering methods, Liu et al. [21] proposed privacy protection methods for sensitive edge weights in weighted graphs, adopting Gaussian noise disturbance and greedy disturbance. Zheleva and Getoor [22] mainly considered the privacy of graphs with multiple types of edges and one type of node. Its main idea is to divide the original graph into subgraphs via spectral clustering and then modify the links in the subgraphs and add new links between the subgraphs randomly.

Low data availability and high computational complexity are the common problems of these methods, and their privacy will continue to decrease as inference attacks intensify.

2.2. Link Prediction

Link prediction is aimed at predicting missing facts according to existing entities and has found wide application in social, biological, and communication networks. Known for its powerful inference attack, link prediction has been maliciously used to spy on the privacy of entities in the networks. Among plenty of link prediction methods, classification models such as support vector machine (SVM) [23], multilayer perceptron (MLP) [24], and $k$ nearest neighbor (KNN) [25] regard link prediction as a binary classification problem, in which the connected node pairs and unconnected node pairs are regarded as positive samples and negative samples, respectively.

2.3. Antilink Prediction

At present, most antilink prediction methods for graph structure data disturb the graph structure by adding some new links and deleting part of nonsensitive links strategically to reduce the prediction ability of various link prediction methods and achieve the privacy protection of sensitive links. Liu and Terzi [6] proposed to achieve $k$ -degree anonymization through edge addition or deletion strategies. Rousseau et al. [7] proposed two approaches that preserve the coreness of a graph while anonymizing it through various edge modification operations. Fard and Wang [8] and Mittal et al. [9] proposed two structure-aware randomization perturbation methods based on local perturbation and random walk considering the structural proximity of nodes. Zhou et al. [10] regarded the links between the end nodes of a sensitive link and their common nodes as the candidate links to be deleted and expressed the attack on local similarity as an optimization problem to determine which links to delete. Chen et al. [11] proposed an iterative gradient attack (IGA) method based on integral gradient information in Graph Autoencoder (GAE). The gradients obtained by maximizing the loss of sensitive links represent the influence of other links on sensitive links. During $k$ iterations, $n$ links with the largest gradients are modified. Yu et al. [12] combated resource allocation (RA) indicator link prediction via random, heuristic, and evolutionary link disturbance. Among these three methods, random link disturbance increases and changes links without any strategy, heuristic link disturbance reduces the link prediction ranking of node pairs in the test set, and evolutionary link disturbance selects the links to be added and deleted according to the fitness function. Waniek et al. [13] selected to delete or add the most influential links to hide sensitive links by reducing or creating the closed triangles containing sensitive links.

The methods mentioned above can be used in IoT systems to avoid the leakage of sensitive links in data transactions. However, two shortcomings are present in the above methods: the first is that the utility of the graph will be lost due to link disturbance, and the second is that they lack the consideration of the impact of node attributes on link prediction.

3. Preliminaries

As a kind of non-Euclidean data, a graph is difficult to be directly processed by traditional data analysis methods or deep learning models such as Convolutional Neural Network (CNN) [26] and Recurrent Neural Network (RNN) [27] due to the high computational and space overhead. Graph embedding, also called network representation learning, is aimed at mapping graph data, usually a high-dimensional dense matrix to low-dimensional dense vectors. Graph embedding has more flexible and rich calculation methods to apply deep learning models directly for graph analysis tasks. Graph Neural Network (GNN) represents the deep learning method of graph embedding. By modeling the nodes and communication links in the networks, GNN can be applied to solve the privacy disclosure problem in IoT. For the advantages of feature extraction from non-Euclidean data, our SLPGE is based on some GNN models. In this section, the GNN models involved in SLPGE, e.g., Graph Convolutional Network (GCN), VGAE, and ARVGA, are briefly introduced. For the sake of clarity, the frequently used notations and their meanings are listed in Table 1.

Table 1

Summary of notations.

Notations	Meanings
$G$	The undirected original graph
$V$	The set of nodes in $G$
$V$	The number of nodes
$E$	The set of edges in $G$
$E$	The number of edges
$v_{i}$	The $i^{th}$ node
$e_{i j}$	The edge between $v_{i}$ and $v_{j}$
$X$	The node feature matrix of $V$
$F$	The number of node attributes
$A$	The adjacency matrix of $G$
$A_{p}$	The adjacency matrix of privacy graph
$A_{t}$	The adjacency matrix of training graph
$\hat{A}$	The reconstructed adjacency matrix of $G$
${\hat{A}}_{p}$	The reconstructed adjacency matrix of privacy graph
$A_{i j}$	The link state between $v_{i}$ and $v_{j}$ in $A$
${\hat{A}}_{i j}$	The link state between $v_{i}$ and $v_{j}$ in $\hat{A}$
$L$	The number of categories for node labels
$\hat{y}$	The node label matrix predicted by softmax classifier with each row includes the predicted values of $L$ categories
$Z_{p}$	The privacy embedding of privacy graph
$Z_{f}$	The link protection graph embedding
$Z$	The higher dimensional graph embedding concatenated by $Z_{f}$ and $Z_{p}$
$m$	The maximum number of edges added for each sensitive link
$E_{sl}$	The sensitive links in $G$
$E_{nsl}$	Part of nonsensitive links in $G$
$E_{know}$	The links which are known to the attack models
$L_{link}$	The reconstruction loss
$L_{lable}$	The node classification loss
$L_{g}$	The distribution loss of the generator
$L_{G}$	The total loss of the generator
$L_{D}$	The distribution loss of the discriminator
$Ac c_{sl}$	The classification accuracy of the attack models for sensitive links
$Ac c_{nsl}$	The classification accuracy of the attack models for nonsensitive links
$Ac c_{recon}$	The link reconstruction accuracy of $Z_{f}$
$Re c_{recon}$	The link reconstruction recall of $Z_{f}$
$Ac c_{node}$	The node classification accuracy of $Z_{f}$

3.1. Graph Convolutional Network

In 2013, Bruna et al. [28] first proposed the neural network on the graph and gave two structures based upon a hierarchical clustering of the domain and the spectrum of the graph Laplacian. As a typical GNN model, GCN [29] is a scalable approach for semisupervised learning on graph data, which uses the spectrum of the graph Laplacian to achieve convolution on graphs. After each convolution of GCN, the node features are the weighted sum of the previous features of the nodes and their neighbor nodes, for which the nodes can aggregate further features with the deepening of layers. Hence, the superiority of GCN is to incorporate local graph structure and node features naturally. Suppose the adjacency matrix $A \in ℝ^{N \times N}$ represents the connection relationship between $n$ nodes, then the layer-wise propagation rule of GCN is as follows: $\begin{matrix} (1) & H_{l + 1} = σ {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H_{l} W_{l}, \end{matrix}$ where $H_{l}$ is the feature matrix of the $l^{th}$ layer, $W_{l}$ is the trainable weight matrix, and $σ \cdot$ is an activation function. ${\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ is the normalization of $\tilde{A}$ where $\tilde{A} = A + I_{N}$ , $I_{N} \in ℝ^{N \times N}$ is the identity matrix, $\tilde{D}$ is the degree matrix of $\tilde{A}$ , and ${\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}$ . The degree of a node is the number of first-order neighbors connected to the node. Equation (1) can be abbreviated as $H_{l + 1} = f H_{l}, A$ , for $A$ is the input of each layer.

3.2. Variational Graph Autoencoders

Soon after the proposal of GCN, to expand the capability of GCN, VGAE proposed by Kipf and Welling [30] adopts GCN as an encoder to generate specific graph embedding for different tasks of the graph, not limited to node classification. VGAE is an unsupervised learning framework derived from Variational Autoencoders (VAE) [31], which obtains graph embedding through the encoder-decoder structure. VGAE consists of a two-layer GCN encoder and a simple inner-product decoder. The two-layer GCN can be defined as follows: $\begin{matrix} (2) & GCN X, A = f H_{1}, A = σ \bar{A} f H_{0}, A W_{1} = \bar{A} ReLU \bar{A} H_{0} W_{0} W_{1}, \end{matrix}$ where $\bar{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ is the symmetrically normalized adjacency matrix and $ReLU \cdot = \max 0, \cdot$ is the activation function of the first layer. $σ \cdot$ of the second layer is determined according to the specific task. The encoder of VGAE is aimed at learning the mean $μ$ and the standard deviation $σ$ of a multidimensional Gaussian distribution from which the graph embedding $Z$ is sampled. The process is briefly described below: $\begin{matrix} (3) & μ = GC N_{μ} X, A, \\ \log σ = GC N_{σ} X, A, \\ Z = μ + ε \times σ, \end{matrix}$ where $X \in ℝ^{N \times F}$ replaces $H_{0}$ in Equation (2) as the node feature matrix of the first layer and $GC N_{μ} X, A$ and $GC N_{σ} X, A$ share first-layer parameters $W_{0}$ . $Z ~ N μ, σ^{2}$ is the graph embedding matrix and $ε ~ N 0, 1$ is the noise sampled from the standard Gaussian distribution. The inner product is used as a decoder in VGAE, and the formula is as follows: $\begin{matrix} (4) & \hat{A} = σ Z \cdot Z^{T}, \end{matrix}$ where $σ \cdot = 1 / 1 + \exp - \cdot$ is the sigmoid function. $\hat{A}$ is the reconstructed adjacency matrix, and ${\hat{A}}_{i j}$ can be regarded as the product of independent event probabilities of the $i^{th}$ node and the $j^{th}$ node. When ${\hat{A}}_{i j}$ is greater than the threshold 0.5, it means that there is a link between the $i^{th}$ node and the $j^{th}$ node.

VGAE has two optimization objectives: one is to make $\hat{A}$ and $A$ as similar as possible; the other is to make the distribution of $Z$ as close to the standard Gaussian distribution as possible. Since binary cross-entropy (BCE) can determine the proximity between the actual output and the expected output and Kullback-Leibler (KL) divergence can measure the difference between two distributions, the loss function of VGAE composed of BCE and KL divergence can be expressed as $\begin{matrix} (5) & loss = E_{q Z ∣ X, A} \log p A ∣ Z - KL q Z ∣ X, A p Z . \end{matrix}$

Here, the former minimizes the reconstruction loss through the cross-entropy function, and the latter minimizes the KL divergence. $p A ∣ Z = σ Z \cdot Z^{T}$ , $q Z ∣ X, A = \prod_{i = 1}^{N} q z_{i} ∣ X, A = \prod_{i = 1}^{N} N z_{i} ∣ μ_{i}, diag σ_{i}^{2}$ is the real distribution function we get, and $p Z = \prod_{i} p z_{i} = \prod_{i} N z_{i} ∣ 0, I$ is a Gaussian prior. $KL q \cdot p \cdot$ is the KL divergence between $q \cdot$ and $p \cdot$ . We expect $q Z ∣ X, A$ to be as close to $p Z$ as possible.

More specifically, $E_{q Z ∣ X, A} \log p A ∣ Z$ in Equation (5) can be abbreviated as $los s_{link}$ below: $\begin{matrix} (6) & los s_{link} = - \frac{1}{V^{2}} \sum_{i \in V} \sum_{j \in V} p_{1} A_{i j} \log {\hat{A}}_{i j} + 1 - A_{i j} \log 1 - {\hat{A}}_{i j}, \end{matrix}$ where $A_{i j}$ represents the value which is 0 or 1 of an element in $A$ , ${\hat{A}}_{i j}$ represents the probability value of the corresponding element in $\hat{A}$ , and $p_{1}$ is the ratio of the number of 0 to 1 in $A$ , which can be used to solve the problem of imbalance between positive and negative samples. $KL q Z ∣ X, A p Z$ in Equation (5) can be abbreviated as $los s_{dist}$ below: $\begin{matrix} (7) & los s_{dist} = - \frac{1}{2} 1 + \log σ^{2} - μ^{2} - σ^{2} . \end{matrix}$

3.3. Adversarially Regularized Variational Graph Autoencoder

To force the graph embedding learned by VGAE to fit the prior distribution better, Pan et al. [32] proposed ARVGA by combining VGAE and Generative Adversarial Network (GAN). GAN was first proposed by Goodfellow et al. [33] to serve as a generative model bridging supervised learning and unsupervised learning in 2014. Most recently, exploiting GAN to work out elegant solutions to severe privacy and security problems has become increasingly popular in both academia and industry due to its game theoretic optimization strategy [34]. Typically, GAN consists of a generator $G$ and a discriminator $D$ , the purpose of which is to mix the spurious with the genuine in a nutshell. During the iterative training, $G$ is trained to generate the fake samples to convince $D$ that the fake samples come from a prior data distribution, while $D$ discriminates whether an input sample comes from the prior data distribution or $G$ we built. In ARVGA, we take VGAE as $G$ , a two-layer fully connected network as $D$ where the output layer only has one dimension with a sigmoid function. The equation for training the encoder model with the discriminator can be written as follows: $\begin{matrix} (8) & \min_{G} \max_{D} E_{x ~ p_{data} x} \log D x + E_{z ~ p_{z} z} \log 1 - D G z . \end{matrix}$

Here, $x ~ P_{data} x$ is the real sample, $z ~ P_{z} z$ is the original data, $G z$ is the fake sample, and $D \cdot$ is the probability that the sample is true. $G$ is aimed at minimizing the equation while $D$ is aimed at the opposite of $G$ . Through the game between $G$ and $D$ , ARVGA can enforce the graph embedding to match the prior distribution and produce a robust representation.

4. Model and Problem Formulation

In this article, our work is based on the following assumptions in the graph of IoT: The connections between devices are bidirectional. There are $L$ types of devices in the graph, and each device has its own attribute information such as internal storage, bandwidth, and hard disk. Sensitive links are the links that need to be hidden, while nonsensitive links are those which can be made public. The links whose end nodes have a larger total degree are defined as sensitive links. The nodes with larger degrees usually have more influence in the graph, so the links between these nodes are also more meaningful. Moreover, we take SVM and MLP as attack models to test the performance of our method, and part of nonsensitive links and nonexistent links in the graph are known to the attack models.

4.1. Network Model

We express one of the graphs of IoT as an undirected graph $G = V, E, X$ . $V = v_{1}, v_{2}, \dots, v_{N}$ is the set of $n$ terminal nodes and $N = V$ . $E$ contains the edges $e_{i j}$ with the communication link between $v_{i}$ and $v_{j}$ $1 \leq i, j \leq N$ , including sensitive links and nonsensitive links. $\bar{E}$ is the set of nonexistent links and $E \cup \bar{E} = E_{N^{2}}$ , where $E_{N^{2}}$ contains $n \times n$ edges that can be connected by $n$ nodes. Node attributes are summarized in a feature matrix $X \in ℝ^{N \times F}$ with the $i^{th}$ row representing the attributes of $v_{i}$ and $F$ is the number of attributes. $A \in ℝ^{N \times N}$ is the adjacency matrix, where $A_{i j} = 1$ if $e_{i j} \in E$ ; otherwise, $A_{i j} = 0$ . $E_{sl} \subset E$ is the set of sensitive links, $E_{nsl} \subset E$ is the set of nonsensitive links, and $E_{sl} \cap E_{nsl} = \emptyset$ .

4.2. Attack Model

Both SVM and MLP have strong classification abilities for nonlinear problems with different structures.

SVM is a classification model based on the structural risk minimization criterion in machine learning. For the nonlinear classification problems, SVM adopts a nonlinear function $ϕ x$ to map the samples from the input space to a high-dimensional feature space where the samples are linearly separable and construct an optimal classification hyperplane to categorize new samples utilizing labeled training data. Given the training set $T = x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{k}, y_{k} x_{i} \in ℝ^{N}$ , SVM can transform the classification problem into a convex quadratic optimization problem as follows: $\begin{matrix} (9) & \begin{cases} \min_{α} \frac{1}{2} \sum_{i = 1}^{k} \sum_{j = 1}^{k} α_{i} α_{j} y_{i} y_{j} ϕ x_{i} ϕ x_{j} - \sum_{i = 1}^{k} α_{i} \\ s . t . \sum_{i = 1}^{k} α_{i} y_{i} = 0 \\ 0 \leq α_{i} \leq C, i = 1, 2, \dots, k, \end{cases} \end{matrix}$ where $α_{i}$ is a Lagrange multiplier and $C$ is the penalty factor. Since the computation of $ϕ x_{i} \times ϕ x_{j}$ increases sharply in the high-dimensional space, SVM introduces kernel function $K x_{i}, x = ϕ x_{i} \cdot ϕ x$ to avoid the problem. The kernel function we choose is Gaussian kernel: $\begin{matrix} (10) & \begin{matrix} K x_{i}, x = \exp - \frac{{x_{i} - x}^{2}}{2 σ^{2}}, \end{matrix} \end{matrix}$ where $σ^{2}$ is the variance. In this case, the classification decision function is as follows: $\begin{matrix} (11) & \begin{matrix} f x = sign \sum_{i = 1}^{k} α_{i} y_{i} \exp - \frac{{x_{i} - x}^{2}}{2 σ^{2}} + b, \end{matrix} \end{matrix}$ where $b$ is the bias constant.

MLP is a fully connected artificial neural network, consisting of an input layer, hidden layer, and output layer. MLP adjusts the parameters in the hidden layer units through the supervised back propagation (BP) algorithm and gradient descent algorithm to reduce the error between the actual output and the expected output. The forward propagation mechanism of MLP is expressed as below: $\begin{matrix} (12) & \begin{matrix} H^{l + 1} = σ W^{l} H^{l} + b^{l}, \end{matrix} \end{matrix}$ where $H^{l}$ is the input matrix, $W^{l}$ is the weight matrix, $b^{l}$ is the bias, and $H^{l + 1}$ is the output of the hidden layer. Thus, the decision function of MLP with only one hidden layer can be expressed as follows: $\begin{matrix} (13) & \begin{matrix} f x = σ W^{1} σ W^{0} x + b^{0} + b^{1}, \end{matrix} \end{matrix}$ where $x$ is the input and $σ \cdot$ is an activation function.

4.3. Problem Formulation

Given a graph $G$ , our model will compress it into a graph embedding $E m b$ where the $i^{th}$ row represents the vector $em b_{i}$ of $v_{i}$ . The vector of $e_{i j}$ can be expressed as $em b_{i j} = em b_{i}, em b_{j}$ . Suppose a link set $E_{know}$ containing $k$ nonsensitive links (class 1) and $k$ nonexistent links (class 0) in $G$ have been exposed to attackers. Then, the embedding matrix of $E_{sl}$ and $E_{know}$ are $E m b_{E_{sl}}$ and $E m b_{E_{know}}$ where each row represents an edge embedding vector.

During data transactions, attackers will collect or steal $E m b$ by any means to infer sensitive links through link prediction. Our goal is to achieve the balance between privacy protection and data utility. To this end, we use “minmax” strategy to maximize the distance between the predicted label $labe l_{pred}$ of sensitive links and its real label $labe l_{real}$ and then minimize the distance between $labe l_{pred}$ of nonsensitive links and its $labe l_{real}$ . The mathematical description is as follows: $\begin{matrix} (14) & Training : Clf_fit E m b_{E_{know}}, labe l_{E_{know}}, \\ Prediction : labe l_{pred} = {Clf}_{predict E m b_{E_{sl}}}, \\ Objective : \min_{E_{nsl}} \max_{E_{sl}} {labe l_{pred} - labe l_{real}}^{2}, \end{matrix}$ where $Clf_fit x_{train}, y_{train}$ means to fit the classifier model with the training data and $Clf_predict x_{test}$ means to predict the labels of $x_{test}$ . $labe l_{E_{know}}$ is the label set of $E_{know}$ where $k$ ones represent nonsensitive links and $k$ zeros represent nonexistent links. We expect to get a graph embedding which can work for our purpose.

5. Algorithm

The SLPGE framework consists of two parts. In this section, we will introduce the framework of SLPGE in Subsections 5.1 and 5.2, and the evaluation indicators are described in Subsection 5.3.

5.1. Generate the Privacy Embedding $Z_{p}$

Part 1 is to generate a privacy embedding $Z_{p}$ . In order to put more privacy information into $Z_{p}$ , we first change the structure of the original graph $G$ to enhance the connection strength of end nodes of sensitive links. Before inputting the graph data into the model, we preprocess $G$ through Algorithms 1 and 2 corresponding to Figures 1 and 2. For Algorithm 1, we believe that two nodes with more common neighbors have a closer relationship. As shown in Figure 1, $e_{01}$ is a sensitive link, $v_{2}, v_{4}$ and $v_{3}, v_{5}$ are the neighbor sets of $v_{0}$ and $v_{1}$ , respectively, and $e_{23}$ exists; in this case, we link $e_{03}$ and $e_{12}$ to make the relationship between $v_{0}$ and $v_{1}$ closer. For Algorithm 2, we believe that the main information in the graph will focus on sensitive links when other irrelevant nodes and links are removed. As shown in Figure 2, we only keep the sensitive links and their adjacent links to retain the information about the sensitive links to the greatest extent. The two privacy graph adjacency matrices $A_{p}$ ’s obtained by Algorithms 1 and 2 are, respectively, used as the input of the encoder to output two $Z_{p}$ ’s.

[figure(s) omitted; refer to PDF]

For VGAE is more robust and suitable for small graphs, we adopt VGAE to obtain $Z_{p}$ in this part as shown in Figure 3. As discussed in Section 3, the mechanism for the encoder to generate $Z_{p}$ can refer to Equations (2) and (3). Then, we get the reconstructed adjacency matrix ${\hat{A}}_{p} = sigmoid Z_{p} \cdot {Z_{p}}^{T}$ . The reconstruction loss $L_{link}$ is the same as Equation (5), except that $A$ and $\hat{A}$ are replaced by $A_{p}$ and ${\hat{A}}_{p}$ .

Algorithm 1: Generate privacy graph by adding edges.

Input: $G = V, E, X$ : the original graph

$A$ : the adjacency matrix of $G$

$E_{sl}$ : the set of sensitive links

$N_{i}$ : the neighbor nodes set of $v_{i}$

$m$ : the maximum number of edges added for each sensitive link

Output: $A_{p}$ : the adjacency matrix of privacy graph

1:for $s l_{i j} \in E_{sl}$ do

2: find the neighbor nodes sets $N_{i}$ and $N_{j}$

3: for $n_{i} \in N_{i}, n_{j} \in N_{j}$ do

4: if $A_{n_{i} n_{j}} = 1$ and $A_{n_{i} v_{j}} = 0 A_{n_{j} v_{i}} = 0$ then

5: if the number of edges added for $s l_{i j} \leq m$ then

6: $A_{n_{i} v_{j}} = 1 A_{n_{j} v_{i}} = 1$

7: end if

8: end if

9: end for

10:end for

11:return take the modified $A$ as $A_{p}$

Algorithm 2: Generate privacy graph by deleting edges.

Input: $G = V, E, X$ : the original graph

$A$ : the adjacency matrix of $G$

$E_{sl}$ : the set of sensitive links

$V_{sl}$ : the end nodes set of $E_{sl}$

$A_{p}$ : the empty matrix with the same shape as $A$

$N$ : the neighbor nodes set

Output: $A_{p}$ : the adjacency matrix of privacy graph

for $v \in V_{sl}$ do

2: find the neighbor nodes set $N$ of $v$

for $n \in N$ do

4: $A_{p} v n$ =1 $A_{p} n v = 1$

end for

6: end for

return $A_{p}$

[figure(s) omitted; refer to PDF]

For node classification, a softmax classifier is followed by the encoder to predict the labels of the nodes. The node classification loss function $L_{label}$ is as follows: $\begin{matrix} (15) & L_{label} = - \sum_{i = 1}^{V} \sum_{l = 1}^{L} y_{i l} \ln {\hat{y}}_{i l}, \end{matrix}$ where $y_{i l}$ represents the real label of $v_{i}$ in category $l$ with a value of 0 or 1, while ${\hat{y}}_{i l}$ is the value we predict in $\hat{y}$ and ${\hat{y}}_{i l} = softmax {Z_{p}}_{i l} = 1 / Z \exp {Z_{p}}_{i l}$ with $Z = \sum_{l = 1}^{L} e^{{Z_{p}}_{i l}}$ . Therefore, the total loss of Part 1 is as follows: $\begin{matrix} (16) & L_{1} = L_{link} + L_{label} . \end{matrix}$

Through the BP mechanism of $L_{1}$ , we train the encoder to generate $Z_{p}$ that contains privacy information and conforms to a prior distribution.

5.2. Generate the Link Protection Graph Embedding $Z_{f}$

Part 2 generates a graph embedding $Z_{f}$ that can protect sensitive links. In order to reduce the most intuitive privacy information, we remove the sensitive links in $A$ to obtain $A_{t}$ , the adjacency matrix of the training graph, as shown in Figure 4. The model in this part is designed based on ARVGA, as shown in Figure 5. The inputs of the encoder are $A_{t}$ and $X$ . $Z_{f}$ output from the encoder is the input of the discriminator and the softmax classifier. Unlike Part 1, $Z_{f}$ and $Z_{p}$ are combined by adding or concatenating to form a higher dimensional matrix $Z$ as the input of the inner-product decoder. The node classification loss function $L_{label}$ is the same as Equation (15). In order to distinguish the two $Z_{p}$ ’s obtained in Part 1, in Section 6.2, we will use $S L P G E^{+}$ to explain that Algorithm 1 is used for the generation of $Z_{p}$ and $S L P G E^{-}$ to explain that Algorithm 2 is used.

[figure(s) omitted; refer to PDF]

Since the adversarial training between the encoder and the discriminator can force $Z_{f}$ to match a prior distribution, the KL divergence in Equation (5) is omitted. Here, we try to use Mean Squared Error (MSE) as the reconstruction loss $L_{link}$ , and the reconstruction target is changed to $A$ : $\begin{matrix} (17) & L_{link} = - \frac{1}{V} \sum_{i \in V} \sum_{j \in V} {A_{i j} - {\overset{\land}{A}}_{i j}}^{2} . \end{matrix}$

In the discriminator, we take $Z_{f}$ as the fake samples and Gaussian distribution samples as the real samples, then input them into the discriminator, i.e., a two-layer full connection layer network to get two estimated value $d_{fake}$ and $d_{real}$ , respectively. $L_{g}$ and $L_{D}$ are the distribution loss of the generator and the discriminator, which are both calculated by BCE: $\begin{matrix} (18) & L_{g} = - \log d_{fake}, \\ (19) & L_{D} = - \log d_{real} - \log 1 - d_{fake} . \end{matrix}$

Therefore, the total loss of the generator can be written as follows: $\begin{matrix} (20) & L_{G} = L_{link} + L_{label} + L_{g} . \end{matrix}$

Through the adversarial training, the discriminator learns how to distinguish between the real samples and the fake samples, while the generator learns to generate a better $Z_{f}$ to confuse the discriminator. In general, the training process of obtaining $Z_{f}$ can be summarized as Algorithm 3.

Algorithm 3: Generate link protection graph embedding.

Input: $G = V, E, X$ : the original graph

$A_{t}$ : the adjacency matrix of training graph

$Z_{p}$ : the privacy embedding

Output: $Z_{f}$ : the link protection graph embedding

for each epoch do

Generate the adjacency matrix $A_{t}$ of training graph

3: Input $A_{t}$ and $X$ to the encoder to generate $Z_{f}$

Input $Z_{f}$ to the softmax classifier

Adding or concatenating $Z_{f}$ and $Z_{p}$ to form $Z$

6: Input $Z$ to the inner-product decoder

Input $Z_{f}$ and the Gaussian samples to the discriminator

Update the generator by minimizing $L_{G}$

9: Update the discriminator by minimizing $L_{D}$

end for

return $Z_{f}$

5.3. Evaluation Indicators

This subsection will introduce the quantitative indicators of privacy and utility.

5.3.1. Privacy

Our chief target is to reduce the prediction accuracy of the attack models for sensitive links. Input an embedding vector of a sensitive or nonsensitive link to the attack models; if the predicted value is 1, it means the link exists and vice versa. Privacy is measured by the prediction accuracy $Ac c_{sl}$ of the attack models for sensitive links: $\begin{matrix} (21) & Ac c_{sl} = \frac{N_{sl}}{E_{sl}} \times 100 %, \end{matrix}$ where $N_{sl}$ is the number of sensitive links predicted to exist and $E_{sl}$ is the total number of sensitive links. When the security of $Z_{f}$ is stronger, $Ac c_{sl}$ is lower.

5.3.2. Utility

Utility includes three parts: the prediction accuracy of the attack models for nonsensitive links, the accuracy and recall of the reconstructed graph, and the accuracy of node classification. Taking the existing links as positive samples and the nonexistent links as negative samples, the quantitative expression of utility is as follows: $\begin{matrix} (22) & Ac c_{nsl} = \frac{N_{nsl}}{E_{nsl}} \times 100 %, \end{matrix}$ where $Ac c_{nsl}$ is the prediction accuracy of nonsensitive links, $N_{nsl}$ is the number of nonsensitive links predicted to exist, and $E_{nsl}$ is the total number of nonsensitive links: $\begin{matrix} (23) & Ac c_{recon} = \frac{TP + TN}{TP + TN + FP + FN} \times 100 %, \\ (24) & Re c_{recon} = \frac{TP}{TP + FN} \times 100 %, \end{matrix}$ where $Ac c_{recon}$ is the ratio of existing links and nonexistent links that are reconstructed correctly and $Re c_{recon}$ represents how many existing links have been reconstructed. $TP$ and $FP$ are the numbers of reconstructed positive and negative samples, and $FN$ and $TN$ are the numbers of nonreconstructed positive and negative samples: $\begin{matrix} (25) & Ac c_{node} = \frac{N_{node}}{V} \times 100 %, \end{matrix}$ where $Ac c_{node}$ is the ratio of the nodes classified correctly to the total number of nodes. $N_{node}$ is the number of nodes classified correctly, and $V$ is the number of nodes. Our tradeoff is protecting privacy while preserving utility, that is, reducing $Ac c_{sl}$ and keeping $Ac c_{nsl}$ , $Ac c_{recon}$ , $Re c_{recon}$ , and $Ac c_{node}$ high.

6. Simulation

In this section, we will evaluate the performance of SLPGE on two public datasets, $Cora$ [35] and $Yale$ [14].

6.1. Experiment Setting

(1) Datasets. $Cora$ is a citation network composed of 7 categories of machine learning papers. $Cora$ includes 2708 papers as $V$ and 5278 citation relationships between papers as $E$ . 1433 unique words appear in all papers as the attributes of $V$ . $Yale$ is a social network including 8578 people and 188 attributes. The class year attribute divides the nodes into 7 categories. Part of links and labels of the datasets are used as training sets.

(2) Training. The experimental parameters are shown in Table 2. The initial features of nodes are 1433 and 188 dimensions. $Z_{f}$ and $Z_{p}$ are both 8-dim in $Cora$ and 7-dim in $Yale$ . As shown in Figure 5, we have two splicing modes of $Z_{f}$ and $Z_{p}$ in SLPGE: “concatenate (cat)” and “add,” where “cat” means stacking $Z_{f}$ and $Z_{p}$ in the horizontal direction (i.e., column order) and “add” means that the elements in $Z_{f}$ and $Z_{p}$ are added correspondingly. $Z$ is 16-dim and 14-dim when using “cat” and 8-dim and 7-dim when using “add.” The embeddings of two nodes in $Z_{f}$ are concatenated together as an edge embedding, so the dimension of edge embedding is twice as large as node embedding.

Table 2

Details of experiment.

Parameters	$Cora$	$Yale$
$V$	2708	5278
$E$	5278	405450
$F$	1433	188
$L$	7	7
$E_{sl}$	100	200
$E_{nsl}$	100	200
$E_{know}$	400	400
$m$	10	15
$Z_{p}$	8-dim	7-dim
$Z_{f}$	8-dim	7-dim
$Z add$	8-dim	7-dim
$Z cat$	16-dim	14-dim

Besides, we take the original graph $G$ and the training graph $G_{t}$ with sensitive links deleted as the input of VGAE to compare with SLPGE. At the same time, we use TSNE to visualize $Z_{f}$ in 2-dim to observe the node classification result, and the nodes belonging to the same label are represented by the same color. In essence, TSNE uses PCA to reduce the dimension of the feature and then maps it to a 2-dimensional or 3-dimensional space for visualization to observe each layer’s feature distribution.

(3) Attack. 100 and 200 edges with larger node degrees in the training sets are selected as the sensitive links of $Cora$ and $Yale$ , respectively, and $m$ is 10 and 15 in Algorithm 1. We randomly select 200 nonsensitive links and 200 nonexistent links to form $E_{know}$ which has been exposed to the attackers. Moreover, the edge embeddings of $E_{know}$ will be used as the training set to train the attack models. The edge embeddings of the same number of sensitive and nonsensitive links are the input of the attack models. We train each model four times, and the attack models make 10 predictions after each training. Finally, the averages of the 40 prediction results are taken as the prediction accuracy of sensitive and nonsensitive links.

6.2. Result Analysis

We carried out our experiments under four models: $V G A E$ , $V G A E_{t}$ , $S L P G E^{+}$ , and $S L P G E^{-}$ . $V G A E$ means the input is the original graph without any modification. $V G A E_{t}$ means the input is the training graph in which the sensitive links are deleted. Our SLPGE is divided into two types: $S L P G E^{+}$ and $S L P G E^{-}$ where $Z_{p}$ comes from Algorithms 1 and 2, respectively.

Figures 6 and 7 show node classification of SVM under different models for $Cora$ and $Yale$ in a visualization method, respectively. In each subgraph, the points in the same color constitute a cluster, representing different classes. A larger distance between different clusters means higher accuracy. Corresponding numerical results are listed in Tables 3 and 4. The decline degree of five indicators of $V G A E_{t}$ , $S L P G E^{+}$ , and $S L P G E^{-}$ compared with $V G A E$ is shown in Tables 5 and 6, and the decline degree are calculated by $A - B / B %$ , where $B$ represents $V G A E$ and $A$ represents the others.

[figure(s) omitted; refer to PDF]

Table 3

The results on $Cora$ .

		SVM		MLP		Reconstruction
Model	Splicing mode	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{recon}$	$Re c_{recon}$	$Ac c_{node}$
$VGAE$	—	$87.5$	$85.0$	$84.1$	$85.0$	$88.6$	$90.7$	$83.5$
$VGA E_{t}$	—	$75.2$	$83.0$	$80.4$	$83.5$	$84.7$	$89.7$	$79.9$
$SLPG E^{-}$	cat	$66.5$	$80.1$	$65.8$	$79.1$	$85.6$	$86.8$	$76.4$
$SLPG E^{-}$	add	$61.6$	$84.6$	$61.3$	$80.5$	$83.5$	$85.4$	$74.1$
$SLPG E^{+}$	cat	$68.0$	$82.0$	$61.8$	$79.8$	$86.6$	$87.3$	$74.5$
$SLPG E^{+}$	add	$61.2$	$82.3$	$60.6$	$84.4$	$84.4$	$87.2$	$77.2$

Table 4

The results on $Yale$ .

		SVM		MLP		Reconstruction
Model	Splicing mode	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{recon}$	$Re c_{recon}$	$Ac c_{node}$
$VGAE$	—	$84.5$	$84.5$	$76.5$	$77.0$	$72.1$	$84.5$	$81.0$
$VGA E_{t}$	—	$81.0$	$83.0$	$74.0$	$73.5$	$73.1$	$85.7$	$77.6$
$SLPG E^{-}$	cat	$75.0$	$81.2$	$65.0$	$71.0$	$67.3$	$79.7$	$80.1$
$SLPG E^{-}$	add	$74.5$	$84.5$	$65.2$	$71.7$	$65.6$	$71.4$	$76.5$
$SLPG E^{+}$	cat	$76.5$	$83.6$	$65.8$	$68.1$	$68.9$	$75.8$	$75.6$
$SLPG E^{+}$	add	$73.7$	$82.0$	$67.5$	$71.5$	$68.8$	$75.1$	$75.5$

Table 5

The decline degree of five indicators compared with VGAE on $Cora$ .

		SVM		MLP		Reconstruction
Model	Splicing mode	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{recon}$	$Re c_{recon}$	$Ac c_{node}$
$VGA E_{t}$	—	$14.05$	$2.35$	$4.40$	$1.76$	$4.40$	$1.10$	$4.31$
$SLPG E^{-}$	cat	$24.00$	$5.76$	$21.76$	$6.94$	$3.39$	$4.30$	$8.50$
$SLPG E^{-}$	add	$29.60$	$0.47$	$27.11$	$5.29$	$5.76$	$5.84$	$11.26$
$SLPG E^{+}$	cat	$22.28$	$3.52$	$26.52$	$6.12$	$2.26$	$3.75$	$10.78$
$SLPG E^{+}$	add	$30.05$	$3.17$	$27.94$	$0.71$	$4.74$	$3.86$	$7.54$

Table 6

The decline degree of five indicators compared with VGAE on $Yale$ .

		SVM		MLP		Reconstruction
Model	Splicing mode	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{sl}$	$Ac c_{nsl}$	$Ac c_{recon}$	$Re c_{recon}$	$Ac c_{node}$
$VGA E_{t}$	—	4.14	1.78	3.27	4.55	-1.39	-1.42	4.20
$SLPG E^{-}$	cat	11.24	3.91	15.03	7.79	6.66	5.68	1.11
$SLPG E^{-}$	add	11.83	0.00	14.77	6.88	9.02	15.50	5.56
$SLPG E^{+}$	cat	9.47	1.07	13.99	11.56	4.44	10.30	6.67
$SLPG E^{+}$	add	12.78	2.96	11.76	7.14	4.58	11.12	6.79

6.2.1. Privacy

There is a comparison of $Ac c_{sl}$ of the four models in Tables 3 and 4 that $Ac c_{sl}$ of $V G A E_{t}$ , $S L P G E^{+}$ , and $S L P G E^{-}$ decrease in varying degrees, but $Ac c_{sl}$ of $S L P G E^{+}$ and $S L P G E^{-}$ decrease more. Especially for $Cora$ , $S L P G E^{+}$ reduces $Ac c_{sl}$ by 30.05% at most and 22.28% at least on the basis of $V G A E$ while $V G A E_{t}$ reduces $Ac c_{sl}$ by 14.05% at most. For $Yale$ , $S L P G E^{+}$ reduces $Ac c_{sl}$ by 15.03% at most and 9.46% at least on the basis of $V G A E$ while $V G A E_{t}$ reduces $Ac c_{sl}$ by 4.14% at most. The privacy of SLPGE has significant improvement compared with $V G A E_{t}$ .

Although the privacy of SLPGE is 1.3 $~$ 3.6 times higher than that of $V G A E_{t}$ on $Yale$ , the protection effect of sensitive links on $Yale$ is not as good as that on $Cora$ , which results from the fact that the node attributes of $Yale$ are more closely related to the links. This also signifies that similar attributes will make the privacy information between nodes more difficult to remove. In general, these comparisons can confirm that our SLPGE has better performance on sensitive link protection.

6.2.2. Utility

The loss of partial utility is the necessary cost of privacy protection. Taking $V G A E$ as a comparison, we can see that the classification accuracy of four variant models has decreased, reflecting a partial sacrifice of data utility. $Ac c_{nsl}$ , $Ac c_{link}$ , $Re c_{recon}$ , and $Ac c_{node}$ of SLPGE and $V G A E_{t}$ all decrease simultaneously, but the decline ranges are generally lower than that of $Ac c_{sl}$ . From Tables 5 and 6, it can be seen that $Ac c_{nsl}$ of SLPGE decrease by 6.94% and 11.56% at most on the basis of $V G A E$ for $Cora$ and $Yale$ , but the two $Ac c_{sl}$ decrease more, reaching 21.76% and 13.99%. $Ac c_{link}$ of SLPGE decrease by 5.75% and 9.07% at most for $Cora$ and $Yale$ . The maximum decline ranges of $Ac c_{link}$ , $Re c_{recon}$ , and $Ac c_{node}$ of SLPGE on two datasets are 5.76% and 9.02%, 5.84% and 15.50%, and 11.26% and 6.79%, which are basically lower than the decline ranges of $Ac c_{sl}$ . Tables 5 and 6 reflect the tradeoff between privacy and utility.

6.2.3. Models

The data in four tables show that $S L P G E^{+}$ and $S L P G E^{-}$ are very close in performance on privacy and utility, which also proves that both Algorithms 1 and 2 are feasible. For the two splicing modes, the privacy and utility of mode “add” are better than those of mode “cat.” The analysis of this result is as follows.

The distributions of $Z_{f}$ and $Z_{p}$ both approach $N 0, 1$ (standard normal distribution), and the weight of privacy information in $Z_{p}$ is large and fixed. When $Z$ obtained by adding $Z_{f}$ and $Z_{p}$ is to fit the link labels of the original graph after decoding, the MSE loss function will force $Z_{f}$ to reduce the weight of privacy information, so that we can squeeze more privacy information. Therefore, the combination of MSE and mode “add” is better.

Overall, our SLPGE reduces the prediction accuracy of sensitive links to varying degrees, from which we can conclude that our model is effective. While protecting the privacy of sensitive links, some utility will be sacrificed, which may be structure information or attribute information. From the result analysis, it can be confirmed that SLPGE can retain most of the utility. In practical application, part of the structure of the model can be adjusted to meet different task requirements.

7. Conclusion

The problems of individual privacy under the interconnection of all things are ubiquitous. The research on link protection against link prediction in IoT is of great significance for entity privacy. Through the simulation of the datasets, the feasibility of our SLPGE is preliminarily verified. However, multifaceted challenges remain in the research on link protection. Our datasets are just static graphs, in which the nodes belong to different categories at the same level, and the edges only represent reference and social relationships. In heterogeneous scenarios, nodes can be of different levels, edges between the nodes may have diverse meanings, and the weight of the edges are no longer all equal to one. The weight of edges reflects the difference in the degree of communication between nodes.

Furthermore, in dynamic graphs, the entry and exit of nodes will affect the graph structure and the privacy information of sensitive links in real time. The attackers can collect more information for inference attacks. The greatest challenge is that the researches on resisting graph disturbance and enhancing the robustness of link prediction continue to emerge, which increases the difficulty of sensitive link protection. Therefore, we will emphasize the sensitive link protection in weighted graphs and dynamic graphs in our follow-up research.

Acknowledgments

This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant 2019JBZ001, in part by the Beijing Natural Science Foundation under Grant 4202054, and in part by the National Natural Science Foundation of China under Grant 61871023 and Grant 61931001.

Appendix

In Section 5.2, we use the two modes of “concatenate (cat)” and “add” to combine $Z_{f} and Z_{p}$ . The following is an explanation of these two operations. “cat” and “add” are two splicing modes of $Z_{f} and Z_{p}$ in SLPGE. “cat” means stacking $Z_{f} and Z_{p}$ in the horizontal direction, and “add” means that the elements in $Z_{f} and Z_{p}$ are added correspondingly. Here, we will intuitively show how to get $Z cat$ and $Z add$ and explain their meanings. We assume that both $Z_{f} and Z_{p}$ are 3 times 2 matrices, $\begin{matrix} (A.1) & Z_{f} = \begin{matrix} a_{1} & a_{2} \\ b_{1} & b_{2} \\ c_{1} & c_{3} \end{matrix}, \\ Z_{p} = \begin{matrix} a_{3} & a_{4} \\ b_{3} & b_{4} \\ c_{3} & c_{4} \end{matrix}, \end{matrix}$ then $\begin{matrix} (A.2) & Z cat = Z_{f} Z_{p} = \begin{matrix} a_{1} & a_{2} & a_{3} & a_{4} \\ b_{1} & b_{2} & b_{3} & b_{4} \\ c_{1} & c_{2} & c_{3} & c_{4} \end{matrix}, \\ Z add = Z_{f} + Z_{p} = \begin{matrix} a_{1} + a_{3} & a_{2} + a_{4} \\ b_{1} + b_{3} & b_{2} + b_{4} \\ c_{1} + c_{3} & c_{2} + c_{4} \end{matrix} . \end{matrix}$

In Part II, the reconstructed adjacency matrix $\hat{A}$ is obtained by the inner product of $Z Z cat or Z add$ , i.e., $\hat{A} cat = Z cat Z {cat}^{T}$ and $A add = Z add Z {add}^{T}$ whose detailed calculations are shown in the bottom.

We can see that $\hat{A} add = \hat{A} cat + B$ , and $B$ is a cross-multiplying term matrix. Since $Z_{p}$ is fixed, $a_{3}, a_{4}, b_{3}, b_{4}, c_{3}, and c_{4} is fixed$ , the loss function will force $Z_{f}$ to constantly adjust so that $\hat{A}$ is close to $A$ . Because $\hat{A} add$ has more cross-multiplying terms, $\hat{A} add$ may exert greater pressure $Z_{f}$ $a_{1}, a_{2}, b_{1}, b_{2}, c_{1}, and c_{2}$ . Based on the above analysis, we chose these two modes to get $Z$ : $\begin{matrix} (A.3) & \hat{A} cat = \begin{matrix} a_{1} & a_{2} & a_{3} & a_{4} \\ b_{1} & b_{2} & b_{3} & b_{4} \\ c_{1} & c_{2} & c_{3} & c_{4} \end{matrix} \times \begin{matrix} a_{1} & b_{1} & c_{1} \\ a_{2} & b_{2} & c_{2} \\ a_{3} & b_{3} & c_{3} \\ a_{4} & b_{4} & c_{4} \end{matrix} = \begin{matrix} \sum_{i = 1}^{4} a_{i}^{2} & \sum_{i = 1}^{4} a_{i} b_{i} & \sum_{i = 1}^{4} a_{i} c_{i} \\ \sum_{i = 1}^{4} b_{i} a_{i} & \sum_{i = 1}^{4} b_{i}^{2} & \sum_{i = 1}^{4} b_{i} c_{i} \\ \sum_{i = 1}^{4} c_{i} a_{i} & \sum_{i = 1}^{4} c_{i} b_{i} & \sum_{i = 1}^{4} c_{i}^{2} \end{matrix}, \\ \hat{A} add = \begin{matrix} a_{1} + a_{3} & a_{2} + a_{4} \\ b_{1} + b_{3} & b_{2} + b_{4} \\ c_{1} + c_{3} & c_{2} + c_{4} \end{matrix} \times \begin{matrix} a_{1} + a_{3} & b_{1} + b_{3} & c_{1} + c_{3} \\ a_{2} + a_{4} & b_{2} + b_{4} & c_{2} + c_{4} \end{matrix} = \begin{matrix} \sum_{i = 1}^{4} a_{i}^{2} & \sum_{i = 1}^{4} a_{i} b_{i} & \sum_{i = 1}^{4} a_{i} c_{i} \\ \sum_{i = 1}^{4} b_{i} a_{i} & \sum_{i = 1}^{4} b_{i}^{2} & \sum_{i = 1}^{4} b_{i} c_{i} \\ \sum_{i = 1}^{4} c_{i} a_{i} & \sum_{i = 1}^{4} c_{i} b_{i} & \sum_{i = 1}^{4} c_{i}^{2} \end{matrix} + \begin{matrix} 2 a_{1} a_{3} + 2 a_{2} a_{4} & a_{1} b_{3} + a_{3} b_{1} + a_{2} b_{4} + a_{4} b_{2} & a_{1} c_{3} + a_{3} c_{1} + a_{2} c_{4} + a_{4} c_{2} \\ b_{1} a_{3} + b_{3} a_{1} + b_{2} a_{4} + b_{4} a_{2} & 2 b_{1} b_{3} + 2 b_{2} b_{4} & b_{1} c_{3} + b_{3} c_{1} + b_{2} c_{4} + b_{4} c_{2} \\ c_{1} a_{3} + c_{3} a_{1} + c_{2} a_{4} + c_{4} a_{2} & c_{1} b_{3} + c_{3} b_{1} + c_{2} b_{4} + c_{4} b_{2} & 2 c_{1} c_{3} + 2 c_{2} c_{4} \end{matrix} . \end{matrix}$

References

[1] Z. Cai, X. Zheng, "A private and efficient mechanism for data uploading in smart cyber-physical systems," IEEE Transactions on Network Science & Engineering, vol. 7 no. 2, pp. 766-775, 2018.

[2] H. Yang, Y. Liang, J. Yuan, Q. Yao, J. Zhang, "Distributed blockchain-based trusted multi-domain collaboration for mobile edge computing in 5g and beyond," IEEE Transactions on Industrial Informatics, vol. 16 no. 11,DOI: 10.1109/TII.2020.2964563, 2020.

[3] Z. Cai, Z. He, "Trading private range counting over big iot data," 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), .

[4] X. Zheng, Z. Cai, "Privacy-preserved data sharing towards multiple parties in industrial iots," IEEE Journal on Selected Areas in Communications, vol. 38 no. 5, pp. 968-979, DOI: 10.1109/JSAC.2020.2980802, 2020.

[5] Z. Cai, Z. He, X. Guan, Y. Li, "Collective data-sanitization for preventing sensitive information inference attacks in social networks," IEEE Transactions on Dependable & Secure Computing, vol. 15 no. 4, pp. 577-590, DOI: 10.1109/TDSC.2016.2613521, 2016.

[6] K. Liu, E. Terzi, "Towards identity anonymization on graphs," in Acm Sigmod International Conference on Management of Data, .

[7] F. O. Rousseau, J. Casas Roma, M. Vazirgiannis, "Community-preserving anonymization of graphs," Knowledge & Information Systems, vol. 54 no. 2, pp. 315-343, DOI: 10.1007/s10115-017-1064-y, 2018.

[8] A. Milani Fard, K. Wang, "Neighborhood randomization for link privacy in social network analysis," World Wide Web internet & Web Information Systems, vol. 18 no. 1,DOI: 10.1007/s11280-013-0240-6, 2015.

[9] P. Mittal, C. Papamanthou, D. Song, "Preserving link privacy in social network based systems," Computer Science, 2012. https://arxiv.org/abs/1208.6189

[10] K. Zhou, T. P. Michalak, T. Rahwan, M. Waniek, Y. Vorobeychik, "Attacking similarity-based link prediction in social networks," 2018. https://arxiv.org/abs/1809.08368

[11] J. Chen, Z. Shi, Y. Wu, X. Xu, H. Zheng, "Link prediction adversarial attack," 2018. https://arxiv.org/abs/1810.01110

[12] S. Yu, M. Zhao, C. Fu, H. Huang, X. Shu, Q. Xuan, G. Chen, "Target defense against link-prediction-based attacks via evolutionary perturbations," IEEE Transactions on Knowledge and Data Engineering, vol. 33 no. 2,DOI: 10.1109/TKDE.2019.2933833, 2019.

[13] M. Waniek, K. Zhou, Y. Vorobeychik, E. Moro, T. Rahwan, "How to hide one's relationships from link prediction algorithms," Scientific Reports, vol. 9 no. 1,DOI: 10.1038/s41598-019-48583-6, 2019.

[14] K. Li, G. Luo, Y. Ye, W. Li, S. Ji, Z. Cai, "Adversarial privacy preserving graph embedding against inference attack," IEEE Internet of Things Journal, vol. 8 no. 8, pp. 6904-6915, DOI: 10.1109/JIOT.2020.3036583, 2021.

[15] L. Sweeney, "k-anonymity: a model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10 no. 5, pp. 557-570, DOI: 10.1142/S0218488502001648, 2002.

[16] X. Ying, X. Wu, "Randomizing social networks: a spectrum preserving approach," in Proceedings of the SIAM International Conference on Data Mining, SDM 2008, .

[17] Y. Li, H. Shen, C. Lang, H. Dong, "Practical anonymity models on protecting private weighted graphs," Neurocomputing, vol. 218, pp. 359-370, DOI: 10.1016/j.neucom.2016.08.084, 2016.

[18] M. Yuan, C. Lei, P. S. Yu, T. Yu, "Protecting sensitive labels in social network data anonymization," IEEE Transactions on Knowledge & Data Engineering, vol. 25 no. 3, pp. 633-647, DOI: 10.1109/TKDE.2011.259, 2013.

[19] S. Chester, G. Srivastava, "Social network privacy for attribute disclosure attacks," International Conference on Advances in Social Networks Analysis & Mining, .

[20] M. Hay, G. Miklau, D. Jensen, P. Weis, S. Srivastava, "Anonymizing Social Networks," 34th International Conference on Very large Data Bases (VLDB), .

[21] L. Liu, J. Wang, J. Liu, J. Zhang, "Privacy preservation in social networks with sensitive edge weights," Siam International Conference on Data Mining, .

[22] E. Zheleva, L. Getoor, "Preserving the privacy of sensitive relationships in graph data," International Journal of Computer Trends & Technology, vol. 17 no. 1, 2014.

[23] M. A. Hearst, S. T. Dumais, E. Osman, J. Platt, B. Scholkopf, "Support vector machines," IEEE Intelligent Systems, vol. 13 no. 4, pp. 18-28, DOI: 10.1109/5254.708428, 1998.

[24] H. L. Jensen, "Using neural networks for credit scoring," Managerial Finance, vol. 18 no. 6, pp. 15-26, DOI: 10.1108/eb013696, 1992.

[25] S. Yang, H. Jian, Z. Ding, H. Zha, C. L. Giles, "Iknn: informative k-nearest neighbor pattern classification," European Conference on Knowledge Discovery in Databases: Pkdd, .

[26] J. Gu, Z. Wang, J. Kuen, L. Ma, G. Wang, "Recent advances in convolutional neural networks," Pattern Recognition, vol. 77, pp. 354-377, DOI: 10.1016/j.patcog.2017.10.013, 2018.

[27] H. Salehinejad, S. Sankar, J. Barfett, E. Colak, S. Valaee, "Recent advances in recurrent neural networks," 2017. https://arxiv.org/abs/1801.01078

[28] J. Bruna, W. Zaremba, A. Szlam, Y. Lecun, "Spectral networks and locally connected networks on graphs," 2013. https://arxiv.org/abs/1312.6203

[29] T. N. Kipf, M. Welling, "Semi-supervised classification with graph convolutional networks," 2016. https://arxiv.org/abs/1609.02907

[30] T. N. Kipf, M. Welling, "Variational graph auto-encoders," 2016. https://arxiv.org/abs/1611.07308

[31] D. J. Im, S. Ahn, R. Memisevic, Y. Bengio, "Auto-encoding variational Bayes," . https://arxiv.org/abs/1312.6114

[32] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, C. Zhang, "Adversarially regularized graph autoencoder for graph embedding," Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18, . https://arxiv.org/abs/1802.04407

[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, 2014.

[34] Z. Cai, Z. Xiong, H. Xu, P. Wang, Y. Pan, "Generative adversarial networks," ACM Computing Surveys, vol. 54 no. 6,DOI: 10.1145/3459992, 2021.

[35] S. Prithviraj, N. Galileo, B. Mustafa, G. Lise, G. Brian, Collective Classification in Network Data, 2008.

Word count: 7538

Show less

Copyright © 2022 Yanfei Lu et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

In the Internet of Things (IoT), massive interconnected intelligent terminal devices constitute diverse networks. Link prediction can serve as a powerful inference attack to speculate the sensitive links in the networks, posing a security threat to entity privacy in IoT. Most antilink prediction methods reduce the prediction ability of link prediction models through link disturbance to hide sensitive links but fail to consider the impact of node attributes on link prediction. This paper proposes a sensitive link protection method based on graph embedding (SLPGE) to combat link prediction attacks. This method is aimed at compressing network topology data into an embedding matrix and lessening private information by combining Variational Graph Autoencoder (VGAE) and Adversarially Regularized Variational Graph Autoencoder (ARVGA). Based on our experiment on two datasets, SLPGE reduces the prediction accuracy of two attack models for sensitive links by up to 30.05% and 15.03% compared to the original data, and the corresponding utility sees a drop of 7.54% and 7.79% at most, which verifies the feasibility of SLPGE—achieving the tradeoff between privacy protection and data utility effectively.

Details

Title

Graph Embedding-Based Sensitive Link Protection in IoT Systems

Author

Lu, Yanfei¹; Deng, Zhilin¹

; Gao, Qinghe¹

; Tao, Jing¹

¹ School of Electronics and Information Engineering, Beijing Jiaotong University, Beijing, China

Editor

Chunqiang Hu

Publication year

2022

Publication date

2022

Publisher

John Wiley & Sons, Inc.

e-ISSN

15308677

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2022/2432351

ProQuest document ID

2660748456

Graph Embedding-Based Sensitive Link Protection in IoT Systems

Jump to:

Full text

Abstract

Details

Suggested sources