Content area
Variational Graph Autoencoder (VGAE) is a widely explored model for learning the distribution of graph data. Currently, the approximate posterior distribution in VGAE-based methods is overly restrictive, leading to a significant gap between the variational lower bound and the log-likelihood of graph data. This limitation reduces the expressive power of these VGAE-based models. To address this issue, this paper proposes the Importance Weighted Variational Graph Autoencoder (IWVGAE) and provides a theoretical justification. This method makes the posterior distribution more flexible through Monte Carlo sampling and assigns importance weights to the likelihood gradients during backpropagation. In this way, IWVGAE achieves a more flexible optimization objective, enabling the learning of richer latent representations for graph data. It not only achieves a theoretically tighter variational lower bound but also makes graph density estimation more accurate. Extensive experimental results on seven classic graph datasets show that as the number of samples from the approximate posterior distribution increases, (1) the variational lower bound continuously improves, validating the proposed theory, and (2) the performance on downstream tasks significantly improves, demonstrating more effective learning and representation of graph data.
Introduction
Graphs are crucial data structures representing features of nodes and relationships between nodes, and have recently received considerable attention. Graphs not only possess powerful information representation ways but also play an essential role in various fields such as recommendation systems [27], drug analysis [44], bioinformatics [52], and anomaly detection [4]. Additionally, many Euclidean data types, such as images and text, can be effectively learned by structuring them as graph data [3, 28].
However, due to the unique data structure of graphs, many Euclidean data generation models [1, 8, 16, 30] are difficult to apply directly to graph data. One major issue is how to reduce the dimensionality of graph data while preserving valuable information, similar to the approach used with Euclidean data. The introduction of classical graph representation learning methods [14], such as Graph Convolutional Network (GCN) [17], Graph Attention Network (GAT) [46], and Approximate Personalized Propagation of Neural Predictions (APPNP) [6], has successfully addressed this issue. These methods transform graph features and discrete edge information into continuous feature representations, and apply them to various graph tasks, such as link prediction [19, 35], graph clustering [24], and node classification [2]. Inspired by the success of these graph representation learning methods, various graph models extend Euclidean methods into the field of graph data, including graph generation models [5, 21, 50], graph contrastive learning [13, 23, 25, 47], graph transformers [31, 37, 48].
Variational Graph Autoencoder (VGAE) [18] is a typical graph autoencoder for learning the distribution of graph data. The core idea of VGAE is to construct a variational lower bound by introducing an approximate variational posterior. The final variational lower bound is comprised of the reconstruction term of the adjacency matrix, and the Kullback–Leibler (KL) divergence between the prior distribution and the approximate posterior distribution. It enables the model to learn the edge information of graphs and the joint distribution of both node features and edge features. However, in VGAE, the overly restrictive representation of the approximate posterior distribution results in a significant gap between the variational lower bound and the log-likelihood of graph data. This deficiency significantly limits the model’s expressive power and generative performance. Inspired by the work of Burda et al. [1], we proposes the Importance Weighted Variational Graph Autoencoder (IWVGAE) to address this issue. IWVGAE introduces the Monte Carlo sampling method to make the approximate posterior distribution more flexible. It then assigns the importance weights of the likelihood gradients during backpropagation. In this way, the proposed IWVGAE can better optimize the model and achieve a theoretically tighter variational lower bound, fully leveraging the capabilities of neural networks and learning richer latent representations of graph data.
Our contributions are outlined as follows:
We utilize Monte Carlo methods for sampling from the approximate posterior distribution in VGAE and propose a new theory regarding the variational lower bound.
We propose a novel graph autoencoder, namely Importance Weighted Variational Graph Autoencoder (IWVGAE), emphasizing the crucial impact of more flexible approximate posterior distributions on the variational lower bound.
Extensive experiments are performed on typical graph datasets, and the results demonstrate the validity of the proposed theory and the superiority of the proposed IWVGAE over other advanced methods.
Related work
In recent years, learning graph data distribution through variational methods has become popular in Graph Neural Networks (GNN). VGAE leverages the graph’s node feature matrix and adjacency matrix to generate low-dimensional latent variables, followed by reconstructing the adjacency matrix. Optimization is achieved through a variational lower bound. Based on the idea of VGAE, a variety of VGAE variants have also emerged.
[See PDF for image]
Fig. 1
The architecture of the proposed IWVGAE, illustrated with six Monte Carlo sampling counts. In the encoder, the graph feature matrix and adjacency matrix are fed into GCNs to obtain the mean and standard deviation of the approximate posterior distribution. Then, six latent variables are sampled from the distribution formed by these mean and standard deviation values. In the decoder, these latent variables are reconstructed into six different adjacency matrices and six different feature matrices, respectively
Early works like ARVGA [36] and RWR-VGAE [10] adopted the data distribution of latent representations to enhance graph embeddings with the aid of adversarial training and random walk regularization, respectively. DGVAE [20] proposed to utilize a Dirichlet distribution instead of a Gaussian distribution for enhancing interpretability. R-VGAE [34] employed a sampling operator and a correction operator to address feature randomness and drift. FT-VGAE [32], based on geometric transformation, proposed a three-stage training framework to address feature drift. VACA [39] incorporated causal inference into VGAE, enabling accurate approximation of interventional and counterfactual distributions in diverse Structural Causal Models (SCM). MoVGAE [53] addresses the limitation of GCNs in capturing high-order structural information by incorporating both first-order and high-order neighborhoods for improved graph representation learning. Su et al. [43] tackle the limitation of standard Gaussian priors in modeling uncertainty by introducing a diffusion-based prior to enhance the expressiveness of the variational framework.
Additionally, various VGAE-based methods are often combined with downstream tasks, yielding excellent performance. GMM-VGAE [12] integrated a Gaussian Mixture Model (GMM) into VGAE, leading to an end-to-end clustering network that enhanced semi-supervised learning performance. GC-VGE [7] achieved graph clustering learning through the joint optimization of the variational lower bound and clustering distribution alignment loss. BELBO-VGAE [33] introduced an explicit clustering objective function into the variational lower bound, resulting in a tight variational lower bound. Zhou et al. [56] and Hou et al. [9] employed VGAE-based models to enhance the interpretability in multimodal recommendation systems, and improved the generalization ability in social network generation, respectively.
Although the above-mentioned methods achieved promising performance, they rarely considered the decisive impact of more flexible approximate posterior distribution on the variational lower bound and optimization objective. To address this issue, this paper proposes a new method called IWVGAE by making the approximate posterior distribution more flexible through Monte Carlo sampling, as described below.
Proposed method
In this section, we provide the details of the proposed IWVGAE. We describe the theoretical derivation and then explain the corresponding network architecture. The framework of IWVGAE is depicted in Fig. 1. The pseudocode for IWVGAE is listed in the Algorithm 1.
Notation and problem definition
An undirected graph is introduced, which consists of a set of N nodes and a set of s edges connecting these nodes. Each node has a feature vector with dimension M, and the feature matrix is represented as . The connectivity between nodes is represented by an adjacency matrix , where if there is an edge connecting node i and node j, and otherwise.
Motivation
VGAE learns latent representations of graph data by introducing latent variables and continually optimizing the variational lower bound to approximate the log-likelihood of graph data as closely as possible. In VGAE, the log-likelihood of graph data introduces an approximate posterior distribution . Then, according to the Bayes’ Theorem, this process can be expressed as:
1
The approximate posterior distribution is reintroduced in Eq. (1), and then divided into three parts:2
The first term is the reconstruction term that tests the generation performance. The second term denotes the KL divergence between the approximate posterior distribution and the prior distribution. The third term denotes the KL divergence between the approximate posterior distribution and the true posterior distribution.Since the true posterior distribution is unattainable, VGAE adopts the first two terms as the optimization objective. These two terms together constitute the variational lower bound. The third term denotes the gap between the variational lower bound and the true log-likelihood. Moreover, since the third term is a KL divergence, it is non-negative, ensuring that the true log-likelihood is greater than or equal to the variational lower bound. Ultimately, the optimizable lower bound is defined as:
3
In VGAE, a crucial premise is that for obtaining a tight lower bound, the introduced approximate posterior distribution must approach the true posterior distribution as closely as possible. However, in VGAE, the approximate posterior distribution is represented randomly and stiffly. This may cause the approximate posterior distribution to not satisfy the modeling assumptions, leading to a looser lower bound and consequently limiting the model’s performance.Importance weighted
The data log-likelihood is defined as:
4
To address the limitation of VGAE, we adopt the Monte Carlo method to make the approximate posterior distributions more flexible. Subsequently, a new variational lower bound is obtained by Jensen’s Inequality:5
where k denotes the number of Monte Carlo samples, and are independently and identically distributed samples, sampled from the approximate posterior distribution.Theorem 1
For any sampling number k, the variational lower bound obtained with a larger number of samples is greater than or equal to with a smaller number of samples, and the variational lower bound is always less than or equal to the log-likelihood of graph data:
6
Theorem 2
As k approaches infinity, meaning that an infinite number of samples is taken, the lower bound will converge to the log-likelihood of graph data:
7
For detailed proofs of this theory, refer to Appendix B.
After obtaining a theoretically tighter variational lower bound, we proceed to analyze it during backpropagation.
First, the gradient of the variational lower bound given by Eq. (5) is computed, which is a crucial step for optimizing the lower bound:
8
where is a substitution for , is sampled from a multivariate standard Gaussian distribution, and denotes the parameter vector of the neural network. Then, we have:9a
9b
where Eq. (9a) is derived through the differentiation of the logarithmic function, and Eq. (9b) is achieved by substituting equation . Finally, is denoted as , and the final gradient of the variational lower bound is obtained as follows:10
Similarly, the final gradient for VGAE is obtained:11
Comparing the gradients of these two methods above-mentioned, the gradient of the proposed method includes the normalized importance weight . When only one sample is taken (), the proposed method degenerates into the corresponding VGAE. For , multiple log-likelihood gradients are assigned weights between 0 and 1 during backpropagation. This approach enables differentiated optimization across different samples, with samples from high-probability regions receiving higher weights, thus emphasizing their importance in the optimization process.Network architecture
In this section, we present the network architecture corresponding to the theoretical derivation, including the encoder, decoder, and objective optimization.
Encoder
In the encoder, we explain the process of transforming graph node feature matrices and adjacency matrices into low-dimensional feature representations through graph embedding.
To better obtain and compute multiple latent variables, which are sampled from the approximate posterior distributions, we assume that the latent variables follow a Gaussian distribution. With this assumption, the encoder of each sample c is represented as:
12
13
where and represent the mean and standard deviation, respectively.Next, Spectral-based GCNs [17] with non-shared parameters are employed to parameterize the mean and standard deviation, facilitating effective feature extraction and information propagation.
14
15
where denotes the normalized adjacency matrix, denotes the normalized degree matrix, denotes the exponential function, and denotes the GCN layer. When the number of GCN layers is 1, it is defined as: and when the number of GCN layers is 2, it is defined as: , where denotes the learnable parameter matrix of the l-th GCN layer, and represents the nonlinear activation function.However, since the sampling operation cannot be backpropagated, the reparameterization trick [16] is employed:
16
where .In this case, the multiple approximate posterior distributions are represented as:
17
Ultimately, the encoder transforms graph data into multiple continuous low-dimensional latent variables, facilitating effective representation learning and feature extraction of graph data. These low-dimensional latent variables can be utilized for various downstream graph tasks.Decoder
After obtaining multiple latent variables through graph embedding and sampling, these latent variables can be applied to reconstruct the graph data, including the adjacency matrix and the feature matrix, and obtain the reconstruction term. The decoder of each sample c is defined as:
18
19
where denotes the element at the i-th row and j-th column of the adjacency matrix. and represent the feature vector and the embedding vector of the i-th node, respectively. The reconstruction of each edge and each node feature is defined as:20
21
where denotes the sigmoid non-linear activation function, MLP denotes a multilayer perceptron, and linear denotes a linear activation function.The adjacency matrix decoder (Eq. 20) reconstructs the discrete adjacency matrix by computing the inner product between latent representations, followed by a nonlinear activation function. In contrast, the feature matrix decoder (Eq. 21) consists of a single-layer MLP, which is composed of a linear layer followed by an activation function. This linear layer maps each latent representation to an -dimensional feature space. The choice of activation function depends on whether the original feature values are discrete or continuous.
Ultimately, the multiple reconstructed adjacency matrices and feature matrices from various latent variables are represented as:
22
23
24
Objective optimization
The variational lower bound of the proposed IWVGAE does not follow the traditional KL divergence definition when the sample counts . Therefore, unlike VGAE, in IWVGAE, the gradient of the variational lower bound is obtained by calculating each part separately:
25
the first term is the reconstruction term, which measures the difference between the original data and the reconstructed data. The reconstruction term of each sample c is calculated by:26
27
28
where BCE denotes the binary cross-entropy loss, MSE denotes the mean squared error loss.The second term is the prior distribution of each sample c, which follows a standard multivariate Gaussian distribution. The third term is the approximate posterior distribution of each sample c, which follows a multivariate Gaussian distribution. Thus, the multivariate Gaussian density function is calculated by:
29
30
31
where d denotes the vector dimension of latent variables, and is the covariance matrix.[See PDF for image]
Algorithm 1
The pseudocode of IWVGAE
Complexity analysis
In this section, we analyze the space complexity and time of the theoretically derived variational lower bound for our proposed IWVGAE model. Due to the theoretical requirement of preserving global graph structure, the entire graph must be processed during each training iteration. Assume a graph with N nodes, each having M-dimensional features. Each node is encoded into a latent representation of dimension d, and k Monte Carlo samples are drawn per node.
Space Complexity. During training, the model generates k Monte Carlo samples of d-dimensional latent variables for each of the N nodes, resulting in a total memory requirement of . This term dominates the overall space complexity, as these sampled latent representations must be retained for gradient backpropagation when optimizing the variational objective.
Time Complexity. The time complexity of the loss function in IWVGAE is primarily determined by four components. First, for each of the N nodes, the model samples k latent representations of dimension d, resulting in a cost of . Second, the adjacency matrix reconstruction employs a dot-product decoder to compute pairwise similarity scores between all node pairs for each sample, yielding a complexity of . Third, the feature matrix is reconstructed based on the sampled latent variables, which incurs a computational cost of . Finally, the computation of the prior and approximate posterior distributions involves element-wise operations over the latent variables, contributing an additional . Overall, the time complexity is dominated by the two most expensive components: .
In summary, although IWVGAE improves the optimization of the network by providing a tighter variational bound and enhanced generative performance, this advantage comes at the cost of increased time and space complexity.
Experiments
Benchmark datasets
To evaluate the performance of the proposed IWVGAE, we conduct experiments on seven commonly used graph datasets, including three citation networks (Cora, Citeseer, and ACM) [24, 51], three air-traffic networks (BAT, EAT, and UAT) [38], and the Amazon Photo network (AMAP) [40]. Brief descriptions of these datasets are provided in Table 6 of Appendix A.
Baseline models and evaluation metrics
To validate the effectiveness of the proposed IWVGAE, three downstream tasks such as link prediction, graph clustering, and visualization are conducted in experiments, as listed below.
Link prediction
Link prediction aims to determine whether a connection exists between nodes in an incomplete graph. The typical evaluation metrics include
Area Under the ROC Curve (AUC) and Average Precision (AP) [19, 58]. The comparing methods contain GAE-based (GAE [18], ARGE [36], RWR-GAE [10], DGAE [20], 72-DGAE [49], and PS2 [45]), VGAE-based methods (VGAE [18], ARVGE [36], RWR-VGAE [10], DGVAE [20]) and state-of-the-art link prediction methods (SIEG [42] and PULL [15]).
Graph clustering
Graph clustering aims to classify nodes based on their similarity. The used evaluation metrics are Clustering Accuracy (ACC), Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and F-Score (F1) [22, 54]. The comparing methods include VGAE-based methods (VGAE [18], ARVGE [36], DGVAE [20], FT-VGAE [32] and DyFSS [57]) and graph contrastive learning methods (GDCL [55], NACL[41], and E2Neg[11]).
Visualization
Visualization aims to intuitively demonstrate the effectiveness of learning latent representations. We adopt two visualization methods for the learned latent representations: (1) scatter plots generated by t-SNE [29], and (2) heatmaps of Node Similarity Matrices [26]. These methods are used to assess the quality of the learned latent representations. We compare the proposed IWVGAE with several classic VGAE-based methods (VGAE [18], DGVAE [20] and FT-VGAE [32]) (Table 1).
Table 1. Performance comparisons on five datasets for graph clustering
Dataset | Metric | GDCL | NCLA | E2Neg | VGAE | ARVGE | DGVAE | GMM-VGAE | FT-VGAE | DyFSS | IWVGAE(1) | IWVGAE(M-S) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
(IJCAI’21) | (AAAI’23) | (AAAI’25) | (NIPS’16) | (IJCAI’18) | (NIPS’20) | (AAAI’20) | (IJCAI’22) | (AAAI’24) | (Ours) | (Ours) | ||
Cora | ACC | 70.83 ± 0.47 | 51.09 ± 1.25 | 49.65 ± 2.02 | 65.81 ± 2.03 | 67.79 ± 1.37 | 66.61 ± 3.84 | 68.45 ± 3.43 | 73.48 ± 0.37 | 69.36 ± 2.94 | ||
NMI | 31.80 ± 0.78 | 30.95 ± 2.01 | 47.71 ± 2.53 | 49.97 ± 1.86 | 48.22 ± 2.43 | 49.48 ± 2.20 | 55.41 ± 0.50 | 53.83 ± 1.57 | 54.98 ± 1.18 | |||
ARI | 48.05 ± 0.72 | 36.66 ± 1.65 | 19.76 ± 2.68 | 41.36 ± 3.16 | 44.10 ± 2.18 | 41.98 ± 4.56 | 45.20 ± 3.35 | 53.28 ± 0.74 | 47.14 ± 3.27 | |||
F1 | 52.88 ± 0.97 | 51.12 ± 1.12 | 41.43 ± 3.68 | 64.51 ± 2.42 | 64.85 ± 1.93 | 65.78 ± 3.49 | 66.44 ± 4.73 | 71.26 ± 0.32 | 64.48 ± 3.27 | |||
Citeseer | ACC | 66.39 ± 0.65 | 59.23 ± 2.32 | 68.12 ± 1.24 | 51.75 ± 1.75 | 54.81 ± 3.23 | 49.72 ± 2.65 | 54.35 ± 4.98 | 58.13 ± 0.98 | 66.96 ± 0.87 | ||
NMI | 39.52 ± 0.38 | 36.68 ± 0.89 | 41.83 ± 1.09 | 24.89 ± 1.68 | 29.31 ± 2.05 | 24.95 ± 2.25 | 26.93 ± 4.08 | 31.26 ± 0.86 | 39.73 ± 0.78 | |||
ARI | 41.07 ± 0.96 | 33.37 ± 0.53 | 41.38 ± 1.90 | 22.45 ± 1.84 | 25.50 ± 3.60 | 20.90 ± 3.20 | 25.59 ± 4.71 | 29.80 ± 0.79 | 41.32 ± 0.76 | |||
F1 | 61.12 ± 0.70 | 52.67 ± 0.64 | 59.65 ± 0.98 | 49.63 ± 1.67 | 50.89 ± 4.06 | 47.61 ± 2.76 | 49.08 ± 3.98 | 54.70 ± 1.25 | 60.68 ± 0.29 | |||
Bat | ACC | 45.42 ± 0.54 | 47.48 ± 0.64 | 50.84 ± 2.05 | 62.67 ± 2.91 | 67.48 ± 0.74 | 64.35 ± 2.60 | 64.73 ± 1.29 | – | 66.03 ± 1.58 | ||
NMI | 31.70 ± 0.42 | 24.36 ± 1.54 | 22.42 ± 3.29 | 34.78 ± 4.11 | 39.58 ± 3.95 | 42.63 ± 2.39 | 36.60 ± 1.86 | – | 39.22 ± 2.79 | |||
ARI | 19.33 ± 0.57 | 24.14 ± 0.98 | 19.51 ± 2.62 | 29.43 ± 4.20 | 35.97 ± 0.94 | 34.78 ± 3.70 | 33.14 ± 1.94 | – | 35.14 ± 3.73 | |||
F1 | 39.94 ± 0.57 | 42.25 ± 0.34 | 44.68 ± 4.98 | 61.84 ± 2.88 | 67.53 ± 0.77 | 62.81 ± 2.45 | 63.33 ± 1.66 | – | 64.96 ± 1.56 | |||
Eat | ACC | 33.46 ± 0.18 | 36.06 ± 1.24 | 46.09 ± 0.59 | 50.86 ± 0.63 | 48.27 ± 1.24 | 49.87 ± 0.96 | 53.11 ± 1.74 | – | 50.23 ± 0.68 | ||
NMI | 13.22 ± 0.33 | 21.46 ± 1.80 | 21.10 ± 2.09 | 21.26 ± 1.08 | 21.82 ± 0.72 | 25.18 ± 1.37 | – | 20.89 ± 1.47 | 23.07 ± 1.60 | |||
ARI | 04.31 ± 0.29 | 18.90 ± 1.45 | 15.10 ± 0.72 | 18.82 ± 1.39 | 15.07 ± 0.62 | 18.51 ± 1.31 | – | 15.60 ± 1.10 | 17.13 ± 1.23 | |||
F1 | 25.02 ± 0.21 | 31.25 ± 0.96 | 34.67 ± 3.52 | 51.48 ± 0.70 | 48.06 ± 2.38 | 50.70 ± 0.92 | 52.78 ± 2.29 | – | 50.55 ± 1.10 | |||
Uat | ACC | 48.70 ± 0.06 | 45.38 ± 1.15 | 46.31 ± 1.53 | 48.47 ± 0.40 | 48.04 ± 0.76 | 48.76 ± 1.60 | 49.55 ± 0.82 | 51.41 ± 0.50 | 49.55 ± 0.46 | ||
NMI | 25.10 ± 0.01 | 24.49 ± 0.57 | 17.36 ± 1.40 | 22.78 ± 2.58 | 24.25 ± 1.17 | 22.95 ± 1.43 | 21.65 ± 0.65 | 20.56 ± 1.13 | 22.07 ± 0.45 | |||
ARI | 21.76 ± 0.01 | 21.34 ± 0.78 | 13.86 ± 2.29 | 16.60 ± 1.01 | 16.53 ± 1.42 | 16.42 ± 1.20 | 10.84 ± 1.85 | 19.18 ± 0.66 | 17.81 ± 0.31 | |||
F1 | 45.69 ± 0.08 | 30.56 ± 0.25 | 42.11 ± 1.65 | 49.07 ± 0.65 | 48.24 ± 1.11 | 49.43 ± 1.20 | 49.34 ± 0.76 | 51.49 ± 0.57 | 50.59 ± 0.53 | |||
ACM | ACC | – | – | 81.51 ± 4.36 | 83.60 ± 2.26 | 44.47 ± 1.94 | 79.08 ± 6.32 | 82.40 ± 0.30 | 79.26 ± 1.05 | 72.13 ± 7.14 | 87.28 ± 0.46 | |
NMI | – | – | 51.13 ± 7.56 | 50.95 ± 3.83 | 6.68 ± 1.47 | 46.31 ± 7.98 | 49.53 ± 0.57 | 49.13 ± 1.40 | 46.39 ± 6.44 | |||
ARI | – | – | 54.33 ± 8.30 | 57.26 ± 4.74 | 4.85 ± 1.02 | 49.14 ± 11.55 | 54.94 ± 0.65 | 49.51 ± 1.92 | 45.17 ± 8.34 | |||
F1 | – | – | 81.39 ± 4.50 | 83.60 ± 2.27 | 43.91 ± 2.29 | 79.17 ± 6.27 | 82.21 ± 0.31 | 79.41 ± 1.07 | 71.55 ± 7.99 | |||
AMAP | ACC | 43.75 ± 0.78 | 67.18 ± 0.75 | 50.03 ± 1.90 | 71.16 ± 2.14 | 44.53 ± 3.56 | 21.09 ± 3.68 | 28.27 ± 0.21 | – | 68.64 ± 0.17 | ||
NMI | 37.32 ± 0.28 | 63.63 ± 1.07 | 36.25 ± 1.80 | 59.29 ± 2.10 | 35.57 ± 5.23 | 03.26 ± 1.20 | 04.36 ± 0.19 | – | 43.16 ± 0.27 | |||
ARI | 21.57 ± 0.51 | 46.30 ± 1.59 | 20.57 ± 2.47 | 48.29 ± 3.64 | 22.11 ± 3.88 | 01.48 ± 0.86 | 01.16 ± 0.22 | – | 44.63 ± 0.25 | |||
F1 | 38.37 ± 0.29 | 47.90 ± 1.94 | 69.53 ± 2.36 | 38.51 ± 5.32 | 14.80 ± 2.67 | 14.53 ± 0.17 | – | 63.91 ± 0.19 | 71.30 ± 1.45 |
The reported results were obtained by calculating the mean and standard deviation of 10 test-runs. Bold indicates the best performance, and underlining indicates the second-best performance. IWVGAE(1) denotes the results obtained with a single sampling count. IWVGAE(M-S) denotes to the results corresponding to the maximum sampling count allowed by each dataset. “–” denotes that no corresponding public result is available or that the provided code failed to produce the result
Experiment setup
For link prediction, the graph data is split into fixed-size training, validation, and test sets with a ratio of 0.85:0.05:0.1. For graph clustering, the entire dataset is leveraged for experiments. We divide the experiments into ten groups for both tasks, with the Monte Carlo sampling count set to 1, 5, 10, 20, 30, 50, 75, and 100. To further assess the effectiveness of achieving a tighter variational lower bound, we extend the number of Monte Carlo samples to 100,000 and 10,000 for the BAT and UAT tasks, respectively. In each group, different sampling counts use the same data initialization. The final metrics are obtained by calculating the mean and standard deviation of 10 test-runs. Three types of metrics are achieved such as log-likelihood values, link prediction results, and clustering results (Table 2). In accordance with the proposed theoretical considerations, training is performed on the full graph to preserve its structural integrity. Consequently, the batch size is set to the number of nodes N. The constructed network is optimized using the Adam optimizer. All experiments are conducted on an AMD 3960X 24-core CPU and an NVIDIA RTX A6000 49GB GPU. Detailed hyperparameter settings can be found in Table 7 in Appendix A (Table 3).
Table 2. Performance comparisons of graph clustering across various sampling counts
Dataset | Metrix | 1 | 5 | 10 | 20 | 30 | 50 | 75 | 100 |
|---|---|---|---|---|---|---|---|---|---|
Cora | ACC | 75.36 ± 1.09 | 76.38 ± 0.79 | 76.75 ± 0.45 | 76.99 ± 0.55 | 77.24 ± 0.73 | 77.24 ± 0.53 | ||
NMI | 54.98 ± 1.18 | 55.70 ± 0.76 | 55.77 ± 0.73 | 55.98 ± 0.87 | 56.43 ± 1.12 | 56.24 ± 0.82 | |||
ARI | 53.40 ± 1.88 | 55.19 ± 0.73 | 55.26 ± 0.73 | 55.69 ± 0.81 | 56.25 ± 0.93 | 56.13 ± 1.72 | |||
F1 | 73.88 ± 1.12 | 74.60 ± 0.87 | 75.12 ± 0.39 | 75.31 ± 0.62 | 75.59 ± 0.87 | 75.41 ± 0.40 | |||
Citeseer | ACC | 66.96 ± 0.87 | 68.70 ± 1.13 | 68.82 ± 1.13 | 69.76 ± 0.42 | 69.61 ± 0.51 | 69.84 ± 0.50 | ||
NMI | 39.73 ± 0.78 | 40.69 ± 1.24 | 40.87 ± 1.06 | 41.76 ± 0.60 | 41.54 ± 0.62 | 41.76 ± 0.74 | |||
ARI | 41.32 ± 0.76 | 42.52 ± 1.77 | 42.98 ± 1.36 | 44.17 ± 0.69 | 43.95 ± 0.86 | 44.25 ± 0.88 | |||
F1 | 60.04 ± 1.28 | 60.59 ± 1.42 | 60.43 ± 0.47 | 60.22 ± 0.45 | 60.46 ± 0.49 | 59.31 ± 0.20 | |||
BAT | ACC | 66.03 ± 1.58 | 67.33 ± 2.12 | 68.40 ± 1.66 | 69.01 ± 1.31 | 69.01 ± 1.94 | 70.15 ± 1.63 | ||
NMI | 39.22 ± 2.79 | 38.80 ± 2.63 | 40.63 ± 3.75 | 41.48 ± 3.15 | 40.71 ± 3.53 | 40.87 ± 2.38 | |||
ARI | 35.14 ± 3.73 | 35.06 ± 2.48 | 36.82 ± 4.06 | 38.19 ± 2.38 | 38.45 ± 2.62 | 38.22 ± 2.42 | |||
F1 | 64.96 ± 1.56 | 66.89 ± 2.68 | 67.62 ± 1.48 | 68.07 ± 1.86 | 68.14 ± 1.93 | 69.60 ± 2.17 | |||
EAT | ACC | 50.23 ± 0.68 | 51.33 ± 0.59 | 51.73 ± 0.92 | 52.43 ± 0.48 | 52.68 ± 0.51 | 52.81 ± 0.60 | ||
NMI | 20.89 ± 1.47 | 21.34 ± 1.76 | 21.66 ± 1.36 | 22.15 ± 1.53 | 22.49 ± 1.83 | 22.79 ± 1.44 | |||
ARI | 15.60 ± 1.10 | 16.07 ± 1.48 | 16.36 ± 0.64 | 16.72 ± 1.03 | 17.15 ± 0.85 | 17.13 ± 1.23 | |||
F1 | 50.55 ± 1.10 | 51.67 ± 0.74 | 52.22 ± 0.95 | 52.89 ± 0.44 | 53.21 ± 0.47 | 53.16 ± 0.78 | |||
UAT | ACC | 56.57 ± 0.52 | 57.11 ± 0.29 | 57.03 ± 0.64 | 56.96 ± 0.66 | 57.47 ± 0.97 | 57.46 ± 0.74 | ||
NMI | 25.14 ± 0.98 | 26.11 ± 0.91 | 26.28 ± 0.71 | 25.73 ± 1.21 | 26.51 ± 1.23 | 26.56 ± 0.54 | |||
ARI | 23.76 ± 0.99 | 24.86 ± 0.82 | 24.56 ± 0.60 | 24.25 ± 1.34 | 24.60 ± 1.90 | 24.92 ± 0.87 | |||
F1 | 55.36 ± 0.49 | 55.63 ± 0.31 | 55.71 ± 0.79 | 55.51 ± 0.93 | 56.31 ± 0.74 | 55.98 ± 1.03 | |||
ACM | ACC | 87.28 ± 0.46 | 87.38 ± 0.42 | 87.57 ± 0.58 | 87.59 ± 0.44 | 87.55 ± 0.44 | 87.67 ± 0.44 | ||
NMI | 58.78 ± 1.22 | 59.08 ± 1.04 | 59.68 ± 1.39 | 59.65 ± 1.10 | 59.54 ± 1.20 | 59.91 ± 1.05 | |||
ARI | 65.63 ± 1.14 | 65.88 ± 1.03 | 66.42 ± 1.49 | 66.46 ± 1.09 | 66.29 ± 1.10 | 66.64 ± 1.11 | |||
F1 | 87.30 ± 0.45 | 87.40 ± 0.41 | 87.57 ± 0.56 | 87.60 ± 0.43 | 87.56 ± 0.43 | 87.68 ± 0.42 | |||
AMAP | ACC | 77.67 ± 1.22 | 78.21 ± 0.66 | 78.18 ± 0.55 | OOM | OOM | OOM | ||
NMI | 67.33 ± 0.96 | 67.49 ± 1.27 | 67.41 ± 0.93 | OOM | OOM | OOM | |||
ARI | 59.51 ± 1.41 | 60.05 ± 1.17 | 59.78 ± 1.05 | OOM | OOM | OOM | |||
F1 | 71.30 ± 1.45 | 71.91 ± 0.79 | 71.87 ± 0.55 | OOM | OOM | OOM |
Bold indicates the best performance, underlining indicates the second-best performance, and OOM indicates that an out-of-memory error occurred during execution
Table 3. Performance comparisons on two datasets for link prediction
Model | Cora | Citeseer | ||
|---|---|---|---|---|
AUC | AP | AUC | AP | |
GAE (NIPS’16) | 91.28 ± 0.89 | 92.52 ± 0.69 | 90.01 ± 1.01 | 90.66 ± 1.18 |
VGAE (NIPS’16) | 90.96 ± 0.48 | 92.16 ± 0.51 | 90.46 ± 0.67 | 92.10 ± 0.65 |
ARGE (IJCAI’19) | 92.10 ± 1.18 | 93.25 ± 0.93 | 90.43 ± 0.73 | 92.04 ± 0.83 |
ARVGE (IJCAI’19) | 91.90 ± 0.69 | 92.67 ± 0.73 | 89.77 ± 0.62 | 91.05 ± 0.67 |
RWR-GAE(arxiv’19) | 91.85 ± 0.63 | 92.95 ± 0.48 | 90.57 ± 1.06 | 91.33 ± 1.12 |
RWR-VGAE(arxiv’19) | 92.05 ± 0.46 | 93.07 ± 0.45 | 91.58 ± 0.53 | 92.76 ± 0.46 |
DGAE(NIPS’20) | 91.21 ± 1.04 | 92.48 ± 0.79 | 88.53 ± 2.35 | 88.77 ± 2.56 |
DGVAE(NIPS’20) | 91.21 ± 1.06 | 92.67 ± 0.73 | 90.44 ± 1.22 | 92.07 ± 1.12 |
72-DGAE(IJCAI’22) | 93.53 ± 0.85 | 94.45 ± 0.78 | 94.17 ± 0.69 | |
SEAL-PS2(WSDM’23) | 92.31 ± 0.31 | 93.53 ± 0.22 | 90.29 ± 0.36 | 92.42 ± 0.20 |
GAE-PS2(WSDM’23) | 92.25 ± 0.71 | 93.60 ± 0.25 | 92.16 ± 0.19 | 93.07 ± 0.06 |
GraphSage-PS2(WSDM’23) | 93.35 ± 0.25 | 93.46 ± 1.20 | 93.62 ± 1.10 | |
SIEG(AAAI’24) | – | 92.10 ± 0.44 | – | |
PULL(AAAI’25) | 82.69 ± 0.07 | 85.40 ± 0.04 | 82.21 ± 0.13 | 84.74 ± 0.08 |
IWVGAE(1)(Ours) | 92.62 ± 0.78 | 93.90 ± 0.68 | 93.53 ± 0.49 | 94.33 ± 0.27 |
IWVGAE(5)(Ours) | 93.29 ± 0.65 | 94.37 ± 0.64 | 93.80 ± 0.68 | 94.54 ± 0.51 |
IWVGAE(30)(Ours) | 93.45 ± 0.72 | 94.53 ± 0.57 | 93.90 ± 0.57 | 94.61 ± 0.50 |
IWVGAE(50)(Ours) | 93.55 ± 0.68 | 94.85 ± 0.46 | ||
IWVGAE(100)(Ours) | 93.58 ± 0.72 | |||
Bold indicates the best performance, and underlining indicates the second-best performance. IWVGAE(k) denotes the results with k sampling counts. “–” indicates that no corresponding public result is available or that the provided code failed to produce the result
[See PDF for image]
Fig. 2
The variation curve in the variational lower bound (i.e., the log-likelihood) with increasing sampling counts on five datasets: a, b correspond to link prediction, and the others correspond to graph clustering. The x-axis represents the number of samples, and the y-axis denotes the log-likelihood
Comparisons of different sampling counts
In this section, we validate the effectiveness of the proposed theory in three aspects: (1) the variational lower bound, (2) density estimation accuracy, and (3) data representation capability. We report the results on each of these aspects.
The variational lower bound
According to Theorem 1, the variational lower bound should increase with the number of samples. Two distinct training methods are adopted to fully validate the applicability of this principle under two different tasks: link prediction and graph clustering. In the former, training is conducted with partial data masking. In the latter, the entire dataset is used for training. The variation curve of the variational lower bound is shown in Fig. 2. For all datasets in both tasks, it is observed from Fig. 2 that as the number of samples increases, the variational lower bound continuously increases, thereby validating the proposed Theorem 1. More specifically, as the number of samples increases, the magnitude of the rise in the variational lower bound progressively decreases. This phenomenon is particularly evident in the experiments conducted on the BAT and EAT datasets. We evaluated the changes in the log-likelihood for sample sizes ranging from 1 to 100,000 on BAT, and up to 10,000 on EAT. The numerical results are presented in Table 4, and the corresponding trends are illustrated in Fig. 2g and h. Within the first 100 samples, the log-likelihood improves significantly as the number of samples increases. However, as the number of samples increases substantially—up to 100,000 for BAT and 10,000 for EAT—the rate of improvement gradually diminishes, exhibiting a clear trend toward convergence. Therefore, according to Theorem 2, it is inferred that the variational lower bound is expected to converge to the true data distribution as the number of samples approaches infinity. These results demonstrate the effectiveness and stability of the proposed theory.
Table 4. The log-likelihood varies with the number of sampling counts on the BAT and EAT datasets
Sample | BAT | EAT |
|---|---|---|
1 | 0.90053 | 0.87908 |
5 | 0.87336 | 0.86861 |
10 | 0.86580 | 0.86407 |
20 | 0.86089 | 0.85315 |
30 | 0.85446 | 0.85296 |
50 | 0.84246 | 0.85281 |
75 | 0.83682 | 0.85271 |
100 | 0.83230 | 0.85257 |
1000 | 0.79925 | 0.84223 |
10,000 | 0.78828 | 0.83641 |
50,000 | 0.76577 | OOM |
100,000 | 0.76322 | OOM |
Performance
The accuracy of graph density estimation is reflected in the generation performance in the link prediction, and the data representation capability is demonstrated through the clustering performance of latent variables. Fig. 3 presents performance comparisons of link prediction and graph clustering with different sampling counts. The detailed results for different sampling counts are provided in Tables 5 and 6.
[See PDF for image]
Fig. 3
Performance comparisons of link prediction and graph clustering with different sampling counts: a, b correspond to link prediction, while the others correspond to graph clustering. In a, b, the x-axis denotes the number of samples, and the y-axis denotes the evaluation metric. In c–g, the x-axis (e.g., ACC, NMI) denotes the evaluation metrics, while the y-axis (e.g., 1, 5, 10) denotes the number of samples. To more clearly illustrate the magnitude of variation in graph clustering performance, the differences between the obtained metric values and their respective minimums are visualized along the z-axis
Table 5. Performance comparisons of link prediction across various sampling counts
Dataset | Metrix | 1 | 5 | 10 | 20 | 30 | 50 | 75 | 100 |
|---|---|---|---|---|---|---|---|---|---|
Cora | AUC | 92.62 ± 0.78 | 93.29 ± 0.65 | 93.26 ± 0.76 | 93.33 ± 0.70 | 93.45 ± 0.72 | 93.55 ± 0.68 | 93.61 ± 0.66 | |
AP | 93.90 ± 0.68 | 94.37 ± 0.64 | 94.37 ± 0.57 | 94.46 ± 0.55 | 94.53 ± 0.57 | 94.53 ± 0.63 | 94.60 ± 0.54 | ||
Citeseer | AUC | 93.53 ± 0.49 | 93.80 ± 0.68 | 93.78 ± 0.38 | 93.92 ± 0.38 | 93.90 ± 0.57 | 94.07 ± 0.38 | 94.40 ± 0.22 | |
AP | 94.33 ± 0.27 | 94.54 ± 0.51 | 94.59 ± 0.37 | 94.70 ± 0.29 | 94.61 ± 0.50 | 94.82 ± 0.38 | 95.06 ± 0.25 |
Performance comparisons of link prediction across various sampling counts. Bold indicates the best performance, and underlining indicates the second-best performance
The results in Fig. 3 show that, overall, the metrics for both tasks increase significantly with increasing sampling counts. This indicates that when the approximate posterior distribution becomes more flexible, a tighter variational lower bound can be obtained. In this case, the model’s density estimation accuracy and data representation improve substantially.
However, the overall magnitude of improvement tends to diminish as the sampling count continues to rise. Taking the Cora clustering results in Table 2 as an example, when the sampling count increases from 1 to 30, the clustering performance improves notably: ACC increases from 75.36 to (+ ), NMI from 54.98 to (+ ), ARI from to (+ ), and F1 from 73.88 to (+ ). In contrast, when the sampling count further increases from 30 to 100, the improvement becomes more marginal: ACC rises from 77.28 to (+ ), NMI from 56.52 to (+ ), ARI from 56.25 to (+ ), and F1 from 75.59 to (+ ). Therefore, it is important to select the sampling count reasonably to balance model performance and computational efficiency.
Furthermore, as shown in Fig. 3 and Table 2, although the overall performance of both link prediction and graph clustering improves with increasing sampling counts, slight fluctuations are also observed when the sampling count becomes large. A potential reason for this phenomenon lies in the increased uncertainty during model optimization. Specifically, as more samples are incorporated, the objective function becomes more complex, which may lead to greater instability during training. As a result, the learned latent representations may carry this uncertainty, further affecting the performance on downstream tasks. Therefore, performance fluctuations under varying sampling counts are considered reasonable within this context.
Comparisons with other graph methods
Performance
To validate the superiority of the proposed IWVGAE in density estimation and data representation capability, we compare our method with some representative graph methods on link prediction and graph clustering tasks, as listed in Tables 1 and 3. From the results in Tables 1 and 3, we can make the following observations:
It is noted that VGAE-based methods are regarded as the variant of IWVGAE with a sampling count of . When , the posterior distribution is constrained, leading to the suboptimal performance for IWVGAE. By contrast, when , the concept of importance weighting presents a more flexible posterior distribution, yielding the highest performance in most cases. In particular, on the Cora dataset, comparing the results of two different sampling conditions on and , the link prediction metrics AUC and AP increase by and , respectively, while the clustering metrics ACC, NMI, ARI, and F1 improve by , , , and , respectively. Consequently, IWVGAE (100) with achieves the best performance on most metrics in all sampling counts.
Compared with other methods, such as VGAE-based methods, GAE-based methods, and graph contrastive learning methods, IWVGAE (100) exhibits its superiority. Specifically, with a higher sampling count, our method benefits from richer and more flexible latent representations and optimization objectives, enabling the model to have a better ability to capture the distribution of graph data.
Compared with state-of-the-art self-supervised graph learning methods, our proposed method consistently outperforms the contrastive learning-based method E2Neg [11] and the VGAE-based method DyFSS [57] on the graph clustering task. For the link prediction task, it also achieves better performance than SIEG [42] and PULL [15]. Fundamentally, this is because our method is able to better capture the distribution of graph data, thereby enhancing the model’s representational capacity.
Visualization
We utilize t-SNE and node similarity matrices visualizations to enable a more intuitive comparison of the models’ capability in representing graph features.
The specific results are shown in Figs. 4 and 5. From the t-SNE visualizations, it can be observed that methods such as FT-VGAE [32] and VGAE [17] fail to effectively distinguish nodes of different classes. In contrast, our proposed method demonstrates improved performance: IWVGAE (1) shows mitigated overlap between classes, while IWVGAE (100) further enhances this effect by achieving smaller intra-class variations and clearer inter-class separations.
[See PDF for image]
Fig. 4
The t-SNE visualization of the structure of latent variables on the ACM dataset
[See PDF for image]
Fig. 5
Visualization of the similarity matrices of latent representations on the ACM dataset
For the node similarity matrices, we first sort the nodes based on their class labels to group nodes of the same class together. Then, we visualize the matrices using heatmaps, as shown in Fig. 5. In the heatmap, red indicates positive feature correlation, while blue indicates negative correlation. Darker colors represent stronger (positive or negative) correlations. The visualization results show that VGAE [17] and DGVAE [20] exhibit generally light-colored patterns, indicating a limited ability to capture the common features of nodes within the same class. FT-VGAE [32] displays strong correlations overall, suggesting that it tends to map features into a similar latent space, but fails to effectively distinguish features across different classes. In contrast, our proposed method demonstrates better feature representation: IWVGAE (1) is able to more effectively learn node representations, while IWVGAE (100), benefiting from a tighter variational lower bound via Monte Carlo estimation, exhibits even stronger correlations among latent representations of nodes. This indicates an improved ability to capture both intra-class commonalities and inter-class differences.
Conclusion
This paper proposes a novel graph autoencoder model called IWVGAE based on the proposed theory regarding the variational lower bound. IWVGAE adopts the Monte Carlo approach to sample from the approximate posterior distribution, thereby making the approximate posterior distribution more flexible. Then, an importance weighting strategy is applied to the log-likelihood gradients during backpropagation. Extensive experiments indicate that our method not only validates the proposed theory but also shows its effectiveness in both link prediction and graph clustering tasks. The important finding is that as the number of samples increases, the variational lower bound continuously improves, and meanwhile the model’s expressive power is significantly enhanced.
Data Availability
All datasets used in the experiments are publicly available and can be accessed at: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html.
Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Burda Y, Grosse R, Salakhutdinov R (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519
2. Chen J, Gao K, Li G, He K (2022) NAGphormer: a tokenized graph transformer for node classification in large graphs. arXiv preprint arXiv:2206.04910
3. Chiu B, Sahu SK, Thomas D, Sengupta N, Mahdy M (2020) Autoencoding keyword correlation graph for document clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3974–3981
4. Duan, J; Wang, S; Zhang, P; Zhu, E; Hu, J; Jin, H; Liu, Y; Dong, Z. Graph anomaly detection via multi-scale contrastive learning networks with augmented view. Proc AAAI Conf Artif Intell; 2023; 37, pp. 7459-7467.
5. Faez, F; Ommi, Y; Baghshah, MS; Rabiee, HR. Deep graph generators: a survey. IEEE Access; 2021; 9, pp. 106675-106702. [DOI: https://dx.doi.org/10.1109/ACCESS.2021.3098417]
6. Gasteiger J, Bojchevski A, Günnemann S (2018) Predict then propagate: graph neural networks meet personalized PageRank. arXiv preprint arXiv:1810.05997
7. Guo, L; Dai, Q. Graph clustering via variational graph embedding. Pattern Recognit; 2022; 122, [DOI: https://dx.doi.org/10.1016/j.patcog.2021.108334] 108334.
8. Hao X, Shafto P (2023) Coupled variational autoencoder. arXiv preprint arXiv:2306.02565
9. Hou, D; Gao, C; Li, X; Wang, Z. Dag-aware variational autoencoder for social propagation graph generation. Proc AAAI Conf Artif Intell; 2024; 38, pp. 8508-8516.
10. Huang PY, Frederking R et al (2019) RWR-GAE: random walk regularization for graph auto encoders. arXiv preprint arXiv:1908.04003
11. Huang, Y; Zhao, J; He, D; Jin, D; Huang, Y; Wang, Z. Does GCL need a large number of negative samples? Enhancing graph contrastive learning with effective and efficient negative sampling. Proc AAAI Conf Artif Intell; 2025; 39, pp. 17511-17518.
12. Hui, B; Zhu, P; Hu, Q. Collaborative graph convolutional networks: unsupervised learning meets semi-supervised learning. Proc AAAI Conf Artif Intell; 2020; 34, pp. 4215-4222.
13. Ju W, Wang Y, Qin Y, Mao Z, Xiao Z, Luo J, Yang J, Gu Y, Wang D, Long Q et al (2024)Towards graph contrastive learning: a survey and beyond. arXiv preprint arXiv:2405.11868
14. Khoshraftar, S; An, A. A survey on graph representation learning methods. ACM Trans Intell Syst Technol; 2024; 15,
15. Kim, J; Park, KH; Yoon, H; Kang, U. Accurate link prediction for edge-incomplete graphs via PU learning. Proc AAAI Conf Artif Intell; 2025; 39, pp. 17877-17885.
16. Kingma DP, Welling M (2013) Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114
17. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
18. Kipf TN, Welling M (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308
19. Li J, Shomer H, Mao H, Zeng S, Ma Y, Shah N, Tang J, Yin D (2024) Evaluating graph neural networks for link prediction: current pitfalls and new benchmarking. In: Advances in neural information processing systems, vol 36
20. Li, J; Yu, J; Li, J; Zhang, H; Zhao, K; Rong, Y; Cheng, H; Huang, J. Dirichlet graph variational autoencoder. Adv Neural Inf Process Syst; 2020; 33, pp. 5274-5283.
21. Liu C, Fan W, Liu Y, Li J, Li H, Liu H, Tang J, Li Q (2023) Generative diffusion models on graphs: methods and applications. arXiv preprint arXiv:2302.02591
22. Liu M, Liu Y, Liang K, Tu W, Wang S, Zhou S, Liu X (2023) Deep temporal graph clustering. arXiv preprint arXiv:2305.10738
23. Liu, Y; Jin, M; Pan, S; Zhou, C; Zheng, Y; Xia, F; Philip, SY. Graph self-supervised learning: a survey. IEEE Trans Knowl Data Eng; 2022; 35,
24. Liu Y, Xia J, Zhou S, Yang X, Liang K, Fan C, Zhuang Y, Li SZ, Liu X, He K (2022) A survey of deep graph clustering: taxonomy, challenge, application, and open resource. arXiv preprint arXiv:2211.12875
25. Liu, Y; Yang, X; Zhou, S; Liu, X; Wang, S; Liang, K; Tu, W; Li, L. Simple contrastive graph clustering. IEEE Trans Neural Netw Learn Syst; 2023; 35, pp. 13789-13800. [DOI: https://dx.doi.org/10.1109/TNNLS.2023.3271871]
26. Liu Y, Zhou S, Liu X, Tu W, Yang X (2022) Improved dual correlation reduction network. arXiv preprint arXiv:2202.12533
27. Liu Y, Zhu S, Xia J, Ma Y, Ma J, Zhong W, Zhang G, Zhang K, Liu X (2024) End-to-end learnable clustering for intent learning in recommendation. arXiv preprint arXiv:2401.05975
28. Lv, C; Qi, M; Li, X; Yang, Z; Ma, H. SGFormer: semantic graph transformer for point cloud-based 3D scene graph generation. Proc AAAI Conf Artif Intell; 2024; 38, pp. 4035-4043.
29. Van der Maaten, L; Hinton, G. Visualizing data using t-SNE. J Mach Learn Res; 2008; 9,
30. Manduchi L, Vandenhirtz M, Ryser A, Vogt J (2024) Tree variational autoencoders. In: Advances in neural information processing systems, vol 36
31. Min E, Chen R, Bian Y, Xu T, Zhao K, Huang W, Zhao P, Huang J, Ananiadou S, Rong Y (2022) Transformer for graphs: an overview from architecture perspective. arXiv preprint arXiv:2202.08455
32. Mrabah N, Bouguessa M, Ksantini R (2022) Escaping feature twist: a variational graph auto-encoder for node clustering. In: IJCAI, pp 3351–3357
33. Mrabah N, Bouguessa M, Ksantini R (2023) Beyond the evidence lower bound: dual variational graph auto-encoders for node clustering. In: Proceedings of the 2023 SIAM international conference on data mining (SDM). SIAM, pp 100–108
34. Mrabah, N; Bouguessa, M; Touati, MF; Ksantini, R. Rethinking graph auto-encoder models for attributed graph clustering. IEEE Trans Knowl Data Eng; 2022; 35,
35. Nguyen TK, Fang Y (2024) Diffusion-based negative sampling on graphs for link prediction. In: Proceedings of the ACM web conference 2024, pp 948–958
36. Pan S, Hu R, Long G, Jiang J, Yao L, Zhang C (2018) Adversarially regularized graph autoencoder for graph embedding. arXiv preprint arXiv:1802.04407
37. Rampášek, L; Galkin, M; Dwivedi, VP; Luu, AT; Wolf, G; Beaini, D. Recipe for a general, powerful, scalable graph transformer. Adv Neural Inf Process Syst; 2022; 35, pp. 14501-14515.
38. Ribeiro LF, Saverese PH, Figueiredo DR (2017) struc2vec: learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 385–394
39. Sánchez-Martin, P; Rateike, M; Valera, I. VACA: designing variational graph autoencoders for causal queries. Proc AAAI Conf Artif Intell; 2022; 36, pp. 8159-8168.
40. Shchur O, Mumme M, Bojchevski A, Günnemann S (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868
41. Shen, X; Sun, D; Pan, S; Zhou, X; Yang, LT. Neighbor contrastive learning on learnable graph augmentation. Proc AAAI Conf Artif Intell; 2023; 37, pp. 9782-9791.
42. Shi, L; Hu, B; Zhao, D; He, J; Zhang, Z; Zhou, J. Structural information enhanced graph representation for link prediction. Proc AAAI Conf Artif Intell; 2024; 38, pp. 14964-14972.
43. Su, H; Li, Z; Yuan, CA; Vladimir, FF; Huang, DS. Variational graph neural network with diffusion prior for link prediction. Appl Intell; 2025; 55,
44. Su, X; Hu, P; You, ZH; Philip, SY; Hu, L. Dual-channel learning framework for drug–drug interaction prediction via relation-aware heterogeneous graph transformer. Proc AAAI Conf Artif Intell; 2024; 38, pp. 249-256.
45. Tan Q, Zhang X, Liu N, Zha D, Li L, Chen R, Choi SH, Hu X (2023) Bring your own view: graph neural networks for link prediction with personalized subgraph selection. In: Proceedings of the sixteenth ACM international conference on web search and data mining, pp 625–633
46. Velickovic, P; Cucurull, G; Casanova, A; Romero, A; Lio, P; Bengio, Y et al. Graph attention networks. stat; 2017; 1050,
47. Wang C, Liu Y, Yang Y, Li W (2024) HeterGCL: graph contrastive learning framework on heterophilic graph. In: Proceedings of the thirty-third international joint conference on artificial intelligence, pp 2397–2405
48. Wu Q, Zhao W, Yang C, Zhang H, Nie F, Jiang H, Bian Y, Yan J(2024) Simplifying and empowering transformers for large-graph representations. In: Advances in neural information processing systems, vol 36
49. Wu X, Cheng Q (2022) Stabilizing and enhancing link prediction through deepened graph auto-encoders. In: IJCAI: proceedings of the conference, vol 2022. p 3587 (NIH Public Access)
50. Yang R, Yang Y, Zhou F, Sun Q (2024) Directional diffusion models for graph representation learning. In: Advances in neural information processing systems, vol 36
51. Yang Z, Cohen W, Salakhudinov R (2016) Revisiting semi-supervised learning with graph embeddings. In: International conference on machine learning. PMLR, pp 40–48
52. Yu Z, Lu Y, Wang Y, Tang F, Wong KC, Li X (2022) ZINB-based graph embedding autoencoder for single-cell RNA-Seq interpretations. Proc AAAI Conf Artif Intell 36:4671–4679
53. Yuan, L; Jiang, P; Wen, Z; Li, J. Improving variational graph autoencoders with multi-order graph convolutions. IEEE Access; 2024; 12, pp. 46919-46929. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3380012]
54. Zhang Y, Yuan Y, Wang Q (2023) Multi-level graph contrastive prototypical clustering. In: IJCAI, pp 4611–4619
55. Zhao H, Yang X, Wang Z, Yang E, Deng C (2021) Graph debiased contrastive learning with joint representation clustering. In: IJCAI, pp 3434–3440
56. Zhou, X; Miao, C. Disentangled graph variational auto-encoder for multimodal recommendation with interpretability. IEEE Trans Multimedia; 2024; 26, pp. 7543-7554. [DOI: https://dx.doi.org/10.1109/TMM.2024.3369875]
57. Zhu, P; Wang, Q; Wang, Y; Li, J; Hu, Q. Every node is different: dynamically fusing self-supervised tasks for attributed graph clustering. Proc AAAI Conf Artif Intell; 2024; 38, pp. 17184-17192.
58. Zhu, Z; Zhang, Z; Xhonneux, LP; Tang, J. Neural Bellman—Ford networks: a general graph neural network framework for link prediction. Adv Neural Inf Process Syst; 2021; 34, pp. 29476-29490.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.