Content area
The geo-social group query is to find a group of users for the query point based on location and social information. In this paper, we propose the range constrained group query (RCGQ) on attribute social graph, considering social information, spatial information, keyword information and user group size. We prove that RCGQ problem is NP-hard. For the query, we propose four methods, namely the combination-based group expansion method (COM), the single–multi group expansion method (S–M), the single–single group expansion method (S–S) and the multi–multi group expansion method (M–M). The first method is based on combination. The last three methods are based on social relations. COM uses combination to find user groups. The social relations are not used in the combinatorial process. S–M, S–S and M–M use the social relations to find user groups. Pruning strategies are proposed for the four methods. Finally, experiments demonstrate the efficiency of the proposed methods.
Introduction
With the rapid development of GPS, mobile network and social network, many location-based social network (LBSN) services have been widely used, such as Foursquare, Flickr and Gowalla. These applications make people's life more convenient, but with the increase of people's personalized needs, more queries are needed. One of the queries is geo-social group query, which aims to find a group of users for the query point based on location and social information.
Consider a scenario where Alice, a query user on geo-social networks, wants to organize a dinner party with a clear theme at home. The keywords contained in the theme are music and movie. She has four requirements: (1) the size of the user group is 5 and Alice is in the user group; (2) in order to reduce travel costs, the distance between the selected participants and Alice is less than 10 km; (3) in order to ensure the comfort of cooperation, the selected participants should know at least three other participants; (4) top-1 user group that is most interested in the theme of the party should be returned.
Figure 1 illustrates an attribute social graph. Let Alice be u1 in Fig. 1, the distance between all users and u1 in Fig. 1 is less than 10 km. According to Alice's requirements, user groups A = {u1, u2, u3, u4, u6}, B = {u1, u3, u4, u5, u6} may be found. Finally, the user group B should be returned, because B is more interested in the theme of the party than A.
[See PDF for image]
Fig. 1
Example of attribute social graph
To our knowledge, for the above example, the most related works include [1, 2, 3, 4, 5, 6–7]. [1, 2, 3–4] studied the problem of geo-social group query, but they don’t consider the influence of user’s keyword information on the results. For Fig. 1, they cannot determine which user group is more suitable for querying users, and cannot guarantee to return a more satisfactory user group, such as B. For the papers that considered the keyword information, [5, 6–7] do not consider a fixed user group size. For Fig. 1, they will not be able to guarantee the return of user groups with a user size of 5. So, all these works could not meet Alice’s query requirements.
Based on this, we study a new problem, Range Constrained Group Query (RCGQ) on attribute social graph, considering four constraints at the same time: (i)range constraints, (ii)social constraints (k-core), (iii)keyword constraints, and (iv)user group size constraints.
RCGQ problem is NP-hard, which will be proved in Sect. 3. We need to return the user group with the best score from all the candidate user groups that meet the requirements. We will face the challenge of how to return the result in a reasonable time without enumerating all possible combinations.
For the query, we propose four methods. The first method uses combination to find user groups. The information of social relation is not used in the combination process. The last three methods are based on social relation, using the information of social relation to find user groups. Several pruning rules are proposed to reduce the search space and the number of users, which greatly improve the efficiency of the query.
The main contributions of this work are summarized as follows:
We define the query RCGQ, which is based on attribute social graph.
We propose a combination-based group expansion method and three group expansion methods based on social relation.
We propose several effective pruning strategies for each method.
We conduct extensive experiments using four datasets to demonstrate the efficiency of our methods.
Related work
We divide group queries into four categories: social group query, social group query with keywords, geo-social group query and geo-social group query with keywords.
Social group query
Given a social network and a query point, the group query on social network returns a group with special social relationships. Sozio et al. [8] studied a query-dependent variant of the community-detection problem, which is called the community-search problem. They proposed a greedy algorithm and two heuristic algorithms to solve the community query. Based on [8], which uses global search, Cui et al. [9] proposed a local search algorithm to search in the neighborhood of a vertex to find the best community for the vertex. Huang et al. [10] studied online community search and proposed a novel community model based on the k-truss concept. They also investigated the k-truss community search problem in a dynamic graph setting. Huang et al. [11] proposed a novel community search model called Closest Truss Community (CTC) and effective approximate algorithms. Barbieri et al. [12] tackled the problem of finding the smallest-size solution to community search and devised an approach that exploits the core decomposition of the graph as precomputed information. Song et al. [13] proposed SGEQ query. Given a connected k-core of size m, SGEQ returns a connected k-core of size m + n. This query can be used for a work team to recruit new team members. Cai et al. [14] proposed the closest community query problem. They developed a Baseline algorithm, and then progressively improved it to IndexedLO algorithm and CCS algorithm. Yang et al. [15] proposed the Social-Temporal Group Query to find the activity time and attendees with the minimum total social distance to the initiator.
Social group query with keywords
When users use social software, they will upload pictures and add description information, which will become the user's attribute information. Fang et al. [16] proposed the attributed community query (ACQ). To facilitate ACQ evaluation, they developed the CL-tree index and query algorithms. Huang et al. [17] proposed an Attributed Truss Community (ATC) model and several strategies to quickly find high-quality ATC. They designed an index, ATindex, and implemented a query processing algorithm for ATC. Li et al. [18] proposed a novel skyline community model to detect interesting communities in a multi-valued network. They developed a basic and a novel space-partition algorithm to find all the skyline communities. Zhang et al. [19] studied the novel problem of Keyword-Centric Community Search (KCCS) over attributed graphs, and proposed efficient algorithms as well as the optimizations and indexes to solve KCCS.
Geo-social group query
We can obtain user's location information as well as their social information through Geo-social networks. The goal of group query in geo-social networks is to return user groups that are spatially adjacent and satisfy specified social relationships. Liu et al. [20] proposed a new type of query, called Circle of Friend Query (CoFQ), which allows to find a group of friends in a geo-social network whose members are close to each other socially and geographically. Armenatzoglou et al. [21] proposed the general framework for GeoSN query processing, which enables flexible data management and algorithmic design. Shen et al. [22] proposed the Multiple Rally-Point Social Spatial Group Query (MRGQ), to select an appropriate activity location for a group of nearby attendees with tight social relationships.
Fang et al. [23] studied the problem of searching a Spatial-Aware community (or SAC), whose vertices are close structurally and spatially. For finding a SAC, they proposed two exact algorithms, and three efficient approximation algorithms. Zhu et al. [24] proposed a new family of Geo-Social Group Queries with minimum acquaintance constraint (GSGQs), which guarantees the worst-case acquaintance level. Chen et al. [25] studied the Maximum Co-located Community (MCC) search problem. They proposed an exact algorithm and an approximation algorithm to solve the problem. TQ-tree index is proposed to improve the efficiency of the approximation algorithm. Ma et al. [1] proposed a new type of query, Personalized Geo-Social Group (PGSG) query, which aims to retrieve both a user group and a venue. To tackle the problem of the PGSG query, they proposed the GVPS search algorithm to find the optimal user group and venue simultaneously, and extended the PGSG query to top-k personalized geo-social group (TkPGSG) query. Li et al. [26] defined a new skyline cohesive group model, which can measure the cohesiveness of both social and spatial dimensions flexibly. They proposed exact algorithms with effective pruning techniques and efficient greedy solutions based on a newly designed cd-tree. Kim et al. [27] proposed the Geo-social Community Search problem (GCS) which aims to find a social community and a cluster of spatial locations.
Geo-social group query with keywords
Some recent research works mainly aim at finding user groups on geo-social networks with keywords. Guo et al. [2] studied a new problem: a group nearest neighbor search on a road network that incorporates cohesive social relationships. In addition, the query would identify suitable spatial-textual objects as assembly points for a group of optimal attendees over road social networks. However, the attendee's keyword information is not considered. Liu et al. [6] studied a new attributed community search problem called Vertex-centric Attributed Community (VAC) search, which can handle different types of attributes. To solve the VAC problem, they proposed two exact algorithms and an approximate algorithm. Guo et al. [7] proposed the Multi-Attributed Community (MAC) model in road-social networks. They devised an index structure and proposed two algorithms to compute the top-j MACs. Chen et al. [5] studied the geo-social group search with multi-constraint. It considers the keyword attributes, social familiarity and spatial proximity of participants, and returns a k-truss subgraph that can cover the query keywords in a heuristic way and an effective search framework. However, the application scenario of [5] is more inclined to the cooperation among group members, which requires that for each input keyword key, the number of users covering key is at least ρ.
In this paper, we study the Range Constrained Group Query (RCGQ) on attribute social graph. The most relevant studies are [1, 2, 3, 4, 5, 6–7]. Although these studies are most closely related to our problem, user group evaluation rules are different. Among these works, [1, 2, 3–4] don’t consider user’s keyword information, and [6, 7] don’t have limit on the size of user groups. There are three main differences between this paper and [5]: (i)The social constraints in [5] is k-truss, while ours is k-core. (ii)The number of result group in [5] is 1, while ours is t, which is top-t problem. (iii)The size of user group in [5] is no less than ρ ×|φ|, while ours is h, which is fixed.
Problem statement
In this paper, the geo-social network with keywords is modeled as an attribute graph G = (VG, EG, TG), where VG represents the set of vertices, EG represents the set of edges, and TG is a mapping: VG → keyPset (keyPset is the power set of all keyword sets). TG assigns a keyword set to each vertex. The keyword set of any vertex v ∈ VG is denoted as TG(v) = {wv1, wv2, …, wvi}.
For any vertex v ∈ VG, we denote the neighbor set of v as nbr(v, G) = {u: (u, v) ∈ EG}, where (u, v) denotes the edge between vertex u and v. The degree of v is denoted as degG(v) =|nbr(v, G)|. The Euclidean distance between u and v in G is defined as dist(u, v), and dist(v, v) = 0 for any vertex v ∈ VG.
Based on [28, 29], we give the definition of k-core.
Definition 1
(k-core) Given an attribute graph G = (VG, EG, TG) and an integer k, a k-core is a connected subgraph H ⊆ G, such that ∀v ∈ VH, degH(v) ≥ k.
Definition 2
(User keyword scores(v)) Given an attribute graph G, for a vertex u ∈ VG, the keyword set of u is denoted by TG(u) = {wu1, wu2, …, wun}. Assuming that the query user is v and the query keyword set is T = {w1, w2, …, wm}, then the keyword score of u is denoted by s(u), which is defined as following:
1
Definition 3
(User group keyword score s(H)) Given an attribute graph G, let a user group be a subset of VG. For any user group H = {v1, …, vi} of G, where |H|= i, the keyword score s(H) of H is defined as.
2
Definition 4
(Candidate user group (CG)) Given an attribute graph G, a query user uq, and three parameters k, r, h, a candidate user group (CG) is defined to be a user group that satisfies the following conditions.
Structural cohesion: CG’s induced subgraph G[CG] is a k-core containing uq;
Spatial cohesion: for any v ∈ CG, there is dist(uq, v) ≤ r;
The user group size is fixed: |CG|= h.
Problem statement (RCGQ)
Given an attribute graph G and q = < uq, r, h, k, t, T > , where r is the range threshold, h is the user group size, k indicates k-core, t is the number of user groups to be returned, and T is a query keyword set, RCGQ returns a candidate user group set R = {CG1, …, CGt}, such that for any candidate user group J ∉ R, we have s(J) ≤ s(Rmin), where Rmin = .
Example 1
Take the attribute graph G in Fig. 2 as an example. Assume an RCGQ query q = < uq = u5, r = 3 km, h = 5, k = 2, t = 3, T > , where the query keyword set T = {Chinese, Mathematics, English, Sports, Physics}. All users whose Euclidean distance from the query user u5 is no more than 3 km are in the dotted circle. The keyword set for each user is shown in Table 1. All the candidate user groups containing u5 are CG1 = {u5, u1, u2, u3, u10}, CG2 = {u5, u9, u10, u11, u6}, CG3 = {u5, u9, u10, u3, u6}, CG4 = {u5, u10, u11, u6, u3}, CG5 = {u5, u11, u6, u3, u4}, CG6 = {u5, u10, u11, u9, u3} and CG7 = {u5, u10, u6, u3, u4}. It can be observed that the vertices of these candidate user groups are closely related (2-core). Finally, the query result is R = {CG1, CG4, CG7}, where s(CG1) = 3.6, s(CG4) = 3.2, s(CG7) = 3.2.
[See PDF for image]
Fig. 2
Geo-social network G
Table 1. User keyword set in Fig. 2
user ID | keyword set |
|---|---|
u1 | {Chinese, Mathematics, English, Sports} |
u2 | {History, Biochemistry, Sports, Physics} |
u3 | {Politics, English, History, Mathematics, Sports, Arts} |
u4 | {Mathematics, English, Politics} |
u5 | {Chinese, Mathematics, English, Sports, Physics} |
u6 | {Politics, Arts, Physics, Chinese} |
u7 | {Drawing, Piano, Dance} |
u8 | {Singing, Boxing, Chinese} |
u9 | {Chinese, Geography, History} |
u10 | {Mathematics, English, Sports, Physics} |
u11 | {English, History, Biochemistry, Physics} |
Theorem 1
RCGQ problem is NP-hard.
Proof
Given a graph G, a vertex uq, integers k and h ≥ k, (k, h)-core problem [26] is to test whether there is a connected k-core with h vertices containing uq. We will prove that (k, h)-core problem can be reduced to RCGQ problem. For an RCGQ q = < uq, r = ∞, h, k, t = 1, T > , the query result obtained by q is the best candidate user group, which is also a (k, h)-core. And if there are multiple (k, h)-cores, then the (k, h)-core with highest score will be the query result obtained by q. If there exists only one (k, h)-core, then this (k, h)-core will be the query result obtained by q. Therefore, (k, h)-core problem can be reduced to RCGQ problem. Because (k, h)-core problem is NP-complete, we can prove that RCGQ problem is NP-hard. □
Combination-based group expansion method
The goal of RCGQ is to return t candidate user groups CG with the best score, so we can get the candidate user groups simply by enumerating all combinations with h vertices. For each combination, if it is a candidate user group, we first calculate its keyword score according to Formula (2). Finally, t candidate user groups with the best keyword score are returned. Based on this, we propose a combination-based group expansion method (COM).
We use the single vertex tree (STree) to represent the expansion process of recursion. During the expansion process, only the current path starting from the root is kept to save storage space.
Definition 5
(Single Vertex Tree (STree)) Given an attribute graph G, the single vertex tree takes the query user uq as the root, and the expansion process starts from uq. For the current vertex u, a vertex v would be the child of u, if v should be expanded by u according to the expansion rule.
[See PDF for image]
Algorithm 1
Preprocess(G, q)
Algorithm 1 is the preparation work before getting the combination by recursion. HMap is a hash table to store the information of G′, where the key is v, and the value includes s(v) and FMap. FMap is a hash table to store the vertices in nbr(v, G′), where the key is the vertex in nbr(v, G′) and the value is an empty string. In line 2, the function RangeQuery() is used to find the subgraph Gr of G within the range specified by the query. That is, the distance from the user in Gr to uq is less than or equal to r. The implementation of RangeQuery() is based on R-tree [30]. In lines 3–4, the k-core G′ containing the query user uq is obtained from G. In lines 5–8, HMap is created.[See PDF for image]
Algorithm 2
COM(G, q)
Algorithm 2 gives the details of the combination-based group expansion method. In line 1, HMap is obtained by invoking Algorithm 1. In line 2, path is a stack to store the expanded sequence, which is called the current path starting from uq. R (in line 2) is a min heap to store t candidate user groups, where the key is the best keywords score of the candidate user groups. The query result will eventually be in R. In line 3, Child is a list to store the vertex set to be expanded by the current vertex, where G′ is the k-core obtained by Preprocess(G, q). In lines 4–5, the vertices in Child \ {uq} are sorted. In line 6, Algorithm 3 is invoked to find the combination by recursion.
[See PDF for image]
Algorithm 3
COMexpand(u, h, k, t, HMap, Child, R, path).
Algorithm 3 gives the details of a recursive function to find t candidate user groups with the best keyword score. In line 1, the current vertex u is pushed into the stack path. In lines 2–7, determine whether the vertex set Path is a candidate user group, where Path denotes the set of vertices in path. In line 3, IsConnected() is used to judge whether G′[Path] is connected. There are at most t candidate user groups in the min heap R. So, in line 6, if there are less than t candidate user groups in R, (Path, s(Path)) is put into R. Otherwise, if s(Path) is larger than the score of the top of R, (Path, s(Path)) is put into R to replace the top of R. Line 7 ends the expansion and traces back to the vertex who expands the current vertex, that is, the execution of line 12 is completed. Line 8 indicates that the current path needs to be expanded. In lines 9–10, a vertex set Child′ is obtained by expanding u. Line 12 recursively invokes Algorithm 3 itself to continue the expansion of path. Line 13 removes the last vertex in the path.
Example 2
We assume that a geo-social network G is shown in Fig. 2. Let the query q be < uq = u5, r = 3 km, h = 5, k = 2, t = 3, T > , where T = {Chinese, Mathematics, English, Sports, Physics}. Firstly, the 2-core G′ is shown in Fig. 3, which is obtained according to lines 2–4 of Algorithm 1. The keyword score of each user is shown in Table 2. According to lines 4–5 of Algorithm 2, the vertices in Table 2 are sorted in descending order by keyword score. If the scores are the same, they are sorted in ascending order according to the user ID. After sorting, we have Child = {u1, u10, u3, u2, u4, u6, u11, u9}. We use an STree to represent the process of discovering the candidate user group. Part of STree is shown in Fig. 4. The query user u5 is the root of STree. After Algorithm 3 is invoked, u5 is pushed into path, and we have path = {u5}. In line 10 of Algorithm 3, we have Child′ = {u1, u10, u3, u2, u4, u6, u11, u9}, which is represented in light gray. After u1 is expanded by invoking Algorithm 3, we have path = {u5, u1}. In line 10 of Algorithm 3, we have the new Child′ = {u10, u3, u2, u4, u6, u11, u9}, which is represented in gray. We have path = {u5, u1, u10, u3, u2} as shown in Fig. 4, which is a candidate user group after judgment. Then after recursive return, we have path = {u5, u1, u10, u3}. After all the paths are traversed, we have R = {({u5, u1, u10, u3, u2}, 3.6), ({u5, u10, u3, u6, u11}, 3.2), ({u5, u10, u6, u3, u4}, 3.2)}.
[See PDF for image]
Fig. 3
2-core G2 of G
Table 2. Keyword score of vertex in Fig. 3
user ID | u1 | u2 | u3 | u4 | u5 | u6 | u9 | u10 | u11 |
|---|---|---|---|---|---|---|---|---|---|
Keyword score | 0.8 | 0.4 | 0.6 | 0.4 | 1.0 | 0.4 | 0.2 | 0.8 | 0.4 |
[See PDF for image]
Fig. 4
Part of the STree of q in Example 2
Group expansion methods based on social relations
In this section, we propose three group expansion methods based on social relations, namely the single–multi group expansion method (S–M), the single–single group expansion method (S–S) and multi–multi group expansion method (M–M).
Single–multi group expansion method
We will modify the existing work to solve the RCGQ. [1] proposed the PGSG query. Its group expansion part adopts breadth first search method, but there are some results that cannot be found. Therefore, we will introduce the group expansion process of PGSG query [1], and modify this method to make it applicable to RCGQ.
[See PDF for image]
Algorithm 4
Group extension of PGSG algorithm
Algorithm 4 describes part of the process of PGSG group expansion [1]. Line 1 initializes UQ, DQ and DV, where UQ and DQ are priority queues. Newly generated user groups will be put into UQ, the user group of size h would be put into DQ after leaving UQ, and DV stores users that completes expansion as expansion center. Lines 3–16 search vertex sets with size h in Gk according to breadth search. In each expansion, the expansion center is the vertex with the largest degree in the current user group and is not in DV. DV is a global variable. Next, from Example 3, we can see that DV will affect finding all the results.
Example 3
Given an attributed graph G, we have an RCGQ query q = < uq = u5, r = 3 km, h = 5, k = 2, t = 3, T > , where T = {Chinese, Mathematics, English, Sports, Physics}. Based on Example 1, the result set is R = {(CG1, 3.6), (CG4, 3.2), (CG7, 3.2)}, where CG1 = {u5, u1, u2, u3, u10}, CG4 = {u5, u10, u11, u6, u3}, and CG7 = {u5, u10, u6, u3, u4}.
We get the graph G′ (shown in Fig. 3) stored in HMap by invoking Preprocess(G, q) (Algorithm 1). Then we can get the RCGQ query result by invoking Algorithm 4 with parameters G′ and qPGSG = < uq = u5, h = 5, k = 2 > . Because G′ is the 2-core containing uq, Gk is the same as G′.
Because of the use of DV, PGSG may miss the result CG1. Next is the PGSG group expansion process using Algorithm 4.
For the query qPGSG, PGSG query first takes the query user u5 as the expansion center. In line 4, we have Pi = {u5}. Because |Pi|< 5, we continue to expand. In line 7, we get DV = {u5}. In line 8, we select the non-empty subset of nbr(u5, Gk), where nbr(u5, Gk) = {u3, u6, u10, u9, u11}, and combine the subset with Pi to form a new user group. Put the newly formed user group into the priority queue UQ. According to the condition in line 8, for Vp ⊆ nbr(u5, Gk) \ Pi, we would have |Vp|∈ [2, 4], so the 25 user groups would be put into UQ. Since PGSG query does not give priority rules of UQ, there are 4 possibilities for Pi permutation in UQ. Due to the limited space, only some user groups that will affect the correctness of the results are given. After u5 completes the expansion, the ranking of user groups in UQ is shown in Table 3.
Table 3. UQ under different priority rules after u5 completes the expansion
priority rules | UQ | Score |
|---|---|---|
(1) Descending by the number of edges in Pi | {u5, u3, u6, u10, u9}, {u5, u3, u11, u10, u9}, {u5, u6, u11, u10, u9}, {u5, u3, u6, u11, u10}, {u5, u3, u6, u11, u9}, {u5, u3, u10, u11}, {u5, u3, u6, u10}, …, {u5, u3, u10}, … | 7, 7, 7, 7, 6, 5, 5, …, 3, … |
(2) Ascending by the number of edges in Pi | {u5, u3, u11}, {u5, u6, u10}, …, {u5, u3, u10}, … | 2, 2, …, 3, … |
(3) Descending order by user group size |Pi| | {u5, u3, u6, u10, u9}, {u5, u3, u11, u10, u9}, {u5, u6, u11, u10, u9}, {u5, u3, u6, u11, u10}, {u5, u3, u6, u11, u9}, {u5, u3, u10, u11}, {u5, u3, u6, u10}, …, {u5, u3, u10}, … | 5, 5, 5, 5, 5, 4, 4, …, 3, … |
(4) Ascending order by user group size |Pi| | {u5, u3, u11}, {u5, u6, u10}, …, {u5, u3, u10}, … | 3, 3, …, 3, … |
Because only {u5, u3, u10} in UQ is the subset of CG1, in order to get CG1 for query qPGSG, we have to expand u1 or u2 from {u5, u3, u10}. For {u5, u3, u10}, if u3 ∈ DV and u10 ∈ DV, then we would not obtain CG1. In the following we consider whether u3 ∈ DV and u10 ∈ DV when {u5, u3, u10} leaves UQ.
Take priority rule (1) as an example. Table 3 shows the UQ under this priority rule. The top 5 user groups with the highest priority in UQ have satisfied the user group size constraint, so they are put into DQ after leaving UQ. Now the user groups with the highest priority are {u5, u3, u10, u11} and {u5, u3, u6, u10}, which have the same priority. Assume that {u5, u3, u10, u11} leaves UQ first. Since u3 and u10 have the maximum degree, u3 or u10 is preferred as the expansion center. Suppose that u3 is selected as the expansion center and put into DV. Now we have DV = {u5, u3}. When the user group {u5, u3, u6, u10} leaves UQ, only u10 has the largest degree among the users who are not in DV, so the user group chooses u10 as the expansion center to update the DV. We have DV = {u5, u3, u10}. Therefore, when the user group {u5, u3, u10} leaves UQ, u3 and u10 already exist in DV so CG1 will not be found.
Take priority rule (4) as an example. There are multiple user groups in UQ with the same priority containing u3 or u10. If {u5, u3, u11} and {u5, u6, u10} leave UQ before {u5, u3, u10}, u3 and u10 would be put into DV because u3 and u10 have the maximum degree. Therefore, when {u5, u3, u10} leave UQ, CG1 may not be found.
From the above example, there may exist some user group that cannot be found by using the PGSG group expansion method. The reason is that each vertex can only be the expansion center once.
Therefore, we modify Algorithm 4 to make it applicable to RCGQ query. Based on Algorithm 4, we get Algorithm 5, which is called single–multi group expansion method (S–M). There are two modifications.
In Algorithm 5, we do not use DQ, and store the result set R, which is the same as that in Algorithm 2. There are t optimal candidate user groups in R.
DV is no longer used in the expansion process in Algorithm 5. Lines 5–11 of Algorithm 4 are replaced by lines 6–13 of Algorithm 5.
[See PDF for image]
Algorithm 5
S–M (G, q)
Example 4
Given an attributed graph G, we have an RCGQ query q = < uq = u5, r = 3 km, h = 5, k = 2, t = 3, T > , where T = {Chinese, Mathematics, English, Sports, Physics}. As the query of Example 3, the result set is R = {(CG1, 3.6), (CG4, 3.2), (CG7, 3.2)}, where CG1 = {u5, u1, u2, u3, u10}, CG4 = {u5, u10, u11, u6, u3}, and CG7 = {u5, u10, u6, u3, u4}.
We get the graph G′ (shown in Fig. 3) stored in HMap by invoking Preprocess(G, q) (Algorithm 1). In line 3 of Algorithm 5, we have the queue UQ = {u5}. Lines 4–18 will be repeated until UQ is empty.
(1) When line 4 of Algorithm 5 is executed, since queue UQ ≠ Ø, vertex set {u5} leaves UQ, and we have P = {u5}. In line 10 of Algorithm 5, we have Vp = {u3, u6, u9, u10, u11}. Executing lines 10–12 multiple times, we have UQ = {{u5, u3, u6}, {u5, u3, u6, u9}, {u5, u3, u6, u9, u10}, {u5, u3, u6, u9, u11}, {u5, u3, u6, u10}, …, {u5, u9, u11}, {u5, u10, u11}}.
(2) When line 4 of Algorithm 5 is executed, since queue UQ ≠ Ø, vertex set {u5, u3, u6} leaves UQ, and we have P = {u5, u3, u6}. In line 7 of Algorithm 5, when v is u5, lines 8–9 are executed. When v is u3, we have Vp = {u2, u4, u10}, and vertex set {u5, u3, u6, u2}, {u5, u3, u6, u2, u4}, …, {u5, u3, u6, u4, u10} are put into UQ. When v is u6, we have Vp = {u4, u11}, and vertex set {u5, u3, u6, u4, u11}is put into UQ.
(3) When line 4 of Algorithm 5 is executed, since queue UQ ≠ Ø, vertex set {u5, u3, u6, u9, u10} leaves UQ. After lines 14–17 are executed, we have R = {(CG3, 3.0)}, where CG3 = {u5, u9, u10, u3, u6}.
Because queue UQ is not empty, the while loop will continue to be called until queue UQ is empty. After all the vertex sets in UQ are explored, we have R = {(CG1, 3.6), (CG4, 3.2), (CG7, 3.2)}, where CG1 = {u5, u1, u2, u3, u10}, CG4 = {u5, u10, u11, u6, u3}, and CG7 = {u5, u10, u6, u3, u4}.
In the following, we will prove the correctness of Algorithm 5 with Theorem 2.
Theorem 2
Given an attribute graph G, query q = < uq, r, h, k, t, T > , S–M group expansion method can find out all candidate user groups.
Proof
Given a candidate user group P, we can see that P is connected according to Definition 4. So, we classify the vertices in P according to their social distance to uq. Let Li contain the vertex whose social distance to uq is i. We give the proof by induction.
The basis: there exists a user group {uq} ∪ L1 in UQ according to Algorithm 5.
Assume that there exists a vertex set Pi = {uq} ∪ L1 ∪ … ∪ Li in UQ, then we will prove that Pi+1 = Pi ∪ Li+1 must be in UQ too. Without loss of generality, let Li = {a1, a2, …, am, b1, …, bn}, and Li+1 = A1 ∪ A2 ∪ … ∪ Am, where Aj ⊆ nbr(aj, G[P]), Aj ≠ Ø, nbr(bs, G[P]) ∩ Li+1 = Ø, 1 ≤ j ≤ m, and 1 ≤ s ≤ n.
We give the proof by induction too. (1) For Pi, when a1 acts as the expansion center and completes the expansion, there must be a user group Pi ∪ A1, which is put into UQ. (2) Assume that there is a user group Pi ∪ A1 ∪ … ∪ Aj in UQ, we will prove that Pi ∪ A1 ∪ … ∪ Aj ∪ Aj+1 must be in UQ.
If Aj+1 \ (A1 ∪ … ∪ Aj) = ∅, it shows that Pi ∪ A1 ∪ … ∪ Aj = Pi ∪ A1 ∪ … ∪ Aj ∪ Aj+1. If Aj+1 \ (A1 ∪ … ∪ Aj) ≠ ∅, when Pi ∪ A1 ∪ … ∪ Aj leaves UQ and aj+1 acts as the expansion center, there must be a user group Pi ∪ A1 ∪ … ∪ Aj ∪ Aj+1. According to the induction, we have proved that Pi+1 = Pi ∪ Li+1 must be in UQ. □
Single–single group expansion method
Based on COM group expansion method, we propose the single–single group expansion method (S–S). Similar to COM, we use the single vertex tree (STree) (Definition 5) to represent the expansion process implemented by recursion. During the expansion process, only the current path starting from the root is kept.
[See PDF for image]
Algorithm 6
S–S(G, q)
Algorithm 6 is the preparation of S–S. In line 1, Algorithm 1 is invoked to obtain HMap, which stores the k-core G′ containing uq within the range specified by the query. In line 2, some parameters are initialized to be used for the function SSexpand(). Child is a vertex set and path is a stack, which is used to store the current path. R is a minheap, which is the same as that in Algorithm 2. ALB is used to store the left brothers of all vertices in the current path path.
[See PDF for image]
Algorithm 7
SSexpand(u, pos, h, k, t, HMap, Child, R, path, ALB)
Algorithm 7 describes the details of S–S. In line 1, the current vertex u is pushed into path. In lines 2–3, ALB is updated by adding the prior vertex of u in Child. In lines 4–8, determine whether the vertex set Path is a candidate user group, where Path denotes the set of vertices in path. Lines 5–7 are the same with lines 4–6 of Algorithm 3. In lines 10–13, construct Child′, which is the child vertex set of u. In lines 15–21, u expands each vertex in Child′ by invoking the function SSexpand() (Algorithm 7) recursively. Line 18 removes the last vertex in the path. In lines 19–21, after all the expansions from u are completed, ALB should be updated accordingly.
Theorem 3
Given an attribute graph G and a query q = < uq, r, h, k, t, T > , all candidate user groups can be found by using S–S.
Proof
Given any candidate user group P, we will prove that there must be a path found by S–S, such that Path = P.
(1) When starting from the query user uq, we have path = {uq}. Without loss of generality, let Child = {…, a, …, b, …, c, …} be the neighbor set of uq in G′, where the neighbor set of uq in P is Child ∩ P = {a, b, c}. Suppose that vertex a has the highest priority in {a, b, c}. The vertex whose priority is higher than a will be expanded before a by uq. After all the vertices before a in Child′ are expanded by uq, a is expanded by uq, and we have path = {uq, a}.
(2) For path = {uq, a}, we would expand from a. Without loss of generality, let Child′ = {…, b, …, d, …, c, …} be the vertex set to be expanded by a in G′, where the vertex set to be expanded by a in P is Child′ ∩ P = {b, d, c}. Suppose that vertex b has the highest priority in {b, d, c}. The vertex whose priority is higher than b will be expanded before b by a. After all the vertices before b in Child′ are expanded by a, b is expanded by a, and we have path = {uq, a, b}.
The above process could expand the vertex in P each time. The reason is that P is a candidate user group, so the vertex in P is connected to uq. Each expanded vertex is the neighbor vertex of vertex in the current path. ALB in Algorithm 7 is used to avoid finding duplicate paths. Continuing to expand in this way, all the remaining vertices in P would be expanded and we have path = {uq, a, b, …}, such that Path = P. □
Theorem 4
Given an attribute graph G, for the query q = < uq, r, h, k, t, T > , S–S is used to find the candidate user group. For any two different path path1 and path2 in STree, we would have Path1 ≠ Path2.
Proof
For any two different paths path1 and path2 in STree, without loss of generality, let path1 = {uq, n1, …, ni, a, …}, and path2 = {uq, n1, …, ni, b, …}. According to the S–S group expansion method, all the right brothers of the current vertex v are added to Child′ of v. If b comes after a in Child′(line 14 in Algorithm 7), a would not exist in the subtree of b, that is to say, a would not be in path2. Therefore, the vertex set of path1 must be different from that of path2. □
Theorem 4 shows that S–S would not get two identical candidate user groups.
Multi–multi group expansion method
Inspired by S–M, the multi–multi group expansion method (M–M) is proposed. We use the multi-vertex tree (MTree) to represent the multi–multi group expansion process implemented by recursion. During the expansion process, only the current path starting from the root is kept.
Definition 6
(Multi-vertex Tree (MTree)) Given an attribute graph G, for the query user uq, the multi-vertex tree takes {uq} as the root, and the expansion process starts from {uq}. Each node contains at least one vertex. For the current node Node, a node Node′ would be the child of Node if Node′ should be expanded by Node according to the expanding rule.
Definition 7
(Vertex set of the path in the MTree (Path)) Let the current path in the MTree be path = {Node1, Node2, …, Nodem}, then the vertex set of path is Path, which is defined as.
3
Definition 8
(Neighbors of node in MTree (Nbr(Node, G′))) Given a k-core G′ containing uq, for a node Node in the MTree, the neighbors of Node in G′ is denoted by Nbr(Node, G′), which is defined as
4
Definition 9
(Related vertex set of path(Allbro(path)) Given a k-core G′ containing uq and an MTree with {uq} as the root, let the current path in the MTree be path = {Node1, Node2, …, Nodem}, the related vertex set of path is denoted by Allbro(path), which is defined as
5
[See PDF for image]
Algorithm 8
M–M (G, q)
Algorithm 8 is the preparation of M–M. In line 1, Algorithm 1 is invoked to obtain HMap, which stores the k-core G′ containing uq within the range specified by the query. In line 2, some parameters are initialized to be used for the function MMexpand(). Node is used to store the current node, and path is a stack, which is used to store the current path. R is a minheap, which is the same as that in Algorithm 2. Path is used to store the vertex set of path, and Allbro is used to store Allbro(path) (Definition 9).
[See PDF for image]
Algorithm 9
MMexpand(Node, h, k, t, HMap, R, path, Path, Allbro)
Algorithm 9 describes the details of M–M. In lines 1–2, the current node Node is pushed into path and Path is updated accordingly. In lines 3–7, if there are h users in Path, we should judge whether Path is a candidate user group. If yes, R will be updated using the new result in line 6. Allbro in line 8 stores Allbro(path) for the current path path, and Allbro in line 9 stores the Allbro(path′) for path′ obtained by expanding a new node based on path. In lines 10–11, we get any subset Vp of Child that can be expanded and is not expanded. In lines 12–18, MMexpand() is invoked with Vp as the new current node. When MMexpand() returns, Node′ is popped from path and Path is updated accordingly. In lines 17–18, we would get a new subset Vp of Child. In lines 19–20, after all the expanding from Node are completed, Allbro should be updated accordingly.
Theorem 5
Given an attribute graph G, for the query q = < uq, r, h, k, t, T > , M–M can find all candidate user groups.
Proof
If P is a candidate user group, then according to the social distance from the query user, the vertices in P can be partitioned into a node sequence {{uq}, Node1, Node2, …, Nodem}, where Nodei represents that the shortest social distance between uq and each vertex in Nodei is i.
Now we prove that P can be found with M–M by induction.
According to the expansion rule of M–M, start from {uq}, then Node1 could be found because we have Node1 ⊆ Nbr({uq}, G′), where G′ is the k-core containing uq within the range specified by the query. According to Definition 9, we have Allbro(path) = {uq} ∪ Nbr({uq}, G′), where path = {{uq}, Node1}.
Assume that path = {{uq}, Node1, …, Nodei} could be found. According to the algorithm, we have Nodei+1 ⊆ Nbr(Nodei, G′) \ Allbro(path), so Nodei+1 can be expanded. Therefore, path = {{uq}, Node1, …, Nodei, Nodei+1} could also be found. □
Theorem 6
Given an attribute graph G, for the query q = < uq, r, h, k, t, T > , M–M is used to find the candidate user group. For any two different path path1 and path2 in the MTree, we would have Path1 ≠ Path2.
Proof
For any two different paths path1 and path2 in the MTree, without loss of generality, let path1 = {{uq}, Node1, …, Nodei-1, {a, b}, …}, and path2 = {{uq}, Node1, …, Nodei-1, {a, c}, …}. We can see that path1 and path2 is expanded from path1′ = {{uq}, Node1, …, Nodei-1, {a, b}} and path2′ = {{uq}, Node1, …, Nodei-1, {a, c}} respectively. For path1′ and path2′, only the last node is different. According to Definition 9, we have Allbro(path1′) = Allbro(path2′). For path1, we would have c ∉ Path1, because c ∈ Allbro(path1′). While for path2, we would have c ∈ Path2. Therefore, we would have Path1 ≠ Path2. □
Theorem 6 shows that M–M would not get two identical candidate user groups.
Example 5
Given an attribute graph G and a query q = < uq, r, h, k, t, T > , the expansion process using M–M is represented with a partial MTree, which is shown in Fig. 5.
[See PDF for image]
Fig. 5
The expansion process using M–M
Take Fig. 5 as an example, if path = {{uq}, Node1, Node2, Node3}, the current path path consists of gray nodes in Fig. 5. Allbro(path) contains all vertices in the nodes circling by the dotted ellipses. The child node of Node3 is a subset of Child = Nbr(Node3, G′) \ Allbro(path), where G′ is the k-core containing uq within the range specified by the query.
Pruning techniques
In this section, we present three kinds of pruning strategies: (1) User group diameter pruning strategy based on the bounded diameter property [23, 31] (Sect. 6.1); (2) Minimum degree pruning strategy (Sect. 6.2); (3) Keyword score pruning strategy (Sect. 6.3).
User group diameter pruning strategy (Diam)
Diam is proposed based on n-plex [31]. In order to better understand Diam, this section first introduces the definition of n-plex, then gives the relationship between it and the candidate user group, and finally gives the pruning strategy Diam.
Definition 10
(n-plex [26, 31]) Given an attribute graph G and an integer n, an n-plex is a connected subgraph H ⊆ G, such that ∀v ∈ VH, degH(v) ≥|VH|− n. H is a maximal n-plex if there is no any other n-plex H′ ⊇ H, denoted as n-.
Definition 10 shows that in H, each vertex is connected with at least other |VH|− n vertices. In a k-core, each vertex is connected with at least k other vertices. So, we can have the following equivalence property.
Property 1.
(Equivalence property [26]) A k-core of size h is a (h − k)-plex of size h.
Property 2.
(Bounded diameter of (h − k)-plex [26, 32]) Let H be a (h − k)-plex of size h and diam(H) be the diameter of H.
If h < 2 k + 2, then diam(H) ≤ 2;
If h ≥ 2 k + 2, then diam(H) ≤ h − 2 k + 2.
where diam(H) is the maximum social distance between any two vertices in H. The social distance is defined to be the shortest path length in the social network. User group diameter pruning strategy (Diam) is got according to the above properties.
Example 6
Take Fig. 6 as an example to illustrate the use of Diam pruning. Let the query user be u11, h = 5, and k = 2. The 2-core containing the query user u11 is shown in Fig. 6(a). Now we would use Diam pruning for the 2-core in Fig. 6(a). According to Property 1 and 2, because 5 < 2 × 2 + 2, diam(H) should be less than or equal to 2. Because the social distance between u2 and u11 is 3, u2 can be pruned. After u2 is removed, the degree of user u1 in Fig. 6(a) is less than 2, which cannot meet the social constraint 2-core, so, it also needs to be removed. Finally, the 2-core containing the query user u11 is shown in Fig. 6(b).
[See PDF for image]
Fig. 6
Diam for q = < uq = u11, r = 3 km, h = 5, k = 2, t = 3, T >
Minimum degree pruning strategy (MinDeg)
The main idea of minimum degree pruning is that we can judge in advance that when the user group size reaches h, it cannot meet the social constraints.
Theorem 7
(Minimum degree pruning strategy (MinDeg)) Given the k-core G′ containing uq within the range specified by the query and the current user group be P (|P|< h), if there is a vertex v ∈ P with degG′[P] (v) < k − (h −|P|), then the user group of size h containing P cannot be a candidate user group.
Proof
Let the upper bound of degree of vertex v ∈ P in the user group of size h containing P be UDegP(v), which can be calculated in Formula (6).
6
That is, when UDegP(v) < k, the vertex v cannot satisfy the social constraint k, so P cannot be a candidate user group. If there exists degG′[P](v) + (h −|P|) = UDegP(v) < k for v ∈ P, that is, degG′[P](v) < k − (h −|P|), then the user group of size h containing P cannot be a candidate user group. □
We can see that if the user group of size h containing the current user group P cannot be a candidate user group, then P need not be expanded. So, according to Theorem 7, we can judge in advance whether the current user group could be pruned.
When |P| is small enough, Theorem 7 does not need to be used to judge whether the pruning is needed. We would give the lower bound of |P| for the use of Theorem 7 in the following.
For COM, the method based on combination without considering social relation, there may exist a vertex v in the current user group P such that degG′[P](v) = 0. That is, we have degG′[P](v) ≥ 0, ∀ v ∈ P. Therefore, if |P|≤ h − k, that is, degG′[P](v) = 0 ≥ k − (h −|P|), then Theorem 7 need not be used.
For S–M, S–S and M–M, the methods based on social relation, the current user group P is connected. So, we have degG′[P](v) ≥ 1, ∀v ∈ P. Therefore, if |P|≤ h − k + 1, that is, degG′[P](v) = 1 ≥ k − (h −|P|), then Theorem 7 need not be used.
Because it is time consuming to recalculate the degree of each user in the current group each time, we should give an effective method to realize the minimum degree pruning strategy. We create a hash table PathMap to record the degree of users in the current user group P. PathMap changes with P.
In the following, we will show the application of minimum degree pruning strategy in COM, S–M, S–S and M–M.
(1) COM and S–S
We would use Theorem 7 before expanding for the current user group Path. For Path, we would expand the vertex in Child′ (in Algorithms 3 and 7). Before expanding, if |Path|+ 1 > h − k in COM or |Path|+ 1 > h − k + 1 in S–S, we could use Theorem 7 to check each vertex in Child′. For each vertex v in Child′, we could use Theorem 7 to judge whether Path ∪ {v} could be pruned. If yes, v is removed from Child′. During the process, Path does not change, and so does PathMap.
After checking all vertices in Child′, we would get a new Child′, which may be smaller than the old one. Because the expansion uses recursion, Path changes incrementally, and so does PathMap.
(2) S–M
The current user group is recorded with P, which is got by dequeuing from UQ in line 5 of Algorithm 5. So, PathMap has to be recreated for each P dequeued from UQ. For the current user group P, before expanding, if |P|+ 1 > h − k + 1, we could use Theorem 7 to prune the vertex to be expanded.
For v ∈ P, we would expand the vertex in nbr(v, G′) \ P (in Algorithm 5). For each vertex u in nbr(v, G′) \ P, we could use Theorem 7 to judge whether P ∪ {u} could be pruned. If yes, u is removed from nbr(v, G′). During this process, P does not change, and so does PathMap.
After checking all vertices in nbr(v, G′) \ P, we would get a new set, which may be smaller than the old one. For each subset Vp in the new set that satisfies the conditions in line 10 of Algorithm 5, we would get a new current user group V = P ∪ Vp, and PathMap is changed accordingly. After using Theorem 7, if V could not be pruned, it is added to UQ in line 13 of Algorithm 5. After Vp is processed, we would restore P and PathMap to process the next subset Vp. During this process, P changes incrementally, and so does PathMap.
(3) M–M
Given the current user group Path, we could use Theorem 7 to judge whether Path could be pruned. For Path, we would expand the vertex in Child (in Algorithm 9). Before expanding, if |Path|+ 1 > h − k + 1, we could use Theorem 7 to check each vertex in Child. For each vertex v in Child, we could use Theorem 7 to judge whether Path ∪ {v} could be pruned. If yes, v is removed from Child. During the process, Path does not change, and so does PathMap.
After checking all vertices in Child, we would get a new Child, which may be smaller than the old one. For each subset Vp of the new Child satisfying the conditions in line 11 of Algorithm 9, we would get a new current user group Path = Path ∪ Vp, and PathMap is changed accordingly. Because the expansion uses recursion, Path changes incrementally, and so does PathMap.
Keyword score pruning strategy
Let the minimum keyword score in the t candidate user groups be st when the t candidate user groups are known. For the k-core G′ containing uq within the range specified by the query, let smax be . The main idea of keyword score pruning strategy is to judge whether the keyword score upper bound of the current user group is less than or equal to st. If yes, the current user group could be pruned.
For the current user group P (|P|≤ h), the upper bound of s(P) is denoted as UB(P), which can be calculated in Formula (7).
7
With the help of UB(P), we would give the multi-path keyword score pruning (MUScoreA) in the following.
Theorem 8
(Multi-path keyword score pruning strategy for SS and COM (MUScoreA)).
For COM and S–S, let the current path be path. According to S–S and COM, Path is the current user group. For path, we would expand the vertex in Child′ (in Algorithms 3 and 7). For a vertex v in Child′, if UB(Path ∪ {v}) ≤ st, we need not expand v and the vertex v′ after v in Child′.
Proof
For any candidate user group H containing Path ∪ {v}, we would have s(H) ≤ UB(Path ∪ {v}) according to Formula (7). So, if UB(Path ∪ {v}) ≤ st, we need not expand v. Because v′ comes after v in Child′, we have s(v) ≥ s(v′). So, s(Path ∪ {v}) ≥ s(Path ∪ {v′}) and then UB(Path ∪ {v}) ≥ UB(Path ∪ {v′}). For any candidate user group H containing Path ∪ {v′}, we would have s(H) ≤ UB(Path ∪ {v′}) according to Formula (7). If UB(Path ∪ {v}) ≤ st, we have UB(Path ∪ {v′}) ≤ UB(Path ∪ {v}) ≤ st. So, we need not expand v′. □
For the method COM, given a current path pathi, we would expand the vertex in Child′ (in Algorithm 3), where Child′ is denoted by Childi in Theorem 9. In order to understand Theorem 9, Fig. 7 is given.
[See PDF for image]
Fig. 7
Theorem 9
Theorem 9
(Multi path keyword score pruning strategy for COM (MUScoreB))
For COM, let the current path be pathi-1 = {uq, …, vi-1}. After vi is expanded for pathi-1, we obtain a new current path pathi = {uq, …, vi-1, vi}. Let pathl be obtained after vl is expanded for pathl-1, where vl is the successor of vl-1 in Childl-2, i + 1 ≤ l ≤ j. If |Pathj|< h, vj+1 is the successor of vj in Childj-1 and UB(Pathj ∪ {vj+1}) ≤ st, we need not expand all vertices after vl in Childl-1 for pathl-1, where i ≤ l ≤ j.
Proof
If vj+1 is the successor of vj in Childj-1 and UB(Pathj ∪ {vj+1}) ≤ st, we prove that for any vertex vl′ after vl in Childl-1 (i + 1 ≤ l ≤ j), we need not expand vl′ for pathl-1.
There are two cases:
(1) We could not get a path containing Pathl-1 ∪ {vl′} of size |Pathj ∪ {vj+1}|, which shows that we would not get a candidate user group containing Pathl-1 ∪ {vl′}.
(2) We could get a path containing Pathl-1 ∪ {vl′} of size |Pathj ∪ {vj+1}|, which is path′ = {uq, …, vl-1, vl′, vl+1′,…, vj+1′}. According to COM, s(vm′) ≤ s(vm), where l ≤ m ≤ j + 1. According to Formula (7), we have UB(Path′) ≤ UB(Pathj ∪ {vj+1}) ≤ st.
Therefore, we need not expand vl′ for pathl-1. □
For COM, after sorting the vertices in VG′ by keyword score in descending order, the candidate user group is got by traversing the subset of VG′.
In S–M, given the current user group P, for v ∈ P, we would traverse the subset of nbr(v, G′) \ P (in Algorithm 5). The order of traversed subsets is the same as that of COM. So, if we sort the vertices in nbr(v, G′) \ P by keyword score in descending order, MUScoreA (Theorem 8) can be used in S–M.
In M–M, given the current user group Path, we would traverse the subset of Child (in Algorithm 9). The order of traversed subsets is the same as that of COM. So, if we sort the vertices in Child by keyword score in descending order, MUScoreA (Theorem 8) can be used in M–M.
MUScoreB (Theorem 9) could not be used in S–M and M–M. Let's illustrate it with the following example.
Example 7
As shown in Fig. 8, for S–M, if the current user group is P = {uq, …, u}, the current expansion center is vertex u, current vertex set to be expanded is Child = {a, b, c, d, e}, the vertices in Child are sorted by keyword score in descending order. If UB(P ∪ {a, b, c}) ≤ st, S–M can use Theorem 8, but cannot use Theorem 9. Next, the reason is given.
[See PDF for image]
Fig. 8
Example 7
Because s(e) ≤ s(d) ≤ s(c), we have UB(P ∪ {a, b, e}) ≤ UB(P ∪ {a, b, d}) ≤ UB(P ∪ {a, b, c}) ≤ st. Therefore, Theorem 8 can be used to prune the subsets {a, b, d}, {a, b, e} of Child.
If subset {a, c, d} and {a, c, e} of the Child is pruned by Theorem 9, it will not affect the correctness. Because s(e) ≤ s(d) ≤ s(c) ≤ s(b), UB(P ∪ {a, c, e}) ≤ UB(P ∪ {a, c, d}) ≤ UB(P ∪ {a, b, c}) ≤ st. If the subset {a, c} of Child is pruned by Theorem 9, the candidate user group containing P ∪ {a, c} will miss. Because according to the expansion rule of S–M, when P ∪ {a, c} is the current user group, there is a chance to expand another vertex except {a, b, c, d, e}, such as vertex f. When s(f) > s(b), we have UB(P ∪ {a, c} ∪ {f}) > UB(P ∪ {a, b, c}), that is, there may be a candidate user group containing P ∪ {a, c} ∪ {f}. Therefore, the subset {a, c} should not be pruned by Theorem 9, that is, Theorem 9 cannot be used in S–M.
Similarly, M–M can use Theorem 8, but cannot use Theorem 9.
Experimental evaluation
In this section, we evaluate the four methods, namely the combination-based group expansion method (COM), the single–multi group expansion method (S–M), the single–single group expansion method (S–S) and the multi–multi group expansion method (M–M).
Datasets and setting
Three real datasets Brightkite [33], Gowalla [33], Foursquare [34, 35] and a synthetic dataset Syn [36] are used in the experiment. The basic information of the datasets is shown in Table 4.
Table 4. Statistics of datasets
Dataset | # of vertices | # of edges | Max degree | Min degree | Average degree |
|---|---|---|---|---|---|
Brightkite | 58,228 | 214,078 | 1,134 | 1 | 7.35 |
Gowalla | 196,591 | 950,327 | 14,730 | 1 | 9.67 |
Foursquare | 2,153,469 | 13,549,236 | 107,676 | 0 | 12.58 |
Syn | 4,000,000 | 33,600,469 | 27 | 14 | 16.80 |
For Brightkite and Gowalla, each user contains a lot of check-in information, while RCGQ only needs one check-in information, so if the user has more than one check-in information, we select the latest check-in information as the user's location. For Syn, we randomly assign a location to any user A in the social network. Then if A's friend B has not yet been assigned a location, B will be randomly assigned a location, and the distance between B and A will not exceed 80,000. In this way, we obtain the spatial location for each user in Syn. In this paper, each user in the dataset is assigned a real user comment from Twitter [37, 38, 39–40] as the user's keywords information, which is similar to [41]. The assignment follows Zipf distribution [42]. For each experimental result, we run 50 random queries to get the average value. If the average query time exceeds 1000 s, the query time is set to INF.
Implementation We implement all the algorithms on the Eclipse platform using Java. The experimental machine configuration is the Windows 10 operating system, Intel(R) Core(TM) i5-10500 CPU @ 3.10 GHz processor and 8G RAM.
Parameter settings
Table 5 is a summary of the pruning strategies used in this paper. The query parameter setting is shown in Table 6, where w is the number of query keywords, and the value range of w is [1, 5], which is similar to that of [43].
Table 5. Summary of Pruning Strategy
Pruning strategy | Description |
|---|---|
Diam | User group diameter pruning strategy (Sect. 6.1) |
MinDeg | Minimum degree pruning strategy (Sect. 6.2) |
MUScoreA | Multi-path keyword score pruning strategy (Sect. 6.3) |
MUScoreB | Multi-path keyword score pruning strategy for COM (Sect. 6.3) |
Table 6. Parameters setting
Parameter | Description | Value range | Default setting |
|---|---|---|---|
h | User group size | 6, 7, 8, 9, 10 | 8 |
k | Indicates the k-core | 4, 5, 6, 7 | 5 |
w | Number of query keywords | 1, 2, 3, 4, 5 | 3 |
t | Result number | 10, 20, 40, 80, 160 | 10 |
freq | Query keyword frequency | L, M − , M, M + , H | M |
nf | Degree of the query user | [0,10], [11, 20], [21, 30], [31, 40] | [11, 20] |
r (km) | Radius for the query | 20, 40, 80, 160, 320 | 80 |
The frequency of query keywords may have an impact on query performance. We use freq to represent the type of query keyword frequency, including H, M + , M, M − and L. In the following, for the given query q = < uq, r, h, k, t, T = Ø > , we give the steps to obtain w query keywords, and put them into T.
(1) Find the subgraph Gr of G within the range specified by the query q.
(2) Using Diam pruning for Gr to get a new subgraph Gr′.
(3) Get the k-core G′ containing uq for Gr′.
(4) Obtain the keywords of all users in G′, which are stored in a list keys.
(5) Sort the keywords in keys by the number of occurrences in descending order.
(6) If freq = H, T = keys[0.. (w − 1)].
If freq = L, T = keys[(|keys|− w).. (|keys|− 1)].
If freq = M, T = keys[mid.. (mid + w − 1)], where mid =|keys| / 2.
If freq = M + , T = keys[mid.. (mid + w − 1)], where mid =|keys| / 4.
If freq = M − , T = keys[mid.. (mid + w − 1)], where mid =|keys| * 3 / 4.
In the experiments, we may have to use different set of queries to evaluate the effect of different parameters. In Table 7, Qi represents a set of 50 queries for each dataset, so does Qi*. Given Qi, let Ui = {q.uq | q ∈ Qi} and Ti = {q.T | q ∈ Qi}. For Qi and Qj, if i ≠ j, we would have Ui ≠ Uj and Ti ≠ Tj. Given Qi*, let Ui* = {q.uq | q ∈ Qi*} and Ti* = {q.T | q ∈ Qi*}. For Qi and Qi*, we have Ui = Ui* and Ti ≠ Ti*.
Table 7. Set of queries for different parameters
Parameter | Experimental result | Set of queries |
|---|---|---|
h | Figure 9 | Q1 |
freq | Figure 10 | Q1* |
w | Figure 11 | Q1* |
t | Figure 12 | Q2 |
nf | Figure 13 | Q3([0,10]), Q1( [11, 20]), Q4([21, 30]), Q5([31, 40]) |
r | Figure 14 | Q1* |
k | Figure 15 | Q6* |
Scalability | Figure 16 | Q1 |
Effect of pruning strategy on running time
In this section, to evaluate the effect of pruning rules, we use the query set Q1, and the default parameters. In Table 8–11, ‘all’ indicates the query time of the method using all pruning strategies, and ‘all-noXX’ indicates the query time when all pruning strategies except XX are used for this method. In Table 8, ‘all-noMUScoreAB’ indicates the query time when COM uses all pruning strategies except MUScoreA and MUScoreB.
Table 8. The effect of pruning strategies on COM
Dataset | All | All-noDiam | All-noMinDeg | All-noMUScoreAB | All-noMUScoreB |
|---|---|---|---|---|---|
BrightKite | 0.485 s | 0.674 s | INF | 0.706 s | 0.495 s |
Gowalla | 0.450 s | 8.657 s | INF | 0.564 s | 0.470 s |
In Table 8, it can be seen that MinDeg has the most obvious effect on COM. Diam has obvious effect on COM for Gowalla.
In Table 9, it can be seen that MinDeg has the most obvious effect on S–M. The effect of Diam on S–M is not obvious. MUScoreA has some effect on S–M.
Table 9. The effect of pruning strategies on S–M
Dataset | All | All-noDiam | All-noMinDeg | All-noMUScoreA |
|---|---|---|---|---|
BrightKite | 0.871 s | 0.886 s | 45.894 s | 1.247 s |
Gowalla | 1.115 s | 1.119 s | 460.350 s | 2.039 s |
In Table 10, it can be seen that MinDeg has the most obvious effect on S–S. The effect of other pruning strategies is not obvious.
Table 10. The effect of pruning strategies on S–S
Dataset | All | All-noDiam | All-noMinDeg | All-noMUScoreA |
|---|---|---|---|---|
BrightKite | 0.508 s | 0.538 s | INF | 0.570 s |
Gowalla | 0.548 s | 0.636 s | INF | 0.577 s |
In Table 11, it can be seen that MinDeg has the most obvious effect on M–M. Other pruning strategies has some effect on M–M.
Table 11. The effect of pruning strategies on M–M
Dataset | All | All-noDiam | All-noMinDeg | All-noMUScoreA |
|---|---|---|---|---|
BrightKite | 0.013 s | 0.023 s | 0.751 s | 0.044 s |
Gowalla | 0.023 s | 0.059 s | 1.783 s | 0.041 s |
From the above results, we can see that MinDeg has the most obvious effect on all the methods. The advantages of M–M are more obvious than other methods. In the later experiment, the methods with all the pruning strategies will be evaluated. The pruning strategies used by each method are shown in Table 12.
Table 12. Summary of Methods
Method | The used pruning strategy |
|---|---|
COM | Diam + MinDeg + MUScoreA + MUScoreB |
S–M | Diam + MinDeg + MUScoreA |
S–S | Diam + MinDeg + MUScoreA |
M–M | Diam + MinDeg + MUScoreA |
Varying the query parameters
In this section, we study the effect of seven parameters on the performance of our proposed methods. We evaluate the impact of a parameter change on query time, and use default values for other parameters.
Varying h. Figure 9 shows the impact of user group size h on the running time, where h represents the user group size. With the increase of h, the runtime of all methods is increasing, and the running time of M–M is significantly less than other methods. According to Property 2, when Diam pruning strategy is used, with h increasing, less vertices are pruned and more users need to be searched. On the other side, with h increasing, the number of user groups with size of h increases. From Fig. 9, when h > 8, S–M consumes more time than other methods. One of the reasons is that the same candidate user group may be found more than once in S–M.
[See PDF for image]
Fig. 9
Varying h
Varying freq. Figure 10 shows the effect of the query keyword frequency freq on the running time. The performance of all methods will increase with the increasing of freq from L to H. With the increase of freq, for any user in VG′(in Algorithm 1), its user keyword score may be larger. So, it is easier to find candidate user groups with high score, such that the keyword score pruning strategy may work earlier.
[See PDF for image]
Fig. 10
Varying freq
Varying w. Figure 11 shows the impact of w on the running time, where w is the number of query keywords. We can see that the number of query keywords has little effect on the running time. According to Fig. 10, freq affects the query time. But in Fig. 11, freq is set to the default value. This experiment shows that the frequency of query keywords is more important than the number of query keywords.
[See PDF for image]
Fig. 11
Varying w
Varying t. Figure 12 shows the effect of the number of result sets t on the running time. With the increase of t, the running time of all algorithms changes little. The reason may be that the t results are not obtained incrementally. In order to confirm that the candidate user group found is the final result, more user groups need to be searched. With the increase of t, the search space may not change much.
[See PDF for image]
Fig. 12
Varying t
Varying nf. Figure 13 shows the impact of nf on the running time, where nf represents the degree of the query user in G. With the increase of nf, the runtime of all methods is increasing. The reason is that if the query user has more friends, there will be more users within the given range. So, there may be more candidate user groups, which leads to a larger search space.
[See PDF for image]
Fig. 13
Varying nf
Varying r. Figure 14 shows the effect of r on the running time, where r is the query radius. With the increase of r, the query time of all algorithms is on the rise. The larger r is, the more users will be searched. Then more candidate user groups need to be searched, which will take more time.
[See PDF for image]
Fig. 14
Varying r
Varying k. In order to keep the queries in Q6* unchanged with the change of k, we set t to 1. Other parameters still use the default values. Figure 15 shows the impact of k on the running time. With the increase of k, the running time of all methods decreases. With k increasing, there will be less users in k-core because the vertex in k-core should have larger degree. So, fewer candidate user groups need to be searched.
[See PDF for image]
Fig. 15
Varying k
Scalability. Figure 16 shows the impact of user group size h on the running time. In Foursquare dataset, with the increase of h, the runtime of all methods is increasing, and the running time of M–M is significantly less than other methods. In Syn dataset, all algorithms require less time to find the top-t results, maybe because of the degree of user. In Table 4, although the average degree in Foursquare and Syn is similar, the maximum degree in Foursquare is 107,676 and the minimum degree is 0, while the maximum degree in Syn is 27 and the minimum degree is 14. Syn has fewer candidate user groups, leading to less running time.
[See PDF for image]
Fig. 16
Scalability
Conclusion
In this paper, we define a new kind of query, namely, the range constrained group query on attribute social graph (RCGQ). RCGQ aims to return t candidate user groups with the best score. Four methods are proposed to solve the RCGQ problem, including COM, S–M, S–S and M–M. Several effective pruning strategies are given, including user group diameter pruning strategy (Diam), minimum degree pruning strategy (MinDeg), and keyword score pruning strategy. Diam is based on the property of n-plex. Finally, we conduct extensive experiments and then demonstrate the efficiency of our methods. Because RCGQ problem is NP-hard, in order to deal with the larger group size h, we should propose an approximate algorithm for RCGQ in the future work.
Author contributions
Zijun Chen: Conceptualization, Methodology, Writing – review & editing. Wenwen Shao: Conceptualization, Methodology, Software, Writing – original draft. Wenyuan Liu: Writing – review & editing. All authors have read and agreed to the published version of the manuscript.
Declarations
Conflict of interest
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Ma, Y., Ye, Y., Wang, G., Bi, X., Wang, Y.: Personalized geo-social group queries in location-based social networks. In: Proceedings of the 23rd International Conference on Database Systems for Advanced Applications, pp. 388–405 (2018)
2. Guo, F., Yuan, Y., Wang, G., Chen, L., Lian, X., Wang, Z.: Cohesive group nearest neighbor queries over road-social networks. In: Proceedings of the 35th IEEE International Conference on Data Engineering, pp. 434–445 (2019)
3. Yao, K; Chang, L. Efficient size-bounded community search over large networks. Proc. VLDB Endow.; 2021; 14,
4. Liu, B., Zhang, F., Zhang, W., Lin, X., Zhang Y.: Efficient community search with size constraint. In: Proceedings of the 37th IEEE International Conference on Data Engineering, pp. 97–108 (2021)
5. Chen, L., Liu, C., Zhou, R., Xu, J., Li, J.: Finding effective geo-social group for impromptu activity with multiple demands (2019). arXiv:1912.08322
6. Liu, Q., Zhu, Y., Zhao, M., Huang, X., Xu, J., Gao, Y.: VAC: vertex-centric attributed community search. In: Proceedings of the 36th IEEE International Conference on Data Engineering, pp. 937–948 (2020)
7. Guo, F., Yuan, Y., Wang, G., Zhao, X, Sun, H.: Multi-attributed community search in road-social networks. In: Proceedings of the 37th IEEE International Conference on Data Engineering, pp. 109–120 (2021)
8. Sozio, M., Gionis, A.: The community-search problem and how to plan a successful cocktail party. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 938–948 (2010)
9. Cui, W., Xiao, Y., Wang, H., Wang, W.: Local search of communities in large graphs. In: Proceedings of the International Conference on Management of Data, pp. 991–1002 (2014)
10. Huang, X., Cheng, H., Qin, L., Tian, W., Yu, J. X.: Querying k-truss community in large and dynamic graphs. In: Proceedings of the International Conference on Management of Data, pp. 1311–1322 (2014)
11. Huang, X; Lakshmanan, LVS; Yu, JX; Cheng, H. Approximate closest community search in networks. Proc. VLDB Endow.; 2015; 9,
12. Barbieri, N; Bonchi, F; Galimberti, E; Gullo, F. Efficient and effective community search. Data Min. Knowl. Disc.; 2015; 29,
13. Song, X; Wang, B; Yang, X; Qin, J; Zhao, L; Niu, L. SGEQ: a new social group enlarging query with size constraints. IEEE Access; 2020; 8, pp. 193608-193620. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3032987]
14. Cai, M., Chang, L.: Efficient closest community search over large graphs. In: Proceedings of the 25th International Conference on Database Systems for Advanced Applications, pp. 569–587 (2020)
15. Yang, D-N; Chen, Y-L; Lee, W-C; Chen, M-S. On social-temporal group query with acquaintance constraint. Proc. VLDB Endow.; 2011; 4,
16. Fang, Y; Cheng, R; Luo, S; Hu, J. Effective community search for large attributed graphs. Proc. VLDB Endow.; 2016; 9,
17. Huang, X; Lakshmanan, LV. Attribute-driven community search. Proc. VLDB Endow.; 2017; 10,
18. Li, R.-H., Qin, L., Ye, F., Yu, J. X., Xiao, X., Xiao, N., Zheng, Z.: Skyline community search in multi-valued networks. In: Proceedings of the 2018 International Conference on Management of Data, pp. 457–472 (2018)
19. Zhang, Z., Huang, X., Xu, J., Choi, B., Shang, Z.: Keyword-centric community search. In: Proceedings of the 35th IEEE International Conference on Data Engineering, pp. 422–433 (2019)
20. Liu, W., Sun, W., Chen, C., Huang, Y., Jing, Y., Chen, K.: Circle of friend query in geo-social networks. In: Proceedings of the 17th International Conference on Database Systems for Advanced Applications, pp. 126–137 (2012)
21. Armenatzoglou, N; Papadopoulos, S; Papadias, D. A general framework for geo-social query processing. Proc. VLDB Endow.; 2013; 6,
22. Shen, C-Y; Yang, D-N; Huang, L-H; Lee, W-C; Chen, M-S. Socio-spatial group queries for impromptu activity planning. IEEE Trans. Knowl. Data Eng.; 2016; 28,
23. Fang, Y; Cheng, R; Li, X; Luo, S; Hu, J. Effective community search over large spatial graphs. Proc. VLDB Endow.; 2017; 10,
24. Zhu, Q; Hu, H; Xu, C; Xu, J; Lee, W-C. Geo-social group queries with minimum acquaintance constraints. VLDB J.; 2017; 26,
25. Chen, L; Liu, C; Zhou, R; Li, J; Yang, X; Wang, B. Maximum co-located community search in large scale social networks. Proc. VLDB Endow.; 2018; 11,
26. Li, Q., Zhu, Y., Yu, J. X.: Skyline cohesive group queries in large road-social networks. In: Proceedings of the 36th IEEE International Conference on Data Engineering, pp. 397–408 (2020)
27. Kim, J., Guo, T., Feng, K., Cong, G., Khan, A., Choudhury, F. M.: Densely connected user community and location cluster search in location-based social networks. In: Proceedings of the 2020 International Conference on Management of Data, pp. 2199–2209 (2020)
28. Seidman, SB. Network structure and minimum degree. Soc. Netw.; 1983; 5,
29. Batagelj, V., Zaversnik, M.: An O(m) algorithm for cores decomposition of networks (2003). arXiv:cs/0310049
30. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the SIGMOD'84, Proceedings of Annual Meeting, pp. 47–57 (1984)
31. Seidman, SB; Foster, BL. A graph theoretic generalization of the clique concept. J. Math. Sociol.; 1978; 6,
32. Brandes, U; Erlebach, T. Network Analysis: Methodological Foundations; 2005; Berlin, Springer: [DOI: https://dx.doi.org/10.1007/b106453]
33. Cho, E., Myers, S. A., Leskovec, J.: Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1082–1090 (2011)
34. Sarwat, M; Levandoski, JJ; Eldawy, A; Mokbel, MF. Lars*: an efficient and scalable location-aware recommender system. IEEE Trans. Knowl. Data Eng.; 2014; 26,
35. Levandoski, J. J., Sarwat, M., Eldawy, A., Mokbel, M. F.: LARS: a location-aware recommender system. In: Proceedings of the IEEE 28th International Conference on Data Engineering, pp. 450–461 (2012)
36. Newman, MEJ; Watts, DJ. Renormalization group analysis of the small-world network model. Phys. Lett. A; 1999; 263, pp. 341-346.1732095 [DOI: https://dx.doi.org/10.1016/S0375-9601(99)00757-4]
37. Lou, T; Tang, J; Hopcroft, JE; Fang, Z; Ding, X. Learning to predict reciprocity and triadic closure in social networks. ACM Trans. Knowl. Discov. Data; 2013; 7,
38. Hopcroft, J. E., Lou, T., Tang, J.: Who will follow you back? Reciprocal relationship prediction. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management, pp. 1137–1146 (2011)
39. Dong, Y., Tang, J., Wu, S., Tian, J., Chawla, N. V., Rao, J., Cao, H.: Link prediction and recommendation across heterogeneous social networks. In: Proceedings of the 12th IEEE International Conference on Data Mining, pp. 181–190 (2012)
40. Zhang, J; Fang, Z; Chen, W; Tang, J. Diffusion of “following” links in microblogging networks. IEEE Trans. Knowl. Data Eng.; 2015; 27,
41. Attique, M; Afzal, M; Ali, F; Mehmood, I; Ijaz, MF; Cho, H-J. Geo-social top-k and skyline keyword queries on road networks. Sensors; 2020; 20,
42. Joachims, T.: A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 128–136 (2001)
43. Liu, Q; Zhu, Z; Xu, J; Gao, Y. MaxiZone: maximizing influence zone over geo-textual data. IEEE Trans. Knowl. Data Eng.; 2021; 33,
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.