Content area
Clustering of protein association networks is crucial for understanding protein relationships and cellular functions. This research employs a Mixed Integer Linear Programming (MILP) approach to cluster proteins in the FprAl flavodiiron protein network, containing 61 proteins and 230 connections. The first stage applies MILP to minimize the maximum diameter within the clusters, focusing only on the topological characteristics of the network. A refined model is then followed, designed to maximize the functional similarity within each cluster. This is achieved using a Jaccard similarity matrix based on the molecular function aspect of the Gene Ontology (GO) terms, which emphasizes biological relevance in the clustering process. The integration of topological and functional criteria into the second MILP model enables effective clustering that captures both connectivity and biological context. Validation through gene sequence alignment supports the functional relevance of the formed clusters, revealing biologically significant groupings. The findings suggest that incorporating functional similarities into the clustering improves the biological interpretability of gene groups, demonstrating the potential for refined prediction of gene function. Future directions include incorporating additional GO aspects such as biological processes and cellular components, as well as advanced metrics for sequence similarity, to further improve the precision of clustering.
Abstract
Clustering of protein association networks is crucial for understanding protein relationships and cellular functions. This research employs a Mixed Integer Linear Programming (MILP) approach to cluster proteins in the FprAl flavodiiron protein network, containing 61 proteins and 230 connections. The first stage applies MILP to minimize the maximum diameter within the clusters, focusing only on the topological characteristics of the network. A refined model is then followed, designed to maximize the functional similarity within each cluster. This is achieved using a Jaccard similarity matrix based on the molecular function aspect of the Gene Ontology (GO) terms, which emphasizes biological relevance in the clustering process. The integration of topological and functional criteria into the second MILP model enables effective clustering that captures both connectivity and biological context. Validation through gene sequence alignment supports the functional relevance of the formed clusters, revealing biologically significant groupings. The findings suggest that incorporating functional similarities into the clustering improves the biological interpretability of gene groups, demonstrating the potential for refined prediction of gene function. Future directions include incorporating additional GO aspects such as biological processes and cellular components, as well as advanced metrics for sequence similarity, to further improve the precision of clustering.
Keywords
Mixed Integer Linear Programming (MILP), Graph Clustering, Gene Ontology (GO).
(ProQuest: ... denotes formulae omitted.)
1. Introduction
In bioinformatics, comprehending how genes are arranged functionally inside biological networks is a key challenge. Graph clustering plays an important role in identifying functionally related gene groups within biological networks [1]. Gene interaction networks can identify gene groupings with similar biological activities by capturing essential functional interactions between proteins. Clustering is also crucial for understanding functional linkages, including protein function prediction, disease gene discovery, and biomarker identification [2]. Clustering biological networks poses distinct challenges due to their high connectedness, complex connections, and functional heterogeneity [3]. The complex structure of biological networks is frequently not well captured by conventional clustering techniques like к-means and hierarchical clustering, particularly when working with large-scale, highly interconnected datasets [4].
To address these limitations, Mixed Integer Linear Programming (MILP) has emerged as a powerful tool for precise and explainable clustering in gene networks [5]. MILP integrates both topological structure and biological function, ensuring that clusters maintain physical proximity and functional coherence. However, existing MILP-based clustering models primarily focus on network topology and often overlook functional similarities derived from Gene Ontology (GO) terms [6, 7]. This topology-driven clustering may fail to capture biologically relevant groupings, limiting its effectiveness in function prediction.
To refine gene clusters, this research adopts a two MILP models. The first model minimizes the maximum cluster diameter, ensuring that genes grouped together are topologically close within the network. The second model enhances the functional significance of these clusters by maximizing intra-cluster similarity using GO term-based Jaccard similarity. By comparing the two approaches, we gain insights into the structural meaning and biological interpretability of the final clusters.
Despite the effectiveness of MILP-based clustering, scalability remains a key challenge, particularly in large-scale protein association networks [8]. The integration of functional similarity metrics introduces computational complexity due to the high-dimensional nature of GO annotations [9]. While previous studies have explored heuristic and metaheuristic strategies for optimizing MILP clustering models, a gap remains in systematically balancing topological and functional constraints [10].
To bridge this gap, this study proposes a novel MILP-based framework that systematically integrates functional similarity measures into gene network clustering. The method is validated using a dataset derived from the STRING protein association network [11], specifically for the Flavodiiron protein FprAl network, containing 61 genes and 230 interactions. The effectiveness of the approach is assessed through functional validation using gene sequence alignments, ensuring biologically meaningful clusters.
By systematically integrating functional and topological considerations, this study enhances the precision of gene function clustering and supports downstream applications such as functional annotation, disease gene discovery, and biomarker identification. The proposed framework demonstrates the potential of optimization-based approaches in bioinformatics, paving the way for more robust data-driven gene function prediction.
2. Mixed Integer Linear Programming Formulation
Graph clustering is essential for analyzing complex networks by grouping nodes based on similarity or connectivity. In bioinformatics, it helps identify functional modules in gene interaction networks. Clustering algorithms optimize criteria such as modularity, connectivity, or specific properties like diameter or edge weights. Mixed Integer Linear Programming (MILP) provides a structured optimization approach, allowing precise definition of objectives and constraints for customized clustering solutions.
2.1 MILP Algorithm for Diameter Minimization
The MILP formulation for minimizing the maximum diameter within clusters aims to create compact clusters, ensuring that the longest shortest path between any two nodes within the same cluster is as short as possible. The mathematical model is as follows:
... (1)
Subject to:
... (2)
... (3)
... (4)
... (5)
Where хц = 1 if node i belongs to cluster I, and k is the number of clusters. The objective minimizes Dmax, the largest intra-cluster distance, ensuring compact clusters. Constraint (3) ensures Dmax captures the longest pairwise distance d¡j within each cluster. This approach is beneficial for generating similar groups where the intra-cluster distances are minimized, promoting tighter and more cohesive clusters [12].
2.2 MILP Model for Maximizing Weight Within Clusters
The second MILP model refocuses the clustering objective to maximize the total weight of edges within each cluster, where the weights are determined by the go function similarity scores between nodes. The model can be described with the following elements:
... (6)
Subject to:
... (7)
... (8)
... (9)
... (10)
This model aims to maximize the sum of weighted edge similarities within each cluster, where wij represents the Gene Ontology (GO) functional similarity score between nodes i and j. By promoting intra-cluster connectivity among functionally similar proteins, this approach enhances the biological relevance of the clusters, making it suitable for biological network analysis where clusters correspond to functional modules.
To achieve this, the formulation introduces the binary variable indicating whether an edge between nodes i and j exists within cluster c. Constraints (8) and (9) ensure that an edge is assigned to a cluster only if both nodes belong to the same cluster. Additionally, constraint (10) maintains balanced cluster sizes by ensuring that the number of nodes in each cluster remains close to with a tolerance 8. This model shifts the optimization focus from minimizing cluster diameter (as in the first model) to maximizing intra-cluster similarity, leveraging functional relationships to refine clustering outcomes.
3. Methodology
The study employs network analysis and MILP-based clustering on STRING-databse gene interaction data, evaluating results through clustering metrics and gene sequence alignment for biological relevance.
3.1 Data Collection and Go Function Similarity Score
The dataset for this research was sourced from the STRING database [11], which contains detailed information on gene interactions and associated metrics such as physical proximity on chromosomes, experimental interactions, and predicted associations based on computational methods. Additionally, sequences for all the genes were collected to facilitate subsequent sequence alignment analyses, enhancing the robustness of the clustering approach by validating the functional groupings. The calculation of functional similarity between genes was based on the Jaccard similarity coefficient [13], which is defined as:
... (11)
Here, A and B represent sets of Gene Ontology (GO) terms associated with the genes being compared. For each gene, GO terms were aggregated into sets, and the Jaccard index was computed to quantify the overlap in functional annotations between every pair of genes in the study. This measure of functional similarity was instrumental in refining the gene clustering process, ensuring that genes grouped together shared a higher degree of functional relatedness.
3.2 Identifying Optimum Number of Cluster
A Graph Convolutional Network (GCN) was utilized to analyze the gene interaction network, leveraging its ability to capture complex node dependencies through connection patterns and edge features. This makes GCNs well-suited for learning from non-Euclidean data, where traditional machine learning methods often fall short [14]. The network was transformed into a GCN-compatible format using node degree as the node feature and multiple interaction attributes as edge features. Key edge features included phylogenetic co-occurrence, experimentally determined interactions, gene fusion etc. A model was then constructed to project these features into a lower-dimensional embedding space, facilitating clustering by simplifying complex network structures.
To determine the optimal number of clusters, the Elbow Method was applied. This technique plots the within-cluster sum of squares (inertia) against the number of clusters, identifying the point where further increases provide diminishing returns. This "elbow" represents a balance between compactness and cluster count, ensuring effective partitioning of the network [15].
3.3 Solving the Optimization Model
The gene network was clustered using a Mixed Integer Linear Programming (MILP) model, solved with the Gurobi optimizer [16] via the Python API (gurobipy). The number of clusters (k) was determined using the Elbow Method and incorporated into the MILP constraints. Two models were implemented: Maximum Diameter Minimization (Equations 1-5), which minimized the longest intra-cluster distance, and Weight Maximization (Equations 6-10), which optimized functional similarity. Binary variables were assigned to indicate gene-cluster memberships, ensuring that each gene was placed in exactly one cluster and that the total number of clusters did not exceed k. The Gurobi solver was then executed to find optimal solutions, where the topology-based model focused on compact clusters, while the weight-based model integrated functional similarity for biologically meaningful groupings.
3.4 Comparison of Gene Sequence Similarity Inside the Clusters
To compare the clusters derived from the Maximum Diameter Minimization and Weight Maximization models, multiple clustering evaluation metrics were employed to assess both structural and functional properties. Modularity was used to measure the strength of intra-cluster connectivity compared to a random connectivity configuration, with higher values indicating better-defined clusters. Cohesion was measured using the shortest path distances between nodes within the same cluster, where lower values indicate more compact and tightly connected clusters. Separation was evaluated by calculating the shortest path distances between nodes belonging to different clusters, with higher values signifying well-separated clusters. Additionally, average conductance was used to assess inter-cluster connectivity, where lower values indicate stronger intra-cluster connections relative to inter-cluster edges, ensuring well-defined clusters. Weighted Mean Sequence Similarity was computed as the weighted average sequence similarity within each cluster using the BLOSUM62 substitution matrix for global alignments, providing insight into biological relevance. These metrics were collectively used to provide a quantitative assessment of clustering effectiveness by integrating both network topology and functional similarity.
4. Results and Discussion
The optimal number of clusters (k) for gene clustering was determined using the Elbow Method, applied to embeddings generated by a Graph Convolutional Network (GCN). As shown in Figure 1, the inertia-representing the within-cluster sum of squares-declines sharply before stabilizing as the number of clusters increases. The most pronounced reduction in inertia occurs when transitioning from one to two clusters, after which further increases yield minimal improvement in within-cluster variance. This pattern indicates that k=2 is the optimal choice for this analysis. This selection of this input parameter was crucial as it served as an input parameter for the Mixed Integer Linear Programming (MILP) models used in subsequent phases of the study. With this constraint, the MILP models were designed to minimize the maximum intra-cluster diameter while maximizing intra-cluster functional similarity, leveraging both topological and functional data from the gene network. The choice of k=2 ensured that the resulting clusters were both computationally efficient and biologically meaningful, capturing distinct patterns of gene interactions while maintaining interpretability. This strategic decision was instrumental in balancing model performance, biological relevance, and computational feasibility, ultimately enhancing the effectiveness of the MILP-bascd clustering approach.
Figure 2 depicts the MILP model that minimizes the maximum intra-cluster diameter, resulting in 24 and 37 genes assigned to two clusters based on network topology. In contrast, Figure 3 illustrates the weight maximization MILP model, which optimizes functional similarity, yielding 29 and 32 genes per cluster. This distinction highlights how different optimization criteria influence the structural and functional composition of the clusters.
The Weight Maximization model outperforms the Diameter Minimization model in modularity, separation, and weighted mean sequence similarity, indicating stronger biological relevance and better-defined clusters. Meanwhile, the Diameter Minimization model exhibits higher conductance which reflects stronger inter-cluster connectivity, may facilitate interactions between functionally related genes despite their separation.
The small difference in sequence similarity between the two models suggests that topological clustering alone captures a significant portion of functional relationships. This indicates that while maximizing intra-cluster functional similarity improves clustering quality, sequence similarity alone may not fully distinguish functionally related genes due to evolutionary divergence and sequence redundancy. Future work could incorporate additional Gene Ontology aspects, refine similarity scoring methods, or integrate advanced sequence comparison techniques such as profile-based alignments to enhance functional clustering. Expanding the dataset to larger gene networks could further validate the robustness of the models and optimize clustering effectiveness for complex biological systems.
5. Conclusion
This study successfully applied Mixed Integer Linear Programming (MILP) to cluster genes within a network by leveraging both topological structure and functional similarity derived from Gene Ontology (GO) terms. Two MILP models were developed: Maximum Diameter Minimization, which focused solely on network topology, and Weight Maximization, which integrated functional similarity into the clustering process. The Elbow Method determined the optimal number of clusters, and sequence similarity analysis was conducted using BLOSUM62 to evaluate the biological relevance of the clustering results. The findings indicate that while both models effectively grouped genes, the Weight Maximization approach produced clusters with higher modularity, improved separation, and stronger functional coherence, as reflected in the Calinski-Harabasz Index and Weighted Mean Sequence Similarity. However, The small difference in sequence similarity suggests that functional coherence is inherently captured by the topology-based model, as the STRING interaction network already embeds sequence similarity.
Acknowledgements
This research is supported by USDA NIFA grant 2021-67015-34532.
References
[1] Y. Lu, Z. Yu, Y Wang, Z. Ma, K.-C. Wong, and X. Li, "Gmhcc: High-throughput analysis of biomolecular data using graph-based multiple hierarchical consensus clustering," Bioinformatics, vol. 38, no. 11, pp. 3020-3028, 2022.
[2] K. Berahmand, E. Naşiri, and Y. Li, "Spectral clustering on protein-protein interaction networks via constructing affinity matrix using attributed graph embedding," Computers in Biology and Medicine, vol. 138, p. 104937, 2021.
[3] J. Guo, M. Wang, and X. Guo, "Hierarchical hyperedge graph transformer: Toward dynamic interactions of brain networks for neurodevelopmental disease diagnosis," IEEE Access, vol. 12, pp. 162145-162156, 2024.
[4] A. A. Wani, "Comprehensive analysis of clustering algorithms: Exploring limitations and innovative solutions," PeerJ Computer Science, vol. 10, p. e2286, 2024.
[5] H. Pirim, B. Eksioglu, and F. W. Glover, "A novel mixed integer linear programming model for clustering relational networks," Journal of Optimization Theory and Applications, vol. 178, pp. 123-145, 2018.
[6] J. P. Sáráivá, M. Oswald, A. Biering, D. Röll, C. Assmann, T. Klassert, M. Blaess, K. Czakai, R. Claus, J. Löffler, H. Slevogt, and R. König, "Fungal biomarker discovery by integration of classifiers," BMC Genomics, vol. 18, p. 601, 2017.
[7] S. Wang, X. Chen, B. J. Frederisy, B. A. Mbakogu, A. D. Kanne, P. Khosravi, and W. B. Hayes, "On the current failure-but bright future-of topology-driven biological network alignment," arXiv preprint, vol. 2204.11999, 2022.
[8] M. E. Vinyard, Engineering Computational Single-Cell and CRISPR-Based Functional Genomics Tools to Study Dynamic Gene Regulation in Hematopoietic Development and Disease. PhD thesis, Harvard University, 2024.
[9] S. Panda, Algorithms for Large-Scale Adversarial Decision Problems. PhD thesis, Vanderbilt University, 2018.
[10] A. Name(s), "Title of the paper," Journal/Conference Name, vol. XX, no. X, pp. XX-XX, 2023.
[11] C. v. Mering, M. Huynen, D. Jaeggi, S. Schmidt, P. Bork, and B. Snel, "String: a database of predicted functional associations between proteins," Nucleic acids research, vol. 31, no. 1, pp. 258-261, 2003.
[12] H. Pirim, A. Aghalari, and M. Marufuzzaman, "Clustering network data using mixed integer linear programming," Recent Applications in Graph Theory, p. 89, 2022.
[13] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, "Using of jaccard coefficient for keywords similarity," in Proceedings of the international multiconference of engineers and computer scientists, vol. 1, pp. 380-384, 2013.
[14] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, "Graph convolutional networks: a comprehensive review," Computational Social Networks, vol. 6, no. 1, pp. 1-23, 2019.
[15] P. Patel, B. Sivaiah, and R. Patel, "Approaches for finding optimal number of clusters using к-means and agglomerative hierarchical clustering techniques," in 2022 international conference on intelligent controller and computing for smart power (IC1CCSP), pp. 1-6, IEEE, 2022.
[16] R. Anand, D. Aggarwal, and V. Kumar, "A comparative analysis of optimization solvers," Journal of Statistics and Management Systems, vol. 20, no. 4, pp. 623-635, 2017.
Copyright Institute of Industrial and Systems Engineers (IISE) 2025