Content area
The discovery of gene-disease links is an important challenge in biological and biomedical domains, as it presents opportunities in tasks such as disease detection and drug repurposing. Machine Learning approaches that predict gene-disease associations significantly accelerate this process by leveraging biological knowledge represented in ontologies and the structure of knowledge graphsto organize data.
State-of-the-art approaches for gene-disease association typically use Knowledge Graph Embeddings and other Machine Learning algorithms, modeling the problem as a pair binary classification task. Although this is generally the logic behind a Machine Learning approach, the effectiveness of link classificationapproaches is limited by the need to generate negative examples, the absence of relationships between genes and diseases, and because only some Knowledge Graph Embeddings are able to directly predict gene-disease associations.
This dissertation explores the differences between addressing the gene-disease association problem as a link classification task and a link predictiontask. We compare means of combining vectors and classification algorithms for the link classification approach. We also analyzed the influence of considering several knowledge graph embeddings in both the link classification and link prediction approaches. The methods were evaluated using biomedical data sources such as DisGeNET and popular ontologies.
Our results show that enriching the semantic representation of disease does not support better performance of link classification methods and the performance of link prediction methods in predicting disease-linked genes. However, it does support better performance of link prediction methods in predicting gene-linked diseases. The results also suggest that link prediction methods better explore the semantic richness encoded in knowledge graphs through various ontologies and additional links between ontology classes.
Employing link prediction over link classification provides advantages across design aspects and techniques. For instance, link prediction leverages relationships between target entities within knowledge graphs and does not require the synthetic generation of negative examples. While link prediction methods offer an end-to-end approach that directly generates predictions from the learned embeddings, link classification methods require integrating various Machine Learning methods with strategies to combine the embeddings, leading to increased complexity and potential loss of information.