1. Introduction
Drug discovery is a complicated, costly and low-success process. It is estimated that it takes about 10∼15 years and 0.8∼1.5 billion dollars from initially presenting the abstract concept to putting it into market for a new drug. Despite pharmaceutical companies investing enormous costs and time, only about 10% of drugs are successfully evaluated by FDA every year [1,2]. Nobel Laureate James Black presented that the most solid foundation for new drug discovery is beginning from old drugs [3]. Drug repurposing, which repositions the existing drugs to find new treatment clues of the old drugs, can shorter drug research and development time, reduce unexpected drug toxicity, and promote drugs to enter clinical phases as soon as possible [4,5,6]. Jin et al. [7] represented that repositioned drugs account for about 30% of the newly FDA-approved drugs and vaccines.
DTI identification, aiming to find potential targets/drugs for the existing drugs/targets, has been an important step in drug repositioning. With the integration of numerous heterogeneous biological data, a variety of computational approaches have been exploited to systematically infer possible DTIs. Some research [4,8] has been better summarized. Inspired by these summarizations, in this study, we discussed relevant data repositories, different computational models and their advantages, and challenges for DTI identification.
2. Data Representation and Repositories
2.1. Benchmark Data Set
The majority of computational models for DTI identification used the datasets provided by Yamanishi et al. [9]. The details are shown in Table 1. Yamanishi et al. [9] provided three types of data: drug similarity matrix , target similarity matrix , and drug-target interaction network where if the drug and the target is linked; otherwise, .
2.2. Flowchart
Various DTI inference algorithms have been designed over the past two decades. These methods usually integrated the datasets provided by Yamanishi et al. [9] and other biological information from various public databases into their proposed computational models, and then trained the models, finally scoring the interaction probabilities for unknown drug-target pairs. We briefly represented the flowchart as Figure 1.
2.3. DTI Relevant Databases
Various experimental data provides abundant information for DTI identification and significantly improved the performances of DTI prediction models. It is feasible to merge these DTI data from different databases. To address the conflict problems between data values from different repositories in the process of data merging, for example, Liu et al. [10] set a priority for each DTI and give precedence to the more reliable data source. Liu et al. [10] merged different compound-protein interaction data retrieved from Matador, DrugBank, and STITCH. Matador and DrugBank are manually curated databases. STITCH is a comprehensive repository collected from four different sources: manually curated databases, experimental validation, text mining and model prediction. Particularly, STITCH assigns each DTI a score ranging from 0 to 1000. Each score indicates confidence degree of each DTI supported by the above four types of evidence. In addition, Liu et al. [10] considered that DTIs from Matador and DrugBank are supported by biochemical experiments and the literature and gave these DTIs the highest score of 1000.
Lou et al. [11] designed a novel Network integration pipeline for DTI prediction, DTINet. DTINet developed other ways of DTI data merging from a multiple-views perspective based on the following steps:
Step 1. Extracting related data from different databases: (i) drugs, DTIs and drug-drug interactions from DrugBank; (ii) proteins, and protein-protein interactions from HPRD [12]; (iii) diseases, drug-disease and protein-disease associations from the Comparative Toxicogenomics Database [13]; (iv) side-effects and drug-side-effect associations from SIDER [14].
Step 2. Excluding isolated entities (nodes) which have no edges in the network.
Step 3. Integrating four types of nodes and six types of associations (edges) in Step 1 and constructing a heterogeneous network.
Step 4. Building multiple similarity networks to further increase the network heterogeneity.
Step 5. Removing homologous proteins or similar drugs from constructed heterogeneous networks to reduce the potential redundancy in the DTIs: (i) removing the DTIs involving homologous proteins with sequence identity scores larger than ; (2) removing the DTIs involving similar drugs with Tanimoto coefficients larger than ; (3) removing the DTIs involving the drugs with Jaccard similarity scores of side effects larger than ; (4) removing the DTIs involving the proteins or drugs associated with similar diseases (Jaccard similarity scores larger than ; (5) removing the DTIs involving either homologous proteins with sequence identity scores larger than or similar drugs with Tanimoto coefficients larger than .
Parts of DTI repositories are described as follows:
2.3.1. DrugBank
The DrugBank database [15] (
2.3.2. SuperTarget
The SuperTarget database [16] (
2.3.3. STITCH
The STITCH database [17] (
2.3.4. ZINC
The ZINC database [18] (
2.3.5. IUPHAR/BPS Guide to PHARMACOLOGY
IUPHAR/BPS Guide to PHARMACOLOGY [19] (
2.3.6. SIDER
The SIDER database [20] (
2.3.7. BindingDB
BindingDB [21] (
2.3.8. TTD
Therapeutic Target Database (TTD) [22] (
2.3.9. MATADOR
The MATADOR database [16] (
2.3.10. ChEMBL
The ChEMBL database [23] (
2.3.11. DCDB
The DCDB database [24] (
2.4. DTI Relevant Software Packages
Drug and target features are important for unknown DTI classification. Researchers have developed various software packages to extract abundant drug and target features.
2.4.1. RDKit
RDKit [25] (
2.4.2. ChemDes
ChemDes [26] (
2.4.3. OpenBabel
OpenBabel [27] (
2.4.4. Rchemcpp
Rchemcpp [28] (
2.4.5. PyDPI
PyDPI [29] (
2.4.6. Rcpi
Rcpi [30] (
2.4.7. KeBABS
KeBABS [31] (
2.4.8. PROFEAT
PROFEAT [32] (
2.4.9. Pse-in-One
Pse-in-One [33] (
2.4.10. ProtrWeb
ProtrWeb [34] (
2.5. On-Line Tools/Web-Service for DTI Prediction
Stimulated by the increasing interest in DTI identification and the availability of various open data repositories, many online tools have been exploited to find new DTIs. These tools have been provided without considering the mathematical models and computational complexity, and thus significantly lower the collaboration barriers among different researchers involved in multiple disciplines. More online tools are described as follows [35].
2.5.1. DrugE-Rank
DrugE-Rank [36] (
2.5.2. DINIES
DINIES [37] (
2.5.3. Drug2Gene
Drug2Gene [38] (
2.5.4. iGPCR-Drug
iGPCR-Drug [39] (
2.5.5. SynSysNet
SynSysNet [40] (
2.5.6. SDTNBI
SDTNBI [41] (
2.5.7. DTome
DTome [42] (
2.5.8. PharmMapper
PharmMapper [43] (
2.5.9. SwissTargetPrediction
SwissTargetPrediction [44] (
2.5.10. TargetNet
TargetNet [45] (
2.5.11. DT-Web
DT-Web [46] (
3. Network-Based Methods
Computational methods for DTI prediction can be roughly classified into four categories: ligand-based approaches, docking approaches, network-based approaches, and machine learning-based approaches. Ligand-based approaches assume that similar drugs tend to bind similar targets and predict underlying DTIs based on ligand similarities. However, prediction accuracies of ligand-based approaches may be unreliable when known ligands for a protein are not enough. Docking approaches fully utilize the 3D structures of proteins, however, this type of method cannot find new DTIs when the 3D structures of proteins are unknown. Network-based approaches and machine learning-based approaches tend to address the limitations of the above two types of methods. Network-based methods efficiently predicted potential DTIs by integrating graph-based techniques and various biological data.
3.1. DSSI
Campillos et al. [47] exploited a drug side-effect similarity-based inference method (DSSI). DSSI can be classified into three steps:
Step 1: Developing a measure to compute the probability that two drugs share a common target based on drugs’ chemical similarity (2D Tanimoto coefficient, y):
(1)
Step 2: Measuring the probability that two drugs simultaneously interact with a target based on their phenotypic side-effect similarity(x):
(2)
Step 3: Designing a sigmoid function to compute the probabilities of two drugs sharing the same target incorporating chemical similarities and phenotypic side-effect similarities.
(3)
where the fitted parametersDSSI can find possible DTIs, however, it can only be used to infer potential associations for drugs that have known side-effect information, thus seriously limiting its application.
3.2. MTOI
Yang et al. [48] exploited a robust computational model to mine new drug targets based on multiple target optimal intervention solutions (MTOI). MTOI is classified into two stages: drug target identification and optimal multi-target control solution inference. In stage 1, MTOI firstly defined the disease state combing experimental data from patients and cells in abnormal conditions, and the desired state that could be restored into normal physiological state; it then selected activities of potential drug targets and calculated median deviation (m.d.) of the activities between the normal and disease states to score underlying drug targets. In stage 2, MTOI added drug reactants to screened drug targets and obtained multi-target intervention solution by selecting intensities. MTOI identified underlying drug targets and best restored an inflammation-related network to a normal state. Figure 2 described the details.
3.3. NRWRH
Chen et al. [49] assumed that similar drugs intend to interact with similar targets and presented a method, Network-based Random Walk with Restart on the Heterogeneous network (NRWRH) by integrating drug similarity network, protein similarity network, and DTI network into a heterogeneous network. NRWRH computed the interaction probabilities for unknown drug-target pairs by randomly walking on the heterogeneous network:
(4)
NRWRH finally defined the following iteration model to compute the interaction probability by randomly walking in DTI network:(5)
Figure 3 describes the details.where and is the probability of jumping from target/drug network to drug/target network and the restart of walking at the seed nodes, respectively.
3.4. DBSI, TBSI, and NBI
Cheng et al. [50] viewed a DTI network as a bipartite graph and developed three DTI prediction methods: Drug-based similarity inference (DBSI), Target-based similarity inference (TBSI), and Network-based inference (NBI). DBSI assumed that a query drug similar to known drugs interacting with a target may associate with and defined a linkage score between and :
(6)
TBSI assumed that a query target similar to known targets, which interacts with a drug , may associate with and defined a linkage score between and :
(7)
Given a target , NBI defined its score associated with :
(8)
where is initial score of drug , is the number of targets interacting with , and is the number of drugs associating with .3.5. DTINet
Luo et al. [11] integrated various information from multiple heterogeneous networks and presented a novel Network integration pipeline for DTI prediction, DTINet. DTINet used a compact feature learning method to handle the noisy, high-dimensional and incomplete natures of large-scale biological data and obtained low-dimensional but informative vector representations of drugs and targets. Figure 4 described the details.
4. Machine Learning-Based Methods
The researchers exploited numerous models and algorithms to find missing DTIs based on machine learning methods except for network-based methods. These methods can be roughly classified into five groups: Bipartite Local Model (BLM), regularized least squares, matrix factorizations, deep learning, and other methods.
4.1. BLM
4.1.1. KRM
Yamanishi et al. [9] exploited a Kernel Regression Method (KRM). KRM scored the interaction likelihoods for unknown drug-target pairs through three stages: constructing pharmacological space, learning model based on kernel regression to represent the correlation between chemical/genome space and pharmacological feature space, and calculating feature-based similarity scores. Figure 5 describes the details.
where weight can be computed by optimizing the following loss function:
(9)
4.1.2. BLM
Bleakley et al. [51] proposed a supervised learning-based Bipartite Local Model (BLM) to find novel linkage between drug and target in the following way:
Step 1: Excluding target . For a drug , listing all other known targets in the bipartite network and giving their labels ; listing the targets unknown to be targeted by and giving their labels .
Step 2: Finding a classification rule to discriminate the -labeled data from the -labeled data based on genomic sequence information for the targets.
Step 3: Taking this rule and identifying the label of and thus inferring whether there exists linkage between and .
Step 4: Fixing the same target and excluding drug , listing all other known drugs interacting in the bipartite network and giving their labels ; listing the drugs unknown to interact with and giving their labels .
Step 5: Finding a classification rule to discriminate the -labeled data from the -labeled data based on chemical structure information for the drugs.
Step 6: Taking this rule and identifying the label of and thus inferring whether there exists linkage between and .
Bleakley et al. [51] used SVM as local classifier.
4.1.3. BLM-NII
Mei et al. [52] incorporated Neighbor-based Interaction-profile Inferring model (NII) into the BLM to find potential DTIs, especially for new drugs and targets (BLM-NII). BLM-NII can be grouped into five steps: computing NII, computing drugs and targets similarity matrix, learning a local model, computing the interaction probability, and obtaining final results. Figure 6 describes the details.
4.2. Regularized Least Squares
4.2.1. LapRLS, NetLapRLS
Xia et al. [53] designed Laplacian regularized least squares (LapRLS) and LapRLS incorporating DTI network (NetLapRLS) to identify underlying DTIs based on a data-dependent manifold regularization model. The details are described in Figure 7.
where and represented two undirected graphs of drug domains and protein domains including both labeled and unlabeled samples, respectively.
4.2.2. RLS
Van et al. [54] assumed that a drug, which exhibits a similar interaction pattern or non-interaction pattern with targets in a known DTI network, is likely to exhibit similar interacting behavior when finding new targets for the drug. Similarly, targets have similar features. Based on the assumption, Van et al. [54] exploited a Regularized Least Squares (RLS) method combined with Gaussian Interaction Profile kernel (RLS). RLS predicted new DTIs based on three steps: separately computing GIP kernels of drugs and targets, obtaining and by adding a small multiple of an identity matrix and integrating the two kernels into GIP kernel, and predicting DTIs based on RLS classifier. Figure 8 describes the details.
4.2.3. WNN, WNN-GIP
Van et al. [55] developed a weighted nearest neighbor (WNN) method to infer association candidates for new drugs/targets. WNN defined an interaction profile score for a new drug d as follows:
(10)
where the weight can be computed by a given decay value as .WNN [55] then extended GIP [54] with WNN and exploited WNN-GIP to identify possible association information for new drugs (or targets): for a new drug d, WNN-GIP add as a new row to original DTI matrix Y and apply GIP to obtain interaction profile of d.
4.2.4. Kron-RLS
Pahikkala et al. [56] presented a Kronecker Regularized Least-Square-based method (Kron-RLS) to score unknown drug-target pairs. Given a training set X ( is a drug-target pair) and their real labels (, if the drug interacts with the target in ; , otherwise), Kron-RLS formulated the problem of DTI prediction as minimizing the following objective function:
(11)
where is the norm of f. By representation theorem, the minimization of the above function can be described as:(12)
where can be computed by the following equation:(13)
where included all drug-target pairs, and represented kernel matrix of drugs and targets in the training set.4.2.5. KMDR
Kuang et al. [57] assumed that two similarity entities tend to link similar nodes to each other and developed a kernel matrix dimension reduction method (KMDR). KMDR defined a general formulation:
(14)
where is a drug-target pair vector, is predicted drug-target association score matrix. is a kernel matrix. KMDR exploited three independent sub-algorithms: KMDR-KP, KMDR-KS, and KMDR-avg.KMDR-KP defined K as where , , , , and scored the interaction probabilities for unknown drug-target pairs by the following equation:
(15)
where , and is a diagonal matrix of .KMDR-KS are similar to KMDR-KP but .
KMDR-avg defined two kernels: and , scored for unknown drug-target pairs based these two kernels, respectively:
(16)
(17)
The final scores can be calculated as:
(18)
4.3. Matrix Factorization
As shown in Figure 9, matrix factorization methods can be used to complete the missing values in DTI matrix. The type of method first factorized Y into two matrices and satisfying , where A and B represented latent feature vectors of drugs and targets. k is the number of features, , respectively.
4.3.1. KBMF2K
Gönen [58] took DTI prediction as a binary classification problem and developed a Kernelized Bayesian Matrix Factorization with twin Kernels (KBMF2K). KBMF2K integrated three different experimental settings into a single unified framework: (i) finding interacting targets from B for a new drug , (ii) finding interacting drugs from A for a new target , (iii) estimating potential associations between a new drug and a new target .
KBMF2K designed a deterministic variational approximation method based on fully conjugate probabilistic model and projected drugs and targets into a unified subspace. Figure 10 illustrates the proposed probabilistic model.
where ∧ and represented priors and projection matrices for a chosen subspace dimensionality, respectively. The drug kernel matrix is applied to project the drug-target pairs to a low-dimensional space, consisted of the low-dimensional feature representations of drugs. Similarly, can be computed. Finally, the predicted interaction matrix can be calculated based on and .
4.3.2. PMF
Cobanoglu et al. [59] developed a probabilistic matrix factorization method (PMF) based on collaborative filtering algorithm. Using a probabilistic model with Gaussian noise, PMF defined the conditional probability for each observed interaction as follows:
(19)
where denotes the Gaussianly distributed probability density function for , with mean and variance , is an indicator function equal to 1 if is known and 0 otherwise.Using zero-mean, PMF represents spherical Gaussian priors on A and B as:
(20)
(21)
PMF then computed the log-likelihood of A and B:(22)
Finally, the underlying DTI score matrix can be computed:
(23)
4.3.3. MSCMF
Zheng et al. [60] proposed a Multiple Similarities Collaborative Matrix Factorization (MSCMF) method by integrating matrix factorization, collaborative filtering and relevant biological information including chemical structures and ATC codes of drugs and genomic sequence, GO and protein-protein interaction network of targets. MSCMF found possible DTIs based on the following seven steps:
Step 1: Building an objective function to minimize the squared error between Y and A and B:
(24)
Step 2: Introducing a weighted low-rank approximation model to distinguish labeled drug-target pairs from unlabeled pairs:
(25)
where W is a weight matrix, if is labeled, namely, interacting or non-interacting; otherwise, .Step 3: Applying Tikhonov regularization to avoid overfitting of A and B to training data:
(26)
where is a regularization coefficient.Step 4: Representing drugs similarity as approximation of corresponding two drug feature vectors:
(27)
Similarly, target similarity can be represented as:
(28)
Step 5: Linearly combing multiple similarity:
(29)
where and . and are weights from multiple similarity matrices of drugs and targets, respectively.Step 6: Developing the entire objective function and scoring unknown drug-target pairs:
(30)
where , , and are regularization coefficients.The model can be solved with alternating least squares algorithm.
Step 7: Computing the interaction probabilities for unknown drug-target pairs:
(31)
4.3.4. NRLMF
Liu et al. [61] designed a Neighborhood Regularized-based Logistic Matrix Factorization method (NRLMF) to model the probability of a drug interacting with a target. NRLMF first model the interaction probability between a drug and a target based on logistic matrix factorization:
(32)
NRLMF then minimized the following objective function to calculate the interaction probabilities for unknown drug-target pairs by placing spherical Gaussian priors on and :(33)
where and are used to control Gaussian distribution variances, and , and denote the Frobenius norm of A and B, respectively.The final objective function can be described as:
(34)
4.3.5. DNILMF
Hao et al. [62] extended NRLMF and proposed a Dual-Network integrated Logistic Matrix Factorization method (DNILMF). DNILMF first calculated the interaction probabilities for unknown drug-target pairs:
(35)
DNILMF then computed the final interaction scores by maximizing the following objective function:(36)
where , ∘ denotes the Hadamard product.4.4. Deep Learning
4.4.1. DeepDTIs
Wen et al. [63] used Deep Belief Network (DBN) to infer potential DTIs without classifying each target into different classes. DeepDTIs identified novel DTIs through three steps:
Step 1: Choosing the most simple and common features to describe drugs and targets: representing chemical compounds with extended connectivity fingerprints and targets with protein sequence composition descriptors.
Step 2: Abstracting feature representations based on DBN. DBN used by DeepDTIs consisted of five layers: the first layer (the input layer) is the calculated features, the second, third and fourth layer are the hidden layers, and the last layer is output layer.
Suppose that x is training sample, DeepDTIs modeled the joint probability distribution between x and l hidden layers based on DBN:
(37)
where , is a visible-hidden conditional probability distribution at level k, is the visible-hidden joint probability distribution in the top level.Step 3: Building a classification model with known label DTIs.
4.4.2. EENN
Gao et al. [64] developed an End-to-End Neural Network (EENN) model to identify DTI candidates directly from raw chemical structures and amino acids sequences. EENN contained four parts: describing drugs and proteins based on related biological information, projecting drugs and proteins into dense vector spaces by integrating graph-based convolutional neural network and long short-term memory recurrent neural networks, forming the context matrix for drugs and protein with attentive pooling network and computing weighted sums of the context matrix, and predicting the interaction probabilities for unknown drug-target pairs based on inference with siamese network. The details are shown in Figure 11.
4.4.3. Stacked Autoencoder
Wang et al. [65] designed a novel computational model to find possible DTIs combining stacked autoencoder in deep learning models. The proposed method can automatically screen hidden information from raw data and select highly representative features based on iterations of multiple layers.
The method can be grouped into four parts: describing each DTI (sample) based on 881 chemical structures of drugs and the position-specific scoring matrix related to protein, reconstructing features with stack autoencoder, classifying unknown drug-target pairs with random forest classifier, and predicting labels for test samples. The details are shown in Figure 12.
In step 2, Wang et al. first encoded the training sample into the hidden representation by the mapping :
(38)
where is the activation function, and are weighted parameters and bias vector , respectively. The representation of the hidden layer H is then mapped into the output layer by the mapping :(39)
where is the activation function, and is weighted parameters and bias vector , respectively. The parameters can be learned by minimizing the following loss function:(40)
where and are the reconstruction error and the weight decay cost, respectively. The hidden layer learned the features and reduced the dimension of original data by mapping. The highest hidden layer of autoencoder can be used as the features of raw data extracted by the stacked autoencoder.4.5. Other Methods
4.5.1. RBM
Wang et al. [66] learned associated probabilities of unknown drug-target pairs using a two-layer restricted Boltzmann machine (RBM) where visible units encoded types of DTIs and hidden units represented latent features of DTIs. Figure 13 describes the details.
4.5.2. NetCBP
Chen et al. [67] exploited a semi-supervised learning-based prediction model (NetCBP) combined with network consistency. NetCBP assumed that there existed coherent interactions between drugs ranked based on their correlations to a query drug and targets ranked based on their correlations to the hidden targets of the query drug, and then designed a learning model to maximize the rank coherences relevant to known DTIs. The details are described in Figure 14.
5. Discussion
Drug repurposing involves various computational methods [1,3]. Of these techniques, DTI inference is one of the most important foundations [68,69]. In this paper, we summarized data sources and related representation involved in DTI prediction. We mainly introduced two classes of typical computational models, network-based methods and machine learning-based methods. These two types of models are applied to target proteins without any known 3D structure information and obtained effective prediction performance [52,70]. More importantly, almost all the methods can further infer novel DTIs for drugs interacting with at least one target protein [4]. Furthermore, some algorithms can effectively identify DTI candidates for new drug molecules which have no associated information with targets by combining with drug similarity, target similarity, and DTIs [4,52,71]. However, there are a few limitations to solve.
Network-based methods are limited to application because DTI data are severely imbalanced in the relevant dataset and there are many more unknown drug-target pairs than DTIs in DTI network [4,72,73]. For example, the interactions in ion channel dataset provided by Yamanishi et al. [9] should be 42,840, however, the actual interaction is 1467. More importantly, a DTI network usually contains several isolated subnetworks, where network-based models are unable to find new association information for orphan drugs (or targets) which have not any known interaction data in the DTI network [4,70]. Finally, most of the network-based methods are biased toward the drugs (or targets) which tend to interact with more targets (or drugs) [4,73]. Therefore, network-based methods should be further exploited to solve these problems in the future.
Machine learning-based methods obtained good improvement in the process of DTI prediction. Table 2 and Table 3 illustrate the performances of some machine learning-based methods from Refs. [52,61]. Table 2 lists AUC and AUPR values provided by Mei et al. [52] for KRM, BLM, RLS, and BLM-NII. These methods are BLM-based methods. The results show that BLM-NII obtained better performance than other BLM-based methods and prove that neighbor-based interaction-profile helps to predict new DTIs.
Table 3 lists the AUC and AUPR values provided by Liu et al. [61] for NetLapRLS, BLM-NII, WNN-GIP, KBMF2K, and NRLMF where NetLapRLS and WNN-GIP are regularized least squares-based methods, BLM-NII is BLM-based method, and the remaining are matrix factorization-based methods. The minor difference of BLM-NII in Table 2 and Table 3 may be caused by different experimental settings.
Matrix factorization models obtain better performance for DTI identification [59,60,61,62,74]. However, this type of method has more parameters to set and is sensitive to parameters [73]. Although RLS-WNN cannot outperform matrix factorization methods, it is relatively much faster and more robust to parameter selection [73,75]. BLMs can efficiently process many fewer unknown DTIs, and thus they exhibit much lower complexity than global algorithms. Furthermore, BLMs are usually fast and memory-efficient techniques when the dataset used is larger [52,73]. Nevertheless, BLMs cannot deal with the situation that both drugs and targets are not included in the training dataset unless integrated with other methods, for example, BLM-NII [74]. Deep learning-based methods obtained better improvement because of their powerful representation learning ability and are one powerful models for DTI prediction [63,65,76,77].
In summary, although various machine learning-based methods have been already proven to be effective for DTI identification, various challenges still remain.
(i) Most of the supervised learning methods are limited to the negative sample selection problem because there are not experimental validated non-DTI data. Therefore, this type of method can only randomly select negative DTI data from unknown associated drug-target pairs, however, these selected negative samples may contain positive DTIs, which severely affects classification performance and generalization ability of models [4,10,56,71,73,74].
(ii) Machine learning-based prediction models are usually built and evaluated with an excessively simplified experimental setting. Such settings may wander from the real case and produce over fitting results [4,74]. Especially, most of the machine learning-based models simply regard DTI as an on-off association and do not consider other key factors like quantitative affinities and molecule concentrations [56,74]. Pahikkala et al. [56] have illustrated that at least four factors may result in highly positive predictive results when building and measuring supervised machine learning-based methods: experimental setting, evaluation data set, problem formulation and evaluation setup. Therefore, DTI identification should be modeled as a rank or regression problem rather than a binary classification problem [74].
(iii) When predicting possible DTIs based on binary classification, the classification accuracy is biased because the results are from the simple average of two different classification models, which are constructed based on drugs and targets, respectively [4].
(iv) Most of the machine learning-based methods have “poor interpretability” properties, therefore, it is difficult to understand potential drug mechanism of action from a pharmacology viewpoint [74].
Although semi-supervised learning methods overcame the negative sample selection limitation by making use of the unlabeled data, it still cannot solve the problem of classifier combination [4].
6. Conclusions and Further Research
In this section, we attempt to provide some suggestions of further research on how to improve DTI prediction performance.
6.1. Heterogeneous Data Integration
Most models incorporate chemical and genomic information, in addition, previous works have utilized pharmacological or phenotypic information, such as side-effects data, gene expression information, and some associated data. These data represent different natures of drugs and targets and can boost prediction accuracy if used concurrently. However, most existing models are limited to homogeneous information and cannot be directly applied to heterogeneous networks.
Heterogeneous data sources give diverse information and help find possible DTIs from a multi-view perspective. To the best of our knowledge, for instance, some genes coding proteins (targets) are tightly associated with some diseases and the therapeutic effects of the drugs on these diseases reflect their biological activities to these targets. Therefore, integrating with various heterogeneous data sources, such as gene-disease association network, drug-disease association network, metabolic network associated to specific diseases, can potentially improve the accuracy and thus provide new insights.
Although several network-based strategies incorporate heterogeneous data source and derive the associated scores through network diffusion method, most existing models have some limitations and fail to give satisfactory integration paradigms: first, the noise and high-dimensionality natures of biological data easily cause predicted bias. Moreover, some network-specific information may be lost in the process of integrating multiple different networks into a single network, since edges from multiple heterogeneous networks are mixed indiscriminately in such process. Therefore, designing appropriate models to incorporate multiple relevant heterogeneous data sources still remains an open problem.
6.2. Reliable Negative Sample Selection
There exist parts of known DTIs (positive samples) and massive unknown drug-target pairs in existing DTI datasets. In addition, there are not experimental validated non-DTIs (negative samples) so that most of the supervised classification algorithms have no choice but to randomly select unlabeled drug-target pairs as negative samples. However, this part of randomly selected negative samples, in fact, may well contain positive DTIs, thereby severely confusing the classification accuracy of supervised-learning techniques. Therefore, although extracting positive drug-target pairs from unconfirmed data is an urgent task, designing an effective method to screen negative DTIs is more challenging [10]. To the best of our knowledge, positive-unlabeled learning [71,78,79] can learn high-quality positive samples and reliable negative samples from the unlabeled data and may be one effective way to select strong negative DTIs.
6.3. Noncoding RNAs as Targets
It is worth mentioning to consider noncoding RNAs as drug targets. Noncoding RNAs [80,81] (nc RNAs) are another new class of targets. ncRNAs can control gene expression and affect disease progression, which makes them targets in the process of drug research and discovery. ncRNAs consist of multiple functionally important RNAs including transfer RNA (tRNA), microRNA, intronic RNA, ribosomal RNAs (rRNA), long noncoding RNA, and repetitive RNA. Each class of RNA has different endogenous functions, which provides many opportunities for drug discovery and design.
ncRNAs have been considered as targets and obtained increasing attention. For example, microRNAs have been well-reviewed to be therapeutically targeted candidates [82,83]. Both microRNA mimics and inhibitors are being designed against targets and tested in clinical trials. For instance, the drugs BMN 044/ PRO044, BMN 045/ PRO045, BMN 053/ PRO053, SRP-4053, and SRP-4053 can be used to therapy duchenne muscular dystrophy (DMD) by targeting dystrophin pre-mRNA [81]. Recently, the research on targeting of repetitive RNAs, intronic RNAs, and miRNAs are advanced, however, long ncRNAs, which are regarded as a challenging class of possible drug targets, will be further focused upon.
The researchers exploited several ncRNA databases, such as NONCODE (
6.4. Environmental Factors and Genetic Factors
Various studies have reported that associations between genetic factors (GFs) and environmental factors (EFs) can greatly influence phenotypes and diseases [84,85]. The computational modeling of GF-EF interaction prediction considerably enriches our knowledge on the mechanisms of GF-EF interactions. For instance, drugs, one class of important EFs, have been revealed to interact with targets (GFs) [84,85]. Qiu et al. [85] suggested that miRNA biomarker signatures of drugs could be applied to evaluate the effects of cancer treatments. Therefore, the analysis and identification of interactions between drugs and genetic factors could help infer novel indications for FDA approved drugs.
6.5. Deep Learning
In the era of big data, large quantities of biological data are dramatically increasing. The availability of these datasets have promoted the development of various modeling approaches [63,76]. Deep learning approach is one type of representation-learning method that can be applied to deal with complex works with heterogeneous and high-dimensional datasets. The accumulation of massive drug and target data provides quantities of biomedical features and accelerates the application of deep learning on DTI prediction [77,86]. Although several deep learning methods [63,64,65] are used to identity possible DTIs, there remains many challenges in interpreting deep learning results, such as selecting appropriate deep architectures and model parameters, solving with small samples and high-dimensional nature of the datasets. Therefore, building an appropriate deep model may be one of efficient ways to improve DTI prediction performance.
6.6. Sparse Representation
DTI data in DTI network are sparse and imbalanced. There is a small quantity of DTIs and abundant unknown drug-target pairs. For example, in the datasets provided by Yamanishi et al. [9], the number of DTIs are 2926, 1476, 635, and 90 between 445, 210, 223, and 54 drugs and 664, 204, 95, and 26 target proteins, respectively, from enzymes, ion channels, GPCRs, and nuclear receptors. The ratio of known DTIs to all drug-target pairs is 0.0099, 0.0345, 0.03, and 0.0641, respectively. The dataset provided by Wen et al. [63] contains only 6262 DTIs among possible 2,146,240 () drug-target pairs from 1412 drugs and 1520 targets, and the ratio of known DTIs to all drug-target pairs is 0.0029. More importantly, DTI prediction must be solved in small samples with high dimension natures of drugs and target information. Sparse representation can automatically discriminate various classes and provides a simple and effective ways of rejecting any invalid test samples not from any class in the training set, and thus reduces data dimension and computational cost [87]. Therefore, sparse representation-based methods may be further applied to DTI prediction.
6.7. Types of DTI
Different types of DTIs help us understand the molecular mechanism of drug action. Although the existing methods have achieved promising performance, the majority of them can only infer the binary interaction between a drug and a target, but cannot detect distinct types of interactions. However, the interactions between drugs and targets generally have different meanings, for example, direct interactions produced by protein-ligand binding and indirect interactions caused by either changed expression levels of a target protein or active metabolites induced by a drug [16,66]. In addition, DTIs can be annotated by different drug modes of action, such as activation and inhibition [17]. Therefore, how to use various biological data to identify different types of DTIs may be a challenging problem.
6.8. Personalized Medicine
The ultimate goal of DTI identification is to provide treatment clues for patients, especially for cancer patients. However, it is inappropriate to simply use one or a few drugs for all the patients [88]. Therefore, computational methods should be used to mine personalized drugs by integrating cancer-related network, drug-drug interaction network, protein-protein interaction network, metabolic network, and so on. Fusing this important information and novel network-based models, researchers may find some valuable drug discovery strategies. In addition, computational models could be applied to predict personalized drug targets, drug effects and resistances for cancer treatment, and infer personalized cancer risk for healthy individuals [89,90]. Therefore, performing personalized medicine based on DTI identification may be a topic of further research.
Author Contributions
DTI computational models, L.Z. and Z.L.; Data representation and repositories, G.T., H.W. and L.P.; flowcharts, F.L., M.C. and J.X.; discussion, J.Y.; conclusion and further research, L.H.P.; writing—original draft preparation, L.Z. and Z.L.; writing—review and editing, L.H.P., F.L., M.C.; supervision, L.H.P.; project administration, L.Z.; funding acquisition, G.T., H.W., L.P., J.Y.
Funding
This research was funded by the Natural Science Foundation of China (Grant 61672223, 61702054, 61803151), the Natural Science Foundation of Hunan province (Grant 2018JJ2461, 2018JJ3568, 2018JJ3570, 2019JJ50187), the Project of Scientific Research Fund of Hunan Provincial Education Department (Grant 14B023, 17A052,18B209), and the Training Program for Excellent Young Innovators of Changsha (Grant kq1802024).
Acknowledgments
We would like to thank all authors of the cited references.
Conflicts of Interest
The authors declare no conflict of interest.
Figures and Tables
Figure 1. The flowchart of standard drug-target interactions (DTI) identification models.
Figure 2. The flowchart of multiple target optimal intervention solutions (MTOI).
Figure 3. The flowchart of Network-based Random Walk with Restart on the Heterogeneous network (NRWRH).
Figure 4. The flowchart of a novel Network integration pipeline for DTI prediction (DTINet).
Figure 6. The flowchart of BLM with neighbor-based interaction-profile inferring (BLM-NII).
Figure 7. The flowchart of Laplacian regularized least square (LapRLS) incorporating DTI network (NetLapRLS).
Figure 9. The flowchart of DTI identification methods based on matrix factorization.
Figure 10. The flowchart of Kernelized Bayesian Matrix Factorization with twin Kernels (KBMF2K).
Datasets provided by Yamanishi et al. [9].
Dataset | Drugs () | Targets () | Interactions |
---|---|---|---|
enzyme | 445 | 664 | 2926 |
ion channel | 210 | 204 | 1476 |
GPCRs | 223 | 95 | 635 |
nuclear receptor | 54 | 26 | 90 |
Performance comparison of BLM-based methods [52].
AUC | ||||
Dataset | KRM | BLM | RLS | BLM-NII |
Enzyme | 86.4 | 97.6 | 97.8 | 98.8 |
Ion Channel | 81.9 | 97.3 | 98.4 | 99.0 |
GPCR | 76.5 | 95.5 | 95.4 | 98.4 |
Nuclear Receptor | 74.9 | 88.1 | 92.2 | 98.1 |
AUPR | ||||
Dataset | KRM | BLM | RLS | BLM-NII |
Enzyme | 6.30 | 83.3 | 91.5 | 92.9 |
Ion Channel | 17.2 | 78.1 | 94.3 | 95.0 |
GPCR | 10.9 | 66.7 | 79.0 | 86.5 |
Nuclear Receptor | 17.1 | 61.2 | 68.4 | 86.6 |
Performance comparison of different types of prediction models [61].
AUC | |||||
Dataset | NetLapRLS | BLM-NII | WNN-GIP | KBMF2K | NRLMF |
Enzyme | 97.2 | 97.8 | 96.4 | 90.5 | 98.7 |
Ion Channel | 96.9 | 98.1 | 95.9 | 96.1 | 98.9 |
GPCR | 91.5 | 95.0 | 94.4 | 92.6 | 96.9 |
Nuclear Receptor | 85.0 | 90.5 | 90.1 | 87.7 | 95.0 |
AUPR | |||||
Dataset | NetLapRLS | BLM-NII | WNN-GIP | KBMF2K | NRLMF |
Enzyme | 78.9 | 75.2 | 70.6 | 65.4 | 89.2 |
Ion Channel | 83.7 | 82.1 | 71.7 | 77.1 | 90.6 |
GPCR | 61.6 | 52.4 | 52.0 | 57.8 | 74.9 |
Nuclear Receptor | 46.5 | 65.9 | 58.9 | 53.4 | 72.8 |
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2019 by the authors.
Abstract
Background: Identifying possible drug-target interactions (DTIs) has become an important task in drug research and development. Although high-throughput screening is becoming available, experimental methods narrow down the validation space because of extremely high cost, low success rate, and time consumption. Therefore, various computational models have been exploited to infer DTI candidates. Methods: We introduced relevant databases and packages, mainly provided a comprehensive review of computational models for DTI identification, including network-based algorithms and machine learning-based methods. Specially, machine learning-based methods mainly include bipartite local model, matrix factorization, regularized least squares, and deep learning. Results: Although computational methods have obtained significant improvement in the process of DTI prediction, these models have their limitations. We discussed potential avenues for boosting DTI prediction accuracy as well as further directions.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details


1 School of Computer Science, Hunan University of Technology, Zhuzhou 412007, China
2 School of Computer Science, Hunan Institute of Technology, Henyang 421002, China
3 Geneis (Beijing) Co. Ltd., Beijing 100102, China
4 School of Computer Science, University of Science and Technology of Hunan, Xiangtan 411201, China
5 School of Computer Science and Engineering, Central South University, Changsha 410083, China; Neuroscience Research Center, Department of Basic Medical Sciences, Changsha Medical University, Changsha 410219, China