GPS-Lipid: a robust tool for the prediction of multiple lipid
OPEN
A
P
Yubin Xie,*, Yueyuan Zheng,*, Hongyu Li,*, Xiaotong Luo, Zhihao He, Shuo Cao, Yi Shi, Qi Zhao,,, Yu Xue, Zhixiang Zuo, & Jian Ren,,
important mechanism for the regulation of variety aspects of protein function. Over the last decades, regulation.
Most genes in eukaryotic cell are post-translationally modied by a wide range of chemical groups. Among those post-translational modications, the addition and removal of lipid groups to certain amino acids is a key modication that orchestrates the subcellular trafficking1,2, signaling3,4 and membrane association5 of proteins. With the rapid development of numerous innovative techniques, three prevalent forms of lipid modications, such as S-palmitoylation, prenylation and N-myristoylation, are now extensively studied.
The reversible attachment of a 16-carbon fatty acid palmitate to protein via thioester linkage is called as S-palmitoylation6. By eectively increasing the hydrophobicity of its modied substrates, the S-palmitoylation process can dynamically regulate the membrane association of various cellular proteins1,2. Cellular proteins may also be covalently modied with the 14-carbon saturated fatty acid myristate, which is known as N-myristoylation. By recognizing a MGXXXS/T signature at N-terminus, the N-myristoyl transferase (NMT) may catalyze the addition of myristate to glycine via an amide bond7,8. Another important type of lipid modication is prenylation. This process involves the addition of a 15-carbon farnesyl group or a 20-carbon geranylgeranyl group to a C-terminal cysteine that conform to a consensus CAAX motif9. Typically, the farnesylation is catalyzed by protein farnesyltransferase (FTase)10, whereas the geranylgeranylation is performed by protein geranylgeranyltransferase type I (GGTase-I)11,12. However, in case of Rab proteins, the geranylgeranyltransferase type II (GGTase-II) which recognized a C-terminal CC/CXC motif is found to catalyze the geranylgeranylation process9,13. Although the enzymatic processes of protein lipidation vary greatly, dierent types of lipid groups are still found to modify similar protein substrates, which implies a strong co-regulation between dierent lipid modications. One of the most striking example is the regulation of small GTPases in subcellular trafficking by prenylation
State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, Guangdong
SCIENTIFIC REPORTS
1
www.nature.com/scientificreports/
and palmitoylation14. In Ras and Rho families, palmitoylation frequently occurs in the hypervariable domain that adjacents to the prenylated C-terminal end CAAX box15. These two types of lipid modication provide sufcient hydrophobicity for proteins to localize on cellular membranes and the precise subcellular localizations of these small GTPases are essential for their proper functionalities16. Additionally, with the help of an intracellular palmitoylationdepalmitoylation cycle, the prenylated small GTPases are able to dynamically traffic from Golgi apparatus to plasma membranes17. A similar co-regulatory mechanism was also reported between myristoylation and palmitoylation. Given the fact that the myristoylation, by itself, is not providing enough hydrophobicity of the modied protein for its membrane association18, extra N-terminal palmitoylation on the myristoylated proteins are usually required for the stable membrane attachment and translocation to ras/caveolae or intracellular liquid-ordered domains19,20. Some Guanylate Cyclase Activating Proteins21, most of the members of the Src family of protein tyrosine kinases22 and the Gi subfamily of alpha subunits of G proteins23 are examples that undergo this kind of regulation. Thus, dual-lipid modications are responsible for the correct localization of many signaling proteins, and play crucial roles in coordinating the extracellular stimuli and intracellular signaling.
Due to their essential physiological functions, the dysfunctions of lipid modication may lead to many sorts of diseases. For example, the overexpression of palmitoyl acyltransferases (PATs) may implicate in schizophrenia24
and Huntingtons disease25. The N-myristoylation is observed to mediate the viral Infectivity and eukaryotic infections26. Taken together, the research on lipid modication, especially on the co-regulatory mechanisms, will be particularly important for identifying potential drug targets for further diagnostic and therapeutic consideration. However, due to the limitations of integrative bioinformatics resources, the overall investigations that focusing on the co-regulation of lipid modications are seldom performed. This deciencies may grievously hamper the development of eective therapies for disorders related to lipid modications.
Recently, several prediction tools for lipid modications were constructed. CSS-Palm27 and CKSAAP-Palm28 are two widely-used tools that can be used to predict palmitoylation sites. NMT7 and Myristoylator29 were specically designed for N-myristoylation prediction. PrePS9 was proposed to predict protein CAAX farnesylation, CAAX geranylgeranylation and Rab geranylgeranylation sites. Unfortunately, none of them can predict a complete set of lipid modication sites and assist the research on co-regulatory mechanisms between dierent lipid modications. Beyond that, the number of experimentally identied lipidation sites has been signicantly expanded in recent years, which provides an opportunity for improving the performance of lipid modication sites prediction.
In this work, we present GPS-Lipid, which is a comprehensive predictor for multiple protein lipid modication sites. From the published literatures, we manually collected 737 S-palmitoylation sites in 361 proteins, 106 S-farnesylation sites in 97 proteins, 95 S- geranylgeranylation sites in 70 proteins and 283 N-myristoylation sites in 281 proteins. Using this comprehensive dataset, we developed a new algorithm called GPS-Lipid, in which we employed the ALC-PSO30 method in our previous GPS algorithm for model training . Both a standalone package and an online service were freely available at http://lipid.biocuckoo.org.
Results
Previously, we developed GPS (Group-based Prediction System) algorithm and successfully applied it to the prediction of post-translational modication sites such as phosphorylation31 and sumoylation32. In this work, for the prediction of protein lipid modication sites, we developed an update version of GPS algorithm, which adopted the ALC-PSO strategy as shown in Fig. S1 to prevent premature convergence and maintain fast training feature. We named this updated tool as GPS-Lipid.
To evaluate the performance of GPS-Lipid, LOO and 4-, 6-, 8-, 10-fold cross validation were performed for S-palmitoylation, N-myristoylation, S-farnesylation and S-geranylgeranylation, respectively. Apparently, the AUCs for all tests are larger than 0.9 (Fig.1AD), indicating that GPS-Lipid is an accurate predictor. Furthermore, the ROC curves of 4-, 6-, 8-, 10-fold cross validations were found to be close to LOO validation in all lipid modication predictors, which demonstrates that GPS-Lipid is also a robust predictor. Since the positive and negative dataset are highly imbalanced in our training set, we calculated the precision-recall curves to further evaluate the performance of GPS-Lipid. A similar result was observed, which further conrmed that GPS-Lipid is an accurate and robust predictor even in the case of imbalanced training set (Fig. S2). To further evaluate the accuracy of GPS-Lipid, an independent test dataset was used. Due to data size limitation, only S-palmitoylation were evaluated. The AUC of the prediction for this independent test dataset was 0.8712 (Fig.1E), further indicating the robustness of GPS-Lipid.
We then compared GPS-Lipid with known predictors such as NMT, CSS-Palm and CKSAAP-Palm. The LOO validation was carried out for all predictors. As a result, the performance of GPS-Lipid is superior to all other predictors: the AUCs for palmitoylation and myristoylation prediction using GPS-Lipid are 0.9434 and 0.9940, respectively, while the AUCs for CSS-Palm, CKSAAP-Palm and NMT are 0.8682, 0.8254 and 0.9028, respectively (Fig.1F).
Availability and Utility of GPS-Lipid. GPS-Lipid can be easily accessed through either standalone package or interactive web server. The GPS-Lipid web server provides a friendly interactive interface (Fig.2A). Below the sequence panel, many interactive options are provided to help users congure their own run of GPS-Lipid. The PTM Type panel lists four supported lipid modication types to allow users conveniently predict any combinations of lipid modications. The threshold panel provides three thresholds to allow users run GPS-Lipid with high, medium and low stringency. The console panel allows users to input the sequences through le and provides protein sequences example to run GPS-Lipid.
SCIENTIFIC REPORTS
2
www.nature.com/scientificreports/
Figure 1. Performance evaluation and comparison of GPS-Lipid. (AD) The performance evaluation for the predictions of S-palmitoylation, N-myristoylation, S-farnesylation and S-geranylgeranylation. The LOO and 4-,6-,8-,10-fold cross validation were performed. (E) An additional test set that was not included in the training set was applied to carry out the further evaluation of palmitoylation prediction. (F) The performance comparison among GPS-Lipid, CSS-Palm, CKSAAP-Palm and NMT. To avoid any bias, the same data set was used and the LOO validation was performed.
Figure 2. A snapshot of GPS-Lipid. (A) The human tyrosine-protein kinase Yes (YES1), mouse guanine nucleotide-binding protein G(i) subunit alpha-2 (GNAI2) and Arabidopsis thaliana rac-like GTP-binding protein (ARAC3) were taken as an example to try out the predictor. All the four supported lipidation were selected and predicted using the default threshold. (B) The predicted results of these three protein sequences. Dierent modication types were marked with dierent colors. (C) Visualization of the predicted results. By clicking on the visualize button in the result page, the lipid modication sites are illustrated in a domain graph. To distinguish between dierent lipid modications, the visualization tool will marked them with dierent colors.
SCIENTIFIC REPORTS
3
www.nature.com/scientificreports/
Figure 3. The co-regulatory mechanism of lipid modications. (A) The distribution of lipid modied proteins in our collected sequence library. (B) The correlation between all six combinations of dual-lipid modication. The color strength represents the signicance level calculated from the chi-square test. Positions in gray were nonsense dual-lipid modications. Positions marked with an asterisk are cases with signicant correlations (i.e. P< 0.05 or Signicance> 2.99), while positions with two asterisks represent very signicant correlations (i.e. P< 0.01 or Signicance>4.61). (C) The position distribution of dierent dual-lipid modications. Four signicantly correlated dual-lipid modications were tested using chi-square test. The horizontal axis represents three tested anking regions, while the vertical axis represents the signicance of whether two types of lipid modication sites are trend to locate adjacently. The red line denote the signicance level with probability lower than 0.05. (D) The in situ crosstalk between prenylation and palmitoylation. The x-axis represents three pairs of potential in situ crosstalk, while y-axis represents the enrichment ratio. Palm, Gera and Farn refer to S-Palmitoylation, S-Geranylgeranylation and S-Farnesylation, respectively.
To clarify the utility of GPS-Lipid web server in more detail, here we took the protein sequences of human tyrosine-protein kinase Yes (YES1), mouse guanine nucleotide-binding protein G(i) subunit alpha-2 (GNAI2) and Arabidopsis thaliana rac-like GTP-binding protein (ARAC3) as an example. Multiple sequences can be inputted for GPS-Lipid via FASTA format. Alternatively, the sequences could also be uploaded as a FASTA format file. Using a default threshold, the potential lipid modification sites of our example are predicted and presented as shown in Fig.2B. YES1 is known to be modified by N-myristoyl group at N-terminal glycine, which is essential for its regulation role in cellular transformation33. As expected, GPS-Lipid successfully predicted the N-myristoylation at N-terminal glycine. As for GNAI2, GPS-Lipid identified an N-myristoylation site at position 2, which has been reported to associate with the constitutive activation of alpha i2 signal transduction functions34.
We also identied several geranylgeranylation sites in the C-terminal of ARAC3 where a CAAX motif has been identied through an in vitro prenylation screen35,36. Taken together, GPS-Lipid is able to identify dierent kinds of known lipid modication sites, indicating its robustness and reliability.
In addition to the lipid modication predictor, we also developed a visualization tool to generate a schematic diagram as shown in Fig.2C. By marking the functional domains with their precise lipid modication sites in proteins, the visualization tool can assist researchers in illustrating the underlying mechanisms of lipid modication process.
To conduct a deep analysis of the co-regulatory mechanisms in lipid modications, we constructed a comprehensive lipid modication dataset that contains 1221 manually collected lipidiation sites, 1447 palmitolyation proteins from high-throughput screen and 2257 GPS-Lipid predicted lipidiation sites. Interestingly, a large proportion of proteins were modied by at least two types of lipid groups, indicating a strong co-regulation was existed between dierent lipid modications
SCIENTIFIC REPORTS
4
www.nature.com/scientificreports/
(Fig.3A). It should be noted that the majority of the co-regulated proteins contain palmitoylation site, demonstrating that palmitoylation is a key modication that coordinate with other lipidation to modulate a diverse cellular functionality.
It is reported that proteins were oen modied sequentially with dierent lipids. To explore how the different lipid modications co-regulate, we performed a statistical analysis on the co-occurrence of all six combinations of dual-lipid modication. We found S-palmitoylation was signicantly (P < 0.05) co-located with N-myristoylation, S-farnesylation and S-geranylgeranylation (Fig.3B, Table S7). An obvious dual-lipid modication of S-farnesylation and S-geranylgeranylation were also detected (Fig.3B).
We next analyzed the spatial relationships of dierent lipid-modied residues. As expected, we observed that S-palmitoylation sites are significantly (P < 0.05) enriched in proximal regions of N-myristoylation, S-farnesylation and S-geranylgeranylation sites (Fig.3C, Table S8). Also, a highest spatial correlation was observed between S-farnesylation and S-geranylgeranylation (Fig.3C). When extending the anking region from (5,5) to (15,15), the level of signicance was obviously decreased in all cases of dual-lipid modication. In this regard, we propose that the adjacent modication of two lipid groups are a key mechanism for dual-lipid modication, and palmitoylation should be the most essential lipid modication which required for the attachments of other lipid groups.
Since the two dierent types of prenylation will recognize the same C-terminal CAAX box, we speculated that the strong spatial relationship observed between S-farnesylation and S-geranylgeranylation was mainly caused by their in situ crosstalk. We performed a hypergeometric test to test whether S-farnesylation and S-geranylgeranylation will modify the same terminal cysteine. As palmitoylation can also occur in cysteine, we also calculated the signicance of in situ crosstalk between palmitoylation and prenylation. Interestingly, we observed that S-farnesylation was signicantly (P = 0) in situ crosstalk with S-geranylgeranylation (Fig.3D, Table S9). On the contrary, palmitoylation and prenylation was found to be unlikely to occur at the same cysteine residue. This result suggested that farnesylation and geranylgeranylation are two modications with similar nature and functionalities. There may be a dynamic regulatory process that maintain the equilibrium between S-farnesylation and S-geranylgeranylation in mammalian cells, which is still needed further experimental verication.
Discussion
Attachment of lipophilic groups is a widespread modication that has essential functions in eukaryotic cell. Over the past decades, at least four dierent types of lipid modications are increasingly studied, which resulted numerous data. We collected all experimentally validated lipid modication data, based on which GPS-Lipid was developed to predict dierent types of lipid modication sites at the same time. Together with the comprehensive lipid modication dataset and visualization tools, GPS-Lipid is able to greatly assist the investigation of lipid modication for the research community. One of the most important function of GPS-Lipid is the ability to simultaneously identify dierent types of lipid modication, which would be of great help for the research of co-regulation between dierent types of lipid modication. As previously reported2,37, proteins were oen attached sequentially with dierent types of lipid. Signal transducing proteins, such as guanine-nucleotide-binding protein- (G) subunits and non-receptor tyrosine kinases, were rst post-transnationally modied by palmityl group at N-terminal Cys residues. Following this, the adjacent Gly residue will be N-myristoylated and targeted to plasma membrane. Similar dual-lipid modications were also observed in Ras proteins and other monomeric GTPases. At their C termini, those proteins were rst modied by farnesyl group at the CAAX box, then, the adjacent palmitoylation will lead to a subcellular trafficking from endoplasmic reticulum to Golgi apparatus and plasma membrane. In consideration of these important physiological functions, the study of co-regulatory mechanisms between dierent lipid modications will be distinctly important for further diagnostic and therapeutic considerations.
With GPS-Lipid, one can predict all potential lipid modication sites for a given interesting protein sequences, which provides specic targets for the subsequent experiments. We applied GPS-Lipid to our collected sequence library to systematically investigate the co-regulatory in lipidation. We observed that the four dierent types of lipidation are obviously crosstalk with each other. Specically, a signicant in situ crosstalk was detected between S-farnesylation and S-geranylgeranylation. However, as the training data set contained a limited amount of prenylation sites, the resolution of the predictions for S-farnesylation and S-geranylgeranylation will be insufficient. Therefore, the level of in situ crosstalk between S-farnesylation and S-geranylgeranylation may probably be over-represented. By incorporating a larger data set for prenylation, it will be possible to further rene the prediction accuracy.
Although the current performance of GPS-Lipid is satisfactory, some lipid modication sites still cannot be properly predicted. Using our training dataset, we predicted the potential lipid modication sites with a default threshold and counted the number of mis-classied sites (Fig. S3). The statistical results suggest that our models are highly sensitive, however, there are still some negative sites incorrectly predicted as positive sites. To reveal the sequence characters of those false positive sites, we plot their motif representation using WebLogo38. Just as we have expected, the mis-classied lipid modication sites all share a very low sequence similarity and no apparent motif is detected (Fig. S4). Since the GPS algorithm is predicted based on groups, this kind of low similarity sites will be predicted in the non-consensus models. However, the dataset being assigned to non-consensus groups is very limited, therefore some false positive predictions will be introduced into our tool. An eective way to solve this problem is by gathering more low-similarity sites in the training dataset, in other words, to enlarge the training set in non-consensus groups.
In the future, further development of GPS-Lipid will be performed, which includes extension to other types of lipid modication, such as GPI-anchor addition, S-diacylglycerol addition and Cholesterol addition. Besides, a more precise GPS algorithm will be adopted in the next version of GPS-Lipid.
SCIENTIFIC REPORTS
5
www.nature.com/scientificreports/
Methods
Data collection and preparation. The training data set in GPS-Lipid was manually collected by searching the scientic literatures (published before Nov. 2014) in the PubMed with keywords such as Palmitoylation, Myristoylation, Farnesylation and Geranylgeranylation. Here, we totally collected 737 S-palmitoylation sites in 361 proteins, 106 S-farnesylation sites in 97 proteins, 95 S- geranylgeranylation sites in 70 proteins and 283 N-myristoylation sites in 281 proteins. To provide full access to the above collected data set, an online database was then developed and the intact annotations from UniProt and NCBI were integrated. As previously described, to avoid any overestimation of prediction accuracy, the redundant sites should be removed, and the CD-HIT39 with a threshold of 40% sequence identity was used to single out homologous proteins. If two proteins are modied by lipid groups at the same position and present more than 40% sequence identity, only one protein was preserved. In particular, 65 palmitoylation sites was randomly selected from the non-redundant dataset to construct an additional test set. Due to data limitation, the additional test set for other lipid modications were not constructed. For the preparation of training data sets, we took known lipid modication sites as the positive dataset, while all other non-modied residues, i.e. cysteine and glycine, in the same substrates were taken as the negative dataset. As a result, 579 S-palmitoylation sites, 226 N-myristoylation sites, 82 S-farnesylation sites and 71 S-geranylgeranylation sites were retained from 277, 226, 78 and 52 protein substrates as the nal positive training data set (Supplementary table S3 S6). While the corresponding negative dataset contains 3002 non-palmitoylated sites, 6754 non-myristoylated sites, 613 non-farnesylated sites and 192 non-geranylgeranylated sites.
To include as much as possible lipid modication sites, another 1259 high-throughput experimentally veried palmitoylated proteins was collected from PubMed. By using GPS-Lipid with a high threshold, the exact palmitoylation sites for those high throughput veried proteins were predicted and integrated into the lipid modication database. Notably, we also constructed a sequence library for further identifying the co-regulation mechanisms of lipid modications by integrating the collected data set and high-throughput data set.
Based on the hypothesis that similar short peptides exhibit similar biochemical properties and biological functions31, a previously described GPS algorithm32 was applied to predict potential lipid modication sites. In GPS algorithm, four sequential training procedures including modication sites clustering, Motif length selection (MLS), Weight training (WT) and Matrix mutation (MaM) were adopted for improving the prediction performance.
Before the training steps, we rst took known lipid modication sites as the positive dataset, while all other non-modied residues (cysteine for palmitoylation and prenylation, glycine for myristoylation) in the same substrates were taken as the negative dataset. A detail statistics on training dataset was shown in Table S1. To enable the subsequent training steps, we extract the lipid modication site peptides as a lipid modied residue, i.e. cysteine and glycine, anked by 30 residues upstream and 30 residues downstream.
In consideration of the fact that one type of post-translational modication (PTM) is capable of recognizing multiple motifs, the modication sites clustering approach was carried out to classify the lipid modication sites into several dierent groups based on recognition motifs. For N-myristoylation, the modied sites that follow an N-terminal MGXXXS/T motif were classied into a consensus set, while the other modied sites were classied into a non-consensus set. Similar strategy was also applied to prenylation. The farnesylation sites were classied into consensus and non-consensus class based on C-terminal CAAX motif. As for S-geranylgeranylation sites, the CAAX and CXC/CC motif were used. Since no apparent motifs were reported in S-palmitoylation, the palmitoylation sites were just clustered into groups using a previously described k-means approach. As a result, based on the sequence similarity of training data set, the S-palmitoylation sites were clustered into three optimal classes. In the k-means methods, we rst calculated the similarity score between two given S-palmitoylation peptides using Equation1.
..
S A, B Num ofconserved substitutions= Num of all substitutions
( ) (1)
A conserved substitution is a substitution with a Score(a, b)>0 in the BLOSUM62 matrix. The S(A, B) ranges from 0 to 1. The distance between the two PSP(m, n) is then dened as: D(A,B)=1/S(A,B). If S(A,B)=0, we simply let D(A,B)=. The k-means algorithm clusters the palmitoylation sites by exhaustive testing. First of all, two palmitoylation sites were randomly chosen as the centroids. Secondly, other positive sites were compared with the two centroids and the distances were calculated. With the shortest distance, the positive sites were then clustered into the corresponding groups. Thirdly, the centroids were updated with the highest average identity score. Optimal cluster can be obtained by iterative repeat of the second and third steps.
To evaluate the amino acid preference of modication enzymes, the WT method was adopted to optimize the scoring weight at each position of a lipid modication peptide. In WT process, the scoring strategy was dened as Equation2.
S A, B w Score A , B
( ) ( )
i i i
The wi refers to the scoring weight of each position. Before the optimization steps, we re-illustrated the weight training process as Equation3.
=
+
w 1 w (3)
i i
The wi represents the numeric changes of scoring weight aer training process. And hence, the weight training process is aimed at nding a set of wi that can get an optimal performance. In GPS-Lipid, this step
=
m i n
(2)
SCIENTIFIC REPORTS
6
www.nature.com/scientificreports/
maximizes the Sn value of the LOO validation under a Sp of 90%. Furthermore, to improve the prediction robustness, a MaM process was subsequently performed. Similar to WT, the MaM process can also be described as Equation4.
=
+
S a, b Score a, b S a, b
( ) ( ) ( ) (4)
Where S(a,b) is the optimal substitute score for amino acid a and b with respect to lipid modification. Score(a,b) is the substitute score in BLOSUM62 matrix. S(a,b) represents the numeric changes of substitute score for amino acid a and b. Thus, the MaM approach seeks for a set of S(a,b) that maximize the prediction performance. In the most recent version of GPS algorithm32, the WT and MaM was performed by the Particle Swarm Optimization (PSO) strategy. Although the original PSO algorithm had exhibited a fast-converging behavior in our previous test32, we still found that, like other population-based optimization techniques, a guided particle located at a local optimum may have a risk of trapping the whole swarm and leading them to a premature convergence. To improve the performance of PSO, a number of PSO variants have been developed. Recently, inspired by the phenomenon of aging observed in nature, a variant of PSO, namely the PSO with an aging leader and challengers (ALC-PSO) was proposed by Weineng Chen30. According to Weineng Chens implementation, the ALC-PSO was integrated into GPS-Lipid. Using ALC-PSO, the problem of premature convergence in WT and MaM was overcome and the fast-converging feature of PSO was fully preserved. For more detail of the GPS algorithm, see also the supplementary methods and Fig. S1.
Performance evaluation and comparison. To evaluate the performance of GPS-Lipid, the leave-one-out (LOO) and 4-, 6-, 8-, 10-fold cross validation was carried out on the training data set. In each validation, the sensitivity (Sn), specicity (Sp), accuracy (Ac), Mathew correlation coefficient (MCC) and precision (Pr) were calculated. Based on the above evaluation, the ROC curves were plotted and the areas under ROC (AUC) were computed. Particularly, an additional test set containing 65 S-palmitoylation sites was adopted to perform an extra evaluation on palmitoylation prediction. To further demonstrate the superiority of GPS-Lipid, we compared it with other existing tools. Since the Myristoylator did not predict precise myristoylation sites in protein substrates and the PrePS did not support batch prediction, the comparison will only perform among GPS-Lipid, CSS-Palm, CKSAAP-Palm and NMT. To avoid any bias, the same training data set used in the GPS-Lipid was adopted in other prediction tools. The prediction results along with the predicted scores are collected from the above three pieces of soware. To draw the ROC curves, we change the prediction cuto according to the collected scores and the performances under dierent cuto are calculated. Specically, when performing the comparison, the prediction threshold for CSS-Palm was set as all. While for CKSAAP-Palm, we set the penalty factor C and Gamma as 100 and 0.0000015, respectively. In NMT, a default parameter was used.
Lipid modications are known as key biological processes that impact diverse cellular functionality by modulating the membrane targeting, subcellular trafficking, intracellular sorting and stability of proteins in most eukaryotes. However, due to the unavailability of comprehensive bioinformatics resources, the co-regulatory mechanisms were seldom performed. In order to analyze the relationship between dierent lipid modication sites, we designed the following statistical methods. Firstly, we extracted all our collected lipid-modied proteins from the original dataset and constructed a sequence library containing 1479 proteins. Then, the GPS-Lipid was applied to predict potential lipid modication sites with a high threshold. All the predicted sites were subsequently mapped to the experimentally veried sites and the reduplicative sites were removed. Totally, there are 3729 amino acid residues annotated as modifying by lipid groups.
As reported in literatures, dual-lipid modications are important co-regulatory mechanisms that orchestrate a wide range of molecular processes in eukaryotic cell. With a chi-square test, we statistically examined the dependency for all six combinations of dual-lipid modications in our collected sequence library. Here, we take the case of palmitoylation-myristoylation dual-lipid modication as an example to describe the statistical method. Before the statistics, a statistical hypothesis is made as follow: the presence of a myristoylation sites in a protein is not relevant to the palmitoylation sites in the same protein substrate. Based on this hypothesis, the number of proteins that contain both myristoylation sites and palmitoylation sites is calculated. Similarly, the number of proteins that just contain myristoylation sites or just contain palmitoylation sites are also calculated. Then, the probability is obtained from a chi-square distribution.
Aer identifying signicant dual-lipid modications, we next further investigate the spatial relationship between dierent lipid modications. Based on the prediction result from GPS-Lipid, we looked through the anking regions, such as (5,5), (10,10) and (15,15), around a given lipid modication site and analyzed whether another type of lipid modication would be signicantly occurred in those proximal regions with a chi-square test. Again, we took the palmitoylation-myristoylation dual-lipid modication as an example to describe the statistical method. Firstly, for all predicted myristoylated glycine residues, we counted the number of myristoylation sites with at least one palmitoylation sites in anking regions. Also, the number of myristoylation sites that without any palmitoylation sites in anking regions is counted. Accordingly, two similar statistics were also performed on other non-myristoylated glycine residues. With a chi-square distribution, the probability is calculated.
At last, since the prenylation and palmitoylation are both modied the cysteine, a hypergeometric test was carried out to evaluate the in situ crosstalk between prenylation and palmitoylation. In this analysis, we tested whether prenylation is more inclined to be happened on palmitoylated cysteines. A more detail description on the statistical methods are presented in supplementary methods.
SCIENTIFIC REPORTS
7
www.nature.com/scientificreports/
References
1. Draper, J. M., Xia, Z. & Smith, C. D. Cellular palmitoylation and trafficking of lipidated peptides. Journal of lipid research 48, 18731884, doi: 10.1194/jlr.M700179-JLR200 (2007).
2. Linder, M. E. & Deschenes, R. J. Palmitoylation: policing protein stability and traffic. Nature reviews. Molecular cell biology 8, 7484, doi: 10.1038/nrm2084 (2007).
3. Smotrys, J. E. & Linder, M. E. Palmitoylation of intracellular signaling proteins: regulation and function. Annual review of biochemistry 73, 559587, doi: 10.1146/annurev.biochem.73.011303.073954 (2004).
4. Resh, M. D. Membrane targeting of lipid modied signal transduction proteins. Sub-cellular biochemistry 37, 217232 (2004).5. Levental, I., Grzybek, M. & Simons, K. Greasing their way: lipid modications determine protein association with membrane ras. Biochemistry 49, 63056316, doi: 10.1021/bi100882y (2010).
6. Nadolski, M. J. & Linder, M. E. Protein lipidation. The FEBS journal 274, 52025210, doi: 10.1111/j.1742-4658.2007.06056.x (2007).7. Maurer-Stroh, S., Eisenhaber, B. & Eisenhaber, F. N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. Journal of molecular biology 317, 541557, doi: 10.1006/jmbi.2002.5426 (2002).
8. Towler, D. A., Gordon, J. I., Adams, S. P. & Glaser, L. The biology and enzymology of eukaryotic protein acylation. Annual review of biochemistry 57, 6999, doi: 10.1146/annurev.bi.57.070188.000441 (1988).
9. Maurer-Stroh, S. & Eisenhaber, F. Renement and prediction of protein prenylation motifs. Genome biology 6, R55, doi: 10.1186/ gb-2005-6-6-r55 (2005).
10. Hougland, J. L. et al. Identication of novel peptide substrates for protein farnesyltransferase reveals two substrate classes with distinct sequence selectivities. Journal of molecular biology 395, 176190, doi: 10.1016/j.jmb.2009.10.038 (2010).
11. Gangopadhyay, S. A., Losito, E. L. & Hougland, J. L. Targeted reengineering of protein geranylgeranyltransferase type I selectivity functionally implicates active-site residues in protein-substrate recognition. Biochemistry 53, 434446, doi: 10.1021/bi4011732 (2014).
12. McGuire, T. F., Qian, Y., Vogt, A., Hamilton, A. D. & Sebti, S. M. Platelet-derived growth factor receptor tyrosine phosphorylation requires protein geranylgeranylation but not farnesylation. The Journal of biological chemistry 271, 2740227407 (1996).
13. Pereira-Leal, J. B., Hume, A. N. & Seabra, M. C. Prenylation of Rab GTPases: molecular mechanisms and involvement in genetic disease. FEBS letters 498, 197200 (2001).
14. Sanchez-Mir, L. et al. Rho2 palmitoylation is required for plasma membrane localization and proper signaling to the ssion yeast cell integrity mitogen- activated protein kinase pathway. Molecular and cellular biology 34, 27452759, doi: 10.1128/mcb.01515-13 (2014).
15. Michaelson, D. et al. Dierential localization of Rho GTPases in live cells: regulation by hypervariable regions and RhoGDI binding. The Journal of cell biology 152, 111126 (2001).
16. Symons, M. & Settleman, J. Rho family GTPases: more than simple switches. Trends in cell biology 10, 415419 (2000).17. Rocks, O. et al. The palmitoylation machinery is a spatially organizing system for peripheral membrane proteins. Cell 141, 458471, doi: 10.1016/j.cell.2010.04.007 (2010).
18. Navarro-Lerida, I., Alvarez-Barrientos, A., Gavilanes, F. & Rodriguez-Crespo, I. Distance-dependent cellular palmitoylation of denovo-designed sequences and their translocation to plasma membrane subdomains. Journal of cell science 115, 31193130 (2002).
19. McCabe, J. B. & Berthiaume, L. G. Functional roles for fatty acylated amino-terminal domains in subcellular localization. Molecular biology of the cell 10, 37713786 (1999).
20. McCabe, J. B. & Berthiaume, L. G. N-terminal protein acylation confers localization to cholesterol, sphingolipid-enriched membranes but not to lipid ras/caveolae. Molecular biology of the cell 12, 36013617 (2001).
21. Stephen, R., Bereta, G., Golczak, M., Palczewski, K. & Sousa, M. C. Stabilizing function for myristoyl group revealed by the crystal structure of a neuronal calcium sensor, guanylate cyclase-activating protein 1. Structure 15, 13921402, doi: 10.1016/j. str.2007.09.013 (2007).
22. Patwardhan, P. & Resh, M. D. Myristoylation and membrane binding regulate c-Src stability and kinase activity. Molecular and cellular biology 30, 40944107, doi: 10.1128/mcb.00246-10 (2010).
23. Preininger, A. M. et al. Myristoylation exerts direct and allosteric eects on Galpha conformation and dynamics in solution. Biochemistry 51, 19111924, doi: 10.1021/bi201472c (2012).
24. Mukai, J. et al. Evidence that the gene encoding ZDHHC8 contributes to the risk of schizophrenia. Nature genetics 36, 725731, doi: 10.1038/ng1375 (2004).
25. Yanai, A. et al. Palmitoylation of huntingtin by HIP14 is essential for its trafficking and function. Nature neuroscience 9, 824831, doi: 10.1038/nn1702 (2006).
26. Maurer-Stroh, S. & Eisenhaber, F. Myristoylation of viral and bacterial proteins. Trends in microbiology 12, 178185, doi: 10.1016/j. tim.2004.02.006 (2004).
27. Ren, J. et al. CSS-Palm 2.0: an updated soware for palmitoylation sites prediction. Protein engineering, design & selection : PEDS 21, 639644, doi: 10.1093/protein/gzn039 (2008).
28. Wang, X. B., Wu, L. Y., Wang, Y. C. & Deng, N. Y. Prediction of palmitoylation sites using the composition of k-spaced amino acid pairs. Protein engineering, design & selection : PEDS 22, 707712, doi: 10.1093/protein/gzp055 (2009).
29. Bologna, G., Yvon, C., Duvaud, S. & Veuthey, A. L. N-Terminal myristoylation predictions by ensembles of neural networks. Proteomics 4, 16261632, doi: 10.1002/pmic.200300783 (2004).
30. Chen, W. N. et al. Particle Swarm Optimization With an Aging Leader and Challengers. Evolutionary Computation, IEEE Transactions on 17, 241258, doi: 10.1109/TEVC.2011.2173577 (2013).
31. Xue, Y. et al. GPS 2.0, a tool to predict kinase-specic phosphorylation sites in hierarchy. Molecular & cellular proteomics : MCP 7, 15981608, doi: 10.1074/mcp.M700574-MCP200 (2008).
32. Zhao, Q. et al. GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic acids research 42, W325330, doi: 10.1093/nar/gku383 (2014).
33. Johnson, D. R., Bhatnagar, R. S., Knoll, L. J. & Gordon, J. I. Genetic and biochemical studies of protein N-myristoylation. Annual review of biochemistry 63, 869914, doi: 10.1146/annurev.bi.63.070194.004253 (1994).
34. Gallego, C., Gupta, S. K., Winitz, S., Eisfelder, B. J. & Johnson, G. L. Myristoylation of the G alpha i2 polypeptide, a G protein alpha subunit, is required for its signaling and transformation functions. Proceedings of the National Academy of Sciences of the United States of America 89, 96959699 (1992).
35. Sorek, N. et al. Activation status-coupled transient S acylation determines membrane partitioning of a plant Rho-related GTPase. Molecular and cellular biology 27, 21442154, doi: 10.1128/mcb.02347-06 (2007).
36. Sorek, N. et al. Dierential eects of prenylation and s-acylation on type I and II ROPS membrane interaction and function. Plant physiology 155, 706720, doi: 10.1104/pp.110.166850 (2011).
37. Aicart-Ramos, C., Valero, R. A. & Rodriguez-Crespo, I. Protein palmitoylation and subcellular trafficking. Biochimica et biophysica acta 1808, 29812994, doi: 10.1016/j.bbamem.2011.07.009 (2011).
38. Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome research 14, 11881190, doi: 10.1101/gr.849004 (2004).
39. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 31503152, doi: 10.1093/bioinformatics/bts565 (2012).
SCIENTIFIC REPORTS
8
www.nature.com/scientificreports/
This work was supported by grants from the National Basic Research Program (973 project) [2013CB933900 and 2012CB911201]; National Natural Science Foundation of China [31471252]; Guangdong Natural Science Foundation [S20120011335, 2014TQ01R387 and 2014A030313181]; Program for New Century Excellent Talents in University [NCET-13-0610]; Program of International S&T Cooperation [2014DFB30020].
Author Contributions
Y.X. developed the soware package and wrote the main manuscript. Y.Z. collected the original data set. H.L. designed the web server. Z.H., X.L., S.C., Y.S., Q.Z., Y.X. and Z.Z. reviewed the manuscript. J.R. conceived of the study, and participated in its design and coordination and helped to dra the manuscript. All authors have read and approved the nal manuscript.
Additional Information
Supplementary information accompanies this paper at http://www.nature.com/srep
Competing nancial interests: The authors declare no competing nancial interests.
How to cite this article: Xie, Y. et al. GPS-Lipid: a robust tool for the prediction of multiple lipid modication sites. Sci. Rep. 6, 28249; doi: 10.1038/srep28249 (2016).
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the articles Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
SCIENTIFIC REPORTS
9
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Nature Publishing Group Jun 2016
Abstract
As one of the most common post-translational modifications in eukaryotic cells, lipid modification is an important mechanism for the regulation of variety aspects of protein function. Over the last decades, three classes of lipid modifications have been increasingly studied. The co-regulation of these different lipid modifications is beginning to be noticed. However, due to the lack of integrated bioinformatics resources, the studies of co-regulatory mechanisms are still very limited. In this work, we developed a tool called GPS-Lipid for the prediction of four classes of lipid modifications by integrating the Particle Swarm Optimization with an aging leader and challengers (ALC-PSO) algorithm. GPS-Lipid was proven to be evidently superior to other similar tools. To facilitate the research of lipid modification, we hosted a publicly available web server at http://lipid.biocuckoo.org with not only the implementation of GPS-Lipid, but also an integrative database and visualization tool. We performed a systematic analysis of the co-regulatory mechanism between different lipid modifications with GPS-Lipid. The results demonstrated that the proximal dual-lipid modifications among palmitoylation, myristoylation and prenylation are key mechanism for regulating various protein functions. In conclusion, GPS-lipid is expected to serve as useful resource for the research on lipid modifications, especially on their co-regulation.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer