Abstract: The Multiple Sequence Alignment (MSA) methods aims to align several sequences together and find similarities, or differences and infer structural, functional or evolutionary relationships. Since a multiple sequence alignment problem can be solved by a well known dynamic programming algorithm have exponential time complexity, the heuristic approaches are commonly used. In this paper, multiple sequence alignment techniques are explained and discussed how the Feng and Doolittle algorithm was implemented in one of the most popular programs used for MSA namely, CLUSTAL. Also, the statistical significance scores of an alignment are shown numerically and graphically based on EMBOSS, for the searched data.
Keywords: Sequence alignment, Progressive sequence alignment, Feng and Doolittle algorithm.
1. Introduction:
The primary goal of bioinformatics seeks to compare DNA / protein sequences, to search for a common or related pattern among them. During recent years, MSA methods in bioinformatics act as a challenging task for solving problems in computational biology for biological sequences such as biological macromolecules, DNA and proteins (Desmond, G. Higgins et al., 1988; Geoffrey, J. Barton, 1998; Julie, D. Thompson, et al; 1994; W.S. Juang and S.F. Su. 2008). Thus, sequence alignment techniques are considered as central to modern molecular biology. The most commonly used sequence alignment techniques possess three components: scoring scheme, gap model and the alignment optimization algorithm. The description is as follows:
The scoring scheme is a 20x20 matrix of numbers that defines the value for aligning each of the possible amino-acid pairs. The pair-score matrix is usually symmetrical and the simplest scoring scheme is the identity matrix. This scores 1 for an exact match of two amino acids, and o for a mismatch. Recently, PAM 250 or Dayhoff and BLOSUM series were the most widely used scoring matrices. In extensive tests of sequence database searching (S. Henikoff and J.G. Henikoff, 1992), pair-wise alignment (Vogt et al., 1995) and multiple sequence alignment (J.G. Barton, 1998), the BLOSUM series of matrices on average give results superior to the PAM and other matrices.
The gap model/gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is not possible to achieve a good alignment residue-to-residue at some other neighboring points in the sequence. The most commonly used scoring scheme for gaps is a function of the form (ul + v), where l is the length of the gap in residues.
The optimal alignment of any two sequences is the alignment that maximizes the sum-of-pairs scores less any penalty for introduced gaps. An efficient technique known as dynamic programming, allows the optimal alignment of two sequences to be found in the order of m x n steps, where m and n are the lengths of two sequences.
There are several MSA algorithms reported in the literature (Chen, et al., 1992, D.J.Lipman, et al., 1989, McClure, et al., 1994, Thompson, et al., 1999). A great majority of MSA algorithms are such as progressive (extension of DP), iterative and stochastic approaches such as Simulated Annealing (SA), Genetic Algorithms (GA), Evolutionary Programming (EP) are widely applied in many bioinformatics research areas. S.B. Needleman and C.D. Wunsch (1970) are often attributed as the first application of dynamic programming in molecular biology, while various formulations of the same algorithm were described by Sellers (1974) and Waterman et al., (1976). Desmond G. Higgins et al., (1988) performed multiple alignments of large numbers of amino acid or nucleotide sequences from a series of pair-wise alignments of clusters, following the order of phylogeny tree. It is proved that the method is sufficiently fast and economical with memory on microcomputers.
D.J. Thompson et al., (1994) used progressive multiple sequence alignment method for the alignment of divergent protein sequences through sequence weighting, position specific gap penalty and weight matrix choice using CLUSTALW. Geoffrey J. Barton (1998) discussed the basic algorithms for alignment of two or more protein sequences and described alternative methods for scoring substitutions and gaps for local and global alignment methods. W.S. Juang and S.F. Su (2008) presented a hybrid algorithm for multiple sequence alignment method and it combines the pair-wise dynamic programming and the particle swarm optimization techniques with comparisons. Most of the approaches to MSA problem are based on the progressive approach proposed by Feng and Doolittle. This heuristic algorithm use pair-wise alignments to construct a global alignment by aligning two more similar sequences, and then adding the other sequences one by one.
In this paper, a MSA technique was incorporated effectively on insulin receptors derived from Uni- Prot database. The objective of this paper was fulfilled in three aspects. Initially, a MSA is carried out using CLUSTAL (a popular program to MSA) for five insulin protein sequences. The phylogeny tree is obtained based on the similarity scores, using the similar tool. Secondly, the paper discussed the theoretical basis of Feng - Doolittle algorithm, one of the popular technique used in the context of progressive / hierarchical alignment technique. Thirdly, a pair-wise global alignment scores was evaluated with the help of EMBOSS, a tool for pair-wise sequence alignment for performing a standard dynamic programming method. The theoretical and graphical representations are summarized in the respective sections.
2. Materials and Methods:
(a) Data Source
This paper uses Uni-Prot, one such a central repository of protein data created by combining the Swiss-Prot, Tr-Embl and PIR-PSD databases, which is accessible at http:www/uni-prot/org. The five insulin protein sequences for different organisms with similar gene names and similar sequence length were chosen to illustrate the performance of multiple sequence alignment using CLUSTALW. The searched results are displayed in FASTA format shown in Table 1.
(b) Multiple Alignment Method
Multiple alignments act as a key factor for the prediction of protein secondary structure, residue accessibility, function and the identification of residues important for specificity. It also provides the basis for the most sensitive sequence searching algorithms (Barton, G. J and Sternberg, M.J.E, 1990). From theoretical point of view, it is well-known that MSA problem can be exactly solved by the dynamic programming algorithm (Needleman and Wunsch, 1970; Smith and Waterman, 1981) for local and global alignments which ensures an optimal alignment of the sequences. In practice, implementation of dynamic programming technique leads a trivial task for aligning more than three or four sequences because the computational cost increases exponentially with the number of dimensions, i.e. the number of sequences. Hence, the most widely used heuristic approach for multiple sequence alignments known as progressive / hierarchical technique, builds up a final solution.
Any multiple sequence alignment is built progressively by a series of pair-wise alignments leads the following stages: (a) Computation of all pair wise sequence similarities and hence observing the most closely related sequences. (b) Construction of dendogram based on similarity matrix and (c) perform multiple alignment of sequences in a pair wise manner followed by the order of clustering obtained in a dendogram. W. Bains (1986) recommended for exact multiple alignment procedures the problems with computer time and memory increase approximately exponentially with the number of sequences compared, so that all algorithms described for more than three sequences have been heuristic. For the above sequences shown in Table 1, the multiple alignments are performed with a popular program to MSA called CLUSTALW. The insulin sequences are then aligned and the results are shown in Figure 1 (Code names of the sequences are given at the extreme left, whereas the right part indicates the respective protein sequences with varying colored alphabets, single asterisks * indicates the highlight sequence similarities among the residues).
(C) Algorithm
Several computer programs like BLAST and FASTA implements Needleman-Wunsch and Smith- Waterman algorithms to do fast searches of databases. Both these algorithms uses a word based heuristic, which use approximation techniques to search similarity between sequences. In addition, Feng and Doolittle (1987) suggested the most probable method to MSA, called progressive or hierarchical. The basic idea behind the method is that it does not compare all of the sequences together, rather compares only a pair of sequences at a time. The primary advantage of implementing Feng - Doolittle algorithm is that it performs faster and more efficient than the generalized dynamic programming methods.
The computational procedure of Feng-Doolittle method is described as below:
(a) Construct a half-matrix of n (n-1) distances between all pairs of n sequences using standard pair wise alignment of dynamic programming method.
(b) Construct a phylogeny tree from the distance matrix using a tree clustering algorithm, which gives a representation of the relationships between the sequences. In this tree, sequences that are similar are arranged close to each other, and sequences that are dissimilar are arranged further away.
(c) The nearest two sequences as indicated by the guide tree are aligned using a pair wise alignment method. The next nearest sequence is then added to this alignment, and so on, until all sequences are part of the MSA. The procedure continues iteratively.
Using the above procedure for the respective protein sequences shown in Figure 1, the pair-wise similarity scores are constructed using CLUSTALW. The half matrix of n (n-1) distances between all pairs of five sequences using standard pair-wise alignment is shown in Table 2. Based on the pair-wise similarity scores, the phylogeny tree obtained is shown in Figure 2.
The final step of this algorithm was implemented with the support of EMBOSS, a tool for pair-wise sequence alignment. In this illustration, the nearest two sequences as indicated in phylogeny tree are initially aligned using a standard pair-wise alignment method. The next nearest sequences is then added to this alignment, and continues, until five sequences are part of the multiple sequence alignment method. Then, the pair-wise similarity scores with default parameters are performed using a standard dynamic programming method, called Needleman-Wunsch. The respective results are shown in Table 3.
3. Results and Discussions:
The paper used the Uni-Prot database for testing the performance of multiple sequence alignment by implementing CLUSTAL program. To fulfill the goal, the computational tasks are presented in two sections. Section 1 presents the theoretical view of multiple sequence alignment and its applications are outlined in the form of literature survey. Section 2 describes the data source (presented in Table 1) and analysis of progressive or hierarchical method to MSA, called Feng-Doolittle algorithm. A general procedure for aligning those sequences are discussed in Section 2(b) and Figure 1 shows how multiple alignments are performed using this tool with default parameters for the data sequences.
In Section 2(C), the computational work is carried out step-by-step procedure. Initially, the pair-wise similarity scores between all pairs of sequences are determined using a standard pair-wise dynamic programming method possessing a half matrix of n (n-1) distances, where n = 5. The respective result was shown in Table 2. Secondly, based on the similarity scores obtained in Table 2, CLUSTALW produced a phylogeny tree used to infer biologically relevant phylogenetic relationships between the sequences. The result is shown in Figure 2.
At a final stage, the nearest two sequences, as indicated by the guide tree are aligned using a pair-wise alignment method. For aligning those sequences, a standard dynamic programming algorithm called Needleman-Wunsch, was implemented by a popular tool called, EMBOSS. The respective results are displayed in Table 3, with default parameters of score matrix BLOSUM 62. From Table 3, it is evident that first two protein sequences possessed a high similarity score of 520.00 when compared with the other pairs of sequences.
4. Conclusions:
The paper focused the theoretical view of Feng-Doolittle algorithm, one of the most popular programs used for MSA, called CLUSTAL. An illustration was carried out to study the performance of the above technique in the context of bioinformatics. The multiple protein sequences were collected and analyzed with the help of CLUSTAL and pair-wise similarity scores using EMBOSS. It combines pair-wise alignments beginning with the most similar pair and progressing to the most distantly related, which finally builds up a MSA solution. The construction of phylogeny tree and pair-wise similarity scores using progressive alignment technique indicates the existence of high similarity score among the sequences 1 and 2. When compared with the other standard dynamic programming methods to MSA, the progressive or hierarchical approach have a greater advantage because of its simplicity, speed and flexibility.
References
[1] D.F. Feng and R.F. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, Journal of Molecular Evolution, 25(1987), 351-360.
[2] D.G. Higgins, J.D. Thomson and T.J. Gibson, Using CLUSTAL for multiple sequence alignments, Methods Enzymol., 266(1996), 383-402.
[3] D.J. Thompson, D.G. Higgins and T.J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research, 22(1994), 4673-4680.
[4] D.J. Lipman, S.F. Altschul and J.D. Kececioglu, A tool for multiple sequence alignment, Proceedings of the National Academy of Sciences of the United States of America, 86(1989), 4412-4415.
[5] G.H. Desmond and M.S. Paul, CLUSTAL: A package for performing multiple sequence alignment on a microcomputer, Gene, 73(1), (1988), 237-244.
[6] G.J. Barton and M.J.E. Sternberg, Flexible protein sequence patterns - A sensitive method to detect weak structural similarities, Journal of Molecular Biology, 212(1990), 389-402.
[7] J.G. Barton, Protein sequence alignment techniques, Acta Cryst., 54(1988), 1139-1146.
[8] M.A. McClure, T.K. Vasi and W.M. Fitch, Comparative analysis of multiple protein sequence alignment methods, Molecular Biology and Evolution, 11(1994), 571-592.
[9] N. Gautham, Bioinformatics: Databases and Algorithms, Narosa Publishing House Ltd, 2007.
[10] S.B. Needleman and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, 4(1970), 443-453.
[11] S.C. Chen, A.K.C. Wong and D.K.Y. Chiu, A survey of multiple sequence comparison methods, Bulletin of Mathematical Biology, 54(1992), 563-598.
[12] S. Henikoff and J.G. Henikoff, Amino acids substitution matrices from protein blocks, Proc. Nat Acad. Sci. U.S.A., 89(1992), 109-115.
[13] T.F. Smith and M.S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology, 147(1981), 195-197.
[14] W. Bains, MULTAN: A program to align multiple DNA sequences, Nucleic Acids Research, 14(1986), 159-177.
[15] W.S. Juang and S.F. Su, Multiple sequence alignment using modified dynamic programming and particle swarm optimization, Journal of the Chinese Institute of Engineers, 31(2008), 659-673.
G. Suresh1, * and C. Vijayalakshmi2
1 Department of Mathematics, VV College of Engineering, Tamil Nadu, India
2 Department of Mathematics, VIT University, Tamil Nadu, India
* Corresponding author, e-mail: ([email protected])
(Received: 12-7-13; Accepted: 18-8-13)
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright International Journal of Pure and Applied Sciences and Technology Oct 2013
Abstract
The Multiple Sequence Alignment (MSA) methods aims to align several sequences together and find similarities or differences, and infer structural, functional or evolutionary relationships. Since a multiple sequence alignment problem can be solved by a well known dynamic programming algorithm that has exponential time complexity, heuristic approaches are commonly used. In this paper, multiple sequence alignment techniques explained and discussed how the Feng and Doolittle algorithm was implemented in one of the most popular programs used for MSA namely, CLUSTAL. Also, the statistical significance scores of an alignment are shown numerically and graphically based on EMBOSS, for the searched data.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer