gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Document Type : Research Article

Authors

1 Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, I.R. IRAN

2 Department of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, I.R. IRAN

Abstract

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods applied to this problem. We introduce gpALIGNER, a fast pairwise DNA-DNA global alignment algorithm. gpALIGNER uses similar score schema with DIALIGN-T to produce the final alignment. It also uses the concept of “spaced seeds” to determine locally aligned subsequences which construct semi-global alignment as the preliminaries of global alignment computation. This enables gpALIGNER to have the precision provided by the DIALIGN-T algorithm in considerably less time and space complexities. We performed benchmarking of our approach based on numerous datasets from standard benchmarking databases and real sequences of NCBI database where gpALIGNER performed three times faster than DIALIGN-T. gpALIGNER is a new alternative for having sensitivity and selectivity of DIALIGN-T but with less computational cost.

Keywords

Main Subjects


[1]  Needleman S.B., Wunsch C.D., A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins, J. Mol. Biol., 48, p. 443 (1970).
[2] Gish W., Miller W., Myers E.W., Altschul S.F., Lipman D., Basic Local Alignment Search Tool, J. Mol. Biol., 215, p. 403 (1990).
[3] Lipman D., Pearson W., Rapid and Sensitive Protein Similarity Searches, Science, 227, p. 1435 (1985).
[4] Schaffer A., Zhang J., Zhang Z., Miller W., Altschul S., Lipman D., Gapped Blast and Psiblast: A New Generation of Protein Database Search Programs, Nucleic Acids Res., 25, p. 3389 (1997).
[5] Fleischmann S., Peterson R.D., White J., Delcher O., Kasif A.L., Salzberg S.L., Alignment of Whole Genomes, Nucleic Acids Research., 27, p. 2369 (1999).
[6] Pachter L., Bray N., DubchakI., Avid: A Global Alignment Program, Genome Res., 13, p. 97 (2003).
[7] Brudno M., Chapman M., Göttgens B., Batzoglou S., Morgenstern B., Fast and Sensitive Multiple Alignment of Large Genomic Sequences, BMC Bioinformatics, 4, p. 66 (2003).
[8] Morgenstern B., DIALIGN 2: Improvement of the Segment-to-Segment Approach to Multiple Sequence Alignment, Bioinformatics, 15, p. 211 (1999).
[9] Subramanian A., Weyer-Menkhoff J., Kaufmann M., Morgenstern B., DIALIGN-T: an Improved Algorithm for Segment-Based Multiple Sequence Alignment, BMC ­Bioinformatics, 6, p. 66 (2005).
[10] MA B., TROMP J., LI M., PatternHunter: Faster and More Sensitive Homology Search, Bioinformatics, 18, p. 440 (2002).
[11] Noé L., Kucherov G., Improved Hit Criteria for DNA Local Alignment, BMC Bioinformatics, 5, p. 149 (2004).
[12] Buhler J.,KeichU., Sun Y., Designing Seeds for Similarity Search in Genomic DNA. in "Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB’03)", 10–13 April, Berlin, Germany, ACM Press, pp. 67–75 (2003).
[13] Brejova B., Brown D., Vinar T., Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specificity, "Proc. 3rd International Workshop on Algorithms in Bioinformatics (WABI'03)", LNCS, 2812, p.39 (2003).
[14] Brown D., Li M., Ma B., Homology  Search Methods. In Wong, L.(Ed.),"The Practical Bioinformatician",  Singapore World Scientific Press, pp. 217–244 (2004).
[15] Brejova B., Brown D., Vinar T., Optimal Spaced Seeds for Hidden Markov Models, with Application to Homo-Logous Coding Regions, In: "R. Baeza-Yates, E. Chavez, M. Crochemore, ed., Combinatorial Pttern Matching, 14th Annual Symposium (CPM)", LNCS, 2676, p. 42 (2003)
[16] Choi K.P., Zeng F., Zhang L., Good Spaced Seeds for Homology Search, Bioinformatics, 20, p. 1053 (2004).
[17] KeichU., Li M., Ma B., Tromp J., On Spaced Seeds for Similarity Search, Discrete Appl. Math., 138,
p. 253 (2004).
[18] Kucherov G., Noe´ L., Ponty Y., Estimating Seed Sensitivity on Homogeneous Alignments. In "Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE2004)", May 19–21,Taichung,Taiwan, IEEE Computer Society Press, pp. 387–394 (2004).
[19] Li M., Ma B., Kisman D., Tromp J., PatternHunter II: Highly Sensitive and Fast Homology Search,
J. Bioinform. Comput. Biol., 2, p. 417 (2004).
[20] Kurtz  S.,  Choudhuri  J.V.,  Ohlebusch  E., Schleiermacher C., Stoye J., Giegerich R., REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res., 29, p. 4633 (2001).
[21] Morgenstern B., Dress A., Werner T., Multiple DNA and Protein Sequence Alignment Based on Segment-to-Segment Comparison, Proc. Natl. Acad. Sci. USA., 93, p. 12098 (1996).
[22] Noé  L.,  Kucherov  G.,  YASS:  Enhancing the Sensitivity of DNA Similarity Search, Nucleic Acids Res., 33 (web-server issue):W540-W543 (2005).
[23] Thompson  J.D.,  Higgins  D.G.,  Gibson  T.J., CLUSTALW: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position Specific Gap Penalties and Weight Matrix Choice, Nucleic Acids Res., 22, p. 4673 (1994).
[24] Edgar R.C., MUSCLE: Multiple Sequence Alignment with High Score Accuracy and High Throughput, Nucleic Acids Res., 32, p. 1792 (2004).
[25] Wilm A., MainzI., Steger G., An Enhanced RNA Alignment Benchmark for Sequence Alignment Programs, Algorithms. Mol. Biol., 1, p. 19 (2006).
[26] Fickett  J.W.,  Wasserman  W.W.,  Discovery  and Modeling of Transcriptional Regulatory Regions, Curr. Opin. Biotechnol., 11, p. 19 (2000).
[27] Loots  G.G.,  Locksley  R.M.,  Blankespoor  C.M., Wang Z.E., Miller W., Rubin E.M., Frazer K.A., Identification of a Coordinate Regulator of Interleukins 4, 13, and 5 by Cross-Sspecies Sequence Comparisons, Science, 288, p. 136 (2000).
[28] Mayor C. et al.,  VISTA:  Visualizing  Global  DNA Sequence Alignments of Arbitrary Length, Bioinformatics, 16, p. 1046 (2000).