Finding Exact and Solo LTR-Retrotransposons in Biological Sequences Using SVM

Document Type: Research Article


Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and Center of Excellence in Biomathematics, University of Tehran, Tehran, I.R. IRAN


Finding repetitive subsequences in genome is a challengeable problem in bioinformatics research area. A lot of approaches have been proposed to solve the problem, which could be divided to library base and de novo methods. The library base methods use predetermined repetitive genome’s subsequences, where library-less methods attempt to discover repetitive subsequences by analytical approaches. In this article we propose novel de novo methodology which stands on theory of pattern recognition’s science. Our methodology by using Support Vector Machine (SVM) classification and clustering methods could extract exact and Solo LTR-retrotransposons. This methodology issued to show complexity efficiency and applicability of the pattern recognition theories in bioinformatics and biomathematics research areas.We demonstrate applicability of our methodology by comparing its results with other well-known de novo method. Both applications return classes of discovered repetitive subsequences, were their results when had applied on show more that 90 percents similarities.


Main Subjects

[1] Pevzner P.A., Tang H., Tesler G., De Novo Repeat Classification and Fragment Assembly, Genome Res., 14, p.1786 (2004).

[2] Kumar A., Hirochika H., Application of Retrotransposons as Genetic Tools in Plant Biology, Trends in Plant Sciences, 6, p. 127 (2001).

[3] McCarthy E., McDonald J., LTR-STRUC: a Novel Search and Identification Program for LTR Retrotransposons, Bioinformatics, 12; 19(3), p. 362 (2003).

[4] Bao Z., Eddy S., Automated de Novo Identification of Repeat Sequence Families in Sequenced Genomes, Genome Res., 8, p. 1269 (2002).

[5] Price A.L., Jones N.C., Pevzner P.A., De Novo Identification of Repeat families in Large Genomes, Bioinformatics.,21, p. i351 (2005).

[6] Edgar R.C., Myers E.W., PILER: Identification and Classification of Genomic Repeats, Bioinformatics, 21 Suppl 1, p. i152 (2005).

[7]   Rho M., Choi J.-H., Kim S., De Novo Identification of LTR Retrotransposons in Eukaryotic Genomes, BMC Genomics, 3;8:90 (2007).

[8]   Ben-Hur A., Horn D., Siegelmann H.T., Vapnik V., Support Vector Clustering, Journal of Machine Learning Research, 1, p. 125 (2001).

[9] VapnikV., “The Nature of Statistical Learning Theory”, Springer-Verlag Press(1995).

[10] Cristianini N., Shawe-Taylor J., “An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods”, Cambridge University Press(2000).

[11] Duda R.O., Hart P.E., Stork D.G., “Pattern Classification”, 2nd Ed., Elsevier Academic Press; (2003).

[12] Roberts S.J., Non-Parametric Unsupervised Cluster Analysis, Pattern Recognition, p. 30,261272 (1997).

[13] Farach M., Optimal Suffix Tree Construction with Large Alphabets, "Annual Symposium on Foundations of Computer Science",pp. 137-143 (1997).

[14] McCreight E.M., A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23(2), p. 262 (1976).

[15] Ukkonen E., On-Line Construction of Suffix Trees, Algorithmica, 14, p. 249-260 (1995).

[16] Weiner P., Linear Pattern Matching Algorithms, In:“Proceedings of the 14th Symposium on Switching and Automata Theory”, pp. 1-11 (1973).

[17]Gusfield D., “Algorithms on String, Trees, and Sequences”, Cambridge University Press (1997).

[18]Gusfield D., “Algorithms on String, Trees, and Sequences, Computer Science and Computational Biology”, Cambridge University Press(2005).

[19] Agarwal P., States D., The Repeat Pattern Toolkit (RPT):Analyzing the Structure and Evolution of the C. Elegans Genome, Proc.Int. Conf. Intel. Syst. Mol. Biol., 2, p. 1 (1994)

[20] Kurtz S., Ohlebusch F., Schleiermacher C., Stoye J., Giegerich R., Computation and Visualization of Degenerate Repeats in Complete Genomes, Proc. Int. Conf. Intel. Syst. Mol. Biol.,8, 228238 (2000).

[21] Burke J., Davison D., Hide W., d2-Cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences, Genome Res., 9, p. 1135 (1999).

[22] Malde K., Schneeberger K., Coward E., Jonassen I., RBR: Library-Less Repeat Detection for ESTs, Bioinformatics,22(18), p. 2232 (2006).

[23] Huang X., Wang J., Aluru S., Yang S.P., Hillier L., PCAP: A Whole­Genome Assembly Program, Genome Res., 13, p. 2164 (2003).