Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM

Document Type: Review Article


Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry & Biophysics (IBB), University of Tehran, Tehran, I.R. IRAN


Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitive subsequences. The sequencing noises and the sequences’ substitutions probability are obstacles of these researches. Some statistical and approximation algorithms have introduced to tackle these obstacles. By introducing conspicuous statistical machine learning methods upon Support Vector Machines, machine learning approaches act as potent methods to solve the pattern-finding problem. Support vector machines methods are time efficient approaches, which based on their parameters can be precise and accurate. In this Review, mathematical definition of structural repetitive subsequences are introduced, thereafter proposed algorithm to tackle simple pattern finding problem, which can be applicable on structural patterns are reviewed. Theoretical aspects of Support Vector Machines on computational biology platform are considered. Finally, novel evolutionary Fuzzy SVM will be introduced, which is applicable on wide range of bioinformatics problems especially the problem of structural repetitive subsequences.


Main Subjects

[1] Britten R.J., Graham D.E., Neufeld B.R., Analysis of Repeating DNA Sequences by Reassociation Kinetics, Methods Enzymol, 29, p. 363 (1974).

[2] Consortium I.H.G., Initial Sequencing and Analysis of the Human Genome. International Human Genome Consortium, Nature, 409, p. 860 (2001).

[3] Brejova B. et al., Finding Patterns in Biological Sequences, in Technical Report,UniversityofWaterloo. (2000).

[4] Linial M. et al., Global Self-Organization of All Known Protein Sequences Reveals Inherent Biological Signatures, J. Mol Bio., 268, p. 539 (1997).

[5] al., Dictionary Building via Unsupervised Hierarchical Motif Discovery in Sequence Space of Natural Proteins, Proteins, 37, p. 264 (1999).

[6] Nevill-Manning et al., Highly Specific Protein Sequence Motifs for Genome Analysis, Proceedings of the National Academy of Sciences of the United States of America, 95, p. 5865 (1998.).

[7] Savoie C.J. et al., Use of BONSAI Decision Trees for the Identication of Potential MHC Class I Peptide Epitope Motifs. In "Pacific Symposium on Biocomputing", pp. 182-189 (1999).

[8] Califano  A.,  SPLASH:  Structural  Pattern Localization Analysis by Sequential Histograms. Bioinformatics, 16, p. 341 (2000).

[9] Pedersen A.G. et al., The Biology of Eukaryotic Promoter Prediction, Computers and Chemistry, 23, p. 191 (1999).

[10] Mironov  A.A.  et al.,  Computer  Analysis  of Transcription Regulatory Patterns in Completely Sequenced Bacterial Genomes, Nucleic Acids Research, 27, p. 2981 (1999).

[11] Hughes J.D. et al., Computational Identication of Cis-Regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces Cerevisiae, Journal of Molecular Biology, 296, p. 1205 (2000).

[12] Gelfand  M.S.,  Koonin  E.V.,  Mironov  A.A., Prediction of Transcription Regulatory Sites in Archaea by a Comparative Genomic Approach. Nucleic Acids Research, 28, p. 695 (2000).

[13] Gorodkin J., Heyer L.J., Stormo G.D., Finding the Most Significant Common Sequence and Structure Motifs in a Set of RNA Sequences, Nucleic Acids Research, 25, p. 3724 (1997).

[14] Benson G., Tandem Repeats Finder: aA Program to Analyze DNA Sequences, Nucleic Acids Research, 27, p. 573 (1999).

[15] Wicker T., François Sabot, Hua-Van A., A Unified Classification System for Eukaryotic Transposable Elements, Nature Reviews Genetics, 8, p. 973 (2007).

[16] Dorohonceanu B., Nevill-Manning C.G., Accelerating Protein Classification Using Suffix Trees. In "Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pp. 128-133 (2000).

[17] Schneider T.D., Stephens R.M., Sequence Logos: A New Way to Display Consensus Sequences, Nucleic Acids Research, 18, p. 6097 (1990).

[18] Gorodkin(a) J., Heyer L.J., Stormo G.D., Displaying the Information Contents of Structural RNA Alignments: the Structure Logos, Computer Applications in the Biosciences, 13, p. 583 (1997).

[19] van Helden J., Andre B., Collado-Vides J., Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies, Journal of Molecular Biology, 281, p. 827 (1998).

[20] Tompa M., An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem. In "Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology", pp. 262-271 (1999).

[21] Smith H.O., Annau T.M., Chandrasegaran S., Finding Sequence Motifs in Groups of Functionally Related Proteins, Proceedings of the National Academy of Sciences of the United States of America, 87, p. 826 (1990).

[22] RigoutsosI., Floratos A., Motif Discovery without Alignment or Enumeration (Extended Abstract). In "Proceedings of the second annual international conference on Computational molecular biology", pp. 221-227 (1998).

[23] Pevzner P.A., Sze S.H., Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. In "Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology", pp. 269-278 (2000).

[24] LawrenceC.E. et al., Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment, Science, 262, pp. 208-214 (1993).

[25] Lawrence C.E.,  Reilly  A.A.,  An  Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences, Proteins, 7, p. 41 (1990).

[26] Durbin R. et al., "Biological Sequence Analysis",CambridgeUniversityPress (1998).

[27] Boser B.,GuyonI., Vapnik V., A Training Algorithm for Optimal Margin Classifiers, in "Annual Workshop on Computational Learning Theory",Pittsburgh: ACM Press (1992).

[28] Cortes C., Vapnik V., Support-Network. "Machine Learning", 20 (1995).

[29] Vapnik V., "The Nature of Statistical Learning Theory". New York.: Springer-Verlag (1995).

[30] Suykens J.A.K., Advances in Learning Theory: Methods, Models and Applications Nato Science  Series. Series III, Computer and Systems Sciences, 190,Amsterdam (2003): I O S Press. Cristianini, J.S.-T., "Supportector Machines and other kernel-based learning methods".CambridgeUniversity Press (2000).

[31] Wang L., "Support Vector Machines: Theory and Applications". Springer (2005).

[32] Kreßel U.H.-G., Pairwise Classification and Support Vvector Machines, "Advances in Kernel Methods: Support Vector Learning", ed. B. Sch¨ olkopf C.J.C.B., Smola A.J.,Cambridge,MA: The MIT Press (1999).

[34] Platt J.C., Cristianini N., Shawe-Taylor J., Large Margin DAGs for Multiclass Classification, in "Advances in Neural Information Processing Systems", Solla T.K.L.S.A., Muller K.-R., Editor, The MIT Press (2000).

[35] Inoue T., Abe S., Fuzzy Support Vector Machines for Pattern Classification. in IJCNN '01, "International Joint Conference Neural Networks",Washington,DC,USA: IEEE (2001).

[36] Abe S., Inoue T., Fuzzy Support Vector Machines for Multiclass Problems. in ESANN'2002 Proceedings, Bruges(Belgium): "European Symposium on Artificial Neural Networks" (2002).

[37] Dashti H, Masoudi-Nejad A., De novo LTR Discovery and Clustering, a Support Vector Machine Approach in Center of Excellence in Biomathematics, Institute of Biochemistry and Biophysics: Tehran (2007).

[38] Bao Z., Eddy S., Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes, Genome Res, 8, p. 1269 (2002).

[39] Pevzner P.A., Tang H., Tesler G., De Novo Repeat Classification and Fragment Assembly. Genome Res, 14, p. 1786 (2004).

[40] Price A.L., Jones N.C., Pevzner P.A., De Novo Identification of Repeat Families in Large Genomes, Bioinformatics, 21, p. i351 (2005).

[41] Masoudi-Nejad A., Movahedi S., Jáuregui R., Genome-Scale Computational Analysis of DNA Curvature and Repeats in Abidopsis and Rice Uncovers Plant-Specific Genomic Prop-Erties, BMC Genomics, 12, p. 214 (2011).

[42] Zamani Z., Hajihosseini A., Masoudi-Nejad A., Computational Methodologies for Analyzing, Modeling and Controlling Gene Regulatory Net-Works, Biomedical Engineering and Computational Biology, 2, p. 47 (2010).

[43] Askary A., Masoudi-Nejad A., Mizbani A., Naderi-Parizi S., Purmasjedi M., N4: A Precise and Highly Sensitive Promoter Predictor Using Neural Network Fed by Nearest Neigh-Bors, Genes and Genetic Systems, 84, 425-430 (2009).

[44] Masoudi-Nejad A., Tonomura K., Kawashima S., Itoh M., Kanehisa M., Endo T., Goto S., EGassembler: Online Bioinformatics Service for Large-Scale Processing, Clustering and Assembling ESTs and Genomic DNA Fragmets, Nucleic Acids Research, 34, W459-W462 (2006).

[45] Masoudi-Nejad A., Goto S., Jauregui R., Itoh M., Kawashima S., Moriya Y., Endo T., Kanehisa M., EGENES: Transcriptome-Based Plant Database of Genes with metabolic Pathway Information and EST in-Dices in KEGG, Plant Physiology, 144, p. 857 (2007).