ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)

Chun Liang; Gang Wang; Lin Liu; Guoli Ji; Lin Fang; Yuansheng Liu; Carter, Kikia; Webb, Jason S; Dean, Jeffrey FD
January 2007
BMC Genomics;2007, Vol. 8, p134
Academic Journal
Background: With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences. Description: ConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software — WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3′ and/or 5′ termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3′ ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5′ ESTs had a verified 5′ terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces. Conclusion: ConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets.


Related Articles

  • Statistical Data Mining's Challenges in Bioinformatics. Kuonen, Diego // Scientific Computing World;Nov/Dec2004, Issue 79, p31 

    The article focuses on the problems encountered by data miners and bioinformaticians concerning statistical data mining. From a bioinformatician's perspective, statistical data miners do not understand the biological questions, take too long to come up with answers, speak different scientific...

  • GWAS Integrator: a bioinformatics tool to explore human genetic associations reported in published genome-wide association studies. Wei Yu; Yesupriya, Ajay; Wulf, Anja; Hindorff, Lucia A.; Dowling, Nicole; Khoury, Muin J.; Gwinn, Marta // European Journal of Human Genetics;Oct2011, Vol. 19 Issue 10, p1095 

    Genome-wide association studies (GWAS) have successfully identified numerous genetic loci that are associated with phenotypic traits and diseases. GWAS Integrator is a bioinformatics tool that integrates information on these associations from the National Human Genome Research institute (NHGRI)...

  • GenMiner: mining non-redundant association rules from integrated gene expression data and annotations. Ricardo Martinez; Nicolas Pasquier; Claude Pasquier // Bioinformatics;Nov2008, Vol. 24 Issue 22, p2643 

    Summary: GenMiner is an implementation of association rule discovery dedicated to the analysis of genomic data. It allows the analysis of datasets integrating multiple sources of biological data represented as both discrete values, such as gene annotations, and continuous values, such as gene...

  • UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets. Abu-Jamous, Basel; Fa, Rui; Roberts, David J.; Nandi, Asoke K. // BMC Bioinformatics;2015, Vol. 16 Issue 1, p1 

    Background: Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently...

  • Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19. König, Inke R.; Auerbach, Jonathan; Gola, Damian; Held, Elizabeth; Holzinger, Emily R.; Legault, Marc-André; Rui Sun; Tintle, Nathan; Hsin-Chou Yang // BMC Genetics;2/3/2016, Vol. 17, p49 

    In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting...

  • A DNA based Approach to find Closed Repetitive Gapped Subsequences from a Sequence Database. Lavanya, B.; Murugan, A. // International Journal of Computer Applications;Sep2011, Vol. 29, p45 

    In bioinformatics, the discovery of transcription factor binding affinities is important. This is done by sequence analysis of micro array data. The determination of continuous and gapped motifs accurately from the given long sequence of data, say genetic data is challenging and requires a...

  • An Algorithm for Classifying DNA Reads. Sahli, Mohammed; Shibuya, Tetsuo // International Proceedings of Chemical, Biological & Environmenta;2012, Vol. 31, p59 

    A DNA read comes from either strand of the genome, and it has been believed in the impossibility of determining which one is the actual strand for the last three decades. Therefore, for the first time, we developed an algorithm for classifying DNA reads according to which strand they were...

  • Analysis of the role of retrotransposition in gene evolution in vertebrates. Zhan Yu; Morais, David; Ivanga, Mahine; Harrison, Paul M // BMC Bioinformatics;2007, Vol. 8, p308 

    Background: The dynamics of gene evolution are influenced by several genomic processes. One such process is retrotransposition, where an mRNA transcript is reverse-transcribed and reintegrated into the genomic DNA. Results: We have surveyed eight vertebrate genomes (human, chimp, dog, cow, rat,...

  • LeaderGene: A Fast Data-mining Tool for Molecular Genomics. Bragazzi, Nicola Luigi; Sivozhelezov, Victor; Nicolini, Claudio // Journal of Proteomics & Bioinformatics;2011, Vol. 4 Issue 4, p83 

    DNA microarrays are one of the most promising methods for molecular genomics, but this technique is often associated with experimental complications and difficulties in the analysis. Moreover, the greatest part of genes displayed on an array is often not directly involved in the cellular process...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics