Title
Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads.
Abstract
The results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99-111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644-652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086-1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134-1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.
Year
DOI
Venue
2017
10.1186/s13015-017-0091-2
Algorithms for Molecular Biology
Keywords
Field
DocType
Alternative splicing,Assembly evaluation,De Bruijn graph topology,Enumeration algorithm,Formal model for representing repeats,RNA-seq,Repeats,Transcriptome assembly
Assemblers,Genome,De novo transcriptome assembly,RNA-Seq,Computer science,Transcriptome,Heuristics,De Bruijn sequence,Bioinformatics,Genetics,Sequence assembly
Journal
Volume
Issue
ISSN
12
1
1748-7188
Citations 
PageRank 
References 
1
0.37
7
Authors
8
Name
Order
Citations
PageRank
Leandro Lima121.40
B. Sinaimeri24711.75
Gustavo Sacomoto3455.81
Helene Lopez-Maestre410.37
Camille Marchet522.09
Vincent Miele6737.42
Marie-France Sagot71337109.23
Vincent Lacroix830121.03