Abstract | ||
---|---|---|
This paper provides a framework for statistical modeling of genomic sequences. Such a framework can be used a the basis for the synthesize similar sequences. The synthesized sequences could then be used to make for further inference about the genomic sequences. We start by converting the sequence of nucleotides from the genome into a decimal sequence via Huffman coding. Using the HodrickPrescott filter (HP filter) this decimal sequence is decomposed into two components, namely, trend and cyclic. Next, the ARIMA-GARCH statistical modeling approach is applied on the trend component exhibiting heteroskedasticity. The autoregressive integrated moving average (ARIMA) is used to capture the linear characteristics of the sequence, while the generalized autoregressive conditional heteroskedasticity (GARCH) is applied to model the statistical nonlinearity of the genome sequence. This modeling approach allows us to synthesize a given genomic sequence based on its statistical charatceristics. Finally, the probability distribution function (PDF) of a given sequence is estimated using a Gaussian mixture model, and based on the estimated PDF, we determine a new PDF representing sequences that statistically counteract the original sequence. We applied the proposed framework on several genes, as well as on the HIV nucleotide sequence. The corresponding results show some promise. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/BIBM49941.2020.9313090 | BIBM |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Salman Mohamadi | 1 | 0 | 0.34 |
Donald A. Adjeroh | 2 | 811 | 64.20 |
Behnoush Behi | 3 | 0 | 0.34 |
hamidreza amindavar | 4 | 215 | 36.34 |