Yin, RuoFei
(2023)
Artifact of Detecting Biomarkers Associated with Sequencing Depth in RNA-Seq.
Master's Thesis, University of Pittsburgh.
(Unpublished)
Abstract
RNA-Seq is a highly sensitive and accurate sequencing technique that uses next-generation sequencing (NGS) technology to reveal the presence and quantity of RNA in a biological sample at a given moment, which is useful for studying the behavior of genes under different biological conditions.[1,2] An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for systematic technical biases such as library size and transcript length.[3] Multiple popular normalization methods have been proposed and widely used, including counts per million (CPM), transcripts per million (TPM) and reads per kilobase million (RPKM). Although systematic experimental bias and technical variation are expected to be eliminated after normalization, we surprisingly found a large proportion of genes associated with library size in human post-mortem striatum normalized RNA-seq data. In this thesis, we confirmed the universal existence of this problem by systematically examining 159 Gene Expression Omnibus (GEO) datasets and 24 of The Cancer Genome Atlas (TCGA) datasets. We conducted a simulation study to rule out potential causes from count data quantification and examined a potential solution to correct the artifact based on a Poisson model with variable rates for different nucleotide patterns from a previous publication. We reproduced the results of this paper and applied this published model to these data to see if the library size affected the regression. We performed linear regression analysis on the model coefficients and library size, which did not show evidence of an association. Thus, for a future direction, we plan to replace this Poisson model with a negative binomial model which may improve the model fitting and develop as a solution to correct the artifact. If successful, the new normalization will improve association analysis and biomarker detection in basic and clinical studies of diseases.
Public health significance: Limited number of research has been focused on the artifact of the biomarkers associated with sequencing depth in normalized RNA-Seq datasets, which should be corrected to improve accuracy in downstream translation research. This paper tries to figure out this artifact.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
11 May 2023 |
Date Type: |
Publication |
Defense Date: |
21 May 2023 |
Approval Date: |
11 May 2023 |
Submission Date: |
27 April 2023 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
33 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Public Health > Biostatistics |
Degree: |
MS - Master of Science |
Thesis Type: |
Master's Thesis |
Refereed: |
Yes |
Uncontrolled Keywords: |
RNA-Seq; Normalization; Sequencing depth |
Related URLs: |
|
Date Deposited: |
11 May 2023 16:55 |
Last Modified: |
11 May 2023 16:55 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/44782 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |