Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Artifact of Detecting Biomarkers Associated with Sequencing Depth in RNA-Seq

Yin, RuoFei (2023) Artifact of Detecting Biomarkers Associated with Sequencing Depth in RNA-Seq. Master's Thesis, University of Pittsburgh. (Unpublished)

Download (542kB) | Preview


RNA-Seq is a highly sensitive and accurate sequencing technique that uses next-generation sequencing (NGS) technology to reveal the presence and quantity of RNA in a biological sample at a given moment, which is useful for studying the behavior of genes under different biological conditions.[1,2] An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for systematic technical biases such as library size and transcript length.[3] Multiple popular normalization methods have been proposed and widely used, including counts per million (CPM), transcripts per million (TPM) and reads per kilobase million (RPKM). Although systematic experimental bias and technical variation are expected to be eliminated after normalization, we surprisingly found a large proportion of genes associated with library size in human post-mortem striatum normalized RNA-seq data. In this thesis, we confirmed the universal existence of this problem by systematically examining 159 Gene Expression Omnibus (GEO) datasets and 24 of The Cancer Genome Atlas (TCGA) datasets. We conducted a simulation study to rule out potential causes from count data quantification and examined a potential solution to correct the artifact based on a Poisson model with variable rates for different nucleotide patterns from a previous publication. We reproduced the results of this paper and applied this published model to these data to see if the library size affected the regression. We performed linear regression analysis on the model coefficients and library size, which did not show evidence of an association. Thus, for a future direction, we plan to replace this Poisson model with a negative binomial model which may improve the model fitting and develop as a solution to correct the artifact. If successful, the new normalization will improve association analysis and biomarker detection in basic and clinical studies of diseases.
Public health significance: Limited number of research has been focused on the artifact of the biomarkers associated with sequencing depth in normalized RNA-Seq datasets, which should be corrected to improve accuracy in downstream translation research. This paper tries to figure out this artifact.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Yin, RuoFeiRUY28@pitt.eduRUY280009-0005-4117-380X
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairTseng, George Cctseng@pitt.eductseng
Committee MemberCarlson, Jenna Colavincenzojnc35@pitt.edujnc35
Committee MemberFan,
Date: 11 May 2023
Date Type: Publication
Defense Date: 21 May 2023
Approval Date: 11 May 2023
Submission Date: 27 April 2023
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 33
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: RNA-Seq; Normalization; Sequencing depth
Related URLs:
Date Deposited: 11 May 2023 16:55
Last Modified: 11 May 2023 16:55


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item