Lin, Chien-Wei
(2017)
Power calculation and study design in RNA-Seq and Methyl-Seq.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
Next generation sequencing (NGS) technology has emerged as a powerful tool in characterizing genomic profiles. Among several applications, RNA sequencing (RNA-Seq) and Methylation sequencing (Methyl-Seq) have gradually become standard tools for transcriptomic and epigenetic monitoring respectively. Although the costs of NGS experiments have constantly decreased, high sequencing cost and bioinformatic complexity remain obstacles for many biomedical projects. Unlike earlier microarray technologies, modeling of NGS data should consider discrete count data. In addition to sample size, sequencing depth is also directly related to experimental costs. Consequently, given a total budget and a pre-specified unit experimental cost, the study design issue in RNA-Seq/Methyl-Seq is a multi-dimensional constrained optimization problem rather than a one-dimensional sample size calculation in a traditional hypothesis setting. In the first part of this dissertation, we proposed a statistical framework, namely ``RNASeqDesign", to utilize pilot data for power calculation and study design of RNA-Seq experiments. The approach was based on a mixture model fitting of the p-value distribution from pilot data and a parametric bootstrap procedure to infer genome-wide power for optimal sample size and sequencing depth. We further illustrated five practical study design tasks for practitioners. We performed simulations and real data applications to evaluate performance and compare to existing methods.
In the second part, we proposed another statistical framework, namely ``MethylSeqDesign", specifically for Methyl-Seq data. There were mainly two challenges. Firstly, the statistical modeling for Methyl-Seq data required a powerful statistical test using beta-binomial model for conducting power calculation. Secondly, there is an extremely high number of CpG sites (about 30M) in the human genome, which results in many CpG sites with very shallow coverage. Hence, we focused on a region-/capture-based method which produced more counts in a region/window such that power calculation became feasible.
Public health significance: As sequencing costs keep dropping, RNA-Seq and Methyl-Seq experiments will become more prevalent and more projects with large sample size will be expected. We believe our work will provide practical guidance for future study design to understand disease mechanism and improve disease diagnosis and treatment.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
29 June 2017 |
Date Type: |
Publication |
Defense Date: |
14 April 2017 |
Approval Date: |
29 June 2017 |
Submission Date: |
4 March 2017 |
Access Restriction: |
5 year -- Restrict access to University of Pittsburgh for a period of 5 years. |
Number of Pages: |
91 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Public Health > Biostatistics |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
Power calculation, Sample size, RNA-Seq data, Methyl-Seq data, Next Generation Sequencing (NGS), p-value mixture model |
Date Deposited: |
29 Jun 2017 23:44 |
Last Modified: |
30 Jun 2022 15:22 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/30934 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |