Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Power calculation and study design in RNA-Seq and Methyl-Seq

Lin, Chien-Wei (2017) Power calculation and study design in RNA-Seq and Methyl-Seq. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Submitted Version

Download (2MB) | Preview


Next generation sequencing (NGS) technology has emerged as a powerful tool in characterizing genomic profiles. Among several applications, RNA sequencing (RNA-Seq) and Methylation sequencing (Methyl-Seq) have gradually become standard tools for transcriptomic and epigenetic monitoring respectively. Although the costs of NGS experiments have constantly decreased, high sequencing cost and bioinformatic complexity remain obstacles for many biomedical projects. Unlike earlier microarray technologies, modeling of NGS data should consider discrete count data. In addition to sample size, sequencing depth is also directly related to experimental costs. Consequently, given a total budget and a pre-specified unit experimental cost, the study design issue in RNA-Seq/Methyl-Seq is a multi-dimensional constrained optimization problem rather than a one-dimensional sample size calculation in a traditional hypothesis setting. In the first part of this dissertation, we proposed a statistical framework, namely ``RNASeqDesign", to utilize pilot data for power calculation and study design of RNA-Seq experiments. The approach was based on a mixture model fitting of the p-value distribution from pilot data and a parametric bootstrap procedure to infer genome-wide power for optimal sample size and sequencing depth. We further illustrated five practical study design tasks for practitioners. We performed simulations and real data applications to evaluate performance and compare to existing methods.

In the second part, we proposed another statistical framework, namely ``MethylSeqDesign", specifically for Methyl-Seq data. There were mainly two challenges. Firstly, the statistical modeling for Methyl-Seq data required a powerful statistical test using beta-binomial model for conducting power calculation. Secondly, there is an extremely high number of CpG sites (about 30M) in the human genome, which results in many CpG sites with very shallow coverage. Hence, we focused on a region-/capture-based method which produced more counts in a region/window such that power calculation became feasible.

Public health significance: As sequencing costs keep dropping, RNA-Seq and Methyl-Seq experiments will become more prevalent and more projects with large sample size will be expected. We believe our work will provide practical guidance for future study design to understand disease mechanism and improve disease diagnosis and treatment.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Lin, Chien-Weichl169@pitt.educhl169
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairTseng,
Committee MemberPark,
Committee MemberWeeks, Danielweeks@pitt.edu0000-0001-9410-7228
Committee MemberKrafty,
Date: 29 June 2017
Date Type: Publication
Defense Date: 14 April 2017
Approval Date: 29 June 2017
Submission Date: 4 March 2017
Access Restriction: 5 year -- Restrict access to University of Pittsburgh for a period of 5 years.
Number of Pages: 91
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Power calculation, Sample size, RNA-Seq data, Methyl-Seq data, Next Generation Sequencing (NGS), p-value mixture model
Date Deposited: 29 Jun 2017 23:44
Last Modified: 30 Jun 2022 15:22


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item