Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Differentially expressed gene detection with covariate selection under small sample size genomic setting

Guo, Kexin (2018) Differentially expressed gene detection with covariate selection under small sample size genomic setting. Master's Thesis, University of Pittsburgh. (Unpublished)

Submitted Version

Download (915kB) | Preview


In the genomic setting, most data have relative small sample size (n) considering large number of covariates (p). For this type of data structure, it is not appropriate to fit simple linear regression models since the variance would be large and it could encounter over-fitting. Methods for restraining the number of variables contained in the model are necessary.
In this study, constrained best subset (CBS) and LASSO methods were performed to select covariates and detect differentially expressed (DE) genes. For comparison purpose, we set two different simulation settings for each method. Under univariate settings, all methods had type I error well controlled and CBS methods were more powerful than LASSO. However, LASSO had better prediction results compared to CBS methods even though it had more false positive covariates selected. Under genome-wide simulation settings, FDR only well controlled for larger sample size (n=50, 100). Other results have a similar trend as in the univariate setting.
Beyond simulations, eight transcriptomic studies from post-mortem brain tissues of major depressive disorder (MDD) patients were used as a real data application to further compare the CBS2 method and LASSO. As the result of meta-analysis combining all eight studies, CBS2 method generated more DE genes compared to LASSO. It also detected more significant pathways compared to LASSO. Our evaluations suggest that no method performs universally the best in the small-n-large-p scenario and selection of the best method depends on sample size, dimensionality and the desired biological purpose. From the public health significance perspective, using CBS2 method under small sample size genomic setting could help us detect more DE genes as well as more meaningful pathways.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Guo, Kexinkeg105@pitt.edukeg105
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairTseng,
Committee MemberRobert,
Committee MemberPark, Hyun
Date: 28 June 2018
Date Type: Publication
Defense Date: 27 March 2018
Approval Date: 28 June 2018
Submission Date: 2 April 2018
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 40
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: High-dimensional, CBS, LASSO, Variable selection, DE gene
Date Deposited: 28 Jun 2018 20:07
Last Modified: 02 Jul 2018 21:29


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item