Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Classification and clustering for RNA-seq data with variable selection

Rahman, Md Tanbin (2019) Classification and clustering for RNA-seq data with variable selection. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

[img] PDF
Submitted Version
Restricted to University of Pittsburgh users only until August 2024.

Download (564kB) | Request a Copy


Clustering and classification play an important role in identifying sub-types of complex diseases as well as building a predictive model in the field of medicine. In recent years, lowering of cost and high accuracy has made RNA-seq widely popular which is expected to continue to grow over the next few years. One of the important features of RNA-seq data is its count data structure. While there has been a great deal of literature in both clustering and classification method, most of them are either heuristic or suitable for continuous data and does not directly generalize to count data.

In Chapter 2, we propose a classifier for the count structure of the RNA-seq data with variable selection and covariate adjustment. In this paper, we develop a negative binomial model via generalized linear model framework with double regularization for gene and covariate sparsity to accommodate three key elements: adequate modeling of count data with overdispersion, gene selection and adjustment for covariate effects. The proposed sparse negative binomial classifier (snbClass) is evaluated in simulations and two real applications using cervical tumor miRNA-seq data and schizophrenia post-mortem brain tissue RNA-seq data to demonstrate its superior performance in prediction accuracy and feature selection.

In Chapter 3, we discuss a model-based clustering method which can use the count structure of the data. In this paper, we develop a negative binomial mixture model with gene regularization to cluster samples (small n) with high-dimensional gene features (large p). The method is compared with the sparse Gaussian mixture model and sparse K-means using extensive simulations and two real transcriptomic applications in breast cancer and rat brain studies. The result shows superior performance of the proposed count data model in clustering accuracy, feature selection and biological interpretation by pathway enrichment analysis.

Contribution to public health:
Transcriptomic data play an important role in identifying genes that are differentially expressed under various external conditions and diseases. RNA-seq data are now the most popular method when measuring the expression level in transcriptomic data. The method proposed in this thesis is tailor-made for classification and clustering in count structure of RNA-seq data.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Rahman, Md Tanbinmdr56@pitt.edumdr56
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairTseng,
Committee MemberWahed,
Committee MemberDing,
Committee MemberPark, Hyun
Date: 26 September 2019
Date Type: Publication
Defense Date: 7 June 2019
Approval Date: 26 September 2019
Submission Date: 8 July 2019
Access Restriction: 5 year -- Restrict access to University of Pittsburgh for a period of 5 years.
Number of Pages: 73
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Clustering, Classification, RNA-seq data, Count data, Machine learning
Date Deposited: 26 Sep 2019 16:44
Last Modified: 27 Sep 2019 18:12


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item