Classification and clustering for RNA-seq data with variable selection

Rahman, Md Tanbin (2019) Classification and clustering for RNA-seq data with variable selection. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Submitted Version
Download (564kB) | Preview

Abstract

Clustering and classification play an important role in identifying sub-types of complex diseases as well as building a predictive model in the field of medicine. In recent years, lowering of cost and high accuracy has made RNA-seq widely popular which is expected to continue to grow over the next few years. One of the important features of RNA-seq data is its count data structure. While there has been a great deal of literature in both clustering and classification method, most of them are either heuristic or suitable for continuous data and does not directly generalize to count data.

In Chapter 2, we propose a classifier for the count structure of the RNA-seq data with variable selection and covariate adjustment. In this paper, we develop a negative binomial model via generalized linear model framework with double regularization for gene and covariate sparsity to accommodate three key elements: adequate modeling of count data with overdispersion, gene selection and adjustment for covariate effects. The proposed sparse negative binomial classifier (snbClass) is evaluated in simulations and two real applications using cervical tumor miRNA-seq data and schizophrenia post-mortem brain tissue RNA-seq data to demonstrate its superior performance in prediction accuracy and feature selection.

In Chapter 3, we discuss a model-based clustering method which can use the count structure of the data. In this paper, we develop a negative binomial mixture model with gene regularization to cluster samples (small n) with high-dimensional gene features (large p). The method is compared with the sparse Gaussian mixture model and sparse K-means using extensive simulations and two real transcriptomic applications in breast cancer and rat brain studies. The result shows superior performance of the proposed count data model in clustering accuracy, feature selection and biological interpretation by pathway enrichment analysis.

Contribution to public health:
Transcriptomic data play an important role in identifying genes that are differentially expressed under various external conditions and diseases. RNA-seq data are now the most popular method when measuring the expression level in transcriptomic data. The method proposed in this thesis is tailor-made for classification and clustering in count structure of RNA-seq data.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Rahman, Md Tanbin	mdr56@pitt.edu	mdr56

ETD Committee:

Title	Member	Email Address
Committee Chair	Tseng, George	ctseng@pitt.edu
Committee Member	Wahed, Abdus	WahedA@edc.pitt.edu
Committee Member	Ding, Ying	yingding@pitt.edu
Committee Member	Park, Hyun Jung	hyp15@pitt.edu

Date:

26 September 2019

Date Type:

Publication

Defense Date:

7 June 2019

Approval Date:

26 September 2019

Submission Date:

8 July 2019

Access Restriction:

5 year -- Restrict access to University of Pittsburgh for a period of 5 years.

Number of Pages:

Institution:

University of Pittsburgh

Schools and Programs:

School of Public Health > Biostatistics

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

Clustering, Classification, RNA-seq data, Count data, Machine learning

Date Deposited:

26 Sep 2019 16:44

Last Modified:

01 Sep 2024 05:15

URI:

http://d-scholarship.pitt.edu/id/eprint/37332

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Classification and clustering for RNA-seq data with variable selection

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds