rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data

Cahill, Kelly (2020) rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data. Master's Thesis, University of Pittsburgh. (Unpublished)

Preview

PDF
Submitted Version
Download (599kB) | Preview

Abstract

Genomic data has strong potential to predict biologic classifications using gene expression data. For example, tumor subtype can be determined using machine learning models and gene expression profiles. We propose the use of Top Scoring Pairs in combination with machine learning to improve inter-study prediction of genomic profiles. Inter-study prediction refers to two studies that are completely independent either in terms of platform or tissue. Top Scoring Pairs (TSPs) rank pairs of genes according to how well they are expressed between different groups of subjects. For example, gene A will be lowly expressed in cases, and gene B will be highly expressed in controls, while gene A will be highly expressed in controls, and gene B will be lowly expressed in cases. The pairs demonstrate an inverse relationship with respect to one and another. Using TSPs act not only as a feature selection step, but also allows for a non parametric method that transforms the continuous expression data to 0,1, which is based on the rank of the pairs. Due to the robust nature of the transformed data, our methods demonstrate that the use of TSP binary data is much more effective in prediction than continuous data, particularly in cross study prediction. Furthermore, we extend the use of TSPs to not only binary and multi-class label prediction, but also continuous classification. The objective of this paper is to demonstrate how using dichotomized data from TSPs as the feature space for machine learning methods, particularly random forest, returns stronger prediction accuracy across independent studies than traditional machine learning techniques with log2 and quantile normalization of data. This work has significant public health impact as accurate genomic prediction is crucial for early detection of many serious illnesses such as cancer.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Cahill, Kelly	kmc152@pitt.edu	kmc152

ETD Committee:

Title	Member	Email Address	Pitt Username
Thesis Advisor	Tseng, George	ctseng@pitt.edu	ctseng
Committee Member	Carlson, Jenna	jnc35@pitt.edu	jnc35
Committee Member	Silvia, Liu	silvia.shuchang.liu@gmail.com

Date:

29 January 2020

Date Type:

Publication

Defense Date:

27 September 2019

Approval Date:

29 January 2020

Submission Date:

22 November 2019

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

Institution:

University of Pittsburgh

Schools and Programs:

School of Public Health > Biostatistics

Degree:

MS - Master of Science

Thesis Type:

Master's Thesis

Refereed:

Yes

Uncontrolled Keywords:

Genomics, public health, classification

Date Deposited:

29 Jan 2020 19:16

Last Modified:

29 Jan 2020 19:16

URI:

http://d-scholarship.pitt.edu/id/eprint/37878

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds