Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data

Cahill, Kelly (2020) rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data. Master's Thesis, University of Pittsburgh. (Unpublished)

Submitted Version

Download (599kB) | Preview


Genomic data has strong potential to predict biologic classifications using gene expression data. For example, tumor subtype can be determined using machine learning models and gene expression profiles. We propose the use of Top Scoring Pairs in combination with machine learning to improve inter-study prediction of genomic profiles. Inter-study prediction refers to two studies that are completely independent either in terms of platform or tissue. Top Scoring Pairs (TSPs) rank pairs of genes according to how well they are expressed between different groups of subjects. For example, gene A will be lowly expressed in cases, and gene B will be highly expressed in controls, while gene A will be highly expressed in controls, and gene B will be lowly expressed in cases. The pairs demonstrate an inverse relationship with respect to one and another. Using TSPs act not only as a feature selection step, but also allows for a non parametric method that transforms the continuous expression data to 0,1, which is based on the rank of the pairs. Due to the robust nature of the transformed data, our methods demonstrate that the use of TSP binary data is much more effective in prediction than continuous data, particularly in cross study prediction. Furthermore, we extend the use of TSPs to not only binary and multi-class label prediction, but also continuous classification. The objective of this paper is to demonstrate how using dichotomized data from TSPs as the feature space for machine learning methods, particularly random forest, returns stronger prediction accuracy across independent studies than traditional machine learning techniques with log2 and quantile normalization of data. This work has significant public health impact as accurate genomic prediction is crucial for early detection of many serious illnesses such as cancer.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Cahill, Kellykmc152@pitt.edukmc152
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Thesis AdvisorTseng, Georgectseng@pitt.eductseng
Committee MemberCarlson, Jennajnc35@pitt.edujnc35
Committee MemberSilvia,
Date: 29 January 2020
Date Type: Publication
Defense Date: 27 September 2019
Approval Date: 29 January 2020
Submission Date: 22 November 2019
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 28
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: Genomics, public health, classification
Date Deposited: 29 Jan 2020 19:16
Last Modified: 29 Jan 2020 19:16


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item