Stokes, Matthew
(2014)
Novel Extensions of Label Propagation for Biomarker Discovery in Genomic Data.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
One primary goal of analyzing genomic data is the identification of biomarkers which may be causative of, correlated with, or otherwise biologically relevant to disease phenotypes. In this work, I implement and extend a multivariate feature ranking algorithm called label propagation (LP) for biomarker discovery in genome-wide single-nucleotide polymorphism (SNP) data. This graph-based algorithm utilizes an iterative propagation method to efficiently compute the strength of association between a SNP and a phenotype.
I developed three extensions to the LP algorithm, with the goal of tailoring it to genomic data. The first extension is a modification to the LP score which yields a variable-level score for each SNP, rather than a score for each SNP genotype. The second extension incorporates prior biological knowledge that is encoded as a prior value for each SNP. The third extension enables the combination of rankings produced by LP and another feature ranking algorithm.
The LP algorithm, its extensions, and two control algorithms (chi squared and sparse logistic regression) were applied to 11 genomic datasets, including a synthetic dataset, a semi-synthetic dataset, and nine genome-wide association study (GWAS) datasets covering eight diseases. The quality of each feature ranking algorithm was evaluated by using a subset of top-ranked SNPs to construct a classifier, whose predictive power was evaluated in terms of the area under the Receiver Operating Characteristic curve. Top-ranked SNPs were also evaluated for prior evidence of being associated with disease using evidence from the literature.
The LP algorithm was found to be effective at identifying predictive and biologically meaningful SNPs. The single-score extension performed significantly better than the original algorithm on the GWAS datasets. The prior knowledge extension did not improve on the feature ranking results, and in some cases it reduced the predictive power of top-ranked variants. The ranking combination method was effective for some pairs of algorithms, but not for others. Overall, this work’s main results are the formulation and evaluation of several algorithmic extensions of LP for use in the analysis of genomic data, as well as the identification of several disease-associated SNPs.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
25 September 2014 |
Date Type: |
Publication |
Defense Date: |
17 July 2014 |
Approval Date: |
25 September 2014 |
Submission Date: |
14 August 2014 |
Access Restriction: |
1 year -- Restrict access to University of Pittsburgh for a period of 1 year. |
Number of Pages: |
134 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
Dietrich School of Arts and Sciences > Intelligent Systems |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
feature selection, dimensionality reduction, bioinformatics, label propagation, SNP, genomics, biomarker discovery |
Date Deposited: |
25 Sep 2014 14:52 |
Last Modified: |
15 Nov 2016 14:23 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/22722 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |