Geng, Ming
(2006)
A COMPARISON OF LOGISTIC REGRESSION TO RANDOM FORESTS FOR EXPLORING DIFFERENCES IN RISK FACTORS ASSOCIATED WITH STAGE ATDIAGNOSIS BETWEEN BLACK AND WHITE COLON CANCER PATIENTS.
Master's Thesis, University of Pittsburgh.
(Unpublished)
Abstract
Introduction: Colon cancer is one of the most common malignancies in America. According to the American Cancer Society, blacks have lower survival rate than whites. Many previous studies suggested that it is because blacks were more likely to be diagnosed at a late stage. Hence, it is crucial to determine factors that are associated with colon cancer stage at diagnosis. Objectives: The objectives of this study are twofold: 1)To compare logistic regression modeling to Random Forests classification with respect to variables selected and classification accuracy; and 2) To evaluate the factors related to colon cancer stage at diagnosis in a population based study. Many studies have comparedClassification and Regression Trees (CART) to logistic regression and found that they have very similar power with respect to the proportion correctly classified and the variables selected. This study extends previous methodological research by comparing the Random Forests classification techniques to logistic regression modeling using a relatively small and incomplete dataset. Methods and Materials: The data used in this research were from National Cancer Institute Black/White Cancer Survival Study which had 960 cases of invasive colon cancer. Stage at diagnosis was used as the dependent variable for fitting logistic regression models and Random Forests Classification to multiple potential explanatory variables, which included some missing data. Results: Odds ratio (blacks vs. whites) decreased from 1.628 (95%CI: 1.068-2.481) to 1.515 (95% CI: 0.920-2.493) after adjustment was made for patient delay in diagnosis, occupation, histology and grade of tumor. Race became no longer important after these variables were entered in the Random Forests. These four variables were identified as the most important variables associated with racial disparity in colon cancer stage at diagnosis in both logistic regression and Random Forests. The correctclassification rate was 47.9% using logistic regression and was 33.9% using Random Forests. Conclusion: 1). Logistic regression and Random Forests had very similar power in variable selection. 2). Logistic regression had higher classification accuracy than Random Forests with respect to overall correct classification rate.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
1 June 2006 |
Date Type: |
Completion |
Defense Date: |
21 January 2006 |
Approval Date: |
1 June 2006 |
Submission Date: |
12 April 2006 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Public Health > Biostatistics |
Degree: |
MS - Master of Science |
Thesis Type: |
Master's Thesis |
Refereed: |
Yes |
Uncontrolled Keywords: |
colon cancer; polytomous logistic regression; proportional odds model; random forests; stage at diagnosis |
Other ID: |
http://etd.library.pitt.edu/ETD/available/etd-04122006-102254/, etd-04122006-102254 |
Date Deposited: |
10 Nov 2011 19:36 |
Last Modified: |
15 Nov 2016 13:39 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/7034 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |