Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Prediction of Preterm Birth in Southwestern PA using Classification Models: A Comparative Analysis

Pudasainy, Sabnum (2022) Prediction of Preterm Birth in Southwestern PA using Classification Models: A Comparative Analysis. Master's Thesis, University of Pittsburgh. (Unpublished)

Download (1MB) | Preview


Background: Preterm birth is a global health burden and a leading cause of neonatal mortality and morbidity. This study aims to compare prediction models to identify clinical, demographic, and environmental risk factors associated with preterm birth using binary classification methods.

Methods: Data from 221,060 infants born between 2010 and 2020 to mothers who resided in eight southwestern Pennsylvania counties (Allegheny, Armstrong, Beaver, Butler, Fayette, Greene, Washington, Westmoreland) were used. Covariates utilized for this analysis were the mother’s and the neonate’s clinical and demographic features and the mother’s mean exposure to air pollutants - Carbon monoxide (CO), Nitrogen dioxide (NO2), Particulate Matter (PM2.5), Ozone (O3) and Sulfur dioxide (SO2) in mother’s geocoded areas of residence during the mother’s gestation period. Exploratory data analysis, including Empirical Bayes approach, was conducted to better understand the covariates and the outcome, i.e., preterm birth. Further, three supervised machine learning techniques – Elastic Net (GLMNET), Support Vector Machine (SVM) and Random Forest – were used to build and compare prediction models based on performance metrices like Area under the Curve (AUC), sensitivity and specificity.

Results: Empirical Bayes identified mothers with fewer prenatal visits (0-10) and mothers who resided in Allegheny County to be associated with higher posterior average for event probability. Among the three different algorithms used to predict preterm birth, Random Forest seemed to outperform GLMNET and SVM with an AUC of 0.83, compared to 0.77 for both GLMNET and SVM. The top important predictors common to GLMNET and SVM were total number of prenatal visits, mother’s race and education. Additionally, Random Forest identified mean exposures to pollutants as the top features, along with number of prenatal visits and Allegheny as the mother’s residential county. The results from Empirical Bayes exploration and the classification models were fairly consistent.

Public Health Significance: Optimal prediction of preterm birth facilitates early identification and treatment of at-risk mothers, and enables targeted interventions to minimize infant mortality and morbidity, which would significantly benefit the community, nation, and the healthcare system as a whole. The environmental factors identified here should be explored further.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Pudasainy, Sabnumsap196@pitt.edusap196
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairBuchanich, Jeanine M.jeanine@pitt.edujeanine
Committee MemberCarlson, Jenna C.jnc35@pitt.edujnc35
Committee MemberTalbott, Evelyn O.eot1@pitt.edueot1
Committee MemberYouk, Ada O.ayouk@pitt.eduayouk
Date: 12 May 2022
Date Type: Publication
Defense Date: 25 April 2022
Approval Date: 12 May 2022
Submission Date: 28 April 2022
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 93
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: preterm birth, empirical bayes, classification, prediction
Date Deposited: 12 May 2022 14:10
Last Modified: 12 May 2022 14:10


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item