Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Experimental Design for Unbalanced Data Involving a Two level Logistic Model

Chen, Huanyu (2007) Experimental Design for Unbalanced Data Involving a Two level Logistic Model. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Primary Text

Download (1MB) | Preview


The multilevel logistic model is used to analyze hierarchical data with binary outcomes, to detect variation both between and within clusters. I extended explicit variance formulae for a fixed effect in two level model for balanced binary data to account for imbalance both between and within clusters. The derivation of the variance is based on a linearization of the two level logistic model using first order marginal quasilikelihood (MQL1) estimation. In a simulation study, I used second order propensity quasilikelihood (PQL2) estimation to collaborate the accuracy of the analytic variance formula based on the observed racial distribution in a multi-center study of racial disparities. Using the site specific racial distributions, I simulated the log odds ratio for black race that could be detected with 80% power. These methods are illustrated in the context of a multi-center study of racial disparities in 30-day mortality in the Veterans Affairs (VA) Healthcare System, where the racial distributions are dramatically unbalanced across the 149 sites. We also consider a subset of 42 sites that include a majority of the black hospitalizations. The same analytic variance is obtained when one has either equal numbers of observations per site and/or a constant proportion of black veterans across sites. The observed racial imbalance both within and across sites increases the variance of the race coefficient more in the Random Coefficient (RC) model than in the random intercept (RI) model. Compared to PQL2, the analytic variances using MQL1 are, severely downwardly biased with smaller variance components. The simulation variances are virtually identical to the analytic variances for these data. For a given power, somewhat smaller log odds ratios can be detected in the RI model than in the RC model. The derived formulas provide a basis for planning multi-center studies when a predictor of primary importance is highly imbalanced both between and within sites. In studies of racial disparities in health care, the site-specific population distributions are often known from administrative data. The public health relevance of this work is that these methods for unbalanced data may facilitate more effective planning of multi-center studies of racial disparities.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Chen,, chenhy98@hotmail.comHUC6
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairStone, Roslyn Aroslyn@pitt.eduROSLYN
Committee MemberJeong, Jong-Hyeonjeong@nsabp.pitt.eduJJEONG
Committee MemberFine, Michael
Committee MemberSharma, Ravi Krks1946@pitt.eduRKS1946
Committee MemberMazumdar, Satimaz1@pitt.eduMAZ1
Date: 21 June 2007
Date Type: Completion
Defense Date: 23 April 2007
Approval Date: 21 June 2007
Submission Date: 13 April 2007
Access Restriction: 5 year -- Restrict access to University of Pittsburgh for a period of 5 years.
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: random coefficient model; first order marginal quasi-likelihood estimation; health service research; random intercept model; second order propensity quasi-likelihood estimat; racial disparities
Other ID:, etd-04132007-121242
Date Deposited: 10 Nov 2011 19:37
Last Modified: 15 Nov 2016 13:40


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item