Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Discoverability and interpretability of spurious associations in data-driven decisions

Teng, Xian (2024) Discoverability and interpretability of spurious associations in data-driven decisions. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

PDF (Manuscript)
Primary Text

Download (8MB) | Preview


Big data and machine learning tools have jointly empowered humans in making data-driven decisions. Many of them capture empirical associations that might be spurious due to confounding factors and subgroup heterogeneity. The famous Simpson's paradox is such a phenomenon where aggregated and subgroup-level associations contradict with each other, causing confusion and decision difficulties. Existing algorithms and systems fail to offer sufficient support for humans, especially the broad range of non-experts of causal inference and machine learning, to locate, understand, identify subgroups free of spuriousness, and make reliable decisions from spurious associations in practice.

Motivated by identified research gaps and audience needs, this dissertation's objective is twofold: firstly, to empower users in identifying the presence of spurious associations and paradoxical trends within their data, and secondly, to facilitate the interpretation of these phenomena, enabling more informed decision-making when dealing with observational data. To accomplish these aims, it introduces a multifaceted novel solution comprising three key components: a data-driven algorithm named Deparadox Tree, a practical human-centric workflow called Deparadox Workflow, and a visual analytic system titled VISPUR. In particular, the Deparadox Tree automatically uncovers subgroup patterns behind paradoxical associations, employing innovative split criteria that balance confounders and homogenize inconsistent effects throughout a recursive partitioning process. The Deparadox Workflow is developed from in-depth semi-structured interviews and engagements with three diverse target user groups, integrating their challenges and requirements in handling spurious or paradoxical phenomena in data analysis. Aligned closely with the workflow, VISPUR is designed to perform four main tasks, including identifying confounding factors, exploring diverse subgroup patterns susceptible to misinterpretations of causality, interpreting paradoxical phenomena, and aiding in informed decision-making. The combined integration of quantitative and visual signals, coupled with interactive features, culminates in a more comprehensive and nuanced understanding of the underlying data.

My research bridges the divide between causality theory and practical applications, with a profound emphasis on both discoverability and interpretability. Grounded in causal theory, it translates complex causal factors behind spurious associations into clear visual signals and interactive features, fostering a clearer understanding of data patterns and preventing misinterpretations. The development, execution, and evaluation of my solutions are intricately shaped through extensive collaboration with target users. This user-centric methodology ensures that the toolkit caters to the challenges and needs of a diverse array of audiences including data scientists and policymakers. Moreover, my research contributes to the broader realm of data-driven decision-making and knowledge discovery. Through both quantitative and qualitative evaluations (i.e., expert interview and controlled user experiments), it demonstrates a great potential of raising awareness of pitfalls arising from spurious associations, enhancing comprehension of critical causal mechanisms underpinning a Simpson's paradox, as well as fostering reliable, accountable, and informed decisions.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Teng, Xianxit22@pitt.eduxit22
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairLin,
Committee MemberFarzan,
Committee MemberBrusilovsky,
Committee MemberGregory,
Date: 13 May 2024
Date Type: Publication
Defense Date: 2 April 2024
Approval Date: 13 May 2024
Submission Date: 17 April 2024
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 139
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Information Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Simpson's paradox, Data-driven decision making, Confounding, Heterogeneous effects, Spurious association, Casual inference, Visual analytics system, Subgroups, Decision trees, Kernel mean embedding, Kernel distance
Date Deposited: 13 May 2024 17:28
Last Modified: 13 May 2024 17:28


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item