Teng, Xian
(2024)
Discoverability and interpretability of spurious associations in data-driven decisions.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
Big data and machine learning tools have jointly empowered humans in making data-driven decisions. Many of them capture empirical associations that might be spurious due to confounding factors and subgroup heterogeneity. The famous Simpson's paradox is such a phenomenon where aggregated and subgroup-level associations contradict with each other, causing confusion and decision difficulties. Existing algorithms and systems fail to offer sufficient support for humans, especially the broad range of non-experts of causal inference and machine learning, to locate, understand, identify subgroups free of spuriousness, and make reliable decisions from spurious associations in practice.
Motivated by identified research gaps and audience needs, this dissertation's objective is twofold: firstly, to empower users in identifying the presence of spurious associations and paradoxical trends within their data, and secondly, to facilitate the interpretation of these phenomena, enabling more informed decision-making when dealing with observational data. To accomplish these aims, it introduces a multifaceted novel solution comprising three key components: a data-driven algorithm named Deparadox Tree, a practical human-centric workflow called Deparadox Workflow, and a visual analytic system titled VISPUR. In particular, the Deparadox Tree automatically uncovers subgroup patterns behind paradoxical associations, employing innovative split criteria that balance confounders and homogenize inconsistent effects throughout a recursive partitioning process. The Deparadox Workflow is developed from in-depth semi-structured interviews and engagements with three diverse target user groups, integrating their challenges and requirements in handling spurious or paradoxical phenomena in data analysis. Aligned closely with the workflow, VISPUR is designed to perform four main tasks, including identifying confounding factors, exploring diverse subgroup patterns susceptible to misinterpretations of causality, interpreting paradoxical phenomena, and aiding in informed decision-making. The combined integration of quantitative and visual signals, coupled with interactive features, culminates in a more comprehensive and nuanced understanding of the underlying data.
My research bridges the divide between causality theory and practical applications, with a profound emphasis on both discoverability and interpretability. Grounded in causal theory, it translates complex causal factors behind spurious associations into clear visual signals and interactive features, fostering a clearer understanding of data patterns and preventing misinterpretations. The development, execution, and evaluation of my solutions are intricately shaped through extensive collaboration with target users. This user-centric methodology ensures that the toolkit caters to the challenges and needs of a diverse array of audiences including data scientists and policymakers. Moreover, my research contributes to the broader realm of data-driven decision-making and knowledge discovery. Through both quantitative and qualitative evaluations (i.e., expert interview and controlled user experiments), it demonstrates a great potential of raising awareness of pitfalls arising from spurious associations, enhancing comprehension of critical causal mechanisms underpinning a Simpson's paradox, as well as fostering reliable, accountable, and informed decisions.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
13 May 2024 |
Date Type: |
Publication |
Defense Date: |
2 April 2024 |
Approval Date: |
13 May 2024 |
Submission Date: |
17 April 2024 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
139 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Computing and Information > Information Science |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
Simpson's paradox, Data-driven decision making, Confounding, Heterogeneous effects, Spurious association, Casual inference, Visual analytics system, Subgroups, Decision trees, Kernel mean embedding, Kernel distance |
Date Deposited: |
13 May 2024 17:28 |
Last Modified: |
13 May 2024 17:28 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/46156 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
 |
View Item |