Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Domain Robustness in Multi-modality Learning and Visual Question Answering

Zhang, Mingda (2022) Domain Robustness in Multi-modality Learning and Visual Question Answering. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Published Version

Download (14MB) | Preview


Humans perceive the world via multiple modalities, as information from a single modality is usually partial and incomplete. This observation motivates the development of machine learning algorithms capable of handling multi-modal data and performing intelligent reasoning. The recent resurgence of deep learning brings both opportunities and challenges to multi-modal reasoning. On the one hand, its strong representation learning capability provides a unified approach to represent information across multiple modalities. On the other hand, properly training such models typically requires enormous data, which is not always feasible especially for the multi-modal setting.
One promising direction to mitigate the lack of data for deep learning models is to transfer knowledge (e.g., gained from solving related problems) to low-resource domains. This procedure is known as transfer learning or domain adaptation, and it has demonstrated great success in various visual and linguistic applications. However, how to effectively transfer knowledge in a multi-modality setting remains a research question. In this thesis, we choose multi-modal reasoning as our target task and aim at improving the performance of deep neural networks on low-resource domains via domain adaptation. We first briefly discuss our prior work about advertisement understanding (as a typical multi-modal reasoning problem) and share our experience from addressing the data-availability challenge. Next, we turn to visual question answering, a more general problem that involves more complicated reasoning. We evaluate mainstream VQA models and classic single-modal domain adaptation strategies and show that existing methods usually suffer significant performance degradation when directly apply to the multi-modal setting. We measure the domain gaps in different modalities and design an effective strategy to manually control domain shifts on individual modalities, which helps better understand the problem. Lastly, we present a systematic study across real datasets to answer a few fundamental questions regarding knowledge transfer in VQA, such as the sensitivity of various models towards different types of supervisions (i.e. unsupervised, self-supervised, semi-supervised, and fully supervised). We conclude by sharing the limitations and our vision for future research directions.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Zhang, Mingdamzhang@cs.pitt.edumiz44
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairKovashka, Adrianakovashka@cs.pitt.eduaik85
Committee CoChairHwa, Rebeccahwa@cs.pitt.edureh23
Committee MemberLitman, Dianedlitman@pitt.edudlitman
Committee MemberHwang, Seong Jaesjh95@pitt.edusjh95
Committee MemberHe, Daqingdah44@pitt.edudah44
Date: 17 January 2022
Date Type: Publication
Defense Date: 30 November 2021
Approval Date: 17 January 2022
Submission Date: 9 December 2021
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 115
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Computer Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Domain Robustness, Multi-modal Reasoning, Visual Rhetoric, Advertisement Understanding, Visual Question Answering
Date Deposited: 17 Jan 2022 15:03
Last Modified: 17 Jan 2022 15:03


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item