Domain Robustness in Multi-modality Learning and Visual Question Answering

Zhang, Mingda (2022) Domain Robustness in Multi-modality Learning and Visual Question Answering. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF (FINAL VERSION)
Published Version
Download (14MB) | Preview

Abstract

Humans perceive the world via multiple modalities, as information from a single modality is usually partial and incomplete. This observation motivates the development of machine learning algorithms capable of handling multi-modal data and performing intelligent reasoning. The recent resurgence of deep learning brings both opportunities and challenges to multi-modal reasoning. On the one hand, its strong representation learning capability provides a unified approach to represent information across multiple modalities. On the other hand, properly training such models typically requires enormous data, which is not always feasible especially for the multi-modal setting.
One promising direction to mitigate the lack of data for deep learning models is to transfer knowledge (e.g., gained from solving related problems) to low-resource domains. This procedure is known as transfer learning or domain adaptation, and it has demonstrated great success in various visual and linguistic applications. However, how to effectively transfer knowledge in a multi-modality setting remains a research question. In this thesis, we choose multi-modal reasoning as our target task and aim at improving the performance of deep neural networks on low-resource domains via domain adaptation. We first briefly discuss our prior work about advertisement understanding (as a typical multi-modal reasoning problem) and share our experience from addressing the data-availability challenge. Next, we turn to visual question answering, a more general problem that involves more complicated reasoning. We evaluate mainstream VQA models and classic single-modal domain adaptation strategies and show that existing methods usually suffer significant performance degradation when directly apply to the multi-modal setting. We measure the domain gaps in different modalities and design an effective strategy to manually control domain shifts on individual modalities, which helps better understand the problem. Lastly, we present a systematic study across real datasets to answer a few fundamental questions regarding knowledge transfer in VQA, such as the sensitivity of various models towards different types of supervisions (i.e. unsupervised, self-supervised, semi-supervised, and fully supervised). We conclude by sharing the limitations and our vision for future research directions.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Zhang, Mingda	mzhang@cs.pitt.edu	miz44

ETD Committee:

Title	Member	Email Address	Pitt Username
Committee Chair	Kovashka, Adriana	kovashka@cs.pitt.edu	aik85
Committee CoChair	Hwa, Rebecca	hwa@cs.pitt.edu	reh23
Committee Member	Litman, Diane	dlitman@pitt.edu	dlitman
Committee Member	Hwang, Seong Jae	sjh95@pitt.edu	sjh95
Committee Member	He, Daqing	dah44@pitt.edu	dah44

Date:

17 January 2022

Date Type:

Publication

Defense Date:

30 November 2021

Approval Date:

17 January 2022

Submission Date:

9 December 2021

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

115

Institution:

University of Pittsburgh

Schools and Programs:

School of Computing and Information > Computer Science

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

Domain Robustness, Multi-modal Reasoning, Visual Rhetoric, Advertisement Understanding, Visual Question Answering

Date Deposited:

17 Jan 2022 15:03

Last Modified:

17 Jan 2022 15:03

URI:

http://d-scholarship.pitt.edu/id/eprint/42059

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Domain Robustness in Multi-modality Learning and Visual Question Answering

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds