Multimodal knowledge integration for object detection and visual reasoning

Ye, Keren (2021) Multimodal knowledge integration for object detection and visual reasoning. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Download (56MB) | Preview

Abstract

We humans still perceive and reason in a different way than artificial intelligence models. We witness, we listen, we touch, we understand the world via multi-modal sensing, while machine models rely only on a single or a few modalities and ignore abundant information. In this thesis, we explore techniques for reducing the perception gap between machines and humans and focus on two families of tasks, reasoning and detection. First, we incorporate information from text, audio, motion, external knowledge bases, for training computer vision models. We find that data inputs from more extensive channels provide complementary information to improve models. Second, we study how multimodal inputs can be fully utilized. We argue that most existing deep learning methods are prone to pay too large attention to shallow patterns in the input features, which causes the resulting models to be biased. We propose robust training to overcome the issue. Third, we extend the benefits of multi-modal information to the supervision signals instead of the inputs, by learning a weakly supervised detection model from the natural supervision of textual captions or audio narrations. With the help of NLP constituency parsing, it is possible to extract structural knowledges from the captions and narrations, hence determines the entities and relations of visual objects.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Ye, Keren	yekeren.cn@gmail.com	key36	0000-0002-7349-7762

ETD Committee:

Title	Member	Email Address	Pitt Username
Committee Chair	Kovashka, Adriana	kovashka@pitt.edu	kovashka
Committee Member	Litman, Diane	dlitman@pitt.edu	dlitman
Committee Member	Hauskrecht, Milos	milos@pitt.edu	milos
Committee Member	He, Daqing	dah44@pitt.edu	dah44
Committee Member	Hwang, Seong Jae	sjh95@pitt.edu	sjh95

Date:

8 September 2021

Date Type:

Publication

Defense Date:

8 July 2021

Approval Date:

8 September 2021

Submission Date:

28 July 2021

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

216

Institution:

University of Pittsburgh

Schools and Programs:

School of Computing and Information > Computer Science

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

weakly supervised learning, object detection, scene graphs generation, cross-modal retrieval, multi-modal learning, advertisements, external knowledge, vision and language, representation learning, question answering

Date Deposited:

08 Sep 2021 13:22

Last Modified:

08 Sep 2021 13:22

URI:

http://d-scholarship.pitt.edu/id/eprint/41514

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Multimodal knowledge integration for object detection and visual reasoning

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds