Ye, Keren
(2021)
Multimodal knowledge integration for object detection and visual reasoning.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
We humans still perceive and reason in a different way than artificial intelligence models. We witness, we listen, we touch, we understand the world via multi-modal sensing, while machine models rely only on a single or a few modalities and ignore abundant information. In this thesis, we explore techniques for reducing the perception gap between machines and humans and focus on two families of tasks, reasoning and detection. First, we incorporate information from text, audio, motion, external knowledge bases, for training computer vision models. We find that data inputs from more extensive channels provide complementary information to improve models. Second, we study how multimodal inputs can be fully utilized. We argue that most existing deep learning methods are prone to pay too large attention to shallow patterns in the input features, which causes the resulting models to be biased. We propose robust training to overcome the issue. Third, we extend the benefits of multi-modal information to the supervision signals instead of the inputs, by learning a weakly supervised detection model from the natural supervision of textual captions or audio narrations. With the help of NLP constituency parsing, it is possible to extract structural knowledges from the captions and narrations, hence determines the entities and relations of visual objects.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
8 September 2021 |
Date Type: |
Publication |
Defense Date: |
8 July 2021 |
Approval Date: |
8 September 2021 |
Submission Date: |
28 July 2021 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
216 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Computing and Information > Computer Science |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
weakly supervised learning, object detection, scene graphs generation, cross-modal retrieval, multi-modal learning, advertisements, external knowledge, vision and language, representation learning, question answering |
Date Deposited: |
08 Sep 2021 13:22 |
Last Modified: |
08 Sep 2021 13:22 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/41514 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |