Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Multimodal knowledge integration for object detection and visual reasoning

Ye, Keren (2021) Multimodal knowledge integration for object detection and visual reasoning. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Download (56MB) | Preview


We humans still perceive and reason in a different way than artificial intelligence models. We witness, we listen, we touch, we understand the world via multi-modal sensing, while machine models rely only on a single or a few modalities and ignore abundant information. In this thesis, we explore techniques for reducing the perception gap between machines and humans and focus on two families of tasks, reasoning and detection. First, we incorporate information from text, audio, motion, external knowledge bases, for training computer vision models. We find that data inputs from more extensive channels provide complementary information to improve models. Second, we study how multimodal inputs can be fully utilized. We argue that most existing deep learning methods are prone to pay too large attention to shallow patterns in the input features, which causes the resulting models to be biased. We propose robust training to overcome the issue. Third, we extend the benefits of multi-modal information to the supervision signals instead of the inputs, by learning a weakly supervised detection model from the natural supervision of textual captions or audio narrations. With the help of NLP constituency parsing, it is possible to extract structural knowledges from the captions and narrations, hence determines the entities and relations of visual objects.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairKovashka, Adrianakovashka@pitt.edukovashka
Committee MemberLitman, Dianedlitman@pitt.edudlitman
Committee MemberHauskrecht, Milosmilos@pitt.edumilos
Committee MemberHe, Daqingdah44@pitt.edudah44
Committee MemberHwang, Seong Jaesjh95@pitt.edusjh95
Date: 8 September 2021
Date Type: Publication
Defense Date: 8 July 2021
Approval Date: 8 September 2021
Submission Date: 28 July 2021
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 216
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Computer Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: weakly supervised learning, object detection, scene graphs generation, cross-modal retrieval, multi-modal learning, advertisements, external knowledge, vision and language, representation learning, question answering
Date Deposited: 08 Sep 2021 13:22
Last Modified: 08 Sep 2021 13:22


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item