Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Image-caption alignment and object naming variability as supervision for multi-modal object detection

Nebbia, Giacomo (2024) Image-caption alignment and object naming variability as supervision for multi-modal object detection. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

This is the latest version of this item.

Download (6MB) | Preview


In this work, we show how studying image-text alignment can uncover novel sources of supervision present in multi-modal datasets useful for training deep learning models for grounding and object detection.
The relationship between an image and is seen as binary: they either match or they do not, and a model is trained to match an image with its caption.
We challenge the idea of this binary relationship and consider image-text alignment: how well a caption describes an image can be quantified on a spectrum.
From relaxing the binary image-text matching relationship, we uncover two sources of supervision.
First, we introduce almost-matching captions. We define an almost matching caption as a non-matching caption that partially overlaps the matching one, and we show that this partial overlap can be leveraged for better training. We hypothesize that, by teaching a model to ground a limited portion of a caption with an image, it can learn better representations for this shared caption portion.
Second, we hypothesize that quantifying the alignment between a caption and its image provides useful information for training detection models. We explore alignment scores to guide training using Curriculum Learning.
We observe that image-text alignment is influenced by the words used in a caption, and shift our focus from leveraging whole captions as sources of supervision to leveraging individual words.
First, we analyze the impact of Named Entities. We hypothesize that Named Entities represent a wasted learning opportunity: they are rare, and a model cannot learn to ground them. If each Named Entity were replaced with a common word (i.e., a hypernym), a grounding model could learn from such mention.
Second, we argue that current approaches do not model the different words that can be used to describe the same concept (e.g., synonyms). We analyze the extent of this issue in current multi-modal models for open-vocabulary object detection, and we experiment with ways to ameliorate this problem.
We design experiments to test each of the four hypotheses, and we report results to show the promise of switching the focus of multi-modal self-supervised training from image-text matching to image-text alignment.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Nebbia, Giacomogin2@pitt.edugin20000-0002-4766-6278
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairKovashka, Adrianakovashka@cs.pitt.eduaik850000-0003-1901-9660
Committee MemberCooper, Gregorygfc@pitt.edugfc0000-0003-1687-6202
Committee MemberHe, Daqingdah44@pitt.eduDAH440000-0002-4645-8696
Committee MemberLitman, Diane J.litman@cs.pitt.eduDLITMAN
Date: 10 January 2024
Date Type: Publication
Defense Date: 28 November 2023
Approval Date: 10 January 2024
Submission Date: 4 December 2023
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 149
Institution: University of Pittsburgh
Schools and Programs: Dietrich School of Arts and Sciences > Intelligent Systems
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: multi-modal deep learning, self-supervision, grounding, object detection, image-text alignment, concept relationships
Date Deposited: 10 Jan 2024 14:30
Last Modified: 10 Jan 2024 14:30

Available Versions of this Item


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item