Image-caption alignment and object naming variability as supervision for multi-modal object detection

Nebbia, Giacomo (2024) Image-caption alignment and object naming variability as supervision for multi-modal object detection. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

This is the latest version of this item.

Preview

PDF
Download (6MB) | Preview

Abstract

In this work, we show how studying image-text alignment can uncover novel sources of supervision present in multi-modal datasets useful for training deep learning models for grounding and object detection.
The relationship between an image and is seen as binary: they either match or they do not, and a model is trained to match an image with its caption.
We challenge the idea of this binary relationship and consider image-text alignment: how well a caption describes an image can be quantified on a spectrum.
From relaxing the binary image-text matching relationship, we uncover two sources of supervision.
First, we introduce almost-matching captions. We define an almost matching caption as a non-matching caption that partially overlaps the matching one, and we show that this partial overlap can be leveraged for better training. We hypothesize that, by teaching a model to ground a limited portion of a caption with an image, it can learn better representations for this shared caption portion.
Second, we hypothesize that quantifying the alignment between a caption and its image provides useful information for training detection models. We explore alignment scores to guide training using Curriculum Learning.
We observe that image-text alignment is influenced by the words used in a caption, and shift our focus from leveraging whole captions as sources of supervision to leveraging individual words.
First, we analyze the impact of Named Entities. We hypothesize that Named Entities represent a wasted learning opportunity: they are rare, and a model cannot learn to ground them. If each Named Entity were replaced with a common word (i.e., a hypernym), a grounding model could learn from such mention.
Second, we argue that current approaches do not model the different words that can be used to describe the same concept (e.g., synonyms). We analyze the extent of this issue in current multi-modal models for open-vocabulary object detection, and we experiment with ways to ameliorate this problem.
We design experiments to test each of the four hypotheses, and we report results to show the promise of switching the focus of multi-modal self-supervised training from image-text matching to image-text alignment.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Nebbia, Giacomo	gin2@pitt.edu	gin2	0000-0002-4766-6278

ETD Committee:

Title	Member	Email Address	Pitt Username	ORCID
Committee Chair	Kovashka, Adriana	kovashka@cs.pitt.edu	aik85	0000-0003-1901-9660
Committee Member	Cooper, Gregory	gfc@pitt.edu	gfc	0000-0003-1687-6202
Committee Member	He, Daqing	dah44@pitt.edu	DAH44	0000-0002-4645-8696
Committee Member	Litman, Diane J.	litman@cs.pitt.edu	DLITMAN

Date:

10 January 2024

Date Type:

Publication

Defense Date:

28 November 2023

Approval Date:

10 January 2024

Submission Date:

4 December 2023

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

149

Institution:

University of Pittsburgh

Schools and Programs:

Dietrich School of Arts and Sciences > Intelligent Systems

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

multi-modal deep learning, self-supervision, grounding, object detection, image-text alignment, concept relationships

Date Deposited:

10 Jan 2024 14:30

Last Modified:

10 Jan 2024 14:30

URI:

http://d-scholarship.pitt.edu/id/eprint/45612

Available Versions of this Item

Image-caption alignment and object naming variability as supervision for multi-modal object detection. (deposited UNSPECIFIED)
- Image-caption alignment and object naming variability as supervision for multi-modal object detection. (deposited 10 Jan 2024 14:30) [Currently Displayed]

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Image-caption alignment and object naming variability as supervision for multi-modal object detection

Abstract

Share

Details

Available Versions of this Item

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds