Looking Similar, Sounding Different

Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning

MIT, Netflix
CVPR 2024

Abstract

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.

Example Clips

Clip 1

DE
EN
ES
FR
IT
JA

Clip 2

DE
EN
ES
FR
IT
JA

Clip 3

DE
EN
ES
FR
IT
JA

Clip 4

DE
EN
ES
FR
IT
JA

Clip 5

DE
EN
ES
FR
IT
JA

Clip 6

DE
EN
ES
FR
IT
JA

Clip 7

DE
EN
ES
FR
IT
JA

Clip 8

DE
EN
ES
FR
IT
JA

Clip 9

DE
EN
ES
FR
IT
JA

Clip 10

DE
EN
ES
FR
IT
JA

Clip 11

DE
EN
ES
FR
IT
JA

Clip 12

DE
EN
ES
FR
IT
JA

Clip 13

DE
EN
ES
FR
IT
JA

Clip 14

DE
EN
ES
FR
IT
JA