CoCa: Contrastive Captioners are Image-Text Foundation Models

최근 vision과 language에 걸친 다양한 downstream task에 적용 가능한 multimodal pre-trained model에 대한 연구가 활발히 진행되고 있습니다.

Inc Lomin

Oct 11, 2022

CoCa: Contrastive Captioners are Image-Text Foundation Models

Contents

Introduction Proposed Method Contrastive Captioners Pretraining Experiment Results Conclusions

Introduction

최근 vision과 language에 걸친 다양한 downstream task에 적용 가능한 multimodal pre-trained model에 대한 연구가 활발히 진행되고 있습니다.

Vision-Language Pretraining (VLP)

vision과 language를 fusion model에 함께(jointly) encode 시키기 위한 연구입니다.

Early works : LXMERT, UNITER, VinVL

ViLT, VLMo

Image-Text Foundation Models

CLIP(2021), ALIGN(2021)은 contrastive objective와 noisy image-text pair로 학습 시킨 dual-encoder 모델이 strong image, text representation을 학습한다는 것을 보였습니다.

또 다른 연구들은 (SimVLM, [17], [34]) encoder-decoder 모델과 generative loss를 사용하여 vision-language bench에서 높은 성능을 보였습니다.

Florence

LiT, BASIC

SimVLM

ALBEF[36] : contrastive loss, MLM, dual-encoder

CoCa는 minimalist design으로 CLIP과 같은 contrastive approach의 모델과 SimVLM과 같은 generative method를 사용한 모델의 능력을 포괄하는 image-text encoder-decoder foundation model 입니다.

Proposed Method

CoCa은 다음의 특징을 가지고 있습니다.

encoder-decoder 구조의 image-text foundation 모델

contrastive loss와 captioning (generative) loss를 모두 사용하여 학습

decoder transformer의 구조를 두개로 분리

Approach

3 foundation model families that utilize natural language supervision differently

1. single-encoder classification pre-training

classic single-encoder 방식은 큰 annotation 데이터셋(e.g., ImageNet, Instagram, JFT)을 사용하여 pretrain하며 cross-entropy loss를 사용합니다.

2. dual-encoder contrastive learning

이 방식은 single-encoder를 사용한 분류모델의 pretraining과 비교하여 human-annotated dataset 외에 웹 스케일의 noisy한 데이터를 사용할 수 있으며 free-form text를 학습합니다. 두개의 encoder는 이미지, text pair를 비교하며 공동으로 최적화 합니다.

dual-encoder approach는 image encoder와 함께 aligned text encoder를 함께 학습 하므로 image-text retrieval, zero-shot image classification과 같은 crossmodal alignment 문제에 적용 가능하게 됩니다.

3. encoder decoder image captioning

dual-encoder approach는 text 전체를 encode 하지만 generative approach (a.k.a. captioner)는 detailed granularity를 목표로 하여 모델이 tokenized text를 autoregressive하게 학습하도록 합니다.

Contrastive Captioners Pretraining

CoCa는 simple encoder-decoder 구조를 사용하며 세개의 학습 패러다임을 매끄럽게(seamlessly) 통합합니다.

image encoder로 ViT(or ConvNets)를 사용하며 text decoder로 transformer decoder를 사용합니다.

Decoupled Text Decoder and CoCa Architecture

위에서 설명했다시피 captioning approach는 text의 conditional likelihood를 최적화 하지만 contrastive approach는 unconditaional text representation을 사용합니다. 이러한 차이를 하나의 모델에 통합하기 위해서 decoder를 절반으로 나누어 cross-attention을 사용하지 않는 unimodal component와 multimodal component로 분리하였습니다.

Attentional Pooler

저자의 preliminary 실험에 의하면 single pooled image embedding은 global representation으로써 visual recognition task에 도움이 되며, 추가적인 visual token은 region-level feature를 필요로 하는 multimodal understanding task에 도움이 된다고 합니다.

그러므로 CoCa는 다양한 downstream task에 사용하기 위해 task-specific attentional pooling을 사용하였습니다. pooler는 single multi-head attention layer로 구성됩니다.

Pretraing Efficiency

이러한 디자인은 두 loss를 single forward propagation으로 계산할 수 있으며, 두 loss의 대부분의 연산이 공유되도록 하여 기본적인 encoder-decoder model에서 최소한의 overhead만 가집니다.

기존의 다른 방식[30, 32, 33, 35, 36, 37](multiple stage on various data and/or modalities)과 비교하여 다양한 데이터를 사용하여 scratch 부터 직접적으로 end-to-end로 pretrain 합니다.

Experiment

Training Setup

Data

web scale alt-text data, annotated images (JFT-3B) 사용

Optimization

Lingvo framework, GSPMD

batch size : 65536

Adafactor optimizer with beta_1 = 0.9, beta_2 = 0.999 and decoupled(?) weight decay ratio 0.01 (For memory efficiency?)

warm up, linearly decay

5 days on 2,048 CloudTPUv4 chips

Results

core tasks

visual recognition

crossmodal alignment

image captioning and multimodal understanding

1. Visual Recognition Tasks

visual recognition experiments are conducted on ImageNet [9] as image recognition benchmark, and multiple video datasets including Kinetics-400 [57], Kinetics-600 [58], Kinetics-700 [59], Moments-in-Time [60] as test-beds for video action recognition

2. Crossmodal Alignment Tasks

Zero-Shot Image-Text Retrieval

We evaluate CoCa on the two standard image-text retrieval benchmarks: MSCOCO [63] and Flickr30K [62].

Zero-Shot Image Classification

Zero-Shot Video Retrieval

We evaluate video-text retrieval using CoCa on MSR-VTT [71] using the full split.

3. Image Captioning and Multimodal Understanding Tasks

Multimodal Understanding

VQA v2

SNLI-VE

NLVR2

Image Captioning

MSCOCO Karpathy-test

NoCaps

Conclusions

Dual-encoder models are excellent at zero-shot picture categorization, but they are less suitable for common vision-language understanding. On the other hand, encoder-decoder approaches are good at image captioning and visual question answering but not retrieval-style tasks.

https://www.marktechpost.com/2022/05/30/google-ai-proposes-contrastive-captioner-coca-a-novel-encoder-decoder-model-that-simultaneously-produces-aligned-unimodal-image-and-text-embeddings/

Zero-shot cross-model retrieval

For image-text retrieval, following (Xu et al. 2018; Chi and Peng 2019) we evaluate crossmodal retrieval on two scenarios: image-to-text (Img2Txt) and text-to-image (Txt2Img) that take one modality data as query, i.e., images (text), to retrieve related items in the other modality, i.e., texts (images). The widely-used mean average precision (MAP) score computed from all returned results are used as the evaluation measure,