Towards Zero-shot Document Query System

BLIP, BLIP-2, LLaVA, MiniGPT-4와 같은 최신 멀티모달 언어 모델을 통해 Zero-shot 문서 질의 시스템의 가능성을 탐구합니다. 각 모델의 핵심 기능과 적용 가능성을 비교 분석합니다.

Jun 22, 2023

Towards Zero-shot Document Query System

Contents

1. Survey on Recent Advances in Multimodal LLMs

- Emergence of Visual Instruction Model -

1. Survey on Recent Advances in Multimodal LLMs BLIP BLIP-2 LLaVA LLaMA-Adapter-V2 MiniGPT-4 ETC

1. Survey on Recent Advances in Multimodal LLMs

BLIP

notion image

BLIP: Bootstrapping Language-Image Pre-training for unified vision-language understanding and geration

Motivation

Vision-language pre-training (VLP)의 두 가지 방법 약점을 보완

Encoder-based model (e.g. CLIP): 이미지 캡셔닝과 같은 text generation task에 약함
Encoder-decoder model (e.g. VL-T5): image-text retrieval에 약함

Model perspective

Multimodal mixture of Encoder-Decoder (MED)
세 가지 vision-language task를 동시에 학습

1) Image-text contrastive learning (ITC)

2) Image-text matching (ITM)

3) Image-contitioned language modeling (LM)

Data perspective

Captionin and Filtering (CapFilt)
Noisy image-text pair로부터 학습하기 위한 방법
VLP는 다운스트림 태스크에 효과적이지만, 데이터셋의 규모를 키움으로써 얻은 효과 때문에 데이터의 노이즈로 인한 영향이 잘 드러나지 않았음. 단순한 필터로는 부족함.
Captioner: 웹 이미지로부터 합성 캡션을 생성
Filter: 노이지 캡션을 제거

Method

Model Architecture and Pre-training objectives

notion image

2개의 understanding-based objectives + 1개(LM)의 generation-based objective.
Text encoder와 decoder의 CA, FFN은 파라미터를 공유함
Unimodal encoder

Image encoder: ViT
Text encoder: BERT
이 둘은 contrastive loss를 사용하여 학습

Image-grounded text encoder

이미지 feature와 Cross-Attention 수행
image text matching loss (ITM): binary classification
Hard negative mining 사용

Image-grounded text decoder

Cross entropy loss를 사용한 일반적인 auto-regressive 텍스트 생성 objective

CapFilt

notion image

Finetuning

Captioner와 Filter는 모두 같은 pre-trained MED로 초기화
Captioner: MED의 image-grounded text decoder와 COCO 데이터셋으로 fine-tuning
Filter: MED의 image-grounded text encoder를 사용하여 텍스트가 이미지와 맞는지 판별하도록 fine-tuning

Bootstraping

Web에서 수집된 텍스트와 Captioner가 생성한 텍스트를 Filter가 필터링해서 데이터셋에 추가

BLIP-2

notion image

BLIP-2: Bootstrapping Language-Image Pre-training with frozen unimodal models

Generic and compute-efficient VLP method by bootstrapping from off-the-shelf pre-trained vision models and laugage models

Motivation

Vision-language model을 end-to-end 학습하는 것은 너무 cost가 높음. 이미 잘 학습된 모델을 활용하면 좋을 듯?
그러나 frozen unimodal 모델들은 서로 다른 모달리티(이미지, 텍스트)를 본 적이 없기 때문에 이 갭을 메꾸는 것 = 이미지 feature를 텍스트 공간으로 align 시키는 것이 핵심.
이미지 모델과 텍스트 모델 사이의 Q-Former 구조와 이를 학습시키기 위한 2단계 방법을 제안
기존 work과의 차이점

Frozen: Image encoder를 fine-tuning
Flamingo: LM을 fine-tuning
BLIP-2의 차이점: image encoder와 LM 모두 freeze하고 그 사이의 Q-Former만 학습함

Method

Model Architecture (Q-Former)

notion image

두 개의 Transformer sub-module로 구성됨. BLIP과 동일한 것으로 추정됨.

Image transformer

Image Encoder와 Cross Attention(CA) 레이어를 통해 interact 하는 것으로 추정됨
Learable query embedding을 학습하여 image transformer에 입력

Text transformer

두 transformer 모두 BERT로 초기화. 파라미터 갯수는 188M.

Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder

BLIP과 3개의 training objective (ITC, ITM, ITG)를 사용하는 것은 거의 동일하지만 모델 아키텍처와 self-attention 메커니즘에 차이가 있음
Image-Text Contrastive Learning (ITC):

이미지 transformer와 텍스트 transformer로부터의 output representation을 align 시키도록 contrastive loss로 학습.

Output representation까지 두 Transformer간에 정보가 새지 않도록 self-attention을 마스킹하여 서로 차단함

Image-grounded Text Generation (ITG)

Text generation은 causal process이므로 마스킹 방법이 다름

Query는 text에 attend 할 수 없음

Text는 모든 query와 previous token에 attend할 수 있음

Visual feature가 text transformer에 직접적으로 연관되지 않으므로, self-attention을 통해 정보를 전달하도록 학습되어야 함.

Image-Text Matching (ITM)

Bi-directional self-attention mask 사용
Binary classification으로 학습

Bootstrap Vision-Language Generative Learning from a Frozen LLM

notion image

이미지 Transformer의 output query embedding을 (frozen) LLM의 input에 넣을 수 있도록 같은 dimension으로 projection 하는 FC 레이어 추가.

Q-Former는 언어 모델에 필요한(language-informative) 시각적 representation을 추출하도록 학습되었으므로, 이는 불필요한 정보를 제거하고 LLM에 정확히 필요한 정보만을 전달함.

두 가지 LLM을 실험

Decoder-based: Visual repr 으로부터 텍스트를 디코딩하도록 학습
Encoder-decoder: Decoder가 prefix text를 입력받고 suffix를 디코딩하도록 학습

Model Pre-training

BLIP과 동일하게 총 129M 장의 데이터셋으로 pre-training: COCO, Visual Genome, CC3M, CC12M, SBU, LAION300M
BLIP에서 제안한 CapFilt 방식 사용
ViT-L/14 from CLIP, ViT-g/14 from EVA-CLIP
OPT, FlanT5
FlanT5는 Bfloat16 사용
제일 무거운 모델일 때 A100 x 16개로 총 9일 걸림

결과

notion image

LLaVA

Motivation

Vision-language model이 많은 성공을 거두고 있지만, 각각의 task를 하나의 모델로 해결
Task에 대한 instruction은 암시적으로 모델 디자인에 반영되어있음
반면 최근 LLM들은 instruction을 바탕으로 하나의 모델이지만 여러 task를 수행함
본 논문에서는 Instruction-tuning을 multimodal 공간으로 확장하여 general-purpose visual assistant를 만들고자 함
Contribution

Multimodal instruction-following data: image-text 페어 데이터와 ChatGPT/GPT-4를 사용하여 instruction-following 데이터 생성
Large multimodal models: CLIP과 LLaMA를 사용하여 end-to-end 학습
Open-source: 데이터셋과 코드베이스, 모델, 데모를 공개

GPT-assisted Visual Instruction Data Generation

기본적으로 가지고 있는 것: Image Captioning 데이터셋

notion image

ChatGPT/GPT-4를 사용하여 이미지를 설명하도록 지시하는 질문을 생성

notion image

이 질문으로부터 심플한 instruction-following 데이터 (image-text pair) 생성할 수 있음:

notion image

만들기는 쉬우나, 다양성이 부족하고 깊이 있는 추론을 학습시키에 부족함

ChatGPT/GPT4를 사용한 데이터 생성 기법

COCO 데이터셋 사용
(이미지를 입력하지 못하므로) 데이터셋에 라벨링 된 정보인 1) 캡션 2) 박스를 텍스트 입력으로 ChatGPT/GPT4에 입력
사람이 직접 만든 예시를 in-context learning으로 사용하여 적절한 프롬프트를 통해 세 가지 instruction-following data 생성

notion image

Conversation: AI assistant가 질문에 대답하는 것과 같은 대화 패턴.
Detailed description: 사진에 대한 상세한 설명.
Complex reasoning: 위 두 개로부터 생성. 단계별 추론이 필요한 추론 질문.
예시:

notion image

Visual Instruction Tuning

notion image

Language Model: LLaMA

Vision Encoder: ViT-L/14

Projection: Simple linear projection matrix

Projection layer에는 더 복잡한 방법 - 예를 들면 Flamingo, Q-former 등을 사용할 수 있을 것

Training

notion image

Stage 1: Pre-training for Feature Alignment

Visual encoder와 LLM이 frozen 상태에서 projection layer만 학습
이미지 feature를 pre-trained LLM word embedding에 align 시키기 위한 작업

Stage 2: Fine-tuning end-to-end

Visual encoder는 여전히 frozen
Projection layer와 LLM은 update

Result

notion image

notion image

LLaMA-Adapter-V2

notion image

Motivation

LLaMA-adapter는 instruction following model로 만들어졌고 adaptation prompts에 visual feature를 사용함으로써 쉽게 visual instruction model이 될 수 있었지만, 좋은 데이터셋이 없었음.

⇒ image-text pair 데이터셋과 instruction-following 데이터셋은 각자 많음. 이 둘을 변경 없이 사용할 수 있으면 편리할 것.

두 가지 문제를 해결해야 함: 1) Aligning visual feature 2) Instruction-following. 같이 학습시키면 Visual feature alignment가 dominate해서 학습이 제대로 되지 않았음.

⇒ 서로 다른 성격의 두 task를 하나의 모델에서, 하나의 학습 방법론으로 해결해야 함

Method

Bias Tuning of Linear Layers

LLaMA-Adapter (V1)은 마지막 몇 개의 레이어에 대해서만 adaptation prompts를 추가하는 방식이었기 때문에 fine-tuning 범위가 제한적이었음
더 많은 범위를 학습하기 위해 다음과 같이 추가함

모든 normalization layer를 unfreeze
Linear layer에 bias, scale factor를 추가하여 학습함 (각각 0, 1로 init)

notion image

Joint Training with Disjoint Parameters

notion image

고퀄리티의 visual instruction data를 많이 만드는 것이 어렵기 때문에, 500K 개의 image-text 페어 데이터셋과 50K 개의 text-only instruction 데이터셋을 사용하여 joint training.
단순한 joint training으로는 갯수가 적은 instruction following이 간섭(interference)되어 잘 학습되지 않았음
대신, image-text alignment 학습과 instruction-following 학습에 사용되는 파라미터 그룹을 분리

Image-text alignment: Visual projection layer, Early (zero-initialized) attention
Instruction-following: 나머지

이러한 방식으로 1) 학습의 안정성 2) 데이터의 부재 두 마리의 토끼를 모두 잡았음

Early Fusion of Visual Knowledge

notion image

Visual, language fine-tuning이 서로 간섭하지 않도록 adapter의 위치를 분리함
(Recap) LLaMA-Adapter V1 에서의 multi-modal 실험에서는 visual feature를 합쳐서 모든 레이어의 language adaption prompt (late fusion)에 더하는 식으로 넣었음
V2에서는 (projected) visual feature는 맨 앞에서 word token에 concat해서, adaptation prompts는 V1과 같이 맨 뒤쪽 몇 개의 레이어에 넣어줬음
Join training과 함께 이 기법을 통해 서로 다른 두 fine-tuning의 충돌을 해결함

Integration with Experts

LLaVA나 MiniGPT-4와는 다르게 LLaMA-Adapter-V2는 훨씬 적은 수의 데이터로 학습되었으므로 이미지 이해 능력이 상대적으로 떨어짐.
캡셔닝, OCR, 검색 엔진 등의 전문 모델을 image encoder로 사용하여 더 나은 결과를 얻을 수 있음

notion image

notion image

MiniGPT-4

notion image

Language: Vicuna (LLaMA + instruction-tuning)

Image: BLIP-2 (ViT from CLIP + Q-Former for projection)

Training

Pre-training using image captioning dataset
Generating detailed description (which is noisy)
Clean descriptions using GPT-4
Manual refinement of data
Fine-tuning

ETC

Flamingo

Share article