LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Oct 31, 2021

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Contents

Introduction Proposed Method Experiment

Introduction

LayoutLM은 VrDU(Visually-rich Document Understanding) 작업을 위한 간단하고 효과적인 text & layout pre-trained 방법

이전의 텍스트 기반 pre-trained 모델과는 달리, LayoutLM은 텍스트 임베딩 외에도 2D position embeddings을 사용

pre-train 단계에서 두 가지의 training objectives가 사용됨

Masked visual-language model
Multi-label document classification.

notion image

본 논문에서는 LayouytLM의 개선된 버전 LayoutLMv2을 제시

Fine-tuning 단계에서 image 임베딩 정보와 결합되는 과거 버전과 달리, Transformer 아키텍처를 기반으로 시각적 정보와 텍스트 정보간의 cross-modality interaction을 학습하여 pre-train 단계에서 이미지 정보를 통합

1D & 2D relative positional biases사용

text-image matching, text-image alignment 2개의 새로운 pre-training 전략 추가하여 서로 다른 modalities 간의 alignment를 강제

Proposed Method

notion image

VrDU tasks을 위한 향상된 multi-modal Transformer 아키텍처를 LayoutLMv2의 백본으로 구축

multi-modal Transformer는 텍스트, 이미지 및 레이아웃의 세 가지 입력을 허용

각 양식의 입력은 임베딩 시퀀스로 변환되고 인코더에 의해 융합

Text Embedding

기성 OCR 도구와 PDF 파서를 활용하여 텍스트를 인식하고 reasonable reading order로 serialize를 수행

관행에 따라 WordPiece를 사용하여 text sequence를 토큰화하고 각 토큰을 특정 segment에 assign

토큰 시퀀스의 시작 부분에 [CLS]를 추가하고 끝에 [SEP]를 추가

텍스트 시퀀스의 길이가 최대 시퀀스 길이 L보다 크지 않도록 제한하며 L 보다 길이가 짧을 경우 [PAD] tokens를 추가

notion image

최종 text embedding 은 세가지 embeddings의 합산 결과

Token embedding은 token 자신을 의미
1D postional embedding은 token index
segment embedding은 다른 text segments를 구분하기 위해 활용

notion image

Visual Embedding

ResNeXt-FPN(Xie et al., 2016; Lin et al., 2017) 아키텍처를 visual encoder의 백본으로 사용

문서 페이지 이미지 I 가 주어지면 224 × 224로 크기가 resized된 다음 백본에 공급

그 후, output feature map은 고정 크기(W x H)로 average pooling

길이 W x H의 visual 임베딩 시퀀스로 flattened

dimensions를 맞추기 위해 각 visual token embdding에 linear projection layer가 적용

CNN-based visual backbone이 positional information을 capture하지 못하기 때문에 1D positional embedding을 image token embeddings에 추가

1D positional embedding을 text embedding layer와 공유

모든 visual tokens를 visual segment [C]에 연결

notion image

Layout Embedding

모든 좌표를 [0, 1000] 범위의 정수로 normalize

두 개의 임베딩 레이어를 사용하여 x축과 y축의 feature를 개별적으로 임베딩

i-th text/visual token (x0, x1, y0, y1, w, h)의 normailzed bbox가 주어졌을 때, layout embedding layer는 6개의 bbox features를 연결하여 2D positional embedding을 구성

notion image

CNN은 local transformation을 수행하므로, visual token embeddings은 하나씩 이미지 영역에 매핑될 수 있음

layout embedding layer 관점에서는 visual tokens는 고르게 분할된 grids로 처리될 수 있기에 bbox 좌표를 쉽게 계산 가능

empty boxPAD(0,0,0,0,0,0)가 special tokens [CLS], [SEP], [PAD]로 추가됨

Multi-modal Encoder with Spatial-Aware Self-Attention Mechanism

Encoder는 visual embeddings 와 text embeddings를 unified sequence 로 concatenate

이에 layout embeddings를 더해줌으로써 spatial information을 포함

notion image

기존의 self-attention 메커니즘은 절대 위치에 대한 정보가 있는 input tokens 간의 관계를 단순히 implicitly capture

문서 레이아웃에서 local invariance를 효율적으로 모델링하기 위해 상대 위치 정보를 명시적으로 입력할 필요가 있음

따라서 spatial-aware self-attention 메커니즘을 self-attention layers에 도입

기존 self-attention mechanism은 query와 key의 두 벡터 값을 projecting하여 둘 사이의 correlation을 포착

notion image

spatial relative position을 learnable bias terms로 명시하여 attention score에 더해줌

anchors는 top left corner coordinates에서의 i번째 bbox로 가정

notion image

마지막으로 output vectors는 normalized spatial-aware attention scores에 관한 모든 projected value vectors의 weighted average로 표현

notion image

PRE-TRAINING

Masked Visual-Language Modeling

일부 text tokens를 무작위로 masking하고 모델이 masked token을 recover 하도록 만듦
레이아웃 정보는 변경되지 않은 상태로 유지 즉, 모델은 masked tokens의 positions를 알고 있음
visual encoder에 feed 하기 전에, 원본 이미지에서 masked token에 해당하는 부분을 masking

Text-Image Alignment

일부 text tokens가 무작위로 선정되고 해당 이미지 영역이 문서에 covered
pre-training 동안 classification layer가 encoder outputs 위에 구축
해당 layer는 각 token의 covered 여부에 대한 label을 예측
MVLM과 TIA가 동시에 수행될 경우, MVLM에서의 masked tokens는 TIA loss에 반영되지 않음

Text-Image Matching

[CLS] token을 classifier에 주어, 이미지와 텍스트가 동일한 문서 페이지로부터 발생한 것인지 예측
일반적인 inputs는 positive sample이며 negative sample을 구성하기 위해 다른 문서에서의 이미지로 대체하거나 drop (zero padding?)

Experiment

Settings

d = 768
12-layer, 12-head Transformer encoder
ResNeXt101-FPN
number of parameters 200M

d = 1024
24-layer, 16-head Transformer encoder
ResNeXt101-FPN
number of parameters 426M

etc

maximum sequence length L = 512
Adaptive pooling layer의 output shape은

49개 feature map으로부터 image tokens가 생성

text embedding layer 및 encoder의 경우 LayoutLMv2는 UniLMv2(Bao et al., 2020)와 동일한 아키텍처를 사용

Results

Entity-level precision, recall, F1 score

notion image

notion image

notion image

notion image

Classification accuracy

notion image

ANLS score

notion image

ABLATION STUDY

notion image

Share article

More articles

End-to-End Semi-Supervised Object Detection with Soft Teacher

October 31, 2021

End-to-End Semi-Supervised Object Detection with Soft Teacher

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

October 31, 2021

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Efficient Self-supervised Vision Transformers for Representation Learning

October 29, 2021

Efficient Self-supervised Vision Transformers for Representation Learning

Similarity Reasoning and Filtration for Image-Text Matching

October 29, 2021

Similarity Reasoning and Filtration for Image-Text Matching