ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents

Nov 18, 2021

ViBERTgrid: A Jointly Trained Multi-Modal 2D
Document Representation for Key Information
Extraction from Documents

Contents

Introduction Proposed Method Experiment Conclusions

Introduction

대부분 Document understanding 알고리즘은 transformer나 GNN을 기본 네트워크로 하고있습니다. classify해야되는 word를 하나의 token 혹은 여러개의 token으로 간주하여 transformer나 GNN의 입력으로 사용하게 됩니다. 해당 논문은 transformer나 GNN을 기반으로 했다기 보다는 CNN을 통하여 Document understanding을 풀었습니다.

Related Works

해당 논문과 관련이 높은 방법론이 2가지가 있습니다. CharGrid와 BertGrid가 document understanding에서 CNN을 사용한 방법으로, 두가지 모두 CNN을 기반으로 document understanding 을 진행합니다. 일반적인 CNN의 경우 이미지만을 입력으로 받게됩니다. 그렇기때문에 단순하게 생각하면 문서 이미지만을 입력으로 받는것을 상상하기 쉽습니다. 하지만 입력이 문서 이미지에다가 특정 feature map을 concat한 tensor를 입력으로 받게됩니다.

CharGrid는 이미지 데이터에 추가적으로 Character 단위 detection 결과를 embedding 해서 넣어줍니다. 모든 Character set에 대해서 embedding vector를 정의하고, 해당 character가 있는 영역 (=bbox)에는 해당 embedding vector값을 가지도록 feature map을 만들어 줍니다. 입력을 위와같이 만든 후, CNN을 통과시켜 instance segmentation을 진행 해줍니다. 즉 특정 class에 해당하는 영역을 segmentation하는 task로 바꿔서 document understanding 문제를 풀었습니다.

BertGrid는 character 단위 embedding 을 뽑는게 아니라 bert를 이용해 token들의 feature를 뽑아 입력 feature를 만들었다는 점만 다르고 다른점들은 char grid와 똑같습니다.

Proposed Method

본 논문에서는 transformer나 GNN 기반 방법론 보다 CNN을 쓰는 방법론이 성능이 더 낮은 이유에 대해서 변론 하고 있습니다. 첫번째로는 Text embedding을 할때 BERT가 아닌 단순한 방법을 쓰고있는것이 첫번째 이유이고, BERT를 쓴다 하다러다도 weight를 pretrained network weight로 freeze해서 쓴다는것입니다.

ViBERTGrid는 총 3가지 stage로 구성되어있습니다.

Multi-modal backbone

Segmentation head

Word-level field classification head

Multi-modal backbone

Multi-modal backbone은 이미지가 입력으로 들어가게 되고 text feature를 이용한 feature map이 중간단계에 들어가는 구조입니다. CNN 구조는 ResNet18 기반 FPN을 사용합니다. FPN 중간 단계에 CharGrid나 BertGrid 에서 사용하는 text feature map을 넣어주게 됩니다. CharGrid에서는 Character embedding을 통해 feature map을 만들고, BertGrid에서는 Bert를 통해 token의 embedding을 구하고 그것을 통해 feature map을 제작합니다. 여기서는 Bert를 통해 text의 feature 뽑는것은 BertGrid와 동일하지만, token 단위가 아닌 word 단위 feature를 뽑게됩니다.

by reading them in a top-left to bottom-right order ( : document)

Tokenizing into sub-word token sequence of length

embedding of each token

: embedding of each token with a BERT encoder

Word embedding of

: by averaging the embeddings of its tokens

Feature map :

: Bounding boxes of each work

Segmentation head

해당 head는 학습에만 사용되는 head로, document에서의 instance segmentation을 진행합니다. 조금 자세하게 살펴 본다면, 마지막에 branch가 2개 존재합니다.

하나는 특정 pixel 영역이 우리가 구별하고 싶은 key-value 영역인지 아닌지 binary로 segmentation으로 진행합니다.

또다른 하나 branch는 pixel-level로 class를 포함한 segmentation을 진행합니다.

전체 loss function

Word-level field classification head

실제로 어떤 word가 어떤 class에 해당하는지를 구별해주는 head입니다. 학습에만 사용하고 inference에서는 사용하지 않는 segmentation head와 다르게 해당 head는 inference pipeline에서 사용됩니다.

전체 feature map에서 word 영역에 해당 하는 부분 feature는 RoI Align을 통해 추출합니다. 그 후 뽑은 feature map에 해당 word에 대한 embedding vector를 더해줍니다. 최종적으로는 해당 vector를 통해 classification을 진행합니다.

해당 classfication도 학습을 하는 경우 2개의 branch를 통해 2가지 loss에 대해 학습합니다.

우리가 구별하고싶은 key-value 인가 아닌가를 학습하는 binary classification loss

우리가 구별하고싶은 class label을 학습하는 n-way classification loss

전체 loss function

Experiment

Dataset

SROIE

Receipts dataset
626 for training and 347 for testing
4 classes : company, date, address, total
Use GT bounding boxes and transcription for training

INVOICE

authors' in-house dataset
24,175 for training and 643 for testing
14 classes : CustomerAddress, TotalAmount, DueDate, PONumber, Subtotal, BillingAddress, CustomerName, InvoiceDate, TotalTax, VendorName, InvoiceNumber, CustomerID, VendorAddress and ShippingAddress
Use MS Azure READ API for training

Implementation Details

Note that for many document pages in INVOICE dataset, the number of input sequences can be very large, which could lead to the out of-memory issue due to the limited memory of GPU. To solve this problem, for each document page, we just randomly select at most L (L=10) sequences to propagate gradients so that the memory of intermediate layers belonging to the other sequences can be directly released after acquiring output token embeddings.

Experiments

LayoutLM의 단점 : Need domain-specific Pretraining

SROIE

INVOICE

Ablation studies

Effectiveness of joint training

Effectiveness of multi-modality features

Effectiveness of the CNN module