LayoutLM: Pre-training of text and layout for document image understanding

Oct 29, 2021

LayoutLM: Pre-training of text and layout for document image understanding

Contents

Introduction Proposed Method Experiment Conclusions

이번에 소개할 논문은 Document VQA에 관련된 논문입니다. 문서 이미지에 대한 질문에 답변하기 위해서는 문서 이미지 내의 단어들과 그 레이아웃의 의미를 이해해야 합니다. LayoutLM은 ICDAR에서 주최한 DocVQA 2020에서 3위를 한 팀의 모델로 현재 팀중에서는 3위를 기록하고 있습니다.

https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1

https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3

이외에도 SROIE 2019에서 2위(제출 당시1위) 및 document image classification task 등에서 1위를 달성하고 있습니다.

Introduction

LayoutLM은 scan된 document image에서 text와 layout정보를 활용하여 유용한 정보를 추출하고자 하는 모델입니다. 이전에도 많은 방식으로 document image의 정보를 추출하는 방식들이 있었습니다.

Approaches of document information extraction

1) A table detection method for pdf documents based on convolutional neural networks(2016)

PDF document를 변환한 데이터셋으로 table detection을 하는 방식

2) 이후 Faster R-CNN model, Mask R-CNN 등을 활용한 모델들이 등장

DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images (2017)

etc...

3) Fast CNN-based document layout analysis (2017 ICCV)

NLP pretrained model로부터 text embedding을 추출하고 convolution network로 이미지 및 layout feature를 추출하는 ene-to-end multi modal방식 사용.

4) Graph Convolution for Multimodal Information Extraction from Visually Rich Documents (NAACL 2019)

Graph convolution network기반으로 text와 visual information을 조합하여 활용.

이전까지의 접근방식은 다음과 같은 두가지 문제점들이 있습니다.

1) 사람이 만들어낸 training set에 의존하여 대량의 unlabeled data를 활용하지 못하는 단점이 있습니다.

2) CV, NLP pretrained model을 활용하지만 text, layout 등의 joint pre-training을 고려하지 않습니다.

위 문제를 해결하기 위한 LayoutLM의 contribution은 다음과 같습니다.

1) 처음으로 document에서 추출된 textual, layout information을 pre-training에 사용하여 sota를 달성.

2) masked visual language model과 multi-label document classification을 training objective로서 사용한 pre-training 방식을 적용

Proposed Method

Model Architecture

LayoutLM의 전체적인 구조는 위 그림과 같습니다. LayoutLM은 두가지 종류의 input을 rich representation embedding으로 받습니다.(text는 너무나 당연하므로 설명하지 않음)

1) Document Layout Information: document 상의 상대적 위치를 표현하는 2-D position embedding

sequence내의 word 위치를 embedding하는 것(1-D)과 달리 2-D position embedding은 문서 내에서 상대적 위치를 embedding합니다.

문서의 top-left를 원점으로 정하고 bounding box를 x0,y0, x1,y1으로 각각의 position에 대해 4개의 embedding layer와 2개의 embedding table을 갖습니다. x0,x1 그리고 y0, y1은 embedding table을 공유합니다.

2) Visual Information: font, color 등의 정보를 갖는 image embedding

OCR 결과에 따른 bbox의 영역에대한 Faster RCNN의 image feature를 사용합니다.

[CLS] token에는 downstream task에서 사용할 수 있도록 스캔된 전체 이미지 feature를 추출하여 사용합니다.

Pre-training LayoutLM

Task #1: Masked Visual-Language Model

masked language model에 영감을 받아 Masked Visual-language Model(MVLM)을 통해 2-D position embedding과 text embedding을 clue로 language representation을 학습하도록 했습니다. pre-training 시에는 random하게 input token을 masking하는 대신 2-D position과 다른 text embedding을 유지하고 model이 masking된 token을 predict하도록 했습니다. 이 방식으로 LayoutLM은 language context를 이해하지는 못하지만 2-D position information을 사용하여 visual과 language modal 간의 상관관계를 학습할 수 있습니다.

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Task #2: Multi-label Document Classification

document image understanding에서는 많은 고품질의 document-level representation이 필요합니다. IIT-CDIP Test collection은 document image에 다양한 tag를 포함합니다. multi-label document classification(MDC) loss를 pre-training 단계에서 활용했습니다. 주어진 scanned document에 대해 document tag를 예측하도록 supervised pre-training을 적용하였습니다.

* Experiment에서 MVLM만 사용한 방식과 MVLM+MDC를 사용한 방식의 비교를 실험.

Fine-tuning LayoutLM

pre-trained LayoutLM model은 세가지 document image understanding task에 fine-tuning됩니다.

1) form understanding task

2) receipt understanding task

{B, I, E, S, O}등의 tag를 예측

3) document image classification task

[CLS] token의 representation을 사용해서 class label을 예측

Experiment

Pre-training dataset

IIT-CDIP Test Collection 1.0에 pre-training되었으며 600만장의 document와 1100만장의 scan document image가 있습니다. 각 document에 대한 text와 metadata를 가진 XML파일이 포함됩니다. text content는 document image에 OCR을적용하여 만들어졌습니다.

Fine-tuning dataset

1) The FUNSD Dataset

FUNSD Form understanding in noisy scanned document의 준말입니다.

199장의 데이터셋

149장 학습용셋

50장 테스트셋

- semantic 라벨은 question, answer, header 등으로 구성됩니다.

2) The SROIE Dataset

receipt information extraction(task3)에 대해서 모델을 평가했습니다. 해당 영수증 데이터셋은 626장의 학습셋 347장의 test set으로 구성됩니다. 각 receipt는

3) The RVL-CDIP Dataset

IIT-CDIP Test Collection의 subset

400,000 장의 흑백 이미지로 구성되며 16가지 class로 class당 25,000장으로 구성됩니다. 320,000장의 학습용 이미지와 40,000장의 validation 이미지 40,000장의 test image로 구성됩니다.

가장 큰 이미지의 크기는 1000 pixel을 넘지 않으며16가지의 class는 다음과 같습니다.


- letter
- form
- email
- handwritten
- advertisement
- scientific report
- scientific publication
- specification
- file folder
- news article
- budget
- invoice
- presentation
- questionnaire
- resume
- memo

https://www.cs.cmu.edu/~aharley/rvl-cdip/

Document pre-processing

각 document의 layout information을 활용하기 위해서 각 token의 위치를 얻어내야 합니다. 하지만 IIT-CDIP는 bbox정보가 없이 pure text로 구성됩니다. 필요한 정보를 얻기 위해서 Tesseract OCR을 document image에 적용하였습니다. 이렇게 추출된 OCR정보를 hOCR format으로 저정하였습니다.


# Sample of hOCR
...
<p class='ocr_par' lang='deu' title="bbox930">
  <span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
    <span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span> 
    <span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span> 
    <span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span> 
    <span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span> 
    <span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span> 
    <span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span> 
    <span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span> 
    <span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span> 
    <span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span> 
  </span>
  ...

https://en.wikipedia.org/wiki/HOCR

Model Pre-training

LayoutLM의 initial weight는 BERT를 base로 합니다. architecture또한 12-layer의 transformer와 768 hidden size 그리고 12개의 attention head로 같습니다.

LARGE setting의 경우 24-layer와 1024 hidden size 16 attention head로 구성했습니다. 이는 BERT LARGE model로 initialize했습니다.

pre-training task는 다음과 같이 수행했습니다.

15%를 prediction을 하는 token으로 선정합니다.

학습 중 80%는 Mask token, 10% random token 그리고 10%는 바꿔주지 않았습니다.

token prediction은 cross-entropy로 학습했습니다.

여기에 네가지 embedding representation을 얻기 위한 2D position embedding layer를 더합니다. 실제 coordinate는 0-1000사이로 구성되지만 virtual coordinate를 도입하여 이를 scale해주어 사용했습니다.

visual feature를 추출하기 위한 Faster-RC model의 backbone은 ResNet-101를 사용했고 Visual Genome dataset에 pre-training했습니다.

Task specific fine-tuning

1) Form Understanding

스캔된 이미지에서 key-value pair를 추출하는 task입니다. 그리고 두가지 subtask로 semantic labeling, semantic linking이 있습니다.

semantic labeling은 semantic entitiies에 해당하는 word로 모으고 이에 pre-defined된 label을 할당하는 task입니다.

semantic linking은 이 semantic entities들간의 relation을 예측하는 task입니다.

여기서는 Semantic labeling에 초점을 맞추고 LayoutLM을 fine-tuning했습니다. semantic labeling task를 sequnec labeling 문제로 해석하였습니다. 따라서 final representation(LayoutLM을 거쳐 추출된 feature)를 linear layer와 softmax를 통해 각 token의 label을 예측했습니다.

Text만 가지고 semantic labeling을 하는 것보다 Layout pretraininng을 한 경우 성능이 좋고 다른 representation을 활용한 경우 더 나은 성능을 보임.

2) Receipt Understanding

주어진 영수증에 대해서 회사, 주소, 날짜, 합계 등의 정보를 추출해야합니다. 이를 pre-defined key로서 생각하고 sequence labeling method를 적용했습니다.

3) Document Image Classification

주어진 document image에 대해서 category classification을 수행합니다. 기존 image 기반의 접근방식과 달리 text와 layout representation을 더해 구분하도록 했습니다.

이미지 기반의 classification 방법들은 Baseline에서 볼 수 있습니다. Text only method는 image 방식보다 낮은 성능을 보이나 LayoutLM을 적용시 더 높은 성능을 보입니다.

Conclusions

이론적으로는 새로운 기법은 없으나 Document를 인식하는 task에서 우리가 해볼만한 다양한 실험을 보여주어 의미가 있었습니다.