DocFormer: End-to-End Transformer for Document Understandingt

Inc Lomin

Oct 29, 2021

DocFormer: End-to-End Transformer for Document Understandingt

Contents

Introduction Proposed Method Experiment results Conclusions

소개글 (간략하게 논문이나 논문 주제에 대한 트리비아)

Introduction

Motivation

Previous multi-modality models didn't count the nature of visual/language tokens as long as cross-modality feature correlation leading to problem when it is hard to learn correlation between text with associated visual features.

Contributions

Multi-modal framework fusing: text, visual and spatial features

Trained on few unsupervised tasks. New tasks: learning to reconstruct and multi-modal masked language modeling

DocFormer is end-to-end trainable

DocFormer does not rely on a pre-trained object detection network

DocFormer does not use custom OCR unlike some of the recent papers

Proposed Method

Core Idea

DocFormer proposed a new a novel framework for multi-modality training. They call it Discrete Multi-Modal.

While many previous works used only text or text + spatial features, this paper approach suggests to utilize all 3 modalities: text, spatial and visual features.

Visual features - ResNet50 w/o transfer learning from object detection task

Text features - OCR with subsequent world-piece tokenizer (Wt initialized from LayoutLMv1)

Spatial features - for each word encoded embeddings →Wx (x1,x3,w,Ax) + Wy (y1,y3,h,Ay) + Pabs.

Separate trainable weights for visual and text spatial embeddings.

Then transformer encoder will be: Fenc(η, V , Vs, T , Ts)

Pre-training is done using 3 different types of tasks:

Multi-Modal Masked Language Modeling (MM-MLM) - a modification MLM → corresponding visual features are not masked

Learn To Reconstruct (LTR) - document visual reconstruction from features from all modalities with L1 loss.

Text Describes Image (TDI) - teaching network if text description match a document image representation.

Total loss looks like:

Lpt = λ*L_MM_MLM + βA*L_LTR + γ*L_TDI . In this paper λ = 5, β = 1 and γ = 5.

Pre-trained is done for 5 epochs.

After pre-training all task specific heads are removed and linear projection heads are added and model is fine-tuned for inference/downstream tasks.

Multi-Modal Self Attention

Experiment results

Two versions of DocFormer:

base - 768 hidden states and 12 attention heads

large - 1024 hidden states and 16 attention heads

Sequence Labeling Task

Document Classification Task

Shared or Independent Spatial embeddings?

Effect of pretraining

Deeper Projection Head

fc → ReLU → LayerNorm → fc

Importance of Different Pre-training Tasks

Importance of Modalities

Example

Conclusions

DocFormer from Amazon looks promising in perspective of end-to-end trainable meta-model for OCR related tasks. The only problem is the amount of data needed to adequately pre-trainin model. Also speed performance should be taken into account.