DocFormer: End-to-End Transformer for Document Understandingt

Inc Lomin's avatar
Oct 29, 2021
DocFormer: End-to-End Transformer for Document Understandingt
 
소개글 (간략하게 논문이나 논문 주제에 대한 트리비아)
 

Introduction

Motivation

Previous multi-modality models didn't count the nature of visual/language tokens as long as cross-modality feature correlation leading to problem when it is hard to learn correlation between text with associated visual features.
notion image

Contributions

  • Multi-modal framework fusing: text, visual and spatial features
  • Trained on few unsupervised tasks. New tasks: learning to reconstruct and multi-modal masked language modeling
  • DocFormer is end-to-end trainable
  • DocFormer does not rely on a pre-trained object detection network
  • DocFormer does not use custom OCR unlike some of the recent papers
 

Proposed Method

Core Idea

DocFormer proposed a new a novel framework for multi-modality training. They call it Discrete Multi-Modal.
 
While many previous works used only text or text + spatial features, this paper approach suggests to utilize all 3 modalities: text, spatial and visual features.
  • Visual features - ResNet50 w/o transfer learning from object detection task
  • Text features - OCR with subsequent world-piece tokenizer (Wt initialized from LayoutLMv1)
  • Spatial features - for each word encoded embeddings →Wx (x1,x3,w,Ax) + Wy (y1,y3,h,Ay) + Pabs.
    • Separate trainable weights for visual and text spatial embeddings.
Then transformer encoder will be: Fenc(η, V , Vs, T , Ts)
notion image
Pre-training is done using 3 different types of tasks:
  • Multi-Modal Masked Language Modeling (MM-MLM) - a modification MLM → corresponding visual features are not masked
  • Learn To Reconstruct (LTR) - document visual reconstruction from features from all modalities with L1 loss.
  • Text Describes Image (TDI) - teaching network if text description match a document image representation.
Total loss looks like:
Lpt = λ*L_MM_MLM + βA*L_LTR + γ*L_TDI . In this paper λ = 5, β = 1 and γ = 5.
Pre-trained is done for 5 epochs.
notion image
After pre-training all task specific heads are removed and linear projection heads are added and model is fine-tuned for inference/downstream tasks.

Multi-Modal Self Attention

notion image

Experiment results

Two versions of DocFormer:
  • base - 768 hidden states and 12 attention heads
  • large - 1024 hidden states and 16 attention heads

Sequence Labeling Task

notion image
notion image
 

Document Classification Task

notion image
 

Shared or Independent Spatial embeddings?

notion image
 

Effect of pretraining

notion image
 

Deeper Projection Head

fc → ReLU → LayerNorm → fc
notion image
 

Importance of Different Pre-training Tasks

notion image
 

Importance of Modalities

 
notion image
 

Example

notion image

Conclusions

DocFormer from Amazon looks promising in perspective of end-to-end trainable meta-model for OCR related tasks. The only problem is the amount of data needed to adequately pre-trainin model. Also speed performance should be taken into account.
 
 
 
 
Share article