소개글 (간략하게 논문이나 논문 주제에 대한 트리비아)
Previous multi-modality models didn't count the nature of visual/language tokens as long as cross-modality feature correlation leading to problem when it is hard to learn correlation between text with associated visual features.
- Multi-modal framework fusing: text, visual and spatial features
- Trained on few unsupervised tasks. New tasks: learning to reconstruct and multi-modal masked language modeling
- DocFormer is end-to-end trainable
- DocFormer does not rely on a pre-trained object detection network
- DocFormer does not use custom OCR unlike some of the recent papers
Proposed Method
Core Idea
DocFormer proposed a new a novel framework for multi-modality training. They call it Discrete Multi-Modal.
While many previous works used only text or text + spatial features, this paper approach suggests to utilize all 3 modalities: text, spatial and visual features.
- Visual features - ResNet50 w/o transfer learning from object detection task
- Text features - OCR with subsequent world-piece tokenizer (Wt initialized from LayoutLMv1)
- Spatial features - for each word encoded embeddings →Wx (x1,x3,w,Ax) + Wy (y1,y3,h,Ay) + Pabs.
- Separate trainable weights for visual and text spatial embeddings.
Then transformer encoder will be: Fenc(η, V , Vs, T , Ts)
Pre-training is done using 3 different types of tasks:
- Multi-Modal Masked Language Modeling (MM-MLM) - a modification MLM → corresponding visual features are not masked
- Learn To Reconstruct (LTR) - document visual reconstruction from features from all modalities with L1 loss.
- Text Describes Image (TDI) - teaching network if text description match a document image representation.
Total loss looks like:
Lpt = λ*L_MM_MLM + βA*L_LTR + γ*L_TDI . In this paper λ = 5, β = 1 and γ = 5.
Pre-trained is done for 5 epochs.
After pre-training all task specific heads are removed and linear projection heads are added and model is fine-tuned for inference/downstream tasks.
Multi-Modal Self Attention
Experiment results
Two versions of DocFormer:
- base - 768 hidden states and 12 attention heads
- large - 1024 hidden states and 16 attention heads
Sequence Labeling Task
Document Classification Task
Shared or Independent Spatial embeddings?
Effect of pretraining
Deeper Projection Head
fc → ReLU → LayerNorm → fc
Importance of Different Pre-training Tasks
Importance of Modalities
DocFormer from Amazon looks promising in perspective of end-to-end trainable meta-model for OCR related tasks. The only problem is the amount of data needed to adequately pre-trainin model. Also speed performance should be taken into account.
Share article