FormNet: Beyond Sequential Modeling for Form-Based Document Understanding
form documents have unique challenges compared to natural language documents stemming from their structural characteristics (hard to address serialization error!)
Oct 11, 2022
What does it do?
- sequential entity tagging and extraction
Problem
- form documents have unique challenges compared to natural language documents stemming from their structural characteristics ( hard to address serialization error!)
Novelty / solution to the problem
- Rich attention → replace ETC’s attention mechanism
- leverages the spatial relationship between tokens in a form for more precise attention score calculation
- Super-Tokens → embedding
- for each word by embedding representations from their neighboring tokens through graph convolutions
Pretraining
- MLM(masked language modeling) only!
- steps
- OCR : document -> OCR words + bboxes
- BERT-multilingual vocabulary (tokenize the extracted OCR words) : OCR words + bboxes -> tokens
- GCN (embedding: graph construction & message passing(?)) : tokens & bboxes(2d coordinates) -> super-tokens (graph embedding)
- ETC(extended transformer construction) w/ Rich Attention : super-tokens -> entity BIOES logits
- Viterbi (decode and obtain the final entities for output.): entity BIOES logits -> entity extraction outputs
- setup
- max sequence length : 1024
RichAtt(Rich Attention)
- replace ETC’s attention with rich attention
- why not just use ETC?
- ETC uses relative positional encoding → however, token offsets measure based on the error-prone serialization may limit the power of positional encoding
- Rich attention
- avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely → computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid
- Attention Score ()
- each pair of token representation
- ( : each attention head )
- actual order :
- log-distance
- 단어i 와 단어j 의 attention score 계산: i < j → → higher attention score → 자기 자신의 왼쪽에 있는 단어에 더 높은 attention score를 줌. (영어에서 형용사는 명사의 왼쪽에 있음, 따라서 <형용용사, 명사> 순으로 문장이 있을 때, 명사가 바로 전 형용사에 더 attend하게 됨.)
- 단어i 와 단어j 사이의 거리가 짧으면 의 값이 작아지고, 의 값이 커지게 됨(절대값이 작아짐) → higher attention score → 자기 자신과 가까운 단어에 높은 attention score를 줌.
- what does those bias terms do?
- penalizing attention edges for violating soft order/distance constraints
- → model will learn logical implication rules such as
- “Lazy”는 오른쪽에 있으므로 “crow”를 modify 하지 않고, attention edge가 penalty를 받게 됨.
- “Sly”는 토큰이 많이 떨어져 있으므로 attention edge가 penalty를 받게 됨.
- “Cunning”은 큰 패널티를 받지 않기 때문에, 위에서 (Negative Error에서의) Process of Elimination에서 Attention을 줘야할 가장 확률 높은 후보의 형용사로 선정됨.
Super-Token by Graph Learning
- poor serialization can still block significant attention weight calculation between related word tokens → every token only attend to tokens that are nearby in the serialized sequence.
- strong inductive biases so that they have higher probabilities of belonging to the same entity type
Fine-tuning (dataset)
- MLM-pretraining → CORD, FUNSD
- CORD standard evalation set (1000 documents): train(800), validation(100), test(100)
- FUNSD (about 200 documents) : 75-25 split for training and test
- (no pretraining) → Payment (10K documents, 7 semantic entity labels) // not available on internet?
Pre-training
- Following DocFormer, (2021), we collect around 700k unlabeled form documents for unsupervised pre-training. We adopt the Masked Language Model (MLM) objective.
- We train the models from scratch using Adam optimizer with batch size of 512. The learning rate is set to 0.0002 with a warm-up proportion of 0.01.
- From DocFormer
Result (benchmark)
- FormNet-A2는 DocFormer 보다 2.5배 작은 모델을 사용하면서 성능이 더 뛰어남.
- FormNet-A3는 97.28% 의 F1 score로 SOTA성능을 달성함.
Importance of RichAtt, Super-tokens(GCN)
Attention Visualization
we visualize the local-to-local attention scores for specific examples from the CORD dataset for the standard ETC and FormNet models. Qualitatively, we confirm that the tokens attend primarily to other tokens within the same visual block for FormNet. Moreover for that model, specific attention heads are attending to tokens aligned horizontally, which is a strong signal of meaning for form documents. No clear attention pattern emerges for the ETC model, suggesting the RichAtt and Super-Token by GCN enable the model to learn the structural cues and leverage layout information effectively.
Conclusion
- RichAtt mechanism and Super-Token components help the ETC transformer excel at form understanding in spite of sub-optimal, noisy serialization.
- FormNet recovers local syntactic information that may have been lost during text serialization and achieves state-of-the-art performance on three benchmarks.
Others
- Viterbi algorithm : finds a sequence that maximizes the posterior probability
- BIOES tagging
- B(begin), I(inside), O(outside), E(end), if it’s a single word → S(single)
- NER(named-entity recognition) == KV extraction
- ETC
- ETC scales to relatively long inputs by replacing standard attention, which has quadratic complexity, with a sparse global-local attention mechanism that distinguishes between global and long input tokens. The global tokens attend to and are attended by all tokens, but the long tokens attend only locally to other long tokens within a specified local radius, reducing the complexity so that it is more manageable for long sequences.
Share article