FormNet: Beyond Sequential Modeling for Form-Based Document Understanding

form documents have unique challenges compared to natural language documents stemming from their structural characteristics (hard to address serialization error!)

Inc Lomin

Oct 11, 2022

FormNet: Beyond Sequential Modeling for Form-Based Document Understanding

Contents

What does it do?Problem Novelty / solution to the problem Pretraining RichAtt(Rich Attention) Super-Token by Graph Learning Fine-tuning (dataset)Pre-training Result (benchmark)Importance of RichAtt, Super-tokens(GCN)Attention Visualization Conclusion Others

What does it do?

sequential entity tagging and extraction

Problem

form documents have unique challenges compared to natural language documents stemming from their structural characteristics ( hard to address serialization error!)

Novelty / solution to the problem

Rich attention → replace ETC’s attention mechanism

leverages the spatial relationship between tokens in a form for more precise attention score calculation

Super-Tokens → embedding

for each word by embedding representations from their neighboring tokens through graph convolutions

Pretraining

MLM(masked language modeling) only!

steps

OCR : document -> OCR words + bboxes

BERT-multilingual vocabulary (tokenize the extracted OCR words) : OCR words + bboxes -> tokens

GCN (embedding: graph construction & message passing(?)) : tokens & bboxes(2d coordinates) -> super-tokens (graph embedding)

ETC(extended transformer construction) w/ Rich Attention : super-tokens -> entity BIOES logits

Viterbi (decode and obtain the final entities for output.): entity BIOES logits -> entity extraction outputs

setup

max sequence length : 1024

RichAtt(Rich Attention)

replace ETC’s attention with rich attention

why not just use ETC?

ETC uses relative positional encoding → however, token offsets measure based on the error-prone serialization may limit the power of positional encoding

Rich attention

avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely → computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid

Attention Score ()

each pair of token representation

( : each attention head )
actual order :
log-distance

단어i 와 단어j 의 attention score 계산: i < j → → higher attention score → 자기 자신의 왼쪽에 있는 단어에 더 높은 attention score를 줌. (영어에서 형용사는 명사의 왼쪽에 있음, 따라서 <형용용사, 명사> 순으로 문장이 있을 때, 명사가 바로 전 형용사에 더 attend하게 됨.)
단어i 와 단어j 사이의 거리가 짧으면 의 값이 작아지고, 의 값이 커지게 됨(절대값이 작아짐) → higher attention score → 자기 자신과 가까운 단어에 높은 attention score를 줌.

what does those bias terms do?

penalizing attention edges for violating soft order/distance constraints
→ model will learn logical implication rules such as

“Lazy”는 오른쪽에 있으므로 “crow”를 modify 하지 않고, attention edge가 penalty를 받게 됨.
“Sly”는 토큰이 많이 떨어져 있으므로 attention edge가 penalty를 받게 됨.
“Cunning”은 큰 패널티를 받지 않기 때문에, 위에서 (Negative Error에서의) Process of Elimination에서 Attention을 줘야할 가장 확률 높은 후보의 형용사로 선정됨.

Super-Token by Graph Learning

poor serialization can still block significant attention weight calculation between related word tokens → every token only attend to tokens that are nearby in the serialized sequence.

strong inductive biases so that they have higher probabilities of belonging to the same entity type

Fine-tuning (dataset)

MLM-pretraining → CORD, FUNSD

CORD standard evalation set (1000 documents): train(800), validation(100), test(100)
FUNSD (about 200 documents) : 75-25 split for training and test

(no pretraining) → Payment (10K documents, 7 semantic entity labels) // not available on internet?

Pre-training

Following DocFormer, (2021), we collect around 700k unlabeled form documents for unsupervised pre-training. We adopt the Masked Language Model (MLM) objective.

We train the models from scratch using Adam optimizer with batch size of 512. The learning rate is set to 0.0002 with a warm-up proportion of 0.01.

From DocFormer

Result (benchmark)

FormNet-A2는 DocFormer 보다 2.5배 작은 모델을 사용하면서 성능이 더 뛰어남.

FormNet-A3는 97.28% 의 F1 score로 SOTA성능을 달성함.

Importance of RichAtt, Super-tokens(GCN)

Attention Visualization

we visualize the local-to-local attention scores for specific examples from the CORD dataset for the standard ETC and FormNet models. Qualitatively, we confirm that the tokens attend primarily to other tokens within the same visual block for FormNet. Moreover for that model, specific attention heads are attending to tokens aligned horizontally, which is a strong signal of meaning for form documents. No clear attention pattern emerges for the ETC model, suggesting the RichAtt and Super-Token by GCN enable the model to learn the structural cues and leverage layout information effectively.

Conclusion

RichAtt mechanism and Super-Token components help the ETC transformer excel at form understanding in spite of sub-optimal, noisy serialization.

FormNet recovers local syntactic information that may have been lost during text serialization and achieves state-of-the-art performance on three benchmarks.

Others

Viterbi algorithm : finds a sequence that maximizes the posterior probability

BIOES tagging

B(begin), I(inside), O(outside), E(end), if it’s a single word → S(single)

NER(named-entity recognition) == KV extraction

ETC scales to relatively long inputs by replacing standard attention, which has quadratic complexity, with a sparse global-local attention mechanism that distinguishes between global and long input tokens. The global tokens attend to and are attended by all tokens, but the long tokens attend only locally to other long tokens within a specified local radius, reducing the complexity so that it is more manageable for long sequences.