Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

HTS(Hierarchical Text Spotter)는 텍스트 스포팅과 레이아웃 분석을 통합한 모델로, 텍스트 라인의 정확한 폴리곤 예측과 문서 내 텍스트 레이아웃을 효과적으로 분석합니다. 다양한 벤치마크에서 최고 성능을 기록하며, 텍스트와 레이아웃을 동시에 처리하는 첫 사례로 주목받고 있습니다.

Dec 08, 2023

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

Contents

Introduction Related Works Methodology Experiments Ablation Studies Limitations Conclusion

notion image

notion image

Introduction

기존의 Text Spotting 은 단어 수준에서 텍스트를 추출하는 것이 대부분

Text Spotting 과 기하학적 레이아웃 분석을 통합한 최초의 방법

Text Spotting 과 레이아웃 분석에서 전부 SOTA 성능을 달성

notion image

notion image

Hierarchical Text Spotter (HTS) 의 두 가지 주요 구성 요소

Unified-Detector-Polygon (UDP)

Text Line을 위한 Bezier Curve polygon 예측

Text Line들을 단락으로 그룹화 하기 위한 affinity matrix 생성

전통적인 Bezier Curve 의 control point 을 얻기 위해 L1 Loss 를 사용하는 방식은 텍스트 형태를 정확하게 포착하지 못함

Location and Shape Decoupling Module (LSDM) - 위치와 형태 표현 학습을 분리한 새로운 방법 제안

Line-to-Character-to-Word (L2C2W)

Transformer 인코더-디코더 구조

word 단위 바운딩 박스와 word 클래스를 함께 예측

공백도 하나의 특수한 문자로 예측

라인을 문자로 분리하고 이를 다시 단어로 병합하는 방식

core contributions

word-level 의 text spotting & geometric layout analysis 의 통합

텍스트 라인의 정확한 Polygon 예측을 가능하게 하는 LSDM 모듈

일부의 layout analysis 와 text localization 이 포함된 L2C2W recognizer

테스트 데이터셋에 대한 fine-tuning 없이 text spotting 과 geometric layout analysis 벤치마크에서 최고 성능 결과

Related Works

Text Detection

Text Recognition

Text Spotting

Layout analysis

Methodology

Unified Detection Stage

Line Recognition Stage

1. Unified Detection of Text Line and Paragraph

notion image

notion image

N: 쿼리의 수

m: 베지에 곡선 차수

A. KMaX-DeepLab

notion image

notion image

학습 가능한 object queries 를 image feature 와 cross attention을 통해 mask 임베딩 백터를 얻는 것이 목적

K-means cross-attention

notion image

https://arxiv.org/abs/2207.04044

B. Bezier Head

notion image

notion image

notion image

notion image

Location Head : Axis-Aligned Bounding Boxes (AABB) 을 예측

notion image

notion image

Shape Head : Local Bezier 예측

각 베지에 곡선은 m + 1 개의 컨트롤 포인트를 가짐

하나의 폴리곤은 2 * (m + 1) 컨트롤 포인트

즉 4 * (m + 1) 개의 좌표를 예측

notion image

notion image

Global Bezier : Local Bezier 를 AABB 를 통해 스케일링 및 변환

notion image

notion image

C. Layout Head

notion image

notion image

layout grouping 을 하기 위한 head

Affinity matrics N * N 을 얻음

Loss

notion image

notion image

2. Line-to-Character-to-Word Recognition

notion image

notion image

BezierAlign

notion image

https://arxiv.org/pdf/2002.10200v2.pdf

Text Line Recognition Model

MobileNetV2 CNN 백본을 통해 이미지 픽셀을 인코딩

positional encoding 과 함께 transformer encoder 에 적용

auto-regressive transformer decoder 로 출력

Character Localization

2-layer FFN head 추가

토큰 분류 헤드와 병렬로 처리하며 좌상단, 우하단 4d vector 생성

Loss

notion image

notion image

character classification : cross-entropy loss

character localization : L1 loss

αt : 바운딩박스 여부

ϵ : 0으로 나누는 것을 피하기 위한 작은 양수

Post-processing

예측된 공백 문자를 사용하여 텍스트 라인을 단어로 나눔

BezierAlign 시 얻었던 bijection 을 통해 예측된 좌표를 원본 이미지 좌표로 projection 함

Experiments

Results on End-to-End Text Spotting

notion image

notion image

notion image

notion image

Results on Geometric Layout Analysis

notion image

notion image

Ablation Studies

notion image

notion image

LSDM(Locaction and Shape Decoupling Module)

LSDM 대신 AABB 단일 예측 헤드로 교체

notion image

notion image

Limitations

Latency

A100 GPU 에서 7.8 FPS 로 실행

Line labels

라인 라벨이 없는 경우가 많지만 라인 그룹화 주석은 저렴함

워드 단위 라벨과 라인 그룹화 주석을 통해 라인 박스를 예측하기는 쉽다

Character Localization

대부분의 벤치마크 데이터셋은 캐릭터 단위 라벨을 제공하지 않아서 정확도를 평가할 수 없음

Conclusion

text spotting 과 layout analysis 를 동시에 진행한 첫번째 논문

각각의 벤치마크에서 SOTA 성능을 달성

Share article