로민 공식 블로그 | The Data for AI

See All Tech AI Insights

문서 파싱 솔루션, 도입 전에 꼭 확인해야 할 5가지 기준

문서 파싱 솔루션 도입 전 확인해야 할 5가지 핵심 기준을 상세히 분석, AI-Native 문서 생성부터 RAG 연계까지, AI 시대 문서 처리 완벽 가이드

Aug 06, 2025

AI Insights

DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting

DEER(Detection-agnostic End-to-End Recognizer)는 텍스트 스팟팅의 새로운 접근 방식으로, 기존의 텍스트 탐지와 인식 시스템에서 벗어나 텍스트 탐지 오류에 덜 의존하는 인식 구조를 제안합니다. 이를 통해 다양한 형태와 크기의 텍스트를 효과적으로 인식할 수 있으며, 복잡한 탐지 메커니즘 없이도 성능을 유지합니다.

Apr 25, 2024

Tech

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

OCR 없는 문서 이해를 위한 혁신적 AI 모델. 고해상도 이미지 처리, 토큰 최적화, 텍스트 스팟팅 기능 탑재. 12개 벤치마크에서 우수한 성능 입증. 문서 이해 AI의 새 지평을 열다

Apr 25, 2024

Tech

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

SliceGPT는 Transformer 기반 언어 모델의 효율성을 극대화하기 위해 Structured Pruning 방식을 제안합니다. 주성분 분석(PCA)으로 weight matrix를 최적화하여 최대 64%의 연산 비용을 줄이면서 성능을 유지합니다.

Apr 25, 2024

Tech

Improving Post-Training Quantization on Object Detection with Task Loss-Guided Lp Metric

quantization에서의 두 접근 방식 QAT(Quantization Aware Training)와 PTQ(Post Training Quantization) 중에서 PTQ 방식에서의 성능 하락을 완화하기 위한 DetPTQ라는 새로운 방식을 제시

Apr 25, 2024

Tech

문서 파싱 솔루션, 도입 전에 꼭 확인해야 할 5가지 기준

문서 파싱 솔루션 도입 전 확인해야 할 5가지 핵심 기준을 상세히 분석, AI-Native 문서 생성부터 RAG 연계까지, AI 시대 문서 처리 완벽 가이드

Aug 06, 2025

AI Insights

DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting

Apr 25, 2024

Tech

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Apr 25, 2024

Tech

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Apr 25, 2024

Tech

Improving Post-Training Quantization on Object Detection with Task Loss-Guided Lp Metric

Apr 25, 2024

Tech

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

DetPTQ와 ODOL을 활용해 Document AI 모델의 PTQ 성능을 혁신적으로 개선하고, 성능 저하 없이 효율적 양자화를 실현합니다.

Apr 25, 2024

Tech

LLMs

LLaMa, PaLM, Mistral 등 최신 LLM 모델들의 학습 데이터와 최적화 기법을 비교 분석합니다. 각 모델의 특성과 최신 기술 트렌드를 한눈에 확인해보세요.

Apr 25, 2024

Tech

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

DINO 모델을 Open-set 및 Referring Object Detection으로 확장한 Grounding DINO를 소개합니다. 자연어 쿼리를 활용한 객체 탐지 성능을 강화합니다.

Apr 25, 2024

Tech

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

LLM의 quality와 factuality를 향상시키 위해 Self-RAG를 제안, LM을 on-demand로 조건에 따라 passage retrieve를 하도록 학습한다. Retrieved passage를 reflection token 이라는 special token을 이용해 self-check를 하여 generation의 quality와 factuality를 향상시킨다.

Apr 25, 2024

Tech

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

외부 지식 결합으로 LLM의 Hallucination 문제를 개선하고 성능을 높이는 RAG 모델의 효과를 분석합니다.

Apr 25, 2024

Tech

A Graphical Approach to Document Layout Analysis

GLAM(Graph-based Layout Analysis Model)은 PDF 문서의 레이아웃 분석을 위한 경량 그래프 신경망 모델로, 텍스트 박스를 노드로 구성하여 노드와 엣지 분류를 통해 레이아웃을 추론합니다. DocLayNet과 PubLayNet 데이터셋에서 기존 CV 모델보다 효율적인 성능을 보여주며, 작은 모델 사이즈와 빠른 추론 속도가 강점입니다.

Dec 08, 2023

Tech

ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction

ICL-D3IE는 문서 정보 추출(DIE)을 위한 In-Context Learning 프레임워크로, 다양한 예시를 사용해 LLM의 정확성과 성능을 향상시킵니다. FUNSD, CORD, SROIE 등의 벤치마크에서 기존 방법 대비 우수한 성능을 보입니다.

Dec 08, 2023

Tech

Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis

HTS(Hierarchical Text Spotter)는 텍스트 스포팅과 레이아웃 분석을 통합한 모델로, 텍스트 라인의 정확한 폴리곤 예측과 문서 내 텍스트 레이아웃을 효과적으로 분석합니다. 다양한 벤치마크에서 최고 성능을 기록하며, 텍스트와 레이아웃을 동시에 처리하는 첫 사례로 주목받고 있습니다.

Dec 08, 2023

Tech

CoOp, CoCoOp, KgCoOp

CoOp, CoCoOp, KgCoOp는 비전-언어 모델의 프롬프트 최적화와 일반화 성능을 개선해, 제로샷 전이와 unseen 클래스 대응력을 높이는 프레임워크입니다.

Dec 08, 2023

Tech

Towards Zero-shot Document Query System

BLIP, BLIP-2, LLaVA, MiniGPT-4와 같은 최신 멀티모달 언어 모델을 통해 Zero-shot 문서 질의 시스템의 가능성을 탐구합니다. 각 모델의 핵심 기능과 적용 가능성을 비교 분석합니다.

Jun 22, 2023

Tech

CLIP: Learning Transferable Visual Models From Natural Language Supervision

이 연구는 추가 학습 없이 새로운 문서에서 정보를 추출하는 Zero-shot Document Query System의 가능성을 탐구합니다. BLIP, BLIP-2, LLaVA, MiniGPT-4 등 최신 Multimodal Language Model을 활용한 접근 방식과 그 성능을 분석합니다

Jun 22, 2023

Tech

LLaMA: Open and Efficient Foundation Language Models

기존의 많은 parameter를 사용하는 LLM과 비교해서 해당 논문에서 제안한 LLaMA-13B는 (GPT-3: 175B, PaLM: 540B) 13B 의 parameter만을 사용하여 대부분의 benchmark에서 좋은 성능을 보였습니다.

Jun 22, 2023

Tech

COLT5: Faster Long-Range Transformers with Conditional Computation

이 논문은 LONGT5를 기반으로 어텐션 및 피드포워드 레이어에 대한 구조를 개선하여 긴 입력을 빠르게 처리할 수 있는 새로운 모델인 COLT5를 제안합니다.

Jun 22, 2023

Tech

Language Is Not All You Need: Aligning Perception with Language Models

논문의 저자들은 앞서 언급한 LLM들의 문제점을 극복하고자 Multimodal Lage Language Model(MLLM)인 KOSMOS-1을 제안하였습니다.

Jun 22, 2023

Tech

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

training objective와 architecture가 동시에 고려되지 않으면 최적의 성능가 구해졌다고 하기 어렵습니다. 따라서 저자는 학습 방법에 맞춰서 네트워크 아키텍쳐 역시 알맞게 같이 바뀌어야 한다고 말하고 있습니다.

Jun 22, 2023

Tech

GMN : Generative Multi-modal Network for Practical Document Information Extraction

이 논문은 DIE task를 해결하기 위해 주로 제안되어 오던 Sequence Labeling 방식의 여러 고질적인 문제들을 해결하는 것을 Motivation으로 지정하였습니다.

Jun 22, 2023

Tech

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Real-time object detection는 컴퓨터 비전에서 매우 중요한 주제입니다. 객체 추적, 자율주행, 로봇, 등 에지 장치에서 작업이 되는 경우 작업 속도 향상에 중점을 둡니다. Yolov7은 에지에서 클라우드, CPU, 모바일 GPU 를 전부 타겟합니다.

Jun 22, 2023

Tech

ULSD: Unified Line Segment Detection across Pinhole, Fisheye, and Spherical Cameras

일반적인 line detection task는 이미 deep learning 기반의 연구가 있지만, fisheye, spherical같은 distorted line에 대해서는 아직 연구되지 않았음.

May 04, 2023

Tech

Scene Text Recognition with Permuted Autoregressive Sequence Models

본 논문의 모델 PARSeq는 순서 치환(Permutation) 언어 모델링을 사용하여 공유 가중치를 가진 내부 Autoregressive(AR) Language Models(LM)의 앙상블을 학습합니다.

May 04, 2023

Tech

BEiT-3: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

본 논문에서는 vision and vision-language task에서 SOTA 성능을 달성하는 범용 멀티모달 기반 모델 BEIT-3를 소개합니다.

Oct 11, 2022

Tech

FormNet: Beyond Sequential Modeling for Form-Based Document Understanding

form documents have unique challenges compared to natural language documents stemming from their structural characteristics (hard to address serialization error!)

Oct 11, 2022

Tech

A ConvNet for the 2020s

본 연구는 디자인 공간들(The Design Spaces)을 재시험하고 순수한 ConvNet이 달성할 수 있는 것의 한계를 시험합니다.

Oct 11, 2022

Tech

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

대부분의 OCR 기반의 Key Information Extraction(KIE) 방법론들은 textual features와 position features만 사용하고 있습니다. 하지만 풍부한 semantic representation을 얻기 위해서는 visual feature와 global layout까지 사용하는 것이 좋을 수 있습니다

Oct 11, 2022

Tech

Test-Time Adaptation for Visual Document Understanding

대부분 Visual Document Understanding(VDU) 태스크는 self-supervised pre-training 뒤에 이어지는 fine-tuning으로 이루어집니다.

Oct 11, 2022

Tech

Look Closer to Supervise Better: One-Shot Font Generation via Component-Based Discriminator

본 논문은 Component-Aware Module(CAM)이라는 모듈을 제안하여 글꼴 생성기가 더 세분된(Fine-Grained) 레벨에서 Content와 Style을 분리하도록 합니다.

Oct 11, 2022

Tech

YOLO Detector Family - Overview and History

Too many objects in the image make it extremely crowded. This creates various challenges for the object detect model, like the occlusions could be large, the objects could be small, and the scale could be inconsistent.

Oct 11, 2022

Tech

CoCa: Contrastive Captioners are Image-Text Foundation Models

최근 vision과 language에 걸친 다양한 downstream task에 적용 가능한 multimodal pre-trained model에 대한 연구가 활발히 진행되고 있습니다.

Oct 11, 2022

Tech

LinkBERT: Pretraining Language Models with Document Links

BERT와 같은 기존 모델들은 단일 문서 내 텍스트 말뭉치에 대해서만 학습을 진행하고 문서 간의 종속성은 학습하지 않습니다. 이 연구에서는 문서 간 Link를 활용하여 문서 간 종속성도 학습할 수 있는 LinkBERT를 제안합니다.

Oct 11, 2022

Tech

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.

Oct 11, 2022

Tech

Diffusion Models Beat GANs on Image Synthesis

GANs가 Likelihood 기반의 SOTA 모델들보다 다양성을 적게 포착한다는 것 또한 드러남. 게다가, GANs는 조심스럽게 선택한 하이퍼 파라미터와 규제기(Regularizers)가 없으면 붕괴(Collapsing)되어 학습하기 어려움.

Oct 11, 2022

Tech

PaddleOCR

PaddleOCR은 Baidu의 딥러닝 프레임워크 PaddlePaddle을 기반으로 한 OCR 솔루션입니다. 최신 버전인 PP-OCRv3는 경량 모델과 강력한 다국어 인식을 제공하며, PP-Structure는 레이아웃 분석, 표 인식, VQA 등을 지원하여 문서 구조 분석을 돕습니다.

Oct 11, 2022

Tech

Exploring Plain Vision Transformer Backbones for Object Detection

Object detection is a fundamental computer vision task, typically performed by detectors comprising a task-agnostic backbone and independently developed necks and heads that incorporate detection-specific prior knowledge.

Jun 28, 2022

Tech

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training 기술은 Document AI에서 많은 발전을 이루었습니다. 대부분의 multimodal 모델은 텍스트에 대해 masked language modeling(MLM)을 사용하여 사전 학습하지만 이미지 학습에 대해서는 다양한 방식이 존재합니다.

Jun 28, 2022

Tech

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

이 논문은 “가볍고 빠른 Mobile vision task를 위한 모델을 만들기 위해 CNN의 강점과 ViT의 강점을 합칠 수 있을 까?”에 대한 질문에 대답합니다.

Jun 28, 2022

Tech

LongT5: Efficient Text-To-Text Transformer for Long Sequences

최근 NLP task들에서 long input을 다룰 수 있는 Transformer 모델들이 좋은 성능을 기록하였습니다. 또한, Transformer 모델의 크기를 키우는 것이 성능에 도움이 된다는 연구들이 보고되고 있습니다.

May 19, 2022

Tech

QuadTree Attention for Vision Transformers

컴퓨터비전에서도 트랜스포머의 제곱 복잡도는 문제가 됩니다. 본 논문은 Quadtree 어텐션을 도입하여 제곱의 연산 복잡도를 선형으로 줄입니다. Quadtree 트랜스포머는 토큰 피라미드를 생성하여 어텐션을 Coarse-To-Fine 방식으로 연산합니다.

Apr 20, 2022

Tech

TableFormer: Table Structure Understanding with Transformers.

TableFormer는 Transformer를 활용한 테이블 구조 이해 모델로, 이미지에서 HTML 태그와 셀 bounding box를 동시에 예측합니다. 다양한 테이블 스타일과 복잡성을 다루는 SynthTabNet 데이터셋도 함께 제안합니다.

Apr 20, 2022

Tech

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Apr 20, 2022

Tech

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

DETR은 Transformer 기반의 detector algorithm으로 object detection 문제를 set prediction task로 보고 bipartite graph matching을 통해 label을 assign하는 방식을 사용합니다.

Apr 20, 2022

Tech

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Encoder-Decoder Model은 앞에서 언급한 sequence labeling의 여러 문제들을 해결할 수 있습니다.

Apr 19, 2022

Tech

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Sparse R-CNN은 Sparse 방식으로 객체 인식을 수행합니다. Sparse-In Sparse-Out의 방식으로 빠른 학습 속도를 보였고 유명한 One Stage 그리고 Two Stage Detector들과 대등한 성능을 보였습니다.

Apr 19, 2022

Tech

Parsing Table Structures in the Wild

복잡한 large-scale table structure parsing dataset을 구축, 구조화된 테이블에서의 discrete cells를 정확하게 그룹화할 수 있는 pairing loss 기반의 cycle-pairing module을 최적화하는 Cycle-CenterNet을 소개함. WTW, ICDAR2019 데이터 셋 기반 테이블 구조 분석 성능 SOTA를 달성

Apr 19, 2022

Tech

Meta Self-Learning for Multi-Source Domain Adaptation: A Benchmark

This paper introduces an a new method for domain adaptation through meta self-learning approach.

Apr 19, 2022

Tech