Source linked

1.820 جدولًا مكتوبًا، 9 لغة: النماذج الجديدة تشير إلى أخطاء استخراج

تقريبا نصف الأوراق في PulseBench-Tab تحتوي على الخلايا المتواجدة؛ معظم المصنعين التجارية تقييم أقل من 60٪ على معيار الفردية الجديد بناء على الخريطة.

pulsebench tabtable extractionmultilingual benchmarkdocument intelligencegraph evaluation

48.1% of tables in the new PulseBench-Tab benchmark have merged or spanning cells — a structural quirk that flattens most commercial extractors.

Existing table extraction benchmarks dodge hard cases. PulseBench-Tab doesn't. Its 1,820 human-annotated tables come from 380 real-world documents: financial filings, government reports, and regulatory disclosures. Nine languages across Latin, CJK, Arabic, and Cyrillic scripts. Table sizes range from 2 cells to 1,183. No cherry-picked clean layouts.

T-LAG Turns Tables Into Graphs

The authors propose T-LAG (Table Logical Adjacency Graph), an evaluation metric that models each table as a directed graph over cell adjacencies. Instead of pixel-level overlap or cell-by-cell F1, T-LAG computes structural and content fidelity in a single score via optimal bipartite matching. If an extractor misreads a column span or drops a row boundary, the graph mismatch shows up immediately. Traditional cell-level metrics score that as a single error; T-LAG sees the ripple effect.

Nine Systems, Uneven Results

PulseBench-Tab evaluates 9 commercial and open-source table extraction systems. The paper reports per-language breakdowns. Systems that perform well on clean English PDFs drop sharply on Arabic or CJK documents with merged cells. Nearly half the benchmark tables contain spans — the kind of layout that breaks naive row-column parsers.

The full dataset, scoring code, and all provider outputs are publicly available on GitHub. Teams can now pinpoint exactly which cell-relationships their extractors get wrong — and that's where the next generation of table parsers will improve.


Source: PulseBench-Tab: A Multilingual Benchmark for Table Extraction with Graph-Based Evaluation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.