48.1% of tables in the new PulseBench-Tab benchmark have merged or spanning cells — a structural quirk that flattens most commercial extractors.
Existing table extraction benchmarks dodge hard cases. PulseBench-Tab doesn't. Its 1,820 human-annotated tables come from 380 real-world documents: financial filings, government reports, and regulatory disclosures. Nine languages across Latin, CJK, Arabic, and Cyrillic scripts. Table sizes range from 2 cells to 1,183. No cherry-picked clean layouts.
T-LAG Turns Tables Into Graphs
The authors propose T-LAG (Table Logical Adjacency Graph), an evaluation metric that models each table as a directed graph over cell adjacencies. Instead of pixel-level overlap or cell-by-cell F1, T-LAG computes structural and content fidelity in a single score via optimal bipartite matching. If an extractor misreads a column span or drops a row boundary, the graph mismatch shows up immediately. Traditional cell-level metrics score that as a single error; T-LAG sees the ripple effect.
Nine Systems, Uneven Results
PulseBench-Tab evaluates 9 commercial and open-source table extraction systems. The paper reports per-language breakdowns. Systems that perform well on clean English PDFs drop sharply on Arabic or CJK documents with merged cells. Nearly half the benchmark tables contain spans — the kind of layout that breaks naive row-column parsers.
The full dataset, scoring code, and all provider outputs are publicly available on GitHub. Teams can now pinpoint exactly which cell-relationships their extractors get wrong — and that's where the next generation of table parsers will improve.
Source: PulseBench-Tab: A Multilingual Benchmark for Table Extraction with Graph-Based Evaluation
Domain: arxiv.org
Comments load interactively on the live page.