Source linked

Baidu's Unlimited-OCR parse des PDF de 30 pages en un seul coup, pas de chunking

Le nouveau modèle OCR de Baidu traite des documents jusqu'à 32 768 jetons, traitant des PDF multi-page en tant que séquence d'image unique sans division au niveau de la page.

baiduunlimited ocrdeepseek ocrocrdocument parsinglarge language models

Baidu's Unlimited-OCR swallows a 30-page PDF as a single image sequence and spits back the parsed text - no page-level chunking, no sliding window nonsense.

That 32,768-token context window is the key. Most OCR pipelines break documents into pages, run each through a model, then stitch results together. Unlimited-OCR treats the whole document as one long-horizon parsing problem, feeding up to 32,768 tokens worth of image and text in a single forward pass. The README shows a max_length of 32768 and ngram_window up to 1024 for multi-page jobs.

Two Modes: Gundam for Dense Pages, Base for Uniform Docs

Baidu ships two inference configurations. The "gundam" mode uses base_size=1024, image_size=640, crop_mode=True - designed for densely packed layouts like scientific papers or forms. It crops the image into overlapping patches then reconstructs the parse. The "base" mode (image_size=1024, crop_mode=False) handles whole-page images without cropping, suitable for uniform documents and PDF-to-image conversions.

Single images use one config; multi-page or PDF parsing forces base mode with image_size=1024. The pdf_to_images helper converts each page at 300 DPI via PyMuPDF, then infer_multi processes them as a batch - but importantly, the model sees all pages together, not one at a time.

Built on Deepseek-OCR, Pushed Further

Baidu explicitly calls Unlimited-OCR an attempt to "push Deepseek-OCR one step further." The model uses a custom logit processor (DeepseekOCRNoRepeatNGramLogitProcessor) with no_repeat_ngram_size=35 and adjustable ngram_window to suppress repetitive text - a common failure mode when decoding long documents. The ngram_window defaults to 128 for single images and 1024 for multi-page inference, reflecting the longer context.

Inference is supported via Huggingface Transformers (with trust_remote_code=True) and via SGLang for lower latency. The SGLang setup includes a custom logit processor and streaming API compatible with OpenAI's chat completions endpoint.

The paper is up on arXiv, and the model is available on ModelScope. Baidu isn't just releasing weights; they're giving engineers two concrete deployment paths and a clear recommendation for when to use each mode. That's the level of detail that makes a tool actually usable.

One-shot long-horizon parsing means the end of page-level glue code for document workflows. Next step: see how it holds up against a 100-page legal brief.


Source: Unlimited OCR: One-Shot Long-Horizon Parsing
Domain: github.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.