Source linked

Baidu's Unlimited-OCR Parses 30-Seiten-PDFs in einem Schuss, kein Chunking

Das neue OCR-Modell von Baidu verarbeitet Dokumente mit bis zu 32.768 Token und verarbeitet mehrseitige PDFs als eine einzelne Bildsequenz ohne Aufteilung auf Seitenebene.

baiduunlimited ocrdeepseek ocrocrdocument parsinglarge language models

Baidu's Unlimited-OCR swallows a 30-page PDF as a single image sequence and spits back the parsed text - no page-level chunking, no sliding window nonsense.

That 32,768-token context window is the key. Most OCR pipelines break documents into pages, run each through a model, then stitch results together. Unlimited-OCR treats the whole document as one long-horizon parsing problem, feeding up to 32,768 tokens worth of image and text in a single forward pass. The README shows a max_length of 32768 and ngram_window up to 1024 for multi-page jobs.

Two Modes: Gundam for Dense Pages, Base for Uniform Docs

Baidu ships two inference configurations. The "gundam" mode uses base_size=1024, image_size=640, crop_mode=True - designed for densely packed layouts like scientific papers or forms. It crops the image into overlapping patches then reconstructs the parse. The "base" mode (image_size=1024, crop_mode=False) handles whole-page images without cropping, suitable for uniform documents and PDF-to-image conversions.

Single images use one config; multi-page or PDF parsing forces base mode with image_size=1024. The pdf_to_images helper converts each page at 300 DPI via PyMuPDF, then infer_multi processes them as a batch - but importantly, the model sees all pages together, not one at a time.

Built on Deepseek-OCR, Pushed Further

Baidu explicitly calls Unlimited-OCR an attempt to "push Deepseek-OCR one step further." The model uses a custom logit processor (DeepseekOCRNoRepeatNGramLogitProcessor) with no_repeat_ngram_size=35 and adjustable ngram_window to suppress repetitive text - a common failure mode when decoding long documents. The ngram_window defaults to 128 for single images and 1024 for multi-page inference, reflecting the longer context.

Inference is supported via Huggingface Transformers (with trust_remote_code=True) and via SGLang for lower latency. The SGLang setup includes a custom logit processor and streaming API compatible with OpenAI's chat completions endpoint.

The paper is up on arXiv, and the model is available on ModelScope. Baidu isn't just releasing weights; they're giving engineers two concrete deployment paths and a clear recommendation for when to use each mode. That's the level of detail that makes a tool actually usable.

One-shot long-horizon parsing means the end of page-level glue code for document workflows. Next step: see how it holds up against a 100-page legal brief.


Source: Unlimited OCR: One-Shot Long-Horizon Parsing
Domain: github.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.