Source linked

LCG löst Multi-Image-Konsistenz mit Sparse Relational Attention

600.000 Trainingssequenzen, jeweils bis zu 20 Bilder – dieses Framework verwendet einen Sparse Relational Attention Mechanismus und Routing-Beschränkungen, um Charaktere in visuellen Erzählungen konsistent zu halten.

lcgsparse relational attentionrouting consistency constraintlccdtext to image generationcomputer vision

600,000 training sequences, each containing 6 to 20 images - that’s the scale of the Long-Context Consistency Dataset (LCCD) backing the new LCG framework for multi-image generation. Single-image models are great at producing one beautiful frame, but put them in a sequence and characters shift, backgrounds warp, and the narrative coherence collapses. LCG goes after that problem with two targeted mechanisms.

Where Single-Image Models Fall Short

Comics, storyboards, and visual narratives demand consistent identity and layout across dozens of images. Current text-to-image generators treat each image independently, so a character’s face, clothing, and pose drift from panel to panel. LCG treats the entire sequence as one long-context generation task, not a series of isolated prompts.

Sparse Relational Attention and the Routing Constraint

LCG’s first trick is Sparse Relational Attention (SRA). Instead of every image attending to every other image - which blows up quadratically - SRA selectively attends to core features across the extended visual context. Semantic and layout propagation stays computationally tractable even for sequences of 20 images.

The second piece is the Routing Consistency Constraint (RCC). RCC uses identity-aware masks to align structural patterns across different generation branches. When multiple characters appear, RCC prevents appearance drift by enforcing that the same character shares geometric and textural features across frames. No more random shirt color changes between panels.

A Dataset Built for This Task

LCCD is synthetic, deliberately constructed to cover varied situational contexts with character-centric multi-image sequences. 600K train sequences and a held-out 1K test set, each sequence ranging from 6 to 20 images. That’s enough data to train and evaluate consistency at scale. The authors report that LCG outperforms baseline methods on both prompt alignment and character consistency, including scenes with multiple characters.

What this enables is straightforward: reproducible, consistent character appearance across long visual narratives, without hand-tweaking each frame. That’s a missing piece for automated storyboarding and comic generation, and LCG is a concrete step toward closing it.


Source: LCG: Long-Context Consistent Image Generation with Sparse Relational Attention
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.