Most long-video models still treat motion as an afterthought, stuffing every frame into a transformer and praying for temporal coherence. GOPAgen takes a smarter route: it piggybacks on video codec's native Groups of Pictures (GOPs) to build a motion-aware agent that actually understands what moved where, when.
Motion Agent Trained on Codec's GOP Structure
The key insight is that video codecs already compress motion into GOP hierarchies – I-frames, P-frames, B-frames – but no one thought to train a vision agent directly on those GOPs. The GOPAgen team does exactly that: they train a motion agent on GOPs extracted from the compressed stream, giving the model a built-in motion vocabulary without custom optical flow networks. That agent feeds into a GOP tree reasoning algorithm that mirrors the codec's temporal pyramid, letting the model zoom from coarse scene changes down to specific object displacements.
Structural Memory with Coarse-to-Fine Zoom-In
GOPAgen couples a structural memory mechanism that stores local motion information alongside dense captions in structured pages. Inference runs a coarse-to-fine zoom-in algorithm that first retrieves relevant motion vectors from a dedicated database, then drills into the most informative GOPs. This avoids the quadratic cost of full-frame attention over minutes of video. The motion vector database supports retrieval at multiple granularities, so the model can answer “did the car turn left at 3:12?” without reprocessing the whole clip.
On benchmarks, GOPAgen delivers superior Video Question Answering performance on both MotionBench (designed for motion-heavy queries) and Egoschema (egocentric activity understanding). The team hasn't released absolute numbers in the abstract, but the architectural shift is what matters: by aligning the reasoning hierarchy with the video codec's own compression hierarchy, they cut the memory and compute overhead that plagues existing long-video agents.
This approach is a direct challenge to the frame-stacking status quo. Next real test will be whether the motion vector database scales to hour-long video without index blowup – but the principle of leveraging codec internals is too clean to ignore.
Source: GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning
Domain: arxiv.org
Comments load interactively on the live page.