Source linked

GOPAgen Taps Video Codecs GOP-Struktur für langes Video-Verstehen

Durch das Training eines Motion-Agenten auf Gruppen von Bildern (GOPs) aus Video-Codec erreicht GOPAgen überlegene VQA auf MotionBench und Egoschema ohne Brute-Force-Computing.

gopagenvideo understandingmotion agentmotionbenchegoschemavideo codec

Most long-video models still treat motion as an afterthought, stuffing every frame into a transformer and praying for temporal coherence. GOPAgen takes a smarter route: it piggybacks on video codec's native Groups of Pictures (GOPs) to build a motion-aware agent that actually understands what moved where, when.

Motion Agent Trained on Codec's GOP Structure

The key insight is that video codecs already compress motion into GOP hierarchies – I-frames, P-frames, B-frames – but no one thought to train a vision agent directly on those GOPs. The GOPAgen team does exactly that: they train a motion agent on GOPs extracted from the compressed stream, giving the model a built-in motion vocabulary without custom optical flow networks. That agent feeds into a GOP tree reasoning algorithm that mirrors the codec's temporal pyramid, letting the model zoom from coarse scene changes down to specific object displacements.

Structural Memory with Coarse-to-Fine Zoom-In

GOPAgen couples a structural memory mechanism that stores local motion information alongside dense captions in structured pages. Inference runs a coarse-to-fine zoom-in algorithm that first retrieves relevant motion vectors from a dedicated database, then drills into the most informative GOPs. This avoids the quadratic cost of full-frame attention over minutes of video. The motion vector database supports retrieval at multiple granularities, so the model can answer “did the car turn left at 3:12?” without reprocessing the whole clip.

On benchmarks, GOPAgen delivers superior Video Question Answering performance on both MotionBench (designed for motion-heavy queries) and Egoschema (egocentric activity understanding). The team hasn't released absolute numbers in the abstract, but the architectural shift is what matters: by aligning the reasoning hierarchy with the video codec's own compression hierarchy, they cut the memory and compute overhead that plagues existing long-video agents.

This approach is a direct challenge to the frame-stacking status quo. Next real test will be whether the motion vector database scales to hour-long video without index blowup – but the principle of leveraging codec internals is too clean to ignore.


Source: GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.