Source linked

OmniMem coupe le cache KV pour les LLM audio-visuels avec un gain de précision de 4%

La compression de la mémoire consciente des perturbations d'OmniMem améliore la précision du flux audio-visuel LLM de 2 à 4% sur le même budget de mémoire, avec un 1-2% supplémentaire après l'ajustement.

omnimemvideo salmonn 2qwen 25 omnimemory compressionaudio visual llmsarxiv

OmniMem wrings an extra 4% absolute accuracy out of streaming audio-visual LLMs on long-form video benchmarks without burning more memory—by treating visual and audio tokens as separate problems and keeping only the states that actually matter.

Audio-visual LLMs choke on long video because every new frame adds tokens and bloats the key-value cache linearly. Existing compression methods treat all tokens uniformly, which is a bad bet when audio streams are sparse and visual feeds are dense. OmniMem's core insight: allocate memory budget per modality, then pick which KV states survive using a perturbation-aware selection algorithm that preserves informative, non-redundant content.

Modality-Aware Allocation Fixes Imbalance

Vanilla compression strategies don't account for the fact that a 5-minute video might have 10x more visual tokens than audio tokens. OmniMem separates the two contexts and assigns memory budgets proportional to each modality's actual token count and redundancy. This alone prevents the visual stream from starving the audio channel of retention slots.

The authors evaluate on VideoMME Long, LVBench, and LVOmniBench using video-SALMONN 2+ and Qwen-2.5-Omni. Across the board, OmniMem outperforms strong training-free baselines (e.g., uniform KV dropping, random eviction) by 2–4% absolute accuracy under identical memory constraints.

Perturbation Detection Picks What to Keep

Not all KV states deserve immortality. OmniMem measures how much a given state's removal perturbs the model's output distribution. States whose removal causes the least perturbation—or that are redundant with others—get evicted first. The retained set stays compact but still captures long-range dependencies. No hand-tuned heuristics, no modality-blind truncation.

Fine-Tuning Squeezes Another 2%

Compression alone gets you most of the way, but OmniMem also offers a budget-aware fine-tuning stage. The model is trained to consolidate useful information into the retained memory slots, effectively learning to be a better packer. That adds 1–2% more accuracy on top of the training-free version, closing the gap to full-memory inference without doubling the cache.

What this means practically: streaming audio-visual models can now handle hour-long video feeds on hardware that previously maxed out at a few minutes. Next step is seeing how far the same perturbation-aware logic generalizes to other modalities—or to autoregressive models beyond the LLM stack.


Source: OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.