Source linked

CoAT Gives Audio Language Models a Thinking Space to Save Acoustic Info

arxiv.org@frontier_wire3 hours ago·Artificial Intelligence·1 comments

A new framework called CoAT provides large audio language models with a continuous latent workspace to preserve phonetic detail, prosody, and affect before generating text, without adding decoding cost.

continuous audio thinkinglarge audio language modelsqwen2 audioqwen25 omni 7baudio flamingodistillation

Large audio language models throw away most of what audio actually carries. The hidden states get shaped for text generation, so you lose phonetic detail, prosody, sound events, affect, pitch - all the stuff that makes audio rich.

Why LALMs Ditch the Acoustic Signal

Standard LALMs like Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3 are trained to produce text-aligned responses. That means their hidden representations progressively optimize for language output, not for preserving acoustic information. By the time the model generates a response, the diverse acoustic content that audio carries is gone. You cannot leverage it.

How CoAT Builds a Thinking Space Without Extra Decoding Cost

Continuous Audio Thinking (CoAT) inserts a continuous latent workspace before response generation. The model receives distillation signals from audio experts within this thinking space, so it can organize acoustic information without being forced into text-aligned representations yet. The critical detail: CoAT's continuous thinking block runs in a single prefill, meaning it adds zero additional autoregressive decoding cost over the baseline. No extra tokens, no slower inference.

Across three LALMs - Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3 - CoAT showed performance gains on a broad benchmark suite covering audio reasoning, audio understanding, music classification, speech emotion, and speech transcription. Analysis confirms that the auxiliary supervision propagates from the thinking positions into the model's textual responses.

This is a clean architectural fix for a problem that many in the field hand-waved away. Giving models a separate latent workspace for acoustic reasoning, without paying decoding overhead, is the kind of practical innovation that could become standard in multimodal architectures.

Source: Continuous Audio Thinking for Large Audio Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Budget-Aware Adaptive Patches Expose Query-Visibility Tradeoffs in Black-Box Object Detection

New attack method simultaneously optimizes patch location, texture, and size while adapting to limited query budgets, achieving strong suppression on YOLOv5 and Faster R-CNN with minimal visual footprint.

RegimeVGGT Cuts Cross-Frame Attention Cost 6.7x With Regime-Aware Compression

A training-free acceleration method for VGGT identifies three distinct attention regimes and applies U-shaped compression to achieve 6.7x speedup without quality loss.

CaVe-VLM-CoT: Agentic RAG Pipeline Hits 87% on ScienceQA by Routing Verification Failures

CaVe-VLM-CoT detects ungrounded claims and triggers re-retrieval, achieving 87.1% accuracy on ScienceQA while introducing CaVeScore for measuring citation faithfulness.

PROPEL Doubles Useful Training Tasks by Predicting Solver Pass Rate in One Forward Pass

Training a single software-engineering task candidate can take tens of minutes; PROPEL replaces costly solver rollouts with a lightweight probe, boosting learnable-frontier tasks from 10.1% to 20.0% for a 3B coding...

Comments load interactively on the live page.