Source linked

CoAT Gives Audio Language Models a Thinking Space to Save Acoustic Info

A new framework called CoAT provides large audio language models with a continuous latent workspace to preserve phonetic detail, prosody, and affect before generating text, without adding decoding cost.

continuous audio thinkinglarge audio language modelsqwen2 audioqwen25 omni 7baudio flamingodistillation

Large audio language models throw away most of what audio actually carries. The hidden states get shaped for text generation, so you lose phonetic detail, prosody, sound events, affect, pitch - all the stuff that makes audio rich.

Why LALMs Ditch the Acoustic Signal

Standard LALMs like Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3 are trained to produce text-aligned responses. That means their hidden representations progressively optimize for language output, not for preserving acoustic information. By the time the model generates a response, the diverse acoustic content that audio carries is gone. You cannot leverage it.

How CoAT Builds a Thinking Space Without Extra Decoding Cost

Continuous Audio Thinking (CoAT) inserts a continuous latent workspace before response generation. The model receives distillation signals from audio experts within this thinking space, so it can organize acoustic information without being forced into text-aligned representations yet. The critical detail: CoAT's continuous thinking block runs in a single prefill, meaning it adds zero additional autoregressive decoding cost over the baseline. No extra tokens, no slower inference.

Across three LALMs - Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3 - CoAT showed performance gains on a broad benchmark suite covering audio reasoning, audio understanding, music classification, speech emotion, and speech transcription. Analysis confirms that the auxiliary supervision propagates from the thinking positions into the model's textual responses.

This is a clean architectural fix for a problem that many in the field hand-waved away. Giving models a separate latent workspace for acoustic reasoning, without paying decoding overhead, is the kind of practical innovation that could become standard in multimodal architectures.


Source: Continuous Audio Thinking for Large Audio Language Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.