Gemini 3.1 Flash TTS: the Next Generation of Expressive AI Speech

DeepMind has announced the release of Gemini 3.1 Flash, a new text‑to‑speech (TTS) model that builds on the company’s Gemini family of generative AI systems. The model is positioned as the next step in creating expressive, controllable AI‑generated audio.

Granular Audio Tags

The core innovation of Gemini 3.1 Flash is the introduction of granular audio tags. These tags allow developers to specify fine‑grained attributes of the generated speech—such as pitch, tempo, emphasis, and emotional tone—at the level of individual words or phrases. By providing this level of control, the model can produce more natural and context‑appropriate vocalisations, which is particularly valuable for applications that require nuanced emotional expression or precise prosody.

Expressive Audio Generation

According to DeepMind, the new tags enable the model to generate audio that is not only intelligible but also expressive. This means that the synthesized voice can convey subtle variations in mood, intent, and emphasis that are often missing from earlier TTS systems. The result is a more engaging listening experience that can adapt to a wide range of use cases, from virtual assistants to audiobooks and interactive storytelling.

Technical Overview

Gemini 3.1 Flash is built on the same underlying architecture as previous Gemini models, leveraging large‑scale transformer networks trained on diverse audio‑text datasets. The addition of the audio‑tagging mechanism is implemented as a lightweight conditioning layer that modulates the network’s output based on the supplied tags. This design keeps inference latency low while providing the expressive capabilities that developers demand.

Availability

DeepMind has made the model available through its standard API, allowing developers to experiment with the new tags and integrate expressive speech synthesis into their products. The release date for the public API is 2026‑04‑15, and the model is documented on the DeepMind blog.

For more details, see the official announcement on DeepMind’s blog: https://deepmind.google/blog/gemini-3-1-flash-tts-the-next-generation-of-expressive-ai-speech/.

Source: Gemini 3.1 Flash TTS: the next generation of expressive AI speech
Domain: deepmind.google

Granular Audio Tags

Expressive Audio Generation

Technical Overview

Availability

More in Technology