Source linked

NVIDIA Cosmos 3 unifie le raisonnement physique et l'action en un seul modèle

Une seule architecture Mixture-of-Transformers remplace des modèles séparés pour la génération mondiale, la compréhension de la scène et la génération de politiques robotisées.

nvidiacosmos 3mixture of transformersroboticsartificial intelligencemachine learning

NVIDIA has collapsed the fragmented pipeline of physical AI models into a single, unified omni-model called Cosmos 3. Instead of juggling separate architectures for world generation, scene understanding, and policy execution, Cosmos 3 handles all these modalities within a single forward pass using a Mixture-of-Transformers (MoT) backbone.

Mixture-of-Transformers Architecture

Cosmos 3 moves away from the previous paradigm where developers had to switch between distinct models like Cosmos Predict for generation and Cosmos Reason for understanding. The new MoT architecture processes text, image, video, audio, and action inputs through a shared representation space. Each modality is encoded via dedicated encoders—such as a ViT for visual understanding or a VAE for generation—and then projected into the unified backbone.

The model splits its processing into two distinct subsequences: an autoregressive (AR) subsequence for next-token prediction and reasoning, and a diffusion (DM) subsequence for iterative denoising and generation. While AR and DM tokens use separate parameter sets within each transformer layer, they interact through joint attention. This allows the model to seamlessly transition between acting as a vision-language model (VLM), a forward/inverse dynamics model, or a robot policy without architectural changes.

Deployment Scales and Capabilities

NVIDIA is shipping Cosmos 3 in two distinct sizes to target different compute environments. The 8B parameter Cosmos 3 Nano is optimized for efficient inference on workstation-grade hardware like the RTX PRO 6000 GPU. For large-scale research and synthetic data generation (SDG), the 32B parameter Cosmos 3 Super is designed to run on NVIDIA Hopper and Blackwell architectures.

The model's versatility is defined by its ability to map diverse inputs to specific physical outputs:

  • Text/Image/Video to Video: High-fidelity world generation.
  • Text/Video to Vision Language Model: Complex scene reasoning.
  • Action/Image/Text to Video: Forward dynamics modeling.
  • Text/Video to Action: Inverse dynamics and policy generation.

This unification enables more robust training for robotics and autonomous driving by providing a foundation that understands not just pixels, but the underlying motion, causality, and physics of the real world. The release of Cosmos 3 provides the groundwork for more sophisticated autonomous systems capable of simulating and reacting to complex, long-tail physical scenarios.


Source: Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.