Netflix's Vera Edits Videos by Generating Only What Needs to Change

Vera doesn't regenerate every pixel like every other video editing model; it generates only the edit layer and an alpha matte, then composites over the untouched source footage.

Existing video diffusion models rewrite the entire clip for any edit, which scrambles identity, performance, and background details. Netflix's research team built Vera to decouple the edit from the source. Vera uses a Mixture-of-Transformers (MoT) architecture with three separate DiTs (one each for edit layer, alpha matte, and composite video) that share joint self-attention. This lets each branch specialize while cross-layer interactions keep the composite physically consistent.

Vera Generates Edit Layers, Not Full Videos

Training a model that outputs distinct layers required a custom dataset. Netflix compiled 486k frames at 832x480 resolution across three tiers: synthetic composites with high-quality alpha mattes, realistic single-object clips with matting and inpainting, and multi-object videos with effects like shadows. Vera trains two variants: 1.3B and 14B parameters, both initialized from a pretrained T2V base.

Benchmarks on 72 object-addition and 69 background-change test pairs show Vera significantly outperforms all baselines on content preservation (pixel-level and perceptual similarity). Instruction compliance and video quality match or exceed the strongest baselines. A human preference study with 19 creative reviewers across 512 trials confirms it: Vera-1.3B was preferred over every baseline for both content preservation and instruction compliance.

VOID Uses Physics Reasoning to Inpaint Interactions

Simple object removal often breaks physics. Remove a hand holding a lamp, and current models leave the lamp floating or ignore falling motion. Netflix's VOID adds a quadmask-based reasoning pipeline that uses a VLM to identify causally affected regions (objects that will fall, collide, or change trajectory).

VOID runs a two-pass inference. First pass: the diffusion model (CogVideoX-Fun-V1.5-5b-InP backbone finetuned with Gen-Omnimatte checkpoint) takes the video and quadmasks and generates a physically plausible counterfactual. Second pass: if object morphing is detected, VOID re-runs using flow-warped noise from the first pass to stabilize shape along the new trajectory.

Training data came from Kubric simulations and HUMOTO motion capture, re-simulating scenes with target objects removed. Strict physics laws govern the counterfactuals: gravity, collisions, inertia.

Both Models Beat Strong Baselines in User Studies

VOID was evaluated against six baselines (open and closed source) on 75 real-world scenarios. 25 creative reviewers rated outputs for visual quality, temporal consistency, blending, and realism of scene evolution. VOID was selected 64.8% of the time, substantially ahead of every competitor. Baselines generated impossible water splashes or spinning tops that move without hands.

Netflix acknowledges limitations. Vera struggles with complex effects like lightning and smoke due to training data sparsity, and sometimes fails to keep background motion consistent with camera movement. VOID can't handle unusual camera angles or very close shots, and has length/resolution constraints. These prototypes are early efforts, but the core ideas - layered generation and physics-aware inpainting - are the right direction for giving artists precise control without sacrificing source integrity.

Source: Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix
Domain: netflixtechblog.com

Netflix's Vera Edits Videos by Generating Only What Needs to Change

Vera Generates Edit Layers, Not Full Videos

VOID Uses Physics Reasoning to Inpaint Interactions

Both Models Beat Strong Baselines in User Studies

More in Artificial Intelligence