Source linked

GF-DiT Dynamically Reshuffles GPU Parallelism for 6x DiT Throughput

By treating GPU parallelism as a schedulable resource, GF-DiT cuts latency by 95% and reduces SLO violations by 90% compared to fixed-pipeline execution.

gf ditvllm omnidiffusion transformergpu parallelismai inferencesystems engineering

778 milliseconds to form a GPU communication group — that's the tax static parallelism imposes on Diffusion Transformer (DiT) workloads. The GF-DiT runtime drops that to 60 microseconds, then uses the reclaimed slack to dynamically reassign GPU resources per request. Throughput jumps 6.01×, mean latency falls 95%, and SLO violation rates crater by 90%.

Why Static Parallelism Fails for DiT

Diffusion Transformers generate images and videos through iterative denoising, meaning each request undergoes multiple execution stages with vastly different compute and memory demands. Existing systems lock a request into a fixed parallel configuration — say, 4 GPUs with tensor parallelism — from start to finish. That's fine for batch inference on uniform workloads. For DiT serving, it's wasteful: early stages underutilize GPUs, later stages bottleneck, and you can't shift resources to a request that suddenly needs more.

The GF-DiT authors, building on vLLM-Omni, argue that GPU parallelism should be a first-class schedulable resource, not a deployment-time setting. They decompose each request into independently schedulable "trajectory tasks" and decouple the execution pipeline so the runtime can online-reallocate GPUs across active requests.

Elastic Parallelism and Group-Free Collectives

To make reallocation cheap enough to matter, GF-DiT introduces group-free collectives — a lightweight communication abstraction that forms and reconfigures arbitrary execution groups on the fly. Traditional collectives require building NCCL communicators or MPI groups, which take hundreds of milliseconds and kill responsiveness. GF-DiT's approach lowers overhead from 778 ms to roughly 60 µs, making it practical to change parallelism mid-request.

The runtime is policy-programmable: operators can specify objectives (e.g., minimize tail latency, maximize throughput) and GF-DiT decides which requests get how many GPUs at each timestep. The paper evaluates on representative image and video diffusion workloads, showing consistent wins over static pipelines.

What 60 µs Means in Practice

That 60 µs overhead means the scheduler can re-evaluate resource allocation at every denoising step without adding meaningful latency. It can peel GPUs away from a request that's past its peak memory phase and hand them to a newly arrived job. The result is a system that automatically rides the heterogeneity of DiT serving, rather than fighting it with static configuration.

I've seen plenty of inference optimizations that look good on paper but fall apart under real workload skew. GF-DiT's combination of elastic parallelism and sub-millisecond group reconfiguration addresses the core structural inefficiency of current DiT serving stacks. This points to a future where serving systems treat parallelism as a dynamic resource pool, not a fixed deployment artifact.


Source: GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.