Source linked

Netflix Replaced Custom Batch Queuing With Kueue in 4 Weeks

Millions of batch jobs now run through Kueue, cutting custom CMB logic and enabling preemption-based fair sharing across tenant hierarchies.

netflixkueuekubernetesbatch computetitusinfrastructure engineering

Netflix ran millions of batch jobs through a custom scheduler called CMB (Compute Managed Batch) since 2018. Last year they replaced the queuing and scheduling core with Kueue, a cloud-native job queueing system, and the production migration took only 4 weeks.

Why Kueue beat YuniKorn and Volcano for Netflix

CMB was built before the Kubernetes ecosystem had mature batch offerings, but maintaining custom logic for fair sharing, hierarchical tenants, and capacity management became a drag. The team evaluated YuniKorn and Volcano but rejected both because they replace pod scheduling in the kube-scheduler, which would fragment job placement across Titus cells and hurt efficiency. Kueue sits above the scheduler, letting Titus keep its existing scheduling profiles while handling queue admission and quota management.

Kueue also supports multi-tenant quota management over heterogeneous hardware, operates on standard v1.Pod and batch/v1.Job primitives, and natively includes features like preemption, all-or-nothing scheduling, and topology-aware scheduling - exactly what the team wanted to build into CMB but couldn't easily.

Zero-lift migration for users, one button for operators

Netflix Batch (the migration project) required zero changes from end users. The team maintained API parity with CMB, so existing submission flows kept working. Under the hood, a single UI toggle converts an existing CMB tenant hierarchy into Kueue's Cohorts, ClusterQueue, and LocalQueue objects. Reserved and shared capacity maps to resource flavors and nominal quotas. Rolling back is the same button.

That one-click conversion hides serious complexity. The team had to run Kueue with much higher QPS, Burst, and groupKindConcurrency than default - derisked through load tests in a development environment that mimics Titus production cells.

Preemption-based fair sharing finally works

CMB's fair sharing only applied at admission; once a job started, it ran to completion, even if higher-priority work appeared. Kueue enables reclaimWithinCohort: Any and withinClusterQueue: LowerPriority in the ClusterQueue spec. Now idle reserved capacity can be lent to other tenants and reclaimed on demand, and priority preemption means business-critical workloads get faster turnaround. Compute saw a significant increase in average resource utilization after deploying these features.

The team also learned to migrate the most complex customer first - the largest batch user. That built confidence and compressed the production rollout to 4 weeks. Lessons from this migration are already being picked up by internal teams building Kubernetes-native training infrastructure.


Source: How Netflix Simplified Batch Compute with Kueue
Domain: medium.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.