Source linked

Meta's Instantaneous PowerLoss Storm: Full Region Blackout Recovery Tested

engineering.fb.com@systems_wire2 hours ago·Systems Engineering·1 comments

Meta deliberately de-energized a large production region housing critical storage, AI, and data warehouse workloads to prove its infrastructure can survive sudden power loss without data loss or permanent damage.

metatwine orchestratorinstantaneous powerloss stormdata center reliabilitydisaster readinesssystems engineering

Meta deliberately shut down power to an entire production data center region—housing critical storage, AI, and data warehouse workloads—to prove their infrastructure can survive an instant blackout without data loss or permanent damage. This wasn’t a simulation; they actually de-energized the facility.

Built for Blackout from the Ground Up

Meta’s data center stack, from mechanical/electrical facilities up through the Twine orchestrator and its control plane services (Scheduler, Allocator, Broker, Zelos), was designed with power loss tolerance baked in. Batteries and Power Loss Siren (PLS) persist in-memory data when racks go dark. Unavailability events (UEs) provide asynchronous region-wide signaling to coordinate shutdown and recovery. Still, those capabilities were battle-tested only on single fault domains—racks or single DCs. A region is 50–60x larger than a typical fault domain, and bootstrapping a dead region means millions of services starting simultaneously with no external coordination.

The Ouroboros and Boomerang Problems

Two critical failure modes emerged. The “ouroboros” circular dependency: control plane services like Scheduler and Allocator depend on each other to start, creating a chicken-and-egg problem when the whole region is dark. Meta solved this with Belljar tests in CI/CD to detect dependency cycles early, plus a purpose-built “Twine recovery kit” (Twrko) to jumpstart critical control plane services if a cycle slips through. The “boomerang” problem was subtler: the UEs that orchestrate shutdown also targeted the orchestrator itself, causing control plane services to shut themselves down and leave orphaned services. Meta’s fix was brutally simple—allow control plane services to ignore power-related UEs entirely.

Tradeoffs: What They Let Break

Perfection is the enemy of reliability at scale. Meta drew a hard line: no data loss, no permanent facility damage, no impact beyond a single region. Everything else—transient service errors, bounded staleness in routing tables, rack failures within a predefined threshold—was considered tolerable if post-incident remediation could restore within a reasonable MTTR. Overengineering risked false positives during normal operations, so they chose pragmatic boundaries.

Validation by De-Energizing a Real Region

To test, they started small: validated dependency resolution in pre-production regions and shadow regions replicating production. Then they powered off their smallest production region. Finally, they pulled the plug on a large region with critical workloads. The power supply fault was injected with no preemptive actions—truly zero-notice. MTTR for drain actions mirrored real incidents. The Storm exercises trained both infrastructure and engineers iteratively, aiming to handle region loss as seamlessly as a single rack failure.

Next: Live Client Traffic Under the Knife

Having proven they can recover a dark region, Meta is expanding the test to regions carrying live client traffic—the ultimate stress test. They’ll adopt the same incremental strategy. Each cycle uncovers architectural improvements that feed back into the infrastructure. Reliability and velocity are two sides of the same coin; this foundation enables faster innovation in data center design and capacity deployment. The next post promises details on testing with live client traffic—the hardest test yet.


Source: Lights Out, Systems On: Validating Instant Power Loss Readiness
Domain: engineering.fb.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.