Source linked

Meta Rebuilds BLOB Storage: O(1) Metadata and Zero Proxy Slash GPU Stalls

engineering.fb.com@systems_wire3 hours ago·Systems Engineering·3 comments

Meta rebuilt its BLOB-storage metadata subsystem from layered lookups to O(1) ZippyDB-backed resolution and eliminated the dataplane proxy, achieving 80% distributed cache hit rate and sub-2ms metadata access to nearly...

metablob storagetectonicai traininggpu stallszippydb

Meta reports that 80% of data-cache requests now hit memory, and metadata lookups take 1–2 ms, after ripping out a stateful metadata stack that added hundreds of milliseconds of latency. That rewrite directly addresses the primary cause of GPU stalls in AI training at scale.

Why a Slow Storage Read Stalls Thousands of GPUs

Every step in model training synchronizes hundreds of thousands of GPUs. If one GPU’s dataloader stalls waiting on a storage read, every other GPU in that training job idles. Meta’s old BLOB-storage architecture, built for Facebook and Instagram photo uploads, required multiple layer lookups – namelayer, volumeslayer, containerlayer – sometimes crossing regions, adding up to hundreds of milliseconds per getObject("/bucket/path") call. That killed AI workloads that need predictable pMax latencies.

Collapsing the Metadata Stack into O(1) Lookups

Meta replaced the spread-out metadata stores with a unified flat schema backed by ZippyDB. The old getObject path needed three stateful lookups before even touching Tectonic blocks. The new path issues a getReadPlan() that resolves path to (blockId, offset, size) in O(1) – one lookup per chunk. No more cross-region metadata cascades. Simultaneously, Meta eliminated the dataplane proxy from the read path; the client SDK now streams bytes directly from Tectonic storage servers. That cuts power overhead and latency in one move.

Tiered Caching Turns Cross-Region Copying into On-Demand Hydration

Copying terabytes across regions before a training run wasted hours. Meta borrowed OS page-cache ideas: treat the global HDD-backed BLOB store as the source of truth, then layer GPU-host memory (L1), host flash (L2), and regional flash-backed BLOB storage (L3) as caches. A prefetch() API in the SDK lets dataloaders hydrate data minutes ahead. Auto-lifecycle policies (TTL, LRU) keep hot data across epochs. Ingestion times dropped from hours to minutes for most jobs. Researchers no longer manually snapshot data to a region; storage just fetches on demand.

Meta’s new stack is lean enough that storage overhead on top of Tectonic is negligible. Next up: scaling to network limits and eliminating checkpoint stalls at even larger cluster sizes. The storage bottleneck is no longer the problem – GPUs can finally focus on compute.


Source: Meta's AI Storage Blueprint at Scale
Domain: engineering.fb.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.