Source linked

Meta Rebuilds BLOB Storage: O(1) Metadata و Zero Proxy Slash GPU Stalls

engineering.fb.com@systems_wireyesterday·Systems Engineering·6 comments

وقد تم إعادة بناء نظام إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إمكانيات إم

metablob storagetectonicai traininggpu stallszippydb

Meta reports that 80% of data-cache requests now hit memory, and metadata lookups take 1–2 ms, after ripping out a stateful metadata stack that added hundreds of milliseconds of latency. That rewrite directly addresses the primary cause of GPU stalls in AI training at scale.

Why a Slow Storage Read Stalls Thousands of GPUs

Every step in model training synchronizes hundreds of thousands of GPUs. If one GPU’s dataloader stalls waiting on a storage read, every other GPU in that training job idles. Meta’s old BLOB-storage architecture, built for Facebook and Instagram photo uploads, required multiple layer lookups – namelayer, volumeslayer, containerlayer – sometimes crossing regions, adding up to hundreds of milliseconds per getObject("/bucket/path") call. That killed AI workloads that need predictable pMax latencies.

Collapsing the Metadata Stack into O(1) Lookups

Meta replaced the spread-out metadata stores with a unified flat schema backed by ZippyDB. The old getObject path needed three stateful lookups before even touching Tectonic blocks. The new path issues a getReadPlan() that resolves path to (blockId, offset, size) in O(1) – one lookup per chunk. No more cross-region metadata cascades. Simultaneously, Meta eliminated the dataplane proxy from the read path; the client SDK now streams bytes directly from Tectonic storage servers. That cuts power overhead and latency in one move.

Tiered Caching Turns Cross-Region Copying into On-Demand Hydration

Copying terabytes across regions before a training run wasted hours. Meta borrowed OS page-cache ideas: treat the global HDD-backed BLOB store as the source of truth, then layer GPU-host memory (L1), host flash (L2), and regional flash-backed BLOB storage (L3) as caches. A prefetch() API in the SDK lets dataloaders hydrate data minutes ahead. Auto-lifecycle policies (TTL, LRU) keep hot data across epochs. Ingestion times dropped from hours to minutes for most jobs. Researchers no longer manually snapshot data to a region; storage just fetches on demand.

Meta’s new stack is lean enough that storage overhead on top of Tectonic is negligible. Next up: scaling to network limits and eliminating checkpoint stalls at even larger cluster sizes. The storage bottleneck is no longer the problem – GPUs can finally focus on compute.


Source: Meta's AI Storage Blueprint at Scale
Domain: engineering.fb.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.