Source linked

OpenAI Diagnoses 18-Year-Old libunwind Bug via Crash Epidemiology

Two unrelated bugs - silent hardware corruption on Azure and a race condition in GNU libunwind - caused seemingly impossible crashes in OpenAI's data infrastructure.

openairocksetgnu libunwindcazuredata infrastructure

Two bugs, one an 18-year-old race condition in GNU libunwind and the other a silent hardware error on a single Azure host, were masquerading as a single impossible crash in OpenAI's ChatGPT data infrastructure. The crashes were real: C++ functions in the Rockset service finishing normally, then returning to a NULL address or landing 8 bytes off on the stack pointer. Engineers spent weeks chasing a ghost until they switched from deep-diving individual core dumps to building a population-level crash database.

Symptoms That Shouldn't Exist in Normal C++ Code

Rockset’s DocumentTree::updateDocument method would call an unknown function X, X's frame would get corrupted, and then X would return to an address that wasn't code. In some cores the saved return address was NULL; in others the stack pointer register %rsp had been mysteriously decremented by exactly 8 bytes. These are not normal segfaults. A stray write corrupting only the return address is astronomically unlikely; %rsp misalignment without setcontext, longjmp, or inline assembly is effectively impossible in compiled code using standard calling conventions.

Every hypothesis — compiler bug, kernel signal delivery issue, ASAN staging miss — had strong evidence against it. The team couldn't even reliably classify crashes from application logs because corrupted stack traces produced both false positives and false negatives.

Epidemiological Turn: Population-Level Core Dump Analysis

Instead of staring at a few cores, the team built a high-quality dataset of all Rockset crashes across the fleet, extracting structured features from each core: stack frame pointers, register values, the exact memory addresses around the corruption point. This shift — thinking like an epidemiologist rather than a microscope-wielding debugger — revealed two distinct clusters of crashes.

Cluster A correlated perfectly with a single Azure host. Cluster B appeared across multiple regions and hardware types, but only when GNU libunwind was used for stack unwinding during crash handling. The first was a rare CPU that simply didn't compute arithmetic correctly — silent hardware corruption. The second was a race condition in GNU libunwind that had been present since 2008, triggered when multiple threads crashed simultaneously and libunwind's internal state got mangled.

Two Bugs, One Fix, and a Lesson in Scale Debugging

Once the clusters were separated, each bug became tractable. The hardware fault was handled by replacing the single misbehaving Azure node. The libunwind race was patched by fixing the library's locking around its internal data structures — a change that OpenAI contributed back upstream.

Neither bug would have been found by conventional core dump inspection. The hardware bug looked like software because crashes appeared across multiple regions (before realizing one host's effects propagated through load balancing). The libunwind bug looked like hardware because it produced stack misalignments that application code shouldn't be able to generate. Only treating crashes like a population of cases each with a precise set of measurements made the split visible.

This investigation proves that even with modern C++ tooling — -fno-omit-frame-pointer, folly signal handlers, ASAN — some bugs require you to zoom out to the fleet level. The 18-year-old libunwind bug was waiting because no one had ever looked at enough cores at once.


Source: Core dump epidemiology: fixing an 18-year-old bug
Domain: openai.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.