Source linked

A Memory Pattern That Runs 30% Slower Than Random Access

blog.weineng.me@vast_jaguar2 hours ago·Systems Engineering·2 comments

By striding across page boundaries instead of random jumps, you can make a simple sum routine take 2 billion cycles - over 30% worse than randomized access.

intelcpumemory hierarchycacheperformance optimizationx86 64

Random access is not the slowest way to traverse memory. A carefully crafted access pattern that hops across page boundaries can beat random by over 30% - 2.05 billion cycles vs 1.57 billion on an Intel Core Ultra 7 268V.

Why Random Is Not the Worst

Linear access sums 2^26 integers in 133 million cycles - the CPU loves sequential reads. Shuffle the positions with a Fisher-Yates permutation and the same operation blows up to 1.57 billion cycles. That's a 10x penalty. But you can do worse.

Wei Neng's blog post builds the slowest possible permutation step by step. The key insight: a CPU's hardware prefetchers are excellent at detecting stride patterns, even random-looking ones, as long as they stay within a 4 KiB page. Cross a page boundary and the prefetchers give up - they don't risk speculative fetches across virtual pages that might map to non-adjacent physical pages.

Striding Across Pages to Break Prefetch

First attempt: separate every access by exactly one cache line (64 bytes). That yields 719 million cycles - 4x slower than linear, but still twice as fast as random. The prefetcher still sees a regular 64-byte stride and pulls in future cache lines, just not across pages.

Second attempt: stride by a full page (4096 bytes). Each access lands in a different page. The TLB must reload a translation for every load. The prefetcher sees no pattern it can follow across boundaries. Result: 2.05 billion cycles - 30% worse than random, which at least sometimes hits the same page twice and gets a TLB hit.

The code is straightforward: nested loops that iterate over the 1024 elements within a page, then move to the next page's corresponding element. Every load misses L1, L2, likely L3, and triggers a page walk.

What This Teaches About Modern CPUs

This isn't a lab curiosity. If you write code that walks a large array in a page-strided pattern - say, accessing column-major data in a row-major loop - you'll silently pay a massive penalty. The hardware prefetchers are tuned for streaming within pages, not for arbitrary striding across them.

Wei Neng's full experiment, including the separated_by_a_page function, runs on an Intel Core Ultra 7 268V with L1d 320 KiB, L2 14 MiB, L3 12 MiB. Huge pages are disabled, which is the default in most user-space code. The same effect applies on any x86-64 CPU with a page-based TLB and limited cross-page prefetch.

Next time you see a hot loop that jumps across pages, don't blame the random number generator - blame the page boundary.


Source: Data Access Patterns That Makes Your CPU Really Angry
Domain: blog.weineng.me

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.