Source linked

Inside Modern CPUs: L1 Cache at 3 Cycles, Divisions at 20, and Why RIPC Rarely Hits 4

A deep dive into CPU physics reveals why physical distance drives latency, L1 cache hits take 3 cycles, divisions improved to 20 cycles, and superscalar efficiency is harder than you think.

cpu architecturecache hierarchysuperscalarbranch predictionperformance optimization64 bit cpus

L1 data cache access takes 3 CPU cycles; division operations—once a 100+ cycle monster—now complete in 20. That improvement is real, but the physics that makes L1 slow hasn't budged.

Distance Drives Latency from L1 to DRAM Electrical signals pay for every millimeter. Parasitic capacitances scale linearly with connection length, not with the speed of light. For a 0.5mm round trip, light needs ~3e-12 seconds; a real register-register operation takes ~3e-10 seconds—two orders of magnitude more. Capacitance, not relativity, is the bottleneck. L1D hits at 3 cycles, L2 at 10–15 cycles. Writes are "almost instant" because the CPU just issues a command and moves on—no waiting for the write to complete.

Superscalar Can Retire 12 IPC—But You'll Be Lucky to See 4 Modern CPUs pack multiple ALUs and SIMD units, enabling more than one operation per cycle. That's why micro-benchmarks occasionally report 0.75 cycles per operation: it's statistical throughput, not sub-cycle execution. Retired Instructions Per Cycle (RIPC) can theoretically hit 10–12, but in real ALU-bound code achieving even 4 is a stretch. RAM-bound algorithms fare worse. That gap between theory and practice is where optimization lives.

Branch Mispredictions Are the One Instruction-Latency Trap Outside of branching, instruction handling is near-perfectly optimized—delays come from data, not instructions. A mispredicted branch flushes the pipeline and costs many cycles. Compiler hints like ]/ ] help, but the hardware predictor is already good. Still, if you're counting cycles, branches remain the one place where instruction flow matters. These numbers aren't trivia—they're the physics your compiler hides and your profiler reveals. Knowing L1 costs 3 cycles and division costs 20 tells you exactly when to inline, when to table-lookup, and when to stop optimizing.


Source: On CPU Physics and CPU Cycles
Domain: 6it.dev

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.