Source linked

À l'intérieur des processeurs modernes: le cache L1 à 3 cycles, les divisions à 20 et pourquoi le RIPC atteint rarement 4

Une profonde plongée dans la physique de la CPU révèle pourquoi la distance physique conduit à la latence, les touches du cache L1 prennent 3 cycles, les divisions sont améliorées à 20 cycles et l'efficacité superscalaire est plus difficile que vous ne le pensez.

cpu architecturecache hierarchysuperscalarbranch predictionperformance optimization64 bit cpus

L1 data cache access takes 3 CPU cycles; division operations—once a 100+ cycle monster—now complete in 20. That improvement is real, but the physics that makes L1 slow hasn't budged.

Distance Drives Latency from L1 to DRAM Electrical signals pay for every millimeter. Parasitic capacitances scale linearly with connection length, not with the speed of light. For a 0.5mm round trip, light needs ~3e-12 seconds; a real register-register operation takes ~3e-10 seconds—two orders of magnitude more. Capacitance, not relativity, is the bottleneck. L1D hits at 3 cycles, L2 at 10–15 cycles. Writes are "almost instant" because the CPU just issues a command and moves on—no waiting for the write to complete.

Superscalar Can Retire 12 IPC—But You'll Be Lucky to See 4 Modern CPUs pack multiple ALUs and SIMD units, enabling more than one operation per cycle. That's why micro-benchmarks occasionally report 0.75 cycles per operation: it's statistical throughput, not sub-cycle execution. Retired Instructions Per Cycle (RIPC) can theoretically hit 10–12, but in real ALU-bound code achieving even 4 is a stretch. RAM-bound algorithms fare worse. That gap between theory and practice is where optimization lives.

Branch Mispredictions Are the One Instruction-Latency Trap Outside of branching, instruction handling is near-perfectly optimized—delays come from data, not instructions. A mispredicted branch flushes the pipeline and costs many cycles. Compiler hints like ]/ ] help, but the hardware predictor is already good. Still, if you're counting cycles, branches remain the one place where instruction flow matters. These numbers aren't trivia—they're the physics your compiler hides and your profiler reveals. Knowing L1 costs 3 cycles and division costs 20 tells you exactly when to inline, when to table-lookup, and when to stop optimizing.


Source: On CPU Physics and CPU Cycles
Domain: 6it.dev

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.