sync.RWMutex, the reflexive fix for contended reads, is a trap: at 8 cores it plateaus at 2x single-core throughput while a 256-shard lock-striped map reaches 6.9x. A detailed benchmark of six in-memory Go cache designs, all using only the standard library, makes the case brutally clear.
The author built the same string-to-string cache six ways: naive (no locking), a single sync.Mutex, a single sync.RWMutex, sync.Map, a sharded map with 256 mutexes, and a copy-on-write map using atomic.Pointer. All six implement the same interface, so one test harness drives them identically. The benchmark ran 1,000,000 keys across 1 to 8 physical P-cores (pinned to avoid hybrid CPU chaos) on an i7-14700K, measuring ns/op under uniform and Zipfian distributions.
The Numbers That Matter
At 8 cores with a read-only workload, the sharded design delivers 11.5 ns/op. The single mutex gives 168 ns/op. sync.RWMutex? 53 ns/op - barely better than the mutex, and it hits that ceiling at just 4 cores. The copy-on-write design hits 11.5 ns/op on reads but explodes to 82,500,000 ns/op per write (that's 82 milliseconds) because every Set copies the entire million-entry map. sync.Map lands at 30 ns/op for read-only, but lags behind sharded on all other mixes.
Under a balanced mix (50% reads, 50% writes), sharded takes 24 ns/op. The single mutex drops to 190 ns/op, and rwmutex is even worse at 282 ns/op (writer starvation from read-held locks). The copy-on-write design sinks to 46,500,000 ns/op.
Why Skew Isn't Simply Worse
Real workloads are Zipfian - a few hot keys see most traffic. The common assumption is that skew hurts performance. The benchmark shows it's more nuanced. Reads speed up for almost every design because hot keys stay in CPU cache. The sharded map, however, slows down under skew: hot keys collide on a few shards, causing those locks to contend while the rest sit idle. For a balanced mix under skew, sharded runs 0.82x its uniform throughput. The copy-on-write design is unaffected because its write cost is uniform - every write copies the whole map regardless of key.
The Winner in 15 Lines
The sharded design is trivially simple: N independent maps, each behind its own mutex. A hash of the key picks the shard. With 256 shards, contention drops by roughly 256x. The implementation is about 15 lines of core logic. The benchmark uses uint64 hash keys routed via a bitmask; the author explains why 256 shards (a power of two) enables fast modular arithmetic via bitwise AND.
For any Go in-memory cache handling concurrent access, this leaves little doubt: shard your locks.
Source: Shard your locks: benchmarking 6 Go cache designs
Domain: strebkov.dev
Comments load interactively on the live page.