Source linked

ClickHouse Ships Silk: A Fiber Runtime That Yields in 3.6 Nanoseconds

Silk is a stackful-fiber scheduler with NUMA-aware work stealing and io_uring, delivering 15x throughput over boost::asio at low concurrency and 5.9 million file IOPS.

clickhousesilkfiber runtimecppio uringdatabases

3.6 nanoseconds per fiber yield with cross-CPU work stealing. That's the headline number from Silk, the stackful-fiber runtime ClickHouse just shipped.

Silk is not a toy. It sits alongside ClickHouse's existing thread-per-core query engine and targets the I/O-bound tail: distributed cache lookups, object-storage reads, HTTP fan-out, replica coordination. Those workloads live at the 99.9th percentile, where a kernel thread context switch costs microseconds and the OS hits its limit.

Why ClickHouse Built Its Own Fiber Runtime

Off-the-shelf options like boost::asio or C++20 coroutines didn't hit all four requirements: a fiber yield in tens of nanoseconds, work stealing that respects NUMA topology, zero heap allocation in the steady state, and io_uring treated as the I/O ground truth rather than a bolted-on backend.

Stackless coroutines are cheap but viral. Every function on a suspension path must be marked co_await, and the compiler's heap allocation elision (HALO) breaks when the coroutine handle escapes to a scheduler queue. OS threads cost a few microseconds per context switch and eat kilobytes of stack per instance. Silk's stackful fibers let any function yield without language footprint: the stack is a normal stack, mmap'd per fiber with guard pages. No slab allocation, no cache aliasing. The 13% overhead Alibaba's Photon paper attributed to slab-allocated stacks does not appear here because the precondition does not exist.

Silk's Architecture: Per-CPU Scheduler with rseq

Silk runs one OS thread per CPU, pinned. Each scheduler owns a per-CPU ProcessorState with a bounded ready queue (Vyukov MPMC, cache-line-aligned), an io_uring ring, a sleep tree ordered by deadline, and an eventfd for wakeup. Work stealing between cores pulls tasks when local queues run dry. The per-CPU lock-free stack uses the Linux rseq (restartable sequences) syscall, benchmarking at 2068x faster than a global lock-free stack at 32 threads.

Benchmark methodology is solid: the repository includes a harness (./bb) that runs identical workloads through Silk and competitors, with controlled CPU pinning, fixed warmup, percentile tracking, and JSON output anyone can reproduce.

Benchmarks That Back the Claims

Silk delivers roughly 3.6 ns per fiber yield with cross-CPU stealing, 7.6 microseconds for an io_uring ping-pong, and 5.9 million file IOPS at a working configuration. Against boost::asio, Silk shows about 15x higher throughput at one connection and roughly 4x at high concurrency. These numbers are not hypothetical; they come from the same harness you can run today.

ClickHouse plans to integrate Silk first into its distributed cache. If the benchmarks hold under production load, the tail latency improvements for object-storage IO could reshape how the engine handles remote reads.


Source: Announcing Silk: a silky smooth fiber runtime for ClickHouse
Domain: clickhouse.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.