Source linked

REMOP Cuts Remote-Memory Transfer Rounds by 97% in DuckDB

A new operator optimization framework for DuckDB reduces transfer rounds by up to 97% and cuts query runtime by 22-26% on spill-heavy TPC benchmarks.

duckdbremopremote memorydatabase operatorstpc htpc ds

Spilling to remote memory instead of disk sounds like a clean win for analytical databases, but it comes with a hidden tax: each transfer pays a fixed round-trip latency, not just bandwidth. Classical spilling strategies that minimize total I/O volume can actually hurt here by causing too many small transfers. REMOP, from researchers at [institution not named in abstract but attributed to the paper], tackles this head-on by making the number of transfer rounds a first-class term in the cost model. They implement their approach for blocked nested-loop join, external merge sort, and external hash join inside DuckDB. The result on a two-node compute-memory testbed: transfer rounds drop by up to 97%, operator runtime by up to 48% on spill-heavy microbenchmarks.

Why Disk-Optimized Heuristics Fail Over RDMA Disk spilling costs are dominated by seek times and sequential bandwidth. Minimizing total pages written (or read) is a sensible proxy for runtime. But remote memory over RDMA or similar interconnects flips that: per-transfer latency becomes the bottleneck, not the total byte count. A naive buffer-allocation heuristic might fragment writes into many small round trips, each wasting microseconds on handshake overhead. REMOP builds operator-specific buffer-partitioning strategies that batch pages into transfers, aggressively reducing the count of round trips even if total I/O volume rises slightly. For a blocked nested-loop join, the framework partitions the inner relation's memory budget into large chunks that minimize the number of times the outer relation must be rescanned from remote memory. For external merge sort, REMOP adjusts run sizes to fit the transfer window, and for hash join it skews partition sizes to favor fewer, larger transfers.

Measured Gains That Matter The numbers aren't incremental. On TPC-H queries that spill, average runtime falls 22.7%. On TPC-DS, 26.4%. Those are end-to-end numbers, not just operator-level microbenchmarks. Not every query spills, but when memory pressure hits, REMOP turns a painful remote spill into something barely slower than local DRAM. DuckDB's architecture made this possible because its operator implementations are modular enough to swap in round-aware memory policies without heavy refactoring. The paper provides buffer-partitioning formulas for each operator type, making the approach portable to other engines that expose similar operator internals.

What This Means for Disaggregated Databases Disaggregated memory is gaining traction in cloud deployments where compute nodes pool remote memory via fabric like CXL or RDMA. Every analytical engine facing that architecture needs a REMOP-style cost model. The 97% reduction in transfer rounds suggests that most existing systems are leaving significant performance on the table. Expect follow-up work to extend these policies to window functions, aggregations, and other memory-hungry operators.


Source: REMOP: REmote-Memory-aware OPerator Optimization
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.