Tiara Cuts 10-Hop RDMA Chains by 2.85x with Programmable NIC ISA

Graph traversals, page-table walks, and disaggregated LLM inference all hit the same wall: each level of indirection costs a separate network round-trip over RDMA, and Tiara's FPGA prototype smashes that wall by 2.85x on 10-hop chains.

RDMA one-sided verbs are the natural primitive for memory disaggregation, but they demand the client provide the exact remote address upfront. When that address depends on data you must first read from remote memory, you are stuck in a sequential chain of 1-RTT operations. That pattern crops up everywhere: graph traversals following pointers hop by hop, address translation walking multi-level page tables, distributed coordination with conditional multi-host logic, and disaggregated LLM inference resolving paged KV caches through block-table lookups. Each level of indirection costs one round-trip.

Why RDMA's One-Sided Verbs Hit a Wall

Existing RDMA NICs either burn remote CPU cycles or suffer limited throughput when offloading indirection. The result is the Indirection Wall, a bottleneck that makes pointer-chasing workloads over disaggregated memory painfully slow. Offloading to the remote CPU defeats the purpose of zero-copy RDMA, and custom hardware is too rigid to handle varied patterns like PagedAttention block tables or MoE expert lists.

Tiara's Compact ISA: eBPF for the NIC

Tiara introduces a compact, statically verifiable instruction set that executes directly on the memory-side NIC. Operators pre-register Tiara programs, analogous to eBPF programs in the kernel. A single Tiara program resolves the whole chain of indirection locally, collapsing multiple sequentially dependent round-trips into one. The ISA is minimal enough to run at line rate on an FPGA, yet expressive enough to handle pointer chasing, conditional lookups, and arithmetic needed for page-table walks.

Benchmarks That Prove the Point

On an FPGA-based prototype, Tiara shows real numbers that matter: 10-hop graph-traversal latency drops 2.85x while sustaining 3.4x higher throughput compared to one-sided RDMA. Page-table walk latency falls 62%. Uncontended distributed-lock latency improves 2.9x. For disaggregated PagedAttention at 8 KB blocks, throughput jumps 2.8x. MoE expert-gather latency with 32 experts hits a 1.88x improvement. These are not simulation estimates; that is hardware-measured performance.

What This Enables Next

Tiara treats the network as a computation substrate rather than a dumb pipe, unloading indirection resolution to where the data lives. That opens the door for truly disaggregated datacenters where graph analytics, virtual memory, distributed locks, and LLM serving all share pooled memory without round-trip penalties. The next step is a production ASIC that takes Tiara's FPGA prototype and pushes it to line-rate at 100 Gbps.

Source: Tiara: A Programmable Line-Rate ISA for Remote Memory Access
Domain: arxiv.org