Source linked

Why the LDT Can't Fix the Two-Byte Syscall Patching Problem

A deep dive into the 256-byte jump offset limitation reveals why every syscall instrumentation approach on x86-64 requires tradeoffs, and why the author's LDT-based idea hits a dead end.

linuxx86 64intelzpolineinstruction punningsystem call instrumentation

Stephen Kell's latest blog post on system call instrumentation reads like a confession: after a deep exploration of x86-64 segmentation, his LDT-based approach collapsed into a near miss. The core problem? A two-byte syscall instruction (0f 05) that needs to become a jump, but every useful jump is at least five bytes long. That mismatch forces everyone who patches syscalls into a tight corner.

The Two-Byte Jail

All system calls on x86-64 use either syscall (2 bytes) or int 0x80 (2 bytes). To redirect them to instrumentation, you need to replace those two bytes with something that transfers control elsewhere. But a direct relative jump (e9) takes 5 bytes: one opcode and a 32-bit signed offset. You have only two bytes to work with. Patching becomes an exercise in papering over this crack.

Kell's libsystrap library currently uses ud2 (2 bytes) to trigger a SIGILL trap, then runs the real syscall from the signal handler. Double-trap overhead. Slow.

Instruction punning, from the Liteinst paper, exploits the fact that adjacent instruction bytes after the syscall can be hijacked. If you have 0f 05 xx yy zz, you can write e9 WW xx yy zz, leaving the high three bytes of the next instruction alone. The WW byte becomes the least significant byte of the jump displacement, giving you 256 bytes of wiggle room on a little-endian machine. You need a statistical fallback when the high-order byte is zero or very small. E9Patch extends the trick with compound punning to increase coverage.

The Known Escapes (and Their Costs)

zpoline takes a cleaner but riskier route: replace syscall with ff d0 (call *%rax). Since %rax holds the syscall number (a small nonnegative integer), the call goes to address 0 to 255. You must map executable code at page zero, breaking null-pointer protection. The paper mitigates with Intel MPK (memory protection keys) to make that memory execute-only, and bitmap validation of return addresses. But MPK isn't universal, validation adds latency, and mapping low memory requires privileges. Buggy code that puts a high value in %rax will jump to a random address instead of getting a clean ENOSYS.

Neither approach is free. Punning wastes virtual address space (one page per patch site, though E9Patch shares physical pages). zpoline weakens security and increases complexity. Kell wondered: could segmentation help?

Why Segmentation Failed

x86-64 still runs with segmentation enabled, using the Global and Local Descriptor Tables (GDT, LDT). Linux exposes modify_ldt() for per-process LDT entries. The idea: find a two-byte far call instruction that indirects through an LDT selector to reach instrumentation. Kell tried ff 18 (lcall *(%rax)), hoping the segment selector in %rax would route via LDT. That's not how lcall works. The instruction loads a far pointer (selector:offset) from memory, not a raw selector from a register. Even if it did, storing 16-bit selectors in 64-bit %rax is awkward. The LDT approach collapses: you cannot control the global segment table, and the call gate mechanism requires kernel privilege or specific descriptors. No two-byte shortcut exists via segmentation.

Kell admits the attempt was naive, but the journey uncovered why every current method has unavoidable drawbacks. The 256-byte window from instruction punning remains the most direct path, but it's a tight fit. zpoline's zero-page mapping is clever but fragile.

For now, the field remains stuck choosing between virtual address space hunger and trap overhead. Until someone finds a 2-byte jump in the x86 encoding, syscall instrumentation will stay a game of compromises.


Source: System call instrumentation on Linux/x86-64 using memory-indirect calls (in vain?), part one
Domain: humprog.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.