Source linked

Prevención de inyección de prompt perfecto demostrado imposible en modelos de embalaje compartido

Una nueva prueba formal demuestra que en los LLM con arquitecturas de embalaje compartido, ninguna defensa en la tubería puede lograr el control semántico fiel, haciendo que la inyección rápida sea inevitable sin separación arquitectónica.

prompt injectionshared embedding architecturesformal proofllm securitymathematical impossibilityarchitectural separation

Prompt injection isn't a failure of better filters—it's a mathematical certainty. A new preprint posted to arXiv formalizes the problem and proves that perfect prevention is impossible in any shared-embedding sequence model.

The paper defines Prompted Action Models, where outputs include control-authoritative actions like refusal decisions, tool authorization, and memory writes. The goal is Semantic-Faithful Control (SFC): behavior that depends only on the meaning of untrusted input, not its encoding. Three independent impossibility results show SFC is unachievable.

Three Hard Limits on In-Pipeline Defenses

First, provenance-recovery impossibility: shared representations make trusted and untrusted content statistically inseparable, bounded by total variation distance. Second, control-path exposure: untrusted tokens enter control-relevant computation through the same attention value-aggregation that determines outputs—there's no separate channel. Third, a finite-coverage invariance gap: no finite training can certify invariance over infinite semantic-equivalence classes. Each result is grounded in measurements on production tokenizers and models.

These aren't engineering gaps or alignment failures. They're structural. The paper draws a direct parallel to the code-data confusion in Von Neumann machines that gave rise to buffer overflows—a vulnerability class no single mechanism ever fixed. It took decades of layered defenses (DEP, W^X, ASLR, stack canaries, memory-safe languages) to contain, never eliminate, that structural flaw.

The implication is the same: prompt injection cannot be eliminated by better in-pipeline classification or alignment alone. It requires architectural separation of instruction and data channels. The proof doesn't prescribe the new architecture—it says everything else is mathematically doomed to fail.


Source: On the Inseparability of Instructions and Data in Shared-Embedding Sequence Models
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.