MIRAGE exploite la modération de la sécurité pour bloquer l'édition d'images AI

MIRAGE flips the script on AI image protection: instead of trying to disrupt the generative model, it poisons the safety filter that runs before any edit is made.

How MIRAGE Turns Safety Against Itself

Prior immunization methods required access to model weights and the exact editing prompt - a non-starter for closed-source editors like GPT-Image, Gemini Flash Image (Nano Banana), and Grok Imagine. MIRAGE takes a system-level view and exploits a shared vulnerability: pre-generation safety moderation. By adding adversarial perturbations that align an image with policy-violating concepts in the representation space of an ensemble of open-source embedding and moderation models, the image triggers an automatic refusal regardless of the text prompt. No model internals needed. No prompt guessing.

88% Success on Closed-Source APIs

The authors evaluated MIRAGE against multiple commercial image editing APIs and report success rates exceeding 88%. That means nearly 9 out of 10 attempts to edit a protected image were blocked before any manipulation occurred. The method is prompt-agnostic - it doesn't matter if someone types “add a mustache” or “make this person look angry”; the moderation classifier sees a policy violation and stops the edit.

Why This Changes the Personal Image Protection Game

Current defenses are either ineffective or impractical for real-world use. Watermarks can be cropped or inpainted. Model-specific perturbations require black-box optimization against each target editor. MIRAGE's approach works across multiple closed-source systems with a single, precomputed perturbation. The trade-off: it relies on the moderation classifiers being static. If those classifiers adapt - retraining to ignore the adversarial patterns - the protection degrades. But for now, it's a concrete, deployable way to stop bulk unauthorized editing of personal images at scale.

MIRAGE opens a new line of defense that doesn't require model access, but its reliance on moderation systems means it's a race between adversarial immunizations and adaptive filter classifiers.

Source: MIRAGE: Protecting against Malicious Image Editing via False Moderation
Domain: arxiv.org

MIRAGE exploite la modération de la sécurité pour bloquer l'édition d'images AI

How MIRAGE Turns Safety Against Itself

88% Success on Closed-Source APIs

Why This Changes the Personal Image Protection Game

More in Machine Learning