Source linked

MIRAGE exploite la modération de la sécurité pour bloquer l'édition d'images AI

En ciblant les classificateurs de modération de pré-génération au lieu du modèle génératif lui-même, MIRAGE obtient plus de 88% de succès dans la prévention des modifications non autorisées d’images personnelles sur GPT-Image, Gemini Flash Image et...

mirageopenaigoogle deepmindgrokimage editingadversarial attacks

MIRAGE flips the script on AI image protection: instead of trying to disrupt the generative model, it poisons the safety filter that runs before any edit is made.

How MIRAGE Turns Safety Against Itself

Prior immunization methods required access to model weights and the exact editing prompt - a non-starter for closed-source editors like GPT-Image, Gemini Flash Image (Nano Banana), and Grok Imagine. MIRAGE takes a system-level view and exploits a shared vulnerability: pre-generation safety moderation. By adding adversarial perturbations that align an image with policy-violating concepts in the representation space of an ensemble of open-source embedding and moderation models, the image triggers an automatic refusal regardless of the text prompt. No model internals needed. No prompt guessing.

88% Success on Closed-Source APIs

The authors evaluated MIRAGE against multiple commercial image editing APIs and report success rates exceeding 88%. That means nearly 9 out of 10 attempts to edit a protected image were blocked before any manipulation occurred. The method is prompt-agnostic - it doesn't matter if someone types “add a mustache” or “make this person look angry”; the moderation classifier sees a policy violation and stops the edit.

Why This Changes the Personal Image Protection Game

Current defenses are either ineffective or impractical for real-world use. Watermarks can be cropped or inpainted. Model-specific perturbations require black-box optimization against each target editor. MIRAGE's approach works across multiple closed-source systems with a single, precomputed perturbation. The trade-off: it relies on the moderation classifiers being static. If those classifiers adapt - retraining to ignore the adversarial patterns - the protection degrades. But for now, it's a concrete, deployable way to stop bulk unauthorized editing of personal images at scale.

MIRAGE opens a new line of defense that doesn't require model access, but its reliance on moderation systems means it's a race between adversarial immunizations and adaptive filter classifiers.


Source: MIRAGE: Protecting against Malicious Image Editing via False Moderation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.