Source linked

Модель торговли безопасностью и полезностью работает только в неправильных сценариях

alignmentforum.org@frontier_wire2 hours ago·Artificial Intelligence·2 comments

20% серийное замедление является учебной ценой вмешательства в безопасность, но модель предполагает, что разработчики оптимизируют эффективно - предположение, которое рухнет, когда регуляторы, сотрудники или общественность имеют разные убеждения.

alignment forumai safetyaigovernancesafety usefulness tradeoffai controlregulation

A safety intervention can cost a 20% serial slowdown in AI development—that’s the price of doing business under the rushed-reasonable-developer model. But that model only applies when the developer already agrees with you on what safety means. Step outside that narrow window and the whole tradeoff framework breaks down.

Two Regimes, One Flawed Model

The standard safety-usefulness tradeoff model assumes developers choose safety actions that maximize marginal safety gain per unit cost. The post on the Alignment Forum identifies two regimes where this holds: the “rushed reasonable developer” who shares your values but faces competitive pressure, and the “limited political will” scenario where the developer concedes to a fixed cost threshold. In both cases, the developer implements the most efficient interventions, so safety tech improvements (pushing out the Pareto frontier) and safety budget increases (accepting more usefulness loss) are the obvious levers.

But the real world is messier. The developer doesn’t always get to choose the interventions. When pressure comes from regulators, poorly-informed staff, or the public with different beliefs, the developer optimizes for their satisfaction—not for safety as you define it. The connection between actual safety value and what gets implemented weakens dramatically. You can’t just assume the developer will pick the best tech; you have to consider political feasibility, legibility, verifiability, and which constituencies happen to like a given ask.

When 10% Annual Takeover Risk Becomes a Political Fight

The post flags a sobering concrete: the author expects AI companies to impose AI takeover risk levels of 10% or 20% per year—levels that would seem insane to potential regulators. But if regulation sets a threshold, the risk assessment will almost certainly be inaccurate rather than the threshold being too high. The model’s neat risk-assessment constraint—”you aren't allowed to impose more than X% risk”—is implausible in practice.

This has direct implications for safety research. Techniques that are robust to being implemented by a half-hearted company—like AI control, which can be more reliably externally evaluated than alignment—become more valuable. The post also notes that political feasibility depends less on usefulness cost and more on factors like how easy the ask is to verify or what constituencies support it. Pushing for “the best safety per cost” intervention is wasted effort if the intervention is politically invisible.

What This Means for Strategy

The author, who previously leaned on the tradeoff model, now reports being “less into this theory of change” and more interested in pushing companies to make bigger tradeoffs rather than efficient ones. The key takeaway: when you’re developing safety techniques, pay attention to whether they’re incredibly inconvenient and expensive—because the efficient ones may never get deployed in the settings that matter most.

The future of AI risk mitigation will involve both efficient internal negotiations and messy external pressure. The safety-usefulness tradeoff model is a useful tool for the first kind of situation; ignoring it in the second kind leads to wrong priorities. The next step is figuring out which interventions are politically feasible and actually reduce risk—a design constraint that most alignment research still neglects.


Source: Efficient tradeoffs and the safety-usefulness tradeoff model
Domain: alignmentforum.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.