Source linked

ASSERT превращает правила простого языка в AI Test Suites

Новая рамка открытого исходного кода Microsoft превращает описания политики на естественном языке в оцененные испытательные случаи, позволяя разработчикам проверить, что их ИИ ведет себя точно так, как планировалось.

microsoftassertresponsible aiai evaluationopen source

ASSERT can turn a sentence like “no external emails” into a full test suite that scores compliance. Microsoft’s open‑source framework, Adaptive Spec‑Driven Scoring for Evaluation and Regression Testing (ASSERT), reads plain‑language policy descriptions, generates structured test cases, runs them against a target model, and returns a numeric score. It also records the internal decision path, so developers can see exactly why a failure occurred.

How ASSERT Turns Text Into Tests

ASSERT begins with a high‑level goal or policy written in natural language. The tool parses the description, identifies acceptable and unacceptable behaviors, and produces a set of problem scenarios. For each scenario, it creates test cases that invoke the model, capture the output, and compare it against the policy. The scoring engine assigns a score based on the proportion of cases that pass. Developers can inject system context, available tools, and constraints to narrow the evaluation space. For example, a document‑research agent can be told to limit confidential data to C‑level executives and to provide concise summaries that reference prior context. ASSERT then generates test cases that check each of those constraints.

Continuous Monitoring and Regression Checks

ASSERT is not limited to a one‑off audit. Sarah Bird, Microsoft’s chief product officer of Responsible AI, says the framework supports evaluation during development, after deployment, and for ongoing monitoring. Because the test suite is generated from text, updating a policy is as simple as editing a sentence. The new tests can be rerun automatically, flagging regressions when a model’s behavior drifts. This capability aligns with industry trends toward repeatable testing, as seen in Stanford’s HELM, MLCommons’ AILuminate, and METR benchmarks.

Why It Matters for Trustworthy AI

General benchmarks measure broad capabilities, but they miss the nuances that matter in a product’s context. ASSERT fills that gap by tying evaluation directly to the rules that govern a specific application. By scoring compliance and exposing the decision path, developers gain actionable insight into why an AI behaves a certain way. The framework also encourages a culture of continuous evaluation, making it easier to meet internal compliance standards and external regulatory expectations.

ASSERT’s release signals a shift toward policy‑driven testing in the AI ecosystem. As models grow more capable, the ability to automatically generate and score tests from plain‑language rules will become a critical tool for building trustworthy systems.


Source: New Microsoft tool lets devs spin up AI behavior tests using text descriptions
Domain: techcrunch.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.