Source linked

Erster Proof Math Test: AI Scores 6/10, Halluzinationen und Brennt Bargeld

scientificamerican.com@science_desk2 hours ago·Artificial Intelligence·0 comments

Der beste Eintrag, das Multi-Modell-Ensemble von IMProofBench, löste 6-7 von 10 Forschungs-Level-Problemen, aber verbrannte fast $ 1.000 in API-Gebühren pro falsche Antwort.

first proofopenaichatgpt 5 5 proimproofbencheth zurichlarge language models

Six out of ten research-level math problems solved correctly by the best AI — the rest were garbage filtered by heroic human graders. That's the verdict from the second batch of First Proof problems, a project organized by mathematicians at Harvard, Stanford, and other institutions to evaluate whether LLMs are actually useful for professional math research.

Six Out of Ten, With a Catch

The test pitted OpenAI's ChatGPT-5.5 Pro (4–5 correct) against three academic models. IMProofBench, built by scientists at ETH Zurich and Aarhus University, took top honors with 6 or 7 out of 10. Graders applied the same standard math journals use: "accept with minor revisions." Some answers sat on the fuzzy edge of that threshold, hence the toss-up in the final score. But the models also churned out "copious amounts of garbage," as the First Proof team put it, requiring two days of intensive peer review at Harvard's Center of Mathematical Sciences and Applications.

The Council of Models and Its Price Tag

ChatGPT-5.5 Pro isn't a single model — it's a unified framework combining several LLMs. When the base model gets lazy or evades, other LLMs automatically check its work, provide feedback, and force it to persist. IMProofBench takes this further: stuck, its core ChatGPT can consult a "council" of Claude (Anthropic) and Gemini (Google). This Frankenstein approach got the best score, but at a cost. In some cases, Mohammed Abouzaid, a mathematician at Stanford and First Proof team member, reports that the overlapping LLMs racked up nearly $1,000 in query charges — just to get the wrong answer. He worries about a future where grant proposals include line items for purchasing tokens from tech companies.

Citations Missing, Norms Violated

Every model flagrantly violated academic norms. "There were a lot of missing citations," said Lauren Williams, Harvard mathematician and First Proof team member. "If it was a human, one might call it plagiarism." The team hopes the math community will pressure AI companies to align their products with scientific ethics. The models are undeniably useful at digging up obscure references and grinding through tedious calculations — in one case, an AI executed a strategy the problem's authors had identified but found too boring to pursue — but the cost, both in dollars and ethical standards, is far from settled.


Source: AI scores a 'C-' on its hardest math test yet
Domain: scientificamerican.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.