LLaMA 65B a marqué 63,7 et 48,8 sur MMLU - Voici la correction

LLaMA 65B has been scored at both 63.7 and 48.8 on MMLU depending on who ran the eval. That 15-point gap isn't a model issue — it's a reporting problem that Every Eval Ever and Hugging Face Community Evals just patched with a cross-compatible pipeline.

The Fix for Reproducibility

The EvalEval Coalition launched Every Eval Ever in February 2026 as a single JSON schema for evaluation results. It captures who ran the eval, which model, how it was accessed, generation settings, and what the metric actually means. No more losing the harness version or the temperature in a footnote. The companion JSONL file stores per-sample outputs. Since launch, the datastore on Hugging Face has grown to ~229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, pulled from 31 different reporting formats. Reproducing those runs from scratch would cost hundreds of thousands of dollars — so keeping the data together is the smart engineering play.

What the Converter Does

Hugging Face Community Evals, also launched February 2026, decentralizes score reporting. A benchmark registers itself via an eval.yaml in a dataset repo, then model pages collect scores from .eval_results/*.yaml files. Anyone can submit a PR with a YAML file, and each score gets a badge: author-submitted, community-submitted, or independently verified. The new converter takes an EEE JSON record and writes the YAML that Hugging Face expects. Submit once through your organization's verified Hugging Face account, and your result shows up on the model page with a source badge linking back to the full EEE record — generation config, reproducibility notes, instance-level data, everything.

Why This Matters Now

Same evaluation now surfaces in two places doing different jobs. Hugging Face puts the score where people look at models. EEE keeps the structured record that makes the score interpretable, powering Eval Cards that compose run data with benchmark and model metadata. Third-party evaluators and first-party model authors both get a single workflow for cross-posting. The result is a visible, legible, and verifiable evaluation that doesn't depend on trusting a screenshot from a paper. Next time someone claims a score, you can click back to the exact settings that produced it — or spot the discrepancy before it propagates.

Source: Featuring Every Eval Ever Results on Hugging Face Model Pages
Domain: huggingface.co

LLaMA 65B a marqué 63,7 et 48,8 sur MMLU - Voici la correction

The Fix for Reproducibility

What the Converter Does

Why This Matters Now

More in Machine Learning