PoQ-Judge Cuts Decentralized LLM Evaluation Cost 72.7% With a 0.747 Judge

Reference-free quality evaluation for decentralized LLM inference just got a cost-effective alternative: PoQ-Judge's best model hits a 0.747 Pearson correlation with ground-truth proxies on held-out tests, outperforming reference-based evaluators from prior work.

Three Architectures, One Clear Winner

The PoQ-Judge framework trains dedicated judge models to score query-output pairs without needing ground-truth references — essential for Proof-of-Quality (PoQ) in decentralized inference networks where you can't assume a trusted answer exists. The authors tested three architectures spanning the quality-cost tradeoff: a lightweight TextCNN, a MiniLM cross-encoder, and a heavier DeBERTa judge. Two-stage training on UltraFeedback plus GPT-labeled in-domain data pushed DeBERTa to that 0.747 Pearson correlation, the top score. Even as a reference-free component in a composite scoring pipeline, it hit 0.645 Pearson, matching the best single reference-based evaluator while eliminating the need for a reference answer entirely.

Cascade Evaluation: 72.7% Cost Reduction

More interesting than raw accuracy is the cost story. PoQ-Judge introduces online calibration that identifies semantic quality as the dominant evaluation dimension. Using cascade evaluation — routing cheap checks through smaller models before escalating to the big judge — the system reduces inference cost by 72.7% with only modest quality degradation. That's the kind of number that makes decentralized inference economically viable for high-throughput applications.

Open Problems: Summarization Still Lags

The framework isn't a universal fix. Results on QA tasks are much stronger than on summarization, where the proxy quality metrics themselves become the bottleneck. The authors point to this as the main remaining limitation — the judge can only be as good as the proxy it learns from. PoQ-Judge gives the decentralized LLM stack a real-time, cost-aware quality gate, but closing the summarization gap will require better ground-truth proxies, not just smarter judges.

Source: PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference
Domain: arxiv.org

PoQ-Judge Cuts Decentralized LLM Evaluation Cost 72.7% With a 0.747 Judge

Three Architectures, One Clear Winner

Cascade Evaluation: 72.7% Cost Reduction

Open Problems: Summarization Still Lags

More in Artificial Intelligence