GPT-5 Hits 62.7% Accuracy on Production Incidents, Falls Short of 72.7% Expert Baseline

According to Datadog and Carnegie Mellon’s latest benchmark, GPT-5 achieved 62.7% accuracy on the ARFBench test, falling short of human domain experts at 72.7%. ARFBench is the first AI benchmark built from 63 real production incidents, containing 750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points—no synthetic data.

AI models struggle most on cross-metric reasoning (Tier III questions), where GPT-5 scored just 47.5% F1. A theoretical model-expert oracle combining AI and human judgment reaches 87.2% accuracy, illustrating how collaboration could exceed either alone. Datadog’s hybrid model, Toto-1.0-QA-Experimental, topped the leaderboard at 63.9% accuracy, outperforming GPT-5 on anomaly identification.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments