AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful.

A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found that only 16 percent of 445 LLM benchmarks for natural language processing and machine learning use rigorous scientific methods to compare model performance.

What's more, about half the benchmarks claim to measure abstract ideas like reasoning or harmlessness without offering a clear definition of those terms or how to measure them.

In a statement, Andrew Bean, lead author of the study said, "Benchmarks underpin nearly all claims about advances in AI. But without sh

See Full Page