How AI Killed Scientistry

On the basis of some of the things I learned in the process of writing PROBABILITY ZERO, Claude Athos and I have teamed up to write another paper:

AIQ: Measuring Artificial Intelligence Scientific Discernment

We propose AIQ as a metric for evaluating artificial intelligence systems’ ability to distinguish valid scientific arguments from credentialed nonsense. We tested six AI models using three papers: one with sound methodology and correct mathematics, one with circular reasoning and fabricated data from prestigious institutions, and one parody with obvious tells including fish-pun author names and taxonomic impossibilities. Only one of six models correctly ranked the real work above both fakes. The worst performer exhibited severe anti-calibration, rating fabricated nonsense 9/10 while dismissing sound empirical work as “pseudoscientific” (1/10). Surprisingly, the model that delivered the sharpest critiques of both fake papers was still harsher on the real work—demonstrating that critical thinking ability does not guarantee correct application of scrutiny. We propose that a random number generator would achieve AIQ ~100; models that reliably invert correct rankings score below this baseline. Our results suggest that most current AI systems evaluate scientific aesthetics rather than scientific validity, with profound implications for AI-assisted peer review, research evaluation, and automated scientific discovery.

Read the rest at AI Central. The results are fascinating.

DISCUSS ON SG