Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)

December 20, 2025

16 min episode · 2 min read

Andrew Gordon,Nora Petrova

Episode

16 min

Read time

2 min

Topics

Fundraising & VC, Artificial Intelligence

AI-Generated Summary

Published Dec 22, 2025

Key Takeaways

✓TrueSkill Methodology: Prolific uses Microsoft's TrueSkill framework from Xbox Live to run AI model tournaments, selecting model pairs based on information gain to minimize uncertainty efficiently with fewer comparisons needed.
✓Representative Sampling: Humane stratifies participants by age, ethnicity, and political alignment using census data from US and UK populations, unlike Chatbot Arena's anonymous users, enabling demographically representative preference data.
✓Actionable Metrics: Breaking preference into six specific dimensions—helpfulness, communication, adaptiveness, personality, trust, and cultural understanding—provides AI labs concrete feedback on where models need improvement versus single preference votes.

What It Covers

Andrew Gordon and Nora Petrova from Prolific explain why current AI benchmarks miss critical user experience factors and introduce their human-centered evaluation methodology called Humane.

Key Questions Answered

•TrueSkill Methodology: Prolific uses Microsoft's TrueSkill framework from Xbox Live to run AI model tournaments, selecting model pairs based on information gain to minimize uncertainty efficiently with fewer comparisons needed.
•Representative Sampling: Humane stratifies participants by age, ethnicity, and political alignment using census data from US and UK populations, unlike Chatbot Arena's anonymous users, enabling demographically representative preference data.
•Actionable Metrics: Breaking preference into six specific dimensions—helpfulness, communication, adaptiveness, personality, trust, and cultural understanding—provides AI labs concrete feedback on where models need improvement versus single preference votes.

Notable Moment

Initial testing with 500 participants revealed models scored significantly lower on personality and cultural understanding metrics compared to helpfulness, suggesting training data may not produce personalities users actually want.

Know someone who'd find this useful?