AI Summary
→ WHAT IT COVERS Andrew Gordon and Nora Petrova from Prolific explain why current AI benchmarks miss critical user experience factors and introduce their human-centered evaluation methodology called Humane. → KEY INSIGHTS - **TrueSkill Methodology:** Prolific uses Microsoft's TrueSkill framework from Xbox Live to run AI model tournaments, selecting model pairs based on information gain to minimize uncertainty efficiently with fewer comparisons needed. - **Representative Sampling:** Humane stratifies participants by age, ethnicity, and political alignment using census data from US and UK populations, unlike Chatbot Arena's anonymous users, enabling demographically representative preference data. - **Actionable Metrics:** Breaking preference into six specific dimensions—helpfulness, communication, adaptiveness, personality, trust, and cultural understanding—provides AI labs concrete feedback on where models need improvement versus single preference votes. → NOTABLE MOMENT Initial testing with 500 participants revealed models scored significantly lower on personality and cultural understanding metrics compared to helpfulness, suggesting training data may not produce personalities users actually want. 💼 SPONSORS [{"name": "Prolific", "url": ""}] 🏷️ AI Benchmarking, Human Evaluation, AI Safety