GPT 5.4 First Test Results

March 6, 2026

28 min episode · 2 min read

Episode

28 min

Read time

2 min

AI-Generated Summary

Published Mar 7, 2026

Key Takeaways

✓Computer Use Benchmark: GPT-5.4 scores 75% on OSWorld Verified, surpassing human-level performance at 72.4% and representing a 28-percentage-point jump from GPT-5.2's 47.3%. For teams running autonomous desktop agents, this shifts the core question from capability to trust — whether organizations are willing to grant models sufficient system access.
✓Token Efficiency via Tool Search: GPT-5.4 introduces on-demand tool loading rather than front-loading all tool definitions into every prompt. Tested across 250 tasks from Scale's MCP Atlas, this approach cuts total token usage by 47% with no accuracy loss — a direct cost reduction for any team running high-volume agentic workflows with large tool libraries.
✓Professional Task Performance (GDPVal): On the GDPVal benchmark spanning 44 occupations across 9 industries, GPT-5.4 ties or beats human professionals 82-83% of the time when including ties. Ethan Mollick calculates this translates to saving approximately 4 hours and 38 minutes on a standard 7-hour professional knowledge work task.
✓Codex CLI Friction Reduction: The updated Codex CLI requires significantly fewer user confirmations than Claude Code, and provides real-time interstitial progress updates during long-running tasks rather than operating as a black box. In direct testing, a completed deployment produced zero errors on first run — a reliability outcome the host had not previously experienced with Claude Code.
✓UI Design as a Consistent Weakness: GPT-5.4 performs poorly on front-end visual design across multiple independent testers. When evaluating outputs, Claude identified specific failures including muddy gradient backgrounds, absent typographic hierarchy, and dated dark-mode aesthetics. Teams using 5.4 for full-stack builds should route UI and design tasks to alternative models like Claude Opus.

What It Covers

OpenAI releases GPT-5.4, a professional-focused frontier model combining reasoning, coding, and computer use capabilities. The episode covers benchmark results, community reactions, and a first-hand test building an agent portfolio tool using both ChatGPT 5.4 and the updated Codex CLI.

Key Questions Answered

•Computer Use Benchmark: GPT-5.4 scores 75% on OSWorld Verified, surpassing human-level performance at 72.4% and representing a 28-percentage-point jump from GPT-5.2's 47.3%. For teams running autonomous desktop agents, this shifts the core question from capability to trust — whether organizations are willing to grant models sufficient system access.
•Token Efficiency via Tool Search: GPT-5.4 introduces on-demand tool loading rather than front-loading all tool definitions into every prompt. Tested across 250 tasks from Scale's MCP Atlas, this approach cuts total token usage by 47% with no accuracy loss — a direct cost reduction for any team running high-volume agentic workflows with large tool libraries.
•Professional Task Performance (GDPVal): On the GDPVal benchmark spanning 44 occupations across 9 industries, GPT-5.4 ties or beats human professionals 82-83% of the time when including ties. Ethan Mollick calculates this translates to saving approximately 4 hours and 38 minutes on a standard 7-hour professional knowledge work task.
•Codex CLI Friction Reduction: The updated Codex CLI requires significantly fewer user confirmations than Claude Code, and provides real-time interstitial progress updates during long-running tasks rather than operating as a black box. In direct testing, a completed deployment produced zero errors on first run — a reliability outcome the host had not previously experienced with Claude Code.
•UI Design as a Consistent Weakness: GPT-5.4 performs poorly on front-end visual design across multiple independent testers. When evaluating outputs, Claude identified specific failures including muddy gradient backgrounds, absent typographic hierarchy, and dated dark-mode aesthetics. Teams using 5.4 for full-stack builds should route UI and design tasks to alternative models like Claude Opus.

Notable Moment

During hands-on testing, the host repeatedly struggled to move GPT-5.4 from planning into execution mode. After multiple redirects, the model acknowledged it had stayed too long in abstraction — then responded with another multi-paragraph description instead of building the requested clickable prototype.

Know someone who'd find this useful?