What I Learned Testing GPT-5.5

April 24, 2026

36 min episode · 2 min read

Episode

36 min

Read time

2 min

AI-Generated Summary

Published Apr 24, 2026

Key Takeaways

✓Benchmark interpretation: Comparing models by cost-per-token alone misleads buyers. GPT-5.5 costs $5 input and $30 output per million tokens — double GPT-5.4 — but dominates the cost-performance frontier on Artificial Analysis when measured by intelligence-per-dollar. For agent workflows like Codex, efficiency per task completed matters far more than raw token pricing.
✓Coding agent durability: GPT-5.5 sustains long-running autonomous coding tasks in ways previous models could not. Independent testers report uninterrupted runs of 7–31 hours, compared to prior 30-minute ceilings. For developers using Codex, this enables queuing multi-step migrations or RL runs overnight without manual intervention, fundamentally changing what autonomous coding workflows can accomplish.
✓Multi-model task routing: Practitioners find the optimal setup is Opus 4.7 at extra-high thinking for planning, then GPT-5.5 at high for execution. This split outperforms any single-model configuration. For teams building agent pipelines, explicitly separating the planning and execution phases across models — rather than defaulting to one — produces measurably better outputs on complex, multi-step tasks.
✓SweeBench Pro as a misleading signal: GPT-5.5 underperforms Opus 4.7 on SweeBench Pro, but CodeRabbit's independent code review evaluation shows GPT-5.5 finding 79.2% of expected issues versus a 58.3% baseline. Practitioners should weight real-world task evaluations over SweeBench scores, which OpenAI's own February research argues no longer measures frontier coding capabilities accurately.
✓Codex monothread workflow: Users are experimenting with a single continuously updated Codex thread — leveraging OpenAI's improved context compaction — instead of splitting work across multiple project conversations. Starting with a structured model interview to build background context, then routing all strategic and iterative questions through one thread, preserves continuity and reduces context-switching overhead across long-running projects.

What It Covers

OpenAI releases GPT-5.5, scoring 82.7% on Terminal Bench 2.0 versus Opus 4.7's 69.4%, reclaiming the top position on Artificial Analysis benchmarks by three points. The episode covers benchmark comparisons, coding performance, knowledge work capabilities, and what the release signals about OpenAI's competitive positioning against Anthropic.

Key Questions Answered

•Benchmark interpretation: Comparing models by cost-per-token alone misleads buyers. GPT-5.5 costs $5 input and $30 output per million tokens — double GPT-5.4 — but dominates the cost-performance frontier on Artificial Analysis when measured by intelligence-per-dollar. For agent workflows like Codex, efficiency per task completed matters far more than raw token pricing.
•Coding agent durability: GPT-5.5 sustains long-running autonomous coding tasks in ways previous models could not. Independent testers report uninterrupted runs of 7–31 hours, compared to prior 30-minute ceilings. For developers using Codex, this enables queuing multi-step migrations or RL runs overnight without manual intervention, fundamentally changing what autonomous coding workflows can accomplish.
•Multi-model task routing: Practitioners find the optimal setup is Opus 4.7 at extra-high thinking for planning, then GPT-5.5 at high for execution. This split outperforms any single-model configuration. For teams building agent pipelines, explicitly separating the planning and execution phases across models — rather than defaulting to one — produces measurably better outputs on complex, multi-step tasks.
•SweeBench Pro as a misleading signal: GPT-5.5 underperforms Opus 4.7 on SweeBench Pro, but CodeRabbit's independent code review evaluation shows GPT-5.5 finding 79.2% of expected issues versus a 58.3% baseline. Practitioners should weight real-world task evaluations over SweeBench scores, which OpenAI's own February research argues no longer measures frontier coding capabilities accurately.
•Codex monothread workflow: Users are experimenting with a single continuously updated Codex thread — leveraging OpenAI's improved context compaction — instead of splitting work across multiple project conversations. Starting with a structured model interview to build background context, then routing all strategic and iterative questions through one thread, preserves continuity and reduces context-switching overhead across long-running projects.

Notable Moment

A researcher described setting GPT-5.5 a large-scale reinforcement learning task before a holiday weekend, expecting it to stall. Returning days later, the model had run autonomously for 31 hours straight — something no prior model had sustained — completing an industrial-scale run without interruption.

Know someone who'd find this useful?