Investing on the Front Lines of the AI Arms Race | Nathan Benaich

November 10, 2025

53 min episode · 2 min read

Nathan Benaich

Episode

53 min

Read time

2 min

Topics

Investing, Artificial Intelligence

AI-Generated Summary

Published Dec 22, 2025

Key Takeaways

✓Inference-Time Scaling: Models now spend more compute during the answer phase rather than training, using chain-of-thought reasoning to explore multiple solution paths before responding. This approach yields better performance on math, coding, and scientific tasks without requiring larger model sizes.
✓Prompt Engineering Impact: Users who provide detailed context, persona descriptions, and scaffolding get significantly better responses because they help navigate the model's high-dimensional answer space. Poor prompting accounts for much response variability, not just model limitations, giving informed users a measurable advantage.
✓Model Regression Trade-offs: ChatGPT-4o outperforms GPT-5 at writing tasks because foundation models involve hundreds of competing optimization signals across domains. Each update pulls the model in different directions, creating inevitable capability regressions in some areas while improving others, making consistent performance impossible.
✓DeepSeek Cost Narrative: The reported five million dollar training cost for DeepSeek R1 only covered the final qualifying run, excluding all research and development, data annotation, infrastructure, and prior experimental training runs. This mirrors reporting only a Formula One qualifying lap cost while ignoring the entire race weekend expenses.

What It Covers

Nathan Benaich, founder of Air Street Capital and creator of the annual State of AI Report, examines breakthrough developments in artificial intelligence, including DeepSeek's innovations, reasoning models, and the shift from pre-training to inference-time scaling.

Key Questions Answered

•Inference-Time Scaling: Models now spend more compute during the answer phase rather than training, using chain-of-thought reasoning to explore multiple solution paths before responding. This approach yields better performance on math, coding, and scientific tasks without requiring larger model sizes.
•Prompt Engineering Impact: Users who provide detailed context, persona descriptions, and scaffolding get significantly better responses because they help navigate the model's high-dimensional answer space. Poor prompting accounts for much response variability, not just model limitations, giving informed users a measurable advantage.
•Model Regression Trade-offs: ChatGPT-4o outperforms GPT-5 at writing tasks because foundation models involve hundreds of competing optimization signals across domains. Each update pulls the model in different directions, creating inevitable capability regressions in some areas while improving others, making consistent performance impossible.
•DeepSeek Cost Narrative: The reported five million dollar training cost for DeepSeek R1 only covered the final qualifying run, excluding all research and development, data annotation, infrastructure, and prior experimental training runs. This mirrors reporting only a Formula One qualifying lap cost while ignoring the entire race weekend expenses.

Notable Moment

Benaich reveals that telling ChatGPT to think step by step two years ago improved performance because it decomposed complex tasks into smaller hops, allowing the system to debug its reasoning. This observation directly led developers to train models with explicit reasoning traces from domain experts.

Know someone who'd find this useful?