Why Vision Language Models Ignore What They See with Munawar Hayat - #758
Episode
57 min
Read time
2 min
Topics
Productivity, Fundraising & VC, Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓Vision Token Attention Failure: Vision language models attend poorly to visual tokens despite having images as input. Cross-attention modules injected every fourth transformer block with auxiliary loss maximizing attention on relevant segmentation masks improves visual grounding while reducing compute complexity from m+n squared to n squared plus mn.
- ✓Physics Understanding Gap: Foundation models fail simple physical reasoning tasks like unstacking boxes, generating deformed objects with changed sizes and properties. Prompt expansion describing physical constraints during training helps, but models trained on trillions of text tokens versus billions of image-text pairs struggle with spatial correspondence and physical world simulation.
- ✓Generalized Contrastive Learning: Standard CLIP training uses image-text pairs only, failing at composed queries mixing modalities. Training on all permutations of image, text, and fused embeddings without collecting new triplet data enables cross-modal retrieval and generalizes to video benchmarks despite no video training data, maintaining CLIP's 300-400 million parameter efficiency.
- ✓Multi-Person Generation Solution: Models lose facial identity when generating multiple people and fail accurate person counts beyond three to four subjects. Defining attention masks preventing tokens of one person from attending to another person's tokens reduces identity leakage, enabling inference-only personalization without fine-tuning adapters for each face.
What It Covers
Munawar Hayat from Qualcomm AI Research discusses three NeurIPS papers addressing critical failures in vision language models: why they ignore visual input, physics-based generation limitations, and multi-person image generation challenges with proposed solutions.
Key Questions Answered
- •Vision Token Attention Failure: Vision language models attend poorly to visual tokens despite having images as input. Cross-attention modules injected every fourth transformer block with auxiliary loss maximizing attention on relevant segmentation masks improves visual grounding while reducing compute complexity from m+n squared to n squared plus mn.
- •Physics Understanding Gap: Foundation models fail simple physical reasoning tasks like unstacking boxes, generating deformed objects with changed sizes and properties. Prompt expansion describing physical constraints during training helps, but models trained on trillions of text tokens versus billions of image-text pairs struggle with spatial correspondence and physical world simulation.
- •Generalized Contrastive Learning: Standard CLIP training uses image-text pairs only, failing at composed queries mixing modalities. Training on all permutations of image, text, and fused embeddings without collecting new triplet data enables cross-modal retrieval and generalizes to video benchmarks despite no video training data, maintaining CLIP's 300-400 million parameter efficiency.
- •Multi-Person Generation Solution: Models lose facial identity when generating multiple people and fail accurate person counts beyond three to four subjects. Defining attention masks preventing tokens of one person from attending to another person's tokens reduces identity leakage, enabling inference-only personalization without fine-tuning adapters for each face.
Notable Moment
When testing proprietary foundation models on simple box unstacking tasks, researchers found models that generate intricate visual details fail basic physics, producing deformed boxes with altered sizes, revealing a fundamental gap in spatial reasoning despite impressive general capabilities.
You just read a 3-minute summary of a 54-minute episode.
Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from The TWIML AI Podcast
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Jun 9 · 51 min
Planet Money
Don't hate the replicator, hate the game
Feb 27
More from The TWIML AI Podcast
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
May 21 · 66 min
The School of Greatness
Your Brain Is Built for God, Not Scarcity | Dr. Lisa Miller
Jun 8
More from The TWIML AI Podcast
We summarize every new episode. Want them in your inbox?
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
How to Engineer AI Inference Systems with Philip Kiely - #766
How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765
Similar Episodes
Related episodes from other podcasts
Planet Money
Feb 27
Don't hate the replicator, hate the game
The School of Greatness
Jun 8
Your Brain Is Built for God, Not Scarcity | Dr. Lisa Miller
Hard Fork
Jun 5
Hot I.P.O Summer + What Is A.I. Doing to Math? + HatGPT
Cognitive Revolution
May 26
Your Biggest Lever: Designing your AI Career for Maximum Impact, with 80,000 Hours founder Ben Todd
TED Radio Hour
May 22
What we'll eat on a warmer planet
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into The TWIML AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime