Why Vision Language Models Ignore What They See with Munawar Hayat - #758

December 9, 2025

57 min episode · 2 min read

Munawar Hayat

Episode

57 min

Read time

2 min

AI-Generated Summary

Published Dec 31, 2025

Key Takeaways

✓Vision Token Attention Failure: Vision language models attend poorly to visual tokens despite having images as input. Cross-attention modules injected every fourth transformer block with auxiliary loss maximizing attention on relevant segmentation masks improves visual grounding while reducing compute complexity from m+n squared to n squared plus mn.
✓Physics Understanding Gap: Foundation models fail simple physical reasoning tasks like unstacking boxes, generating deformed objects with changed sizes and properties. Prompt expansion describing physical constraints during training helps, but models trained on trillions of text tokens versus billions of image-text pairs struggle with spatial correspondence and physical world simulation.
✓Generalized Contrastive Learning: Standard CLIP training uses image-text pairs only, failing at composed queries mixing modalities. Training on all permutations of image, text, and fused embeddings without collecting new triplet data enables cross-modal retrieval and generalizes to video benchmarks despite no video training data, maintaining CLIP's 300-400 million parameter efficiency.
✓Multi-Person Generation Solution: Models lose facial identity when generating multiple people and fail accurate person counts beyond three to four subjects. Defining attention masks preventing tokens of one person from attending to another person's tokens reduces identity leakage, enabling inference-only personalization without fine-tuning adapters for each face.

What It Covers

Munawar Hayat from Qualcomm AI Research discusses three NeurIPS papers addressing critical failures in vision language models: why they ignore visual input, physics-based generation limitations, and multi-person image generation challenges with proposed solutions.

Key Questions Answered

•Vision Token Attention Failure: Vision language models attend poorly to visual tokens despite having images as input. Cross-attention modules injected every fourth transformer block with auxiliary loss maximizing attention on relevant segmentation masks improves visual grounding while reducing compute complexity from m+n squared to n squared plus mn.
•Physics Understanding Gap: Foundation models fail simple physical reasoning tasks like unstacking boxes, generating deformed objects with changed sizes and properties. Prompt expansion describing physical constraints during training helps, but models trained on trillions of text tokens versus billions of image-text pairs struggle with spatial correspondence and physical world simulation.
•Generalized Contrastive Learning: Standard CLIP training uses image-text pairs only, failing at composed queries mixing modalities. Training on all permutations of image, text, and fused embeddings without collecting new triplet data enables cross-modal retrieval and generalizes to video benchmarks despite no video training data, maintaining CLIP's 300-400 million parameter efficiency.
•Multi-Person Generation Solution: Models lose facial identity when generating multiple people and fail accurate person counts beyond three to four subjects. Defining attention masks preventing tokens of one person from attending to another person's tokens reduces identity leakage, enabling inference-only personalization without fine-tuning adapters for each face.

Notable Moment

When testing proprietary foundation models on simple box unstacking tasks, researchers found models that generate intricate visual details fail basic physics, producing deformed boxes with altered sizes, revealing a fundamental gap in spatial reasoning despite impressive general capabilities.

Know someone who'd find this useful?