Skip to main content
MH

Munawar Hayat

1episode
1podcast

We have 1 summarized appearance for Munawar Hayat so far. Browse all podcasts to discover more episodes.

Featured On 1 Podcast

All Appearances

1 episode

AI Summary

→ WHAT IT COVERS Munawar Hayat from Qualcomm AI Research discusses three NeurIPS papers addressing critical failures in vision language models: why they ignore visual input, physics-based generation limitations, and multi-person image generation challenges with proposed solutions. → KEY INSIGHTS - **Vision Token Attention Failure:** Vision language models attend poorly to visual tokens despite having images as input. Cross-attention modules injected every fourth transformer block with auxiliary loss maximizing attention on relevant segmentation masks improves visual grounding while reducing compute complexity from m+n squared to n squared plus mn. - **Physics Understanding Gap:** Foundation models fail simple physical reasoning tasks like unstacking boxes, generating deformed objects with changed sizes and properties. Prompt expansion describing physical constraints during training helps, but models trained on trillions of text tokens versus billions of image-text pairs struggle with spatial correspondence and physical world simulation. - **Generalized Contrastive Learning:** Standard CLIP training uses image-text pairs only, failing at composed queries mixing modalities. Training on all permutations of image, text, and fused embeddings without collecting new triplet data enables cross-modal retrieval and generalizes to video benchmarks despite no video training data, maintaining CLIP's 300-400 million parameter efficiency. - **Multi-Person Generation Solution:** Models lose facial identity when generating multiple people and fail accurate person counts beyond three to four subjects. Defining attention masks preventing tokens of one person from attending to another person's tokens reduces identity leakage, enabling inference-only personalization without fine-tuning adapters for each face. → NOTABLE MOMENT When testing proprietary foundation models on simple box unstacking tasks, researchers found models that generate intricate visual details fail basic physics, producing deformed boxes with altered sizes, revealing a fundamental gap in spatial reasoning despite impressive general capabilities. 💼 SPONSORS [{"name": "Qualcomm", "url": "twimlai.com/qualcomm"}] 🏷️ Vision Language Models, Multimodal AI, Image Generation, Contrastive Learning

Never miss Munawar Hayat's insights

Subscribe to get AI-powered summaries of Munawar Hayat's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required • Free tier available