Vespa AI and Surpassing the Limits of Vector Search

May 12, 2026

38 min episode · 2 min read

Radu Gheorghe

Episode

38 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published May 12, 2026

Key Takeaways

✓Hybrid search outperforms vectors alone: Combining BM25 lexical search with embedding models consistently outperforms either approach in isolation. Even though most modern embedding models individually beat BM25 off-the-shelf, hybrid search surpasses the models themselves. Production systems should implement both signals rather than defaulting to vector similarity as the sole relevance measure.
✓Tensor-based ranking enables future-proof retrieval: Representing data as tensors rather than flat vectors allows Vespa to natively support new retrieval techniques like ColPali multi-vector search and Bayesian BM25 normalization without architectural rewrites. Practitioners should model ranking signals as named tensor dimensions to enable fast dot-product operations instead of slower scripted field calculations.
✓Multi-stage re-ranking on content nodes reduces latency bottlenecks: Vespa executes first-phase ranking across all documents and second-phase re-ranking on top-N results directly on content nodes, avoiding expensive data movement. A third global re-ranking phase on a stateless GPU layer handles complex models. This architecture allows more sophisticated models to run within acceptable latency budgets.
✓Chunking strategy directly impacts vector relevance quality: Compressing an entire book or long document into one vector creates a lossy representation that loses specificity. Models like ColPali address PDF complexity by generating one vector per 32x32 patch across each page, enabling precise retrieval of specific tables or graphs within documents when text and image queries share the same vector space.
✓Agent accuracy compounds with retrieval quality: When AI agents run 10 sequential searches each at 90% accuracy, compound success probability drops dramatically. Improving single-query retrieval precision directly multiplies aggregate agent reliability. Poor retrieval context also increases hallucination rates because models rely on whatever context is provided, making high-precision search infrastructure a prerequisite for reliable agentic systems.

What It Covers

Vespa software engineer Radu Gheorghe explains why vector similarity alone fails in production search systems, how tensor-based retrieval generalizes ranking beyond single-signal approaches, and where multi-stage re-ranking architectures create efficiency trade-offs in RAG pipelines and AI agent workflows.

Key Questions Answered

•Hybrid search outperforms vectors alone: Combining BM25 lexical search with embedding models consistently outperforms either approach in isolation. Even though most modern embedding models individually beat BM25 off-the-shelf, hybrid search surpasses the models themselves. Production systems should implement both signals rather than defaulting to vector similarity as the sole relevance measure.
•Tensor-based ranking enables future-proof retrieval: Representing data as tensors rather than flat vectors allows Vespa to natively support new retrieval techniques like ColPali multi-vector search and Bayesian BM25 normalization without architectural rewrites. Practitioners should model ranking signals as named tensor dimensions to enable fast dot-product operations instead of slower scripted field calculations.
•Multi-stage re-ranking on content nodes reduces latency bottlenecks: Vespa executes first-phase ranking across all documents and second-phase re-ranking on top-N results directly on content nodes, avoiding expensive data movement. A third global re-ranking phase on a stateless GPU layer handles complex models. This architecture allows more sophisticated models to run within acceptable latency budgets.
•Chunking strategy directly impacts vector relevance quality: Compressing an entire book or long document into one vector creates a lossy representation that loses specificity. Models like ColPali address PDF complexity by generating one vector per 32x32 patch across each page, enabling precise retrieval of specific tables or graphs within documents when text and image queries share the same vector space.
•Agent accuracy compounds with retrieval quality: When AI agents run 10 sequential searches each at 90% accuracy, compound success probability drops dramatically. Improving single-query retrieval precision directly multiplies aggregate agent reliability. Poor retrieval context also increases hallucination rates because models rely on whatever context is provided, making high-precision search infrastructure a prerequisite for reliable agentic systems.

Notable Moment

Radu describes how Vespa's tensor framework supported ColPali multi-vector retrieval from day one of the model's release — not because Vespa anticipated it, but because the underlying mathematical plumbing for mapping patch IDs to vectors and computing MaxSim was already in place.

Know someone who'd find this useful?