🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
Episode
85 min
Read time
3 min
AI-Generated Summary
Key Takeaways
- ✓Patient selection as root cause: 90-95% of cancer drugs fail in clinical trials not because of poor pharmacology or target selection, but because trials enroll the wrong patients. Drugs often work in a subset of patients, but without models to identify that subset upfront, trials run on broad populations, diluting signal and leading to cancellation of molecules that could help specific subgroups.
- ✓Spatial transcriptomics as training data: Noetik generates multimodal tissue data stacking H&E pathology images, multiplex fluorescence protein stains, and spatial transcriptomics capturing up to 20,000 genes per spatial location. Each data point functions like a 20,000-channel image rather than a standard RGB image. Over 100 million spatially-resolved cells have been generated, representing at least one order of magnitude more paired data than any known public dataset.
- ✓H&E as universal inference input: Despite training on expensive multimodal data, Noetik's models run inference using only standard H&E pathology slides at deployment. Because H&E is collected for virtually every cancer patient globally, this allows retrospective analysis of existing trial cohorts — splitting responders from non-responders using images already on file, without requiring new data collection from past participants.
- ✓Autoregressive scaling for spatial biology: Noetik's Tario model applies next-token autoregressive training — the same objective scaling LLMs — to spatial transcriptomics data. Larger models only outperform smaller ones at longer context lengths, meaning the model must observe larger tissue regions simultaneously to capture nonlinear spatial patterns. This mirrors LLM scaling behavior and suggests tissue context length is a key variable for biological foundation model performance.
- ✓In vivo perturbation validation via barcoded mouse tumors: To validate human model predictions without relying on cell lines, Noetik uses a multiplexed CRISPR knockout platform injecting ~100 barcoded cancer cell variants into single mice, producing hundreds of genetically distinct tumors per animal. Human-trained models are then inferenced directly on mouse H&E, and predictions about immune infiltration and tumor phenotype are validated against known pathway biology across multiple gene knockouts simultaneously.
What It Covers
Noetik co-founders Ron Alfa and Daniel Bear explain how 90-95% of cancer drug trial failures stem from poor patient selection rather than bad pharmacology. They describe building multimodal foundation models trained on spatially-resolved human tumor data — combining H&E pathology, multiplex protein imaging, and 20,000-gene spatial transcriptomics — to match drugs to the right patient subpopulations.
Key Questions Answered
- •Patient selection as root cause: 90-95% of cancer drugs fail in clinical trials not because of poor pharmacology or target selection, but because trials enroll the wrong patients. Drugs often work in a subset of patients, but without models to identify that subset upfront, trials run on broad populations, diluting signal and leading to cancellation of molecules that could help specific subgroups.
- •Spatial transcriptomics as training data: Noetik generates multimodal tissue data stacking H&E pathology images, multiplex fluorescence protein stains, and spatial transcriptomics capturing up to 20,000 genes per spatial location. Each data point functions like a 20,000-channel image rather than a standard RGB image. Over 100 million spatially-resolved cells have been generated, representing at least one order of magnitude more paired data than any known public dataset.
- •H&E as universal inference input: Despite training on expensive multimodal data, Noetik's models run inference using only standard H&E pathology slides at deployment. Because H&E is collected for virtually every cancer patient globally, this allows retrospective analysis of existing trial cohorts — splitting responders from non-responders using images already on file, without requiring new data collection from past participants.
- •Autoregressive scaling for spatial biology: Noetik's Tario model applies next-token autoregressive training — the same objective scaling LLMs — to spatial transcriptomics data. Larger models only outperform smaller ones at longer context lengths, meaning the model must observe larger tissue regions simultaneously to capture nonlinear spatial patterns. This mirrors LLM scaling behavior and suggests tissue context length is a key variable for biological foundation model performance.
- •In vivo perturbation validation via barcoded mouse tumors: To validate human model predictions without relying on cell lines, Noetik uses a multiplexed CRISPR knockout platform injecting ~100 barcoded cancer cell variants into single mice, producing hundreds of genetically distinct tumors per animal. Human-trained models are then inferenced directly on mouse H&E, and predictions about immune infiltration and tumor phenotype are validated against known pathway biology across multiple gene knockouts simultaneously.
- •Data generation must precede model development: Noetik spent roughly 18 months generating data before training any functional model. The lesson for AI biotech startups: design datasets around the specific ML problem first, control for batch effects by distributing each patient sample across multiple slides and arrays, and reach a critical data threshold before expecting meaningful model signal. Dropping to 10-40% of training data causes substantial generalization failure, particularly across cancer types not seen during training.
Notable Moment
Noetik ran its lab for approximately 18 months — sourcing human tumors, building processing pipelines, and running two-week spatial transcriptomics machine cycles — before accumulating enough data to train a single model. There was no prior evidence any of it would work. The first functional foundation model, Octo VC, emerged roughly two years after the company launched.
You just read a 3-minute summary of a 82-minute episode.
Get Latent Space summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Latent Space
Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion
Apr 15 · 77 min
The Startup Ideas Podcast
Hermes Agent clearly explained (and how to use it)
Apr 20
More from Latent Space
Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony
Apr 7 · 72 min
Eye on AI
#336 Professor Mausam: Why India Is Losing the AI Race and What It Will Take to Catch Up
Apr 20
More from Latent Space
We summarize every new episode. Want them in your inbox?
Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion
Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony
Marc Andreessen introspects on The Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"
Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun
Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample
Similar Episodes
Related episodes from other podcasts
The Startup Ideas Podcast
Apr 20
Hermes Agent clearly explained (and how to use it)
Eye on AI
Apr 20
#336 Professor Mausam: Why India Is Losing the AI Race and What It Will Take to Catch Up
Bankless
Apr 20
Can AI Agents Build Real Businesses? | Kelly Claude creator Austen Allred
The Biotech Startups Podcast
Apr 20
🧬 Small Community, Long Journey: The Power of Relationships in Biotech | Caitlyn Krebs (Part 3/4)
The EntreLeadership Podcast
Apr 20
How Great Businesses Turn One-Time Buyers Into Lifelong Clients
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Latent Space.
Every Monday, we deliver AI summaries of the latest episodes from Latent Space and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime