Ishaan Bhat

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Jan 2, 202628 minResearcher, Princeton

AI Summary

→ WHAT IT COVERS Princeton researchers Kevin Wang, Ishan Durugkar, Nicole Holt, and Ben Eisenbach present their NeurIPS best paper on scaling reinforcement learning networks to 1000 layers using self-supervised learning. They demonstrate how combining architectural innovations like residual connections with contrastive objectives enables deep networks in RL, challenging the field's reliance on shallow two-to-four layer models. → KEY INSIGHTS - **Self-supervised RL objective:** The breakthrough shifts from traditional value-based Q-learning to representation learning using contrastive loss, where states along the same trajectory are pushed together and different trajectories pushed apart. This reframes RL as a binary classification problem rather than noisy TD error regression, enabling scalability similar to language and vision models without requiring human-crafted reward signals. - **Critical depth thresholds:** Performance improvements are non-linear and require specific combinations of factors. Simply doubling network depth initially degraded performance, but combining residual connections, layer normalization, and sufficient depth created critical thresholds where performance multiplied dramatically. The team found 64 layers often sufficient for near-perfect performance, though networks scaled successfully to 1000 layers in GPU-accelerated environments. - **Parameter efficiency through depth:** Scaling network depth grows parameters linearly while scaling width grows parameters quadratically. For resource-constrained applications, depth scaling provides better performance per parameter. The team demonstrated state-of-the-art goal-conditioned RL performance on JAX GCRL environments using single 80GB H100 GPUs, making the approach accessible rather than requiring massive distributed compute infrastructure. - **Batch size unlocking:** Deep networks unlock additional scaling dimensions previously ineffective in traditional RL. The research shows that scaling batch size only becomes effective when network capacity is sufficient to leverage the additional data. Their GPU-accelerated JAX environments collect thousands of parallel trajectories simultaneously, requiring 50+ million transitions to observe the dramatic performance increases from depth scaling. - **Implicit world modeling:** The contrastive objective performs next-state prediction through binary classification rather than explicit frame prediction. This approach learns meaningful state-action representations for goals without high-dimensional complexity, functioning as an implicit world model. The method draws parallels to next-token prediction in language models but applies classification to whether future states belong to the same or different trajectories. → NOTABLE MOMENT The lead researcher Kevin Wang describes running experiments where doubling network depth initially produced no improvement, but doubling depth again while adding architectural components suddenly caused performance to skyrocket in one environment. This discovery of non-linear critical depth thresholds was unexpected and required combining multiple factors simultaneously rather than incremental hyperparameter optimization. 💼 SPONSORS None detected 🏷️ Deep Reinforcement Learning, Self-Supervised Learning, Network Architecture, Goal-Conditioned RL, Robotics Applications

Read Full Summary Listen

Featured On 1 Podcast

Latent Space

All Appearances

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

AI Summary

Never miss Ishaan Bhat's insights