Kaizen! Let it crash (Friends)
Episode
101 min
Read time
2 min
AI-Generated Summary
Key Takeaways
- ✓Memory fragmentation solution: Varnish crashed 43 times in three months from MP3 files (30-100MB each) causing memory fragmentation. Moving large files from malloc memory storage to file-based cache with pre-allocated disk space eliminated crashes while maintaining 93% cache hit ratio across 15 global regions.
- ✓Concurrency misconfiguration impact: Setting Fly.io proxy concurrency to connections instead of requests caused 2,700 long-running connections to block new traffic in Newark region. HTTP/2 clients experienced response body timeouts while headers returned successfully, resolved by switching concurrency mode and explicitly setting 60-second idle timeouts.
- ✓Thread pool architecture benefits: Varnish runs as daemon with multiple threads, so out-of-memory kills only restart individual threads within two seconds rather than entire VM. This let-it-crash philosophy from Erlang ecosystem enables system stability despite component failures, with zero thread failures recorded after five days uptime.
- ✓Bandwidth abuse detection: Episode 456 generated 30 terabytes from San Jose alone in 60 days, with 10,000+ distinct IPs downloading repeatedly. Honeycomb observability reveals patterns like 170,000 favicon requests in two hours and weekly Python/Go clients scraping all MP3s, requiring vmod-throttle implementation for rate limiting.
- ✓Regional traffic optimization: San Jose and Tokyo handle highest CDN load at 2.29 gigabits per second peak. Automated hourly checks using hurl test all 15 regions, downloading full MP3s to validate response times under 100 seconds. Fly.io allows per-region instance sizing but requires manual scaling after initial deployment.
What It Covers
Gerhard Lazu debugs Changelog's CDN infrastructure after 43 out-of-memory crashes since October, implementing file-based caching for MP3s, fixing Fly.io proxy misconfigurations, and discovering massive bandwidth abuse from 10,000+ IPs downloading episode 456 repeatedly.
Key Questions Answered
- •Memory fragmentation solution: Varnish crashed 43 times in three months from MP3 files (30-100MB each) causing memory fragmentation. Moving large files from malloc memory storage to file-based cache with pre-allocated disk space eliminated crashes while maintaining 93% cache hit ratio across 15 global regions.
- •Concurrency misconfiguration impact: Setting Fly.io proxy concurrency to connections instead of requests caused 2,700 long-running connections to block new traffic in Newark region. HTTP/2 clients experienced response body timeouts while headers returned successfully, resolved by switching concurrency mode and explicitly setting 60-second idle timeouts.
- •Thread pool architecture benefits: Varnish runs as daemon with multiple threads, so out-of-memory kills only restart individual threads within two seconds rather than entire VM. This let-it-crash philosophy from Erlang ecosystem enables system stability despite component failures, with zero thread failures recorded after five days uptime.
- •Bandwidth abuse detection: Episode 456 generated 30 terabytes from San Jose alone in 60 days, with 10,000+ distinct IPs downloading repeatedly. Honeycomb observability reveals patterns like 170,000 favicon requests in two hours and weekly Python/Go clients scraping all MP3s, requiring vmod-throttle implementation for rate limiting.
- •Regional traffic optimization: San Jose and Tokyo handle highest CDN load at 2.29 gigabits per second peak. Automated hourly checks using hurl test all 15 regions, downloading full MP3s to validate response times under 100 seconds. Fly.io allows per-region instance sizing but requires manual scaling after initial deployment.
Notable Moment
One episode from August 2021 about OAuth complexity has been downloaded over one million times, generating 400 gigabytes every four hours from thousands of Asian IP addresses. The team suspects speed testing or archiving bots rather than genuine listeners, forcing implementation of throttling mechanisms to control bandwidth costs.
You just read a 3-minute summary of a 98-minute episode.
Get The Changelog summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from The Changelog
Exploring with agents (Interview)
Apr 24 · 96 min
a16z Podcast
Ben Horowitz on Venture Capital and AI
Apr 27
More from The Changelog
Astral has been acquired by OpenAI (News)
Mar 27 · 10 min
Up First (NPR)
White House Response To Shooting, Shooter Investigation, King Charles State Visit
Apr 27
More from The Changelog
We summarize every new episode. Want them in your inbox?
Similar Episodes
Related episodes from other podcasts
a16z Podcast
Apr 27
Ben Horowitz on Venture Capital and AI
Up First (NPR)
Apr 27
White House Response To Shooting, Shooter Investigation, King Charles State Visit
The Prof G Pod
Apr 27
Why International Stocks Are Beating the S&P + How Scott Invests his Money
Snacks Daily
Apr 27
🏈 “Endorse My Ball” — Fernando Mendoza’s LinkedIn-ing. Intel’s chip-rip-dip. The Vatican’s AI savior. +Uber Spy Pricing
The Indicator
Apr 27
Premium and affordable products are having a moment
This podcast is featured in Best Cybersecurity Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into The Changelog.
Every Monday, we deliver AI summaries of the latest episodes from The Changelog and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime