When Cloudflare Sneezed and Half the Internet Caught a Cold


Woke up this morning. Made coffee. Opened X to check what fresh chaos Tuesday had in store.
Nothing loaded.
Tried ChatGPT. Error. Letterboxd. Error. Even tried DownDetector to see if anyone else was having problems. DownDetector was down. That's when i knew this was bad.
It was 6:20 AM Eastern Time when reports started flooding in. Cloudflare, the infrastructure giant that powers roughly 20% of the internet, was having a very bad morning. And when Cloudflare has a bad morning, the rest of us can't even complain about it online because the sites we'd use to complain are also down.
You've probably done this. Clicked refresh fifteen times. Checked your wifi. Restarted your router. Wondered if you forgot to pay your internet bill. Then you realize it's not you. It's everyone.
What broke
The culprit was embarrassingly simple. A configuration file that automatically manages threat traffic grew too big. Just... grew beyond its expected size. And when it hit that limit, it crashed the entire traffic management system.
Three hours. That's how long major chunks of the internet were basically unusable.
Here's what went down:
X couldn't load tweets
ChatGPT refused to chat
League of Legends players couldn't start matches
Even nuclear plant background check systems went offline
The error messages were almost funny. "Please unblock challenges.cloudflare.com to proceed." Like the system was blaming you for its own failure. Classic.
The spike nobody saw coming
At 11:20 UTC, Cloudflare noticed something weird. A spike in unusual traffic hitting one of their services. Not an attack. Not malicious. Just... unusual. That spike triggered the configuration file to balloon past its limits.
And boom. Global meltdown.
By 9:57 AM ET, Cloudflare said they'd implemented a fix. But for three hours before that, millions of people worldwide just stared at loading screens.
The irony killed me. People trying to check if Cloudflare was down on DownDetector. But DownDetector runs on Cloudflare. So they got an error message telling them to check Cloudflare's status. Which they couldn't check. Because Cloudflare was down.
Why one company failing breaks everything
I used to think the internet was this distributed, resilient thing. Can't take it down. Too many paths. Too many redundancies.
Then you learn about Cloudflare.
They handle DNS traffic, security protection, and content delivery for millions of websites. When they fail, it's not just their customers who suffer. It's everyone trying to reach those customers.
Most people don't even know Cloudflare exists. It works behind the scenes. Invisible. Until it isn't.
This wasn't Cloudflare's first rodeo either. They had a nearly 30-minute global outage in July 2019 from a bad firewall rule. Another one in June 2022 from a network configuration error. Same pattern. Small mistake, huge consequences.
We've built an internet where a handful of companies are load-bearing walls. When one cracks, entire neighborhoods collapse.
The 500 error nobody understands
If you saw "Error 500: Internal Server Error" this morning, join the club.
That's web-speak for "something broke on the server side, but we're not going to tell you what." Generic. Useless. Frustrating.
Users trying to access sites saw messages like "Internal Server Error on Cloudflare's network" with helpful advice to "try again in a few minutes." As if refreshing would magically fix a global infrastructure failure.
The technical reality was worse. Sites weren't actually down. Your connection was fine. The websites themselves were running. But Cloudflare sits between you and them. When it stops working, nobody gets through.
The London lockdown
Around hour two of the crisis, something weird happened in London.
Cloudflare temporarily disabled WARP access across the entire city. Just shut it off. WARP is their VPN-like service that routes traffic through their network.
When your network is the problem, sometimes the fix is to stop using your network.
Londoners trying to work suddenly lost connection. No warning. Just gone. Cloudflare was essentially amputating limbs to save the body.
Twenty minutes later, they re-enabled it after making changes that allowed the service to recover. Crisis management in real time. Chaotic but effective.
The cost nobody wants to calculate
Someone always tries to put a dollar figure on these things.
One estimate claimed the outage could cost between $5 billion to $15 billion per hour of downtime. Another comparison: last month's AWS outage was pegged at over $75 million per hour.
These numbers are probably inflated. But they're not wrong in spirit.
Every business using the internet lost productivity. Every creator couldn't post. Every gamer couldn't play. Every transaction that should have happened, didn't.
The real cost is trust. Every time this happens, people remember how fragile everything is.
The thing nobody talks about
We keep having these conversations about cloud resilience. Redundancy. Failover strategies.
And then a configuration file gets too big and breaks the internet.
It comes less than a month after the AWS outage. Before that, Microsoft Azure went down. Before that, CrowdStrike's bad update grounded flights and shut down hospitals.
Different companies. Same result. When you centralize infrastructure, you centralize failure.
The expert takes are predictable. "This highlights the fragility of modern internet architecture." "Companies need multi-CDN strategies." "We can't rely on single points of failure."
All true. Also, nobody's going to change anything.
Because switching away from Cloudflare means rebuilding your entire infrastructure. Most companies would rather risk occasional outages than spend months migrating to a backup they hope they'll never need.
What scheduled maintenance really means
Here's where it gets suspicious.
Cloudflare had posted about scheduled maintenance in their Santiago datacenter for November 18 between 12:00 and 15:00 UTC. The outage started at 11:20 UTC.
Coincidence? Maybe. Cloudflare insists there's no evidence this was related to the maintenance, but nobody's explicitly ruled it out either.
"Scheduled maintenance" is supposed to be invisible. You reroute traffic. Update systems. Nobody notices. That's the whole point of distributed infrastructure.
When scheduled maintenance happens at the same time as a global outage, people ask questions. Cloudflare promised a detailed post-mortem blog post. As of now, it's not published yet.
The memes were immediate
Some observations from people who couldn't work:
"You know it's a bad Cloudflare outage when it even takes out down detector."
Someone posted a screenshot of their AI girlfriend app showing an error message. Caption: "my ai gf because of cloudflare." Can't even have a fake relationship when infrastructure fails.
Another user: "Lol Cloudflare outage for the 36th time this year ONLY." Probably exaggerating. But it feels true.
The gallows humor is the same every time. We've been through this enough to have the script memorized.
What the CTO said
Dane Knecht, Cloudflare's CTO, posted on X: "I won't mince words: earlier today we failed our customers and the broader Internet when a problem in @Cloudflare network impacted large amounts of traffic that rely on us."
At least he owned it. No corporate speak. No deflection.
But "failed" is doing heavy lifting there. They didn't just fail customers. They accidentally demonstrated how much of modern life depends on a single company's configuration files being the right size.
The recovery nobody celebrates
By 9:57 AM ET, most services were back. Error rates returning to normal. Crisis averted.
Some users still had issues with the Cloudflare dashboard. Some regions recovered slower than others. But the worst was over.
Nobody throws a parade when the internet comes back. We just go back to work. Back to scrolling. Back to pretending this won't happen again.
Cloudflare's statement: "Given the importance of Cloudflare's services, any outage is unacceptable." Nice sentiment. Also: outages will keep happening.
What should have been obvious
The lesson isn't complicated.
When you build critical infrastructure on a small number of massive platforms, you're gambling. Most days you win. The platforms work. Everything's fast and secure.
Then a configuration file grows too big. Or a firewall rule has a typo. Or scheduled maintenance goes sideways.
And suddenly twenty percent of the internet stops working because one company's automatic threat management system couldn't handle the size of its own data.
I still think about that morning. Sitting there with my coffee, unable to do anything online. Wondering how long it would last. Whether i should just go for a walk.
Should have walked. But i refreshed X one more time instead. Still broken. So i made more coffee and waited.
At least my coffee maker doesn't run on Cloudflare.
Enjoyed this article? Check out more posts.
View All Posts