Server meltdown

2007-06-05 (permalink tags: , , , , , )

My post on building a better soda can stove was a test run for my new blog engine. For some reason, it got a lot more attention than I was hoping for. In turn, it rose to the front page on Reddit, Del.icio.us, Digg, and Makezine. Oh my...

What happened exactly? I'm not sure but I will tell my side of the story. I run this website on my home DSL. I have an upload cap of 80 kb/s. This is not great but until now it was plenty enough. Once my new blogging engine was running, I did a few test with Apache bench to see how responsive it was. Everything was fine and it didn't require much CPU load to saturate my uplink. But it didn't feel right, this test was way too artificial. At 7h16, I submitted my howto to Reddit, just to see a real world traffic spike. I didn't expect much, maybe a few hundred hits before I was down voted into oblivion. Exactly seven seconds after hitting submit, I had my first visitor. Not bad, I thought. Then an other one, and an other one, and it kept pouring like that for 36 hours. As I write this, I have brief periods of sub-saturation for the first time.

I got saturated really fast after sending to Reddit. It took about 20 minutes. But the site was still responsive. I was barely saturated. But it got really bad when I climbed to the front page, about one hour later at 8h04.

Suddenly, my bandwidth usage got to sub-saturation level. The server was melting down. I was still unresponsive, under a massive number of requests, but I was pushing only 25% of what I could. My blog is a Pylons application running its own webserver with Apache and mod_rewrite in front serving static files and forwarding the requests for the dynamic stuff. Pylons and Apache are running on different computers, both had a really low system load. Oh my...

A quick test, anything. I restated Apache and I instantly got back to saturation level. Whew! That was close. But it happened again. I was not to babysit Apache all day. Connecting directly to Pylon app was really responsive. Good folks on #pylons suggested lowering Apache's MaxKeepAliveRequests. I went from 100 to 50 and lowered the Timeout while I was there. It worked. I I was back on track serving full speed with a really flat usage graph.

What happened is that when people were done loading the page, they kept a live connection and no one could connect until their connexion was closed or expired. I also tried to lower the number of number of worker threads but that was not the solution. I instantly got a sub-saturated jittery graph and got back to full speed when I restored the defaults. This is an interesting experience. One would think that the lack of a good link was a shield against meltdown but I definitely had to tidy up my config, fast.

The rest of the storm was without major events on the server side, it ran smoothly at saturation level. But that didn't stop the traffic from pouring in. I got popular on Del.icio.us at 13h53, I was on the front page at 17h01. Around 18h30, I was still saturated but my was response time was around 20 seconds. The best it had been since the morning. I got the first hit from Digg at 19h04, submitted by Kevin Rose himself. I was on the front page at 22h19. Now it was pouring hard.

Don't these guys ever sleep? My system ran full speed all night and the whole morning. I thought that it was over then I made the front page on Makezine at 14h20. The big rush it mostly over now. That was something.

I got about 40000 hits in 36 hours. And I was mirrored almost as soon as I was on Reddit's front page. I don't know how many requests the mirror served. I would guess just as many but probably not. To find a mirror you need to hit the comment section of a news site and people are lazy. But mirrors must have served many pages because I had a response time of over five minutes and I still managed to get many up-votes. No way that those all come from seeing an ultra slow page with half the images that wouldn't load. I received around 10k hits from Reddit, 20k from Digg and 5k from various feeds. The browser usage is interesting. Roughly, 62% Firefox, 10% IE6, 7% Safari, 3% Opera, 2% IE3, 1% IE7. Anyone building a web application that requires a specific browser today is plainly crazy.

You demonstrated that you like to read howtos about hardware hacking. Fair enough. I'll get a better link and give you more. Thanks to everyone who waited for my slow server to spit out a half broken page. Thanks to all those who voted me up. I won't let you down. This is only the beginning. Praise "Bob", we have a deal.

Leave a comment