Welcome back! I’m sure some of you noticed a slow-down during the weekend as our ‘Building a High-Performance Web Server’ article got linked by a number of popular other sites, including Slashdot.org
. Naturally the server was getting a serious workout as hordes or people, thousands at the same time, tried to access the article’s pages. We pulled through alright though but not without some extra work at the console.
To be honest I was actually amazed we kept on running as for example Tom's Hardware is usually unreachable when they're Slashdotted. For comparison, they got a rack full of servers and quite possibly a load balancer in front. We just got one single box, so do the math.
But the problems were all software related and Apache took the bulk of it. The problem with Apache is that the per-connection overhead is too high. It's a couple Megs per connection generally, and if you use keepalives (enabled by default), then each connection process will by tied up for as long as 30 seconds after the request has been completed.
Additionally, since Apache works with a pool of individual processes to handle connections, there is no way to have a global shared resource between all processes. So, in the case of your database connections, you have a 1:1 relationship between db connections and HTTP processes. The result is that you have HTTP processes with open db connections serving images and so forth that don't even need db links. So, you end up using a lot more db connections than you actually need.
The thing we need to do to be able to handle such loads in the future is change from Apache to something that uses a worker thread model within a single process. Apache 2.0 may be setup to work like this, but I think it uses a hybrid model that still uses processes for dynamic stuff like PHP. Apache 2.0 will definitely help a little, though.
But anyway, what's also happening is that MySQL is only able to handle so many requests and then you're getting HTTP processes piling up waiting for it. So if we can cut down on the number of requests per page that will make a pretty significant difference when spread across thousands of users.
So yes, I think the Apache keep-alives ‘did us in’ for the most part and the pool of child processes you create becomes unmanageable at some point with many 1000s of connections at the same time. The worst part is that optimizations such as this can't be found in the manual, you'd need to have been in the 'trenches' to know about things like this. Fortunately we have a great team and Vitaliy, our CTO, is really on top of things, and actually had a great time this weekend, or as he put it ‘this is better than simulation’.
Overall I’m more than happy with the performance of the server, it was never designed to handle such loads, and yet it kept on running, it never faltered and it certainly did not turn into a smoking heap of rubble as some suggested. We just were a little slow with serving out those pages and must’ve been unreachable to some with a slower connection.
If anybody else has some additional comments or insights I’d be happy to discuss this further, or go into greater detail. After all we’re all here to learn right?