As background, think of the forum and wiki like a ride at Disneyland. You can only have a certain number of people on the ride at a time (i.e., the maximum number of PHP instances), but you can have a larger number of people waiting in line for the ride (the pending connections). Some people are okay just watching a video of the ride (Varnish cache for non-logged-in users), and so they can come to the end of the line, watch the video, and leave. Others, however, are stuck waiting for a space on the ride.
That long delay before the page starts loading, that's the waiting-in-line part. As soon as the page starts loading, that's you actually being on the ride; you can see that the latter part has gotten considerably faster, but less so the first part.
Now, there are two factors that determine how quickly that line moves: the speed of the ride (i.e., the server's CPU capacity) and the number of people you can fit on the ride (i.e., the server's RAM capacity). The faster the ride goes, the quicker you can get people back off and get the next group on. The more people you can fit on per run, the bigger a chunk of the line you can take away each time the ride comes back to put new people on.
When the line gets gummed up, that's when things go wrong. Too many people in line, and the amusement park staff cap the line and refuse to let anyone else get in. (That's the 500 error you get sometimes.) Enough of those 500 errors, and Cloudflare itself marks the site offline. So, clearly, we need to keep the line from getting too long, and not just for performance reasons!
But what if we just, I dunno, crammed more people in the ride? Those safety guidelines are for wusses, right? Well, turns out, if we risk that, that's when we hit the server's RAM limits. Lots of page faults, and the processors become bogged down, and we end up in the death spiral. People flying off the ride screaming, things on fire, things turning just generally Not Good. That's the part where instead of the server recovering on its own, I get back to the keys and have to try to balance things (or Ran just shuts things off until it recovers).
Now, previously, the Wiki and Forum were two tracks of the same ride; they shared capacity and the line waiting to get in. This obviously wasn't ideal. We've split them onto two separate servers—made them into separate rides—with separate waiting lists and separate capacity. There's still some shared resources (notably the SQL database backing it, off on a third machine), but it's an improvement.
And for a couple of days, everything ran very smoothly.
However, apparently the forum has proven quite capable of hungrily devouring all new capacity; the faster things go, the more people show up, and the quicker we get back to where we were. You all collectively are basically Cookie Monster, but for servers. (Which is actually appropriate, if anyone's seen the very first thing the Muppet that became Cookie Monster came from; he was made for an IBM training film that Henson was hired to do, where he ate a computer. But I digress.)
To stay with the amusement park analogy; the ride itself doesn't take long to go on, so we can move people through faster that way, but the capacity of the ride hasn't increased much so we still can only run a small number of people at a time.
As such, we're probably going to increase the RAM of the machine later this week, so that I can increase the number of concurrent PHP sessions (i.e., the capacity of the ride).
...and now I have mental images of Muppets riding amusement part rides, and periodically eating them. Clearly, I need more coffee.
(This post has been brought to you by the letter W and the number 16. Status Posts are a production of the Decaffeinated Sysadmin's Workshop.)