After talking to Ran a bit about things the other night, I've been experimenting with Vanish on a little test site, but we're not going to try putting it in place on the server until *after* the episode tonight. After last night's fun little adventure with the caching PHP engine blowing up (in a very creative way), we don't want to introduce too many changes during the immediately-after-the-episode traffic. Probably Tuesday or Wednesday, I'll take the board down briefly for a bit to set up Vanish and see if it increases performance.
In general, the performance under the new setup has been really good as it is, especially compared to the old and much more standard LAMP (Linux, Apache, MySQL, PHP) stack we had been using. The problem is just that the new setup is so complicated that if one thing goes wrong, well, we saw what happened last night. So introducing one more element to the formula right during the heaviest load is, uh, let's just go with "probably unwise."
As an illustration, last night's fun adventure was caused by malformed requests to the wiki by some bot... the malformed requests caused PHP-FPM to spit out errors so rapidly that it was generating gigabytes of logfiles. And PHP-FPM will go into a weird error state if its error log is over 2GB, where it simply silently errors out on three out of every four requests or so. That was... fun. An upgrade to our PHP-FPM setup got us a patch that fixed the initial issue, and blowing away the enormous amount of random error logs that PHP-FPM had vomited all over the system fixed the rest, but tracking the issue down was an exercise in forensic system administration.
Edited by Sparks, 13 May 2012 - 07:52 PM.