InterestsI ride horses, I take pictures, I write stories and hope to eventually be published. For my day job, I make computer software.
In my oh-so-copious free time I also run a small private ISP that hosts a number of largish websites as favors to friends, including Westeros.org (and the ASoIaF board, and Blood of Dragons, and so on...). This caused GRRM to quip once that I was 'The Sysadmin of Ice and Fire,' hence my board title.
Since the only actual error that happens when that crops up is that the sidebar uses an older copy of the 'news' feed, I've just made the server strip out the error messages for the moment. I figure we can sort out the rest after the seasonal traffic dies down.
The PHP errors seem to be an issue in the stuff that ties back to the main Westeros.org site, for the "The Latest News" sidebar. They're happening very sporadically, so they're hard to track down, but I'll poke a bit this week.
Tapatalk doesn't play terribly nicely with high traffic situations. In my general experience, a single Tapatalk connection, due to the way it queries data, tends to eat about the same amount of resources as 3-4 normal browser connections. Which means 100 Tapatalk users hitting the site at once is roughly akin to 300-400 normal browser users.
In short: the servers become dark, and full of CPU contention.
More solutions are being investigated before the next season, but changing anything on the site during the season is a recipe for disaster. (We've learned that the hard way in years past.) So the site configuration gets pretty well locked down as the season goes live, and notes get taken for changes post-season.
To elaborate slightly, there's an issue with this particular version of Mediawiki that arose after the upgrade. This issue is also the thing which led the forum and wiki to not co-exist well, which is why the forum was shifted back over to another machine; at least this way if the wiki does Strange Things, it doesn't take the forum down by virtue of being on the same webserver.
There is a watchdog running which, if the wiki goes wonky, should function to shut it down and repair things. It means you'll get a Varnish error for about a minute or two when that happens, but then the wiki should come back properly after that. Hopefully we'll have a fix for the wonky wiki weirdness (yes, I'm just going for the alliteration at this point) soon and that will go away.
Yeah. I didn't have any idea the Mediawiki upgrade would basically dynamite everything, or else I would've raised and waved red flags like mad, as if there was a bridge out ahead. But it seemed a pretty straightforward upgrade, and the test version of it didn't seem to blow everything up, soooo.
To elaborate slightly: the upgrade to Mediawiki has thrown a bit of a spanner in the works for all the load-balancing magic; that upgrade required some changes to the version of PHP we were using, which required changes to the PHP bytecode cache, which introduced an issue with Invision's use of that PHP bytecode cache... follow the chain of dominos.
The short form is that what happens right now is that when the load gets heavy, Invision seems to corrupt the PHP bytecode cache, which causes Mediawiki to start hitting things harder, which causes load to increase, which causes Invision to catch fire and run screaming off a cliff. If the caching is turned off, the corruption doesn't happen, but then the load sets the server on fire anyway.
I'm working on finding some more magic glue to stick the hamsters to their wheels. I put something new in place this morning (Pacific time) and we'll see how it goes, and depending on that, I may tear down a bunch of things and rebuild tonight. Alternatively, as a last resort, since they no longer are required to be on the same server by their configuration, I may yank the Wiki and Invision installations onto separate physical servers, so that they aren't sharing a PHP bytecode cache and see if we can't solve the issue that way.
FYI, just to follow up on Linda's post, there appear to have been two things that were going wrong since the season started.
First, one of the switches that handles the private network between Skrog (the server that handles the web/PHP stuff) and Segfault (the server that handles the database) started to malfunction. That was replaced earlier by the colocation provider, but during that time the connection between Skrog's PHP handler and Segfault's MySQL instance was bogging down.
Second, however, the most recent MediaWiki upgrade had a change in how it handled caching; I didn't realize that so didn't account for this during the upgrade—mea culpa—and that caused issues with all of the optimizations. It happened sporadically enough (but ruined everything when it did) that I wasn't able to diagnose it until the episode tonight, when it happened pretty reliably and I was able to see what was going wrong.
A couple of hours ago (as of the time I post this), I tweaked how things are optimized; it seems to be running much better now. Ran and I are still keeping an eye on things to make sure everything's solid with the tweak, though.
FYI (since I know people are going to come here to post), server's going to be slow for the next hour or so while I do a comprehensive backup of various things. It's been too long since I took a full restorable snapshot (as opposed to just incremental backups), and I want to get that up-to-date.
I decided to leave things up and running as much as possible to avoid taking things offline; everything should be readable at all times, though there might be a few sporadic spots where the database is locked to read-only for a couple of minutes. Just bear with me, and it'll be done before too long.
As best I can tell, what's happening is that some ad (or ads) loads only partially. This has several effects:
This breaks the skin, causing the Flaming Tower of Doom.
It screws up the skin loading behind the scenes, too, which
...means PHP doesn't exit cleanly for a long while, which itself
...means we spawn lots of additional PHP processes as people hit reload, which
...means the server gets loaded down, making everything sluggish even when you do get in.
I've done what I can to alleviate points #3-5, though the side effect is that individual page loads will be a little slower for the moment, but #1-2 are outside of my control.
I've made some changes to the backend to hopefully alleviate a little stress. We'll see what happens; I'm stuck at work, so I'm keeping only half an eye on things, and it will probably take about 15 minutes-ish from this to settle down after the changes I made. We'll see if it improves anything.
(The APC pool thing is separate, however; for some reason, PHP-FPM has started leaking memory where it was not before. I'm investigating what might have changed, and hopefully can have it cleared up before too long.)