Jump to content

Sparks

Administrators
  • Posts

    40
  • Joined

  • Last visited

Everything posted by Sparks

  1. Sparks

    Board Issues 4

    (Having everyone involved out-of-town the Monday after the first episode was, sadly, not ideal! Pesky offline life considerations.)
  2. Sparks

    Board Issues 4

    Okay, mail should be working again. Sorry about that!
  3. Sparks

    Board Issues 4

    There send to be a problem with sending email; Ran and I will look into it.
  4. Sparks

    Board Issues 4

    Okay, no, sorry. There was a replication error I had to solve. NOW we're up and running properly.
  5. Sparks

    Board Issues 4

    And as a followup, we now have a replicated database here on Scramble (the new server), and it's up and running fine. Hopefully everyone will find things a little more zippy now. There is also a test copy of the wiki running on Scramble, which I'll leave up to Ran as to whether or not we move everyone to that one eventually. Scramble has enough oomph it might be able to take on everything.
  6. Sparks

    Board Issues 4

    We're moved onto the new machine in preparation for the season. I will probably be tuning and tweaking things over the next few days, so bear with me and call out to me or Ran here if things break.
  7. Sparks

    Board Issues 4

    If it's happening to you multiple times a day, there there is a known bug with Invision (the forum software we use): if you have an old Invision login cookie around and a new one, you'll potentially get logged out during certain operations. I had blocked that bug before by having Varnish eat old Invision login cookies (so even if you had one, the board never saw it), but with our move to nginx I couldn't do quite the same filtering. That might be the cause. The fix is just to clear your westeros.org cookies out, and then log back in again.
  8. Sparks

    Board Issues 4

    That's not a 'Board Issue' per se—i.e., something not working right/broken on the board itself—so probably could've gone in a separate thread in the forum. That said, changing your signature is fairly straightforward. Just go to the Account Settings page (which is also available from the little dropdown that pops off your name, from the top of the board), and select Signature from the list on the left-hand side. Et voila, an editor to change your signature!
  9. Sparks

    Board Issues 4

    Ugh. I hadn't noticed that nginx blocked path-type URLs. I've added some rewrite logic which hopefully fixes things.
  10. Sparks

    Board Issues 4

    As background, think of the forum and wiki like a ride at Disneyland. You can only have a certain number of people on the ride at a time (i.e., the maximum number of PHP instances), but you can have a larger number of people waiting in line for the ride (the pending connections). Some people are okay just watching a video of the ride (Varnish cache for non-logged-in users), and so they can come to the end of the line, watch the video, and leave. Others, however, are stuck waiting for a space on the ride. That long delay before the page starts loading, that's the waiting-in-line part. As soon as the page starts loading, that's you actually being on the ride; you can see that the latter part has gotten considerably faster, but less so the first part. Now, there are two factors that determine how quickly that line moves: the speed of the ride (i.e., the server's CPU capacity) and the number of people you can fit on the ride (i.e., the server's RAM capacity). The faster the ride goes, the quicker you can get people back off and get the next group on. The more people you can fit on per run, the bigger a chunk of the line you can take away each time the ride comes back to put new people on. When the line gets gummed up, that's when things go wrong. Too many people in line, and the amusement park staff cap the line and refuse to let anyone else get in. (That's the 500 error you get sometimes.) Enough of those 500 errors, and Cloudflare itself marks the site offline. So, clearly, we need to keep the line from getting too long, and not just for performance reasons! But what if we just, I dunno, crammed more people in the ride? Those safety guidelines are for wusses, right? Well, turns out, if we risk that, that's when we hit the server's RAM limits. Lots of page faults, and the processors become bogged down, and we end up in the death spiral. People flying off the ride screaming, things on fire, things turning just generally Not Good. That's the part where instead of the server recovering on its own, I get back to the keys and have to try to balance things (or Ran just shuts things off until it recovers). Now, previously, the Wiki and Forum were two tracks of the same ride; they shared capacity and the line waiting to get in. This obviously wasn't ideal. We've split them onto two separate servers—made them into separate rides—with separate waiting lists and separate capacity. There's still some shared resources (notably the SQL database backing it, off on a third machine), but it's an improvement. And for a couple of days, everything ran very smoothly. However, apparently the forum has proven quite capable of hungrily devouring all new capacity; the faster things go, the more people show up, and the quicker we get back to where we were. You all collectively are basically Cookie Monster, but for servers. (Which is actually appropriate, if anyone's seen the very first thing the Muppet that became Cookie Monster came from; he was made for an IBM training film that Henson was hired to do, where he ate a computer. But I digress.) To stay with the amusement park analogy; the ride itself doesn't take long to go on, so we can move people through faster that way, but the capacity of the ride hasn't increased much so we still can only run a small number of people at a time. As such, we're probably going to increase the RAM of the machine later this week, so that I can increase the number of concurrent PHP sessions (i.e., the capacity of the ride). ...and now I have mental images of Muppets riding amusement part rides, and periodically eating them. Clearly, I need more coffee. (This post has been brought to you by the letter W and the number 16. Status Posts are a production of the Decaffeinated Sysadmin's Workshop.)
  11. Sparks

    Board Issues 4

    I think I found the issue; it should be fixed, but you may have to log in again for the fix to take for your user session. (I.e., you might get logged out one more time, but that should be the last time.)
  12. Sparks

    Board Issues 4

    Mm. Copying over the cache/upload directory caused all the webserver's permissions on those directories to be rewritten. Try again!
  13. Sparks

    Board Issues 4

    Search is far and away the single most expensive operation to perform on a forum this large. It's actually less of a giant looming problem this year due to the actual search chewing RAM or CPU than in past years, because the database and search now happen on an entirely different machine than the forum/wiki processing happens on. However, it still presents a problem due to the length of time it takes a search result to return. That holds up both a connection slot and one of the script-processing instances for far longer than any other forum operation, and we have a limited number of both. (It's a large number, but still a limited one, and "large" ceases to have quite the same meaning when you have 25,000 users hitting the site simultaneously.) Even though Varnish will hold onto a connection until a new slot becomes free, think of it like rush hour on a freeway: if you have 2-4 cars waiting to merge in behind each car, and this repeats over and over, you end up with a rather spectacular traffic jam. That's what happens to the server under load when search is enabled, just by virtue of how long searches take. To continue the freeway analogy, most of the year we need a freeway with 2 lanes in each direction; when we hit the HBO season (rush hour), we need one of those 6-lanes-in-each-direction monstrosities. Unfortunately, the six-lane freeway is overkill 97% of the time (and very, very expensive overkill at that), so we try to mitigate things during the 3% of the time when I become focused on the "Fire" in "Song of Ice and Fire". (Largely due to the fact that I'm mildly concerned the servers might figuratively burst into flame and melt.) One of those mitigations is 'temporarily disabling search'. Hopefully that helps provide a little more backstory and context.
  14. Sparks

    Board Issues 4

    Ran posted about this; we had to temporarily turn off email notifications because, due to the level of board activity, it was trying to send several hundred emails pretty much constantly. This had two effects: The server load was scary, and things stopped working right. We got flagged as spam by several ISPs that were receiving a big chunk of that mail; as a result, we got blacklisted and so a huge chunk of those mails were bouncing anyway. Which caused them to hit the mail server on the way back in, and... yeah. Email notifications will get turned back on after the season ends, or when I figure out a way to stem the tide so the email flood doesn't get us flagged as bulk-mail spammers.
  15. Sparks

    Board Issues 4

    For the record, please only use the Varnish bypass if nothing else is working for you. And if you do log in successfully, try turning it back on once the cookie's in place to see if things still work afterwards. (I.e., hit http://asoiaf.westeros.org/varnish-bypass.php?enable=0 and then come back to the forums and see if you're still logged in properly.) Basically, some browser/extension add ons seem to cause issues, where the browser will just re-retrieve the pre-login Varnish-cached copy of a page, instead of actually sending the cookie (which tells Varnish "I am now logged in, stop caching anything that isn't static"). I wrote a bypass which adds a special cookie to your Westeros session that says "stop using Varnish", to allow folks to try to get past the bypass. However, to bypass the weird browser behavior, this turns off all Varnish caching including the board CSS and images, which increases the load on the server (and not incidentally makes your own session load slower). Four or five people, that doesn't add up, but if like half the board turns the bypass on "just in case" or something similar, it will dramatically affect performance during the season. In theory, the browser caching issue should only affect you prior to sign-on, so you should be okay to turn the bypass behavior back off once you're logged in. If you don't know whether or not you have the bypass enabled, just hit http://asoiaf.westeros.org/varnish-bypass.php and the page will tell you your bypass status.
  16. Sparks

    Board Issues 4

    Yes. If you actually do want to break that pairing, go to your Account Settings and pick 'Disassociate Twitter' or 'Disassociate Facebook' under the appropriate tab. That will revoke the authentication token.
  17. Sparks

    Board Issues 4

    You should be able to use the board's password reset link if you have to, then, even if you're locked out of Facebook. As long as you have access to the email account that your account is registered under.
  18. Sparks

    Board Issues 4

    Facebook logins SHOULD be working again. (And for the record, I see the 'No Permission' quote there too right now. It looks like something's a little odd with Invision's cache that's being used for quotes; either Ran can take a look at it now, or I can poke later when I'm not at work.) Edit: Oh, I see what happened. It appears the original post being linked/quoted there is gone (since the problem is fixed, the thread on the workaround was nuked), and so Invision's quoting system has become rather confused about what to show there; it's applying visibility settings from the original (now gone) post to the quote. SO! No one worry.
  19. Sparks

    Board Issues 4

    Which should be fixed by now.
  20. Just so the reply is tracked here too (and people don't panic): when we did the resource shuffle last night, Ran had misremembered when wiki duplication had happened, and told me the wiki data didn't need to be migrated. Suffice to say, that was Not Quite True. There was an old version of the wiki from testing back in September there, sure, but not the real wiki data. The wiki data has now been migrated, and while changes done this morning (that went into the oops-testing-copy wiki) did indeed get lost, all the rest of the data is back. There may be one more wiki downtime in the next week or two (an hour or so) if we decide to bump the RAM in the server that's now handling just PHP, but otherwise things should be good.
  21. Heya, all.  Sorry I've not been around much lately; if you're looking to find me for a sysadmin thing, poking me through Ran is best.  Day-job work has eaten me alive; I'm traveling a lot, and the board's so high-traffic that I'm more likely to respond to a direct Google Hangouts IM than a board IM right now.  Sorry for any inconvenience! :(

  22. Sparks

    Board Issues 4

      There was an issue with the indexer's automated runs not working right; I tweaked it this morning, so hopefully Sphinx now works in all indexing cases.
  23. Sparks

    Board Issues 4

    Since the only actual error that happens when that crops up is that the sidebar uses an older copy of the 'news' feed, I've just made the server strip out the error messages for the moment. I figure we can sort out the rest after the seasonal traffic dies down. :P
  24. Sparks

    Board Issues 4

    The PHP errors seem to be an issue in the stuff that ties back to the main Westeros.org site, for the "The Latest News" sidebar. They're happening very sporadically, so they're hard to track down, but I'll poke a bit this week.
×
×
  • Create New...