Jump to content

ICE CROW

Recommended Posts

I can guarantee you, at this point, that the issue is PHP-FPM is leaking RAM and when we approach the physical memory limit it goes into a death spiral

How long does it take on average (after a restart) before the 'death spiral' starts?

The question I still have to resolve is what has caused PHP-FPM to start leaking RAM when it had not previously done so.

So you haven't found any vague clues or leads in the error logs? Do you have any theory or suspicion as to what might be the cause of the leak?

Either way, the RAM was a planned upgrade; we need it to allow some tuning to MySQL that I've wanted to do for quite a while. So this was going to happen anyway; I'm just trying to push it forward now as a result of this situation.

Yeah, adding more RAM won't hurt, especially when the upgrade was planned anyway :)

Link to comment
Share on other sites

I've resisted doing this, but for today and tomorrow -- or unless some bolt from the sky suddenly resolves all the problems with processes fighting to death for the pleasure of the Gamemasters of Triskelion (I bid 50 quatloos on the newcomer!) -- I've set the forum members-only to see if it helps with the issues.

ETA: To clarify, this is a temporary measure. The technical problems are surmountable with time, and we will surmount them, but a Sunday evening is a tough time to resolve anything with finality when we've been battling things the last days. We absolutely do not intend to make being members only a permanent thing under any circumstances. So please don't bother suggesting that we do so.

Link to comment
Share on other sites

An update on the situation. Things are likely to be very bad today, as you've already seen. It is likely the site may be unusable most of the day.

But there is good news: we've finally lined up a new server. A server with much more hard disk (probably not so important), a much snappier CPU (about 70% faster, according to PassMark), a handy backup/imaging service (apparently reduces some of Sparks's work) and 8GB of RAM (that's four times what the old server had) which is especially important. And for various reasons, the price bump is less than the cost of what we expected simply upgrading the old server to 4GB of ram would be.

Some time in the Pacific evening -- and barring unforseen circumstances -- the forum and wiki will be moving to shiny, spacious new digs. With luck, everything will run without an issue and the beefed up server will mean that the lag and delays are a thing of the past.

For those struggling to get to www.westeros.org, that will remain slow for a while longer, though eventually it may also get a new server.

Link to comment
Share on other sites

Hmm so:

New server = you'll be able to separate the web and database servers, increasing scalability.

New hard disk = more space, but mostly better data transfer rates, probably setup in a nice hardware RAID setup(so hard diskS)

Snappier CPU and more Memory = probably a nice multi core and with more memory and some configuration adjustments, better performance, allowing the server to handle multiple request, serving data faster and with bigger cache, it will further reduce server load with higher hit rate.

Probably another load of goodies, that would be allowed with the additional resources and whatever your specific setup is (probably not something I will understand as I am more than a decade out of date).

Great news, I suspect there will be some usual birth pains due to scheme change and finding the optimal configuration settings, but it seems that you guys are on the right track. My only suggestion is that you update all your core software components as well, after all the server is already down, its better do it now than further server downtime later on or delaying this.

Link to comment
Share on other sites

The new server is replacing the current one. Sparks can't have three at the same time. :)

Updates and such are up to Sparks, I imagine she's running the versions she's happy with, but I could be wrong.

Link to comment
Share on other sites

I've been wondering if spinning off a part of the forum is a viable option to the problems?

Take the Miscellaneous section encompassing General Chatter, Literature, and Entertainment and let that become a separate forum called "A Forum of Ice and Fire Nights"

I'm only half-kidding

Link to comment
Share on other sites

The new server is replacing the current one. Sparks can't have three at the same time. :)

Is that because you are trying to return business as usual as fast as possible, avoiding any changes that might complicate the situation further and trying the users patience with the errors ?

Because unless there is some physical limitation, I doubt that Sparks cant handle more than three at the same time(two or thirty seven is really the same, only a bigger light show, more noise and heat, and bill of-course) and this step is something that any server admin should be excited todo(or was it just us geeks :rolleyes: ) and with two servers she can easily set it up and test it with minimal down time.

Link to comment
Share on other sites

The new server is replacing the current one. Sparks can't have three at the same time. :)

More accurately, the old server is under hosting agreements that were established two companies ago; my original hosting company was bought by another, which was then bought by a third. They would really prefer not to have old co-located tower boxes to support, and so the whole reason we're getting a Shiny Spiffy New Leased Server at a rate not a lot higher than I pay for the hosting on this one is that it became clear last night they would *really* like me to transition to hardware more in line with what the company supports now.

So the previous statement isn't quite accurate; I *could* have three servers, but then this new one would cost twice as much per month. They're cutting me a break as an incentive to get stack (this machine) off their hands and out of their datacenter. (They want me to do the same with my other legacy hosting server, and I'm going to take them up on that offer as well, but not until *after* stack's been transferred and decommissioned.) ;)

Link to comment
Share on other sites

I'll add to this that I'm aiming for as little downtime as possible. I had to briefly shut *everything* down to set up a replication scheme between stack and segfault (the new server) so that MySQL will be all nice and ready for us; that's done and now it's slowly, slowly synching all the other data. Later tonight I'll shut us down briefly, kick segfault's MySQL server over to master rather than slave, set up stack to point all MySQL queries to segfault and repoint DNS. That should allow stack to (somewhat sluggishly) still handle requests for things on segfault until DNS kicks over for everyone. The goal's to require the server to actually be properly, truly offline for maybe 10-15 minutes during the transfer by getting as much done ahead of time as possible, rather than hours and hours.

Link to comment
Share on other sites

Okay, guys. If you're reading this, you're seeing the forums and wiki live on the new site. I may have to do some fine-tuning as we settle in, but things should be MUCH happier now.

Link to comment
Share on other sites

Okay, guys. If you're reading this, you're seeing the forums and wiki live on the new site. I may have to do some fine-tuning as we settle in, but things should be MUCH happier now.

This is good news, nice job! But with a server name like that (segfault), it's doomed to fail! ;) Next mission: Hunt down the source of the problems :)

Link to comment
Share on other sites

This is the only forum I belong to where the admin has actually explained the problem and the steps they are taking to fix it. I have no idea what it all means but I'm very grateful to have all the information provided. I wish I could send you some baked goods, Sparks!

Link to comment
Share on other sites

This is good news, nice job! But with a server name like that (segfault), it's doomed to fail! ;)

Actually, the most reliable servers I've had have all been named things like 'segfault' 'glitch' and so on. Stack was meant to be a reference to stack overflows, but the name was never obvious. It's also the server I've had the most headaches from. So we're going back to names that redirect all bad-juju into the name instead of the server's operation. ;)

Next mission: Hunt down the source of the problems :)

I actually did manage to — around 1am last night — sort out what was happening, when I thought to run a check against *hardware* on the old server while setting up the new server. So the mystery is solved, at least.

For the record, the hard drive controller was silently failing reads under heavy load... not enough to crash the server, but enough to make paging operations have to happen multiple times. Combined with the old server's limited memory, this meant SQL calls were returning slowly, which caused PHP-FPM calls to return slowly, which caused more PHP-FPM processes to spawn in which caused more memory paging which caused more read failures which caused everything to run still slower... hence the death-spiral.

In hindsight, the fact that our performance took a sudden and inexplicably precipitous nose-dive just before the weekend without any of the *software* involved having changed should've suggested to me a hardware issue, but nothing was showing up in the system diagnostic logs and I was too busy trying to put out fires to really think it through.

Link to comment
Share on other sites

Okay, guys. If you're reading this, you're seeing the forums and wiki live on the new site. I may have to do some fine-tuning as we settle in, but things should be MUCH happier now.

Sparks, based on the time stamp of your post, it looks like whatever you did took some posts with it. I would say the window was 90 minutes to an hour maybe before this post?

Link to comment
Share on other sites

So we're going back to names that redirect all bad-juju into the name instead of the server's operation. ;)

Just make sure to write the name with very big letters when you (or the on-site technician) put the name sticker on, this will increase the absorbtion rate of bad mojo ;)

In hindsight, the fact that our performance took a sudden and inexplicably precipitous nose-dive just before the weekend without any of the *software* involved having changed should've suggested to me a hardware issue, but nothing was showing up in the system diagnostic logs and I was too busy trying to put out fires to really think it through.

Nice to hear that the source of the problem has been dealt with :)

Link to comment
Share on other sites

Sparks, based on the time stamp of your post, it looks like whatever you did took some posts with it. I would say the window was 90 minutes to an hour maybe before this post?

That's possible. There was an issue about 70 minutes before the switch where, due to everything getting backed up over on the old server, things were running slowly enough that the ibf_sessions table actually hit a size limit and the SQL replication process I was using to duplicate the database without having to take the site down temporarily stopped. When I unclogged things, MySQL claimed to recover and get everything copied over, but there's a small possibility things didn't get properly replicated after that. (Because it had become clear the server issue was hardware related, I just wanted us *off* ASAP, before anything went horribly wrong.)

The old database is still intact on the other site; if there's anything super critical, I can rig something to let Linda and Ran pull things from there as-needed, since they're keeping the old server online for the rest of this month to ensure everything necessary is off before it gets decommissioned.

Link to comment
Share on other sites

Also fyi: as traffic picks back up, I'm slowly tweaking the limits on things. Within a day or so, we should have relatively ideal settings for the new box. :)

(We're currently handling roughly 55 requests per second and the box is handling it like a champ, so as a few requests were having to sit in the queue waiting for more connections to free up, I just turned up all the limits a little bit.)

Link to comment
Share on other sites

In a twist of capricious fate, something's gone wonky with Cloudflare's DNS service that's caused it to refuse lookups for a couple people (me included). The forum and wiki may be harder to reach for a little while as a result. This one's beyond my control, alas. :(

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

Guest
This topic is now closed to further replies.
×
×
  • Create New...