Site downtime, server "overload" problem, analysis and possible actions

Copy of email, placed here for the record, in case anybody else has any ideas, and for future reference (especially if in the future the site/server management is taken over by another party.)


Subject:    Ongoing "overload" problem
Date:   25 February 2009 12:02:14 GMT+01:00
To:       stef

Hi Stef,

I noticed you'd gone in and done the needed process killing - thanks. I assumed you did the reboot too, but it seems that was unrelated, as described in the mail from xtrahost. Tuesday and Thursday afternoons I have to be out, and although I did actually ssh in from my mobile, you'd already fixed it by that time.

Ok, the rest below is partly just to get this documented - you don't have to respond in any detail!!

Possible causes

Yesterday's occurrence means that we have eliminated the "filecache" module as a possible cause - it has been disabled for the past few days, in fact all caching has been disabled. So, unless we have multiple causes, filecache is not the culprit.

It seems there's no pattern to these "overload" events, true? Makes it really difficult to diagnose. What I think we have confirmed:

  • Apparently random occurrence

  • Server load goes to huge levels (10+ or even 20+) whereas normal running is around 0.5 to 1.0 most of the time.

  • Cause invariably seems to be a single (?) Apache process consuming large amounts of resources - although possibly this is an effect, not cause?

  • We also tend to see high MySQL usage, but that's not surprising under the circumstances, probably a side-effect.

  • Possibly Xen related, or an Apache-Xen combination problem. Possibly even OS, PHP5, etc. All seem unlikely. However, we know that xtrahost have reported an "unexpected restart" and "a problem with the Xen Hypervisor software" - so, who knows?

Thoughts

Assuming the Apache process is running PHP scripts, why doesn't it get killed by Apache? We have a a 20 second timeout set.

  • Perhaps it gets killed but then restarted - could happen if a user is somehow triggering the problem, and continually retries to view a page etc.

  • Perhaps its the Drupal cron process - this has a much longer timeout set (necessarily) I can't remember how long but will check, a couple of minutes I think.

  • If neither of the above, then the process has presumably hung in some way so won't timeout on its own. That would imply some kind of Apache bug - I've searched but can't find anything that looks like a match.

Plan

  • I will check the timeouts, and reduce the cron process timeout to a sensible minimum. I'll also reduce the standard timeout (that set for normal PHP pages) if possible.

  • I will check to see if the "overload" events seem to coincide with the Drupal cron (cron.php) process times (every 15 minutes currently.) If they don't, that would eliminate cron.php runs as possible cause. If they do, then cron.php is a possible culprit. This could be confirmed by disabling cron processing of that script it and running it manually instead.

Still, just shooting in the dark :(

Alternatives

  • Could this be Xen related? Maybe we should move to a physical server. I used serverpronto.com a couple of years ago and they were very good. They don't do any backups though (hardly surprising, for the price) so that would need to be set up - rsync or MySQL replication to another location.

  • Move to another Xen host? I'm pretty sure xtrahost know their stuff, it seems unlikely that this is some kind of Xen mis-configuration. The only reason for moving would be to eliminate their Xen configuration as a possible cause. That's a lot of work for quite possibly nothing. I'd rather switch to a physical machine and so completely eliminate the possibility of a VPS related issue.

  • If there's a bug at all, I'd guess its the Apache PHP module. We could go to standalone PHP either with Apache or even Lighttp (apparently works very well with Drupal if you know how to set it up.) A lot of work though.

  • Buy in some help from an Apache/Xen guru? I rather doubt that anybody will know the answer, but possibly somebody who knows more about debugging tools that could be used to shed more light. Or maybe get somebody in to help set up lighttp, whether on a VPS or physical.

Summary

  • We can do a little more investigative work, fingers-crossed might find the cause.

  • We can just accept this happens - it doesn't happen often, and the result is usually nothing more than 15 minutes or so of down-time (so long as you or I are around to login and kill the offending process.) However, in the worst case the server might crash (has done once or twice) and that could result in hard corruption of the data. Or, more likely we could have soft-corruption - incomplete Drupal processes resulting in "funny" data - unlikely to be critical, but nonetheless can't be ignored as a risk.

  • Move to a physical server. No guarantee of a solution, but at lest that way we will know it's not a VPS issue, and potentially we can install more powerful tools to track down the real cause. In that case, stick with Ubuntu/Debian or switch to CentOS/Fedora?

  • Consider a switch to Lighttp as a possible (probable?) solution.

That's it for now - finger's crossed it won't happen again today...

Andy

On 24 Feb 2009, at 23:06, Stef wrote:

Cool, thanks. Dunno if you saw, but the server load went mental (>16) a couple of times today. I ssh'd in and checked it out, it was a runaway apache process, consuming 70%+ of system memory and swapping hugely as a result, so I killed the offending process. (This happened twice, with me killing the process each time.)

Cheers,

Stef

On 24 Feb 2009, at 20:09, Andy wrote:

Hi Stef, just FYI.

Begin forwarded message:

From: Xtraordinary Networks Support
Date: 24 February 2009 18:15:56 GMT+01:00
To: support(?)xtrahost [dot] co [dot] uk
Subject: Xtraordinary Networks Xenshell22 Reboot
Reply-To: support(?)xtrahost [dot] co [dot] uk

Dear Customer,

Today (24 February 2009) at 16:30 BST the Xen host server your VPS is hosted on unexpectedly restarted. Our initial investigations have uncovered a problem with the Xen Hypervisor software causing the system restart. Following the restart full service was restored at 17:00 BST.

We are currently further investigating the issue.

Apologies for any inconvenience caused. We will let you know if we take further action impacting your service.

Kind Regards,

Technical Support Team
Xtraordinary Hosting : http://www.xtrahost.co.uk
Tel: +44 (0)845 345 0919 email: support(?)xtrahost [dot] co [dot] uk

0
Your rating: None
Comment viewing options
Select your preferred way to display the comments and click "Save settings" to activate your changes.
Update by ng