[TriLUG] How would you diagnose "random" system hangs?
Steve Litt
slitt at troubleshooters.com
Mon May 7 12:48:47 EDT 2007
On Sunday 06 May 2007 19:31, Andrew Perrin wrote:
> My home system has been freezing up at apparently random times -- usually
> when I'm at work, so I come home to a frozen machine that has to be
> cold-booted. How would you go about checking this out? I've let memtest86
> run continuously for 24 hours with no errors, so I don't think it's
> memory. I have the sensors reporting hourly to a log, and there are no
> temperature concerns (generally between 37 and 40C). There's nothing of
> interest in syslog that rings any bells to me. Any ideas?
>
> The machine is an ASUS A8N-E, nForce chipset, with an Athlon64 dual-core
> CPU and 4GB of RAM in it. It's running debian etch, but with a
> home-compiled kernel 2.6.20.7.
I feel your pain Brother!
The following is what *I* do under those circumstances. I've run into others
that have strong objections to my methods, but I think those people are
wrong :-)
First I'd determine the cost of these intermittent freezes. If they're
happening once a month, and the situation is not safety-critical or
business-critical, I'd make sure my backups are good and live with it. The
more sparse an intermittent, the more costly its solution, but the cheaper it
is to ignore it. If I were to ignore it, I would still be on the lookout for
a symptom reproduction sequence (in other words, what tends to make it
happen).
Before continuing my suggestions, you might want to read some articles I've
written on intermittents:
http://www.troubleshooters.com/tpromag/200504/200504.htm
http://www.troubleshooters.com/tpromag/200507/200507.htm
http://www.troubleshooters.com/tpromag/9812.htm
The next thing I would have done is what you did -- memtest and check for
temperature.
I'd make sure the box is on a known good UPS with good surge suppression.
Next, with the machine running, I'd wiggle everything in the box to see if I
could repeatedly trigger the intermittent by wiggling or bending something.
Yeah, I know somebody's going to say "you do that and you're going to break
your computer even worse!" While it's true there's a *possibility* of that,
the *probability* is that the only result of wiggling will be to find the
root cause. If my computer cost $10,000 I might think twice about wiggling,
but at $700, with the most expensive single component costing $150 brand new,
my time's too valuable to forego this intermittent busting technique.
Besides, if it's good enough for NASA, it's good enough for me :-)
Because finding intermittents is so difficult as to justify time consuming
troubleshooting measures, I'd next use an electronics lubricant on all
connectors, internal and external, to rule out fretting corrosion on
contacts.
If wiggling and electronic lubricating doesn't locate it, I might temporarily
turn my attention to software. What does cron kick off in my absense? If I
disable cron, boot and run nothing, does it ever freeze? If I boot Knoppix
and leave it does it ever freeze?
When I fix an intermittent, I trade off likelihood against cost, and also
trade off effort against cost. I've seen intermittents take months to solve.
We're all too busy to spend months finding the root cause on a $700 machine,
so performing the proper tradeoffs is essential.
Next, I'd try to rule out things outside the entire box. Could the AC power,
network data or peripherals be doing it? I'd try to rule that out.
Finally, if I just could not find the intermittent, and the intermittent was
causing too much hassle to ignore any more, I might replace the whole box and
use the intermittent box for less essential tasks. Yes, that's horrendously
expensive, but intermittents can get more expensive than that.
Not that the following point pertains to you Andrew, but as a general comment,
when I find an intermittent component or component group, I immediately use a
marker pen to marke it "intermittent", and throw it in the garbage. This
prevents me from inadvertently putting it in a box a year later and suddenly
finding the box intermittent. I would never knowingly install an intermittent
component in even the least essential computer -- my time's worth more than
that.
HTH and good luck.
SteveT
More information about the TriLUG
mailing list