[TriLUG] How would you diagnose "random" system hangs?

Steve Litt slitt at troubleshooters.com
Mon May 7 12:48:47 EDT 2007


On Sunday 06 May 2007 19:31, Andrew Perrin wrote:
> My home system has been freezing up at apparently random times -- usually
> when I'm at work, so I come home to a frozen machine that has to be
> cold-booted. How would you go about checking this out? I've let memtest86
> run continuously for 24 hours with no errors, so I don't think it's
> memory. I have the sensors reporting hourly to a log, and there are no
> temperature concerns (generally between 37 and 40C). There's nothing of
> interest in syslog that rings any bells to me. Any ideas?
>
> The machine is an ASUS A8N-E, nForce chipset, with an Athlon64 dual-core
> CPU and 4GB of RAM in it. It's running debian etch, but with a
> home-compiled kernel 2.6.20.7.

I feel your pain Brother!

The following is what *I* do under those circumstances. I've run into others 
that have strong objections to my methods, but I think those people are 
wrong :-)

First I'd determine the cost of these intermittent freezes. If they're 
happening once a month, and the situation is not safety-critical or 
business-critical, I'd make sure my backups are good and live with it. The 
more sparse an intermittent, the more costly its solution, but the cheaper it 
is to ignore it. If I were to ignore it, I would still be on the lookout for 
a symptom reproduction sequence (in other words, what tends to make it 
happen).

Before continuing my suggestions, you might want to read some articles I've 
written on intermittents:

http://www.troubleshooters.com/tpromag/200504/200504.htm
http://www.troubleshooters.com/tpromag/200507/200507.htm
http://www.troubleshooters.com/tpromag/9812.htm

The next thing I would have done is what you did -- memtest and check for 
temperature.

I'd make sure the box is on a known good UPS with good surge suppression.

Next, with the machine running, I'd wiggle everything in the box to see if I 
could repeatedly trigger the intermittent by wiggling or bending something. 
Yeah, I know somebody's going to say "you do that and you're going to break 
your computer even worse!" While it's true there's a *possibility* of that, 
the *probability* is that the only result of wiggling will be to find the 
root cause. If my computer cost $10,000 I might think twice about wiggling, 
but at $700, with the most expensive single component costing $150 brand new, 
my time's too valuable to forego this intermittent busting technique. 
Besides, if it's good enough for NASA, it's good enough for me :-)

Because finding intermittents is so difficult as to justify time consuming 
troubleshooting measures, I'd next use an electronics lubricant on all 
connectors, internal and external, to rule out fretting corrosion on 
contacts.

If wiggling and electronic lubricating doesn't locate it, I might temporarily 
turn my attention to software. What does cron kick off in my absense? If I 
disable cron, boot and run nothing, does it ever freeze? If I boot Knoppix 
and leave it does it ever freeze?

When I fix an intermittent, I trade off likelihood against cost, and also 
trade off effort against cost. I've seen intermittents take months to solve. 
We're all too busy to spend months finding the root cause on a $700 machine, 
so performing the proper tradeoffs is essential.

Next, I'd try to rule out things outside the entire box. Could the AC power, 
network data or peripherals be doing it? I'd try to rule that out.

Finally, if I just could not find the intermittent, and the intermittent was 
causing too much hassle to ignore any more, I might replace the whole box and 
use the intermittent box for less essential tasks. Yes, that's horrendously 
expensive, but intermittents can get more expensive than that.

Not that the following point pertains to you Andrew, but as a general comment, 
when I find an intermittent component or component group, I immediately use a 
marker pen to marke it "intermittent", and throw it in the garbage. This 
prevents me from inadvertently putting it in a box a year later and suddenly 
finding the box intermittent. I would never knowingly install an intermittent 
component in even the least essential computer -- my time's worth more than 
that.

HTH and good luck.

SteveT



More information about the TriLUG mailing list