[TriLUG] Opteron >4GB w/RHEL 3

Mark T. Voelker markvoelker at fast-mail.org
Mon Feb 16 09:12:18 EST 2004


I'm working on setting up some new lab hardware, among which is a shiny
new dual Opteron server running RHEL 3.  The box has two Opteron 244
CPU's and 6GB of DDR ECC RAM installed in six 1GB sticks (there are 8
total DIMM slots).  The motherboard is a Tyan S2882 (a.k.a. Thunder K8S
Pro).  Everything seems to run fine in the limited testing I've done so
far, but every few seconds I see this appear in the syslog:

Feb 12 18:04:11 rtp-wbu-sh-m1 kernel: CPU 0: Silent Northbridge MCE 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel: Northbridge status
a40000000005001b 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     GART TLB error generic level
generic 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     extended error gart error 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     link number 0 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     error address valid 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     error uncorrected 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     previous error lost 
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     error address 00000000fafe1a68

I thought this looked like possibly bad RAM.  But when I pull out
two--*any* two--sticks of RAM, the error message goes away.  It seems to
be tied in to the fact that I have >4GB of memory.  According to the
motherboard manual, when you use more than 6 DIMMs on this board, you're
using a 128-bit (interleaved) memory configuration as opposed to a
64-bit (noninterleaved) configuration with 4 or fewer DIMMs (ref. page
30 of ftp://ftp.tyan.com/manuals/m_s2882_101.pdf), if that's any hint. 
I've tried rearranging the DIMMs in every valid way listed in the
motherboard's manual to no avail.  I even ran memtest86
(www.memtest.org) just to be sure I didn't have bad RAM.  I'm using RHEL
stock kernel 2.4.21-9.ELsmp and had the same problem on 2.4.21-4.ELsmp. 
The box seems to run fine, but those errors clogging up my syslog have
me worried.  

Anyone know what might be happening here?  I'm not sure whether to
complain to the vendor that something is fishy with their hardware or
whether this is a software issue.

At Your Service,

-- 
Mark T. Voelker

[root at localhost root]# free
             total       used       free     shared    buffers    
cached
Mem:       5976880     657272    5319608          0     105080    
222268
-/+ buffers/cache:     329924    5646956
Swap:      2040244          0    2040244

[root at localhost root]# uname -a
Linux localhost.localdomain 2.4.21-9.ELsmp #1 SMP Thu Feb 12 16:03:39
EST 2004 x86_64 x86_64 x86_64 GNU/Linux

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.trilug.org/pipermail/trilug/attachments/20040216/248b0610/attachment.pgp>


More information about the TriLUG mailing list