[TriLUG] Linux(RedHat) kernel question(long/involved)

Lance A. Brown lance at bearcircle.net
Wed Oct 27 15:27:51 EDT 2010


Leslie(Pete) Boyd wrote:
>     Torque is used to queue jobs for the cluster and MPICH is used
>     to distribute the job across the nodes. The RAIDS are mounted 
>     using EXT4. NFS with automounter is used to distribute the disks
>     to each of the individual servers.
>     
>     Problem: When several jobs are running on the cluster, the load
>              average on the disk servers climbs above 8. Sometimes as
>              much as 12 and the performance of the running jobs 
>              drops.
>              

How many nfsd's do you have configured on the NFS server(s).  I've seen
this behaviour when you don't have enough nfsd's running to cover the
server load.  If this is a standard RHEL or CentOS server install you
almost certainly don't have enough nfsd's running.  Even properly
tunred, you can get into bad IO wait if you swamp the NFS server with
requests.

--[Lance]

-- 
 GPG Fingerprint: 409B A409 A38D 92BF 15D9 6EEE 9A82 F2AC 69AC 07B9
 CACert.org Assurer



More information about the TriLUG mailing list