[TriLUG] System overload issues

John Vaughters jvaughters04 at yahoo.com
Fri May 24 11:53:45 EDT 2013


Brian,
 
From that data, I see that your processor got slammed even before the time frame you sent, then there are huge long spikes of io wait. This suggest a pattern of heavy processing and then heavy writing to disk, which is a self destructive pattern if you are still trying to serve pages. Check your memory and swap stats as well. If you end up swaping to much you are killing your performance. If you are bzip 'ng huge files, this could be your spark to start the bad behaivor that get's you in an unrecoverable cycle. Someone mentioned using nice on some of your processes. That could be a solution to spread the operation over a longer period of time, but it can cause you issues if you need certian processes finished by a certain time. If you end up with a backup that spans until the next backup becasue of a nice, you are defeating your purpose. 
 
I recommend finding the beginning times of your issues and see if the processing issues follows disk io waits or vice versa. 
 
It's hard to make recommendations without information, so these are very general and you may have covered them.
 
Possible improvements:
 
1. Find a way to reduce your backup. Archive data if possible, remove any unecessary large files, trim down backups
2. Does disk preceed processor or vice versa
3. Investigate the offending processes and consider nice 'ness
4. During these resource storms are you self destructing becasue of heat. Both Processor and Disks may protect themselves and throttle during these conditions. Be sure to clean the machine from all dust and find a way to measure the heat. Computers vary on that topic you will have to investigate your hardware.
 
More radical change may include:
 
1. Run the OS from a different disk than your data raids. This will help if you have swap issues and just in general can help eliminate resource fights with OS and applications
2. Consider faster raid disks
3. Consider more capable box. Servers self destruct if they get into these spike conditions, a more capable box ideally would not find itself at the self destruct point.
 
I personally would analyze the crap out of the box before getting a new one, because the issue could just move right over to that new box, so understanding it is the first step. 
 
Good Luck,
 
John Vaughters


More information about the TriLUG mailing list