[TriLUG] software raid failures?

David Brain dbrain at gmail.com
Wed Jan 14 13:04:00 EST 2009


Another anecdote...

Beyond a number of machines that we have running software RAID 1 on
their OS drives (just for redundancy), I have a couple of large
softwareRAID 5 and 6 arrays in production use - one is around 2.5TB on
14 SCSI drives the other is around 3.5TB on 15 SAS drives.

There have been a couple of instances of problems with the SAS drives
not always being 'hot swappable', requiring a server restart to pick
up replacement drives, but I think this may be due to the specific
kernel compile options on the box.  However in 2-3 years of operation
we haven't lost data.  The box with the SAS drives actually has 2
drive trays with one running software raid on one controller and the
2nd running a hardware raid10 on a separate controller (as we needed
some additional write performance).

Performance wise it's pretty decent - obviously the hardware RAID 10
beats out the software RAID 6 (as does the SAN...), but it's been very
stable.

Couple of points that might be worth thinking about - consider RAID6
rather than RAID5 for larger arrays (the bigger of our arrays is
actually raid6 + a hotspare), it's not uncommon for another drive to
go bad during the rebuild process which can ruin your day if you are
just doing RAID5.  Also to help mitigate the 'bad drive on rebuild'
problem, re-scan the array relatively frequently (a hardware card
will normally do this automatically  - I  the terms I've seen used
there is 'patrol reads' or 'scrubbing').  I run 'echo check >>
/sys/block/md5/md/sync_action' on a weekly basis, this will pick up
any failed drives early before they bite you during a rebuild.

David.

On Tue, Jan 13, 2009 at 3:13 PM, Cristóbal Palmer <cmp at cmpalmer.org> wrote:
> Anybody here ever lost data because of a problem with Linux software
> raid? If so, please describe the circumstances. Now, of course you had
> separate, off-site backups, so you probably only lost a day's worth of
> data when this happened, but...
>
> I have two people who are software raid skeptics and need convincing.
> An official-looking document that basically says, "This is why you
> won't lose data due to a kernel panic or drive failure or..." would be
> great.
>
> Thanks in advance,
> --
> Cristóbal M. Palmer
> "Small acts of humanity amid the chaos of inhumanity provide hope. But
> small acts are insufficient."
>    -- Paul Rusesabagina
> --
> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG FAQ  : http://www.trilug.org/wiki/Frequently_Asked_Questions
>



More information about the TriLUG mailing list