[TriLUG] software raid failures?
kevin at flanagannc.net
Tue Jan 13 20:36:55 EST 2009
Of course it largely depends on the hardware you are using, but in true
server class machines with hot swap SCSI drives you get a lot of added
features. For example, in HP Proliant servers you get the disks and array
controller holding the array configurations. You can swap out the disks
live with zero outage, if the Array controller dies, shut down the machine
and put in a new controller, as you bring the machine up the Array
controller BIOS will ask if you want to use the configuration stored on the
disks, if yes, it boots right back up. One of the keys is that you don't
have to know anything about the configuration of the array, just which disk
is dead. The software, OS and all, knows nothing about the failure that
caused you to replace a part.
All that said, there are lots of cases where that cost is more than the
value you would get out of it, but when it really is important data, the
cost of high end disks, array controllers, and so on, is justified by the
much less likely to experience any outage in spite of some hardware
On Tue, Jan 13, 2009 at 5:23 PM, Aaron Joyner <aaron at joyner.ws> wrote:
> I've personally supported software raid in Linux for well more than 6
> years, in large scale production usage. I've never lost data. I know
> that sounds broad, but I've never been in a situation where I had more
> than one drive fail faster than I could do a timely back up, shut down
> the box, replace the drive, and bring it back on line. I've had a box
> which had unstable hardware (a flaky but irreplaceable PCI card) which
> would lock the hardware, cause kernel panics, and all manner of other
> generally unfriendly shutdowns on a weekly basis, for months on end,
> and even in that fairly worst-case scenario didn't loose data on the
> relevant linux md software raid-5 array.
> I agree with others that if you're staking your production system on
> it, you should compile as large as possible a set of anecdotes, but
> that's about as close as you're going to come to realistic data. Far
> more important than the droves of "it works great for me" responses
> you're likely to get, are looking for those tell tale few "I lost data
> in this situation" responses you may get. Those will be the anecdotes
> that will trouble your skeptics. The best you can do, should you be
> able to find any one willing to say that, is to try to assess that
> either (a) their particularly very unusual situation wouldn't apply to
> you, or (b) they're not really a trust-worthy source. If you can't
> convince yourself of either (a) or (b) for anyone you can find who
> *does* report a failure, then you might be on to something, and
> perhaps you should listen to your skeptics.
> Generally though, I'm of the opinion that if it were a real problem,
> you'd hear a lot more grumbling about it on public lists, and it'd be
> readily discoverable via Google. Because of my positive experiences,
> I've never really gone looking for that type of concern, so it might
> be out there. I'd be very surprised.
> Aaron S. Joyner
> On Tue, Jan 13, 2009 at 3:13 PM, Cristóbal Palmer <cmp at cmpalmer.org>
> > Anybody here ever lost data because of a problem with Linux software
> > raid? If so, please describe the circumstances. Now, of course you had
> > separate, off-site backups, so you probably only lost a day's worth of
> > data when this happened, but...
> > I have two people who are software raid skeptics and need convincing.
> > An official-looking document that basically says, "This is why you
> > won't lose data due to a kernel panic or drive failure or..." would be
> > great.
> > Thanks in advance,
> > --
> > Cristóbal M. Palmer
> > "Small acts of humanity amid the chaos of inhumanity provide hope. But
> > small acts are insufficient."
> > -- Paul Rusesabagina
> > --
> > TriLUG mailing list :
> > TriLUG FAQ : http://www.trilug.org/wiki/Frequently_Asked_Questions
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG FAQ : http://www.trilug.org/wiki/Frequently_Asked_Questions
More information about the TriLUG