SATA vs. SCSI reliability
Here's a guy who discusses SATA
vs. SCSI disk reliability. Short conclusion: actual disk failures
(MTBF) are almost exactly as likely wtih cheap SATA disks as expensive SCSI
disks. But the bit error rate of SATA is much higher. In other
words, the likelihood of not being able to read a sector because it got
corrupted on SATA is vastly higher than SCSI. By his calculation, on a 1TB
disk, you have about a 56% chance of not being able to read every single
sector, which means rebuilding your RAID correctly in the case of a failed
disk is usually impossible.
It's true, and I learned this the hard way myself. Back at NITI, we ran into exactly this problem. Back
in the days when we introduced our software RAID support, typical disk sizes
were around 20 GB, about 50x smaller than they are now. (Wow!) The bit
error rates now are about the same as they were then, which means,
assuming the failure percentage declines linearly(1), about a
1.1% failure rate in recovering a RAID.
In general, that 1.1% failure rate isn't so bad. Remember, it's 1.1% on top
of the rather low chance that your RAID failed in the first place, and even
then it doesn't result in total data loss - just some inconvenience and the
loss of a sector here or there. Anyway, the failure rate was small enough
that nobody knew about it, including us. So when we had about a 1.1% rate
of weird tech support cases involving RAID problems, we looked into it, but
blamed it on bad luck with hard drive vendors.
By the time disks were 200GB and failure rates were more like 10%, we were
having some long chats with those hard drive vendors. Um, guys? Your
disks. They're dropping dead at a pretty ridiculous pace, here.
You see, we were still proceeding under the assumption that IDE disk are
either entirely good, or they're bad. That is, if you get a bad sector on
an IDE disk, it's supposed to be the beginning of the end. That's because
modern disks have a spare
sector remapping feature that's supposed to automatically (and silently)
stop using sectors when the disk finds that they're bad. The problem,
though, is it has to discover this at write time, not at read
time. If you're writing a sector, you can just read it back, make sure it
was written correctly, and if not, write it to your spare sector area. But
if you read it back and it fails the checksum - what then?
This is the "bit error rate" problem. It's not nice to think about, but
modern disks just plain lose data over time. You can write it one day, and
read-verify it right afterwards without a problem, and then the data can be
missing again tomorrow. Ugh.
And the frequency - per bit - with which this happens is the same as ever.
With SCSI it's less than with SATA, but as we have more bits per disk, the
frequency per disk is getting ridiculous. A 56% chance that you
can't read all the data from a particular disk now.
There are two reasons you probably haven't heard about this problem. First,
you probably don't run a RAID. Let's face it, if your home disk has a
terabyte of stuff on it, you just probably aren't accessing all that data.
Most of the files on your disk, you will probably never access again.
Face it! It's true. If you filled up a 1TB disk, you probably filled it
with a bunch of movies, and most of those movies suck and you will never
watch them again. Stasticially speaking, the part of your disk that loses
data is probably in the movies that suck, not the movies that are good,
simply because the vast majority of movies suck.
But if you're using a RAID, you occasionally need to read the entire disk,
so the system will find those bad sectors, even in files you don't
care about. Maybe the system will be smart enough not to report those bad
sectors to you, but it'll find them.
Secondly, even when you lose a sector here and there, you usually don't even
care. Movies, again: MPEG streams are designed to recover from occasional
data loss, because they're designed to be delivered over much less reliable
streams than hard disks. What happens if you get a corrupt blob of data in
your movie? A little sprinkle of digital junk on your screen. And within a
second or so, the MPEG decoder hits another keyframe and the junk is gone.
Whatever, just another decoder glitch, right? Maybe. Maybe not. But you
don't really care either way.
The Solution
At NITI, we eventually settled on a clever solution to this that won't lose
data on RAIDs. Of course we can't protect data on a non-RAID disk in any
direct sense, but we strongly recommended for our customers to do frequent incremental backups instead.
But on a RAID, the problem is actually easier: simply catch the problem
before a disk fails. In the background, we would be constantly, but
slowly, reading through the contents of all your disks, averaging about one
pass per week. If your RAID is still intact but we find a bad sector, no
data has been lost yet: the other disks in the RAID can still be used to
reconstruct it. So that's exactly what we would do! Reconstruct the bad
sector, and write it back to the failing disk which could then
automatically use its sector remapping code to make the bad sector disappear
forever.
The read-reconstruction part was never open sourced, so if you want that,
you'd have to write it yourself. Luckily, it was easy, and now that we have
ionice you don't have to be
nearly as careful to do it slowly in the background.
The other part was to make sure Linux's software RAID could recover in case
it ran into individual bad sectors. You see, they made the same bad
assumption that we did: if you get a bad sector, the disk is bad, so drop it
out of the RAID right away and use the remaining good disks. The problem is
that nowadays, every disk in the RAID is likely to have sector
errors, so it will be impossible to rebuild the RAID under that
assumption. Not only that, but throwing the disk out of the RAID is the
worst thing you can do, because it prevents you from recovering the
bad sectors on the other disks!
A co-worker of mine at the time, Peter
Zion(2), modified the Linux 2.4 RAID code to do something
much smarter: it would keep a list of bad sectors it found on each disk, and
simply choose to read that sector from the set of other disks whenever you
tried to read it. Of course it would then report the problem back to
userspace through a side channel, where we could report a warning
about your disks being potentially bad, and accelerate the background
auto-recovery process.
Sadly, while the code for this must be GPL as it modified the Linux kernel,
the old svn repository at svn.nit.ca seems to have disappeared. I imagine
it's an accident, albeint a legally ambiguous one. But I can't point you to
it right now.
I also don't know if the latest Linux RAID code is any smarter out of the
box than it used to be. As we learned, it used to be pretty darn dumb. But
I don't have to know; I don't work there anymore. Still, please feel free
to let me know if you've got some interesting information about it.
Footnote
(1) Of course the failure rate is not exactly a linear function
of the disk size, for simple reasons of probability. The probability of a
1TB disk having no errors (0.44, apparently) is actually the same as
the probability that all of a set of 50x 20GB disks has no errors.
The probability of no failures on any one disk is thus the 50th root of
that, or 98.4%. In other words, the probability of failure was more like
1.6% back in the day, not 1.1%.
(2) Peter is now a member of The
Navarra Group, a software contracting group which
may be able to solve your Linux kernel problems too. (Full disclosure: I'm
an advisor and board member at Navarra.)
Update 2008/10/08: Andras Korn wrote in to tell me that the
calculations are a bit wrong... by about an order of magnitude, if you
assume a 1TB RAID. Though many modern RAIDs will be bigger than that, since
the individual disks are around 1TB. Do with that information what you will
:)
Update 2008/10/30: I should also note that my calculations are wrong
because I misread what the Permabit guy calculated in the first place, and
didn't bother to redo the math myself. His calculations are
not off by an order of magnitude, although there is some disagreement
about whether the correct number is 0.44 or 0.56.
October 30, 2008 16:36