[H-GEN] Linux file server

Fri Jul 2 03:41:56 EDT 2004

David Jericho wrote:
> Russell Stuart wrote:
> If your system does heavy disk IO, I highly recommend a SCSI style 
> controller, or a second processor.

As it happens, all the Dell's do have at least two processors.
They aren't small machines.

According to Dell's own internal benchmarks, you are better
off on a file server without a Raid card speed wise.  This
shouldn't of been a surprise, but it was to me.  The main
CPU's are each approx an order of magnitude faster than the
single CPU on the Raid card - possibly because there is no
easy way to get 100 watts to a PCI slot. And they have an
order or magnitude more memory to play with. The only down
side is that is isn't battery backed up RAM, so write
caching isn't as effective.  Still, I wasn't the only one
on the list surprised by Dell's admission, going by the
responses posted.

>> 2.  Software Raid is more reliable that hardware Raid.  This
>>     is primarily because of bugs in Raid cards and firmware
>>     bugs/incompatibilities in the SCSI disk drives.  
> I beg to differ here. If the research has been done properly, using 
> vendor recommended hardware, hardware RAID is as good as, if not better 
> than software RAID. I have had corrupted software RAID arrays, as well 
> as corrupted hardware RAID arrays.

I have used raid controllers from 3 different vendors.  In one
case the raid controller card failed.  I don't know why - I will
put it down to a hardware failure.  The array was lost.

In another case the SCSI disk drives (IBM's) had a firmware
bug which corrupted the SCSI bus and destroyed the array.
After a year or so IBM issued a firmware upgrade for the
drives and all was well.

In a third case, the Adaptec SCSI host controller had a firmware
bug.  The bug was triggered by an incompatibility with the
Seagate U320 drives.  This lead to a protocol error on the SCSI
bus.  I am not sure whose fault the SCSI protocol error was - no
one has owned up, I am quoting Adaptec when I said it was
triggered by the Seagate U320 drives (well, it didn't happen
with other drives, but why was it Seagate's fault - Seagate
didn't think so).  Maybe it is a bad cable / connector
stuffing things up - who knows.  But then it continued
after everything, including the disk drives, was replaced.
A second identical machine runs without a problem.

To continue the story the protocol error caused the host
controller to take recovery actions, which included
flushing all cached data in the battery backed up RAM to
disk.  The flush happened in a high priority process
within the host controller - so high that a priority
inversion occurred locking out the task that handled
the comms over the SCSI bus to the CPU. The "bug" has
to do with this flush.  Either is wasn't supposed to
happen, or it wasn't supposed to happen at this priority,
or it was just taking too long.  I never did get out
of Adaptec what it was (and for that matter still is).
This caused timeouts in OS driver supplied by Adaptec,
which in turn lead to disk corruption. I don't know why
it lead to corruption, but it did happen several times.

The "fix" currently being issued by Adaptec is alter
the driver in the OS to hide the long time is was
sometimes taking for the controller to do a SCSI
reset - I kid you not.

> You're way off base here, even with the qualification of your original 
> post. The likelihood of a single drive death in a RAID array increases 
> in a linear fashion respect to the number of drives in an array. The 
> likelihood of a single drive death resulting in lost or corrupted data 
> on a properly designed RAID array however decreases faster than linear.
> 
> A paranoid RAID 1 controller will read both disks, and compare the 
> results informing you if they differ. For performance reasons, some RAID 
> controllers may skip this step.

Here we will have to differ.  Every Raid array I have had,
has failed.  And its not as if I haven't had a number of
Raid arrays, from a number of vendors, each one more expensive
than the last.  (I wonder why that was :).) They have failed
multiple times.  I have had single HDD's fail as well of
course.  But not a lot in comparison to the number I have
deployed.  So we have:
    Total Raid array failure per array owned: 100%
    Total HDD failures per HDD owned:           5% (?)

If you look at the failure modes I describe above, the
failures in the Raid arrays are not hardware failures, they
were caused by design errors / bugs / or whatever you want
to call them.  What you describe above is how Raid arrays
are meant to cope with failure modes the designers
anticipated - primarily failures in the HDD's.  And yes,
I agree they do this very well.  The issue is, I have not
had a single hardware failure in a HDD in a raid array.
Not one.  HDD's just don't fail that often.

Design errors in any complex system are almost a
certainty.  It requires a huge amount of effort to get
rid of them, and even that only works if you have
control over the environment the software is going to
be used it.  The shuttle software was one example of
where those two conditions were met: the effort was
expended, and they had control of the environment.  For
the rest of us who don't the sort of resources the
shuttle team had, we do our best and then give the
result to the users.  If we have a lot of users, they
will supply the huge man-power needed to do the testing
over all the possible configurations, and if we have
are careful to fix all the problems they report, we
will end up with a fairly design error free product.
The 2.2 Linux kernel is an example of this process.

Even in that case, once a user has got his system
going well, if he is savvy he will change as little
as possible (ie keep control of the environment) to
ensure the system stays reliable. This is why many
sys admins loath Microsoft's automatic upgrade
feature.  Walk out of the building one night leaving
a working system.  Walk in the next morning to
discover Microsoft's latest automatically installed
patch was incompatible with a previous hot-fix to SQL
Sever.  Just wonderful.  The mere thought brings on
heart palpatations.  Rule 1: if you want reliable
systems resist change.  If you can't resist change
then manage it - install on a dummy system, and test
and re-test.

Raid arrays occupy a bad space.  They are complex.  Users
demand the best performance, so they are forever changing.
Their makers can't control the drives you use, the
electrical noise near the SCSI bus/cable, and probably a
whole host of other things - so they don't have control
of the environment.  And finally they don't sell a lot of
them, so they don't have a lot of users to foister the
testing onto.  In other words, they dammed things are
always going have design errors.

I grant you this is just a long winded post-justification
of why I have had so many Raid failures.  But it does
explain why a single HDD is more reliable than a Raid
array.  They are simple.  They ship millions of them.
And usually the changes they make are just a matter of
scale.

This will hold while More's law still applies.  More's
law the the base driver behind the continual churn we
have in computer design.  When More's law ends and
the design of a Raid card stabilises for 20 years I have
no doubt they will be rock solid.  Unfortunately I will be
retired by then.

> I would be very interested to see your methodology for coming to this 
> conclusion. If your application or operating system is corrupting data 
> then you've got bigger issues to worry about than a drive dying.

You and I live in different worlds it seems.  I have had
CVS corrupt my sources, system admins do rm -r's from
root, database admin's carefully test their query on 100
rows, then find the real run destroys the 100,000 of them,
bugs in ext3, bugs in cpio, users delete emails they need
back now, databases exceed the systems maximum file size
and self destruct, and god knows what else.  In my world
users and buggy programs cause many more failures than
dying HDD's.

>> If you are mirroring the damage is reflected on all drives immediately.  
> 
> 
> RAID is not a backup solution, it is an uptime solution, and a risk 
> migration solution.

Yes, I agree.  I believe we were discussing a backup
solution, and I was saying: don't use Raid.

However, this is probably not what caused you to reply.  I
was also saying that in small systems (read: where the
data will happily live on one spindle), Raid causes a
net reduction in uptime as well.

One of the reasons for this is what the array is corrupted,
it takes a lot of effort to fix.  I am not talking about a
single HDD failure.  Raid handles that very well.  I am
talking about the data on the array being damaged.
Regardless of whether your data is stored on a single drive
or a raid array you have to rip out and replace the faulty
component.  You don't dare leave it in place in case it
happens again.  Ripping out a single drive is quick and
easy, very easy in comparison to replacing a raid array.
You then drop the dead drive in the nearest bin (or pull
it apart like I do and remove the cool rare-earth
magnets).  You can't do that with the raid array: its
too expensive, it has to be repaired.

And then, of course, you have to have a spare.  If you
don't have a spare HDD you go down to your nearest
shop and buy one.  If your raid card has those rarest
of things: a hardware failure and the technician zaps
the spare, you ring up the shop and ask for a new one.
They say: sorry no, they don't make them any more.
So you say, "well is there another card that uses the
same disk format".  And they say: "err - I don't think
so".  Then you say: "but I need this working again
today, I think I can recover the data on the drives
if I can read them".  And they say: "sorry sir".  So you
buy a single big HDD, restore yesterdays backup, and
wonder what you did wrong in a past life.  True story.

> Quite simply you cannot talk about data reliability and integrity 
> without talking about RAID. 30 minutes with a notepad and a pen using 
> high school mathematics will illustrate that. It's not the only 
> solution, but it does make up part of the solution.

We must use different notepads and pens!

> To talk of serious data integrity and storage on a single spindle is 
> irresponsible at best.
> 

Ahh yes.  You have got to the heart of the matter.  I used
to think this too, and I suspect most people would agree
with you.  But over the years I have come to believe it is
wrong.   So I decided to spell out my reasons for believing
this in a post to H-GEN, in the hopes that others would not
make the same mistakes I did.  I hope my reasoning is clear,
even if it isn't convincing.