[H-GEN] My SATA, motherboard nightmare

Brendon Higgins blhiggins at gmail.com
Sun Jan 28 21:14:59 EST 2007


Hi everyone,

Here's a problem I've been having with SATA harddrives, filesystem
corruption, and my ASUS A8N-SLI motherboard. It's a bit lengthy, but
details pretty much everything I've tried (which is quite a bit). I'm
looking for confirmation (or otherwise) that what I've done and the
conclusions I've come to are sane. This one has been driving me nuts.

I've been using Debian on this particular machine for just over a year
now with no problems (at least, not with the hardware). But just
recently I ran into all sorts of strife with my SATA harddrives.

It started after I moved the guts of my machine into a new case. I
also changed the PSU at the same time. What I get are a bunch of
errors complaining about timeouts and SATA bus errors (see end for
examples). These come up at somewhat random times, often during
startup. They seem more probably and occur more frequently the more a
particular disk is used. I let this happen a few times while I was
wondering what was wrong, and eventually it caused filesystem
corruption, lost /sbin/init, and made Linux unbootable.

There are a bunch of peculiarities with this. First is that it seems
to be worst at the start of the day, or after the machine has been
shutdown for a couple of hours. Things tend to be okay once it's been
up and running without incident for a while. (If one of those errors
does occur, though, things only tend to get worse.)

I tested the hard drives (and cables, too) on another system (which
had a different PSU). They seemed fine, with the caveat that I had to
jumper the drives to force them to SATA 1.5Gbps mode so that the
motherboard would detect them. I expected that, though; it's a VIA.
Doing this, in Knoppix, I was able to salvage most of my files. The
drives normally run at 3.0 Gbps.

At this point I ruled out the harddrives as culprits (not crashed,
worked on another system), the case and PSU (come on, really?), and
the kernel and drivers (nothing had changed, and several different
bootable CDs with different kernel versions exhibit the same thing).
So I was left with something on the motherboard failing.

Thinking this was where the problem lay, I went to warranty the
motherboard, but got it sent back to me with the message "No fault
found. Linux doesn't support SATA natively. It's probably the drivers
you are using." (They also said that they wouldn't warranty it anyway
because the place I had bought it from was no longer part of their
franchise.)

You can *only begin* to imagine how pissed-off I was.

Admittedly I hadn't tried it in anything other than Linux. I have a
separate partition of WinXP for games, so after making sure I had a
backup of that partition, I gave it a try. I noticed that first time
Windows failed to start - just stuck at that stupid progress bar (the
one that doesn't actually show the progress). Usually it boots in a
couple of seconds (because I have almost no daemons running on it, I
guess). This happened once or twice, but more often than not Windows
booted okay and ran fine. I'm not sure if that was just because I
don't hammer the HD nearly as much in Windows, luck, or whatever.

After much more general stuffing about, I seem to have discovered that
things are a whole lot more stable with the drives locked at 1.5 Gbps.
I have not experienced the problem on either drive since putting those
jumpers in yesterday morning (still keeping fingers crossed). To test
it I tried this:
for x in 1 2 3 4 5 6 7 8 9 0; do dd if=/dev/urandom of=largetestfile
oflag=append conv=notrunc count=200k; dd if=/dev/zero of=largetestfile
oflag=append conv=notrunc count=2000k; done

So this is what I'm left with as the possible culprits:
Hard drives: Not likely. I couldn't have lost both drives to the same
problem at the same time with the same faultless behaviour at 1.5
Gbps, surely. The odds against must be huge.
Case: Could it be that the unusual layout of the case (an Antec P180)
has anything to do with it? Could it be causing some kind of
interference? I don't think that's likely, either.
PSU: Might the new PSU be causing this? I would think an Antec
Truepower Trio would be pretty reliable.
Motherboard: I still think this is the most likely explanation.
Somehow the SATA controller has become brain damaged and can't handle
SATA 3.0 Gbps anymore. Screw whatever the warranty tech guys said -
they lost all respect with the "doesn't support SATA" line.

I'm looking for opinions. Am I right? Do I have any idea what I'm
talking about? Is there something I've missed that I ought to try?

I'm seriously considering just leaving the machine as it is at 1.5
Gbps for a while, probably until it dies totally, and then I'll just
do an upgrade cycle. I really don't feel like fighting with this
anymore. Can I expect it to last as it is, or should gradual
degradation of stability be expected?

If you have no idea what to suggest after all this, well, I hope this
has at least been entertaining for you.

Peace,
Brendon


Examples of errors seen - ones that I've managed to record:
1
ata1: command 0x25 timeout, stat 0xd0 host_stat 0x1
ata1: translated ATA stat/err 0x25/00 to SCSI SK/ASC/ACSQ 0x4/00/00
ata1: status=0x25 { DeviceFault CorrectedError Error }
sd 2:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key=0x4
    ASC=0x0 ASCQ=0x0
end_request: I/O error, dev sda, sector 407006998
Buffer I/O error on device sda2, logical block 330826768
ATA abnormal status 0xD0 on port 0x9F7
ATA abnormal status 0xD0 on port 0x9F7
ATA abnormal status 0xD0 on port 0x9F7
ata1: command 0x25 timeout, stat 0xd0 host_stat 0x1
ata1: command 0x25 timeout, stat 0xd0 host_stat 0x1
ata1: command 0x25 timeout, stat 0xd0 host_stat 0x1
ata1: command 0x25 timeout, stat 0xd0 host_stat 0x1
ata1: command 0x25 timeout, stat 0xd0 host_stat 0x1

2
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x35 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x35 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x35 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: limiting speed to UDMA/100
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x35 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/100
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: limiting speed to UDMA/66
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2 frozen
ata1.00: (BMDMA stat 0x1)
ata1.00: tag 0 cmd 0x35 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: port failed to respond (30 secs, Status 0xd0)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/66
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: limiting speed to UDMA/44
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x35 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/44
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
ata1.00: limiting speed to UDMA/33
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x2
ata1.00: (BMDMA stat 0x0)
ata1.00: tag 0 cmd 0x35 Emask 0x10 stat 0x51 err 0x84 (ATA bus error)
ata1: soft resetting port
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/33
ata1: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back




More information about the General mailing list