[H-GEN] Disk array about to croak?

Wed Sep 17 07:22:46 EDT 2014

On 17/09/2014 8:16 PM, Benjamin Fowler wrote:
> [ Humbug *General* list - semi-serious discussions about Humbug and     ]
> [ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
>
>
>
> Hello all,
>
> I have a little HP Mediasmart server which I've redone with Debian. It 
> runs a 4-drive SATA disk array, which runs ext4 over LVM over 
> MD/softraid (raid 5). It's a neat little machine, which has been going 
> quite nicely for hosting all my media and network backups.
>
> Until now, that is. I've been noticing the following sort of output in 
> my daily logwatch emails:
>
> So what I _think_ is happening, is that the first disk in the array is 
> getting read errors. It hasn't failed out yet. Would I be right in 
> saying that the first disk is about to give up the ghost?
>
> (Guess it's time to start thinking about moving the root and boot 
> disks off the array -- this little server only has 4 disk controllers, 
> and all of them are for the disk array. If I lose the first drive, the 
> (headless!!) machine is basically toast until I can rebuild a network 
> installer and TFTP boot into a recovery disk image with a network 
> console :-/...)
>
>
> WARNING: Kernel Errors Present
>          res 41/40:00:58:94:43/00:00:1a:00:00/40 Emask 0x409 (media 
> error) <F> ...:  6 Time(s)
> ata1.00: error: { UNC } ...:  6 Time(s)
> end_request: I/O error, dev sda, sector ...:  1 Time(s)
> md/raid:md1: read error corrected (8 sec ...:  1 Time(s)
> sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocat 
> ...:  1 Time(s)
> sd 0:0:0:0: [sda]  Sense Key : Medium Error [current] [descr ...:  1 
> Time(s)
>
>  1 Time(s):         1a 43 94 58
>  1 Time(s):         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
>  1 Time(s): Descriptor sense data with sense descriptors (in hex):
>  2 Time(s): ata1.00: cmd 60/08:00:58:94:43/00:00:1a:00:00/40 tag 0 ncq 
> 4096 in
>  3 Time(s): ata1.00: cmd 60/08:08:58:94:43/00:00:1a:00:00/40 tag 1 ncq 
> 4096 in
>  1 Time(s): ata1.00: cmd 60/08:28:58:94:43/00:00:1a:00:00/40 tag 5 ncq 
> 4096 in
>  6 Time(s): ata1.00: configured for UDMA/133
>  5 Time(s): ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x0
>  1 Time(s): ata1.00: exception Emask 0x0 SAct 0x60 SErr 0x0 action 0x0
>  6 Time(s): ata1.00: failed command: READ FPDMA QUEUED
>  6 Time(s): ata1.00: irq_stat 0x40000008
>  6 Time(s): ata1.00: status: { DRDY ERR }
>  6 Time(s): ata1: EH complete
>  1 Time(s): raid5_end_read_request: 43 callbacks suppressed
>  1 Time(s): sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK 
> driverbyte=DRIVER_SENSE
>  1 Time(s): sd 0:0:0:0: [sda] CDB: Read(10): 28 00 1a 43 94 58 00 00 08 00
>  1 Time(s): sd 0:0:0:0: [sda] Unhandled sense code

Certainly not a healthy disk Ben - I'd do a smartctl test on it to be 
sure but I would bet money on it that the drive is on its way out.   To 
run a quick test:

# smartctl --test=short /dev/sda

That should take about a minute. And then run:

# smartctl -a /dev/sda

That should display a report on the drive status.

-- 
Snowy Angelique Maslov<snowy at snowy.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.humbug.org.au/pipermail/general/attachments/20140917/1d00296c/attachment.html>