I'd look into the controller. The Promise PCI SATA cards are known to be problematic with linux. I don't know the details, but have seen 2 of them have problems with software raid at least. The problem can be very sporadic, so it seems to fit your symptoms. Google for the details. Please let me know if/when you find a PCI/SATA card that works! Alex Dean wrote: > I have a fileserver which I used to make backups on my local network. > The 2 disks (Seagate 640GB, purchased new about 6 months ago) are set up > with software RAID1. Each disk has 4 partitions, assembled into 4 RAID1 > devices. > > On Monday, I lost power to the machine. (Power cord yanked out, not > power failure.) I powered the machine back on, and it started up > without issues. About an hour later, I got emails from mdadm saying > that 2 of my 4 RAID devices had failed. > > I assumed disk had not actually physically failed, since 2 of my 4 > devices were still online and (apparently) fine. I removed the failed > partitions from their arrays, and they re-added them. mdadm happily > re-synced them, and everything seemed fine. Now, a few days later, the > same 2 partitions (md0, md1) are again saying they have failed, and md3 > has also failed. The failures didn't happen at the same time. > > I would appreciate some help troubleshooting here. What else should I > do to investigate the problem partitions, or the disk itself? Prior to > this little poweroff incident, this machine had been running without any > issues for several months. I'm using an old workstation with a cheapo > IDE SATA controller. (Maybe that's the problem?) > > *** Timeline *** > > 7/13 18:00 (roughly) : power yanked > 7/13 22:00 (roughly) : power restored > 7/14 9:24 : email from mdadm. sda5 (md1) failed. > 7/14 9:29 : email from mdadm. sda1 (md0) failed. > 7/15 morning (can't find the log entry to get exact time) : remove/add > sda5 and sda1. resynced w/o issues. > 7/15 20:20 : email from mdadm. sda1 (md0) failed. > 7/16 0:07 : email from mdadm. sda7 (md3) failed. > 7/16 9:38 : email from mdadm. sda5 (md1) failed. > > *** Syslog Errors *** > I notice errors like this in /var/log/syslog in a few places. I can try > to track them all down if that's helpful. > > Jul 14 09:24:29 gringotts kernel: [50073.107428] ata3.00: exception > Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen > Jul 14 09:24:29 gringotts kernel: [50073.107509] ata3: SError: { > PHYRdyChg } > Jul 14 09:24:29 gringotts kernel: [50073.107550] ata3.00: cmd > ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 > Jul 14 09:24:29 gringotts kernel: [50073.107553] res > d0/d0:d0:d0:d0:d0/00:00:00:00:00/d0 Emask 0x12 (ATA bus error) > Jul 14 09:24:29 gringotts kernel: [50073.107639] ata3.00: status: { Busy } > Jul 14 09:24:29 gringotts kernel: [50073.107670] ata3.00: error: { ICRC > UNC IDNF } > Jul 14 09:24:29 gringotts kernel: [50073.107714] ata3: hard resetting link > Jul 14 09:24:35 gringotts kernel: [50078.590072] ata3: SATA link up 1.5 > Gbps (SStatus 113 SControl 310) > Jul 14 09:24:35 gringotts kernel: [50078.651141] ata3.00: configured for > UDMA/100 > Jul 14 09:24:35 gringotts kernel: [50078.651190] end_request: I/O error, > dev sda, sector 58604927 > Jul 14 09:24:35 gringotts kernel: [50078.651244] md: super_written gets > error=-5, uptodate=0 > Jul 14 09:24:35 gringotts kernel: [50078.651257] raid1: Disk failure on > sda5, disabling device. > Jul 14 09:24:35 gringotts kernel: [50078.651260] raid1: Operation > continuing on 1 devices. > Jul 14 09:24:35 gringotts kernel: [50078.651353] ata3: EH complete > Jul 14 09:24:35 gringotts kernel: [50078.653030] sd 2:0:0:0: [sda] > 1250263728 512-byte hardware sectors (640135 MB) > Jul 14 09:24:35 gringotts kernel: [50078.664063] sd 2:0:0:0: [sda] Write > Protect is off > Jul 14 09:24:35 gringotts kernel: [50078.664080] sd 2:0:0:0: [sda] Mode > Sense: 00 3a 00 00 > Jul 14 09:24:35 gringotts kernel: [50078.664184] sd 2:0:0:0: [sda] Write > cache: enabled, read cache: enabled, doesn't support DPO or FUA > Jul 14 09:24:35 gringotts kernel: [50078.677073] RAID1 conf printout: > Jul 14 09:24:35 gringotts kernel: [50078.677094] --- wd:1 rd:2 > Jul 14 09:24:35 gringotts kernel: [50078.677103] disk 0, wo:1, o:0, > dev:sda5 > Jul 14 09:24:35 gringotts kernel: [50078.677109] disk 1, wo:0, o:1, > dev:sdb5 > Jul 14 09:24:35 gringotts kernel: [50078.679371] RAID1 conf printout: > Jul 14 09:24:35 gringotts kernel: [50078.679389] --- wd:1 rd:2 > Jul 14 09:24:35 gringotts kernel: [50078.679399] disk 1, wo:0, o:1, > dev:sdb5 > Jul 14 09:24:35 gringotts mdadm[3495]: Fail event detected on md device > /dev/md1, component device /dev/sda5 > > Jul 14 19:02:05 gringotts kernel: [84728.741110] ata3: exception Emask > 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen > Jul 14 19:02:05 gringotts kernel: [84728.741190] ata3: SError: { > PHYRdyChg } > Jul 14 19:02:05 gringotts kernel: [84728.741234] ata3: hard resetting link > Jul 14 19:02:11 gringotts kernel: [84734.530052] ata3: link is slow to > respond, please be patient (ready=-19) > Jul 14 19:02:15 gringotts kernel: [84738.790046] ata3: COMRESET failed > (errno=-16) > Jul 14 19:02:15 gringotts kernel: [84738.790108] ata3: hard resetting link > Jul 14 19:02:16 gringotts kernel: [84739.600066] ata3: SATA link up 1.5 > Gbps (SStatus 113 SControl 310) > Jul 14 19:02:16 gringotts kernel: [84739.660599] ata3.00: configured for > UDMA/100 > Jul 14 19:02:16 gringotts kernel: [84739.660627] ata3: EH complete > Jul 14 19:02:16 gringotts kernel: [84739.661665] sd 2:0:0:0: [sda] > 1250263728 512-byte hardware sectors (640135 MB) > Jul 14 19:02:16 gringotts kernel: [84739.662300] sd 2:0:0:0: [sda] Write > Protect is off > Jul 14 19:02:16 gringotts kernel: [84739.662315] sd 2:0:0:0: [sda] Mode > Sense: 00 3a 00 00 > Jul 14 19:02:16 gringotts kernel: [84739.663229] sd 2:0:0:0: [sda] Write > cache: enabled, read cache: enabled, doesn't support DPO or FUA > > Jul 16 09:38:16 gringotts kernel: [223700.170900] ata3.00: exception > Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen > Jul 16 09:38:16 gringotts kernel: [223700.170985] ata3: SError: { > PHYRdyChg } > Jul 16 09:38:16 gringotts kernel: [223700.171029] ata3.00: cmd > ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 > Jul 16 09:38:16 gringotts kernel: [223700.171033] res > d0/d0:d0:d0:d0:d0/d0:d0:d0:d0:d0/d0 Emask 0x12 (ATA bus error) > Jul 16 09:38:16 gringotts kernel: [223700.171120] ata3.00: status: { Busy } > Jul 16 09:38:16 gringotts kernel: [223700.171151] ata3.00: error: { ICRC > UNC IDNF } > Jul 16 09:38:16 gringotts kernel: [223700.171197] ata3: hard resetting link > Jul 16 09:38:22 gringotts kernel: [223705.600073] ata3: SATA link up 1.5 > Gbps (SStatus 113 SControl 310) > Jul 16 09:38:22 gringotts kernel: [223705.660564] ata3.00: configured > for UDMA/100 > Jul 16 09:38:22 gringotts kernel: [223705.660609] end_request: I/O > error, dev sda, sector 58604927 > Jul 16 09:38:22 gringotts kernel: [223705.660667] md: super_written gets > error=-5, uptodate=0 > Jul 16 09:38:22 gringotts kernel: [223705.660678] raid1: Disk failure on > sda5, disabling device. > Jul 16 09:38:22 gringotts kernel: [223705.660682] raid1: Operation > continuing on 1 devices. > Jul 16 09:38:22 gringotts kernel: [223705.660771] ata3: EH complete > Jul 16 09:38:22 gringotts kernel: [223705.660902] sd 2:0:0:0: [sda] > 1250263728 512-byte hardware sectors (640135 MB) > Jul 16 09:38:22 gringotts kernel: [223705.660954] sd 2:0:0:0: [sda] > Write Protect is off > Jul 16 09:38:22 gringotts kernel: [223705.660962] sd 2:0:0:0: [sda] Mode > Sense: 00 3a 00 00 > Jul 16 09:38:22 gringotts kernel: [223705.661043] sd 2:0:0:0: [sda] > Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > > > *** Background Info *** > root@gringotts:~# uname -a > Linux gringotts 2.6.27-11-server #1 SMP Thu Jan 29 20:19:41 UTC 2009 > i686 GNU/Linux > > root@gringotts:~# cat /proc/mdstat > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] > [raid4] [raid10] > md3 : active raid1 sda7[2](F) sdb7[1] > 593874752 blocks [2/1] [_U] > > md2 : active raid1 sda6[0] sdb6[1] > 1951744 blocks [2/2] [UU] > > md1 : active raid1 sda5[2](F) sdb5[1] > 19534912 blocks [2/1] [_U] > > md0 : active raid1 sda1[2](F) sdb1[1] > 9767424 blocks [2/1] [_U] > > unused devices: > > > root@gringotts:~# mount > /dev/md0 on / type ext3 (rw,relatime,errors=remount-ro) > /dev/md1 on /home type ext3 (rw,relatime,user_xattr) > /dev/md3 on /opt/backup type ext3 (rw,relatime,user_xattr) > > NOTE /dev/md2 is swap space. > > root@gringotts:~# mdadm --detail /dev/md0 > /dev/md0: > Version : 00.90 > Creation Time : Sun Feb 22 10:07:22 2009 > Raid Level : raid1 > Array Size : 9767424 (9.31 GiB 10.00 GB) > Used Dev Size : 9767424 (9.31 GiB 10.00 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Thu Jul 16 10:20:08 2009 > State : clean, degraded > Active Devices : 1 > Working Devices : 1 > Failed Devices : 1 > Spare Devices : 0 > > UUID : e002f922:2f6ba792:a34321e9:b82a201b > Events : 0.10924 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 8 17 1 active sync /dev/sdb1 > > 2 8 1 - faulty spare /dev/sda1 > > > root@gringotts:~# mdadm --detail /dev/md1 > /dev/md1: > Version : 00.90 > Creation Time : Sun Feb 22 10:07:32 2009 > Raid Level : raid1 > Array Size : 19534912 (18.63 GiB 20.00 GB) > Used Dev Size : 19534912 (18.63 GiB 20.00 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 1 > Persistence : Superblock is persistent > > Update Time : Thu Jul 16 10:20:21 2009 > State : clean, degraded > Active Devices : 1 > Working Devices : 1 > Failed Devices : 1 > Spare Devices : 0 > > UUID : 799faf28:c7ea5c83:21a2963b:0dddc562 > Events : 0.1602 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 8 21 1 active sync /dev/sdb5 > > 2 8 5 - faulty spare /dev/sda5 > > root@gringotts:/var/log# mdadm --detail /dev/md2 > /dev/md2: > Version : 00.90 > Creation Time : Sun Feb 22 10:07:40 2009 > Raid Level : raid1 > Array Size : 1951744 (1906.32 MiB 1998.59 MB) > Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 2 > Persistence : Superblock is persistent > > Update Time : Thu Jul 16 07:23:32 2009 > State : clean > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > UUID : d78f4817:88606cc2:cd273dbb:85d5ca20 > Events : 0.18 > > Number Major Minor RaidDevice State > 0 8 6 0 active sync /dev/sda6 > 1 8 22 1 active sync /dev/sdb6 > > > root@gringotts:/var/log# mdadm --detail /dev/md3 > /dev/md3: > Version : 00.90 > Creation Time : Sun Feb 22 10:07:47 2009 > Raid Level : raid1 > Array Size : 593874752 (566.36 GiB 608.13 GB) > Used Dev Size : 593874752 (566.36 GiB 608.13 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 3 > Persistence : Superblock is persistent > > Update Time : Thu Jul 16 01:43:28 2009 > State : clean, degraded > Active Devices : 1 > Working Devices : 1 > Failed Devices : 1 > Spare Devices : 0 > > UUID : 44f192d1:72c1a98b:1027f96a:5aea9347 > Events : 0.2524 > > Number Major Minor RaidDevice State > 0 0 0 0 removed > 1 8 23 1 active sync /dev/sdb7 > > 2 8 7 - faulty spare /dev/sda7 > -- -Eric 'shubes' --------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss