|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
LVM volumes have corrupt ext3 fs after replacing a failing drive in an underlying software raid 5 array
Hi all.
We have a server running Cent OS 4.7 kernel 2.6.9-78.0.5.ELsmp. It has 6 300Gb SAS drives that are configured as follows:
We got some errors in the messages log sometime last week about "bad segments" on one of the drives. Doing a smartcl -t long /dev/sda confirmed that the drive wasn't happy. We decided to replace it before things got worse. Here is the procedure we followed to replace the drive..... 1) backup failing disks partition table: sfdisk -d /dev/sda > /etc/partitions.sda 2) fail and remove the disk from all of its arrays: mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 ... 3) turn off swap since its in a swap array. shutdown the swap array: swapoff -a mdadm -S /dev/md4 4) tell the OS to kill the drive for hot swap and pull the bad drive: echo "scsi remove-single-device 0 0 0 0" > /proc/scsi/scsi 5) insert new drive and tell the OS to find it: echo "scsi add-single-device 0 0 0 0" > /proc/scsi/scsi 6) create the partition table on the new disk: sfdisk /dev/sda < /etc/partitions.sda 7) add the drive back to all of its arrays: mdadm /dev/md0 --add /dev/sda1 ... 8) re-create the swap array, format it, turn swap back on: mdam --create --verbose /dev/md4 --level=0 --raid-devices=2 /dev/sda1 /dev/sdb1 mkswap /dev/md4 swapon -a Everything seemed fine. All arrays synced, data was accessible, so we went home and went to bed. Sadly, things were not fine a few hours later. One of our LVM volumes remounted itself readonly and our attempt to remount it rw led to the machine hard locking. Rebooting the machine gave us nothing but fsck complaining and about TONs of inodes on all but two of the LVM volumes. We ended up having to wipe the raid 5 array clean and restore from backups. Here are the errors from when the kernel remounted the LVM volume ro. Fsck complained in a very similar manner on the reboot. Dec 13 02:37:40 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #16335409: rec_len % 4 != 0 - offset=0, inode=3325888793, rec_len=36875, name_len=128 Dec 13 02:37:40 farrell kernel: Aborting journal on device dm-0. Dec 13 02:37:42 farrell kernel: ext3_abort called. Dec 13 02:37:42 farrell kernel: EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Dec 13 02:37:42 farrell kernel: Remounting filesystem read-only Dec 13 02:45:08 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #10028449: rec_len % 4 != 0 - offset=0, inode=3453369297, rec_len=43883, name_len=176 Dec 13 02:45:57 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #17728222: rec_len % 4 != 0 - offset=0, inode=4248137906, rec_len=41903, name_len=120 Dec 13 02:48:01 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #37765142: rec_len % 4 != 0 - offset=0, inode=189228823, rec_len=2857, name_len=49 Dec 13 02:50:17 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #10027806: rec_len % 4 != 0 - offset=0, inode=2494885026, rec_len=6898, name_len=164 Dec 13 02:51:37 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #17089522: rec_len % 4 != 0 - offset=0, inode=3224714177, rec_len=24635, name_len=56 Dec 13 03:22:44 farrell kernel: EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #31687011: rec_len % 4 != 0 - offset=0, inode=934687742, rec_len=46451, name_len=57 Dec 13 03:37:43 farrell kernel: EXT3-fs error (device dm-3): ext3_readdir: bad entry in directory #1033375: directory entry across blocks - offset=0, inode=2770636924, rec_len=62712, name_len=4 Dec 13 03:37:43 farrell kernel: Aborting journal on device dm-3. Dec 13 03:37:43 farrell kernel: ext3_abort called. Dec 13 03:37:43 farrell kernel: EXT3-fs error (device dm-3): ext3_journal_start_sb: Detected aborted journal Dec 13 03:37:43 farrell kernel: Remounting filesystem read-only Dec 13 03:52:21 farrell kernel: EXT3-fs error (device dm-7): ext3_readdir: bad entry in directory #1442857: rec_len % 4 != 0 - offset=0, inode=732481902, rec_len=52555, name_len=181 Dec 13 03:52:21 farrell kernel: Aborting journal on device dm-7. Dec 13 03:52:21 farrell kernel: ext3_abort called. Dec 13 03:52:21 farrell kernel: EXT3-fs error (device dm-7): ext3_journal_start_sb: Detected aborted journal Dec 13 03:52:21 farrell kernel: EXT3-fs error (device dm-7): ext3_readdir: bad entry in directory #1442857: rec_len % 4 != 0 - offset=0, inode=732481902, rec_len=52555, name_len=181 Dec 13 03:52:21 farrell kernel: Aborting journal on device dm-7. Dec 13 03:52:21 farrell kernel: ext3_abort called. Dec 13 03:52:21 farrell kernel: EXT3-fs error (device dm-7): ext3_journal_start_sb: Detected aborted journal Dec 13 03:52:21 farrell kernel: Remounting filesystem read-only Dec 13 03:53:08 farrell kernel: EXT3-fs error (device dm-7): ext3_readdir: bad entry in directory #1606426: rec_len % 4 != 0 - offset=0, inode=3306786808, rec_len=1430, name_len=20 Does anybody else run a similar setup, and if so, what do your disk replacement procedures look like? Or has anybody ever ran into any similar errors for any other reason (ppl on google mention possible kernel bug)? Basically, I'm just trying to figure out if something I did (or didn't do) caused the FS on the LVM volumes to get corrupt, and if not, what did. -- Dustin Minnich Nicholas IT 613-8148 _______________________________________________ Dulug mailing list Dulug@... https://lists.dulug.duke.edu/mailman/listinfo/dulug |
|
|
Re: LVM volumes have corrupt ext3 fs after replacing a failing drive in an underlying software raid 5 arrayOn Tue, 16 Dec 2008, Dustin Minnich wrote: > Hi all. > > We have a server running Cent OS 4.7 kernel 2.6.9-78.0.5.ELsmp. > > It has 6 300Gb SAS drives that are configured as follows: > * /boot and / are in software raid 1 arrays with ext3 on them. > * A small software raid 0 array for speed with ext3 on it. > * Swap is in a raid 0 array. This is a leftover from a previous admin and it will be changed eventually. > * And a large software raid 5 array which houses LVM volumes that are formatted with ext3. > > We got some errors in the messages log sometime last week about "bad segments" on one of the drives. Doing a smartcl -t > long /dev/sda confirmed that the drive wasn't happy. We decided to replace it before things got worse. > > > Here is the procedure we followed to replace the drive..... > > 1) backup failing disks partition table: > sfdisk -d /dev/sda > /etc/partitions.sda > > 2) fail and remove the disk from all of its arrays: > mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 > ... > 3) turn off swap since its in a swap array. shutdown the swap array: > swapoff -a > mdadm -S /dev/md4 > > 4) tell the OS to kill the drive for hot swap and pull the bad drive: > echo "scsi remove-single-device 0 0 0 0" > /proc/scsi/scsi > > 5) insert new drive and tell the OS to find it: > echo "scsi add-single-device 0 0 0 0" > /proc/scsi/scsi > > 6) create the partition table on the new disk: > sfdisk /dev/sda < /etc/partitions.sda > > 7) add the drive back to all of its arrays: > mdadm /dev/md0 --add /dev/sda1 > ... > > 8) re-create the swap array, format it, turn swap back on: > mdam --create --verbose /dev/md4 --level=0 --raid-devices=2 /dev/sda1 /dev/sdb1 > mkswap /dev/md4 > swapon -a > > > > Everything seemed fine. All arrays synced, data was accessible, so we went home and went to bed. did your backups start or anything that started exercising the disks - or more likely the cables/backplane? Anything in the logs before the journal errors started? -sv _______________________________________________ Dulug mailing list Dulug@... https://lists.dulug.duke.edu/mailman/listinfo/dulug |
|
|
Re: LVM volumes have corrupt ext3 fs after replacing a failing drive in an underlying software raid 5 arrayActually, yeah, rdiff should have started running at 2:30 and the first
error was at 2:37. I think the arrays finished re-buidling around 11 or 12 though. How do you think the two incidents could be related? Array wasn't actually rebuilt successfully and the backup job was the first to notice that when it couldn't find a specific file? Or do you think the constant heavy load had something to do with it? Dustin Minnich Nicholas IT 613-8148 Seth Vidal wrote: > > > On Tue, 16 Dec 2008, Dustin Minnich wrote: > >> Hi all. >> >> We have a server running Cent OS 4.7 kernel 2.6.9-78.0.5.ELsmp. >> >> It has 6 300Gb SAS drives that are configured as follows: >> * /boot and / are in software raid 1 arrays with ext3 on them. >> * A small software raid 0 array for speed with ext3 on it. >> * Swap is in a raid 0 array. This is a leftover from a previous >> admin and it will be changed eventually. >> * And a large software raid 5 array which houses LVM volumes that >> are formatted with ext3. >> >> We got some errors in the messages log sometime last week about "bad >> segments" on one of the drives. Doing a smartcl -t >> long /dev/sda confirmed that the drive wasn't happy. We decided to >> replace it before things got worse. >> >> >> Here is the procedure we followed to replace the drive..... >> >> 1) backup failing disks partition table: >> sfdisk -d /dev/sda > /etc/partitions.sda >> >> 2) fail and remove the disk from all of its arrays: >> mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1 >> ... >> 3) turn off swap since its in a swap array. shutdown the swap array: >> swapoff -a >> mdadm -S /dev/md4 >> >> 4) tell the OS to kill the drive for hot swap and pull the bad drive: >> echo "scsi remove-single-device 0 0 0 0" > /proc/scsi/scsi >> >> 5) insert new drive and tell the OS to find it: >> echo "scsi add-single-device 0 0 0 0" > /proc/scsi/scsi >> >> 6) create the partition table on the new disk: >> sfdisk /dev/sda < /etc/partitions.sda >> >> 7) add the drive back to all of its arrays: >> mdadm /dev/md0 --add /dev/sda1 >> ... >> >> 8) re-create the swap array, format it, turn swap back on: >> mdam --create --verbose /dev/md4 --level=0 --raid-devices=2 /dev/sda1 >> /dev/sdb1 >> mkswap /dev/md4 >> swapon -a >> >> >> >> Everything seemed fine. All arrays synced, data was accessible, so >> we went home and went to bed. > > Did anything happen right before the new errors occurred? > Specifically, did your backups start or anything that started > exercising the disks - or more likely the cables/backplane? > > Anything in the logs before the journal errors started? > > -sv _______________________________________________ Dulug mailing list Dulug@... https://lists.dulug.duke.edu/mailman/listinfo/dulug |
|
|
Re: LVM volumes have corrupt ext3 fs after replacing a failing drive in an underlying software raid 5 arrayOn Tue, 16 Dec 2008, Dustin Minnich wrote: > Actually, yeah, rdiff should have started running at 2:30 and the first error > was at 2:37. > > I think the arrays finished re-buidling around 11 or 12 though. > How do you think the two incidents could be related? Array wasn't actually > rebuilt successfully and the backup job was the first to notice that when it > couldn't find a specific file? Or do you think the constant heavy load had > something to do with it? > The array rebuild (unless you increased the resync rate) won't necessarily hit the array very hard. However, if you have a less-than-great backplane or adapter it could be overheating or being overtaxed when in heavy use and simply losing its mind. Test it, if you can, Run a fast, hard, bonnie++ or tiobench test on it and see if goes bonkers. If it does, call your hw rep and get them to swap the backplane and any/all cables. -sv _______________________________________________ Dulug mailing list Dulug@... https://lists.dulug.duke.edu/mailman/listinfo/dulug |
| Free embeddable forum powered by Nabble | Forum Help |