|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
forcing a raid recoveryHi all,
I'm putting together a backup system at my job and in doing so setup the good ol' raid 5 array. While I was putting the disk array together, I read that one could encounter a problem in which you replace a failed drive, the rebuilding processes will trip over another bad sector in on of the drives which was good before starting the rebuilding process and thus you end up with a screwed up raid array. So I was thinking of a way to avoid this problem. One solution is to kick off a job once a week or month in which you force the whole raid array to be read. I was thinking of possibly forcing a check sum of all the files I had stored on the disk. The other idea I had was to force one of the drives into a failed state and then add it back in and thus force the raid to rebuild. The rebuilding processes takes about 3 hours on my system which I could easily execute at 2am every Sunday morning. Can anyone comment on this as a reliable way to exercise the disks in the array so that a bad sector doesn't get touched until a rebuild occurs? thanks. Steve. _______________________________________________ Discuss mailing list Discuss@... http://lists.blu.org/mailman/listinfo/discuss |
|
|
Re: forcing a raid recoveryI have done this too. Even with a good disk backup, a tape copy is
not a bad idea, and it could be your 'once a week to validate reads' too. In one mainframe shop I had 3x the maximum disk needed, and did a round robin copy into what were effectively 3 partitions on different drives. It re-organized the database and gave me other maintenance space too. copy c was erased, giving work room, copy A (this week) was copied into copy B, copy B was reorganized in place. Copy B becomes 'live', copy 'A' becomes a static backup, copy 'C' is the prior week backup that will be erased before next weeks 'roll and update' procedure. Just some thoughts from what has worked. ><> ... Jack On Tue, Nov 3, 2009 at 1:12 PM, Stephen Adler <adler@...> wrote: > Hi all, > > I'm putting together a backup system at my job and in doing so setup the > good ol' raid 5 array. While I was putting the disk array together, I > read that one could encounter a problem in which you replace a failed > drive, the rebuilding processes will trip over another bad sector in on > of the drives which was good before starting the rebuilding process and > thus you end up with a screwed up raid array. So I was thinking of a way > to avoid this problem. One solution is to kick off a job once a week or > month in which you force the whole raid array to be read. I was thinking > of possibly forcing a check sum of all the files I had stored on the > disk. The other idea I had was to force one of the drives into a failed > state and then add it back in and thus force the raid to rebuild. The > rebuilding processes takes about 3 hours on my system which I could > easily execute at 2am every Sunday morning. > > Can anyone comment on this as a reliable way to exercise the disks in > the array so that a bad sector doesn't get touched until a rebuild occurs? > > thanks. Steve. > > _______________________________________________ > Discuss mailing list > Discuss@... > http://lists.blu.org/mailman/listinfo/discuss > Discuss mailing list Discuss@... http://lists.blu.org/mailman/listinfo/discuss |
|
|
Re: forcing a raid recoveryOn Tue, Nov 03, 2009 at 02:12:29PM -0500, Stephen Adler wrote:
> Hi all, > > I'm putting together a backup system at my job and in doing so setup the > good ol' raid 5 array. While I was putting the disk array together, I > read that one could encounter a problem in which you replace a failed > drive, the rebuilding processes will trip over another bad sector in on > of the drives which was good before starting the rebuilding process and > thus you end up with a screwed up raid array. So I was thinking of a way > to avoid this problem. One solution is to kick off a job once a week or > month in which you force the whole raid array to be read. I was thinking > of possibly forcing a check sum of all the files I had stored on the > disk. The other idea I had was to force one of the drives into a failed > state and then add it back in and thus force the raid to rebuild. The > rebuilding processes takes about 3 hours on my system which I could > easily execute at 2am every Sunday morning. That's why I don't use RAID5, and I do use RAID10, and I also have backups. The incremental disk and controller cost is paid back in man-hours and uptime. -dsr- -- http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference. You can't defend freedom by getting rid of it. _______________________________________________ Discuss mailing list Discuss@... http://lists.blu.org/mailman/listinfo/discuss |
|
|
Re: forcing a raid recoveryOn Tue, Nov 3, 2009 at 2:12 PM, Stephen Adler <adler@...> wrote:
> Hi all, > > I'm putting together a backup system at my job and in doing so setup the > good ol' raid 5 array. While I was putting the disk array together, I > read that one could encounter a problem in which you replace a failed > drive, the rebuilding processes will trip over another bad sector in on > of the drives which was good before starting the rebuilding process and > thus you end up with a screwed up raid array. So I was thinking of a way > to avoid this problem. One solution is to kick off a job once a week or > month in which you force the whole raid array to be read. I was thinking > of possibly forcing a check sum of all the files I had stored on the > disk. Reading all the files (whether you checksum them or not) won't read all of the allocated blocks on the disk: 1. With Raid 5, the parity blocks are pm;u read if a drive error occurs when reading the data blocks. The result is that the parity blocks won't ever get read during your testing (unless a failure occurs). 2. If the filesystem you are using supports snapshots, you will only be reading the data blocks for the current version of the file. (You could read all the snapshots as well, but that is going to result in the same physical block on the disk being 'read' multiple times (once for each snapshot in which it is included).) If you have direct read access to the drives (partitions), you might try just reading from them directly. Any drive on which you get read errors can then be taken offline and a rebuild can be forced. I think this is slightly better then what you suggest below because you are at least taking a drive with a known problem (bad blocks) offline rather then ignoring all of the good data on the driver you are randomly picking to force an error. What I think you really want is RAID scrubbing. Here is a link to some GENTOO Linux RAID docs on the subject: http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Scrubbing If you are using hardware RAID, you should investigate similar commands for your hardware controller. >The other idea I had was to force one of the drives into a failed > state and then add it back in and thus force the raid to rebuild. The > rebuilding processes takes about 3 hours on my system which I could > easily execute at 2am every Sunday morning. And what if one of the drives you didn't take offline has a failure during that window? Bill Bogstad _______________________________________________ Discuss mailing list Discuss@... http://lists.blu.org/mailman/listinfo/discuss |
|
|
Re: forcing a raid recoveryStephen Adler wrote:
> One solution is to kick off a job once a week or > month in which you force the whole raid array to be read. If you are using Linux software RAID, on Ubuntu (and probably Debian) the default setup for mdadm includes a cron job that runs checkaray monthly on the first Sunday of the month. (The checkaray script sends the same command to the md driver as what is described at the "Data Scrubbing" link Bill Bogstad posted.) > Can anyone comment on this as a reliable way to exercise the disks in > the array so that a bad sector doesn't get touched until a rebuild occurs? As Bill Bogstad mentioned, the above won't exercise the entire disk, which is why I'd recommend running smartd, and configuring it to run a long test weekly, which supposedly will perform a read scan of the entire drive. A SMART failure won't trigger a RAID failure, but if you setup alerts from smartd, you can manually fail the bad drive and replace it. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ _______________________________________________ Discuss mailing list Discuss@... http://lists.blu.org/mailman/listinfo/discuss |
|
|
Re: forcing a raid recoveryOn Tue, 3 Nov 2009, Dan Ritter wrote: > On Tue, Nov 03, 2009 at 02:12:29PM -0500, Stephen Adler wrote: >> Hi all, >> >> I'm putting together a backup system at my job and in doing so setup the >> good ol' raid 5 array. While I was putting the disk array together, I >> read that one could encounter a problem in which you replace a failed >> drive, the rebuilding processes will trip over another bad sector in on >> of the drives which was good before starting the rebuilding process and >> thus you end up with a screwed up raid array. So I was thinking of a way >> to avoid this problem. One solution is to kick off a job once a week or >> month in which you force the whole raid array to be read. I was thinking >> of possibly forcing a check sum of all the files I had stored on the >> disk. The other idea I had was to force one of the drives into a failed >> state and then add it back in and thus force the raid to rebuild. The >> rebuilding processes takes about 3 hours on my system which I could >> easily execute at 2am every Sunday morning. > > That's why I don't use RAID5, and I do use RAID10, and I also > have backups. The OP is using the RAID5 for a backup, not primary storage, so I am sympathetic to his desire to use RAID5. As a backup, the refusal to reconstruct past an error on an unfailed disk is not so serious as you might at first glance suppose. You can still read all the files in degraded mode, you just can't reconstruct the backup disks in place. You could mkfs the backup drives and run the backup again (with the bad drives replaced) and get a new backup. There wouldn't necessarily be any loss of data or great amount of sys-admin work. It would have to be a full backup, not incremental, but that may not be a fatal objection. The problem with crashing on a double failure is more acute when it is primary storage, and that storage has to be copied in its entirety to alternate primary storage, or back from a backup, and will likely cause hours or days of downtime. Taking a checksum of all the files would be insurance that any in use bad sectors would be noticed, but as another poster pointed out, it probably wouldn't help with reconstruction in place. Daniel Feenberg > > The incremental disk and controller cost is paid back in > man-hours and uptime. > > -dsr- > > > -- > http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference. > You can't defend freedom by getting rid of it. > _______________________________________________ > Discuss mailing list > Discuss@... > http://lists.blu.org/mailman/listinfo/discuss > Discuss mailing list Discuss@... http://lists.blu.org/mailman/listinfo/discuss |
| Free embeddable forum powered by Nabble | Forum Help |
