blu.org  wiki

forcing a raid recovery

View: New views
6 Messages — Rating Filter:   Alert me  

forcing a raid recovery

by Stephen Adler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

I'm putting together a backup system at my job and in doing so setup the
good ol' raid 5 array. While I was putting the disk array together, I
read that one could encounter a problem in which you replace a failed
drive, the rebuilding processes will trip over another bad sector in on
of the drives which was good before starting the rebuilding process and
thus you end up with a screwed up raid array. So I was thinking of a way
to avoid this problem. One solution is to kick off a job once a week or
month in which you force the whole raid array to be read. I was thinking
of possibly forcing a check sum of all the files I had stored on the
disk. The other idea I had was to force one of the drives into a failed
state and then add it back in and thus force the raid to rebuild. The
rebuilding processes takes about 3 hours on my system which I could
easily execute at 2am every Sunday morning.

Can anyone comment on this as a reliable way to exercise the disks in
the array so that a bad sector doesn't get touched until a rebuild occurs?

thanks. Steve.

_______________________________________________
Discuss mailing list
Discuss@...
http://lists.blu.org/mailman/listinfo/discuss

Re: forcing a raid recovery

by Jack Coats at coats.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have done this too.  Even with a good disk backup, a tape copy is
not a bad idea,
and it could be your 'once a week to validate reads' too.

In one mainframe shop I had 3x the maximum disk needed, and did a round robin
copy into what were effectively 3 partitions on different drives.  It
re-organized the
database and gave me other maintenance space too.  copy c was erased,
giving work room,
copy A (this week) was copied into copy B, copy B was reorganized in place.
Copy B becomes 'live', copy 'A' becomes a static backup, copy 'C' is
the prior week
backup that will be erased before next weeks 'roll and update' procedure.

Just some thoughts from what has worked.

><> ... Jack



On Tue, Nov 3, 2009 at 1:12 PM, Stephen Adler <adler@...> wrote:

> Hi all,
>
> I'm putting together a backup system at my job and in doing so setup the
> good ol' raid 5 array. While I was putting the disk array together, I
> read that one could encounter a problem in which you replace a failed
> drive, the rebuilding processes will trip over another bad sector in on
> of the drives which was good before starting the rebuilding process and
> thus you end up with a screwed up raid array. So I was thinking of a way
> to avoid this problem. One solution is to kick off a job once a week or
> month in which you force the whole raid array to be read. I was thinking
> of possibly forcing a check sum of all the files I had stored on the
> disk. The other idea I had was to force one of the drives into a failed
> state and then add it back in and thus force the raid to rebuild. The
> rebuilding processes takes about 3 hours on my system which I could
> easily execute at 2am every Sunday morning.
>
> Can anyone comment on this as a reliable way to exercise the disks in
> the array so that a bad sector doesn't get touched until a rebuild occurs?
>
> thanks. Steve.
>
> _______________________________________________
> Discuss mailing list
> Discuss@...
> http://lists.blu.org/mailman/listinfo/discuss
>
_______________________________________________
Discuss mailing list
Discuss@...
http://lists.blu.org/mailman/listinfo/discuss

Re: forcing a raid recovery

by Dan Ritter-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 03, 2009 at 02:12:29PM -0500, Stephen Adler wrote:

> Hi all,
>
> I'm putting together a backup system at my job and in doing so setup the
> good ol' raid 5 array. While I was putting the disk array together, I
> read that one could encounter a problem in which you replace a failed
> drive, the rebuilding processes will trip over another bad sector in on
> of the drives which was good before starting the rebuilding process and
> thus you end up with a screwed up raid array. So I was thinking of a way
> to avoid this problem. One solution is to kick off a job once a week or
> month in which you force the whole raid array to be read. I was thinking
> of possibly forcing a check sum of all the files I had stored on the
> disk. The other idea I had was to force one of the drives into a failed
> state and then add it back in and thus force the raid to rebuild. The
> rebuilding processes takes about 3 hours on my system which I could
> easily execute at 2am every Sunday morning.

That's why I don't use RAID5, and I do use RAID10, and I also
have backups.

The incremental disk and controller cost is paid back in
man-hours and uptime.

-dsr-


--
http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference.
You can't defend freedom by getting rid of it.
_______________________________________________
Discuss mailing list
Discuss@...
http://lists.blu.org/mailman/listinfo/discuss

Re: forcing a raid recovery

by Bill Bogstad :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 3, 2009 at 2:12 PM, Stephen Adler <adler@...> wrote:

> Hi all,
>
> I'm putting together a backup system at my job and in doing so setup the
> good ol' raid 5 array. While I was putting the disk array together, I
> read that one could encounter a problem in which you replace a failed
> drive, the rebuilding processes will trip over another bad sector in on
> of the drives which was good before starting the rebuilding process and
> thus you end up with a screwed up raid array. So I was thinking of a way
> to avoid this problem. One solution is to kick off a job once a week or
> month in which you force the whole raid array to be read. I was thinking
> of possibly forcing a check sum of all the files I had stored on the
> disk.

Reading all the files (whether you checksum them or not) won't read
all of the allocated blocks on the disk:

1. With Raid 5, the parity blocks are pm;u read if a drive error
occurs when reading the data blocks.  The result
is that the parity blocks won't ever get read during your testing
(unless a failure occurs).

2.  If the filesystem you are using supports snapshots, you will only
be reading the data blocks for the current version of the file.
(You could read all the snapshots as well, but that is going to result
in the same physical block on the disk being 'read' multiple times
(once for each snapshot in which it is included).)

If you have direct read access to the drives (partitions), you might
try just reading from them directly.  Any drive on which
you get read errors can then be taken offline and a rebuild can be
forced.  I think this is slightly better then what you suggest below
because you are at least taking a drive with a known problem (bad
blocks) offline rather then ignoring all of the good data on the
driver you are randomly picking to force an error.

What I think you really want is RAID scrubbing.  Here is a link to
some GENTOO Linux RAID docs on the subject:

http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Scrubbing

If you are using hardware RAID, you should investigate similar
commands for your hardware controller.

>The other idea I had was to force one of the drives into a failed
> state and then add it back in and thus force the raid to rebuild. The
> rebuilding processes takes about 3 hours on my system which I could
> easily execute at 2am every Sunday morning.

And what if one of the drives you didn't take offline has a failure
during that window?

Bill Bogstad
_______________________________________________
Discuss mailing list
Discuss@...
http://lists.blu.org/mailman/listinfo/discuss

Re: forcing a raid recovery

by Tom Metro-16 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Stephen Adler wrote:
> One solution is to kick off a job once a week or
> month in which you force the whole raid array to be read.

If you are using Linux software RAID, on Ubuntu (and probably Debian)
the default setup for mdadm includes a cron job that runs checkaray
monthly on the first Sunday of the month. (The checkaray script sends
the same command to the md driver as what is described at the "Data
Scrubbing" link Bill Bogstad posted.)


> Can anyone comment on this as a reliable way to exercise the disks in
> the array so that a bad sector doesn't get touched until a rebuild occurs?

As Bill Bogstad mentioned, the above won't exercise the entire disk,
which is why I'd recommend running smartd, and configuring it to run a
long test weekly, which supposedly will perform a read scan of the
entire drive. A SMART failure won't trigger a RAID failure, but if you
setup alerts from smartd, you can manually fail the bad drive and
replace it.

  -Tom

--
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
_______________________________________________
Discuss mailing list
Discuss@...
http://lists.blu.org/mailman/listinfo/discuss

Re: forcing a raid recovery

by Daniel Feenberg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Tue, 3 Nov 2009, Dan Ritter wrote:

> On Tue, Nov 03, 2009 at 02:12:29PM -0500, Stephen Adler wrote:
>> Hi all,
>>
>> I'm putting together a backup system at my job and in doing so setup the
>> good ol' raid 5 array. While I was putting the disk array together, I
>> read that one could encounter a problem in which you replace a failed
>> drive, the rebuilding processes will trip over another bad sector in on
>> of the drives which was good before starting the rebuilding process and
>> thus you end up with a screwed up raid array. So I was thinking of a way
>> to avoid this problem. One solution is to kick off a job once a week or
>> month in which you force the whole raid array to be read. I was thinking
>> of possibly forcing a check sum of all the files I had stored on the
>> disk. The other idea I had was to force one of the drives into a failed
>> state and then add it back in and thus force the raid to rebuild. The
>> rebuilding processes takes about 3 hours on my system which I could
>> easily execute at 2am every Sunday morning.
>
> That's why I don't use RAID5, and I do use RAID10, and I also
> have backups.

The OP is using the RAID5 for a backup, not primary storage, so I am
sympathetic to his desire to use RAID5.

As a backup, the refusal to reconstruct past an error on an unfailed disk
is not so serious as you might at first glance suppose. You can still read
all the files in degraded mode, you just can't reconstruct the backup
disks in place. You could mkfs the backup drives and run the backup again
(with the bad drives replaced) and get a new backup. There wouldn't
necessarily be any loss of data or great amount of sys-admin work. It
would have to be a full backup, not incremental, but that may not be a
fatal objection.

The problem with crashing on a double failure is more acute when it is
primary storage, and that storage has to be copied in its entirety to
alternate primary storage, or back from a backup, and will likely cause
hours or days of downtime.

Taking a checksum of all the files would be insurance that any in use bad
sectors would be noticed, but as another poster pointed out, it probably
wouldn't help with reconstruction in place.

Daniel Feenberg

>
> The incremental disk and controller cost is paid back in
> man-hours and uptime.
>
> -dsr-
>
>
> --
> http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference.
> You can't defend freedom by getting rid of it.
> _______________________________________________
> Discuss mailing list
> Discuss@...
> http://lists.blu.org/mailman/listinfo/discuss
>
_______________________________________________
Discuss mailing list
Discuss@...
http://lists.blu.org/mailman/listinfo/discuss