|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
colonialone: 2nd disk of /dev/md2 dead + filesystem errors on newly created LVHi,
2 issues on colonialone: - First /dev/md2 is only on a single RAID disk - Secondly I got a filesystem error on a LV disk I had just created (on vg_1.0tb, on md1), which was then remounted read-only. It's pretty strange, so I wonder if you have anything to say / suggest about this. -- Sylvain |
|
|
|
|
|
|
|
|
|
|
|
Re: [gnu.org #498996] Hard-disk failures on colonialone> On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote:
> > Ah I see, I was waiting for comments on this - should be able to go out this weekend to do > > replacements / reshuffles / etc, but I need to know if savannah-hackers has a strong > > opinion on how to proceed: > > > > (1) Do we keep the 1TB disks? > > > - Now that the cause of the failure is known to be a software failure, > > > do we forget about this, or still pursue the plan to remove 1.0TB > > > disks that are used nowhere else at the FSF? > > > > That was mostly a "this makes no sense, but that's the only thing that's different about > > that system" type of response; it is true they are not used elsewhere, but if they are > > actually working fine I am fine with doing whatever savannah-hackers wants to do. > > > > (2) Do we keep the 2 eSATA drives connected? > > > - If not, do you recommend moving everything (but '/') on the 1.5TB > > > disks? > > > > Again if they are working fine it's your call; however the bigger issue is if you want to > > keep the 2 eSATA / external drives connected, since that is a legitimate extra point of > > failure, and there are some cases where errors in the external enclosure can bring a system > > down (although it's been up and running fine for several months now). > > > > (3) Do we make the switch to UUIDs now? > > > - About UUIDs, everything in fstab in using mdX, which I'd rather not > > > mess with. > > > > IMHO it would be better to mess with this when the system is less critical; not using UUIDs > > everywhere tends to screw you during recovery from hardware failures. > > > > And BTW totally off-topic, but eth1 on colonialone is now connected via crossover ethernet > > cable to eth1 on savannah (and colonialone is no longer on fsf 10. management network, > > which I believe we confirmed no one cared about) > > > > (4) We need to change to some technique that will give us RAID1 redundancy even if one > > drives dies. I think the safest solution would be to not use eSATA, and use 4 1.5TB drives > > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives would need to fail to > > bring savannah down. Other option would be 2 triple RAID1s using eSATA, each with 2 disks > > inside the computer and the 3rd disks in the external enclosure. On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote: > Hi, > > As far as the hardware is concerned, I think it is best that we do > what the FSF sysadmins think is best. > > We don't have access to the computer, don't really know anything about > what it's made of, don't understand the eSATA/internal > differences. We're even using Xen as you do, to ease this kind of > interaction. In short, you're more often than not in better position > to judge the hardware issues. > > > So: > > If you think it's safer to use 4x1.5TB RAID-1, then let's do that. > > Only, we need to discuss how to migrate the current data, since > colonialone is already in production. > > In particular, fixing the DNS issues I reported would help if > temporary relocation is needed. I see that there are currently 4x 1.5TB disks. sda 1TB inside sdb 1TB inside sdc 1.5TB inside? sdd 1.5TB inside? sde 1.5TB external/eSATA? sdf 1.5TB external/eSATA? Here's what I started doing: - recreate 4 partitions on sdc and sde (but 2 of them in an extended partition) - added sdc and sdd to the current RAID-1 arrays mdadm /dev/md0 --add /dev/sdc1 mdadm /dev/md0 --add /dev/sdd1 mdadm /dev/md1 --add /dev/sdc2 mdadm /dev/md1 --add /dev/sdd2 mdadm /dev/md2 --add /dev/sdc5 mdadm /dev/md2 --add /dev/sdd5 mdadm /dev/md3 --add /dev/sdc6 mdadm /dev/md3 --add /dev/sdd6 mdadm /dev/md0 --grow -n 4 mdadm /dev/md1 --grow -n 4 mdadm /dev/md2 --grow -n 4 mdadm /dev/md3 --grow -n 4 colonialone:~# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0] 955128384 blocks [4/2] [UU__] [>....................] recovery = 0.0% (43520/955128384) finish=730.1min speed=21760K/sec md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0] 19534976 blocks [4/4] [UUUU] md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1] 2000000 blocks [4/4] [UUUU] md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1] 96256 blocks [4/4] [UUUU] - install GRUB on sdc and sdd With this setup, the data is both on the 1TB and the 1.5TB disks. If you confirm that this is OK, we can: * extend this to sde and sdf, * unplug sda+sdb and plug all the 1.5TB disks internaly * reboot while you are at the colo, and ensure that there's no device renaming mess * add the #7 partitions in sdc/d/e/f as a new RAID device / LVM Physical Volume and get the remaining 500GB Can you let me know if this sounds reasonable? -- Sylvain |
|
|
Re: [gnu.org #498996] Hard-disk failures on colonialoneOn Sat, Oct 31, 2009 at 11:13:51AM +0100, Sylvain Beucler wrote:
> > On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote: > > > Ah I see, I was waiting for comments on this - should be able to go out this weekend to do > > > replacements / reshuffles / etc, but I need to know if savannah-hackers has a strong > > > opinion on how to proceed: > > > > > > (1) Do we keep the 1TB disks? > > > > - Now that the cause of the failure is known to be a software failure, > > > > do we forget about this, or still pursue the plan to remove 1.0TB > > > > disks that are used nowhere else at the FSF? > > > > > > That was mostly a "this makes no sense, but that's the only thing that's different about > > > that system" type of response; it is true they are not used elsewhere, but if they are > > > actually working fine I am fine with doing whatever savannah-hackers wants to do. > > > > > > (2) Do we keep the 2 eSATA drives connected? > > > > - If not, do you recommend moving everything (but '/') on the 1.5TB > > > > disks? > > > > > > Again if they are working fine it's your call; however the bigger issue is if you want to > > > keep the 2 eSATA / external drives connected, since that is a legitimate extra point of > > > failure, and there are some cases where errors in the external enclosure can bring a system > > > down (although it's been up and running fine for several months now). > > > > > > (3) Do we make the switch to UUIDs now? > > > > - About UUIDs, everything in fstab in using mdX, which I'd rather not > > > > mess with. > > > > > > IMHO it would be better to mess with this when the system is less critical; not using UUIDs > > > everywhere tends to screw you during recovery from hardware failures. > > > > > > And BTW totally off-topic, but eth1 on colonialone is now connected via crossover ethernet > > > cable to eth1 on savannah (and colonialone is no longer on fsf 10. management network, > > > which I believe we confirmed no one cared about) > > > > > > (4) We need to change to some technique that will give us RAID1 redundancy even if one > > > drives dies. I think the safest solution would be to not use eSATA, and use 4 1.5TB drives > > > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives would need to fail to > > > bring savannah down. Other option would be 2 triple RAID1s using eSATA, each with 2 disks > > > inside the computer and the 3rd disks in the external enclosure. > > On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote: > > Hi, > > > > As far as the hardware is concerned, I think it is best that we do > > what the FSF sysadmins think is best. > > > > We don't have access to the computer, don't really know anything about > > what it's made of, don't understand the eSATA/internal > > differences. We're even using Xen as you do, to ease this kind of > > interaction. In short, you're more often than not in better position > > to judge the hardware issues. > > > > > > So: > > > > If you think it's safer to use 4x1.5TB RAID-1, then let's do that. > > > > Only, we need to discuss how to migrate the current data, since > > colonialone is already in production. > > > > In particular, fixing the DNS issues I reported would help if > > temporary relocation is needed. > > > I see that there are currently 4x 1.5TB disks. > > > sda 1TB inside > sdb 1TB inside > sdc 1.5TB inside? > sdd 1.5TB inside? > sde 1.5TB external/eSATA? > sdf 1.5TB external/eSATA? > > > Here's what I started doing: > > - recreate 4 partitions on sdc and sde (but 2 of them in an extended > partition) > > - added sdc and sdd to the current RAID-1 arrays > > mdadm /dev/md0 --add /dev/sdc1 > mdadm /dev/md0 --add /dev/sdd1 > mdadm /dev/md1 --add /dev/sdc2 > mdadm /dev/md1 --add /dev/sdd2 > mdadm /dev/md2 --add /dev/sdc5 > mdadm /dev/md2 --add /dev/sdd5 > mdadm /dev/md3 --add /dev/sdc6 > mdadm /dev/md3 --add /dev/sdd6 > mdadm /dev/md0 --grow -n 4 > mdadm /dev/md1 --grow -n 4 > mdadm /dev/md2 --grow -n 4 > mdadm /dev/md3 --grow -n 4 > > colonialone:~# cat /proc/mdstat > Personalities : [raid1] > md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0] > 955128384 blocks [4/2] [UU__] > [>....................] recovery = 0.0% (43520/955128384) finish=730.1min speed=21760K/sec > > md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0] > 19534976 blocks [4/4] [UUUU] > > md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1] > 2000000 blocks [4/4] [UUUU] > > md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1] > 96256 blocks [4/4] [UUUU] > > - install GRUB on sdc and sdd > > > With this setup, the data is both on the 1TB and the 1.5TB disks. > > If you confirm that this is OK, we can: > > * extend this to sde and sdf, > > * unplug sda+sdb and plug all the 1.5TB disks internaly > > * reboot while you are at the colo, and ensure that there's no device > renaming mess > > * add the #7 partitions in sdc/d/e/f as a new RAID device / LVM > Physical Volume and get the remaining 500GB > > > Can you let me know if this sounds reasonable? up! -- Sylvain |
|
|
Re: [gnu.org #498996] Hard-disk failures on colonialoneOn Thu, Nov 12, 2009 at 12:33:17PM +0100, Sylvain Beucler wrote:
> On Sat, Oct 31, 2009 at 11:13:51AM +0100, Sylvain Beucler wrote: > > > On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote: > > > > Ah I see, I was waiting for comments on this - should be able to go out this weekend to do > > > > replacements / reshuffles / etc, but I need to know if savannah-hackers has a strong > > > > opinion on how to proceed: > > > > > > > > (1) Do we keep the 1TB disks? > > > > > - Now that the cause of the failure is known to be a software failure, > > > > > do we forget about this, or still pursue the plan to remove 1.0TB > > > > > disks that are used nowhere else at the FSF? > > > > > > > > That was mostly a "this makes no sense, but that's the only thing that's different about > > > > that system" type of response; it is true they are not used elsewhere, but if they are > > > > actually working fine I am fine with doing whatever savannah-hackers wants to do. > > > > > > > > (2) Do we keep the 2 eSATA drives connected? > > > > > - If not, do you recommend moving everything (but '/') on the 1.5TB > > > > > disks? > > > > > > > > Again if they are working fine it's your call; however the bigger issue is if you want to > > > > keep the 2 eSATA / external drives connected, since that is a legitimate extra point of > > > > failure, and there are some cases where errors in the external enclosure can bring a system > > > > down (although it's been up and running fine for several months now). > > > > > > > > (3) Do we make the switch to UUIDs now? > > > > > - About UUIDs, everything in fstab in using mdX, which I'd rather not > > > > > mess with. > > > > > > > > IMHO it would be better to mess with this when the system is less critical; not using UUIDs > > > > everywhere tends to screw you during recovery from hardware failures. > > > > > > > > And BTW totally off-topic, but eth1 on colonialone is now connected via crossover ethernet > > > > cable to eth1 on savannah (and colonialone is no longer on fsf 10. management network, > > > > which I believe we confirmed no one cared about) > > > > > > > > (4) We need to change to some technique that will give us RAID1 redundancy even if one > > > > drives dies. I think the safest solution would be to not use eSATA, and use 4 1.5TB drives > > > > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives would need to fail to > > > > bring savannah down. Other option would be 2 triple RAID1s using eSATA, each with 2 disks > > > > inside the computer and the 3rd disks in the external enclosure. > > > > On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote: > > > Hi, > > > > > > As far as the hardware is concerned, I think it is best that we do > > > what the FSF sysadmins think is best. > > > > > > We don't have access to the computer, don't really know anything about > > > what it's made of, don't understand the eSATA/internal > > > differences. We're even using Xen as you do, to ease this kind of > > > interaction. In short, you're more often than not in better position > > > to judge the hardware issues. > > > > > > > > > So: > > > > > > If you think it's safer to use 4x1.5TB RAID-1, then let's do that. > > > > > > Only, we need to discuss how to migrate the current data, since > > > colonialone is already in production. > > > > > > In particular, fixing the DNS issues I reported would help if > > > temporary relocation is needed. > > > > > > I see that there are currently 4x 1.5TB disks. > > > > > > sda 1TB inside > > sdb 1TB inside > > sdc 1.5TB inside? > > sdd 1.5TB inside? > > sde 1.5TB external/eSATA? > > sdf 1.5TB external/eSATA? > > > > > > Here's what I started doing: > > > > - recreate 4 partitions on sdc and sde (but 2 of them in an extended > > partition) > > > > - added sdc and sdd to the current RAID-1 arrays > > > > mdadm /dev/md0 --add /dev/sdc1 > > mdadm /dev/md0 --add /dev/sdd1 > > mdadm /dev/md1 --add /dev/sdc2 > > mdadm /dev/md1 --add /dev/sdd2 > > mdadm /dev/md2 --add /dev/sdc5 > > mdadm /dev/md2 --add /dev/sdd5 > > mdadm /dev/md3 --add /dev/sdc6 > > mdadm /dev/md3 --add /dev/sdd6 > > mdadm /dev/md0 --grow -n 4 > > mdadm /dev/md1 --grow -n 4 > > mdadm /dev/md2 --grow -n 4 > > mdadm /dev/md3 --grow -n 4 > > > > colonialone:~# cat /proc/mdstat > > Personalities : [raid1] > > md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0] > > 955128384 blocks [4/2] [UU__] > > [>....................] recovery = 0.0% (43520/955128384) finish=730.1min speed=21760K/sec > > > > md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0] > > 19534976 blocks [4/4] [UUUU] > > > > md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1] > > 2000000 blocks [4/4] [UUUU] > > > > md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1] > > 96256 blocks [4/4] [UUUU] > > > > - install GRUB on sdc and sdd > > > > > > With this setup, the data is both on the 1TB and the 1.5TB disks. > > > > If you confirm that this is OK, we can: > > > > * extend this to sde and sdf, > > > > * unplug sda+sdb and plug all the 1.5TB disks internaly > > > > * reboot while you are at the colo, and ensure that there's no device > > renaming mess > > > > * add the #7 partitions in sdc/d/e/f as a new RAID device / LVM > > Physical Volume and get the remaining 500GB > > > > > > Can you let me know if this sounds reasonable? > > up! Seriously, can you answer if it's OK to move the RAID 1tb->1.5tb and plan a disk re-plug soon? -- Sylvain |
|
|
|
|
|
|
|
|
|
| Free embeddable forum powered by Nabble | Forum Help |