|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
DRBD crash on two nodes cluster. Some help please?
Hello all,
Eventually I managed to get a log during DRBD crash. I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and drbd-8.3.1-3 self compiled. Both nodes have a dedicated 1G ethernet back to back connection over RTL8169sb/8110sb cards. When I run applications, that constantly read or write to the disks (active/active config), drbd kept on crashing. Now I have the logs for the reason of that: ______________________ ON TWEETY1 Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED. Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED. Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540! Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540! Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Oct 20 15:46:52 localhost kernel: drbd2: asender terminated Oct 20 15:46:52 localhost kernel: drbd2: asender terminated Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status Oct 20 15:46:52 localhost kernel: drbd2: Connection closed Oct 20 15:46:52 localhost kernel: drbd2: Connection closed ___________________________ ON TWEETY2 Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header on sock: r=-104 Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by peer. Oct 20 15:46:52 localhost kernel: drbd2: asender terminated Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID Oct 20 15:46:52 localhost kernel: drbd2: Connection closed Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm fence-peer minor-2 ____________________ DRBD.CONF # # drbd.conf # global { usage-count yes; } common { protocol C; syncer { rate 100M; al-extents 257; } handlers { pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/sbin/obliterate"; pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f"; split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; } startup { wfc-timeout 60; degr-wfc-timeout 60; # 1 minutes. become-primary-on both; } disk { fencing resource-and-stonith; } net { sndbuf-size 512k; timeout 60; # 6 seconds (unit = 0.1 seconds) connect-int 10; # 10 seconds (unit = 1 second) ping-int 10; # 10 seconds (unit = 1 second) ping-timeout 50; # 500 ms (unit = 0.1 seconds) max-buffers 2048; max-epoch-size 2048; ko-count 10; allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "*****"; after-sb-0pri discard-least-changes; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; rr-conflict call-pri-lost; data-integrity-alg "crc32c"; } } resource r0 { device /dev/drbd0; disk /dev/hda4; meta-disk internal; on tweety-1 { address 10.254.254.253:7788; } on tweety-2 { address 10.254.254.254:7788; } } resource r1 { device /dev/drbd1; disk /dev/hdb4; meta-disk internal; on tweety-1 { address 10.254.254.253:7789; } on tweety-2 { address 10.254.254.254:7789; } } resource r2 { device /dev/drbd2; disk /dev/sda1; meta-disk internal; on tweety-1 { address 10.254.254.253:7790; } on tweety-2 { address 10.254.254.254:7790; } } _________ Also available in http://pastebin.ca/1633173 How can I solve this? Thank you All for your time. _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: DRBD crash on two nodes cluster. Some help please?
Hello all again.
In continuation to the bellow described issue, with integrity check enabled, I used to get a crash at least once per 24 hours. Now I have integrity check disabled and the cluster is running without crashes for the last 9 days. Could someone kindly provide some hints for the possible reasons of this observed behavior? Off-loading is disabled on both dedicated gigabit NICs. Also is integrity-check really needed (I have read the documentation :) ) if it keeps on breaking the cluster? Thank you All for your time. Theophanis Kontogiannis On Tue, 2009-10-20 at 20:31 +0300, Theophanis Kontogiannis wrote: Hello all, _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: DRBD crash on two nodes cluster. Some help please?On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> Hello all again. > > In continuation to the bellow described issue, with integrity check > enabled, I used to get a crash at least once per 24 hours. No. You don't get "crashes". You configured it to fence its peer on connection loss, and that is what it does. > Now I have integrity check disabled and the cluster is running without > crashes for the last 9 days. > > Could someone kindly provide some hints for the possible reasons of > this observed behavior? > > Off-loading is disabled on both dedicated gigabit NICs. Either something modifies in-flight buffers, which may or may not be intentional, and may or may not be "safe" wrt file system data integrity. Or you actually _do_ have data corruption. If drbd detects checksum mismatch (== data corruption, or more general: data received is not the same as it was when calculating the checksum before it was send), rather than knowingly writing diverging data, drbd disconnects, and tries to reconnect, hoping for the bitmap based resync to send "better" data this time. On disconnect, if so configured, a primary will call its fence-peer handler. You configured "obliterate" as fence peer handler. So it "obliterates" its peer. > Also is integrity-check really needed (I have read the > documentation :) ) if it keeps on breaking the cluster? If you rather have silent data corruption :-) ==> Find the cause of the checksum mismatch. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: DRBD crash on two nodes cluster. Some help please?
Hello Lars and All,
Please look bellow On Thu, 2009-10-29 at 16:53 +0100, Lars Ellenberg wrote: Correct in strict terminology. I just had in my mind that both nodes get fenced so I get "crush" in the sense of having no service.On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote: > Hello all again. > > In continuation to the bellow described issue, with integrity check > enabled, I used to get a crash at least once per 24 hours. No. You don't get "crashes". You configured it to fence its peer on connection loss, and that is what it does. But yes, the actual thing is that it gets fenced. Is there any way to track to really low level the crc error? Turn on insane debugging on drbd or something else?> Now I have integrity check disabled and the cluster is running without > crashes for the last 9 days. > > Could someone kindly provide some hints for the possible reasons of > this observed behavior? > > Off-loading is disabled on both dedicated gigabit NICs. Either something modifies in-flight buffers, which may or may not be intentional, and may or may not be "safe" wrt file system data integrity. Or you actually _do_ have data corruption. If drbd detects checksum mismatch (== data corruption, or more general: data received is not the same as it was when calculating the checksum before it was send), rather than knowingly writing diverging data, drbd disconnects, and tries to reconnect, hoping for the bitmap based resync to send "better" data this time. On disconnect, if so configured, a primary will call its fence-peer handler. You configured "obliterate" as fence peer handler. So it "obliterates" its peer. > Also is integrity-check really needed (I have read the > documentation :) ) if it keeps on breaking the cluster? If you rather have silent data corruption :-) ==> Find the cause of the checksum mismatch. I can not think of any good way to go low level for that! Thank you All for your time. T.K. _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
| Free embeddable forum powered by Nabble | Forum Help |