DRBD crash on two nodes cluster. Some help please?

View: New views
4 Messages — Rating Filter:   Alert me  

DRBD crash on two nodes cluster. Some help please?

by Theophanis Kontogiannis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello all,

Eventually I managed to get a log during DRBD crash.

I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and drbd-8.3.1-3  self compiled.

Both nodes have a dedicated 1G ethernet back to back connection over RTL8169sb/8110sb cards.

When I run applications, that constantly read or write to the disks (active/active config), drbd kept on crashing.

Now I have the logs for the reason of that:


______________________
ON TWEETY1

Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed

___________________________

ON TWEETY2


Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header on sock: r=-104
Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by peer.
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm fence-peer minor-2

____________________


DRBD.CONF


#
# drbd.conf
#


global {

    usage-count yes;
}


common {

  protocol C;

  syncer {

    rate 100M;

    al-extents 257;
  }

 
handlers {
   
    pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";

    pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";

    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

    outdate-peer "/sbin/obliterate";


    pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";

    split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";

  }

  startup {

     wfc-timeout  60;


    degr-wfc-timeout 60;    # 1 minutes.


    become-primary-on both;

  }

  disk {

    fencing resource-and-stonith;


  }

  net {
   
     sndbuf-size 512k;

     timeout       60;    #  6 seconds  (unit = 0.1 seconds)
     connect-int   10;    # 10 seconds  (unit = 1 second)
     ping-int      10;    # 10 seconds  (unit = 1 second)
     ping-timeout  50;    # 500 ms (unit = 0.1 seconds)

     max-buffers     2048;

     max-epoch-size  2048;

     ko-count 10;


    allow-two-primaries;


      cram-hmac-alg "sha1";
      shared-secret "*****";


    after-sb-0pri discard-least-changes;

    after-sb-1pri violently-as0p;


    after-sb-2pri violently-as0p;


    rr-conflict call-pri-lost;


    data-integrity-alg "crc32c";

  }


}


resource r0 {

        device          /dev/drbd0;
        disk            /dev/hda4;
        meta-disk       internal;

on tweety-1 { address   10.254.254.253:7788; }

on tweety-2 { address   10.254.254.254:7788; }

}

resource r1 {

        device        /dev/drbd1;
        disk          /dev/hdb4;
        meta-disk     internal;

  on tweety-1 { address  10.254.254.253:7789; }

  on tweety-2 { address  10.254.254.254:7789; }
}

resource r2 {

device /dev/drbd2;
disk /dev/sda1;
meta-disk internal;

  on tweety-1 { address  10.254.254.253:7790; }

  on tweety-2 { address  10.254.254.254:7790; }
}

_________

Also available in http://pastebin.ca/1633173


How can I solve this?

Thank you All for your time.



_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: DRBD crash on two nodes cluster. Some help please?

by Theophanis Kontogiannis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello all again.

In continuation to the bellow described issue, with integrity check enabled, I used to get a crash at least once per 24 hours.

Now I have integrity check disabled and the cluster is running without crashes for the last 9 days.

Could someone kindly provide some hints for the possible reasons  of this observed behavior?

Off-loading is disabled on both dedicated gigabit NICs.

Also is integrity-check really needed (I have read the documentation :) ) if it keeps on breaking the cluster?

Thank you All for your time.

Theophanis Kontogiannis


On Tue, 2009-10-20 at 20:31 +0300, Theophanis Kontogiannis wrote:
Hello all,

Eventually I managed to get a log during DRBD crash.

I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and drbd-8.3.1-3  self compiled.

Both nodes have a dedicated 1G ethernet back to back connection over RTL8169sb/8110sb cards.

When I run applications, that constantly read or write to the disks (active/active config), drbd kept on crashing.

Now I have the logs for the reason of that:


______________________
ON TWEETY1

Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed

___________________________

ON TWEETY2


Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header on sock: r=-104
Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by peer.
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed
Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm fence-peer minor-2

____________________


DRBD.CONF


#
# drbd.conf
#


global {

    usage-count yes;
}


common {

  protocol C;

  syncer {

    rate 100M;

    al-extents 257;
  }

 
handlers {
   
    pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";

    pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";

    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

    outdate-peer "/sbin/obliterate";


    pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";

    split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";

  }

  startup {

     wfc-timeout  60;


    degr-wfc-timeout 60;    # 1 minutes.


    become-primary-on both;

  }

  disk {

    fencing resource-and-stonith;


  }

  net {
   
     sndbuf-size 512k;

     timeout       60;    #  6 seconds  (unit = 0.1 seconds)
     connect-int   10;    # 10 seconds  (unit = 1 second)
     ping-int      10;    # 10 seconds  (unit = 1 second)
     ping-timeout  50;    # 500 ms (unit = 0.1 seconds)

     max-buffers     2048;

     max-epoch-size  2048;

     ko-count 10;


    allow-two-primaries;


      cram-hmac-alg "sha1";
      shared-secret "*****";


    after-sb-0pri discard-least-changes;

    after-sb-1pri violently-as0p;


    after-sb-2pri violently-as0p;


    rr-conflict call-pri-lost;


    data-integrity-alg "crc32c";

  }


}


resource r0 {

        device          /dev/drbd0;
        disk            /dev/hda4;
        meta-disk       internal;

on tweety-1 { address   10.254.254.253:7788; }

on tweety-2 { address   10.254.254.254:7788; }

}

resource r1 {

        device        /dev/drbd1;
        disk          /dev/hdb4;
        meta-disk     internal;

  on tweety-1 { address  10.254.254.253:7789; }

  on tweety-2 { address  10.254.254.254:7789; }
}

resource r2 {

device /dev/drbd2;
disk /dev/sda1;
meta-disk internal;

  on tweety-1 { address  10.254.254.253:7790; }

  on tweety-2 { address  10.254.254.254:7790; }
}

_________

Also available in http://pastebin.ca/1633173


How can I solve this?

Thank you All for your time.



_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: DRBD crash on two nodes cluster. Some help please?

by Lars Ellenberg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> Hello all again.
>
> In continuation to the bellow described issue, with integrity check
> enabled, I used to get a crash at least once per 24 hours.

No.
You don't get "crashes".

You configured it to fence its peer on connection loss,
and that is what it does.

> Now I have integrity check disabled and the cluster is running without
> crashes for the last 9 days.
>
> Could someone kindly provide some hints for the possible reasons  of
> this observed behavior?
>
> Off-loading is disabled on both dedicated gigabit NICs.

Either something modifies in-flight buffers,
which may or may not be intentional,
and may or may not be "safe" wrt file system data integrity.

Or you actually _do_ have data corruption.

If drbd detects checksum mismatch (== data corruption,
or more general: data received is not the same as
it was when calculating the checksum before it was
send), rather than knowingly writing diverging data,
drbd disconnects, and tries to reconnect,
hoping for the bitmap based resync to send
"better" data this time.

On disconnect, if so configured, a primary will call its
fence-peer handler.

You configured "obliterate" as fence peer handler.

So it "obliterates" its peer.

> Also is integrity-check really needed (I have read the
> documentation :) ) if it keeps on breaking the cluster?

If you rather have silent data corruption :-)

==> Find the cause of the checksum mismatch.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: DRBD crash on two nodes cluster. Some help please?

by Theophanis Kontogiannis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Lars and All,

Please look bellow

On Thu, 2009-10-29 at 16:53 +0100, Lars Ellenberg wrote:
On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> Hello all again.
> 
> In continuation to the bellow described issue, with integrity check
> enabled, I used to get a crash at least once per 24 hours.

No.
You don't get "crashes".

You configured it to fence its peer on connection loss,
and that is what it does.

Correct in strict terminology. I just had in my mind that both nodes get fenced so I get "crush" in the sense of having no service.
But yes, the actual thing is that it gets fenced.

> Now I have integrity check disabled and the cluster is running without
> crashes for the last 9 days.
> 
> Could someone kindly provide some hints for the possible reasons  of
> this observed behavior?
> 
> Off-loading is disabled on both dedicated gigabit NICs.

Either something modifies in-flight buffers,
which may or may not be intentional,
and may or may not be "safe" wrt file system data integrity.

Or you actually _do_ have data corruption.

If drbd detects checksum mismatch (== data corruption,
or more general: data received is not the same as
it was when calculating the checksum before it was
send), rather than knowingly writing diverging data,
drbd disconnects, and tries to reconnect,
hoping for the bitmap based resync to send
"better" data this time.

On disconnect, if so configured, a primary will call its
fence-peer handler.

You configured "obliterate" as fence peer handler.

So it "obliterates" its peer.

> Also is integrity-check really needed (I have read the
> documentation :) ) if it keeps on breaking the cluster?

If you rather have silent data corruption :-)

==> Find the cause of the checksum mismatch.

Is there any way to track to really low level the crc error? Turn on insane debugging on drbd or something else?
I can not think of any good way to go low level for that!

Thank you All for your time.
T.K.


_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user