|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
Primary/Diskless node cannot reconnectIf you are in a situation where one node is Primary/Diskless and the
other Secondary/UpToDate, DRDB works just fine, redirecting the I/O to the secondary node. However, if you then lose the network connection for any reasons (say some sort of transient issue), DRBD will not allow the connection to be re-established even though nothing has actually changed. When the connect request comes in, you see this trace message: Can only connect to data with current UUID=XXXXX Which is output by receive_uuids in drbd_receiver.c. I understand why this check should be made _if_ you actually have a local disk on the Primary, but if you are Diskless I think it is not necessary and results in unnecessary problems. My proposed fix is to add an explicit test for Diskless in the if statement so that it only does the check for current UUID if the disk state is > Diskless. Proposed patch attached against 8.2. Simon _______________________________________________ drbd-dev mailing list drbd-dev@... http://lists.linbit.com/mailman/listinfo/drbd-dev |
|
|
Re: Primary/Diskless node cannot reconnectOn Sun, Nov 01, 2009 at 05:25:44PM -0500, Graham, Simon wrote:
> If you are in a situation where one node is Primary/Diskless and the > other Secondary/UpToDate, DRDB works just fine, redirecting the I/O to > the secondary node. However, if you then lose the network connection for > any reasons (say some sort of transient issue), DRBD will not allow the > connection to be re-established even though nothing has actually > changed. > > When the connect request comes in, you see this trace message: > > Can only connect to data with current UUID=XXXXX > > Which is output by receive_uuids in drbd_receiver.c. I understand why > this check should be made _if_ you actually have a local disk on the > Primary, but if you are Diskless I think it is not necessary and results > in unnecessary problems. My proposed fix is to add an explicit test for > Diskless in the if statement so that it only does the check for current > UUID if the disk state is > Diskless. > > Proposed patch attached against 8.2. 8.2 is dead. has been fixed differently in 8.3 already, where the corresponding code looks like if (mdev->state.conn < C_CONNECTED && mdev->state.disk < D_INCONSISTENT && mdev->state.role == R_PRIMARY && (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & ~((u64)1))) { -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ drbd-dev mailing list drbd-dev@... http://lists.linbit.com/mailman/listinfo/drbd-dev |
|
|
Re: Primary/Diskless node cannot reconnect> 8.2 is dead.
Hmmm... it hasn't stopped moving yet... are you saying you won't make any more fixes to it? > has been fixed differently in 8.3 already, > where the corresponding code looks like > > if (mdev->state.conn < C_CONNECTED && > mdev->state.disk < D_INCONSISTENT && > mdev->state.role == R_PRIMARY && > (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & > ~((u64)1))) { > I must admit I haven't looked at 8.3 in any detail yet but that code you quote looks suspiciously like the 8.2 code to me -- D_DISKLESS is still a value less than D_INCONSISTENT... Shouldn't this be: if (mdev->state.conn < C_CONNECTED && mdev->state.disk > D_DISKLESS && mdev->state.disk < D_INCONSISTENT && mdev->state.role == R_PRIMARY && (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & ~((u64)1))) { To fix this same issue??? Simon _______________________________________________ drbd-dev mailing list drbd-dev@... http://lists.linbit.com/mailman/listinfo/drbd-dev |
|
|
Re: Primary/Diskless node cannot reconnectOn Mon, Nov 02, 2009 at 10:47:54PM -0500, Graham, Simon wrote:
> > 8.2 is dead. > > Hmmm... it hasn't stopped moving yet... are you saying you won't make > any more fixes to it? Yes. If we change something on the 8.0 branch, we sometimes still merge it through the 8.2 branch first, as that helps in merging it into the 8.3 one, because of all the whitespace changes and constant renames ... But if we fix something on 8.3, which would be relevant for 8.2, we don't much care to merge it back. "8.2.8" was officially 8.3.0, and no more 8.2 will happen. > > has been fixed differently in 8.3 already, > > where the corresponding code looks like > > > > if (mdev->state.conn < C_CONNECTED && > > mdev->state.disk < D_INCONSISTENT && > > mdev->state.role == R_PRIMARY && > > (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & > > ~((u64)1))) { > > > > I must admit I haven't looked at 8.3 in any detail yet but that code you > quote looks suspiciously like the 8.2 code to me -- D_DISKLESS is still > a value less than D_INCONSISTENT... > > Shouldn't this be: > > if (mdev->state.conn < C_CONNECTED && > mdev->state.disk > D_DISKLESS && > mdev->state.disk < D_INCONSISTENT && > mdev->state.role == R_PRIMARY && > (mdev->ed_uuid & ~((u64)1)) != (p_uuid[UI_CURRENT] & > ~((u64)1))) { > > To fix this same issue??? No. The correct fix for your problem probably is not only this, but some addition to the "exposed data uuid" stuff as well. Because it is Primary, there may be cached pages, file system and applications usually have a rough idea what data they expect to live where. What this is supposed to do is avoid a timewarp into stale data, if you lose network first, hum along for hours, and then lose the disk as well. Or vice versa. You are then only allowed to attach or connect to the data you had last access to, not to the other set, as the other set would mean a time warp into stale data. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ drbd-dev mailing list drbd-dev@... http://lists.linbit.com/mailman/listinfo/drbd-dev |
|
|
Re: Primary/Diskless node cannot reconnect>
> No. > The correct fix for your problem probably is not only this, > but some addition to the "exposed data uuid" stuff as well. > > Because it is Primary, there may be cached pages, > file system and applications usually have a rough idea > what data they expect to live where. > > What this is supposed to do is avoid a timewarp into stale data, > if you lose network first, hum along for hours, > and then lose the disk as well. > > Or vice versa. > > You are then only allowed to attach or connect to the > data you had last access to, not to the other set, > as the other set would mean a time warp into stale data. > Good point -- if you lose the network first then I agree. However, if you lose the primary side disk first then I don't think you can hit this 'time warp'. My first thought when looking at this was to NOT attempt to update the current UUID on the Primary if it is diskless when you lose the connection - however, this doesn't work in the specific case that caused us to see this problem -- in that case, we had a DRBD device sitting on a physical disk which had actually gone bad; however, we didn't see this until we tried to write the meta-data with the updated UUID when we lost the network connection... Maybe we just need to back out the UUID update if you cant flush it to disk... Simon _______________________________________________ drbd-dev mailing list drbd-dev@... http://lists.linbit.com/mailman/listinfo/drbd-dev |
|
|
Re: Primary/Diskless node cannot reconnectOn Tue, Nov 03, 2009 at 07:40:44AM -0500, Graham, Simon wrote:
> > > > No. > > The correct fix for your problem probably is not only this, > > but some addition to the "exposed data uuid" stuff as well. > > > > Because it is Primary, there may be cached pages, > > file system and applications usually have a rough idea > > what data they expect to live where. > > > > What this is supposed to do is avoid a timewarp into stale data, > > if you lose network first, hum along for hours, > > and then lose the disk as well. > > > > Or vice versa. > > > > You are then only allowed to attach or connect to the > > data you had last access to, not to the other set, > > as the other set would mean a time warp into stale data. > > > > Good point -- if you lose the network first then I agree. However, if > you lose the primary side disk first then I don't think you can hit this > 'time warp'. Sure you can. You could first lose the disk, then lose the link, and then admin tries to attach the disk. And the latter now needs to fail. If the "exposed data uuid" (mdev->ed_uuid) does not match the "to be connected to" uuid, or the "to be attached" uuid, respectively, connecting or attaching is refused. Which is what that check does, or at least is supposed to do. > My first thought when looking at this was to NOT attempt to update the > current UUID on the Primary if it is diskless when you lose the > connection - however, this doesn't work in the specific case that caused > us to see this problem -- in that case, we had a DRBD device sitting on > a physical disk which had actually gone bad; however, we didn't see this > until we tried to write the meta-data with the updated UUID when we lost > the network connection... > > Maybe we just need to back out the UUID update if you cant flush it to > disk... Please try to reproduce whatever issue you have had with drbd-8.3.5. I was under the impression all combinations of how things can go wrong here would have been excercised and found to work. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ drbd-dev mailing list drbd-dev@... http://lists.linbit.com/mailman/listinfo/drbd-dev |
| Free embeddable forum powered by Nabble | Forum Help |