NLM and CTDB recovery master node failure

View: New views
8 Messages — Rating Filter:   Alert me  

NLM and CTDB recovery master node failure

by Sergey Kleyman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, all

I'm trying to implement clustered Samba on my cluster file system by
using Samba+CTDB (version 3.4.2). I noticed on CTDB wiki page
(http://wiki.samba.org/index.php/CTDB_Project) the following sentence:

"To become a recovery master, a node must be able to acquire an
exclusive lock on that file."

So I am wondering how CTDB deals with recovery master failure. What
happens if the node, CTDB recovery master is running on, has hardware
failure and doesn't come up for a very long time (or even never)? NLM
server of the underlying clustered file system will hold the lock until
the client comes back up which might never happen so remaining nodes
will not be able to select a new leader because none of them will be
able to acquire an exclusive lock. Am I missing something?

Thank you in advance, Sergey


Re: NLM and CTDB recovery master node failure

by Volker Lendecke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 10:41:01AM +0200, Sergey Kleyman wrote:

> I'm trying to implement clustered Samba on my cluster file system by
> using Samba+CTDB (version 3.4.2). I noticed on CTDB wiki page
> (http://wiki.samba.org/index.php/CTDB_Project) the following sentence:
>
> "To become a recovery master, a node must be able to acquire an
> exclusive lock on that file."
>
> So I am wondering how CTDB deals with recovery master failure. What
> happens if the node, CTDB recovery master is running on, has hardware
> failure and doesn't come up for a very long time (or even never)? NLM
> server of the underlying clustered file system will hold the lock until
> the client comes back up which might never happen so remaining nodes
> will not be able to select a new leader because none of them will be
> able to acquire an exclusive lock. Am I missing something?
So you're saying that a node takes a lock, the node dies and
until that node comes back up, nobody will be able to take
that lock? Our assumption so far is that shared fcntl locks
behave like local fcntl locks: If a process that holds a
lock dies, then the lock is released. It should not matter
for what reason that process dies. A node being killed is a
particularly nasty death for a process, but the lock must
nevertheless be released.

You *can* run ctdb without that shared lock. But the shared
lock was there for a reason: We need to make sure that we
have the same view of cluster membership as the cluster fs
below has.

You should look at

ctdb setvar VerifyRecoveryLock 0

to work without a recovery lock. But be aware that this is
NOT recommended.

Volker


signature.asc (204 bytes) Download Attachment

RE: NLM and CTDB recovery master node failure

by Sergey Kleyman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> -----Original Message-----
> From: Volker Lendecke [mailto:Volker.Lendecke@...]
> Sent: Thursday, October 29, 2009 11:21
> To: Sergey Kleyman
> Cc: samba-technical@...
> Subject: Re: NLM and CTDB recovery master node failure
>
> On Thu, Oct 29, 2009 at 10:41:01AM +0200, Sergey Kleyman wrote:
> > I'm trying to implement clustered Samba on my cluster file system by
> > using Samba+CTDB (version 3.4.2). I noticed on CTDB wiki page
> > (http://wiki.samba.org/index.php/CTDB_Project) the following
> sentence:
> >
> > "To become a recovery master, a node must be able to acquire an
> > exclusive lock on that file."
> >
> > So I am wondering how CTDB deals with recovery master failure. What
> > happens if the node, CTDB recovery master is running on, has
hardware
> > failure and doesn't come up for a very long time (or even never)?
NLM
> > server of the underlying clustered file system will hold the lock
> > until the client comes back up which might never happen so remaining
> > nodes will not be able to select a new leader because none of them
> > will be able to acquire an exclusive lock. Am I missing something?
>
> So you're saying that a node takes a lock, the node dies and until
that

> node comes back up, nobody will be able to take that lock? Our
> assumption so far is that shared fcntl locks behave like local fcntl
> locks: If a process that holds a lock dies, then the lock is released.
> It should not matter for what reason that process dies. A node being
> killed is a particularly nasty death for a process, but the lock must
> nevertheless be released.
>
> You *can* run ctdb without that shared lock. But the shared lock was
> there for a reason: We need to make sure that we have the same view of
> cluster membership as the cluster fs below has.
>
> You should look at
>
> ctdb setvar VerifyRecoveryLock 0
>
> to work without a recovery lock. But be aware that this is NOT
> recommended.
>
> Volker

Thanks for the reply but allow me to disagree about "shared fcntl locks
behave like local fcntl locks"

According to this
http://www.opengroup.org/onlinepubs/009629799/chap9.htm#tagcjh_10
"Client Failure and Restart"

"... the client NSM issues an SM_NOTIFY RPC to the NSM on the named
host. In this example it will issue an SM_NOTIFY to the server NSM,
including the client name and the new client state... The callback
procedure in the server NLM notes that the client state has changed and
releases all locks held on behalf of the client."

So NLM server releases locks only when notified by client (in our case
NLM client in Linux kernel) but obviously this happens only when the
node that was holding the lock comes back up. So the problem is that NLM
server doesn't have an ability to distinguish between failed client and
client that holds a lock for a very long time. There's no proactive
heartbeat as CTDB has. The document even says so explicitly (section
"NSM Protocol")

"... The NSM does not actively "probe" hosts it has been asked to
monitor; instead it waits for the monitored host to notify it that the
monitored host's status has changed (that is, crashed and rebooted). "

It's not the case for the kernel which can easily distinguish between
process that died (and so it should have all its locks automatically
released) and process that is still running and holding a lock. Please
correct me if I'm wrong.

As for your advice about running CTDB without a recovery lock I would
obviously prefer to use recommended configuration but I wonder what
functionality will suffer from this choice?

Thanks Sergey

Re: NLM and CTDB recovery master node failure

by Volker Lendecke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 04:11:01PM +0200, Sergey Kleyman wrote:

> Thanks for the reply but allow me to disagree about "shared fcntl locks
> behave like local fcntl locks"
>
> According to this
> http://www.opengroup.org/onlinepubs/009629799/chap9.htm#tagcjh_10
> "Client Failure and Restart"
>
> "... the client NSM issues an SM_NOTIFY RPC to the NSM on the named
> host. In this example it will issue an SM_NOTIFY to the server NSM,
> including the client name and the new client state... The callback
> procedure in the server NLM notes that the client state has changed and
> releases all locks held on behalf of the client."
>
> So NLM server releases locks only when notified by client (in our case
> NLM client in Linux kernel) but obviously this happens only when the
> node that was holding the lock comes back up. So the problem is that NLM
> server doesn't have an ability to distinguish between failed client and
> client that holds a lock for a very long time. There's no proactive
> heartbeat as CTDB has. The document even says so explicitly (section
> "NSM Protocol")
Ok, this is your implementation choice. The behaviour we
expect is different. We view the cluster not as a group of
NFS clients whose servers have to adhere to that standard
behaviour. In fact, in Samba we definitely do not support
re-exporting NFS imports, problems with locking being the
main reason for this.

Please use a different cluster file system that does not
exhibit this behaviour or run without the central
reclockfile.

Volker


signature.asc (204 bytes) Download Attachment

Re: NLM and CTDB recovery master node failure

by Volker Lendecke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 04:34:14PM +0100, Volker Lendecke wrote:
> Please use a different cluster file system that does not
> exhibit this behaviour or run without the central
> reclockfile.

Ok, I've got a question: Can we achieve the same result we
use the fcntl lock on the reclockfile for with another API
on your system?

We need to very quickly determine correct cluster membership
of all ctdb nodes: If nobody can get the reclock lock, then
we're broken. If more than one can get it, we've got a split
brain. How can we get that info reliably out of your
cluster fs without using the fcntl lock?

Volker


signature.asc (204 bytes) Download Attachment

RE: NLM and CTDB recovery master node failure

by Sergey Kleyman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> -----Original Message-----
> From: Volker Lendecke [mailto:Volker.Lendecke@...]
> Sent: Thursday, October 29, 2009 17:48
> To: Sergey Kleyman
> Cc: samba-technical@...
> Subject: Re: NLM and CTDB recovery master node failure
>
> On Thu, Oct 29, 2009 at 04:34:14PM +0100, Volker Lendecke wrote:
> > Please use a different cluster file system that does not exhibit
this

> > behaviour or run without the central reclockfile.
>
> Ok, I've got a question: Can we achieve the same result we use the
> fcntl lock on the reclockfile for with another API on your system?
>
> We need to very quickly determine correct cluster membership of all
> ctdb nodes: If nobody can get the reclock lock, then we're broken. If
> more than one can get it, we've got a split brain. How can we get that
> info reliably out of your cluster fs without using the fcntl lock?
>
> Volker

We have our internal API that are implemented on top of Spread Toolkit
(http://www.spread.org/) but our goal is to make as less changes to
Samba as possible so changing election code to use our API is not the
optimal solution. I guess it'll be easier to adhere to Samba's
assumptions about NLM and provide automatic lock clean-up in case of the
node failure. Are you sure that GPFS and/or GFS have this capability?

As a side note: if I understand you correctly CTDB is assumed to be
running on the same machines as underlying file system. I was under the
impression that it's possible to run file system on machines A and B,
while Samba+CTDB will run on different machines C and D that will see
clustered file system through NFS mounts in which case C and D are just
NLM clients to the file system.

One more point I wanted to inquire about: if smbd daemons dies for some
reason (abnormal exit - panic, etc.) what happens to CIFS locks it was
holding? Are those locks automatically cleaned up?

Thanks, Sergey

Re: NLM and CTDB recovery master node failure

by Volker Lendecke :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 09:20:30PM +0200, Sergey Kleyman wrote:
> We have our internal API that are implemented on top of Spread Toolkit
> (http://www.spread.org/) but our goal is to make as less changes to
> Samba as possible so changing election code to use our API is not the
> optimal solution. I guess it'll be easier to adhere to Samba's
> assumptions about NLM and provide automatic lock clean-up in case of the
> node failure. Are you sure that GPFS and/or GFS have this capability?

I haven't tested it myself, but this is a basic assumption
in ctdb. Tridge might answer this authoritatively.

> As a side note: if I understand you correctly CTDB is assumed to be
> running on the same machines as underlying file system. I was under the
> impression that it's possible to run file system on machines A and B,
> while Samba+CTDB will run on different machines C and D that will see
> clustered file system through NFS mounts in which case C and D are just
> NLM clients to the file system.

Why would you want to do that? Going through the network
twice is a very bad idea for performance. And as I said, the
fcntl locking problems plus very frequent client lockups due
to buggy NFS clients under CIFS load really tell us that you
asking more trouble than you will appreciate.

> One more point I wanted to inquire about: if smbd daemons dies for some
> reason (abnormal exit - panic, etc.) what happens to CIFS locks it was
> holding? Are those locks automatically cleaned up?

They are cleaned up. Look for example at the for-loop in
source3/locking/locking.c:650ff in current master. We also
send immediate retry messages to all processes in case the
parent smbd detects a child has died.

Volker


signature.asc (204 bytes) Download Attachment

Re: NLM and CTDB recovery master node failure

by ronnie sahlberg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Oct 30, 2009 at 6:20 AM, Sergey Kleyman
<Sergey.Kleyman@...> wrote:

>> -----Original Message-----
>> From: Volker Lendecke [mailto:Volker.Lendecke@...]
>> Sent: Thursday, October 29, 2009 17:48
>> To: Sergey Kleyman
>> Cc: samba-technical@...
>> Subject: Re: NLM and CTDB recovery master node failure
>>
>> On Thu, Oct 29, 2009 at 04:34:14PM +0100, Volker Lendecke wrote:
>> > Please use a different cluster file system that does not exhibit
> this
>> > behaviour or run without the central reclockfile.
>>
>> Ok, I've got a question: Can we achieve the same result we use the
>> fcntl lock on the reclockfile for with another API on your system?
>>
>> We need to very quickly determine correct cluster membership of all
>> ctdb nodes: If nobody can get the reclock lock, then we're broken. If
>> more than one can get it, we've got a split brain. How can we get that
>> info reliably out of your cluster fs without using the fcntl lock?
>>
>> Volker
>
> We have our internal API that are implemented on top of Spread Toolkit
> (http://www.spread.org/) but our goal is to make as less changes to
> Samba as possible so changing election code to use our API is not the
> optimal solution. I guess it'll be easier to adhere to Samba's
> assumptions about NLM and provide automatic lock clean-up in case of the
> node failure. Are you sure that GPFS and/or GFS have this capability?

Yes. Locks and open files need to be recovered by the cluster
filesystem very promptly anyway
since if an i/o is blocked for 40 seconds or more, you are very likely
causing the redirector to timeout
with data corruption as a result.


>
> As a side note: if I understand you correctly CTDB is assumed to be
> running on the same machines as underlying file system. I was under the
> impression that it's possible to run file system on machines A and B,
> while Samba+CTDB will run on different machines C and D that will see
> clustered file system through NFS mounts in which case C and D are just
> NLM clients to the file system.

Do not re-export nfs, bad things happens, which is why knfsd for
example refuses to re-export nfs shares.
Also, do not use NFS for locking, or to store the reclock file.
NFS file locking in v2/v3 is very unreliable and will break things.


Instead, if you do need split-brain protection   but you can not use
open()/fcntl() on a reclock file due to cluster filesystem semantincs
you can either run it without a reclockfile, which opens the
possibility of scplit brain  so it is probably sub-optimal.

It should be reasonably easy to replace the recovery-lock with a
different mechanism  using some other type of shared resource as
arbitrator.

Most of what you need would be to replace ctdb_recovery_lock() with an
alternative function that uses something else.
Perhaps have a shared dedicated scsi device and use persistent
reservations?  that would be useful.


(Just dont use NFS,   nfs file locking is broken by design  so this
will cause more problems than it is worth.)


>
> One more point I wanted to inquire about: if smbd daemons dies for some
> reason (abnormal exit - panic, etc.) what happens to CIFS locks it was
> holding? Are those locks automatically cleaned up?
>
> Thanks, Sergey
>