nss-ldap timeouts when used with nscd and gnutls

View: New views
7 Messages — Rating Filter:   Alert me  

nss-ldap timeouts when used with nscd and gnutls

by Douglas E. Engert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We have seen a number of issues with nss-ldap when going from
Ubuntu Dapper to Ubuntu Hardy. (Intrepid has shown similiar problems.)
Dapper clients and Solaris 9 and 10 using Sun's nss ldap work
fine with our two ldap servers.

Hardy, based on nss-ldap_258, has the problems. The code for 260
and 264 appears to have the same problems.

First problem:

The /etc/ldap.conf file implies the default for timeout is 30 seconds.
But it is unlimited in the code. This has caused nscd to lockup as it
keeps accepting requests, with all its worker threads waiting on the
nss-ldap lock, with one thread waiting in ldap_result waiting for the
response. netstat -a shows the connection is in CLOSE_WAIT.  The systems
keep running slow as each caller of nscd times out waithing the nscd,
then goes ahead and does the LDAP request. Nscd uses on file descriptor
for each request and eventually runs out of file descriptors and start
using 100% CPU.

Setting timeout 30 at least helps get out of this situation.
Suggestion: in util.c: set result->ldc_timelimit = 30;  (See attachment)

Second problem:

In ldap-nss.c if the do_result gets a timeout (or error), it writes to
syslog: "nss_ldap: could not get LDAP result" and  sets stat = NSS_UNAVAIL;

But the __session.ls_state is still set to LS_CONNECTED_TO_DSA
and the next operation tries to use the same connection which will also
time out.

Suggestion: in ldap-nss.c (see attachment)
Add call to do_close() in two places where do_result gets a timeout or
other connection error. This change will causes the next request to
reconnect. It may take 30 seconds, but the new connection will not timeout
again.


These problems may be related to the Ubuntu conversion from using OpenSSL
to using GunTLS. It may be that OpenSSL or GnuTLS fails to shutdown the
connectioncorrectly, or fails to tell ldap_search that the connection is
down.

In any case if the do_result fails with some timeout or connection problem,
the conservative thing to do is to do through the do_with_reconnect and try
a different server.

Has anyone seen any similar problems?

What we are testing now is using the Intrepid version of nss-ldap based on
260 on Hardy with the attached changes.

Packages being used:
      libnss-ldap     260-1ubuntu2-dee1   (-dee1 has my changes)
      libldap-2.4-2   2.4.9-0ubuntu0.8.04.2
      libgnutls13     2.0.4-1ubuntu2.3
      nscd            2.7-10ubuntu4

--

  Douglas E. Engert  <DEEngert@...>
  Argonne National Laboratory
  9700 South Cass Avenue
  Argonne, Illinois  60439
  (630) 252-5444

diff -u -r nss_ldap-260/ldap-nss.c nss_ldap-260-dee1/ldap-nss.c
--- nss_ldap-260/ldap-nss.c 2009-04-15 10:13:08.000000000 -0500
+++ nss_ldap-260-dee1/ldap-nss.c 2009-04-20 14:32:28.000000000 -0500
@@ -1577,6 +1577,7 @@
  }
       else
  {
+  syslog (LOG_ERR, "nss_ldap: do_open: do_start_tls failed:stat=%d", stat);
   do_close ();
   debug ("<== do_open (TLS startup failed)");
   return stat;
@@ -2472,6 +2473,7 @@
 #endif /* LDAP_OPT_ERROR_NUMBER */
   syslog (LOG_AUTHPRIV | LOG_ERR, "nss_ldap: could not get LDAP result - %s",
   ldap_err2string (rc));
+  do_close();
   stat = NSS_UNAVAIL;
   break;
  case LDAP_RES_SEARCH_ENTRY:
@@ -2507,6 +2509,7 @@
   syslog (LOG_AUTHPRIV | LOG_ERR,
   "nss_ldap: could not get LDAP result - %s",
   ldap_err2string (rc));
+  do_close();
  }
       else if (resultControls != NULL)
  {
Only in nss_ldap-260-dee1: ldap-nss.o
Binary files nss_ldap-260/nss_ldap.so and nss_ldap-260-dee1/nss_ldap.so differ
diff -u -r nss_ldap-260/util.c nss_ldap-260-dee1/util.c
--- nss_ldap-260/util.c 2008-03-04 04:05:12.000000000 -0600
+++ nss_ldap-260-dee1/util.c 2009-04-15 12:40:26.000000000 -0500
@@ -625,7 +625,7 @@
 #else
   result->ldc_version = LDAP_VERSION2;
 #endif /* LDAP_VERSION3 */
-  result->ldc_timelimit = LDAP_NO_LIMIT;
+  result->ldc_timelimit = 30;  
   result->ldc_bind_timelimit = 30;
   result->ldc_ssl_on = SSL_OFF;
   result->ldc_sslpath = NULL;
Only in nss_ldap-260-dee1: util.o

Re: nss-ldap timeouts when used with nscd and gnutls

by Howard Chu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Douglas E. Engert wrote:
> We have seen a number of issues with nss-ldap when going from
> Ubuntu Dapper to Ubuntu Hardy. (Intrepid has shown similiar problems.)
> Dapper clients and Solaris 9 and 10 using Sun's nss ldap work
> fine with our two ldap servers.
>
> Hardy, based on nss-ldap_258, has the problems. The code for 260
> and 264 appears to have the same problems.

Your analysis makes sense to me. But at the moment I'm no longer interested in
nss-ldap since nss-ldapd ( + slapd nssov) works better and offers easier
administration.

> First problem:
>
> The /etc/ldap.conf file implies the default for timeout is 30 seconds.
> But it is unlimited in the code. This has caused nscd to lockup as it
> keeps accepting requests, with all its worker threads waiting on the
> nss-ldap lock, with one thread waiting in ldap_result waiting for the
> response. netstat -a shows the connection is in CLOSE_WAIT.  The systems
> keep running slow as each caller of nscd times out waithing the nscd,
> then goes ahead and does the LDAP request. Nscd uses on file descriptor
> for each request and eventually runs out of file descriptors and start
> using 100% CPU.
>
> Setting timeout 30 at least helps get out of this situation.
> Suggestion: in util.c: set result->ldc_timelimit = 30;  (See attachment)
>
> Second problem:
>
> In ldap-nss.c if the do_result gets a timeout (or error), it writes to
> syslog: "nss_ldap: could not get LDAP result" and  sets stat = NSS_UNAVAIL;
>
> But the __session.ls_state is still set to LS_CONNECTED_TO_DSA
> and the next operation tries to use the same connection which will also
> time out.
>
> Suggestion: in ldap-nss.c (see attachment)
> Add call to do_close() in two places where do_result gets a timeout or
> other connection error. This change will causes the next request to
> reconnect. It may take 30 seconds, but the new connection will not timeout
> again.
>
>
> These problems may be related to the Ubuntu conversion from using OpenSSL
> to using GunTLS. It may be that OpenSSL or GnuTLS fails to shutdown the
> connectioncorrectly, or fails to tell ldap_search that the connection is
> down.
>
> In any case if the do_result fails with some timeout or connection problem,
> the conservative thing to do is to do through the do_with_reconnect and try
> a different server.
>
> Has anyone seen any similar problems?
>
> What we are testing now is using the Intrepid version of nss-ldap based on
> 260 on Hardy with the attached changes.
>
> Packages being used:
>        libnss-ldap     260-1ubuntu2-dee1   (-dee1 has my changes)
>        libldap-2.4-2   2.4.9-0ubuntu0.8.04.2
>        libgnutls13     2.0.4-1ubuntu2.3
>        nscd            2.7-10ubuntu4
>


--
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

Re: nss-ldap timeouts when used with nscd and gnutls

by Douglas E. Engert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Howard Chu wrote:

> Douglas E. Engert wrote:
>> We have seen a number of issues with nss-ldap when going from
>> Ubuntu Dapper to Ubuntu Hardy. (Intrepid has shown similiar problems.)
>> Dapper clients and Solaris 9 and 10 using Sun's nss ldap work
>> fine with our two ldap servers.
>>
>> Hardy, based on nss-ldap_258, has the problems. The code for 260
>> and 264 appears to have the same problems.
>
> Your analysis makes sense to me. But at the moment I'm no longer
> interested in nss-ldap since nss-ldapd ( + slapd nssov) works better and
> offers easier administration.

Sounds interesting, but we are trying to stick with what is offered by Ubuntu.

>
>> First problem:
>>
>> The /etc/ldap.conf file implies the default for timeout is 30 seconds.
>> But it is unlimited in the code. This has caused nscd to lockup as it
>> keeps accepting requests, with all its worker threads waiting on the
>> nss-ldap lock, with one thread waiting in ldap_result waiting for the
>> response. netstat -a shows the connection is in CLOSE_WAIT.  The systems
>> keep running slow as each caller of nscd times out waithing the nscd,
>> then goes ahead and does the LDAP request. Nscd uses on file descriptor
>> for each request and eventually runs out of file descriptors and start
>> using 100% CPU.
>>
>> Setting timeout 30 at least helps get out of this situation.
>> Suggestion: in util.c: set result->ldc_timelimit = 30;  (See attachment)
>>
>> Second problem:
>>
>> In ldap-nss.c if the do_result gets a timeout (or error), it writes to
>> syslog: "nss_ldap: could not get LDAP result" and  sets stat =
>> NSS_UNAVAIL;
>>
>> But the __session.ls_state is still set to LS_CONNECTED_TO_DSA
>> and the next operation tries to use the same connection which will also
>> time out.
>>
>> Suggestion: in ldap-nss.c (see attachment)
>> Add call to do_close() in two places where do_result gets a timeout or
>> other connection error. This change will causes the next request to
>> reconnect. It may take 30 seconds, but the new connection will not
>> timeout
>> again.
>>
>>
>> These problems may be related to the Ubuntu conversion from using OpenSSL
>> to using GunTLS. It may be that OpenSSL or GnuTLS fails to shutdown the
>> connectioncorrectly, or fails to tell ldap_search that the connection is
>> down.
>>
>> In any case if the do_result fails with some timeout or connection
>> problem,
>> the conservative thing to do is to do through the do_with_reconnect
>> and try
>> a different server.
>>
>> Has anyone seen any similar problems?
>>
>> What we are testing now is using the Intrepid version of nss-ldap
>> based on
>> 260 on Hardy with the attached changes.
>>
>> Packages being used:
>>        libnss-ldap     260-1ubuntu2-dee1   (-dee1 has my changes)
>>        libldap-2.4-2   2.4.9-0ubuntu0.8.04.2
>>        libgnutls13     2.0.4-1ubuntu2.3
>>        nscd            2.7-10ubuntu4
>>
>
>

--

  Douglas E. Engert  <DEEngert@...>
  Argonne National Laboratory
  9700 South Cass Avenue
  Argonne, Illinois  60439
  (630) 252-5444

Re: nss-ldap timeouts when used with nscd and gnutls

by Arthur de Jong-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 2009-04-21 at 15:22 -0500, Douglas E. Engert wrote:
> > Your analysis makes sense to me. But at the moment I'm no longer
> > interested in nss-ldap since nss-ldapd ( + slapd nssov) works better
> > and offers easier administration.
>
> Sounds interesting, but we are trying to stick with what is offered by
> Ubuntu.

FWIW some releases of Ubuntu have nss-ldapd (libnss-ldapd) but I would
avoid version 0.5. The 0.6.7 release is known to work quite well and is
included in Debian stable. There is however no packaged version of the
nssov in slapd as far as I know (but you can use nss-ldapd without it).

Since we're working hard on a PAM module (actually Howard Chu is doing
all the hard work at the moment) as a side effect we may also make it
more easily possible to use the nss-ldapd NSS module together with a
packaged slapd-nssov package (if such a package would be made).

(it's a bit awkward to post a more or less nss-ldapd promotional message
on the nss_ldap list)

--
-- arthur - arthur@... - http://ch.tudelft.nl/~arthur --


signature.asc (204 bytes) Download Attachment

Re: nss-ldap timeouts when used with nscd and gnutls

by Douglas E. Engert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Arthur de Jong wrote:

> On Tue, 2009-04-21 at 15:22 -0500, Douglas E. Engert wrote:
>>> Your analysis makes sense to me. But at the moment I'm no longer
>>> interested in nss-ldap since nss-ldapd ( + slapd nssov) works better
>>> and offers easier administration.
>> Sounds interesting, but we are trying to stick with what is offered by
>> Ubuntu.
>
> FWIW some releases of Ubuntu have nss-ldapd (libnss-ldapd) but I would
> avoid version 0.5. The 0.6.7 release is known to work quite well and is
> included in Debian stable. There is however no packaged version of the
> nssov in slapd as far as I know (but you can use nss-ldapd without it).

Thanks, we will have to look at that.

I did see in the archives that Howard Wilkinson on Dec 9, 2008
"Mega patch against nss_ldap 264" said:

"My intention with this is to make the critical path through the code run
  the minimal code when a connection to the LDAP server exists, make
  recovery on failure more resilient, and provide for multiple SASL mechs
  without need to alter the ldap-nss code."

If it handles the cases where do_result fails, and timeout and connection
errors reconnect to any server that may fix the issue I have seen.

>
> Since we're working hard on a PAM module (actually Howard Chu is doing
> all the hard work at the moment) as a side effect we may also make it
> more easily possible to use the nss-ldapd NSS module together with a
> packaged slapd-nssov package (if such a package would be made).
>
> (it's a bit awkward to post a more or less nss-ldapd promotional message
> on the nss_ldap list)
>

--

  Douglas E. Engert  <DEEngert@...>
  Argonne National Laboratory
  9700 South Cass Avenue
  Argonne, Illinois  60439
  (630) 252-5444

Re: nss-ldap timeouts when used with nscd and gnutls

by Howard Wilkinson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Douglas E. Engert wrote:

>
>
> Arthur de Jong wrote:
>> On Tue, 2009-04-21 at 15:22 -0500, Douglas E. Engert wrote:
>>>> Your analysis makes sense to me. But at the moment I'm no longer
>>>> interested in nss-ldap since nss-ldapd ( + slapd nssov) works better
>>>> and offers easier administration.
>>> Sounds interesting, but we are trying to stick with what is offered by
>>> Ubuntu.
>>
>> FWIW some releases of Ubuntu have nss-ldapd (libnss-ldapd) but I would
>> avoid version 0.5. The 0.6.7 release is known to work quite well and is
>> included in Debian stable. There is however no packaged version of the
>> nssov in slapd as far as I know (but you can use nss-ldapd without it).
>
> Thanks, we will have to look at that.
>
> I did see in the archives that Howard Wilkinson on Dec 9, 2008
> "Mega patch against nss_ldap 264" said:
>
> "My intention with this is to make the critical path through the code run
>  the minimal code when a connection to the LDAP server exists, make
>  recovery on failure more resilient, and provide for multiple SASL mechs
>  without need to alter the ldap-nss code."
>
Yes I said this but I have yet to finish this piece of code. What I have
done runs better than it did before but it does not address some of the
stability issues I found.

You will need to apply the patch and see how you get on. I am hoping to
find time next month to revisit this, but as I am having trouble finding
paying work (as most of the UK seems to be) this may slip if somebody
finds something else for me to do.

The major piece of work that is needed, apart from fixing my patch to be
style compatible with the rest of nss_ldap, is to remove some recursion
from the code that breaks if the underlying connection to the LDAP
disconnects. This needs to be replaced with a list walking operation so
that the reconnects can recover and continue if the remote server has
gone away. I forget which piece of code this is, but I think it was in
the groups generation operation.

> If it handles the cases where do_result fails, and timeout and connection
> errors reconnect to any server that may fix the issue I have seen.
>
>>
>> Since we're working hard on a PAM module (actually Howard Chu is doing
>> all the hard work at the moment) as a side effect we may also make it
>> more easily possible to use the nss-ldapd NSS module together with a
>> packaged slapd-nssov package (if such a package would be made).
>>
>> (it's a bit awkward to post a more or less nss-ldapd promotional message
>> on the nss_ldap list)
>>
>
I had intended to get the nss_Ldap work finished and then look at
porting the functionality into the nss-ldapd environment. But again time
has not been on my side.

If I can help then please feel free to ping me.

Howard.



Re: nss-ldap timeouts when used with nscd and gnutls

by Douglas E. Engert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Howard Wilkinson wrote:

> Douglas E. Engert wrote:
>>
>>
>> Arthur de Jong wrote:
>>> On Tue, 2009-04-21 at 15:22 -0500, Douglas E. Engert wrote:
>>>>> Your analysis makes sense to me. But at the moment I'm no longer
>>>>> interested in nss-ldap since nss-ldapd ( + slapd nssov) works better
>>>>> and offers easier administration.
>>>> Sounds interesting, but we are trying to stick with what is offered by
>>>> Ubuntu.
>>>
>>> FWIW some releases of Ubuntu have nss-ldapd (libnss-ldapd) but I would
>>> avoid version 0.5. The 0.6.7 release is known to work quite well and is
>>> included in Debian stable. There is however no packaged version of the
>>> nssov in slapd as far as I know (but you can use nss-ldapd without it).
>>
>> Thanks, we will have to look at that.
>>
>> I did see in the archives that Howard Wilkinson on Dec 9, 2008
>> "Mega patch against nss_ldap 264" said:
>>
>> "My intention with this is to make the critical path through the code run
>>  the minimal code when a connection to the LDAP server exists, make
>>  recovery on failure more resilient, and provide for multiple SASL mechs
>>  without need to alter the ldap-nss code."
>>
> Yes I said this but I have yet to finish this piece of code. What I have
> done runs better than it did before but it does not address some of the
> stability issues I found.
>
> You will need to apply the patch and see how you get on. I am hoping to
> find time next month to revisit this, but as I am having trouble finding
> paying work (as most of the UK seems to be) this may slip if somebody
> finds something else for me to do.
>

OK, I was not sure where this major modification stood.

> The major piece of work that is needed, apart from fixing my patch to be
> style compatible with the rest of nss_ldap, is to remove some recursion
> from the code that breaks if the underlying connection to the LDAP
> disconnects. This needs to be replaced with a list walking operation so
> that the reconnects can recover and continue if the remote server has
> gone away. I forget which piece of code this is, but I think it was in
> the groups generation operation.

Your change may address the two bugs I turned into today, #391 and #392.
If so that would be great. I was hopping to get #392 into the code upstream,
of Debian and Ubuntu so they would pick them up. The #392 change is really
adding two calls to do_close(), if a connection has an error or times out.

This is not a perfect fix as the active request may still fail. But what
we see is nscd stops working, but the caller like sshd, cron, ls, etc. will
detect that nscd is not working and do calls to LDAP directly bypassing nscd.
So nothing appears to fail, but an ls can take 15 seconds, or a login 30
seconds more then expected.

>> If it handles the cases where do_result fails, and timeout and connection
>> errors reconnect to any server that may fix the issue I have seen.
>>
>>>
>>> Since we're working hard on a PAM module (actually Howard Chu is doing
>>> all the hard work at the moment) as a side effect we may also make it
>>> more easily possible to use the nss-ldapd NSS module together with a
>>> packaged slapd-nssov package (if such a package would be made).
>>>
>>> (it's a bit awkward to post a more or less nss-ldapd promotional message
>>> on the nss_ldap list)
>>>
>>
> I had intended to get the nss_Ldap work finished and then look at
> porting the functionality into the nss-ldapd environment. But again time
> has not been on my side.
>
> If I can help then please feel free to ping me.
>
> Howard.
>
>
>

--

  Douglas E. Engert  <DEEngert@...>
  Argonne National Laboratory
  9700 South Cass Avenue
  Argonne, Illinois  60439
  (630) 252-5444