Multiple LDAP servers, single URI, server shutting down, hangs or fails!
We have what I think of as a 'standard' mixed environment set up and
everything works under normal operation BUT when one of our LDAP servers
is shutting down we get failures. I think this is a short coming in the
openldap library's handling of the 'uri' settings but would like some
more info and wondered if anybody can shed some more light on this. I
have traced through the library code to the 'ldap_connect_to_host'
routine in the os-ip.c file in the openldap library and think this is
where the problem arises but have no direct evidence.
Our set-up is as follows. We use Active Directory as our LDAP/KDC
supplier (these are Win2K3R2 boxes but I have seen this with other
flavours). On this particular environment we have 2 servers both fairly
lightly loaded most of the time. However, one of these server runs
Exchange 2000 and when shutting down can take up to 25 minutes to get to
the point where the network interface stops responding to pings.
The Unix side is configured with nss_ldap (264 + my kerberos patches)
and uses kerberos sasl connections to the LDAP service under AD.
The system is also configured to use pam_krb5 as the authenticator which
may amplify the problem as the KDC seems to shut down before the LDAP
service.
The ldap.conf file contains a single 'uri' statement which looks like this.
uri ldap://active-directory-domain
The look up of the domain will give multiple addresses in our case
192.168.10.1 and 192.168.10.3! (The second is our Exchange Server)
While the exchange server machine is shutting down we get login failures
(pam_krb5 reports incorrect password) and 'getent password' does not
report user entries.
We run NSCD on our boxes just to complicate matters.
It looks to me like the LDAP code will connect to the LDAP server on the
machine that is closing down but as it cannot get service it reports a
failure which results in the upper level code not listing the users.
That is the socket is still accepting connections but the LDAP server
has already died on the Active Directory box ... this is potentially a
Microsoft bug, but we should be working around this as a partially crash
server would give the same results elsewhere.
Now I could use a url with multiple host names and it looks like this
might work a bit better, as the code seems to have a mechanism to
iterate through the hosts. But I was wondering if this should be fixed
in the OpenLDAP library especially as listing the Domain Name allows us
to add and remove AD servers dynamically and the DNS provides the lookup.
As an alternative or an addition should we be handling the sites and
services information in the DNS and binding via SRV lookups? Again is
this a job for the OpenLDAP library or should nss_ldap handle this.
I am struggling to work out which mailing list in the OpenLDAP fora
would be appropriate to try to discuss this and was hoping somebody here
could also point me down that path.
Regards, Howard.