swiftmq HA hangs after switching to standby instance

View: New views
7 Messages — Rating Filter:   Alert me  

swiftmq HA hangs after switching to standby instance

by Yu L :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

This is Swiftmq 7.4.0 HA running on two solaris 9 boxes(ip addresses 212/213). One is primary, the other is backup. First, I started the primary, then started the backup swiftmq. The first became "Active/Active", and the second was "Standby/Standby". Primary was running on 212 box, as well as all jms clients. Backup swiftmq was running on 213 box without any jms clients running on it.

Then I shut down primary swiftmq by "stop.sh" on 212, I saw backup changed its state to "standalone/standalone" on 213. My jms clients on 212 was still working because its provider url was set up to point to both instances. This was from log message:

 "2009-04-08 11:26:01,156 DEBUG [main] (MDriverSwiftMQ.java:66) - providerUrl = smqp://admin:resolve@192.168.1.212:4004/host2=192.168.1.213;port2=4004;type=com.sw
iftmq.net.JSSESocketFactory;reconnect=true;retrydelay=5000;maxretries=720;keepalive=5000;timeout=5000;"

Then I restarted primary swiftmq on 212, I saw all tcp connections were switched back to 212's port 4004, which is primary swiftmq. My jms clients on 212 were still responding and working. So far so good.

Then I repeated this process a couple of times: kill primary, wait for backup swiftmq to pick up, test my jms clients, restart primary...

Repeated about 3 or 4 times, sometimes more, sometime less, when primary swiftmq was down and backup was in "standalone/standalone" mode, all my jms clients would hang and stop to respond. Even if I restarted my jms client application, it would still hang and stop to responde. It seems it hang when jms client was doing JNDI lookup with swifmq. Here is the last log message from my jms client when tried to restart it:

2009-04-08 19:48:23,842 DEBUG [main] (MServer.java:45) - JMS platform: SWIFTMQ
2009-04-08 19:48:23,844  INFO [main] (MServer.java:86) - Initializing JMS service
2009-04-08 19:48:23,854 DEBUG [main] (MDriverSwiftMQ.java:66) - providerUrl = smqp://admin:resolve@localhost:4004/host2=192.168.1.213;port2=4004;type=com.swiftmq.net.JSSESocketFactory;reconnect=true;retrydelay=5000;maxretries=720;keepalive=5000;timeout=5000;

So basically, when Backup swiftmq instance was running in standalone mode after primary was dead, JNDI lookup as well as other jms calls at client side sometimes simply hang. But after I restarted backup swiftmq instance again without starting primary, all my jms client would recover and be able to proceed and connect to backup instance running in standalone mode.

Attached are primary and backup swiftmq instances log files, as well as routerconfig.xml files.

Any suggestion about this problem? thanks very much.

Yuinfo.logrouterconfig.xmlerror.logwarning.loginfo.logrouterconfig.xmlerror.logwarning.log





Re: swiftmq HA hangs after switching to standby instance

by IIT Software :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

You had a so-called split brain, see the warning.log/error.log. Here is a description of the problem and how to solve it.

In your case the split brain occurred after you've started the STANDBY instance while the former STANDALONE instance was down. Then you have only 2 minutes to start the other instance, otherwise you will get a negotiation timeout and the STANDBY switches to STANDALONE. See your warning.log. If you now start the other (former STANDALONE) instance, they both detect that they both are in STANDALONE and one of them is shut down. There is administrative action required. See link above.

Re: swiftmq HA hangs after switching to standby instance

by Yu L :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for looking at this problem. Actually, the problem you referred to as "split brain" was gone after I restarted primary and back swiftmq a couple of times without any administrative actions. If you check date and time of log messages in "error.log" and "warning.log" files, you will see that "slipt brain" only happened on 4/7th, but not on 4/8th. "split brain" problem never happened again in my HA testing.

But the problem described in my original post was a completely different one. It happened on 4/8th and repeated many times in my testing. Please check "info.log" for exceptions and other details because I can't find any error or warning messages related to it in "error.log" or "warning.log" files.

Yu  

Re: swiftmq HA hangs after switching to standby instance

by IIT Software :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I see, sorry.

There is everything ok in the logs. The instances failover as expected, including preferred active failover.

The problem seems to be your configuration. You did not run the HAWizard to create a HA compliant configuration, I guess.

Here is a snippet from the default config of a HA router, JMS listener part:

      <listener name="plainsocket" hostname="localhost" hostname2="localhost" port="4001" port2="4002">

It states that both HA instances are on locahost, different ports.

Here is the same part from your config:

      <listener name="sslsocket" connectaddress="192.168.1.213" connectaddress2="192.168.1.212" port="4004" port2="4004" socketfactory-class="com.swiftmq.net.JSSESocketFactory">

Yours is wrong. The client need the "host2" attribute set, otherwise they can't failover because they don't know where the other HA instance is.

What you refer to - the URL - is the JNDI provider URL. It's just to failover the JNDI context. The JMS failover stuff is wrapped in the connection factory.

Please run the HAWizard on your configuration to ensure everything is properly configured.

Re: swiftmq HA hangs after switching to standby instance

by Yu L :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

thanks for quick response. We didn't use "plainsocket", so its hostname and hostname2 was left with default "localhost". We only use "sslsocket", which was set to point to "212" and "213" for primary and backup swiftmq instances. This routerconfig.xml was working fine. Failover was always successful between primary and backup swiftmq instances. My jms client was able to failover to backup swiftmq initially after primay was donwn. But the problem is that my jms clients sometimes would hang after swiftmq failovered a couple of times between primary and backup instances. So it looks like this was not a configuration problem in routerconfig.xml file. Otherwise, why my jms clients were able to failover to backup swiftmq intitially?

Yu

Re: swiftmq HA hangs after switching to standby instance

by IIT Software :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Try

-Dswiftmq.reconnect.debug=true

at your client to see the reconnect output of the client.


Re: swiftmq HA hangs after switching to standby instance

by IIT Software :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It could just be a half open socket. In that case the client waits up to 5 min until the keepalive jumps in and disconnects. You should see that in the debug output.