Node fails to join after restart

View: New views
4 Messages — Rating Filter:   Alert me  

Node fails to join after restart

by Dima Gutzeit-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Latest version of 2.8.
 
If I restart a node without waiting ~10 seconds between shutdown and startup I get the following (total of two nodes, one is being restarted) :
 
[WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
And it never ends.
 
My config is :
 
<config>
        <TCP
                bind_addr="xx.xx.xx.xx"
        loopback="true"
        recv_buf_size="20000000"
        send_buf_size="640000"
        discard_incompatible_packets="true"
        max_bundle_size="64000"
        max_bundle_timeout="30"
        enable_bundling="true"
        use_send_queues="false"
        sock_conn_timeout="300"
        skip_suspected_members="true"
 
        thread_pool.enabled="true"
        thread_pool.min_threads="1"
        thread_pool.max_threads="50"
        thread_pool.keep_alive_time="5000"
        thread_pool.queue_enabled="false"
        thread_pool.queue_max_size="100"
        thread_pool.rejection_policy="Run"
 
                oob_thread_pool.enabled="true"
        oob_thread_pool.min_threads="1"
        oob_thread_pool.max_threads="15"
        oob_thread_pool.keep_alive_time="5000"
        oob_thread_pool.queue_enabled="true"
        oob_thread_pool.queue_max_size="1000"
        oob_thread_pool.rejection_policy="Run"
                singleton_name="my_channels"/>
        <MPING timeout="4000" receive_on_all_interfaces="true" send_on_all_interfaces="true" mcast_addr="228.8.8.11" mcast_port="60111" ip_ttl="8" num_initial_members="2" num_ping_requests="1"/>
        <MERGE2 max_interval="10000" min_interval="5000"/>
        <FD_SOCK/>
        <FD_ALL timeout="10000" interval="5000"/>
        <VERIFY_SUSPECT timeout="1500"/>
        <pbcast.NAKACK use_mcast_xmit="false" gc_lag="50" retransmit_timeout="600,1200,2400,4800" discard_delivered_msgs="true"/>
        <UNICAST timeout="1200,2400,3600"/>
        <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>
        <VIEW_SYNC avg_send_interval="60000"/>
        <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
        <FC max_credits="2000000" min_threshold="0.10" max_block_times="500:2,1500:5,5000:50,20000:200,100000:500,1000000:1000"/>
        <FRAG2 frag_size="60000"/>
        <pbcast.STATE_TRANSFER/>
        <pbcast.FLUSH timeout="5000"/>
</config>
 
I thought that I will not have join related issues anymore in 2.8 Sad smile emoticon
 
 
Thanks in advance.
 
Regards,
Dima Gutzeit.

 


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Parent Message unknown Re: Node fails to join after restart

by Dima Gutzeit-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some more information:
 
If I wait between node stop and start, it starts ok but the log is filled with:
 
[WARN] [OOB-1,null] org.jgroups.protocols.pbcast.NAKACK - WebLynx02: discarded message from non-member  OtherNode01, my view is MergeView::[WebLynx02|43] [WebLynx02, OtherNode01], subgroups=[[WebLynx02|42] [WebLynx02], [OtherNode01|42] [OtherNode01]]
 
My group has only two nodes...
 
Pease ... help Smile emoticon
 
Regards,
Dima Gutzeit.
 
 

From: dima@...
Sent: Wednesday, November 04, 2009 12:26 PM
Subject: Node fails to join after restart

Latest version of 2.8.
 
If I restart a node without waiting ~10 seconds between shutdown and startup I get the following (total of two nodes, one is being restarted) :
 
[WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
[WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
[WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
And it never ends.
 
My config is :
 
<config>
        <TCP
                bind_addr="xx.xx.xx.xx"
        loopback="true"
        recv_buf_size="20000000"
        send_buf_size="640000"
        discard_incompatible_packets="true"
        max_bundle_size="64000"
        max_bundle_timeout="30"
        enable_bundling="true"
        use_send_queues="false"
        sock_conn_timeout="300"
        skip_suspected_members="true"
 
        thread_pool.enabled="true"
        thread_pool.min_threads="1"
        thread_pool.max_threads="50"
        thread_pool.keep_alive_time="5000"
        thread_pool.queue_enabled="false"
        thread_pool.queue_max_size="100"
        thread_pool.rejection_policy="Run"
 
                oob_thread_pool.enabled="true"
        oob_thread_pool.min_threads="1"
        oob_thread_pool.max_threads="15"
        oob_thread_pool.keep_alive_time="5000"
        oob_thread_pool.queue_enabled="true"
        oob_thread_pool.queue_max_size="1000"
        oob_thread_pool.rejection_policy="Run"
                singleton_name="my_channels"/>
        <MPING timeout="4000" receive_on_all_interfaces="true" send_on_all_interfaces="true" mcast_addr="228.8.8.11" mcast_port="60111" ip_ttl="8" num_initial_members="2" num_ping_requests="1"/>
        <MERGE2 max_interval="10000" min_interval="5000"/>
        <FD_SOCK/>
        <FD_ALL timeout="10000" interval="5000"/>
        <VERIFY_SUSPECT timeout="1500"/>
        <pbcast.NAKACK use_mcast_xmit="false" gc_lag="50" retransmit_timeout="600,1200,2400,4800" discard_delivered_msgs="true"/>
        <UNICAST timeout="1200,2400,3600"/>
        <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>
        <VIEW_SYNC avg_send_interval="60000"/>
        <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
        <FC max_credits="2000000" min_threshold="0.10" max_block_times="500:2,1500:5,5000:50,20000:200,100000:500,1000000:1000"/>
        <FRAG2 frag_size="60000"/>
        <pbcast.STATE_TRANSFER/>
        <pbcast.FLUSH timeout="5000"/>
</config>
 
I thought that I will not have join related issues anymore in 2.8 Sad smile emoticon
 
 
Thanks in advance.
 
Regards,
Dima Gutzeit.

 



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: Node fails to join after restart

by Bela Ban :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Well, this is simple:

    * You have A and B
    * You kill A and restart it immediately
    * The new A (let's call it A') will do a discovery in which B
      returns A (the old, killed) as coordinator
          o The reason is that B hasn't yet excluded A adn become the
            new coordinator itself
    * A' sends a JOIN request to A, which will fail until A has been
      removed and B has become the new coord
    * A' then sends a JOIN request to B, which succeeds


Voila. If you simpl kill and restart a node, it should take ca 2-3
seconds to exclude it from the cluster. If you pull the plug or CTRL-Z
it, then it will take ca 12 - 16 seconds, during which the restarted
node will try to join the old coordinator



Dima Gutzeit wrote:

> Latest version of 2.8.
>
> If I restart a node without waiting ~10 seconds between shutdown and startup I get the following (total of two nodes, one is being restarted) :
>
> [WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS - join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out (after 3000 ms), retrying
>
> And it never ends.
>
> My config is :
>
> <config>
>         <TCP
>                 bind_addr="xx.xx.xx.xx"
>         loopback="true"
>         recv_buf_size="20000000"
>         send_buf_size="640000"
>         discard_incompatible_packets="true"
>         max_bundle_size="64000"
>         max_bundle_timeout="30"
>         enable_bundling="true"
>         use_send_queues="false"
>         sock_conn_timeout="300"
>         skip_suspected_members="true"
>
>         thread_pool.enabled="true"
>         thread_pool.min_threads="1"
>         thread_pool.max_threads="50"
>         thread_pool.keep_alive_time="5000"
>         thread_pool.queue_enabled="false"
>         thread_pool.queue_max_size="100"
>         thread_pool.rejection_policy="Run"
>
>                 oob_thread_pool.enabled="true"
>         oob_thread_pool.min_threads="1"
>         oob_thread_pool.max_threads="15"
>         oob_thread_pool.keep_alive_time="5000"
>         oob_thread_pool.queue_enabled="true"
>         oob_thread_pool.queue_max_size="1000"
>         oob_thread_pool.rejection_policy="Run"
>                 singleton_name="my_channels"/>
>         <MPING timeout="4000" receive_on_all_interfaces="true" send_on_all_interfaces="true" mcast_addr="228.8.8.11" mcast_port="60111" ip_ttl="8" num_initial_members="2" num_ping_requests="1"/>
>         <MERGE2 max_interval="10000" min_interval="5000"/>
>         <FD_SOCK/>
>         <FD_ALL timeout="10000" interval="5000"/>
>         <VERIFY_SUSPECT timeout="1500"/>
>         <pbcast.NAKACK use_mcast_xmit="false" gc_lag="50" retransmit_timeout="600,1200,2400,4800" discard_delivered_msgs="true"/>
>         <UNICAST timeout="1200,2400,3600"/>
>         <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400000"/>
>         <VIEW_SYNC avg_send_interval="60000"/>
>         <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
>         <FC max_credits="2000000" min_threshold="0.10" max_block_times="500:2,1500:5,5000:50,20000:200,100000:500,1000000:1000"/>
>         <FRAG2 frag_size="60000"/>
>         <pbcast.STATE_TRANSFER/>
>         <pbcast.FLUSH timeout="5000"/>
> </config>
>
>
> I thought that I will not have join related issues anymore in 2.8
>
>
> Thanks in advance.
>
> Regards,
> Dima Gutzeit.
>
>  
>  
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> ------------------------------------------------------------------------
>
> _______________________________________________
> javagroups-users mailing list
> javagroups-users@...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>  

--
Bela Ban
Lead JGroups / Clustering Team
JBoss



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: Node fails to join after restart

by Dima Gutzeit-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

But I've waited several minutes ...

--------------------------------------------------
From: "Bela Ban" <belaban@...>
Sent: Wednesday, November 04, 2009 1:20 PM
To: "Dima Gutzeit" <dima.gutzeit@...>
Cc: <javagroups-users@...>
Subject: Re: [javagroups-users] Node fails to join after restart

> Well, this is simple:
>
>    * You have A and B
>    * You kill A and restart it immediately
>    * The new A (let's call it A') will do a discovery in which B
>      returns A (the old, killed) as coordinator
>          o The reason is that B hasn't yet excluded A adn become the
>            new coordinator itself
>    * A' sends a JOIN request to A, which will fail until A has been
>      removed and B has become the new coord
>    * A' then sends a JOIN request to B, which succeeds
>
>
> Voila. If you simpl kill and restart a node, it should take ca 2-3 seconds
> to exclude it from the cluster. If you pull the plug or CTRL-Z it, then it
> will take ca 12 - 16 seconds, during which the restarted node will try to
> join the old coordinator
>
>
>
> Dima Gutzeit wrote:
>> Latest version of 2.8.
>>
>> If I restart a node without waiting ~10 seconds between shutdown and
>> startup I get the following (total of two nodes, one is being restarted)
>> :
>>
>> [WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical
>> address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical
>> address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.TCP - null: no physical
>> address for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address
>> for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address
>> for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address
>> for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address
>> for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>> [WARN] [OOB-1,null] org.jgroups.protocols.TCP - null: no physical address
>> for e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6, dropping message
>> [WARN] [EngineConfigurator] org.jgroups.protocols.pbcast.GMS -
>> join(WebLynx02) sent to e1ec4d1f-5cf7-1149-db6a-22e432f3c3c6 timed out
>> (after 3000 ms), retrying
>>
>> And it never ends.
>>
>> My config is :
>>
>> <config>
>>         <TCP
>>                 bind_addr="xx.xx.xx.xx"
>>         loopback="true"
>>         recv_buf_size="20000000"
>>         send_buf_size="640000"
>>         discard_incompatible_packets="true"
>>         max_bundle_size="64000"
>>         max_bundle_timeout="30"
>>         enable_bundling="true"
>>         use_send_queues="false"
>>         sock_conn_timeout="300"
>>         skip_suspected_members="true"
>>
>>         thread_pool.enabled="true"
>>         thread_pool.min_threads="1"
>>         thread_pool.max_threads="50"
>>         thread_pool.keep_alive_time="5000"
>>         thread_pool.queue_enabled="false"
>>         thread_pool.queue_max_size="100"
>>         thread_pool.rejection_policy="Run"
>>
>>                 oob_thread_pool.enabled="true"
>>         oob_thread_pool.min_threads="1"
>>         oob_thread_pool.max_threads="15"
>>         oob_thread_pool.keep_alive_time="5000"
>>         oob_thread_pool.queue_enabled="true"
>>         oob_thread_pool.queue_max_size="1000"
>>         oob_thread_pool.rejection_policy="Run"
>>                 singleton_name="my_channels"/>
>>         <MPING timeout="4000" receive_on_all_interfaces="true"
>> send_on_all_interfaces="true" mcast_addr="228.8.8.11" mcast_port="60111"
>> ip_ttl="8" num_initial_members="2" num_ping_requests="1"/>
>>         <MERGE2 max_interval="10000" min_interval="5000"/>
>>         <FD_SOCK/>
>>         <FD_ALL timeout="10000" interval="5000"/>
>>         <VERIFY_SUSPECT timeout="1500"/>
>>         <pbcast.NAKACK use_mcast_xmit="false" gc_lag="50"
>> retransmit_timeout="600,1200,2400,4800" discard_delivered_msgs="true"/>
>>         <UNICAST timeout="1200,2400,3600"/>
>>         <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
>> max_bytes="400000"/>
>>         <VIEW_SYNC avg_send_interval="60000"/>
>>         <pbcast.GMS print_local_addr="true" join_timeout="3000"
>> view_bundling="true"/>
>>         <FC max_credits="2000000" min_threshold="0.10"
>> max_block_times="500:2,1500:5,5000:50,20000:200,100000:500,1000000:1000"/>
>>         <FRAG2 frag_size="60000"/>
>>         <pbcast.STATE_TRANSFER/>
>>         <pbcast.FLUSH timeout="5000"/>
>> </config>
>>
>>
>> I thought that I will not have join related issues anymore in 2.8
>>
>> Thanks in advance.
>>
>> Regards,
>> Dima Gutzeit.
>>
>>  ------------------------------------------------------------------------
>>
>> ------------------------------------------------------------------------------
>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008
>> 30-Day trial. Simplify your report design, integration and deployment -
>> and focus on what you do best, core application coding. Discover what's
>> new with
>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> javagroups-users mailing list
>> javagroups-users@...
>> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>>
>
> --
> Bela Ban
> Lead JGroups / Clustering Team
> JBoss
>
>
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users