New clients sometimes don't join the group

View: New views
11 Messages — Rating Filter:   Alert me  

New clients sometimes don't join the group

by Kai Timmer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,
i sometimes have the problem that new clients sometimes don't join a
existing group. Then i get lots of these messages on both clients:
Feb 29, 2000 7:41:07 AM org.jgroups.logging.JDKLogImpl warn
WARNING: localhost-54698: discarded message from non-member
l4work-57960, my view is [localhost-54698|0] [localhost-54698]

Most of the times they perform a merge after a while:
Feb 29, 2000 7:42:49 AM org.jgroups.logging.JDKLogImpl warn

WARNING: localhost-54698: discarded message from non-member
l4work-1893, my view is [localhost-54698|0] [localhost-54698]
view: MergeView::[localhost-54698|1] [localhost-54698, l4work-1893],
subgroups=[[localhost-54698|0] [localhost-54698], [l4work-1893|0]
[l4work-1893]]

But sometimes they don't. Then i can wait almost forever an they don't merge:
Feb 29, 2000 7:43:45 AM org.jgroups.logging.JDKLogImpl warn
WARNING: localhost-54698: discarded message from non-member
l4work-31577, my view is [localhost-54698|2] [localhost-54698]
[...]
Feb 29, 2000 7:56:10 AM org.jgroups.logging.JDKLogImpl warn
WARNING: localhost-54698: discarded message from non-member
l4work-31577, my view is [localhost-54698|2] [localhost-54698]

I have no idea how to reproduce each one of these behaviors. Shouldn't
they just enter the group immediately after the clients start?

Greets,
--
Kai Timmer | http://kaitimmer.de
Email : email@...
Jabber (Google Talk): kai@...
ICQ: 67765488

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Bela Ban :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Version and config please ?

Kai Timmer wrote:

> Hello,
> i sometimes have the problem that new clients sometimes don't join a
> existing group. Then i get lots of these messages on both clients:
> Feb 29, 2000 7:41:07 AM org.jgroups.logging.JDKLogImpl warn
> WARNING: localhost-54698: discarded message from non-member
> l4work-57960, my view is [localhost-54698|0] [localhost-54698]
>
> Most of the times they perform a merge after a while:
> Feb 29, 2000 7:42:49 AM org.jgroups.logging.JDKLogImpl warn
>
> WARNING: localhost-54698: discarded message from non-member
> l4work-1893, my view is [localhost-54698|0] [localhost-54698]
> view: MergeView::[localhost-54698|1] [localhost-54698, l4work-1893],
> subgroups=[[localhost-54698|0] [localhost-54698], [l4work-1893|0]
> [l4work-1893]]
>
> But sometimes they don't. Then i can wait almost forever an they don't merge:
> Feb 29, 2000 7:43:45 AM org.jgroups.logging.JDKLogImpl warn
> WARNING: localhost-54698: discarded message from non-member
> l4work-31577, my view is [localhost-54698|2] [localhost-54698]
> [...]
> Feb 29, 2000 7:56:10 AM org.jgroups.logging.JDKLogImpl warn
> WARNING: localhost-54698: discarded message from non-member
> l4work-31577, my view is [localhost-54698|2] [localhost-54698]
>
> I have no idea how to reproduce each one of these behaviors. Shouldn't
> they just enter the group immediately after the clients start?
>
> Greets,
>  

--
Bela Ban
Lead JGroups / Clustering Team
JBoss



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Kai Timmer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/6 Bela Ban <belaban@...>:
> Version and config please ?

I attached the protocol stack configuration to this mail and I'm using
version: 2.8.0CR3

Greets,
--
Kai Timmer | http://kaitimmer.de
Email : email@...
Jabber (Google Talk): kai@...
ICQ: 67765488


<!--
  Default stack using IP multicasting. It is similar to the "udp"
  stack in stacks.xml, but doesn't use streaming state transfer and flushing
  author: Bela Ban
  version: $Id: udp.xml,v 1.32 2009/06/17 16:35:43 belaban Exp $
-->

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">
    <UDP
         ip_mcast="true"
         mcast_addr="${jgroups.udp.mcast_addr:232.10.10.10}"
         mcast_port="${jgroups.udp.mcast_port:45588}"
         tos="8"
         ucast_recv_buf_size="20000000"
         ucast_send_buf_size="640000"
         mcast_recv_buf_size="25000000"
         mcast_send_buf_size="640000"
         loopback="false"
         discard_incompatible_packets="true"
         max_bundle_size="64000"
         max_bundle_timeout="30"
         ip_ttl="${jgroups.udp.ip_ttl:2}"
         enable_bundling="true"
         enable_diagnostics="true"
         thread_naming_pattern="cl"

         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="8"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="true"
         thread_pool.queue_max_size="10000"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="Run"/>

    <PING timeout="10000"
            num_initial_members="1"/>
    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_ALL interval="1000" timeout="5000" />
    <VERIFY_SUSPECT timeout="2000"/>
    <pbcast.NAKACK use_stats_for_retransmission="false"
                   exponential_backoff="150"
                   use_mcast_xmit="true" gc_lag="0"
                   xmit_from_random_member="true"
                   retransmit_timeout="50,300,600,1200"
                   discard_delivered_msgs="false"/>
    <UNICAST timeout="300,600,1200"/>
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="1000000"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"

                view_bundling="true"/>
    <FC max_credits="500000"
                    min_threshold="0.20"/>
    <FRAG2 frag_size="60000"  />
    <!--pbcast.STREAMING_STATE_TRANSFER /-->
    <pbcast.STATE_TRANSFER  />
    <pbcast.FLUSH />
</config>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by david.forget :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Using 2.8RC3 we are also seeing nodes unable to join and high CPU. We are also seeing nodes leaving the cluster when others try to join.

We are currently trying to reproduce it with log level to debug and will provide more detail soon.

UDP(bind_addr=10.4.68.61;enable_diagnostics=false;mcast_addr=228.8.8.8;mcast_port=54000;loopback=false;mcast_recv_buf_size=120000):PING(timeout=10000;num_initial_members=10000;num_ping_requests=1):MERGE2(max_interval=10000;min_interval=5000):FD_ALL(interval=5000;timeout=16000):VERIFY_SUSPECT(timeout=3000):BARRIER():pbcast.NAKACK():UNICAST():pbcast.STABLE():pbcast.GMS(join_timeout=11000;print_local_addr=true;view_bundling=true):FRAG2(frag_size=60000):pbcast.STATE_TRANSFER()}


David Forget


>-----Original Message-----
>From: ext Kai Timmer [mailto:email@...]
>Sent: Monday, November 09, 2009 1:22 PM
>To: Bela Ban
>Cc: jg-users
>Subject: Re: [javagroups-users] New clients sometimes don't join the
>group
>
>2009/11/6 Bela Ban <belaban@...>:
>> Version and config please ?
>
>I attached the protocol stack configuration to this mail and I'm using
>version: 2.8.0CR3
>
>Greets,
>--
>Kai Timmer | http://kaitimmer.de
>Email : email@...
>Jabber (Google Talk): kai@...
>ICQ: 67765488
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Vladimir Blagojevic-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

David,

Is there a reason why num_ping_requests is 1? Only one ping request is
going to be sent out in time frame of 10 sec. Ping request are sent over
unreliable udp socket (PING is below UNICAST) and could be lost in your
case. Give a thorough look to all PING properties, experiment with
higher value for num_ping_requests and do not forget to report back.

Regards,
Vladimir

On 09-11-09 2:06 PM, david.forget@... wrote:

> Using 2.8RC3 we are also seeing nodes unable to join and high CPU. We are also seeing nodes leaving the cluster when others try to join.
>
> We are currently trying to reproduce it with log level to debug and will provide more detail soon.
>
> UDP(bind_addr=10.4.68.61;enable_diagnostics=false;mcast_addr=228.8.8.8;mcast_port=54000;loopback=false;mcast_recv_buf_size=120000):PING(timeout=10000;num_initial_members=10000;num_ping_requests=1):MERGE2(max_interval=10000;min_interval=5000):FD_ALL(interval=5000;timeout=16000):VERIFY_SUSPECT(timeout=3000):BARRIER():pbcast.NAKACK():UNICAST():pbcast.STABLE():pbcast.GMS(join_timeout=11000;print_local_addr=true;view_bundling=true):FRAG2(frag_size=60000):pbcast.STATE_TRANSFER()}
>
>
> David Forget
>
>
>    
>> -----Original Message-----
>> From: ext Kai Timmer [mailto:email@...]
>> Sent: Monday, November 09, 2009 1:22 PM
>> To: Bela Ban
>> Cc: jg-users
>> Subject: Re: [javagroups-users] New clients sometimes don't join the
>> group
>>
>> 2009/11/6 Bela Ban<belaban@...>:
>>      
>>> Version and config please ?
>>>        
>> I attached the protocol stack configuration to this mail and I'm using
>> version: 2.8.0CR3
>>
>> Greets,
>> --
>> Kai Timmer | http://kaitimmer.de
>> Email : email@...
>> Jabber (Google Talk): kai@...
>> ICQ: 67765488
>>      
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> _______________________________________________
> javagroups-users mailing list
> javagroups-users@...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>    


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by david.forget :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Vladimir,
        Good suggestion I will increase PING::timeout and set PING::num_ping_requests=2. We set PING on every 10 sec to reduce as much as possible the amount of UDP Messages. As you know the coordinator sents PING to every members of the cluster periodically for the entire duration of the cluster and every nodes have to reply, Some of our cluster should reach over 350 nodes in early 2010 and having PING with more aggressive value have an impact on network and create CPU spike on coordinator.

David Forget



>-----Original Message-----
>From: ext Vladimir Blagojevic [mailto:vblagoje@...]
>Sent: Monday, November 09, 2009 2:26 PM
>To: Forget David (Nokia-S/Montreal)
>Cc: email@...; belaban@...; javagroups-
>users@...
>Subject: Re: [javagroups-users] New clients sometimes don't join the
>group
>
>David,
>
>Is there a reason why num_ping_requests is 1? Only one ping request is
>going to be sent out in time frame of 10 sec. Ping request are sent over
>unreliable udp socket (PING is below UNICAST) and could be lost in your
>case. Give a thorough look to all PING properties, experiment with
>higher value for num_ping_requests and do not forget to report back.
>
>Regards,
>Vladimir
>
>On 09-11-09 2:06 PM, david.forget@... wrote:
>> Using 2.8RC3 we are also seeing nodes unable to join and high CPU. We
>are also seeing nodes leaving the cluster when others try to join.
>>
>> We are currently trying to reproduce it with log level to debug and
>will provide more detail soon.
>>
>>
>UDP(bind_addr=10.4.68.61;enable_diagnostics=false;mcast_addr=228.8.8.8;m
>cast_port=54000;loopback=false;mcast_recv_buf_size=120000):PING(timeout=
>10000;num_initial_members=10000;num_ping_requests=1):MERGE2(max_interval
>=10000;min_interval=5000):FD_ALL(interval=5000;timeout=16000):VERIFY_SUS
>PECT(timeout=3000):BARRIER():pbcast.NAKACK():UNICAST():pbcast.STABLE():p
>bcast.GMS(join_timeout=11000;print_local_addr=true;view_bundling=true):F
>RAG2(frag_size=60000):pbcast.STATE_TRANSFER()}
>>
>>
>> David Forget
>>
>>
>>
>>> -----Original Message-----
>>> From: ext Kai Timmer [mailto:email@...]
>>> Sent: Monday, November 09, 2009 1:22 PM
>>> To: Bela Ban
>>> Cc: jg-users
>>> Subject: Re: [javagroups-users] New clients sometimes don't join the
>>> group
>>>
>>> 2009/11/6 Bela Ban<belaban@...>:
>>>
>>>> Version and config please ?
>>>>
>>> I attached the protocol stack configuration to this mail and I'm
>using
>>> version: 2.8.0CR3
>>>
>>> Greets,
>>> --
>>> Kai Timmer | http://kaitimmer.de
>>> Email : email@...
>>> Jabber (Google Talk): kai@...
>>> ICQ: 67765488
>>>
>> ----------------------------------------------------------------------
>--------
>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008
>30-Day
>> trial. Simplify your report design, integration and deployment - and
>focus on
>> what you do best, core application coding. Discover what's new with
>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>> _______________________________________________
>> javagroups-users mailing list
>> javagroups-users@...
>> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>>


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Vladimir Blagojevic-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

David, no need to increase PING:timeout, I suppose, keep it aligned with
your GMS:join_timeout which you already do.  I agree with your increase
of PING::num_ping_requests to 2. Also, what is your reasoning behind a
setting of num_initial_members to 10000? See the details of discovery
algorithm in Discovery.java.

Hearing about 350+ nodes deployment and getting your feedback is
certainly very interesting!

Regards,
Vladimir


On 09-11-09 2:38 PM, david.forget@... wrote:

> Hi Vladimir,
> Good suggestion I will increase PING::timeout and set PING::num_ping_requests=2. We set PING on every 10 sec to reduce as much as possible the amount of UDP Messages. As you know the coordinator sents PING to every members of the cluster periodically for the entire duration of the cluster and every nodes have to reply, Some of our cluster should reach over 350 nodes in early 2010 and having PING with more aggressive value have an impact on network and create CPU spike on coordinator.
>
> David Forget
>
>
>
>    
>> -----Original Message-----
>> From: ext Vladimir Blagojevic [mailto:vblagoje@...]
>> Sent: Monday, November 09, 2009 2:26 PM
>> To: Forget David (Nokia-S/Montreal)
>> Cc: email@...; belaban@...; javagroups-
>> users@...
>> Subject: Re: [javagroups-users] New clients sometimes don't join the
>> group
>>
>> David,
>>
>> Is there a reason why num_ping_requests is 1? Only one ping request is
>> going to be sent out in time frame of 10 sec. Ping request are sent over
>> unreliable udp socket (PING is below UNICAST) and could be lost in your
>> case. Give a thorough look to all PING properties, experiment with
>> higher value for num_ping_requests and do not forget to report back.
>>
>> Regards,
>> Vladimir
>>
>> On 09-11-09 2:06 PM, david.forget@... wrote:
>>      
>>> Using 2.8RC3 we are also seeing nodes unable to join and high CPU. We
>>>        
>> are also seeing nodes leaving the cluster when others try to join.
>>      
>>> We are currently trying to reproduce it with log level to debug and
>>>        
>> will provide more detail soon.
>>      
>>>
>>>        
>> UDP(bind_addr=10.4.68.61;enable_diagnostics=false;mcast_addr=228.8.8.8;m
>> cast_port=54000;loopback=false;mcast_recv_buf_size=120000):PING(timeout=
>> 10000;num_initial_members=10000;num_ping_requests=1):MERGE2(max_interval
>> =10000;min_interval=5000):FD_ALL(interval=5000;timeout=16000):VERIFY_SUS
>> PECT(timeout=3000):BARRIER():pbcast.NAKACK():UNICAST():pbcast.STABLE():p
>> bcast.GMS(join_timeout=11000;print_local_addr=true;view_bundling=true):F
>> RAG2(frag_size=60000):pbcast.STATE_TRANSFER()}
>>      
>>>
>>> David Forget
>>>
>>>
>>>
>>>        
>>>> -----Original Message-----
>>>> From: ext Kai Timmer [mailto:email@...]
>>>> Sent: Monday, November 09, 2009 1:22 PM
>>>> To: Bela Ban
>>>> Cc: jg-users
>>>> Subject: Re: [javagroups-users] New clients sometimes don't join the
>>>> group
>>>>
>>>> 2009/11/6 Bela Ban<belaban@...>:
>>>>
>>>>          
>>>>> Version and config please ?
>>>>>
>>>>>            
>>>> I attached the protocol stack configuration to this mail and I'm
>>>>          
>> using
>>      
>>>> version: 2.8.0CR3
>>>>
>>>> Greets,
>>>> --
>>>> Kai Timmer | http://kaitimmer.de
>>>> Email : email@...
>>>> Jabber (Google Talk): kai@...
>>>> ICQ: 67765488
>>>>
>>>>          
>>> ----------------------------------------------------------------------
>>>        
>> --------
>>      
>>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008
>>>        
>> 30-Day
>>      
>>> trial. Simplify your report design, integration and deployment - and
>>>        
>> focus on
>>      
>>> what you do best, core application coding. Discover what's new with
>>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>>> _______________________________________________
>>> javagroups-users mailing list
>>> javagroups-users@...
>>> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>>>
>>>        
>    


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Bela Ban :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



david.forget@... wrote:
> Hi Vladimir,
> Good suggestion I will increase PING::timeout and set
> PING::num_ping_requests=2. We set PING on every 10 sec to reduce as
> much as possible the amount of UDP Messages. As you know the
> coordinator sents PING to every members of the cluster periodically
> for the entire duration of the cluster and every nodes have to reply,

That shouldn't be an issue though as you use IP multicasting, which only
sends 1 multicast packet. Yes, every receiver responds with a UDP
datagram back to the coordinator though...

After startup, PING is only used by MERGE2, so if you have relatively
high values for min_ and max_timeout in MERGE2, then you decrease the
frequency at which PING is sening out messages.

> Some of our cluster should reach over 350 nodes in early 2010

Interesting !

> and having PING with more aggressive value have an impact on network
> and create CPU spike on coordinator.

understood.

--
Bela Ban
Lead JGroups / Clustering Team
JBoss


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Bela Ban :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



david.forget@... wrote:
> Using 2.8RC3 we are also seeing nodes unable to join and high CPU. We
> are also seeing nodes leaving the cluster when others try to join.
>
> We are currently trying to reproduce it with log level to debug and
> will provide more detail soon.

If you can reproduce this, the sooner the better. 2.8.0 has 4 issues
left and I'm working towards closing them, so I can release GA.


Comments about your config below.


> UDP(bind_addr=10.4.68.61;enable_diagnostics=false;mcast_addr=228.8.8.8;mcast_port=54000;loopback=false;mcast_recv_buf_size=120000):

That's a small receive buffer. The bigger the better, but don't forget
to set net.core.rmem_max (def: 131K IIRC) too (on UNIX systems).

> PING(timeout=10000;num_initial_members=10000;num_ping_requests=1):

As Vladimir pointed out, num_initial_members=10000 ? What's the
rationale here ?

> MERGE2(max_interval=10000;min_interval=5000):

> FD_ALL(interval=5000;timeout=16000):


No FD_SOCK ? You'll have to wait for up to 23 seconds (worst case) to
discover a crashed member...

> VERIFY_SUSPECT(timeout=3000):

> BARRIER():

> pbcast.NAKACK():

> UNICAST():

> pbcast.STABLE():

> pbcast.GMS(join_timeout=11000;print_local_addr=true;view_bundling=true):

join_timeout 11 seconds ? seems to high...

Note that I'd set max_bundling_time (def: 50ms) too if you set
view_bundling=true: this way, many concurrent joins are generating only
few view changes rather than 1 view change / JOIN


> FRAG2(frag_size=60000):

> pbcast.STATE_TRANSFER()}

--
Bela Ban
Lead JGroups / Clustering Team
JBoss


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Re: New clients sometimes don't join the group

by Bela Ban :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I suggest you change PING.timeout to 3000 and PING.num_initial_members
to 3. This way, you reduce the chances of members not finding each
other, or returning after having found another (starting) member

Kai Timmer wrote:

> 2009/11/6 Bela Ban <belaban@...>:
>  
>> Version and config please ?
>>    
>
> I attached the protocol stack configuration to this mail and I'm using
> version: 2.8.0CR3
>
> Greets,
>  

--
Bela Ban
Lead JGroups / Clustering Team
JBoss



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users

Parent Message unknown Re: New clients sometimes don't join the group

by Bela Ban :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

OK, I know what the problem is. The stack trace below shows what happens:

    * An incoming thread passes a unicast message up
    * UNICAST receives the message and sends back an ACK
    * The ACK is sent unicast and arrives at send(), which gets a
      marshaller from the marshaller pool and also fetches the lock (say L1)
    * TP.sendToSingleMember() doesn't find the physical address for the
      destination and calls up(GET_PHYSICAL_ADDRESS, dest)
    * up() is run *in the same thread* !
    * Discovery handles this event and sends a GET_MBRS_REQ out (still
      on the same thread !)
    * TP.send() now happens to fetch a *different* marshaller, therefore
      gets lock L2 !
    * Although this is the same thread, L1 and L2 are different,
      therefore the thread cannot acquire L2 (held by someone else) !
    * If marshaller_pool_size was 1, we'd always get the same lock,
      therefore this thread would be able to re-acquire the lock already
      held by itself !

Okay, so what can be done ?

You're lucky, this has already been fixed in CR5 ! Hre's what I did there:

    * The lock in TP.send() has been removed. Every caller allocates its
      own output buffer and therefore doesn't need the lock, because
      this is not a shared resource any longer ! Allocating output
      streams is really quick, at a rate of hundreds of thousands this
      is not an issue
    * There is still a lock in TP.Bundler, but it's only access by a
      single thread
    * The marshaller pool has been removed

So upgrading to CR5 will eliminate your problem.

There are also workarounds:

    * By default the (oob_)thread_pool_rejection_policy is "run" which
      is not good (I changed this to "discard" on CVS head), you should
      add this property to UDP and set it to "discard" for now
    * Set marshaller_pool_size="1"



 "Incoming-2,df148_BCC:54052,devplatform4.nokia.com-21540" id=41
idx=0xb0 tid=30010 prio=5 alive, in native, parked
at jrockit/vm/Locks.park0(J)V(Native Method)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
jrockit/vm/Locks.park(Locks.java:2507)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
sun/misc/Unsafe.park(ZJ)V(Native Method)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/locks/LockSupport.park(LockSupport.java:118)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/locks/AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:716)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/locks/AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:746)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/locks/AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1076)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/locks/ReentrantLock$NonfairSync.lock(ReentrantLock.java:184)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:256)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.send(TP.java:1101)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.down(TP.java:963)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/PING.sendMcastDiscoveryRequest(PING.java:72)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/PING.sendGetMembersRequest(PING.java:55)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/Discovery.up(Discovery.java:370)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/PING.up(PING.java:67)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.up(TP.java:903)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.sendToSingleMember(TP.java:1136)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.doSend(TP.java:1124)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.send(TP.java:1107)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.down(TP.java:963)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/Discovery.down(Discovery.java:454)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/MERGE2.down(MERGE2.java:176)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/FD_ALL.down(FD_ALL.java:195)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:69)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/BARRIER.down(BARRIER.java:92)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/pbcast/NAKACK.down(NAKACK.java:641)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/UNICAST.sendAck(UNICAST.java:682)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/UNICAST.sendAckForMessage(UNICAST.java:695)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/UNICAST.handleDataReceived(UNICAST.java:581)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/UNICAST.up(UNICAST.java:272)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/pbcast/NAKACK.up(NAKACK.java:712)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/BARRIER.up(BARRIER.java:121)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/FD_ALL.up(FD_ALL.java:180)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/stack/Protocol.up(Protocol.java:344)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/Discovery.up(Discovery.java:283)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/PING.up(PING.java:67)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.passMessageUp(TP.java:1025)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP.access$100(TP.java:53)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP$IncomingPacket.handleMyMessage(TP.java:1537)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
org/jgroups/protocols/TP$IncomingPacket.run(TP.java:1514)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
INFO   | jvm 1    | 2009/11/11 10:22:48 |     at
java/lang/Thread.run(Thread.java:595)



david.forget@... wrote:

> Hi,
> We have been able to reproduce a problem and doing investigation to
> find the root cause. We are also continuing to reproduce the
> disconnect issue with all nodes set to TRACE level. Basically we set 6
> identical machines (DL380, 2CPU, Quad Cores, 16G Ram, Centos 5.3 64b)
> on IPv4 each machine as several NICs then we set
> UDP(bind_addr=10.4.3.5 {eth0}) and also set the multicast route to eth0.
>
> Setup:
> - On 3 machines {M1, M2, M2} we start 1 nodes
> - On 3 machines {M4, M5, M6} we are started 23 nodes
>
> We found the first issue when starting the first node on the last
> machine(M6: GMS: address=devplatform4.nokia.com-21540,
> cluster=df148_BCC:54052, physical address=10.4.3.5:18445)
>
> This is our configuration: (we did not apply your suggestion because
> we would like to find the root cause of this problem first)
>
> UDP(bind_addr=10.4.3.5;enable_diagnostics=false;mcast_addr=228.8.8.8;mcast_port=54052;loopback=false;mcast_recv_buf_size=120000):PING(timeout=10000;num_initial_members=10000;num_ping_requests=1):MERGE2(max_interval=10000;min_interval=5000):FD_ALL(interval=5000;timeout=16000):VERIFY_SUSPECT(timeout=3000):BARRIER():pbcast.NAKACK():UNICAST():pbcast.STABLE():pbcast.GMS(join_timeout=11000;print_local_addr=true;view_bundling=true):FRAG2(frag_size=60000):pbcast.STATE_TRANSFER()
>
>
> Problem:
> The Java process started normally, the node join the cluster normally
> (as per coordinator log) then look like a DEADLOCK situation arrived,
> and the node as been detect as suspect then removed from the cluster
> and never join again after.
> After several minutes we did 3 thread dumps and restart this node that
> join normally the cluster this time.
> The thread dump shows that most threads are locked at
> org/jgroups/protocols/TP.send(TP.java:1101).
>
>
>

--
Bela Ban
Lead JGroups / Clustering Team
JBoss


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
javagroups-users mailing list
javagroups-users@...
https://lists.sourceforge.net/lists/listinfo/javagroups-users