Erlang message passing delay after abnormal network disconnection

View: New views
12 Messages — Rating Filter:   Alert me  

Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I am experiencing a high message passing delay between 2 Erlang nodes, after an abnormal network disconnection. Those 2 nodes are in a WAN and there are multiple Hubs, Switches, Routes, etc., in between them. If the message receiving Erlang node stopped gracefully, the delay doesn't arise. Doing net_adm:ping/1 to that node results no delay "pang". However gen_event:notify/2, gen_server:cast/2, etc. are waiting for about 10 seconds to return.

What's the issue and how this can be avoided?

Thanks,
- Eranga

_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by chandru-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 03/03/2008, Eranga Udesh <eranga.erl@...> wrote:

> Hi,
>
> I am experiencing a high message passing delay between 2 Erlang nodes, after
> an abnormal network disconnection. Those 2 nodes are in a WAN and there are
> multiple Hubs, Switches, Routes, etc., in between them. If the message
> receiving Erlang node stopped gracefully, the delay doesn't arise. Doing
> net_adm:ping/1 to that node results no delay "pang". However
> gen_event:notify/2, gen_server:cast/2, etc. are waiting for about 10 seconds
> to return.
>
> What's the issue and how this can be avoided?

Have you tried putting a snoop to see whether the delay is on the
sending/receiving side?

This might be useful: http://www.erlang.org/contrib/erlsnoop-1.0.tgz

cheers
Chandru
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The problem occurs when the network connectivity is broken (abnormally). The receiving node is not receiving messages. The sending  processes are blocked, since those message delivery calls (gen_event:notify/s, etc) are waiting for about 10 secs to return. We checked the implementation of such calls and notice, the functions are waiting until the messages are delivered to the receiving node. Is there's a way (a system flag may be) to avoid such blocking and to return immediately?

BRgds,
- Eranga



On Mon, Mar 3, 2008 at 6:51 PM, Chandru <chandrashekhar.mullaparthi@...> wrote:
On 03/03/2008, Eranga Udesh <eranga.erl@...> wrote:
> Hi,
>
> I am experiencing a high message passing delay between 2 Erlang nodes, after
> an abnormal network disconnection. Those 2 nodes are in a WAN and there are
> multiple Hubs, Switches, Routes, etc., in between them. If the message
> receiving Erlang node stopped gracefully, the delay doesn't arise. Doing
> net_adm:ping/1 to that node results no delay "pang". However
> gen_event:notify/2, gen_server:cast/2, etc. are waiting for about 10 seconds
> to return.
>
> What's the issue and how this can be avoided?

Have you tried putting a snoop to see whether the delay is on the
sending/receiving side?

This might be useful: http://www.erlang.org/contrib/erlsnoop-1.0.tgz

cheers
Chandru


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Ulf Wiger (TN/EAB) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


It sounds as if the sending node is blocked in auto-connect.

Try the kernel environment variable {dist_auto_connect, once}.
It will ensure that any attempt to send to a disconnected node
immediately fails. If one of the nodes restarts, they will
automatically reconnect, as usual. You can explicitly connect
the two nodes by calling net_kernel:connect(Node).

BR,
Ulf W

Eranga Udesh skrev:

> The problem occurs when the network connectivity is broken (abnormally).
> The receiving node is not receiving messages. The sending  processes are
> blocked, since those message delivery calls (gen_event:notify/s, etc)
> are waiting for about 10 secs to return. We checked the implementation
> of such calls and notice, the functions are waiting until the messages
> are delivered to the receiving node. Is there's a way (a system flag may
> be) to avoid such blocking and to return immediately?
>
> BRgds,
> - Eranga
>
>
>
> On Mon, Mar 3, 2008 at 6:51 PM, Chandru
> <chandrashekhar.mullaparthi@...
> <mailto:chandrashekhar.mullaparthi@...>> wrote:
>
>     On 03/03/2008, Eranga Udesh <eranga.erl@...
>     <mailto:eranga.erl@...>> wrote:
>      > Hi,
>      >
>      > I am experiencing a high message passing delay between 2 Erlang
>     nodes, after
>      > an abnormal network disconnection. Those 2 nodes are in a WAN and
>     there are
>      > multiple Hubs, Switches, Routes, etc., in between them. If the
>     message
>      > receiving Erlang node stopped gracefully, the delay doesn't
>     arise. Doing
>      > net_adm:ping/1 to that node results no delay "pang". However
>      > gen_event:notify/2, gen_server:cast/2, etc. are waiting for about
>     10 seconds
>      > to return.
>      >
>      > What's the issue and how this can be avoided?
>
>     Have you tried putting a snoop to see whether the delay is on the
>     sending/receiving side?
>
>     This might be useful: http://www.erlang.org/contrib/erlsnoop-1.0.tgz
>
>     cheers
>     Chandru
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I can regenerate the behavior by stopping the network interface in the far node (linux ifdown). That runs the connected Erlang node, which was receiving the messages. I wonder if this how the Erlang implementation is or local to this particular setup.

Also I use HIPE. I'll try what you suggested below and also without HIPE.

Thanks,
- Eranga






On Tue, Mar 4, 2008 at 2:08 PM, Ulf Wiger (TN/EAB) <ulf.wiger@...> wrote:

It sounds as if the sending node is blocked in auto-connect.

Try the kernel environment variable {dist_auto_connect, once}.
It will ensure that any attempt to send to a disconnected node
immediately fails. If one of the nodes restarts, they will
automatically reconnect, as usual. You can explicitly connect
the two nodes by calling net_kernel:connect(Node).

BR,
Ulf W

Eranga Udesh skrev:
> The problem occurs when the network connectivity is broken (abnormally).
> The receiving node is not receiving messages. The sending  processes are
> blocked, since those message delivery calls (gen_event:notify/s, etc)
> are waiting for about 10 secs to return. We checked the implementation
> of such calls and notice, the functions are waiting until the messages
> are delivered to the receiving node. Is there's a way (a system flag may
> be) to avoid such blocking and to return immediately?
>
> BRgds,
> - Eranga
>
>
>
> On Mon, Mar 3, 2008 at 6:51 PM, Chandru
> <chandrashekhar.mullaparthi@...
> <mailto:chandrashekhar.mullaparthi@...>> wrote:
>
>     On 03/03/2008, Eranga Udesh <eranga.erl@...
>     <mailto:eranga.erl@...>> wrote:
>      > Hi,
>      >
>      > I am experiencing a high message passing delay between 2 Erlang
>     nodes, after
>      > an abnormal network disconnection. Those 2 nodes are in a WAN and
>     there are
>      > multiple Hubs, Switches, Routes, etc., in between them. If the
>     message
>      > receiving Erlang node stopped gracefully, the delay doesn't
>     arise. Doing
>      > net_adm:ping/1 to that node results no delay "pang". However
>      > gen_event:notify/2, gen_server:cast/2, etc. are waiting for about
>     10 seconds
>      > to return.
>      >
>      > What's the issue and how this can be avoided?
>
>     Have you tried putting a snoop to see whether the delay is on the
>     sending/receiving side?
>
>     This might be useful: http://www.erlang.org/contrib/erlsnoop-1.0.tgz
>
>     cheers
>     Chandru
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Kostis Sagonas-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Eranga Udesh wrote:
> I can regenerate the behavior by stopping the network interface in the
> far node (linux ifdown). That runs the connected Erlang node, which was
> receiving the messages. I wonder if this how the Erlang implementation
> is or local to this particular setup.
>
> Also I use HIPE. I'll try what you suggested below and also without HIPE.

Why would HiPE have some effect in what you are describing?

Kostis
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Kenneth Lundin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

When connectivity is broken abnormally the sending node will detect
this within 45-60 seconds as default. This can be changed with the
net_tick_time environment variable in application kernel.
Before the detection the sending node will try to send the message and
if not possible it will be queued in the inet-driver. If the queue
gets bigger than a certain max a so called "busy port" will occur
which will block the sending Erlang process.
This occurs when the receiving side of the distribution socket does
not read what is
sent to it which is the case when you have no connectivity.

another scenario is that the receiving node is detected as down and
an auto connect (including handshake) is performed for the first
message sent after
the broken connection. This will take in the order of 10 seconds before timeout.

If you want to avoid this for a very crucial process (i.e avoid
blocking of that particular Erlang process) you can send the message
with erlang:send_nosuspend/2 or 3. Warning! these functions should be
used with extreme care, Read the manual!

Note that this has nothing to do with HiPE (i.e native code).
An abnormal termination of the connectivity for example by unplugging
the network cable will have this effect.

/Kenneth Erlang/OTP team Ericsson

On 3/4/08, Eranga Udesh <eranga.erl@...> wrote:

> The problem occurs when the network connectivity is broken (abnormally). The
> receiving node is not receiving messages. The sending  processes are
> blocked, since those message delivery calls (gen_event:notify/s, etc) are
> waiting for about 10 secs to return. We checked the implementation of such
> calls and notice, the functions are waiting until the messages are delivered
> to the receiving node. Is there's a way (a system flag may be) to avoid such
> blocking and to return immediately?
>
> BRgds,
> - Eranga
>
>
>
>
> On Mon, Mar 3, 2008 at 6:51 PM, Chandru
> <chandrashekhar.mullaparthi@...> wrote:
> >
> >
> >
> > On 03/03/2008, Eranga Udesh <eranga.erl@...> wrote:
> > > Hi,
> > >
> > > I am experiencing a high message passing delay between 2 Erlang nodes,
> after
> > > an abnormal network disconnection. Those 2 nodes are in a WAN and there
> are
> > > multiple Hubs, Switches, Routes, etc., in between them. If the message
> > > receiving Erlang node stopped gracefully, the delay doesn't arise. Doing
> > > net_adm:ping/1 to that node results no delay "pang". However
> > > gen_event:notify/2, gen_server:cast/2, etc. are waiting for about 10
> seconds
> > > to return.
> > >
> > > What's the issue and how this can be avoided?
> >
> > Have you tried putting a snoop to see whether the delay is on the
> > sending/receiving side?
> >
> > This might be useful:
> http://www.erlang.org/contrib/erlsnoop-1.0.tgz
> >
> > cheers
> > Chandru
> >
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the info and it makes sense.

In "busy port" situation, do queued messages get discarded if the queue grows beyond a max? Is it FIFO or LIFO? Is there a way to configure this message queue size? Can one inet_drv "busy port" block all other connected (live) node communication?

As I said before the net_adm:ping/1 returns "pang" immediately. Then why doesn't the message delivery function identify that the remote node is inaccessible, hence return immediately with an error?

How the message delivery method implemented in Erlang? Is it to return as soon as the message is handed over to the local inet_drv or delivered to the receiving Erlang node's inet_drv and after receiving a confirmation or something?

- Eranga



On Tue, Mar 4, 2008 at 10:28 PM, Kenneth Lundin <kenneth.lundin@...> wrote:
When connectivity is broken abnormally the sending node will detect
this within 45-60 seconds as default. This can be changed with the
net_tick_time environment variable in application kernel.
Before the detection the sending node will try to send the message and
if not possible it will be queued in the inet-driver. If the queue
gets bigger than a certain max a so called "busy port" will occur
which will block the sending Erlang process.
This occurs when the receiving side of the distribution socket does
not read what is
sent to it which is the case when you have no connectivity.

another scenario is that the receiving node is detected as down and
an auto connect (including handshake) is performed for the first
message sent after
the broken connection. This will take in the order of 10 seconds before timeout.

If you want to avoid this for a very crucial process (i.e avoid
blocking of that particular Erlang process) you can send the message
with erlang:send_nosuspend/2 or 3. Warning! these functions should be
used with extreme care, Read the manual!

Note that this has nothing to do with HiPE (i.e native code).
An abnormal termination of the connectivity for example by unplugging
the network cable will have this effect.

/Kenneth Erlang/OTP team Ericsson

On 3/4/08, Eranga Udesh <eranga.erl@...> wrote:
> The problem occurs when the network connectivity is broken (abnormally). The
> receiving node is not receiving messages. The sending  processes are
> blocked, since those message delivery calls (gen_event:notify/s, etc) are
> waiting for about 10 secs to return. We checked the implementation of such
> calls and notice, the functions are waiting until the messages are delivered
> to the receiving node. Is there's a way (a system flag may be) to avoid such
> blocking and to return immediately?
>
> BRgds,
> - Eranga
>
>
>
>
> On Mon, Mar 3, 2008 at 6:51 PM, Chandru
> <chandrashekhar.mullaparthi@...> wrote:
> >
> >
> >
> > On 03/03/2008, Eranga Udesh <eranga.erl@...> wrote:
> > > Hi,
> > >
> > > I am experiencing a high message passing delay between 2 Erlang nodes,
> after
> > > an abnormal network disconnection. Those 2 nodes are in a WAN and there
> are
> > > multiple Hubs, Switches, Routes, etc., in between them. If the message
> > > receiving Erlang node stopped gracefully, the delay doesn't arise. Doing
> > > net_adm:ping/1 to that node results no delay "pang". However
> > > gen_event:notify/2, gen_server:cast/2, etc. are waiting for about 10
> seconds
> > > to return.
> > >
> > > What's the issue and how this can be avoided?
> >
> > Have you tried putting a snoop to see whether the delay is on the
> > sending/receiving side?
> >
> > This might be useful:
> http://www.erlang.org/contrib/erlsnoop-1.0.tgz
> >
> > cheers
> > Chandru
> >
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions
>


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It's just a guess. When compiled natively I've found some problems time to time,
- garbage collection is not working well. Old heap is kept unnecessarily. I've written this to the list once before.
- code loading crashes a running process more often

Probably it's nothing to do with HIPE, but I thought to simulate the same without HIPE and check.

- Eranga




On Tue, Mar 4, 2008 at 8:55 PM, Kostis Sagonas <kostis@...> wrote:
Eranga Udesh wrote:
> I can regenerate the behavior by stopping the network interface in the
> far node (linux ifdown). That runs the connected Erlang node, which was
> receiving the messages. I wonder if this how the Erlang implementation
> is or local to this particular setup.
>
> Also I use HIPE. I'll try what you suggested below and also without HIPE.

Why would HiPE have some effect in what you are describing?

Kostis


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Scott Lystig Fritchie :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, everyone.  I've read forward in the thread ... and am wondering if
there's a simpler cause?  Since the default distribution mechanism rides
on top of TCP, the delay might be caused by TCP's exponential back-off
when packet loss is encountered?  A quick packet capture could verify
this theory: there would be a big delay after the network partition is
fixed (i.e. plug cable back in, "ifconfig {IFACE} up", whatever) and
before the next packet (in either direction) is transmitted.

-Scott
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The problem I am talking about occurs while the network is in partitioned condition. When the network connection is re-established and the Erlang node is connected with a net_adm:ping/1 the message queue drains out quickly and the nodes start working normal.

As I said before, this delay occurs only after an abnormal network disconnection. If the receiving Erlang node is shutdown gracefully, the message delay doesn't occur.

I doubt, this occurs only when the packets sent out are going to a black-hole and nobody responds that the destination TCP entity is unavailable.

- Eranga



On Wed, Mar 5, 2008 at 12:21 AM, Scott Lystig Fritchie <fritchie@...> wrote:
Hi, everyone.  I've read forward in the thread ... and am wondering if
there's a simpler cause?  Since the default distribution mechanism rides
on top of TCP, the delay might be caused by TCP's exponential back-off
when packet loss is encountered?  A quick packet capture could verify
this theory: there would be a big delay after the network partition is
fixed (i.e. plug cable back in, "ifconfig {IFACE} up", whatever) and
before the next packet (in either direction) is transmitted.

-Scott


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: Erlang message passing delay after abnormal network disconnection

by Eranga Udesh-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Excellent, the net_tick_time environment variable works. Thanks for the advice.
Still I appreciate if I can know the behavior of inet_drv based on the questions I asked in my previous email.

Cheers,
- Eranga



On Tue, Mar 4, 2008 at 10:28 PM, Kenneth Lundin <kenneth.lundin@...> wrote:
When connectivity is broken abnormally the sending node will detect
this within 45-60 seconds as default. This can be changed with the
net_tick_time environment variable in application kernel.
Before the detection the sending node will try to send the message and
if not possible it will be queued in the inet-driver. If the queue
gets bigger than a certain max a so called "busy port" will occur
which will block the sending Erlang process.
This occurs when the receiving side of the distribution socket does
not read what is
sent to it which is the case when you have no connectivity.

another scenario is that the receiving node is detected as down and
an auto connect (including handshake) is performed for the first
message sent after
the broken connection. This will take in the order of 10 seconds before timeout.

If you want to avoid this for a very crucial process (i.e avoid
blocking of that particular Erlang process) you can send the message
with erlang:send_nosuspend/2 or 3. Warning! these functions should be
used with extreme care, Read the manual!

Note that this has nothing to do with HiPE (i.e native code).
An abnormal termination of the connectivity for example by unplugging
the network cable will have this effect.

/Kenneth Erlang/OTP team Ericsson

On 3/4/08, Eranga Udesh <eranga.erl@...> wrote:
> The problem occurs when the network connectivity is broken (abnormally). The
> receiving node is not receiving messages. The sending  processes are
> blocked, since those message delivery calls (gen_event:notify/s, etc) are
> waiting for about 10 secs to return. We checked the implementation of such
> calls and notice, the functions are waiting until the messages are delivered
> to the receiving node. Is there's a way (a system flag may be) to avoid such
> blocking and to return immediately?
>
> BRgds,
> - Eranga
>
>
>
>
> On Mon, Mar 3, 2008 at 6:51 PM, Chandru
> <chandrashekhar.mullaparthi@...> wrote:
> >
> >
> >
> > On 03/03/2008, Eranga Udesh <eranga.erl@...> wrote:
> > > Hi,
> > >
> > > I am experiencing a high message passing delay between 2 Erlang nodes,
> after
> > > an abnormal network disconnection. Those 2 nodes are in a WAN and there
> are
> > > multiple Hubs, Switches, Routes, etc., in between them. If the message
> > > receiving Erlang node stopped gracefully, the delay doesn't arise. Doing
> > > net_adm:ping/1 to that node results no delay "pang". However
> > > gen_event:notify/2, gen_server:cast/2, etc. are waiting for about 10
> seconds
> > > to return.
> > >
> > > What's the issue and how this can be avoided?
> >
> > Have you tried putting a snoop to see whether the delay is on the
> > sending/receiving side?
> >
> > This might be useful:
> http://www.erlang.org/contrib/erlsnoop-1.0.tgz
> >
> > cheers
> > Chandru
> >
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions
>


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions