Comet handler starts terminating TCP connections with RST?

View: New views
17 Messages — Rating Filter:   Alert me  

Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,
we're having problems with grizzly 1.0.20 connector (backported to gf2ur1 running on Debian 4 adm64). After some hours, the connector starts to drop (TCP) connections immediately after setup without reaching any of our application code. Also, I cannot see anything in the glassfish logs. Load is very moderate with only 1-4 clients polling every minute.

Here's what happens eventually:

client <->  server
 -> SYN
 <- SYN, ACK
 -> ACK
 <- FIN, ACK
 -> ACK
 -> HTTP GET...
 <- RST

Once this starts to happen, also non-bayeux calls start getting occasional connection drops in exactly the same way. All bayeux connections are dropped immediately.

I'm not sure how to investigate this any further. Does anyone have any ideas on under which circumstances the comet handler would drop the connection immediately? This might be some kind of resource exhaustion problem, but we're not (yet) that familiar with the grizzly codebase to figure out e.g. what is not being released.

Best regards,

    Jussi Kuosa



Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Jussi Kuosa wrote:
> Hello,
> we're having problems with grizzly 1.0.20 connector (backported to gf2ur1
> running on Debian 4 adm64). After some hours, the connector starts to drop
> (TCP) connections immediately after setup without reaching any of our
> application code.

Can you do a run using
-Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true

(and also make sure you always run with):

<jvm-options>-Dcom.sun.enterprise.server.ss.ASQuickStartup=false</jvm-options>

Also, when that happens, can you grab a jstack PID to see if threads are
available.

Also, I cannot see anything in the glassfish logs. Load is

> very moderate with only 1-4 clients polling every minute.
>
> Here's what happens eventually:
>
> client <->  server
>  -> SYN
>  <- SYN, ACK
>  -> ACK
>  <- FIN, ACK
>  -> ACK
>  -> HTTP GET...
>  <- RST
>
> Once this starts to happen, also non-bayeux calls start getting occasional
> connection drops in exactly the same way. All bayeux connections are dropped
> immediately.

Strange.

>
> I'm not sure how to investigate this any further. Does anyone have any ideas
> on under which circumstances the comet handler would drop the connection
> immediately? This might be some kind of resource exhaustion problem, but
> we're not (yet) that familiar with the grizzly codebase to figure out e.g.
> what is not being released.

One thing you might want to try is to update you v2 installation to use
grizzly 1.0.22.jar

https://maven-repository.dev.java.net/nonav/repository/grizzly/jars/

Thanks

-- Jeanfrancois




>
> Best regards,
>
>     Jussi Kuosa
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Jean-Francois,
and thank you for your quick reply!

I will try your suggestions this evening and report back tomorrow morning. We'll patch v2 after we have identified the problem.

Best regards,

    Jussi Kuosa

> Can you do a run using
> -Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true
> (and also make sure you always run with):
> <jvm-options>-Dcom.sun.enterprise.server.ss.ASQuickStartup=false</jvm-options>
>
> Also, when that happens, can you grab a jstack PID to see if threads are
> available.

> One thing you might want to try is to update you v2 installation to use
> grizzly 1.0.22.jar


Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi again,
it took a while before I got to tackle this issue... I've been trying to solve this for a week now and here's what I've come up with: the TCP resets are caused by a Comet-enabled selector thread consuming 100% CPU. As the CPU is hot, the Linux networking stack begins to drop connections from that CPUs packet queue (confirmed by /proc/net/softnet_stat).

After a varying period of time (20s - 4h), one core (of 2) goes to and stays at 100% spinning wildly in the (epoll) selection loop. I used the JTop plugin in JConsole to track down the CPU hogging thread (SelectorThread-8282). From jstack output I found out that the thread seems to sit in sun.nio.ch.EPollArrayWrapper.epollWait all the time. I attached the NB6.1 debugger to GF and found that the SelectorThread.doSelect for all listeners keep returning with 0 state and there are no ready keys. Evaluating selector.keys() returns only one key with OP_ACCEPT interestOps (i.e. the server socket in port 8282), so I guess all client keys have been cleared out? I have no idea why the epoll returns immediately in this case?

I used the jstack trace to find [1] and a JDK defect 6595055 [2] that lead me to 6403933 [3]. The defect is flagged as 10-Fix Delivered, but I guess the problem hasn't really been fixed yet (at least in jdk-6u7)... I also run a second server on top of Windows Server 2003sp2 and it hasn't had any problems, so that is currently our backup plan (that we really wouldn't like to use due to Windows NTFS (jar) file locking problems during redeployments).

I've tried desperatly to make the problem reproduce systematically, but without any luck. All the test clients that I've found in JDK defects (and a similar defect in Twisted framework) run fine, as do my own attempts to break the disconnect cycle. The only time I even noticed a TCP reset happening before the CPU trashing is reset_sequence.txt (the checksums are offloaded). That was caused by a service shutdown on the client side, but the problem occurs also with the clients doing normal CONNECTs to a subscribed channel.

Is anyone on the list running 2.6 amd64 Linux cometd/bayeux servers and have you experienced similar behavior?

Does someone have any other suggestions as to what I could try next?

Can I provide more information to help solve this problem?

Best regards,

    Jussi Kuosa


[1] http://forums.java.net/jive/thread.jspa?messageID=255525
[2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6595055
[3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933

> Can you do a run using
> -Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true
> (and also make sure you always run with):
> <jvm-options>-Dcom.sun.enterprise.server.ss.ASQuickStartup=false</jvm-options>

Yes, I put those in.

> Also, when that happens, can you grab a jstack PID to see if threads are
> available.

Looks ok, all HttpWorkerThreads for 8282 are waiting in the pipeline? jstack_l.txt

> One thing you might want to try is to update you v2 installation to use
> grizzly 1.0.22.jar

Yes, I bumped our GF to v2ur2-b04 (with 1.0.22) and we use JDK 1.6.0_07-b06.

Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut,

Jussi Kuosa wrote:
> Hi again,
> it took a while before I got to tackle this issue... I've been trying to
> solve this for a week now and here's what I've come up with: the TCP resets
> are caused by a Comet-enabled selector thread consuming 100% CPU. As the CPU
> is hot, the Linux networking stack begins to drop connections from that CPUs
> packet queue (confirmed by /proc/net/softnet_stat).

Grrrr I've reported many many times such JDK issues!!


>
> After a varying period of time (20s - 4h), one core (of 2) goes to and stays
> at 100% spinning wildly in the (epoll) selection loop. I used the JTop
> plugin in JConsole to track down the CPU hogging thread
> (SelectorThread-8282). From jstack output I found out that the thread seems
> to sit in sun.nio.ch.EPollArrayWrapper.epollWait all the time. I attached
> the NB6.1 debugger to GF and found that the SelectorThread.doSelect for all
> listeners keep returning with 0 state and there are no ready keys.
> Evaluating selector.keys() returns only one key with OP_ACCEPT interestOps
> (i.e. the server socket in port 8282), so I guess all client keys have been
> cleared out? I have no idea why the epoll returns immediately in this case?
>
> I used the jstack trace to find [1] and a JDK defect 6595055 [2] that lead
> me to 6403933 [3]. The defect is flagged as 10-Fix Delivered, but I guess
> the problem hasn't really been fixed yet (at least in jdk-6u7)... I also run
> a second server on top of Windows Server 2003sp2 and it hasn't had any
> problems, so that is currently our backup plan (that we really wouldn't like
> to use due to Windows NTFS (jar) file locking problems during
> redeployments).
>
> I've tried desperatly to make the problem reproduce systematically, but
> without any luck. All the test clients that I've found in JDK defects (and a
> similar defect in Twisted framework) run fine, as do my own attempts to
> break the disconnect cycle. The only time I even noticed a TCP reset
> happening before the CPU trashing is
> http://www.nabble.com/file/p21037201/reset_sequence.txt reset_sequence.txt
> (the checksums are offloaded). That was caused by a service shutdown on the
> client side, but the problem occurs also with the clients doing normal
> CONNECTs to a subscribed channel.
>

Thanks for the info. I've forwarded this to the NIO lead as this is
clearly a JDK bug.


> Is anyone on the list running 2.6 amd64 Linux cometd/bayeux servers and have
> you experienced similar behavior?
>
> Does someone have any other suggestions as to what I could try next?
>
> Can I provide more information to help solve this problem?

One workaround for you is to set the
CometContext.setExpirationDelay(-1). This will not enabled that extra
Selector.

Would you be able to produce a test case? Let me try to find a
workaround (again) by trying to detect the scenario and trash the
Selector. Meanwhile, can you file an issue here (as a p2):

https://glassfish.dev.java.net/servlets/ProjectIssues

Make sure you state it is a JDK issue. I'm asking because I might be
able to commit the workaround so GF 2.1 isn't suffering the issue. The
window is short by I should be able to send you a patch by the end of today.

Thanks

-- Jeanfrancois




>
> Best regards,
>
>     Jussi Kuosa
>
>
> [1] http://forums.java.net/jive/thread.jspa?messageID=255525
> [2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6595055
> [3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933
>
>> Can you do a run using
>> -Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true
>> (and also make sure you always run with):
>> <jvm-options>-Dcom.sun.enterprise.server.ss.ASQuickStartup=false</jvm-options>
>
> Yes, I put those in.
>
>> Also, when that happens, can you grab a jstack PID to see if threads are
>> available.
>
> Looks ok, all HttpWorkerThreads for 8282 are waiting in the pipeline?
> http://www.nabble.com/file/p21037201/jstack_l.txt jstack_l.txt
>
>> One thing you might want to try is to update you v2 installation to use
>> grizzly 1.0.22.jar
>
> Yes, I bumped our GF to v2ur2-b04 (with 1.0.22) and we use JDK 1.6.0_07-b06.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> One workaround for you is to set the
> CometContext.setExpirationDelay(-1). This will not enabled that extra
> Selector.

Ok, I'll try that still.

> Would you be able to produce a test case?

I've tried, but I haven't been able to do that - sorry :-(

> Let me try to find a workaround (again) by trying to
> detect the scenario and trash the Selector. Meanwhile,
> can you file an issue here (as a p2):

Yup, created issue 6940 [1].

> The window is short by I should be able to send you a patch by the end of today.

Ok, I also set up the GF2 build environment (cvs and maven1), I have SJSAS91_UR2_BRANCH checked out ATM.

-Jussi

[1] https://glassfish.dev.java.net/issues/show_bug.cgi?id=6940

Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut,

here is one tentative. Just do:

% cd ${glassfish.home}/lib
% unzip grizzly-comet...
% cp appserv-rt.jar appserv-rt.jar.org
% jar uvf appserv-rt.jar com

Let me know if that fix the issue.

A+

-- Jeanfrancois

Jussi Kuosa wrote:

>> One workaround for you is to set the
>> CometContext.setExpirationDelay(-1). This will not enabled that extra
>> Selector.
>
> Ok, I'll try that still.
>
>> Would you be able to produce a test case?
>
> I've tried, but I haven't been able to do that - sorry :-(
>
>> Let me try to find a workaround (again) by trying to
>> detect the scenario and trash the Selector. Meanwhile,
>> can you file an issue here (as a p2):
>
> Yup, created issue 6940 [1].
>
>> The window is short by I should be able to send you a patch by the end of
>> today.
>
> Ok, I also set up the GF2 build environment (cvs and maven1), I have
> SJSAS91_UR2_BRANCH checked out ATM.
>
> -Jussi
>
> [1] https://glassfish.dev.java.net/issues/show_bug.cgi?id=6940


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...

grizzly-comet-selector-spin-20081217.jar (373K) Download Attachment

Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> Let me know if that fix the issue.

Hi Jean-Francois,
and thank you for your incredibly fast response!

I put your fix to test this morning and it ran almost 4 hours before going into 100% for one core. The symptoms are the same: selector thread for port 8282 is hot sitting in epollWait. As always, eden space mem graph looks like a heart-attack after the thrashing due to the amount of temporary objects that are created just to notice that there's nothing to do...

>> One workaround for you is to set the
>> CometContext.setExpirationDelay(-1). This will not enabled that extra
>> Selector.

I was supposed to do this in the morning, but I'll put that to test now. I'll let you know ASAP.

-Jussi

Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut,

Jussi Kuosa wrote:

>
>> Let me know if that fix the issue.
>
> Hi Jean-Francois,
> and thank you for your incredibly fast response!
>
> I put your fix to test this morning and it ran almost 4 hours before going
> into 100% for one core. The symptoms are the same: selector thread for port
> 8282 is hot sitting in epollWait. As always, eden space mem graph looks like
> a heart-attack after the thrashing due to the amount of temporary objects
> that are created just to notice that there's nothing to do...

Thanks for the feedback....let me think about another workaround.


>
>>> One workaround for you is to set the
>>> CometContext.setExpirationDelay(-1). This will not enabled that extra
>>> Selector.
>
> I was supposed to do this in the morning, but I'll put that to test now.
> I'll let you know ASAP.

Many Many Thanks!

-- Jeanfrancois

>
> -Jussi

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut,

second try. I've added some logging when the issue arise, and when I try
to recover from the failure.

Thanks!

-- Jeanfrancois

Jussi Kuosa wrote:

>
>> Let me know if that fix the issue.
>
> Hi Jean-Francois,
> and thank you for your incredibly fast response!
>
> I put your fix to test this morning and it ran almost 4 hours before going
> into 100% for one core. The symptoms are the same: selector thread for port
> 8282 is hot sitting in epollWait. As always, eden space mem graph looks like
> a heart-attack after the thrashing due to the amount of temporary objects
> that are created just to notice that there's nothing to do...
>
>>> One workaround for you is to set the
>>> CometContext.setExpirationDelay(-1). This will not enabled that extra
>>> Selector.
>
> I was supposed to do this in the morning, but I'll put that to test now.
> I'll let you know ASAP.
>
> -Jussi


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...

grizzly-comet-selector-patch2.jar (43K) Download Attachment

Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> second try. I've added some logging when the issue arise,
> and when I try to recover from the failure.

Awesome - I just put that to test (in addition to the CometContext.setExpirationDelay(-1) test). If those changes help, I can rule out which fix was the cause (or were both needed) tomorrow and send you some logs.

I have to run for today, but I'll get back to you tomorrow morning.

-Jussi

Re: Comet handler starts terminating TCP connections with RST?

by David M. Lloyd-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It sounds a lot like
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933 to me...?

- DML

On 12/18/2008 09:16 AM, Jeanfrancois Arcand wrote:

> Salut,
>
> second try. I've added some logging when the issue arise, and when I try
> to recover from the failure.
>
> Thanks!
>
> -- Jeanfrancois
>
> Jussi Kuosa wrote:
>>
>>> Let me know if that fix the issue.
>>
>> Hi Jean-Francois,
>> and thank you for your incredibly fast response!
>>
>> I put your fix to test this morning and it ran almost 4 hours before
>> going
>> into 100% for one core. The symptoms are the same: selector thread for
>> port
>> 8282 is hot sitting in epollWait. As always, eden space mem graph
>> looks like
>> a heart-attack after the thrashing due to the amount of temporary objects
>> that are created just to notice that there's nothing to do...
>>
>>>> One workaround for you is to set the
>>>> CometContext.setExpirationDelay(-1). This will not enabled that
>>>> extra Selector.
>>
>> I was supposed to do this in the morning, but I'll put that to test now.
>> I'll let you know ASAP.
>>
>> -Jussi
>
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@...
> For additional commands, e-mail: users-help@...

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut,

Jussi Kuosa wrote:
>> second try. I've added some logging when the issue arise,
>> and when I try to recover from the failure.
>
> Awesome - I just put that to test (in addition to the
> CometContext.setExpirationDelay(-1) test).

OK with -1 for sure the problem will never arise (I shouldn't say never :-))

  If those changes help, I can rule
> out which fix was the cause (or were both needed) tomorrow and send you some
> logs.
>

Just don't set to -1 to see if the CometSelector can recover. You will
see log messages when it try to recover.

Thanks!!!

-- Jeanfrancois


> I have to run for today, but I'll get back to you tomorrow morning.
>
> -Jussi
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Salut,

David M. Lloyd wrote:
> It sounds a lot like
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933 to me...?

Or that one:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6693490

The patch will show the real exception, if exception happens. If not,
then I do agree it will be the same as 6403933...which I filled but
can't apply the same workaround (patch 1 failed).

A+

-- jeanfrancois


>
> - DML
>
> On 12/18/2008 09:16 AM, Jeanfrancois Arcand wrote:
>> Salut,
>>
>> second try. I've added some logging when the issue arise, and when I
>> try to recover from the failure.
>>
>> Thanks!
>>
>> -- Jeanfrancois
>>
>> Jussi Kuosa wrote:
>>>
>>>> Let me know if that fix the issue.
>>>
>>> Hi Jean-Francois,
>>> and thank you for your incredibly fast response!
>>>
>>> I put your fix to test this morning and it ran almost 4 hours before
>>> going
>>> into 100% for one core. The symptoms are the same: selector thread
>>> for port
>>> 8282 is hot sitting in epollWait. As always, eden space mem graph
>>> looks like
>>> a heart-attack after the thrashing due to the amount of temporary
>>> objects
>>> that are created just to notice that there's nothing to do...
>>>
>>>>> One workaround for you is to set the
>>>>> CometContext.setExpirationDelay(-1). This will not enabled that
>>>>> extra Selector.
>>>
>>> I was supposed to do this in the morning, but I'll put that to test now.
>>> I'll let you know ASAP.
>>>
>>> -Jussi
>>
>> ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@...
>> For additional commands, e-mail: users-help@...
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@...
> For additional commands, e-mail: users-help@...
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> OK with -1 for sure the problem will never arise (I shouldn't say never :-))

And it didn't - as you expected :-)

> Just don't set to -1 to see if the CometSelector can recover. You will
> see log messages when it try to recover.

I just begun this test with you increased logging and our normal test expiration delay of 5 seconds.

-Jussi

Re: Comet handler starts terminating TCP connections with RST?

by Jussi Kuosa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


>> Just don't set to -1 to see if the CometSelector can recover. You will
>> see log messages when it try to recover.

> I just begun this test with you increased logging and our normal test expiration delay of 5 seconds.

Ok, I had some trashing, but the server recovered from it!

I will send you the log files privately.

Best regards,

    Jussi Kuosa

Re: Comet handler starts terminating TCP connections with RST?

by Jeanfrancois Arcand-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Jussi Kuosa wrote:
>
>>> Just don't set to -1 to see if the CometSelector can recover. You will
>>> see log messages when it try to recover.
>
>> I just begun this test with you increased logging and our normal test
>> expiration delay of 5 seconds.
>
> Ok, I had some trashing, but the server recovered from it!

Great. I will make sure the fix goes in v2!

Many thanks!

-- Jeanfrancois

>
> I will send you the log files privately.
>
> Best regards,
>
>     Jussi Kuosa
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...