8.0-RC1 NFS client timeout issue

View: New views
12 Messages — Rating Filter:   Alert me  

8.0-RC1 NFS client timeout issue

by Olaf Seibert-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I see an annoying behaviour with NFS over TCP. It happens both with nfs
and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
some Linux or perhaps Solaris, I'm not entirely sure.

After trying to find something in packet traces, I think I have found
something.

The scenario seems to be as follows. Sorry for the width of the lines.


No.     Time        Source                Destination           Protocol Info
   2296 2992.216855 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2297), DH:0x3819da36/w
   2297 2992.217107 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2296) Error:NFS3ERR_NOENT
   2298 2992.217141 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2299), DH:0x170cb16a/bin
   2299 2992.217334 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2298), FH:0x61b8eb12
   2300 2992.217361 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call (Reply In 2301), FH:0x61b8eb12
   2301 2992.217582 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 ACCESS Reply (Call In 2300)
   2302 2992.217605 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2303), DH:0x61b8eb12/w
   2303 2992.217860 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2302) Error:NFS3ERR_NOENT
   2304 2992.318770 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238293 Ack=230289 Win=8192 Len=0 TSV=86492342 TSER=12393434
   2306 3011.537520 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 GETATTR Reply (Call In 2305)  Directory mode:2755 uid:4100 gid:4100
   2307 3011.637744 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238429 Ack=230405 Win=8192 Len=0 TSV=86511662 TSER=12395366
   2308 3371.534980 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [FIN, ACK] Seq=230405 Ack=238429 Win=49232 Len=0 TSV=12431366 TSER=86511662

        The server decides, for whatever reason, to terminate the
        connection and sends a FIN.

   2309 3371.535018 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238429 Ack=230406 Win=8192 Len=0 TSV=86871578 TSER=12431366

        Client acknowledges this,

   2310 3375.379693 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call, FH:0x008002a2

        but tries to sneak in another call anyway. [A]

   2311 3375.474788 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [ACK] Seq=230406 Ack=238569 Win=49232 Len=0 TSV=12431760 TSER=86875423

        Server ACKs but doesn't send anything else... [B]

        Time passes...

   2312 3675.366081 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [FIN, ACK] Seq=238569 Ack=230406 Win=8192 Len=0 TSV=87175425 TSER=12431760

        Client finally decides after 300 secs to close the connection
        too

   2313 3675.366149 xxx.xxx.31.43         xxx.xxx.16.142        TCP      904 > nfs [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=5 TSV=87175425 TSER=0

        and to re-open a new one.

   2314 3675.366318 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [ACK] Seq=230406 Ack=238570 Win=49232 Len=0 TSV=12461749 TSER=87175425
   2315 3675.366446 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 904 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=12461749 TSER=87175425 MSS=1460 WS=0
   2316 3675.366483 xxx.xxx.31.43         xxx.xxx.16.142        TCP      904 > nfs [ACK] Seq=1 Ack=1 Win=66592 Len=0 TSV=87175425 TSER=12461749
   2317 3675.366506 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call (Reply In 2319), FH:0x008002a2
   2318 3675.366660 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 904 [ACK] Seq=1 Ack=141 Win=49092 Len=0 TSV=12461749 TSER=87175425
   2319 3675.367356 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 ACCESS Reply (Call In 2317)
   2320 3675.367425 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 GETATTR Call (Reply In 2322), FH:0x170cb16a
   2321 3675.367644 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 904 [ACK] Seq=125 Ack=277 Win=49232 Len=0 TSV=12461749 TSER=87175426
   2322 3675.367730 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 GETATTR Reply (Call In 2320)  Directory mode:2755 uid:4100 gid:4100
   2323 3675.367759 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call (Reply In 2325), FH:0x170cb16a

Point [A] seems somwehat worrisome to me:

Though technically the connection is closed in one direction
only, the intention of the server seems clear, and it would be better to
be careful and make a new connection right away.

[B] would be a bug of the server in my opinion. If it ACKs a call, it
should send a reply. And if it can't, it shouldn't.

Please Cc me on replies, I am not subscribed to this list.

-Olaf.
--
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by kometen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I see an annoying behaviour with NFS over TCP. It happens both with nfs
> and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> some Linux or perhaps Solaris, I'm not entirely sure.

I used nfs with tcp on a 7.2-client without problems on a solaris
nfs-server. When I upgraded to RC1 I had 'server not responding -
alive again' messages so I swithced to udp which works flawlessly. I
haven't had time to investigate it though.
--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentler gamester is the soonest winner.

Shakespeare

twitter.com/kometen
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Olaf Seibert-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue 27 Oct 2009 at 18:01:11 +0100, Claus Guttesen wrote:
> I used nfs with tcp on a 7.2-client without problems on a solaris
> nfs-server. When I upgraded to RC1 I had 'server not responding -
> alive again' messages so I swithced to udp which works flawlessly. I
> haven't had time to investigate it though.

Sounds like the same as my problem. I have switched to UDP and so far,
with light useage, I have seen no problem. (It would appear with light
useage only, anyway).

I have looked at some source, but not being an expert on the terrirory,
I haven't seen anything obvious yet.

> Claus
-Olaf.
--
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Rick Macklem :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


First off, I know that cross posting is evil, but I wanted to try
and make sure developers saw it.

On Tue, 27 Oct 2009, Olaf Seibert wrote:

> I see an annoying behaviour with NFS over TCP. It happens both with nfs
> and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> some Linux or perhaps Solaris, I'm not entirely sure.
>
> After trying to find something in packet traces, I think I have found
> something.
>
> The scenario seems to be as follows. Sorry for the width of the lines.
>
>
> No.     Time        Source                Destination           Protocol Info
>   2296 2992.216855 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2297), DH:0x3819da36/w
>   2297 2992.217107 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2296) Error:NFS3ERR_NOENT
>   2298 2992.217141 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2299), DH:0x170cb16a/bin
>   2299 2992.217334 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2298), FH:0x61b8eb12
>   2300 2992.217361 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call (Reply In 2301), FH:0x61b8eb12
>   2301 2992.217582 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 ACCESS Reply (Call In 2300)
>   2302 2992.217605 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2303), DH:0x61b8eb12/w
>   2303 2992.217860 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2302) Error:NFS3ERR_NOENT
>   2304 2992.318770 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238293 Ack=230289 Win=8192 Len=0 TSV=86492342 TSER=12393434
>   2306 3011.537520 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 GETATTR Reply (Call In 2305)  Directory mode:2755 uid:4100 gid:4100
>   2307 3011.637744 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238429 Ack=230405 Win=8192 Len=0 TSV=86511662 TSER=12395366
>   2308 3371.534980 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [FIN, ACK] Seq=230405 Ack=238429 Win=49232 Len=0 TSV=12431366 TSER=86511662
>
> The server decides, for whatever reason, to terminate the
> connection and sends a FIN.
>
>   2309 3371.535018 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238429 Ack=230406 Win=8192 Len=0 TSV=86871578 TSER=12431366
>
> Client acknowledges this,
>
>   2310 3375.379693 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call, FH:0x008002a2
>
> but tries to sneak in another call anyway. [A]
>
Probably not the best behaviour, but I think it is technically allowed by
TCP. (My TCP is very rusty, but I think the socket should be in
TCPS_CLOSE_WAIT at this point and the BSD code will have called
socantrcvmore(), but not socantsndmore().)

>   2311 3375.474788 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [ACK] Seq=230406 Ack=238569 Win=49232 Len=0 TSV=12431760 TSER=86875423
>
> Server ACKs but doesn't send anything else... [B]
>
> Time passes...
>
This is where it seems interesting. It looks to me like the socket upcall
for receiving the FIN would have happened before this point, setting the
ct_error.re_status to RPC_CANTRECV, but the code in clnt_vc_call() doesn't
check for this. (It does check for it happening during and after the
sosend(), but not before it, from what I can see.)

>
> [B] would be a bug of the server in my opinion. If it ACKs a call, it
> should send a reply. And if it can't, it shouldn't.
>
I'll leave this one for the TCP wizzards. I'm not sure what the
correct behaviour is when data is received on a connection. (I think
it is waiting for a FIN from the client side at this point.)

If you could try the following patch and see if it helps, that would be
appreciated, rick
ps: I'll try to reproduce the situation here, but I'm not sure if I can.
--- rpc/clnt_vc.c.sav 2009-10-28 15:44:20.000000000 -0400
+++ rpc/clnt_vc.c 2009-10-28 15:49:57.000000000 -0400
@@ -413,6 +413,19 @@

  cr->cr_xid = xid;
  mtx_lock(&ct->ct_lock);
+ /*
+ * Check to see if the other end has already started to close down
+ * the connection. If it happens after this point, it will be
+ * detected below, when cr->cr_error is checked.
+ */
+ if (ct->ct_error.re_status == RPC_CANTRECV) {
+ if (errp != &ct->ct_error) {
+ errp->re_errno = ct->ct_error.re_errno;
+ errp->re_status = RPC_CANTRECV;
+ }
+ stat = RPC_CANTRECV;
+ goto out;
+ }
  TAILQ_INSERT_TAIL(&ct->ct_pending, cr, cr_link);
  mtx_unlock(&ct->ct_lock);

_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Olaf Seibert-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed 28 Oct 2009 at 16:38:27 -0400, Rick Macklem wrote:
> I'll leave this one for the TCP wizzards. I'm not sure what the
> correct behaviour is when data is received on a connection. (I think
> it is waiting for a FIN from the client side at this point.)

After writing, I realised that it is indeed perfectly allowed for the
client to send data. But since the server already sent its FIN, it can't
send anything more, not even an error message. So with that in mind, the
client shouldn't send anything any more either.

> If you could try the following patch and see if it helps, that would be
> appreciated, rick

Thanks, it looks like it should do the trick. I can't try it before
monday, though.

-Olaf.
--
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Rick Macklem :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Thu, 29 Oct 2009, Olaf Seibert wrote:

>
> After writing, I realised that it is indeed perfectly allowed for the
> client to send data. But since the server already sent its FIN, it can't
> send anything more, not even an error message. So with that in mind, the
> client shouldn't send anything any more either.
>
Here is what I am seeing without and with the patch. The client is a
pretty recent FreeBSD-CURRENT (nfsv4-test) and the server Solaris10
(nfsv4-solaris). I don't get the 5 minute reconnect delay without
the patch. I think the reason is that the resets (Rst's) "inspire"
the FreeBSD-CURRENT TCP to do the new connection. (I don't remember
seeing any RSTs in your tcpdump?) I deleted a few irrelevant lines
(packets not between nfsv4-test and nfsv4-solaris), which is why
the packet #s aren't contiguous. (and appologies for the long lines)

Hopefully someone with TCP expertise will know if the RSTs are
done correctly and whether or not your server should be generating
them.

After the patch things look fine to me. Hopefully your (and others)
testing will go ok.

Snoop trace without the patch (what FreeBSD-CURRENT will do now):
   2 209.18617 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=699 S=2049 Fin Ack=1066292217 Seq=1580479914 Len=0 Win=49232 Options=<nop,nop,tstamp 25243063 5673171>
   3 209.18645 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=699 Ack=1580479915 Seq=1066292217 Len=0 Win=16588 Options=<nop,nop,tstamp 6031630 25243063>
   4 209.18656 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=699 Rst Ack=1580479915 Seq=1066292217 Len=0 Win=0
   5 209.18662 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=699 Rst Ack=1580479915 Seq=1066292217 Len=0 Win=0
   7 528.97250 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris NFS C FSSTAT3 FH=9D01
   8 528.97261 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=699 S=2049 Rst Seq=1580479915 Len=0 Win=0
   9 528.97311 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=818 Syn Seq=3293207355 Len=0 Win=65535 Options=<mss 1460,nop,wscale 3,sackOK,tstamp 6350137 0>
  10 528.97329 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=818 S=2049 Syn Ack=3293207356 Seq=1757551152 Len=0 Win=49232 Options=<nop,nop,tstamp 25275042 6350137,mss 1460,nop,wscale 0,nop,nop,sackOK>
  11 528.97354 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=818 Ack=1757551153 Seq=3293207356 Len=0 Win=8326 Options=<nop,nop,tstamp 6350138 25275042>
  12 528.97375 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris NFS C FSSTAT3 FH=9D01
  13 528.97382 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=818 Rst Ack=1757551153 Seq=3293207356 Len=0 Win=0
  14 528.97439 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=818 S=2049 Ack=3293207488 Seq=1757551153 Len=0 Win=49100 Options=<nop,nop,tstamp 25275042 6350138>
  15 528.97524 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca NFS R FSSTAT3 OK
  16 529.07565 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=818 Ack=1757551325 Seq=3293207488 Len=0 Win=16588 Options=<nop,nop,tstamp 6350240 25275042>

and what happens after the patch is applied:
   6   1.35481 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris NFS C FSSTAT3 FH=9D01
   7   1.35516 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca NFS R FSSTAT3 OK
   8   1.45564 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=651 Ack=2612339569 Seq=837212133 Len=0 Win=16588 Options=<nop,nop,tstamp 144535 25632953>
   9 361.34400 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=651 S=2049 Fin Ack=837212133 Seq=2612339569 Len=0 Win=49232 Options=<nop,nop,tstamp 25668953 144535>
  10 361.34434 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=651 Ack=2612339570 Seq=837212133 Len=0 Win=16588 Options=<nop,nop,tstamp 502984 25668953>
  11 361.34441 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=651 Rst Ack=2612339570 Seq=837212133 Len=0 Win=0
  12 361.34447 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=651 Rst Ack=2612339570 Seq=837212133 Len=0 Win=0
  14 501.80966 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=881 Syn Seq=1926848102 Len=0 Win=65535 Options=<mss 1460,nop,wscale 3,sackOK,tstamp 642887 0>
  15 501.80979 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=881 S=2049 Syn Ack=1926848103 Seq=2754679721 Len=0 Win=49232 Options=<nop,nop,tstamp 25682999 642887,mss 1460,nop,wscale 0,nop,nop,sackOK>
  16 501.81006 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=881 Ack=2754679722 Seq=1926848103 Len=0 Win=8326 Options=<nop,nop,tstamp 642888 25682999>
  17 501.81024 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris NFS C FSSTAT3 FH=9D01
  18 501.81089 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca TCP D=881 S=2049 Ack=1926848235 Seq=2754679722 Len=0 Win=49100 Options=<nop,nop,tstamp 25682999 642888>
  19 501.81169 nfsv4-solaris -> nfsv4-test.cis.uoguelph.ca NFS R FSSTAT3 OK
  20 501.91218 nfsv4-test.cis.uoguelph.ca -> nfsv4-solaris TCP D=2049 S=881 Ack=2754679894 Seq=1926848235 Len=0 Win=16588 Options=<nop,nop,tstamp 642990 25682999>

Anyone with TCP expertise have opinions on these? rick
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Daniel Braniss-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>
> First off, I know that cross posting is evil, but I wanted to try
> and make sure developers saw it.
>
> On Tue, 27 Oct 2009, Olaf Seibert wrote:
>
> > I see an annoying behaviour with NFS over TCP. It happens both with nfs
> > and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> > some Linux or perhaps Solaris, I'm not entirely sure.
> >
> > After trying to find something in packet traces, I think I have found
> > something.
> >
> > The scenario seems to be as follows. Sorry for the width of the lines.
> >
> >
> > No.     Time        Source                Destination           Protocol Info
> >   2296 2992.216855 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2297), DH:0x3819da36/w
> >   2297 2992.217107 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2296) Error:NFS3ERR_NOENT
> >   2298 2992.217141 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2299), DH:0x170cb16a/bin
> >   2299 2992.217334 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2298), FH:0x61b8eb12
> >   2300 2992.217361 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call (Reply In 2301), FH:0x61b8eb12
> >   2301 2992.217582 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 ACCESS Reply (Call In 2300)
> >   2302 2992.217605 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 2303), DH:0x61b8eb12/w
> >   2303 2992.217860 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 2302) Error:NFS3ERR_NOENT
> >   2304 2992.318770 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238293 Ack=230289 Win=8192 Len=0 TSV=86492342 TSER=12393434
> >   2306 3011.537520 xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 GETATTR Reply (Call In 2305)  Directory mode:2755 uid:4100 gid:4100
> >   2307 3011.637744 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238429 Ack=230405 Win=8192 Len=0 TSV=86511662 TSER=12395366
> >   2308 3371.534980 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [FIN, ACK] Seq=230405 Ack=238429 Win=49232 Len=0 TSV=12431366 TSER=86511662
> >
> > The server decides, for whatever reason, to terminate the
> > connection and sends a FIN.
> >
> >   2309 3371.535018 xxx.xxx.31.43         xxx.xxx.16.142        TCP      934 > nfs [ACK] Seq=238429 Ack=230406 Win=8192 Len=0 TSV=86871578 TSER=12431366
> >
> > Client acknowledges this,
> >
> >   2310 3375.379693 xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call, FH:0x008002a2
> >
> > but tries to sneak in another call anyway. [A]
> >
> Probably not the best behaviour, but I think it is technically allowed by
> TCP. (My TCP is very rusty, but I think the socket should be in
> TCPS_CLOSE_WAIT at this point and the BSD code will have called
> socantrcvmore(), but not socantsndmore().)
>
> >   2311 3375.474788 xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 934 [ACK] Seq=230406 Ack=238569 Win=49232 Len=0 TSV=12431760 TSER=86875423
> >
> > Server ACKs but doesn't send anything else... [B]
> >
> > Time passes...
> >
> This is where it seems interesting. It looks to me like the socket upcall
> for receiving the FIN would have happened before this point, setting the
> ct_error.re_status to RPC_CANTRECV, but the code in clnt_vc_call() doesn't
> check for this. (It does check for it happening during and after the
> sosend(), but not before it, from what I can see.)
>
> >
> > [B] would be a bug of the server in my opinion. If it ACKs a call, it
> > should send a reply. And if it can't, it shouldn't.
> >
> I'll leave this one for the TCP wizzards. I'm not sure what the
> correct behaviour is when data is received on a connection. (I think
> it is waiting for a FIN from the client side at this point.)
>
> If you could try the following patch and see if it helps, that would be
> appreciated, rick
> ps: I'll try to reproduce the situation here, but I'm not sure if I can.
> --- rpc/clnt_vc.c.sav 2009-10-28 15:44:20.000000000 -0400
> +++ rpc/clnt_vc.c 2009-10-28 15:49:57.000000000 -0400
> @@ -413,6 +413,19 @@
>
>   cr->cr_xid = xid;
>   mtx_lock(&ct->ct_lock);
> + /*
> + * Check to see if the other end has already started to close down
> + * the connection. If it happens after this point, it will be
> + * detected below, when cr->cr_error is checked.
> + */
> + if (ct->ct_error.re_status == RPC_CANTRECV) {
> + if (errp != &ct->ct_error) {
> + errp->re_errno = ct->ct_error.re_errno;
> + errp->re_status = RPC_CANTRECV;
> + }
> + stat = RPC_CANTRECV;
> + goto out;
> + }
>   TAILQ_INSERT_TAIL(&ct->ct_pending, cr, cr_link);
>   mtx_unlock(&ct->ct_lock);

Did a make buildworld -j8, using as server netapp, freebsd,
and all seems ok.

danny


_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Rick Macklem :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Thu, 29 Oct 2009, Olaf Seibert wrote:

>
> Thanks, it looks like it should do the trick. I can't try it before
> monday, though.
>
Although I think the patch does avoid sending the request on the
partially closed connection, it doesn't fix the "real problem",
so I don't know if it is worth testing?

I'm hoping that the "Help TCP Wizards..." thread I just started
on freebsd-current comes up with something.

At least I can reproduce the problem now. (For some reason, I have
to reboot the Solaris10 server before the problem appears for me.
I can't think why this matters, but that's networking for you:-)

rick

_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Olaf Seibert-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun 01 Nov 2009 at 17:17:15 -0500, Rick Macklem wrote:
> On Thu, 29 Oct 2009, Olaf Seibert wrote:
>
> >
> > Thanks, it looks like it should do the trick. I can't try it before
> > monday, though.
> >
> Although I think the patch does avoid sending the request on the
> partially closed connection, it doesn't fix the "real problem",
> so I don't know if it is worth testing?

Well, I tested it anyway, just in case. It seems to work fine for me, so
far.

I don't see your extra RSTs either. Maybe that is because in my case the
client used a different port number for the new connection. (Usually,
this is controlled by the TCP option SO_REUSEADDR from <sys/socket.h>).

Here is a new packet trace. I had to cut out some packets since I forgot
to kill some (failing) mount attempts of another directory on the same
server.
(sorry again for the long lines)

No.     Time        Source                Destination           Protocol Info
    486 60.438406   xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 LOOKUP Call (Reply In 487), DH:0x61b8eb12/date
    487 60.438629   xxx.xxx.16.142        xxx.xxx.31.43         NFS      V3 LOOKUP Reply (Call In 486) Error:NFS3ERR_NOENT
    488 60.538796   xxx.xxx.31.43         xxx.xxx.16.142        TCP      hello-port > nfs [ACK] Seq=36477 Ack=44701 Win=8192 Len=0 TSV=228817 TSER=1575935

last real action on old connection (client port "hello-port")

    537 420.437763  xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > hello-port [FIN, ACK] Seq=44701 Ack=36477 Win=49232 Len=0 TSV=1611935 TSER=228817
    538 420.437805  xxx.xxx.31.43         xxx.xxx.16.142        TCP      hello-port > nfs [ACK] Seq=36477 Ack=44702 Win=8192 Len=0 TSV=588734 TSER=1611935

server ends connection

    563 605.334262  xxx.xxx.31.43         xxx.xxx.16.142        TCP      hello-port > nfs [FIN, ACK] Seq=36477 Ack=44702 Win=8192 Len=0 TSV=773641 TSER=1611935

some time later, client now ends connection before sending its request on new connection
(port 875)

    564 605.334303  xxx.xxx.31.43         xxx.xxx.16.142        TCP      875 > nfs [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=5 TSV=773641 TSER=0
    565 605.334440  xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > hello-port [ACK] Seq=44702 Ack=36478 Win=49232 Len=0 TSV=1630424 TSER=773641
    566 605.334564  xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 875 [SYN, ACK] Seq=0 Ack=1 Win=49232 Len=0 TSV=1630424 TSER=773641 MSS=1460 WS=0
    567 605.334588  xxx.xxx.31.43         xxx.xxx.16.142        TCP      875 > nfs [ACK] Seq=1 Ack=1 Win=66592 Len=0 TSV=773641 TSER=1630424

new connection set up

    568 605.334605  xxx.xxx.31.43         xxx.xxx.16.142        NFS      V3 ACCESS Call (Reply In 570), FH:0x008002a2
    569 605.334828  xxx.xxx.16.142        xxx.xxx.31.43         TCP      nfs > 875 [ACK] Seq=1 Ack=141 Win=49092 Len=0 TSV=1630424 TSER=773641

and in use

> I'm hoping that the "Help TCP Wizards..." thread I just started
> on freebsd-current comes up with something.
>
> At least I can reproduce the problem now. (For some reason, I have
> to reboot the Solaris10 server before the problem appears for me.
> I can't think why this matters, but that's networking for you:-)

Maybe it depends on server load or something. This particular server is
a central file server at a university, it may have some more pressure to
terminate unused connections.

> rick
-Olaf.
--
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Rick Macklem :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Mon, 2 Nov 2009, Olaf Seibert wrote:

>> Although I think the patch does avoid sending the request on the
>> partially closed connection, it doesn't fix the "real problem",
>> so I don't know if it is worth testing?
>
> Well, I tested it anyway, just in case. It seems to work fine for me, so
> far.
>
Yes, I think the patch is ok, but it doesn't completely resolve the
reconnect issue. It's good to hear that it helps for your case.

> I don't see your extra RSTs either. Maybe that is because in my case the
> client used a different port number for the new connection. (Usually,
> this is controlled by the TCP option SO_REUSEADDR from <sys/socket.h>).
>
For my packet trace, it is using different port#s. The problem is that,
for some reason, it sends the RST from the new port# instead of the port#
for the old connection just closed via soclose().

I don't know why you don't see the extra RSTs, but consider yourself
lucky, since you should be ok without them. (It may simply be that your
server isn't Solaris10 --> a different TCP stack in it.)

Do you happen to know what your server is?

>> I'm hoping that the "Help TCP Wizards..." thread I just started
>> on freebsd-current comes up with something.
>>
>> At least I can reproduce the problem now. (For some reason, I have
>> to reboot the Solaris10 server before the problem appears for me.
>> I can't think why this matters, but that's networking for you:-)
>
> Maybe it depends on server load or something. This particular server is
> a central file server at a university, it may have some more pressure to
> terminate unused connections.
>
Or type of server (ie. not Solaris10). It definitely depends upon timing
in the client. (I'm about to try introducing a 1sec delay before the
soconnect() call and see if that makes the RSTs go away. Not much of a
fix, but...)

I now recall that I ran into a similar problem (although I didn't dig
into the packet traces then) when testing my Mac OS X 10 client, which
uses essentially the reconnect code from Mac OS X 10.4 Tiger. I "fixed"
it by adding a 1sec delay before the reconnect.

Thanks for helping with testing, rick

_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Rick Macklem :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Mon, 2 Nov 2009, Olaf Seibert wrote:

>> Although I think the patch does avoid sending the request on the
>> partially closed connection, it doesn't fix the "real problem",
>> so I don't know if it is worth testing?
>
> Well, I tested it anyway, just in case. It seems to work fine for me, so
> far.
>
> I don't see your extra RSTs either. Maybe that is because in my case the
> client used a different port number for the new connection. (Usually,
> this is controlled by the TCP option SO_REUSEADDR from <sys/socket.h>).
>
It seems that the pesky RSTs I was seeing were generated by the net chip
in the machine I was using (Intel 82801BA/CAM - fxp driver) when TSO
was enabled for it.

sysctl net.inet.tcp.tso=0

got rid of the problem and, with the patch you already tested, thinks
are testing well here.

If anyone is still having NFS over TCP reconnect problems after
applying the patch, please try the above and see if it helps.

Thanks, rick
_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."

Re: 8.0-RC1 NFS client timeout issue

by Rick Macklem :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Tue, 27 Oct 2009, Olaf Seibert wrote:

> I see an annoying behaviour with NFS over TCP. It happens both with nfs
> and newnfs. This is with FreeBSD/amd64 8.0-RC1 as client. The server is
> some Linux or perhaps Solaris, I'm not entirely sure.
>
[good stuff snipped...]
>
> Though technically the connection is closed in one direction
> only, the intention of the server seems clear, and it would be better to
> be careful and make a new connection right away.
>
I believe that r199293 committed to stable/8 should fix this. It did not
make it into FreeBSD8.0, so users of FreeBSD8.0 will need to switch to
using "udp" or apply the patch themselves, if slow reconnects after a
non-FreeBSD NFS server are causing them grief.

rick

_______________________________________________
freebsd-stable@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@..."