[patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

View: New views
6 Messages — Rating Filter:   Alert me  

[patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

by Shirish Pargaonkar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Inspite of a set of data integrity patches in cifs last yer, there
still persist errors
caused due to timeouts resulting in sending incomplete data and
hence data integrity errors.

The proposed socket send timeout is large enough to elminate that possibility.
The tests with this patches have resulted in elminating data integrity errors on
an 80 hours test runs which otherwise manifest in matter of hours of a test run.

Regards,

Shirish


_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@...
https://lists.samba.org/mailman/listinfo/linux-cifs-client

cifs.sndtimeo.1.patch (926 bytes) Download Attachment

Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

by Jeff Layton-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 22 Jul 2009 20:14:38 -0500
Shirish Pargaonkar <shirishpargaonkar@...> wrote:

> Inspite of a set of data integrity patches in cifs last yer, there
> still persist errors
> caused due to timeouts resulting in sending incomplete data and
> hence data integrity errors.
>
> The proposed socket send timeout is large enough to elminate that possibility.

On what evidence do you base the above statement? Who's to say that 30s
is long enough if someone has a high-latency enough connection?

> The tests with this patches have resulted in elminating data integrity errors on
> an 80 hours test runs which otherwise manifest in matter of hours of a test run.
>

Also, can you give some details about these data integrity errors? Were
writes failing? If so, were they not reported at fsync or close?

My suspicion is that the main problem here is the default of "soft" for
CIFS mounts. It's well known that that's a recipe for data corruption
with NFS and there's no reason why it wouldn't be the same for CIFS.

Instead of this patch, how about doing a patch that fixes the
hard/soft mount options for CIFS and see whether you can still
reproduce the data corruption with a hard mount?

--
Jeff Layton <jlayton@...>
_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@...
https://lists.samba.org/mailman/listinfo/linux-cifs-client

Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

by Shirish Pargaonkar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote:

> On Wed, 22 Jul 2009 20:14:38 -0500
> Shirish Pargaonkar <shirishpargaonkar@...> wrote:
>
>> Inspite of a set of data integrity patches in cifs last yer, there
>> still persist errors
>> caused due to timeouts resulting in sending incomplete data and
>> hence data integrity errors.
>>
>> The proposed socket send timeout is large enough to elminate that possibility.
>
> On what evidence do you base the above statement? Who's to say that 30s
> is long enough if someone has a high-latency enough connection?
>
>> The tests with this patches have resulted in elminating data integrity errors on
>> an 80 hours test runs which otherwise manifest in matter of hours of a test run.
>>
>
> Also, can you give some details about these data integrity errors? Were
> writes failing? If so, were they not reported at fsync or close?

The errors logged by cifs client were like this
This is what I had seen last year when the patches were developed.
The entire write could not be sent because of socket timeout, other thread
fills in rest of the 56K write so that second 56K is not responded and client
logs 'No response for cmd'.
The longer timeout seems to be long enough for server to receive entire
smbwrite (56K).

May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
20646
May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
20647
May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
May 12 05:17:11 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
21347
May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
21348
May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 46 mid
24859
May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Send error in read = -11
May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
24858


>
> My suspicion is that the main problem here is the default of "soft" for
> CIFS mounts. It's well known that that's a recipe for data corruption
> with NFS and there's no reason why it wouldn't be the same for CIFS.
>
> Instead of this patch, how about doing a patch that fixes the
> hard/soft mount options for CIFS and see whether you can still
> reproduce the data corruption with a hard mount?
>
> --
> Jeff Layton <jlayton@...>
>
_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@...
https://lists.samba.org/mailman/listinfo/linux-cifs-client

Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

by Jeff Layton-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 23 Jul 2009 09:51:32 -0500
Shirish Pargaonkar <shirishpargaonkar@...> wrote:

> On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote:
> > On Wed, 22 Jul 2009 20:14:38 -0500
> > Shirish Pargaonkar <shirishpargaonkar@...> wrote:
> >
> >> Inspite of a set of data integrity patches in cifs last yer, there
> >> still persist errors
> >> caused due to timeouts resulting in sending incomplete data and
> >> hence data integrity errors.
> >>
> >> The proposed socket send timeout is large enough to elminate that possibility.
> >
> > On what evidence do you base the above statement? Who's to say that 30s
> > is long enough if someone has a high-latency enough connection?
> >
> >> The tests with this patches have resulted in elminating data integrity errors on
> >> an 80 hours test runs which otherwise manifest in matter of hours of a test run.
> >>
> >
> > Also, can you give some details about these data integrity errors? Were
> > writes failing? If so, were they not reported at fsync or close?
>
> The errors logged by cifs client were like this
> This is what I had seen last year when the patches were developed.
> The entire write could not be sent because of socket timeout, other thread
> fills in rest of the 56K write so that second 56K is not responded and client
> logs 'No response for cmd'.
> The longer timeout seems to be long enough for server to receive entire
> smbwrite (56K).
>
> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> 20646
> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
> 20647
> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
> May 12 05:17:11 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> 21347
> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
> 21348
> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 46 mid
> 24859
> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Send error in read = -11
> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> 24858
>
>

It sounds like the original bug was never fixed then, only made less
likely by changing the timing. This patch looks like it just does the
same thing.

Rather than papering over the bug by increasing the timeout, I think a
patch is needed that fixes the actual bug. That is, you need to make it
impossible for these sorts of interleaved sends to occur.

--
Jeff Layton <jlayton@...>
_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@...
https://lists.samba.org/mailman/listinfo/linux-cifs-client

Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

by Shirish Pargaonkar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jul 23, 2009 at 10:34 AM, Jeff Layton<jlayton@...> wrote:

> On Thu, 23 Jul 2009 09:51:32 -0500
> Shirish Pargaonkar <shirishpargaonkar@...> wrote:
>
>> On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote:
>> > On Wed, 22 Jul 2009 20:14:38 -0500
>> > Shirish Pargaonkar <shirishpargaonkar@...> wrote:
>> >
>> >> Inspite of a set of data integrity patches in cifs last yer, there
>> >> still persist errors
>> >> caused due to timeouts resulting in sending incomplete data and
>> >> hence data integrity errors.
>> >>
>> >> The proposed socket send timeout is large enough to elminate that possibility.
>> >
>> > On what evidence do you base the above statement? Who's to say that 30s
>> > is long enough if someone has a high-latency enough connection?
>> >
>> >> The tests with this patches have resulted in elminating data integrity errors on
>> >> an 80 hours test runs which otherwise manifest in matter of hours of a test run.
>> >>
>> >
>> > Also, can you give some details about these data integrity errors? Were
>> > writes failing? If so, were they not reported at fsync or close?
>>
>> The errors logged by cifs client were like this
>> This is what I had seen last year when the patches were developed.
>> The entire write could not be sent because of socket timeout, other thread
>> fills in rest of the 56K write so that second 56K is not responded and client
>> logs 'No response for cmd'.
>> The longer timeout seems to be long enough for server to receive entire
>> smbwrite (56K).
>>
>> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
>> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
>> 20646
>> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
>> 20647
>> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
>> May 12 05:17:11 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
>> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
>> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
>> 21347
>> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
>> 21348
>> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
>> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
>> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
>> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 46 mid
>> 24859
>> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Send error in read = -11
>> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
>> 24858
>>
>>
>
> It sounds like the original bug was never fixed then, only made less
> likely by changing the timing. This patch looks like it just does the
> same thing.

The first step was to change the socket from non-blocking to blocking
to prevent interleaved sends.
A longer send timeout makes sure the send has enough duration to
complete the send instead of returning prematurely.

I can not think of a way to abort a partialy sent request to the server and
I do not know whether it is possible to be sure that entire 56K buffer is
available before dispatching a send on  a (test induced) stressed socket.

>
> Rather than papering over the bug by increasing the timeout, I think a
> patch is needed that fixes the actual bug. That is, you need to make it
> impossible for these sorts of interleaved sends to occur.
>
> --
> Jeff Layton <jlayton@...>
>
_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@...
https://lists.samba.org/mailman/listinfo/linux-cifs-client

Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sends

by Jeff Layton-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 23 Jul 2009 12:00:25 -0500
Shirish Pargaonkar <shirishpargaonkar@...> wrote:

> On Thu, Jul 23, 2009 at 10:34 AM, Jeff Layton<jlayton@...> wrote:
> > On Thu, 23 Jul 2009 09:51:32 -0500
> > Shirish Pargaonkar <shirishpargaonkar@...> wrote:
> >
> >> On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote:
> >> > On Wed, 22 Jul 2009 20:14:38 -0500
> >> > Shirish Pargaonkar <shirishpargaonkar@...> wrote:
> >> >
> >> >> Inspite of a set of data integrity patches in cifs last yer, there
> >> >> still persist errors
> >> >> caused due to timeouts resulting in sending incomplete data and
> >> >> hence data integrity errors.
> >> >>
> >> >> The proposed socket send timeout is large enough to elminate that possibility.
> >> >
> >> > On what evidence do you base the above statement? Who's to say that 30s
> >> > is long enough if someone has a high-latency enough connection?
> >> >
> >> >> The tests with this patches have resulted in elminating data integrity errors on
> >> >> an 80 hours test runs which otherwise manifest in matter of hours of a test run.
> >> >>
> >> >
> >> > Also, can you give some details about these data integrity errors? Were
> >> > writes failing? If so, were they not reported at fsync or close?
> >>
> >> The errors logged by cifs client were like this
> >> This is what I had seen last year when the patches were developed.
> >> The entire write could not be sent because of socket timeout, other thread
> >> fills in rest of the 56K write so that second 56K is not responded and client
> >> logs 'No response for cmd'.
> >> The longer timeout seems to be long enough for server to receive entire
> >> smbwrite (56K).
> >>
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> >> 20646
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
> >> 20647
> >> May 12 05:17:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >> May 12 05:17:11 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> >> 21347
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 47 mid
> >> 21348
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -11, wrote 0
> >> May 12 05:17:39 voyBCSsles11-rc3 kernel:  CIFS VFS: Write2 ret -9, wrote 0
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: server not responding
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response to cmd 46 mid
> >> 24859
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: Send error in read = -11
> >> May 12 05:18:09 voyBCSsles11-rc3 kernel:  CIFS VFS: No response for cmd 50 mid
> >> 24858
> >>
> >>
> >
> > It sounds like the original bug was never fixed then, only made less
> > likely by changing the timing. This patch looks like it just does the
> > same thing.
>
> The first step was to change the socket from non-blocking to blocking
> to prevent interleaved sends.
> A longer send timeout makes sure the send has enough duration to
> complete the send instead of returning prematurely.
>
> I can not think of a way to abort a partialy sent request to the server and
> I do not know whether it is possible to be sure that entire 56K buffer is
> available before dispatching a send on  a (test induced) stressed socket.
>

I think we already discussed this several months ago and agreed that the
right thing to do is to detect when a partial send has occurred and to
reconnect the socket when it does. I can dig up the discussion again,
but you probably remember it...

The question I have is -- why didn't that happen here? That should have
prevented these interleaved sends...right?

Increasing the send timeout will have other effects too that you're not
accounting for here. You're increasing the total send timeout from 15s
to 90s (since steve wanted to keep this loop in smb_sendv instead
of just letting the socket layer handle it). That potentially changes
the overall timeout for SMB calls.

I'm very leery of increasing the send timeout and hoping for the best.
Since the consequences of getting this wrong are data corruption, we
need a real fix or a detailed explanation of how this is guaranteed to
prevent the problem in the future.

--
Jeff Layton <jlayton@...>
_______________________________________________
linux-cifs-client mailing list
linux-cifs-client@...
https://lists.samba.org/mailman/listinfo/linux-cifs-client