|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
[patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sendsInspite of a set of data integrity patches in cifs last yer, there
still persist errors caused due to timeouts resulting in sending incomplete data and hence data integrity errors. The proposed socket send timeout is large enough to elminate that possibility. The tests with this patches have resulted in elminating data integrity errors on an 80 hours test runs which otherwise manifest in matter of hours of a test run. Regards, Shirish _______________________________________________ linux-cifs-client mailing list linux-cifs-client@... https://lists.samba.org/mailman/listinfo/linux-cifs-client |
|
|
Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sendsOn Wed, 22 Jul 2009 20:14:38 -0500
Shirish Pargaonkar <shirishpargaonkar@...> wrote: > Inspite of a set of data integrity patches in cifs last yer, there > still persist errors > caused due to timeouts resulting in sending incomplete data and > hence data integrity errors. > > The proposed socket send timeout is large enough to elminate that possibility. On what evidence do you base the above statement? Who's to say that 30s is long enough if someone has a high-latency enough connection? > The tests with this patches have resulted in elminating data integrity errors on > an 80 hours test runs which otherwise manifest in matter of hours of a test run. > Also, can you give some details about these data integrity errors? Were writes failing? If so, were they not reported at fsync or close? My suspicion is that the main problem here is the default of "soft" for CIFS mounts. It's well known that that's a recipe for data corruption with NFS and there's no reason why it wouldn't be the same for CIFS. Instead of this patch, how about doing a patch that fixes the hard/soft mount options for CIFS and see whether you can still reproduce the data corruption with a hard mount? -- Jeff Layton <jlayton@...> _______________________________________________ linux-cifs-client mailing list linux-cifs-client@... https://lists.samba.org/mailman/listinfo/linux-cifs-client |
|
|
Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sendsOn Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote:
> On Wed, 22 Jul 2009 20:14:38 -0500 > Shirish Pargaonkar <shirishpargaonkar@...> wrote: > >> Inspite of a set of data integrity patches in cifs last yer, there >> still persist errors >> caused due to timeouts resulting in sending incomplete data and >> hence data integrity errors. >> >> The proposed socket send timeout is large enough to elminate that possibility. > > On what evidence do you base the above statement? Who's to say that 30s > is long enough if someone has a high-latency enough connection? > >> The tests with this patches have resulted in elminating data integrity errors on >> an 80 hours test runs which otherwise manifest in matter of hours of a test run. >> > > Also, can you give some details about these data integrity errors? Were > writes failing? If so, were they not reported at fsync or close? The errors logged by cifs client were like this This is what I had seen last year when the patches were developed. The entire write could not be sent because of socket timeout, other thread fills in rest of the 56K write so that second 56K is not responded and client logs 'No response for cmd'. The longer timeout seems to be long enough for server to receive entire smbwrite (56K). May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid 20646 May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid 20647 May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 May 12 05:17:11 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid 21347 May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid 21348 May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 46 mid 24859 May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: Send error in read = -11 May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid 24858 > > My suspicion is that the main problem here is the default of "soft" for > CIFS mounts. It's well known that that's a recipe for data corruption > with NFS and there's no reason why it wouldn't be the same for CIFS. > > Instead of this patch, how about doing a patch that fixes the > hard/soft mount options for CIFS and see whether you can still > reproduce the data corruption with a hard mount? > > -- > Jeff Layton <jlayton@...> > linux-cifs-client mailing list linux-cifs-client@... https://lists.samba.org/mailman/listinfo/linux-cifs-client |
|
|
Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sendsOn Thu, 23 Jul 2009 09:51:32 -0500
Shirish Pargaonkar <shirishpargaonkar@...> wrote: > On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote: > > On Wed, 22 Jul 2009 20:14:38 -0500 > > Shirish Pargaonkar <shirishpargaonkar@...> wrote: > > > >> Inspite of a set of data integrity patches in cifs last yer, there > >> still persist errors > >> caused due to timeouts resulting in sending incomplete data and > >> hence data integrity errors. > >> > >> The proposed socket send timeout is large enough to elminate that possibility. > > > > On what evidence do you base the above statement? Who's to say that 30s > > is long enough if someone has a high-latency enough connection? > > > >> The tests with this patches have resulted in elminating data integrity errors on > >> an 80 hours test runs which otherwise manifest in matter of hours of a test run. > >> > > > > Also, can you give some details about these data integrity errors? Were > > writes failing? If so, were they not reported at fsync or close? > > The errors logged by cifs client were like this > This is what I had seen last year when the patches were developed. > The entire write could not be sent because of socket timeout, other thread > fills in rest of the 56K write so that second 56K is not responded and client > logs 'No response for cmd'. > The longer timeout seems to be long enough for server to receive entire > smbwrite (56K). > > May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding > May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid > 20646 > May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid > 20647 > May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 > May 12 05:17:11 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 > May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding > May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid > 21347 > May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid > 21348 > May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 > May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 > May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding > May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 46 mid > 24859 > May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: Send error in read = -11 > May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid > 24858 > > It sounds like the original bug was never fixed then, only made less likely by changing the timing. This patch looks like it just does the same thing. Rather than papering over the bug by increasing the timeout, I think a patch is needed that fixes the actual bug. That is, you need to make it impossible for these sorts of interleaved sends to occur. -- Jeff Layton <jlayton@...> _______________________________________________ linux-cifs-client mailing list linux-cifs-client@... https://lists.samba.org/mailman/listinfo/linux-cifs-client |
|
|
Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sendsOn Thu, Jul 23, 2009 at 10:34 AM, Jeff Layton<jlayton@...> wrote:
> On Thu, 23 Jul 2009 09:51:32 -0500 > Shirish Pargaonkar <shirishpargaonkar@...> wrote: > >> On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote: >> > On Wed, 22 Jul 2009 20:14:38 -0500 >> > Shirish Pargaonkar <shirishpargaonkar@...> wrote: >> > >> >> Inspite of a set of data integrity patches in cifs last yer, there >> >> still persist errors >> >> caused due to timeouts resulting in sending incomplete data and >> >> hence data integrity errors. >> >> >> >> The proposed socket send timeout is large enough to elminate that possibility. >> > >> > On what evidence do you base the above statement? Who's to say that 30s >> > is long enough if someone has a high-latency enough connection? >> > >> >> The tests with this patches have resulted in elminating data integrity errors on >> >> an 80 hours test runs which otherwise manifest in matter of hours of a test run. >> >> >> > >> > Also, can you give some details about these data integrity errors? Were >> > writes failing? If so, were they not reported at fsync or close? >> >> The errors logged by cifs client were like this >> This is what I had seen last year when the patches were developed. >> The entire write could not be sent because of socket timeout, other thread >> fills in rest of the 56K write so that second 56K is not responded and client >> logs 'No response for cmd'. >> The longer timeout seems to be long enough for server to receive entire >> smbwrite (56K). >> >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid >> 20646 >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid >> 20647 >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 >> May 12 05:17:11 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid >> 21347 >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid >> 21348 >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 46 mid >> 24859 >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: Send error in read = -11 >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid >> 24858 >> >> > > It sounds like the original bug was never fixed then, only made less > likely by changing the timing. This patch looks like it just does the > same thing. The first step was to change the socket from non-blocking to blocking to prevent interleaved sends. A longer send timeout makes sure the send has enough duration to complete the send instead of returning prematurely. I can not think of a way to abort a partialy sent request to the server and I do not know whether it is possible to be sure that entire 56K buffer is available before dispatching a send on a (test induced) stressed socket. > > Rather than papering over the bug by increasing the timeout, I think a > patch is needed that fixes the actual bug. That is, you need to make it > impossible for these sorts of interleaved sends to occur. > > -- > Jeff Layton <jlayton@...> > _______________________________________________ linux-cifs-client mailing list linux-cifs-client@... https://lists.samba.org/mailman/listinfo/linux-cifs-client |
|
|
Re: [patch] Increase send time out on a socket long enough inorder to eliminate any timeouts on large sendsOn Thu, 23 Jul 2009 12:00:25 -0500
Shirish Pargaonkar <shirishpargaonkar@...> wrote: > On Thu, Jul 23, 2009 at 10:34 AM, Jeff Layton<jlayton@...> wrote: > > On Thu, 23 Jul 2009 09:51:32 -0500 > > Shirish Pargaonkar <shirishpargaonkar@...> wrote: > > > >> On Thu, Jul 23, 2009 at 7:05 AM, Jeff Layton<jlayton@...> wrote: > >> > On Wed, 22 Jul 2009 20:14:38 -0500 > >> > Shirish Pargaonkar <shirishpargaonkar@...> wrote: > >> > > >> >> Inspite of a set of data integrity patches in cifs last yer, there > >> >> still persist errors > >> >> caused due to timeouts resulting in sending incomplete data and > >> >> hence data integrity errors. > >> >> > >> >> The proposed socket send timeout is large enough to elminate that possibility. > >> > > >> > On what evidence do you base the above statement? Who's to say that 30s > >> > is long enough if someone has a high-latency enough connection? > >> > > >> >> The tests with this patches have resulted in elminating data integrity errors on > >> >> an 80 hours test runs which otherwise manifest in matter of hours of a test run. > >> >> > >> > > >> > Also, can you give some details about these data integrity errors? Were > >> > writes failing? If so, were they not reported at fsync or close? > >> > >> The errors logged by cifs client were like this > >> This is what I had seen last year when the patches were developed. > >> The entire write could not be sent because of socket timeout, other thread > >> fills in rest of the 56K write so that second 56K is not responded and client > >> logs 'No response for cmd'. > >> The longer timeout seems to be long enough for server to receive entire > >> smbwrite (56K). > >> > >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding > >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid > >> 20646 > >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid > >> 20647 > >> May 12 05:17:09 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 > >> May 12 05:17:11 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 > >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding > >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid > >> 21347 > >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 47 mid > >> 21348 > >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -11, wrote 0 > >> May 12 05:17:39 voyBCSsles11-rc3 kernel: CIFS VFS: Write2 ret -9, wrote 0 > >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: server not responding > >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response to cmd 46 mid > >> 24859 > >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: Send error in read = -11 > >> May 12 05:18:09 voyBCSsles11-rc3 kernel: CIFS VFS: No response for cmd 50 mid > >> 24858 > >> > >> > > > > It sounds like the original bug was never fixed then, only made less > > likely by changing the timing. This patch looks like it just does the > > same thing. > > The first step was to change the socket from non-blocking to blocking > to prevent interleaved sends. > A longer send timeout makes sure the send has enough duration to > complete the send instead of returning prematurely. > > I can not think of a way to abort a partialy sent request to the server and > I do not know whether it is possible to be sure that entire 56K buffer is > available before dispatching a send on a (test induced) stressed socket. > I think we already discussed this several months ago and agreed that the right thing to do is to detect when a partial send has occurred and to reconnect the socket when it does. I can dig up the discussion again, but you probably remember it... The question I have is -- why didn't that happen here? That should have prevented these interleaved sends...right? Increasing the send timeout will have other effects too that you're not accounting for here. You're increasing the total send timeout from 15s to 90s (since steve wanted to keep this loop in smb_sendv instead of just letting the socket layer handle it). That potentially changes the overall timeout for SMB calls. I'm very leery of increasing the send timeout and hoping for the best. Since the consequences of getting this wrong are data corruption, we need a real fix or a detailed explanation of how this is guaranteed to prevent the problem in the future. -- Jeff Layton <jlayton@...> _______________________________________________ linux-cifs-client mailing list linux-cifs-client@... https://lists.samba.org/mailman/listinfo/linux-cifs-client |
| Free embeddable forum powered by Nabble | Forum Help |