Question about big_writes

View: New views
7 Messages — Rating Filter:   Alert me  

Question about big_writes

by Bradley W. Settlemyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello

   I am developing a fuse file system that will need to leverage
reasonably large write buffers.  For example my clients tend to write
hundreds of MB in a single read or write call, and I need my fuse file
system to operate on write buffers of at least 128MB.

   I've mounted with -odirect_io,big_writes,max_write=128000000 flags,
but I still seem to only be receiving 128KB buffers for each write.

  For example, my fuse debug output:

write[1] 131072 bytes to 127401968 flags: 0x8002
    write[1] 131072 bytes to 127401968
    unique: 1123, success, outsize: 24
unique: 1124, opcode: WRITE (16), nodeid: 22, insize: 131152
write[1] 131072 bytes to 127533040 flags: 0x8002
    write[1] 131072 bytes to 127533040
    unique: 1124, success, outsize: 24
unique: 1125, opcode: WRITE (16), nodeid: 22, insize: 131152
write[1] 131072 bytes to 127664112 flags: 0x8002
    write[1] 131072 bytes to 127664112
    unique: 1125, success, outsize: 24

So it appears that fuse knows the client's write buffer is 128MB, but
fuse seems to be chunking up the write no matter what I do.  Any guidance?

Cheers,
Brad

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Question about big_writes

by Miklos Szeredi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 05 Nov 2009, Bradley W. Settlemyer wrote:

> Hello
>
>    I am developing a fuse file system that will need to leverage
> reasonably large write buffers.  For example my clients tend to write
> hundreds of MB in a single read or write call, and I need my fuse file
> system to operate on write buffers of at least 128MB.
>
>    I've mounted with -odirect_io,big_writes,max_write=128000000 flags,
> but I still seem to only be receiving 128KB buffers for each write.
>
>   For example, my fuse debug output:
>
> write[1] 131072 bytes to 127401968 flags: 0x8002
>     write[1] 131072 bytes to 127401968
>     unique: 1123, success, outsize: 24
> unique: 1124, opcode: WRITE (16), nodeid: 22, insize: 131152
> write[1] 131072 bytes to 127533040 flags: 0x8002
>     write[1] 131072 bytes to 127533040
>     unique: 1124, success, outsize: 24
> unique: 1125, opcode: WRITE (16), nodeid: 22, insize: 131152
> write[1] 131072 bytes to 127664112 flags: 0x8002
>     write[1] 131072 bytes to 127664112
>     unique: 1125, success, outsize: 24
>
> So it appears that fuse knows the client's write buffer is 128MB, but
> fuse seems to be chunking up the write no matter what I do.  Any guidance?

Currently fuse doesn't support >128k writes "out of the box".  There
have been patches floating around that raise the limit, but they have
some side effects, so I'm not integrating these yet.

Why do you need these very large buffers?

Thanks,
Miklos

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Question about big_writes

by Bradley W. Settlemyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Performance.  I want to write the data out in parallel -- a special
parallel file system implemented in fuse.  I could build my own
buffering, but then a 128MB write could require as much as 256MB of RAM.

   I can perhaps get by with 128K writes, it just doesn't work in my
prototype implementation.  I may still be be able to get decent
performance with less parallelism.  I will see if I can engineer around
it.  Thanks for the response.

Cheers,
Brad

On 11/06/2009 04:11 AM, Miklos Szeredi wrote:

> On Thu, 05 Nov 2009, Bradley W. Settlemyer wrote:
>> Hello
>>
>>     I am developing a fuse file system that will need to leverage
>> reasonably large write buffers.  For example my clients tend to write
>> hundreds of MB in a single read or write call, and I need my fuse file
>> system to operate on write buffers of at least 128MB.
>>
>>     I've mounted with -odirect_io,big_writes,max_write=128000000 flags,
>> but I still seem to only be receiving 128KB buffers for each write.
>>
>>    For example, my fuse debug output:
>>
>> write[1] 131072 bytes to 127401968 flags: 0x8002
>>      write[1] 131072 bytes to 127401968
>>      unique: 1123, success, outsize: 24
>> unique: 1124, opcode: WRITE (16), nodeid: 22, insize: 131152
>> write[1] 131072 bytes to 127533040 flags: 0x8002
>>      write[1] 131072 bytes to 127533040
>>      unique: 1124, success, outsize: 24
>> unique: 1125, opcode: WRITE (16), nodeid: 22, insize: 131152
>> write[1] 131072 bytes to 127664112 flags: 0x8002
>>      write[1] 131072 bytes to 127664112
>>      unique: 1125, success, outsize: 24
>>
>> So it appears that fuse knows the client's write buffer is 128MB, but
>> fuse seems to be chunking up the write no matter what I do.  Any guidance?
>
> Currently fuse doesn't support>128k writes "out of the box".  There
> have been patches floating around that raise the limit, but they have
> some side effects, so I'm not integrating these yet.
>
> Why do you need these very large buffers?
>
> Thanks,
> Miklos
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Question about big_writes

by Goswin von Brederlow-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

"Bradley W. Settlemyer" <settlemyerbw@...> writes:

> Performance.  I want to write the data out in parallel -- a special
> parallel file system implemented in fuse.  I could build my own
> buffering, but then a 128MB write could require as much as 256MB of RAM.

Actualy it would cost 128MB + <size of write (128k)> * <num of
parallel requests (10)>. That is 1% overhead.

But it also costs time to copy the data between the fuse buffer and
your own cache. I'm currently (again, better this time) trying to add
better buffer lifetime to libfuse.

The way I have in mind (see below for options) the application would
set callbacks for allocation and freeing buffers. Libfuse would
allocate memory for a request and frees it either when the requests
callback returns or when the request is replied to using the
callback. You could then ignore the free callback (or decrement a
reference count) and instead just link the requests buffer into your
cache structure and free it when the cache has been committed.

You would still need your own buffering but without cost of memcpy().


There is one thing I'm undecided about though. Maybe you can weigh in
what would be better.

Option A)

Add a pointer to the allocated buffer to the request structure. A
callback may overwrite the pointer with NULL. If the pointer is !=
NULL then libfuse will free the buffer when the callback returns.

Option B)

Add a pointer to the allocated buffer to the request structure. A
callback may overwrite the pointer with NULL. If the pointer is !=
NULL then libfuse will free the buffer when the request is replied to.

Option C)

Add a pointer to the allocated buffer to the request structure. Also
add a enum to the request { FREE_NOW, FREE_ON_REPLY, DONT_FREE
}. Depending on the enum libfuse frees the buffer when the callback
returns, when the request is replied or never respectively.

Option D)

Add alloc_buffer and free_buffer callbacks an application can set. The
application can then decide itself if free_buffer() should already
free the buffer. The request structure contains a pointer to the
buffer and free_buffer() is called when a request is replied to.


I currently favour option D as that would also allow to align the
buffer for use with libaio or to add a private header and return a
pointer to after the header to libfuse. The header could contain
prev/next links, a reference count or information for a garbage
collector.

MfG
        Goswin

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Question about big_writes

by Bradley W. Settlemyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If I may ask, what is the problem that requires limiting the number of
pages to 32?

   I see the #define in fuse_i.h, and I'm tempted to just crank it up.
But it's not clear how that will effect system memory usage in terms of
long lived buffers, etc.

   What would be the problem with making max pages a setting
configurable in /proc?

Cheers,
Brad

OTOH, no one else on my target system uses FUSE, so

On 11/06/2009 04:11 AM, Miklos Szeredi wrote:

> On Thu, 05 Nov 2009, Bradley W. Settlemyer wrote:
>> Hello
>>
>>     I am developing a fuse file system that will need to leverage
>> reasonably large write buffers.  For example my clients tend to write
>> hundreds of MB in a single read or write call, and I need my fuse file
>> system to operate on write buffers of at least 128MB.
>>
>>     I've mounted with -odirect_io,big_writes,max_write=128000000 flags,
>> but I still seem to only be receiving 128KB buffers for each write.
>>
>>    For example, my fuse debug output:
>>
>> write[1] 131072 bytes to 127401968 flags: 0x8002
>>      write[1] 131072 bytes to 127401968
>>      unique: 1123, success, outsize: 24
>> unique: 1124, opcode: WRITE (16), nodeid: 22, insize: 131152
>> write[1] 131072 bytes to 127533040 flags: 0x8002
>>      write[1] 131072 bytes to 127533040
>>      unique: 1124, success, outsize: 24
>> unique: 1125, opcode: WRITE (16), nodeid: 22, insize: 131152
>> write[1] 131072 bytes to 127664112 flags: 0x8002
>>      write[1] 131072 bytes to 127664112
>>      unique: 1125, success, outsize: 24
>>
>> So it appears that fuse knows the client's write buffer is 128MB, but
>> fuse seems to be chunking up the write no matter what I do.  Any guidance?
>
> Currently fuse doesn't support>128k writes "out of the box".  There
> have been patches floating around that raise the limit, but they have
> some side effects, so I'm not integrating these yet.
>
> Why do you need these very large buffers?
>
> Thanks,
> Miklos
>

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Question about big_writes

by Bradley W. Settlemyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 11/06/2009 11:40 AM, Goswin von Brederlow wrote:

> "Bradley W. Settlemyer"<settlemyerbw@...>  writes:
>
>> Performance.  I want to write the data out in parallel -- a special
>> parallel file system implemented in fuse.  I could build my own
>> buffering, but then a 128MB write could require as much as 256MB of RAM.
>
> Actualy it would cost 128MB +<size of write (128k)>  *<num of
> parallel requests (10)>. That is 1% overhead.
>
> But it also costs time to copy the data between the fuse buffer and
> your own cache. I'm currently (again, better this time) trying to add
> better buffer lifetime to libfuse.
>

Hmm, what am I missing?  Say I have 8 threads.  And each wants to
operate on 16MB.  I have to accumulate 128MB of data before any thread
will begin releasing data.  The client still has in his buffer 128MB
pending on the write call.  So that comes to twice the memory cost
(minus, perhaps, the last 128K which may be shared if I use a writev
type technique to send the data).  Is fuse able to optimize away this
cost somehow?

Now, I'm not afraid to get a bit complicated, what I could do is
accumulate the pointers to the buffers assuming that direct_io gives me
the same pointer as exists in the client's userspace, and then use a
writev call to push a largem amount of data across the network.  The
problem is I can't tell the difference between 128MB writes and 128K
writes, and I will violate POSIX semantics on the latter to speed the
performance on the former.  I'm also not clear on when I need to copy
the buffers fuse gives me and when I can get away with just continuing
to use the buffer.

Not a major deal to me, but a headache for my users that run 3rd party
code.  If they actually need to do a small write, I would like to offer
them the correct semantic if possible.

Note, we have prototyped our threaded write technique and we need the
larger buffers to truly achieve the performance we desire.  But we will
take what we can get I suppose also.

Cheers,
Brad

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Question about big_writes

by Goswin von Brederlow-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

"Bradley W. Settlemyer" <settlemyerbw@...> writes:

> On 11/06/2009 11:40 AM, Goswin von Brederlow wrote:
>> "Bradley W. Settlemyer"<settlemyerbw@...>  writes:
>>
>>> Performance.  I want to write the data out in parallel -- a special
>>> parallel file system implemented in fuse.  I could build my own
>>> buffering, but then a 128MB write could require as much as 256MB of RAM.
>>
>> Actualy it would cost 128MB +<size of write (128k)>  *<num of
>> parallel requests (10)>. That is 1% overhead.
>>
>> But it also costs time to copy the data between the fuse buffer and
>> your own cache. I'm currently (again, better this time) trying to add
>> better buffer lifetime to libfuse.
>>
>
> Hmm, what am I missing?  Say I have 8 threads.  And each wants to
> operate on 16MB.  I have to accumulate 128MB of data before any thread
> will begin releasing data.  The client still has in his buffer 128MB
> pending on the write call.  So that comes to twice the memory cost
> (minus, perhaps, the last 128K which may be shared if I use a writev
> type technique to send the data).  Is fuse able to optimize away this
> cost somehow?

No matter how big or small the fuse writes are your client will have a
128MB buffer and write 128MB in one chunk. The kernel then splits this
up into 4k chunks, passes it around some times until et eventually
ends up in fuse which gathers 128k chunks and sends it to /dev/fuse.
In your filesystem you can act on 128k chunks an waste no memory
or you can cache chunks untill you have 128MB and then flush them to
the net.

So you have 128MB client, 128MB filesystem (+any pending requests).

Now what happens if you had 128MB chunks in fuse?

The client still has 128MB memory that it writes. No change there. The
128MB are on the fly in the kernel. Again no change. The request
swells up to 128MB so libfuse allocates that much for the buffer. But
your filesystem doesn't have to cache the chunk so it saves 128MB
instead.

Overall the change is about 0.


Well, actualy not. Since, if I understand the inner loop right,
libfuse allocates one buffer per thread you would use much more
memory. <max number of filesystem operations started in parallel ever>
* 128MB. Your application runs: 1-2 threads = 128-256MB. You type df
in another shell: another 128MB. updatedb runs in the background:
another 128MB. You would easily waste a lot of memory for miniscule
requests.

> Now, I'm not afraid to get a bit complicated, what I could do is
> accumulate the pointers to the buffers assuming that direct_io gives
> me the same pointer as exists in the client's userspace, and then use
> a writev call to push a largem amount of data across the network.  The
> problem is I can't tell the difference between 128MB writes and 128K
> writes, and I will violate POSIX semantics on the latter to speed the
> performance on the former.  I'm also not clear on when I need to copy
> the buffers fuse gives me and when I can get away with just continuing
> to use the buffer.

You assume fuse doesn't (need to) copy the buffer. The kernel doesn't
just pass the pointer around. That is a goal but not current reality.

As for POSIX semantics I don't see where you need to violate anything?
For full POSIX compliance the hart part will be to do distributed
locking of the files/byte ranges you read/write. Wether you do that
for 128KB or 128MB doesn't really matter. If you even allow opening a
file on multiple hosts in parallel.

Now, the tricky part is the buffers. The pointer you get in the write
callback is only valid until that callback returns. Therefore to cache
data beyond the callback you need to copy the data. That is the part
I'm trying to fix. I want to move the pointer into my cache and tell
fuse to not free it. I will do that when I flush the data to disk. But
it might be a while till I have a good patch for that that the fuse
authors will accept.

> Not a major deal to me, but a headache for my users that run 3rd party
> code.  If they actually need to do a small write, I would like to
> offer them the correct semantic if possible.

Lock a small chunk on the first write, extend to 128MB when the second
chunk comes in, flush data and free the lock if chunks stop coming in
/ too many chunks for other 128MB blocks come in.

Unless you have 3rd party software where each client opens the same
file and writes 1MB chunks at different positions you should be fine.

> Note, we have prototyped our threaded write technique and we need the
> larger buffers to truly achieve the performance we desire.  But we
> will take what we can get I suppose also.

I would be surprised if 128MB chunks in fuse would give you the best
speed. Try some of the patches for bigger write size and see how they
scale. I think people have used anything from 1-16MB in the past.

> Cheers,
> Brad

Just a though. Have you tried glusterfs?

MfG
        Goswin

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel