|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
file:read with read_ahead and binaries brokendd if=/dev/urandom of=/tmp/file.rnd bs=1M count=20
test(Hdl) -> test(Hdl, []). test(Hdl, Acc) -> case file:read(Hdl, 1) of {ok, <<Num:1/binary>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}), test(Hdl, [Num|Acc]); eof -> Acc end. 1> f(), {ok, Hdl} = file:open("/tmp/file.rnd", [read, read_ahead, binary, raw]), X = test:test(Hdl), ok = file:close(Hdl). Erlang will die. Badly. erlang:memory() shows that of the 4GB erlang has claimed before I kill it, 3.9GB of that is binary data. Ways to stop this going nuts: 1) Don't use read_ahead 2) Remove the position call - instead, read 2 bytes and skip the second 3) Add any random term, say 'foo' to the Acc, rather than Num. 4) Have Num as an int, not a binary. 5) Do the following: {ok, <<Num:8>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}), <<Num2:1/binary>> = <<Num:8>>, test(Hdl, [Num2|Acc]); My guess is that what's happening is that the read is reading in a whole disk page (as it should), Num is a pointer into the start of that page, but the rest of the page beyond the first byte, isn't reclaimed. Then the position seemingly invalidates the entire page. This is confirmed by the fact that strace -f -c -p $PID shows the same number of calls to read in both the read_ahead and non read_ahead versions. Interestingly though, there are twice as many calls to lseek in the read_ahead version. From inspecting the size of the file itself, both the read_ahead and non versions are really issuing a read for every single byte read, and the read_ahead version also has the advantage of issuing twice as many seeks. A quick test shows this happens at least as far back as R12B5, and still happens in R13B02. Oh and if you follow suggestion (5), you'll find the read_ahead version is about 8 times slower than the non read_ahead version. Matthew ________________________________________________________________ erlang-bugs mailing list. See http://www.erlang.org/faq.html erlang-bugs (at) erlang.org |
|
|
Re: file:read with read_ahead and binaries brokenHi Matthew,
We are aware of this issue and a more aggressive gc-strategy is being developed. This will be in place in the next release unless something unforeseen happens. The new strategy involves virtual heaps for binaries that will also trigger gc:s when binary heap boundaries are reached instead of only procbins and binary overhead counting triggers. The new strategy will also take care of past old heap binary problems. Regards, Björn-Egil Erlang/OTP Matthew Sackman wrote: > dd if=/dev/urandom of=/tmp/file.rnd bs=1M count=20 > > test(Hdl) -> > test(Hdl, []). > > test(Hdl, Acc) -> > case file:read(Hdl, 1) of > {ok, <<Num:1/binary>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}), > test(Hdl, [Num|Acc]); > eof -> Acc > end. > > 1> f(), {ok, Hdl} = file:open("/tmp/file.rnd", [read, read_ahead, binary, raw]), > X = test:test(Hdl), ok = file:close(Hdl). > > Erlang will die. Badly. erlang:memory() shows that of the 4GB erlang > has claimed before I kill it, 3.9GB of that is binary data. > > Ways to stop this going nuts: > 1) Don't use read_ahead > 2) Remove the position call - instead, read 2 bytes and skip the second > 3) Add any random term, say 'foo' to the Acc, rather than Num. > 4) Have Num as an int, not a binary. > 5) Do the following: > {ok, <<Num:8>>} -> {ok, _Pos} = file:position(Hdl, {cur, 1}), > <<Num2:1/binary>> = <<Num:8>>, > test(Hdl, [Num2|Acc]); > > My guess is that what's happening is that the read is reading in a whole > disk page (as it should), Num is a pointer into the start of that page, > but the rest of the page beyond the first byte, isn't reclaimed. Then the > position seemingly invalidates the entire page. This is confirmed by the > fact that strace -f -c -p $PID shows the same number of calls to read in > both the read_ahead and non read_ahead versions. Interestingly though, > there are twice as many calls to lseek in the read_ahead version. > >>From inspecting the size of the file itself, both the read_ahead and non > versions are really issuing a read for every single byte read, and the > read_ahead version also has the advantage of issuing twice as many > seeks. > > A quick test shows this happens at least as far back as R12B5, and still > happens in R13B02. > > Oh and if you follow suggestion (5), you'll find the read_ahead version > is about 8 times slower than the non read_ahead version. > > Matthew > > ________________________________________________________________ > erlang-bugs mailing list. See http://www.erlang.org/faq.html > erlang-bugs (at) erlang.org ________________________________________________________________ erlang-bugs mailing list. See http://www.erlang.org/faq.html erlang-bugs (at) erlang.org |
|
|
Re: file:read with read_ahead and binaries brokenHi Björn-Egil,
Thanks for the reply, and good to know a solution is in the pipeline. However, you're solution is only addressing one issue. The other issue is why is a read issued when the position call does not move the file handle outside of the region currently cached by the read ahead buffer? In truth, both the seek and read libc calls can be avoided, or at the least, the position can be delayed until some other non-(position or read) call - eg truncate or write. Matthew ________________________________________________________________ erlang-bugs mailing list. See http://www.erlang.org/faq.html erlang-bugs (at) erlang.org |
|
|
Re: file:read with read_ahead and binaries brokenYes, I did hit send a bit prematurely.
The solution I was talking about does not solve this particular problem. What's happening here is that the driver is keeping a read_ahead buffer which is a binary of size 64 kB (if I remember the default cache size correctly). Each read will generate a subbinary of the read_ahead buffer which is kept reachable in the process by pushing the subbinary to a list in the read-loop. Each file:position will flush the read_ahead cache and a new binary will be made to take is place. *repeat until eof* Each subbinary will reference the binary and force the gc to keep those binaries since they are all live data. In this example the total memory consumption would be roughly ~20M x 64K bytes / 2 ~ 640 GB which is not the intention by the programmer I guess. =) The main problem here is that each subbinary is kept. It is aggravated by producing a new binary cache for each read. This is of course easily remedied by matching numbers instead of binaries. In this case using <<N:8>> instead of <<N:1/binary>>. Also instead of seeks one could read 2 bytes instead of one. Or, as you said, skip read_ahead since it wont give any boost because of the seeks. I realize that this not the intent of the test though. Is this a bug in the handling of binaries? No, but perhaps a limitation and not the "least astonishing result". Users must be aware of the fact that subbinaries will keep the whole binary it is referencing. And keeping the subbinaries reachable will keep them from being gc:ed. In this case the user must also be aware of the fact that he is receiving subbinaries from the reads. I think that this could be clearer in the documentation. One could argue that seeks should not always flush the cache. I fully agree with you that this should be avoided. This is something we will review. One could also argue that subbinaries should be compacted. This is not wise for the most common cases. It would kill performance and actually bloat memory. A user can do this by himself by forcing a copy of the subbinary. This will generate a new separate smaller binary. Some sort of smart automatic compacting of binaries could be done in the gc but it is not easily implemented for a number of reasons. Several strategies for compacting are on the table but it wont be a realization until R14 at the earliest. I hope you find this information helpful. *hitting send* Regards, Björn-Egil Erlang/OTP Matthew Sackman wrote: > Hi Björn-Egil, > > Thanks for the reply, and good to know a solution is in the pipeline. > However, you're solution is only addressing one issue. The other issue > is why is a read issued when the position call does not move the file > handle outside of the region currently cached by the read ahead buffer? > In truth, both the seek and read libc calls can be avoided, or at the > least, the position can be delayed until some other non-(position or > read) call - eg truncate or write. > > Matthew > > ________________________________________________________________ > erlang-bugs mailing list. See http://www.erlang.org/faq.html > erlang-bugs (at) erlang.org ________________________________________________________________ erlang-bugs mailing list. See http://www.erlang.org/faq.html erlang-bugs (at) erlang.org |
| Free embeddable forum powered by Nabble | Forum Help |