|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
potential compiler improvement?I've been looking a lot of BEAM code lately, so I can better understand how
the VM works. I have generally been trying to avoid looking at generated code quality too much, because I can easily get obsessed with it (see http://prog21.dadgum.com/50.html). But I have noticed one general thing which might be worth investigating in the future. In a pattern match, all referenced values in tuples are loaded into BEAM registers. This happens right up front as part of the matching process. And it happens even if the values aren't used until some time later, and even if they're not used at all in a particular branch of code. Here's an example: test({A, B, C, D, E, F, G}) -> case A of this -> B + C + D; that -> E + F + G end. In this case, all seven tuple elements, A-G, are loaded into registers before the "case" is executed. This is even though three values are unneeded in each of the two possible branches. I'm no expert on how the compiler works, so I'm not going to hazard a guess at the difficulty of this. James |
|
|
Re: potential compiler improvement?2009/10/26, James Hague <james.hague@...>:
[...] > In a pattern match, all referenced values in tuples are loaded into BEAM > registers. This happens right up front as part of the matching process. And > it happens even if the values aren't used until some time later, and even if > they're not used at all in a particular branch of code. Here's an example: > > test({A, B, C, D, E, F, G}) -> > case A of > this -> B + C + D; > that -> E + F + G > end. > > In this case, all seven tuple elements, A-G, are loaded into registers > before the "case" is executed. This is even though three values are unneeded > in each of the two possible branches. > > I'm no expert on how the compiler works, so I'm not going to hazard a guess > at the difficulty of this. I don't know much about the compiler, but I think it hasn't got the slightest idea of A being 'this', 'that' or something completely different, so it can't really optimize. Actually if A is not 'this' or 'that', you'll get a crash report which includes all parameters, so it is possible that all seven variables are used. ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: potential compiler improvement?> > test({A, B, C, D, E, F, G}) ->
> > case A of > > this -> B + C + D; > > that -> E + F + G > > end. > > I don't know much about the compiler, but I think it hasn't got the > slightest idea of A being 'this', 'that' or something completely > different, so it can't really optimize. You don't need to know the value of A ahead of time. You simply don't unpack B, C, D until you're in the 'this' branch (and once you've gone down that path then you know you don't need to bother unpacking E, F, G). ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: potential compiler improvement?2009/10/26 James Hague <james.hague@...>
> > In a pattern match, all referenced values in tuples are loaded into BEAM > registers. This happens right up front as part of the matching process. And > it happens even if the values aren't used until some time later, and even > if > they're not used at all in a particular branch of code. Here's an example: > > test({A, B, C, D, E, F, G}) -> > case A of > this -> B + C + D; > that -> E + F + G > end. > > In this case, all seven tuple elements, A-G, are loaded into registers > before the "case" is executed. This is even though three values are > unneeded > in each of the two possible branches. > Now the compiler just looks at which elements from the tuple may accessed and only extracts those. So if you only used A, B and C it would only extract the first 3 elements. It could save the whole argument and only extract the arguments when they are used but this would entail saving the whole tuple for the future as long as it may be used. In the general case it could maybe create more instructions, or maybe less, depending on what follows. The analysis would be more difficult, though probably not excessively so. Now the analysis and code generation is easier. It could also delay freeing the tuple which would mean more work the garbage collector. I have absolutely no idea of the results of these trade-offs. Robert |
|
|
Re: potential compiler improvement?----- Original Message ---- > In this case, all seven tuple elements, A-G, are loaded into registers > before the "case" is executed. This is even though three values are unneeded > in each of the two possible branches. > > I'm no expert on how the compiler works, so I'm not going to hazard a guess > at the difficulty of this. A long time ago, I experimented with that in the old Ultrasparc Hipe compiler. The idea then was to sink loads to the places where it was used in all successor branches (or the same basic block), in order to avoid redundant loads. The dataflow analysis seemed pretty straightforward, though I didn't prove it correct or anything. The net effect, alas, was a slowdown due to increased load stalls, since the loads moved closer to their uses :-) Anyway, for BEAM on a modern architecture the net outcome may of course be quite different. Best, Thomas ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: potential compiler improvement?I suspect unpacking 7 or 5 arguments takes more or less the same time.
This involves accessing *p and *(p+1) etc. (in C) My understanding of things was once you had accessed *p accessing *(p+1) was free. Accessing *p pulls the data you need into the cache (which is potentially expensive) once in the cache accessing *(p+1) is free. What matters is cache misses. Aligning the start of the tuple on a cache line boundary should give a great win, also making sure that the entire tuple will fit in cache - this is very processor specific, so you'd need a JIT and deep magic to do this. Sticking stuff in different parts of memory and making sure that code does not span page boundaries will also help - don't have a tight loop spanning a page boundary. Erik Hagersten used to optimize C by physically moving C subroutines in a file - shuffle - recompile and measure. And the programs went faster ( a bit) we thought he was nuts, now he has a company that measures this :-) - It's the cache. Tricky stuff /Joe On Mon, Oct 26, 2009 at 10:02 PM, Robert Virding <rvirding@...> wrote: > 2009/10/26 James Hague <james.hague@...> > >> >> In a pattern match, all referenced values in tuples are loaded into BEAM >> registers. This happens right up front as part of the matching process. And >> it happens even if the values aren't used until some time later, and even >> if >> they're not used at all in a particular branch of code. Here's an example: >> >> test({A, B, C, D, E, F, G}) -> >> case A of >> this -> B + C + D; >> that -> E + F + G >> end. >> >> In this case, all seven tuple elements, A-G, are loaded into registers >> before the "case" is executed. This is even though three values are >> unneeded >> in each of the two possible branches. >> > > Now the compiler just looks at which elements from the tuple may accessed > and only extracts those. So if you only used A, B and C it would only > extract the first 3 elements. It could save the whole argument and only > extract the arguments when they are used but this would entail saving the > whole tuple for the future as long as it may be used. In the general case it > could maybe create more instructions, or maybe less, depending on what > follows. The analysis would be more difficult, though probably not > excessively so. Now the analysis and code generation is easier. It could > also delay freeing the tuple which would mean more work the garbage > collector. > > I have absolutely no idea of the results of these trade-offs. > > Robert > ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: potential compiler improvement?> I suspect unpacking 7 or 5 arguments takes more or less the same time.
First off, good discussion! In C--and HiPE--perhaps the times are the same, but I was thinking about threaded BEAM code where unpacking the additional three values involves three more BEAM instructions (assuming those aren't collapsed into one super-instruction at load time). Also, in lots of functions a significant portion of the instructions are moves from temporary registers to parameter registers. (Typically there's a string of these right before a function call, one per parameter.) So it is *possible* that delaying the loads of tuple elements until right before needed may reduce the amount of register shuffling going on. Maybe. I completely agree with Joe that this is tricky stuff! James ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
|
Re: potential compiler improvement?On Tue, Oct 27, 2009 at 5:58 PM, James Hague <james.hague@...> wrote:
> In C--and HiPE--perhaps the times are the same, but I was thinking > about threaded BEAM code where unpacking the additional three values > involves three more BEAM instructions (assuming those aren't collapsed > into one super-instruction at load time). They *are* collapsed into super-instructions. Not one, but three for your test program. You can see it if you do erts_debug:df(Mod) (The disassemlbly willl appear in Mod.dis) /Bjorn -- Björn Gustavsson, Erlang/OTP, Ericsson AB ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
| Free embeddable forum powered by Nabble | Forum Help |