potential compiler improvement?

View: New views
8 Messages — Rating Filter:   Alert me  

potential compiler improvement?

by James Hague :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've been looking a lot of BEAM code lately, so I can better understand how
the VM works. I have generally been trying to avoid looking at generated
code quality too much, because I can easily get obsessed with it (see
http://prog21.dadgum.com/50.html). But I have noticed one general thing
which might be worth investigating in the future.

In a pattern match, all referenced values in tuples are loaded into BEAM
registers. This happens right up front as part of the matching process. And
it happens even if the values aren't used until some time later, and even if
they're not used at all in a particular branch of code.  Here's an example:

test({A, B, C, D, E, F, G}) ->
   case A of
      this -> B + C + D;
      that -> E + F + G
   end.

In this case, all seven tuple elements, A-G, are loaded into registers
before the "case" is executed. This is even though three values are unneeded
in each of the two possible branches.

I'm no expert on how the compiler works, so I'm not going to hazard a guess
at the difficulty of this.

James

Re: potential compiler improvement?

by Attila Rajmund Nohl :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/10/26, James Hague <james.hague@...>:
[...]

> In a pattern match, all referenced values in tuples are loaded into BEAM
> registers. This happens right up front as part of the matching process. And
> it happens even if the values aren't used until some time later, and even if
> they're not used at all in a particular branch of code.  Here's an example:
>
> test({A, B, C, D, E, F, G}) ->
>    case A of
>       this -> B + C + D;
>       that -> E + F + G
>    end.
>
> In this case, all seven tuple elements, A-G, are loaded into registers
> before the "case" is executed. This is even though three values are unneeded
> in each of the two possible branches.
>
> I'm no expert on how the compiler works, so I'm not going to hazard a guess
> at the difficulty of this.

I don't know much about the compiler, but I think it hasn't got the
slightest idea of A being 'this', 'that' or something completely
different, so it can't really optimize. Actually if A is not 'this' or
'that', you'll get a crash report which includes all parameters, so it
is possible that all seven variables are used.

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: potential compiler improvement?

by James Hague :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > test({A, B, C, D, E, F, G}) ->
> >    case A of
> >       this -> B + C + D;
> >       that -> E + F + G
> >    end.
>
> I don't know much about the compiler, but I think it hasn't got the
> slightest idea of A being 'this', 'that' or something completely
> different, so it can't really optimize.

You don't need to know the value of A ahead of time. You simply don't
unpack B, C, D until you're in the 'this' branch (and once you've gone
down that path then you know you don't need to bother unpacking E, F,
G).

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: potential compiler improvement?

by Robert Virding :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/10/26 James Hague <james.hague@...>

>
> In a pattern match, all referenced values in tuples are loaded into BEAM
> registers. This happens right up front as part of the matching process. And
> it happens even if the values aren't used until some time later, and even
> if
> they're not used at all in a particular branch of code.  Here's an example:
>
> test({A, B, C, D, E, F, G}) ->
>   case A of
>      this -> B + C + D;
>      that -> E + F + G
>   end.
>
> In this case, all seven tuple elements, A-G, are loaded into registers
> before the "case" is executed. This is even though three values are
> unneeded
> in each of the two possible branches.
>

Now the compiler just looks at which elements from the tuple may accessed
and only extracts those. So if you only used A, B and C it would only
extract the first 3 elements. It could save the whole argument and only
extract the arguments when they are used but this would entail saving the
whole tuple for the future as long as it may be used. In the general case it
could maybe create more instructions, or maybe less, depending on what
follows. The analysis would be more difficult, though probably not
excessively so. Now the analysis and code generation is easier. It could
also delay freeing the tuple which would mean more work the garbage
collector.

I have absolutely no idea of the results of these trade-offs.

Robert

Re: potential compiler improvement?

by Thomas Lindgren :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message





----- Original Message ----
> In this case, all seven tuple elements, A-G, are loaded into registers
> before the "case" is executed. This is even though three values are unneeded
> in each of the two possible branches.
>
> I'm no expert on how the compiler works, so I'm not going to hazard a guess
> at the difficulty of this.

A long time ago, I experimented with that in the old Ultrasparc Hipe compiler.
The idea then was to sink loads to the places where it was used in all successor branches (or the same
basic block), in order to avoid redundant loads. The dataflow analysis seemed pretty straightforward, though
I didn't prove it correct or anything.

The net effect, alas, was a slowdown due to increased load stalls, since the loads moved closer to their uses :-)

Anyway, for BEAM on a modern architecture the net outcome may of course be quite different.

Best,
Thomas



     

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: potential compiler improvement?

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I suspect unpacking 7 or 5 arguments takes more or less the same time.

This involves accessing *p and *(p+1) etc. (in C) My understanding of
things was once you
had accessed *p accessing *(p+1) was free. Accessing *p pulls the data you need
into the cache (which is potentially expensive) once in the cache
accessing *(p+1) is
free. What matters is cache misses. Aligning the start of the tuple on
a cache line
boundary should give a great win, also making sure that the entire
tuple will fit
in cache - this is very processor specific, so you'd need a JIT and
deep magic to do this.

Sticking stuff in different parts of memory and making sure that code
does not span
page boundaries will also help - don't have a tight loop spanning a
page boundary.

Erik Hagersten used to optimize C by physically moving C subroutines in a file
- shuffle - recompile and measure. And the programs went faster ( a
bit) we thought he was nuts, now he has a company that measures this
:-) - It's the cache.

Tricky stuff

/Joe

On Mon, Oct 26, 2009 at 10:02 PM, Robert Virding <rvirding@...> wrote:

> 2009/10/26 James Hague <james.hague@...>
>
>>
>> In a pattern match, all referenced values in tuples are loaded into BEAM
>> registers. This happens right up front as part of the matching process. And
>> it happens even if the values aren't used until some time later, and even
>> if
>> they're not used at all in a particular branch of code.  Here's an example:
>>
>> test({A, B, C, D, E, F, G}) ->
>>   case A of
>>      this -> B + C + D;
>>      that -> E + F + G
>>   end.
>>
>> In this case, all seven tuple elements, A-G, are loaded into registers
>> before the "case" is executed. This is even though three values are
>> unneeded
>> in each of the two possible branches.
>>
>
> Now the compiler just looks at which elements from the tuple may accessed
> and only extracts those. So if you only used A, B and C it would only
> extract the first 3 elements. It could save the whole argument and only
> extract the arguments when they are used but this would entail saving the
> whole tuple for the future as long as it may be used. In the general case it
> could maybe create more instructions, or maybe less, depending on what
> follows. The analysis would be more difficult, though probably not
> excessively so. Now the analysis and code generation is easier. It could
> also delay freeing the tuple which would mean more work the garbage
> collector.
>
> I have absolutely no idea of the results of these trade-offs.
>
> Robert
>

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: potential compiler improvement?

by James Hague :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I suspect unpacking 7 or 5 arguments takes more or less the same time.

First off, good discussion!

In C--and HiPE--perhaps the times are the same, but I was thinking
about threaded BEAM code where unpacking the additional three values
involves three more BEAM instructions (assuming those aren't collapsed
into one super-instruction at load time).

Also, in lots of functions a significant portion of the instructions
are moves from temporary registers to parameter registers. (Typically
there's a string of these right before a function call, one per
parameter.) So it is *possible* that delaying the loads of tuple
elements until right before needed may reduce the amount of register
shuffling going on.  Maybe.  I completely agree with Joe that this is
tricky stuff!

James

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


Re: potential compiler improvement?

by Björn Gustavsson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 5:58 PM, James Hague <james.hague@...> wrote:

> In C--and HiPE--perhaps the times are the same, but I was thinking
> about threaded BEAM code where unpacking the additional three values
> involves three more BEAM instructions (assuming those aren't collapsed
> into one super-instruction at load time).

They *are* collapsed into super-instructions. Not one, but three for your test
program.

You can see it if you do

erts_debug:df(Mod)

(The disassemlbly willl appear in Mod.dis)

/Bjorn
--
Björn Gustavsson, Erlang/OTP, Ericsson AB

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org