RFC: ARM Cortex-A8 and floating point performance

View: New views
8 Messages — Rating Filter:   Alert me  

RFC: ARM Cortex-A8 and floating point performance

by Siarhei Siamashka :: Rate this Message:

| View Threaded | Show Only this Message

Hello,

Currently gcc (at least version 4.5.0) does a very poor job generating single
precision floating point code for ARM Cortex-A8.

The source of this problem is the use of VFP instructions which are run on a
slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode
(flush denormals to zero, disable exceptions) just provides a relatively minor
performance gain.

The right solution seems to be the use of NEON instructions for doing most of
the single precision calculations.

I wonder if it would be difficult to introduce the following changes to the
gcc generated code when optimizing for cortex-a8:
1. Allocate single precision variables only to evenly or oddly numbered
s-registers.
2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
'vadd.f32 d0, d0, d1' instead.

The number of single precision floating point registers gets effectively
halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
(packing/unpacking of register pairs may be needed to ensure proper parameters
passing to functions). Also there may be other problems, like dealing with
strict IEEE-754 compliance (maybe a special variable attribute for relaxing
compliance requirements could be useful). But this looks like the only
solution to fix poor performance on ARM Cortex-A8 processor.

Actually clang 2.7 seems to be working exactly this way. And it is
outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision
floating point tests that I tried on ARM Cortex-A8.

--
Best regards,
Siarhei Siamashka

Re: RFC: ARM Cortex-A8 and floating point performance

by Richard Guenther-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
<siarhei.siamashka@...> wrote:

> Hello,
>
> Currently gcc (at least version 4.5.0) does a very poor job generating single
> precision floating point code for ARM Cortex-A8.
>
> The source of this problem is the use of VFP instructions which are run on a
> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode
> (flush denormals to zero, disable exceptions) just provides a relatively minor
> performance gain.
>
> The right solution seems to be the use of NEON instructions for doing most of
> the single precision calculations.
>
> I wonder if it would be difficult to introduce the following changes to the
> gcc generated code when optimizing for cortex-a8:
> 1. Allocate single precision variables only to evenly or oddly numbered
> s-registers.
> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
> 'vadd.f32 d0, d0, d1' instead.
>
> The number of single precision floating point registers gets effectively
> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
> (packing/unpacking of register pairs may be needed to ensure proper parameters
> passing to functions). Also there may be other problems, like dealing with
> strict IEEE-754 compliance (maybe a special variable attribute for relaxing
> compliance requirements could be useful). But this looks like the only
> solution to fix poor performance on ARM Cortex-A8 processor.
>
> Actually clang 2.7 seems to be working exactly this way. And it is
> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision
> floating point tests that I tried on ARM Cortex-A8.

On i?86 we have -mfpmath={sse,x87}, I suppose you could add
-mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
and requiring neon support).

Richard.

> --
> Best regards,
> Siarhei Siamashka
>

Re: RFC: ARM Cortex-A8 and floating point performance

by Andrew Pinski-2 :: Rate this Message:

| View Threaded | Show Only this Message



Sent from my iPhone

On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guenther@...
 > wrote:

> On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
> <siarhei.siamashka@...> wrote:
>> Hello,
>>
>> Currently gcc (at least version 4.5.0) does a very poor job  
>> generating single
>> precision floating point code for ARM Cortex-A8.
>>
>> The source of this problem is the use of VFP instructions which are  
>> run on a
>> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on  
>> RunFast mode
>> (flush denormals to zero, disable exceptions) just provides a  
>> relatively minor
>> performance gain.
>>
>> The right solution seems to be the use of NEON instructions for  
>> doing most of
>> the single precision calculations.
>>
>> I wonder if it would be difficult to introduce the following  
>> changes to the
>> gcc generated code when optimizing for cortex-a8:
>> 1. Allocate single precision variables only to evenly or oddly  
>> numbered
>> s-registers.
>> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
>> 'vadd.f32 d0, d0, d1' instead.
>>
>> The number of single precision floating point registers gets  
>> effectively
>> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
>> (packing/unpacking of register pairs may be needed to ensure proper  
>> parameters
>> passing to functions). Also there may be other problems, like  
>> dealing with
>> strict IEEE-754 compliance (maybe a special variable attribute for  
>> relaxing
>> compliance requirements could be useful). But this looks like the  
>> only
>> solution to fix poor performance on ARM Cortex-A8 processor.
>>
>> Actually clang 2.7 seems to be working exactly this way. And it is
>> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single  
>> precision
>> floating point tests that I tried on ARM Cortex-A8.
>
> On i?86 we have -mfpmath={sse,x87}, I suppose you could add
> -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
> and requiring neon support).

Except unlike sse, neon does not fully support IEEE support. So this  
should only be done with -ffast-math :). The point that it is slow is  
not good enough to change it to be something that is wrong and fast.

>
> Richard.
>
>> --
>> Best regards,
>> Siarhei Siamashka
>>

Re: RFC: ARM Cortex-A8 and floating point performance

by Ramana Radhakrishnan-4 :: Rate this Message:

| View Threaded | Show Only this Message


On Wed, 2010-06-16 at 15:52 +0000, Siarhei Siamashka wrote:
> Hello,
>
> Currently gcc (at least version 4.5.0) does a very poor job generating single
> precision floating point code for ARM Cortex-A8.
>
> The source of this problem is the use of VFP instructions which are run on a
> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode
> (flush denormals to zero, disable exceptions) just provides a relatively minor
> performance gain.

> The right solution seems to be the use of NEON instructions for doing most of
> the single precision calculations.

Only in situations that the user is aware about -ffast-math. I will
point out that single precision floating point operations on NEON are
not completely IEEE compliant.

cheers
Ramana


Re: RFC: ARM Cortex-A8 and floating point performance

by Siarhei Siamashka :: Rate this Message:

| View Threaded | Show Only this Message

On Wednesday 16 June 2010 15:22:32 Ramana Radhakrishnan wrote:

> On Wed, 2010-06-16 at 15:52 +0000, Siarhei Siamashka wrote:
> > Currently gcc (at least version 4.5.0) does a very poor job generating
> > single precision floating point code for ARM Cortex-A8.
> >
> > The source of this problem is the use of VFP instructions which are run
> > on a slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on
> > RunFast mode (flush denormals to zero, disable exceptions) just provides
> > a relatively minor performance gain.
> >
> > The right solution seems to be the use of NEON instructions for doing
> > most of the single precision calculations.
>
> Only in situations that the user is aware about -ffast-math. I will
> point out that single precision floating point operations on NEON are
> not completely IEEE compliant.

Sure. The way how gcc deals with IEEE compliance in the generated code should
be preferably consistent and clearly defined. That's why I reported the
following problem earlier: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43703

Generating fast floating code for Cortex-A8 with -ffast-math option can be a
good starting point. And ideally it would be nice to be able to mix IEEE
compliant and non-IEEE compliant parts of code. Supporting something like this
would be handy:

typedef __attribute__((ieee_noncompliant)) float fast_float;

For example, Qt defines its own 'qreal' type, which currently defaults to
'float' for ARM and to 'double' for all the other architectures. Many
applications are not that sensitive to strict IEEE compliance or even
precision. But some of the applications and libraries do, so they need to be
respected too.

But in any case, ARM Cortex-A8 has some hardware to do reasonably fast single
precision floating point calculations (with some compliance issues). It makes
a lot of sense to be able to utilize this hardware efficiently from a high
level language such as C/C++ without rewriting tons of existing code.

AFAIK x86 had its own bunch of issues with the 80-bit extended precision, when
just 32-bit or 64-bit precision is needed.

By the way, I tried to experiment with solving/workarounding this floating
point performance issue by making a C++ wrapper class, overloading operators
and using neon intrinsics. It provided a nice speedup in some cases. But gcc
still has troubles generating efficient code for neon intrinsics, and there
were other issues like the size of this newly defined type, which make it not
very practical overall.

--
Best regards,
Siarhei Siamashka

Re: RFC: ARM Cortex-A8 and floating point performance

by David Brown-4 :: Rate this Message:

| View Threaded | Show Only this Message

On 16/06/2010 17:09, Andrew Pinski wrote:
>

>> On i?86 we have -mfpmath={sse,x87}, I suppose you could add
>> -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
>> and requiring neon support).
>
> Except unlike sse, neon does not fully support IEEE support. So this
> should only be done with -ffast-math :). The point that it is slow is
> not good enough to change it to be something that is wrong and fast.
>

How relevant is full IEEE support?  I would imagine that only a tiny
minority of programs actually need it.  But I would also imagine that
the great majority of programs are compiled without -ffast-math, even
though that would be perfectly good for them, and generate smaller and
faster code on almost all platforms.

So if -ffast-math were the default, would many people actually see wrong
code?  Obviously such a change could not be done arbitrarily - as you
say, slow code is better than wrong code.  It would have to be done
through stages for successive gcc releases of adding a "-fieee-math"
flag (synonymous with -fno-fast-math), and giving warning messages if
neither "-fieee-math" or "-ffast-math" were specified, before it could
be changed.  And it would only apply to the -std=gnu... standards.  But
the end result is that the majority of users would generate faster code.


Re: RFC: ARM Cortex-A8 and floating point performance

by Richard Earnshaw :: Rate this Message:

| View Threaded | Show Only this Message


On Wed, 2010-06-16 at 08:09 -0700, Andrew Pinski wrote:

>
> Sent from my iPhone
>
> On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guenther@...
>  > wrote:
>
> > On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
> > <siarhei.siamashka@...> wrote:
> >> Hello,
> >>
> >> Currently gcc (at least version 4.5.0) does a very poor job  
> >> generating single
> >> precision floating point code for ARM Cortex-A8.
> >>
> >> The source of this problem is the use of VFP instructions which are  
> >> run on a
> >> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on  
> >> RunFast mode
> >> (flush denormals to zero, disable exceptions) just provides a  
> >> relatively minor
> >> performance gain.
> >>
> >> The right solution seems to be the use of NEON instructions for  
> >> doing most of
> >> the single precision calculations.
> >>
> >> I wonder if it would be difficult to introduce the following  
> >> changes to the
> >> gcc generated code when optimizing for cortex-a8:
> >> 1. Allocate single precision variables only to evenly or oddly  
> >> numbered
> >> s-registers.
> >> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
> >> 'vadd.f32 d0, d0, d1' instead.
> >>
> >> The number of single precision floating point registers gets  
> >> effectively
> >> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
> >> (packing/unpacking of register pairs may be needed to ensure proper  
> >> parameters
> >> passing to functions). Also there may be other problems, like  
> >> dealing with
> >> strict IEEE-754 compliance (maybe a special variable attribute for  
> >> relaxing
> >> compliance requirements could be useful). But this looks like the  
> >> only
> >> solution to fix poor performance on ARM Cortex-A8 processor.
> >>
> >> Actually clang 2.7 seems to be working exactly this way. And it is
> >> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single  
> >> precision
> >> floating point tests that I tried on ARM Cortex-A8.
> >
> > On i?86 we have -mfpmath={sse,x87}, I suppose you could add
> > -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
> > and requiring neon support).
>
> Except unlike sse, neon does not fully support IEEE support. So this  
> should only be done with -ffast-math :). The point that it is slow is  
> not good enough to change it to be something that is wrong and fast.
>

We could document -mfpmath=neon as implying fast-math (or at least, no
denormals and default NaNs).  If the user explicitly asks for floating
point to be done via the Neon unit, it would be somewhat churlish to say
"I don't believe you know what you're asking for... so I'll ignore you".

This might be a better approach than just using Neon automatically when
fast math.  On some cores a mix of double and single-precision code
might run quite slowly if single precision was done in Neon and double
precision on the VFP unit.

R.



Re: RFC: ARM Cortex-A8 and floating point performance

by Richard Guenther-2 :: Rate this Message:

| View Threaded | Show Only this Message

On Thu, Jul 1, 2010 at 5:30 PM, Richard Earnshaw <rearnsha@...> wrote:

>
> On Wed, 2010-06-16 at 08:09 -0700, Andrew Pinski wrote:
>>
>> Sent from my iPhone
>>
>> On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guenther@...
>>  > wrote:
>>
>> > On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
>> > <siarhei.siamashka@...> wrote:
>> >> Hello,
>> >>
>> >> Currently gcc (at least version 4.5.0) does a very poor job
>> >> generating single
>> >> precision floating point code for ARM Cortex-A8.
>> >>
>> >> The source of this problem is the use of VFP instructions which are
>> >> run on a
>> >> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on
>> >> RunFast mode
>> >> (flush denormals to zero, disable exceptions) just provides a
>> >> relatively minor
>> >> performance gain.
>> >>
>> >> The right solution seems to be the use of NEON instructions for
>> >> doing most of
>> >> the single precision calculations.
>> >>
>> >> I wonder if it would be difficult to introduce the following
>> >> changes to the
>> >> gcc generated code when optimizing for cortex-a8:
>> >> 1. Allocate single precision variables only to evenly or oddly
>> >> numbered
>> >> s-registers.
>> >> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
>> >> 'vadd.f32 d0, d0, d1' instead.
>> >>
>> >> The number of single precision floating point registers gets
>> >> effectively
>> >> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
>> >> (packing/unpacking of register pairs may be needed to ensure proper
>> >> parameters
>> >> passing to functions). Also there may be other problems, like
>> >> dealing with
>> >> strict IEEE-754 compliance (maybe a special variable attribute for
>> >> relaxing
>> >> compliance requirements could be useful). But this looks like the
>> >> only
>> >> solution to fix poor performance on ARM Cortex-A8 processor.
>> >>
>> >> Actually clang 2.7 seems to be working exactly this way. And it is
>> >> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single
>> >> precision
>> >> floating point tests that I tried on ARM Cortex-A8.
>> >
>> > On i?86 we have -mfpmath={sse,x87}, I suppose you could add
>> > -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
>> > and requiring neon support).
>>
>> Except unlike sse, neon does not fully support IEEE support. So this
>> should only be done with -ffast-math :). The point that it is slow is
>> not good enough to change it to be something that is wrong and fast.
>>
>
> We could document -mfpmath=neon as implying fast-math (or at least, no
> denormals and default NaNs).  If the user explicitly asks for floating
> point to be done via the Neon unit, it would be somewhat churlish to say
> "I don't believe you know what you're asking for... so I'll ignore you".

It's certainly reasonable to do that when the limitations are documented.
Not enable -ffast-math, but simply state the correct facts for
the MODE_HAS_* macros in real.h.

Richard.