RFC: x86: Emit rsqrt with only -ffast-math

View: New views
10 Messages — Rating Filter:   Alert me  

RFC: x86: Emit rsqrt with only -ffast-math

by Michael Matz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I made some experiments regarding usage of rcpss and rsqrtss generally and
not only with -mrecip.  Due to the NR step all of them are somewhat
reasonably precise (off by about 2.5 ULPs) but common benchmarks that
people care about break in different ways:

* replacing division with rcpss+NR (swdivf) breaks 464.h264ref and
  482.sphinx3.  This is still the case with a somewhat changed expansion
  of the NR step (that better targets a/b, not only 1/b), which needs one
  more addition.
* replacing sqrt(r) with r*[rsqrt(r)+NR] breaks aermod of polyhedron
* replacing 1.0/sqrt(r) with [rsqrt(r)+NR] breaks nothing I tested
  (polyhedron and spec2006).

The last transformation boosts performance of 435.gromacs by 40% .

Now there are multiple ways how to eat this cake, one of them the attached
patch which simply enables the last transformation (and only that one)
under -ffast-math (without -mrecip).  Another possibility would be to
introduce a new flag for just the last transformation.  What do people
think?


Ciao,
Michael.
--
        * config/i386/i386.c (ix86_builtin_reciprocal): Remove dependency
        on TARGET_RECIP.

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c (revision 153713)
+++ config/i386/i386.c (working copy)
@@ -25285,7 +25285,7 @@ static tree
 ix86_builtin_reciprocal (unsigned int fn, bool md_fn,
  bool sqrt ATTRIBUTE_UNUSED)
 {
-  if (! (TARGET_SSE_MATH && TARGET_RECIP && !optimize_insn_for_size_p ()
+  if (! (TARGET_SSE_MATH && !optimize_insn_for_size_p ()
  && flag_finite_math_only && !flag_trapping_math
  && flag_unsafe_math_optimizations))
     return NULL_TREE;

Re: RFC: x86: Emit rsqrt with only -ffast-math

by Richard Guenther-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 3, 2009 at 2:35 PM, Michael Matz <matz@...> wrote:

> Hi,
>
> I made some experiments regarding usage of rcpss and rsqrtss generally and
> not only with -mrecip.  Due to the NR step all of them are somewhat
> reasonably precise (off by about 2.5 ULPs) but common benchmarks that
> people care about break in different ways:
>
> * replacing division with rcpss+NR (swdivf) breaks 464.h264ref and
>  482.sphinx3.  This is still the case with a somewhat changed expansion
>  of the NR step (that better targets a/b, not only 1/b), which needs one
>  more addition.
> * replacing sqrt(r) with r*[rsqrt(r)+NR] breaks aermod of polyhedron
> * replacing 1.0/sqrt(r) with [rsqrt(r)+NR] breaks nothing I tested
>  (polyhedron and spec2006).
>
> The last transformation boosts performance of 435.gromacs by 40% .
>
> Now there are multiple ways how to eat this cake, one of them the attached
> patch which simply enables the last transformation (and only that one)
> under -ffast-math (without -mrecip).  Another possibility would be to
> introduce a new flag for just the last transformation.  What do people
> think?

It would follow existing practice of things we allow in
-funsafe-math-optimizations.  Existing practice in that we
want to allow -ffast-math use with common benchmarks we care
about.

Thus, I'm ok with that if the target maintainers agree.  I think
this needs a documentation update to mrecip though.

Richard.

>
> Ciao,
> Michael.
> --
>        * config/i386/i386.c (ix86_builtin_reciprocal): Remove dependency
>        on TARGET_RECIP.
>
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 153713)
> +++ config/i386/i386.c  (working copy)
> @@ -25285,7 +25285,7 @@ static tree
>  ix86_builtin_reciprocal (unsigned int fn, bool md_fn,
>                         bool sqrt ATTRIBUTE_UNUSED)
>  {
> -  if (! (TARGET_SSE_MATH && TARGET_RECIP && !optimize_insn_for_size_p ()
> +  if (! (TARGET_SSE_MATH && !optimize_insn_for_size_p ()
>         && flag_finite_math_only && !flag_trapping_math
>         && flag_unsafe_math_optimizations))
>     return NULL_TREE;
>

Parent Message unknown Re: RFC: x86: Emit rsqrt with only -ffast-math

by Uros Bizjak-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello!

> I made some experiments regarding usage of rcpss and rsqrtss generally and
> not only with -mrecip.  Due to the NR step all of them are somewhat
> reasonably precise (off by about 2.5 ULPs) but common benchmarks that
> people care about break in different ways:
>
> * replacing division with rcpss+NR (swdivf) breaks 464.h264ref and
>    482.sphinx3.  This is still the case with a somewhat changed expansion
>    of the NR step (that better targets a/b, not only 1/b), which needs one
>    more addition.
> * replacing sqrt(r) with r*[rsqrt(r)+NR] breaks aermod of polyhedron
> * replacing 1.0/sqrt(r) with [rsqrt(r)+NR] breaks nothing I tested
>    (polyhedron and spec2006).
>
> The last transformation boosts performance of 435.gromacs by 40% .
>
> Now there are multiple ways how to eat this cake, one of them the attached
> patch which simply enables the last transformation (and only that one)
> under -ffast-math (without -mrecip).  Another possibility would be to
> introduce a new flag for just the last transformation.  What do people
> think?
>    

There were some problems with the patch that enabled rsqrt
transformation in -ffast-math, see PR34709 [1]. The root cause was
analysed and the fix was proposed [2], but unfortunately, the fix was
never applied. Your  rsqrt patch [3] probably fixed this issue, but
please check the testcase at [2] and eventually add the test to the
testsuite.

The patch itself is OK with the testcase from [2], but please be
prepared for eventual surprises ...

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34709
[2] http://gcc.gnu.org/ml/gcc-patches/2008-01/msg00747.html
[3] http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01698.html

Thanks,
Uros.

Re: RFC: x86: Emit rsqrt with only -ffast-math

by Michael Matz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

On Tue, 3 Nov 2009, Uros Bizjak wrote:

> > introduce a new flag for just the last transformation.  What do people
> > think?
>
> There were some problems with the patch that enabled rsqrt
> transformation in -ffast-math, see PR34709 [1]. The root cause was
> analysed and the fix was proposed [2], but unfortunately, the fix was
> never applied.

That testcase explicitely checks for outputs being +-Inf and gives Inf as
input in one case.  As you noted in your followup there [1], this is
outside the scope of -ffinite-math.  Hence that after the change NaN is
given is completely okay.  The testcase wasn't the root cause for 481.wrf
breaking.  The root cause was to transform sqrt(a/b) into 1/sqrt(b/a) at
all, which I indeed fixed (and added a testcase for already).  That
transformation lead to the followup problem that suddenly behaviour at Inf
mattered, when it really shouldn't.

So, I don't think we are obliged to make this testcase work as the author
expected (it will change behaviour with my proposed patch).  It would
unnecessarily slow down the rsqrt+NR expansion.

> The patch itself is OK with the testcase from [2], but please be
> prepared for eventual surprises ...

So, given the above, still OK, even without that testcase?  I of course
checked that 481.wrf doesn't break (neither the other cpu2006 or
polyhedron benchmarks).  And regstrapped on x86_64-linux, with all
languages+Ada.

[1] http://gcc.gnu.org/ml/gcc-patches/2008-01/msg00750.html


Ciao,
Michael.

Re: RFC: x86: Emit rsqrt with only -ffast-math

by Uros Bizjak-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 11/04/2009 06:14 PM, Michael Matz wrote:

> Hi,
>
> On Tue, 3 Nov 2009, Uros Bizjak wrote:
>
>    
>>> introduce a new flag for just the last transformation.  What do people
>>> think?
>>>        
>> There were some problems with the patch that enabled rsqrt
>> transformation in -ffast-math, see PR34709 [1]. The root cause was
>> analysed and the fix was proposed [2], but unfortunately, the fix was
>> never applied.
>>      
> That testcase explicitely checks for outputs being +-Inf and gives Inf as
> input in one case.  As you noted in your followup there [1], this is
> outside the scope of -ffinite-math.  Hence that after the change NaN is
> given is completely okay.  The testcase wasn't the root cause for 481.wrf
> breaking.  The root cause was to transform sqrt(a/b) into 1/sqrt(b/a) at
> all, which I indeed fixed (and added a testcase for already).  That
> transformation lead to the followup problem that suddenly behaviour at Inf
> mattered, when it really shouldn't.
>
> So, I don't think we are obliged to make this testcase work as the author
> expected (it will change behaviour with my proposed patch).  It would
> unnecessarily slow down the rsqrt+NR expansion.
>
>    
>> The patch itself is OK with the testcase from [2], but please be
>> prepared for eventual surprises ...
>>      
> So, given the above, still OK, even without that testcase?  I of course
> checked that 481.wrf doesn't break (neither the other cpu2006 or
> polyhedron benchmarks).  And regstrapped on x86_64-linux, with all
> languages+Ada.
>
> [1] http://gcc.gnu.org/ml/gcc-patches/2008-01/msg00750.html
>
>    

Yes, following your analysis, the patch is OK.

Thanks,
Uros.


Re: RFC: x86: Emit rsqrt with only -ffast-math

by Geert Bosch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Nov 3, 2009, at 08:35, Michael Matz wrote:
> * replacing sqrt(r) with r*[rsqrt(r)+NR] breaks aermod of polyhedron

Wouldn't you want to do NR on the whole expression directly?
This way the final accuracy will have twice the number of bits
of the initial rsqrt estimate. Doing the multiplication after
the Newton Raphson iteration will cost an additional bit of
precision.

Something like:

   float sqrt(float r) {
     float q = rsqrt(r);   // approx. 1.0 / sqrt (r);
     float y = q * r;      // approx sqrt (r)
     float e = y * y - r;  // y == sqrt (r + e), with e small
     return y - e / (2.0 * y); // Newton Raphson correction
   }

If the initial approximation y is sufficiently close, you
might be able to get away with y - e * rcp(2.0 * y).
Basically, if y is already correct to 12 bits, the division
result needs only to be accurate to 12 bits itself, to get
close to 24 bits of final accuracy.
Anyway, I think the proposed replacement needs a bit
more discussion and numerical analysis.

Using the reciprocal approximations does not in itself mean
reduced accuracy. The IA64 for example relies on these
techniques to implement correctly rounded and efficient
division and square root operations.

The question here is what our speed/accuracy trade-off is.

   -Geert

PS. It would be interesting to know the problematic sqrt
     evaluations in the aermod test case...


Re: RFC: x86: Emit rsqrt with only -ffast-math

by Michael Matz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Geert,

On Wed, 4 Nov 2009, Geert Bosch wrote:

> On Nov 3, 2009, at 08:35, Michael Matz wrote:
> > * replacing sqrt(r) with r*[rsqrt(r)+NR] breaks aermod of polyhedron
>
> Wouldn't you want to do NR on the whole expression directly?

That would be a better implementation, yes.  Note that I don't propose the
above transformation, it is already done that way under -mrecip (it's also
the recommended one from optimization manuals, but of course this doesn't
mean that there aren't better ways).

> This way the final accuracy will have twice the number of bits
> of the initial rsqrt estimate. Doing the multiplication after
> the Newton Raphson iteration will cost an additional bit of
> precision.
>
> Something like:
>
>   float sqrt(float r) {
>     float q = rsqrt(r);   // approx. 1.0 / sqrt (r);
>     float y = q * r;      // approx sqrt (r)
>     float e = y * y - r;  // y == sqrt (r + e), with e small
>     return y - e / (2.0 * y); // Newton Raphson correction
>   }

That itself has a division in it, and hence wouldn't be better than just
using the normal sqrt (no recriprocal) instruction.

> If the initial approximation y is sufficiently close, you
> might be able to get away with y - e * rcp(2.0 * y).
> Basically, if y is already correct to 12 bits, the division
> result needs only to be accurate to 12 bits itself, to get
> close to 24 bits of final accuracy.

Are you sure?  y to 12 bits, e/(2*y) to 12 bits, how should y-e/(2*y) be
correct to 24 bits?

> Using the reciprocal approximations does not in itself mean reduced
> accuracy. The IA64 for example relies on these techniques to implement
> correctly rounded and efficient division and square root operations.
>
> The question here is what our speed/accuracy trade-off is.

It's fairly tight, as divss and sqrtss are actually quite fast (latency
wise).  The throughput is not very good, which is why three muls and some
adds are still better, even with the intermediate check for zero that is
necessary for sqrt(0.0).  But it's not much better anymore, so anything we
add that adds latency makes the whole thing useless.

> PS. It would be interesting to know the problematic sqrt
>     evaluations in the aermod test case...

Actually I misremembered.  It's the division (a/b -> a*rcp(b)) that makes
aermod segfault, see PR34702 which has the correct analysis for this.  I
don't remember anymore which benchmark if any miscompared with only
sqrt(x)->x*rsqrt(x).  But as explained above this transformation is not
generally useful, it needs three registers, several constants, three muls,
a compare, an and, some moves and a plus, and is (for a single call) only
a tad faster than sqrtss.

The situation is different with 1/sqrt(x), because this will save an
sqrtss _and_ a divss instruction (and doesn't need a compare).


Ciao,
Michael.

Re: RFC: x86: Emit rsqrt with only -ffast-math

by Michael Matz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

On Tue, 3 Nov 2009, Richard Guenther wrote:

> Thus, I'm ok with that if the target maintainers agree.  I think this
> needs a documentation update to mrecip though.

r153940.  I've added this sentence to the -mrecip docu:

+Note that GCC implements 1.0f/sqrtf(x) in terms of RSQRTSS (or RSQRTPS)
+already with @option{-ffast-math} (or the above option combination), and
+doesn't need @option{-mrecip}.


Ciao,
Michael.

Re: RFC: x86: Emit rsqrt with only -ffast-math

by Paolo Bonzini-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 11/05/2009 02:57 PM, Michael Matz wrote:

> >
> >     float sqrt(float r) {
> >       float q = rsqrt(r);   // approx. 1.0 / sqrt (r);
> >       float y = q * r;      // approx sqrt (r)
> >       float e = y * y - r;  // y == sqrt (r + e), with e small
> >       return y - e / (2.0 * y); // Newton Raphson correction
> >     }
>
> That itself has a division in it, and hence wouldn't be better than just
> using the normal sqrt (no recriprocal) instruction.

Yes, that would be

   return y - e * q * 0.5;

Paolo


Re: RFC: x86: Emit rsqrt with only -ffast-math

by Geert Bosch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Nov 5, 2009, at 08:57, Michael Matz wrote:

> On Wed, 4 Nov 2009, Geert Bosch wrote:
>>  float sqrt(float r) {
>>    float q = rsqrt(r);   // approx. 1.0 / sqrt (r);
>>    float y = q * r;      // approx sqrt (r)
>>    float e = y * y - r;  // y == sqrt (r + e), with e small
>>    return y - e / (2.0 * y); // Newton Raphson correction
>>  }
>
> That itself has a division in it, and hence wouldn't be better than  
> just
> using the normal sqrt (no recriprocal) instruction.
Indeed, that's why I suggested using rcp for the divide,
which might be sufficient. Anyway, as you mention, this
transformation is only borderline useful anyway.

>
>> If the initial approximation y is sufficiently close, you
>> might be able to get away with y - e * rcp(2.0 * y).
>> Basically, if y is already correct to 12 bits, the division
>> result needs only to be accurate to 12 bits itself, to get
>> close to 24 bits of final accuracy.
>
> Are you sure?  y to 12 bits, e/(2*y) to 12 bits, how should y-e/
> (2*y) be
> correct to 24 bits?
If y is correct to 12 bits, then the correction term e/(2*y) is
a factor 2^-12 smaller than y, so the error in the correction term
is on the order of 2^-24 * y.

   -Geert