Toom symmetries

View: New views
9 Messages — Rating Filter:   Alert me  

Toom symmetries

by nisse-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I've been looking into switching points for interpolate_7pts, from (0,
oo, 1, -1, 2, 1/2, -1/2) to (0, oo, +1, -1, 2, -2, 1/2). Evaluation
and interpolation is equivalent for most practical purposes, so the
reason I tried this was very mundane: Then I can reuse the same
helper function for evaluation in 2 and -2 for both toom44 and toom43
(the latter uses interpolation with the points 0, oo, 1, -1, 2, -2).

This got me to look a little closer to the symmetries of the choosen
points.

The most symmetry we can have is if the set of points is invariant
under both negation and inversion. Then the point set is a union of
sets of the types

  {0, oo}       Each point invariant under negation, and they are swapped
                by inversion.
               
  {1, -1}       Each point invariant under inversion, and they are
                swapped by negation.

  {a, -a, 1/a, -1/a} Swapped pairwise by the two operations.

For a point pair (a, -a), a reasonable interpolation strategy (which seems to
be part of at least some of the optimal sequences) is to split in even
and odd parts, by a butterfly-like operation

  (w_i, w_j) <- ( (w_i + w_j)/2, (w_i - w_j)/(2a) )

Then one question is how to best do this, with current operations that
would be either one addsub followed by two shifts, or one rshadd and
and rshsub. A combined operation might be useful. Currently, I don't
think any interpolation code uses addsub.

Furthermore, if we also have symmetry under inversion for all points,
then the resulting matrices for even and odd parts are the same!

The simplest example, 4 points,

  w0 = f(0)
  w1 = f(1)
  w2 = f(-1)
  w3 = f(oo)

corresponds to the matrix

  1  0  0  0
  1  1  1  1
  1 -1  1 -1
  0  0  0  1

After addsub/2, we have

  1  0  0  0
  0  1  0  1
  1  0  1  0
  0  0  0  1

We get (1, 0; 1,1) for the even coefficients and (1,1; 0,1) for the
even coefficients, which only differ by permutation.

For the next example, say we use eight points, 0, oo, 1, -1, 2, -2,
1/2, -1/2. The full matrix is

   1     0  0  0   0   0  0    0
   1     1  1  1   1   1  1    1
   1    -1  1 -1   1  -1  1   -1
   1     2  4  8  16  32 64  128
   1    -2  4 -8  16 -32 64 -128
   128  64 32  16  8   4  2    1
   128 -64 32 -16  8  -4  2   -1
   0     0  0   0  0   0  0    1

After butterfly operations on the ± pairs, we get

   1     0  0   0  0   0  0    0
   0     1  0   1  0   1  0    1
   1     0  1   0  1   0  1    0
   0     1  0   4  0  16  0   64
   1     0  4   0  16  0 64    0
   0    64  0  16  0   4  0    1
  64     0  16  0  4   0   1   0
   0     0  0   0  0   0  0    1

We get an even part

   1  0  0  0
   1  1  1  1
   1  4 16 64
  64 16  4  1

and an odd part

   1  1  1  1
   1  4 16 64
  64 16  4  1
   0  0  0  1

which again are equivalent. We can use the same row operations for both
matrices, which means that 8 point interpolation can be implemented
compactly using butterfly operations and reductions of the above
4x4 matrix as building blocks.

Then the difficult question is choice of points. Next example with
full symmetry like above would by 12 points, where we might dd the
four points corresponding to a = 4, or a = 2^(GMP_NUMB_BITS) / 2.

Also for 6 points, we will get more symmetry (and hence compactness of
interpolation) if we use 1/2, -1/2 rather than +1 and -1. This is
likely slower than the current strategy. Anyway, after butterflies, we
get the same matrix

   1 0  0
   1 4 16
  16 4  1

for odd and even parts, which can be reduced by

  w1 = w1 - w0
  w2 = w2 - 16*w0
  w1 = (w1 - w2)/15
  w2 = (w2 - w1)/4

(I think this sequence is close to optimal). The butterfly operations
(with no special addsub primitive) would be

  w2 = (w1 - w2)/2
  w1 = w1 - w2
  w4 = (w3 - w4)/2
  w3 = (w3 - w4)/2

In total 12 subtractions, 2 left shifts, 5 right shifts and 2
divisions. This actually not so bad compared to the current code,
there's just one more left shift (and then evaluation obviously also
needs more shifts). Using the four points corresponding to a = sqrt(B)
= 2^{GMP_NUMB_BITS /2} might also be worth investigating, in this case
the same reduction procedure will involved only division by (B-1).

/Niels
_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by David Harvey-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jun 9, 2009, at 4:11 PM, Niels Möller wrote:

> The simplest example, 4 points,
>
>   w0 = f(0)
>   w1 = f(1)
>   w2 = f(-1)
>   w3 = f(oo)
>
> corresponds to the matrix
>
>   1  0  0  0
>   1  1  1  1
>   1 -1  1 -1
>   0  0  0  1
>
> After addsub/2, we have
>
>   1  0  0  0
>   0  1  0  1
>   1  0  1  0
>   0  0  0  1

Hmmm... sorry to interrupt your train of thought, this gives me an  
idea. Maybe you can delay some of the division by two until after  
part of the recomposition?

i.e.: instead of addsub/2, just do addsub, then you have

   1  0  0  0
   0  2  0  2
   2  0  2  0
   0  0  0  1

i.e. if the output poly coefficients are c0, c1, c2, c3, you have

c0
2c1 + 2c3
2c0 + 2c2
c3

Now start doing recomposition:

          |=== 2c1 + 2c3 ===|
                   |=== 2c0 + 2c2 ===|

Then divide by 2:

          |==== c1 + c3 ====|
                   |==== c0 + c2 ====|

Then continue as usual, i.e. subtract c3, c0 from right places, add  
them at beginning and end (and also use the hi(c0) - lo(c3)  
redundancy trick...), to get

|====== c0 =======|
          |====== c1 =======|
                   |====== c2 =======|
                            |====== c3 =======|

I haven't checked the operation count, but it seems plausible that  
something like this could save half a shift.

> In total 12 subtractions, 2 left shifts, 5 right shifts and 2
> divisions. This actually not so bad compared to the current code,
> there's just one more left shift (and then evaluation obviously also
> needs more shifts). Using the four points corresponding to a = sqrt(B)
> = 2^{GMP_NUMB_BITS /2} might also be worth investigating, in this case
> the same reduction procedure will involved only division by (B-1).

I did try the 2^(GMP_NUMB_BITS/2) idea at some stage --- one  
difficulty is that the pointwise multiplications get slightly bigger.

david

_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by bodrato :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ciao!

>> The simplest example, 4 points,

Unfortunately the best examples of symmetry came from the
"Toom-and-a-half", Toom-2.5, Toom-3.5 and so on (as I called them with
Zanoni). This means unbalanced only methods. The balanced always have an
odd number of points...

>>   1  0  0  0
>>   1  1  1  1
>>   1 -1  1 -1
>>   0  0  0  1
>>
>> After addsub/2, we have
>>
>>   1  0  0  0
>>   0  1  0  1
>>   1  0  1  0
>>   0  0  0  1

An addsub/2 can be useful here, currently the effect of
(a,b) <- ( (a+b)/2, (a-b)/2 )
is given by the sequence:
a <- (a+b)/2
b <-  a-b
which implies one shift only.

In other sequences I wrote, you can find something like:
a <- a+b
b <- 2*b
b <- a-b
this should be replaced by addsub (avoiding the shift)

> Hmmm... sorry to interrupt your train of thought, this gives me an
> idea. Maybe you can delay some of the division by two until after
> part of the recomposition?
[...]
> I haven't checked the operation count, but it seems plausible that
> something like this could save half a shift.

There are a lot of tricks like this, the first one in Karatsuba (due to?)
saving half an addition. There is one in Toom-3, saving half a sum again,
and it is implemented in the current[1] interpolate_5pts, another in
Toom-3.5, saving two half additions, implemented in interpolate_6pts.
But in this particular case, I can not see it, because with the currently
implemented sequence we compute one shift only, not two...

Anyway the general idea is good, mix the recomposition phase and the
interpolation, to save some half-operations. I'll give a real example I
developed some times ago (but I did not have the time to really
implement/test it).

Again, the trick works perfectly for any Toom-and-a-half, the example is
Toom-4.5, i.e. interpolation_8pts. I start from the matrix Niels wrote:

>> For the next example, say we use eight points, 0, oo, 1, -1, 2, -2,
>> 1/2, -1/2. The full matrix is
>>
>>    1     0  0  0   0   0  0    0
>>    1     1  1  1   1   1  1    1
>>    1    -1  1 -1   1  -1  1   -1
>>    1     2  4  8  16  32 64  128
>>    1    -2  4 -8  16 -32 64 -128
>>    128  64 32  16  8   4  2    1
>>    128 -64 32 -16  8  -4  2   -1
>>    0     0  0   0  0   0  0    1

Niels proposed the butterfly operations, I suggest something slightly
different, removing also the values on the first and last columns, and
with slightly different shifts, the result I need is:

   1   0  0   0  0   0  0   0
   0   1  0   1  0   1  0   0
   0   0  1   0  1   0  1   0
   0   1  0   4  0  16  0   0
   0   0  1   0  4   0 16   0
   0  16  0   4  0   1  0   0
   0   0 16   0  4   0  1   0
   0   0  0   0  0   0  0   1

As you can see, we have a lot of [x,0;0,x] blocks, this does not only mean
we have two matrices that are identical and can be interpolated with one
single function... we can do something better!
We can "recompose" couples of lines! We have:

|==== r0 =====|
      ||==== r1 =====|
             ||==== r2 =====|
                    ||==== r3 =====|
                           ||==== r4 =====|
                                  ||==== r5 =====|
                                         ||==== r6 =====|
                                                 |==== r7 =====|
and we recompose some pairs to get
|==== r0 =====|
      ||==== r1 and r2 =====|
                    ||==== r3 and r4 =====|
                                  ||==== r5 and r6 =====|
                                                 |==== r7 =====|

then we can interpolate the internal matrix

 1  1  1
 1  4 16
16  4  1

only once, operating on slices that are 3/2 times longer than the original
ones, thus saving one half on _every_ operation after this step:
additions, shifts, divisions... everything.

Moreover, interlacing also the evaluation steps, we can also save on
memory footprint: here is the strategy.

- split in slices of "n" limbs
- evaluate in {0, oo} (2n limbs each)
- for each "a"
  - evaluate in {a, -a} (2n limbs + carry, each)
  - butterfly and remove first/last column
  - partially recompose (only 3n limbs + carry, n limbs [and a carry] freed)
- interpolate the inner matrix

For example, Toom-4.5 should need for the interpolation phase: 2 values of
2n limbs and 3 values of 3n limbs, a total of 13n limbs; instead of
8*2n=16n limbs...

It should not be hard to write a single generic function doing
butterfly/remove/recompose for any {a} (if a is a power of 2).

Unfortunately, I have to underline it again... this strategy only works
for the Toom-and-a-half algorithms, I mean unbalanced only :-(

Have a nice day :-D
Marco

[1]
http://gmplib.org:8000/gmp/file/79e81acfb9b3/mpn/generic/toom_interpolate_5pts.c#l134

--
http://bodrato.it/toom-cook/

_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by Torbjorn Granlund-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bodrato@... writes:

  Unfortunately, I have to underline it again... this strategy only works
  for the Toom-and-a-half algorithms, I mean unbalanced only :-(
 
Well, nothing stops us from over-sampling a balanced case in order to
invoke this new interpolation strategy...

--
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by David Harvey-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jun 10, 2009, at 3:00 AM, bodrato@... wrote:

> An addsub/2 can be useful here, currently the effect of
> (a,b) <- ( (a+b)/2, (a-b)/2 )
> is given by the sequence:
> a <- (a+b)/2
> b <-  a-b
> which implies one shift only.

Ah yes of course, this is already better than what I suggested.

david

_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by David Harvey-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jun 10, 2009, at 3:00 AM, bodrato@... wrote:

> There are a lot of tricks like this, the first one in Karatsuba  
> (due to?)
> saving half an addition. There is one in Toom-3, saving half a sum  
> again,
> and it is implemented in the current[1] interpolate_5pts, another in
> Toom-3.5, saving two half additions, implemented in interpolate_6pts.

Marco,

At some point I skimmed over your paper with Zanoni on the optimality  
of Toom-Cook matrices; I recall your automated search did not take  
into account these kind of tricks that merge the interpolation and  
recomposition stages. Is this correct? Have you tried this since  
writing the paper? It seems that it shouldn't be difficult in theory  
to extend the search method, although I have no idea if it's  
practical (the search spaces are very large...?).

david

_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by Alberto Zanoni :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Alle 14:08, mercoledì 10 giugno 2009, David Harvey ha scritto:

> On Jun 10, 2009, at 3:00 AM, bodrato@... wrote:
> > There are a lot of tricks like this, the first one in Karatsuba
> > (due to?)
> > saving half an addition. There is one in Toom-3, saving half a sum
> > again,
> > and it is implemented in the current[1] interpolate_5pts, another in
> > Toom-3.5, saving two half additions, implemented in interpolate_6pts.
>
> Marco,
>
> At some point I skimmed over your paper with Zanoni on the optimality
> of Toom-Cook matrices; I recall your automated search did not take
> into account these kind of tricks that merge the interpolation and
> recomposition stages. Is this correct?

Hi, I'm Alberto (Zanoni).

Yes, it is. The paper deals only with the interpolation stage. We didn't
consider its merging with evaluation and/or recomposition there.

> Have you tried this since
> writing the paper? It seems that it shouldn't be difficult in theory
> to extend the search method, although I have no idea if it's
> practical (the search spaces are very large...?).

Just to know, in our case the search space - for interpolation only - for
Toom-3.5 was analyzed in about one week of running time (while Toom-3 took
just some seconds).

It seems difficult to use the same tool from the very beginning of
interpolation phase with higher Toom. In our paper, for higher Toom
some "manual" steps had to be done, in order to have a sufficient number of
zero entries in the matrix, so that the analysis could be done in reasonable
time.

This limitation was in practice the starting point to look for other
techniques searching for optimization or generalization somewhere else
(stages merging, use of the "divide by 2" technique, the "general" Toom using
powers of 2,...)

We didn't try to insert these techniques in our "old" automated search tool.

Alberto

_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by bodrato :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ciao,

>   Unfortunately, I have to underline it again... this strategy only works
>   for the Toom-and-a-half algorithms, I mean unbalanced only :-(
>
> Well, nothing stops us from over-sampling a balanced case in order to
> invoke this new interpolation strategy...

Maybe... As soon as the Fool-Toom-Tool implement it, we will know if this
is a good idea :-D

Marco

--
http://bodrato.it/

_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Toom symmetries

by nisse-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bodrato@... writes:

> As you can see, we have a lot of [x,0;0,x] blocks, this does not only mean
> we have two matrices that are identical and can be interpolated with one
> single function... we can do something better!
> We can "recompose" couples of lines! We have:

That's clever. To see if I understand this, let's consider the simplest
four-point interpolation first. The number we're trying to construct
is x=f(B), with

  f(t) = x_0 + x_1 t + x_2 t^2 + x_3 t^3

and B is the power of two used for splitting. To start with, we are
given

  w0 = f(0) = x0
  w1 = f(1) = x0 + x1 + x2 + x3
  w2 = f(-1) = x0 - x1 + x2 - x3
  w3 = f(oo) = x3

First, a butterfly operation, replacing w1 and w2 by

  w1 = x1 + x3
  w2 = x0 + x2

Then x2 = w2 - w0 and x1 = w1 - w3, so we can write

  x = w0 + (w1 - w3) B + (w2 - w0) B^2 + w3 B^3
    = -(w3 B - w0) + (w1 + w2 B) B + (w3 B - w0) B^2

Let L w0 and H w0 be the low and high part of w0. Define

  y = w3 - H w0
  z = w1 + B w2

Then we have

  x = Lw0 - y B + z B + y B^2

or graphically
      B^4  B^3  B^2   B    1
  |    |    |    |    |    |
            +---------+----+
            |   - y   |Lw0 |
      +-----+---------+
      |      z        |
  +---+---------------+
  |   y     |-Lw0|
  +---------+----+

So the inputs to the recombination, at this stage, are y (2n limbs +
sign), z (2n+1 limbs) and Lw0 (n limbs), for a total of 5n + 1 limbs.
Compared to 8*n + 2 needed to represent the original w:s. We need
scratch space for the recursive computations, and for the storage of
z, but no more. (To compute z, put factors in the product area, first
product in z, second in the product area, then do the butterflies and
combination. w0 and w3 can be computed directly into the product area
and combined into y.

I don't fully understand the current toom32_mul, but maybe it's doing
precisely this? The temporary allocation makes me suspicious, though:

toom32_mul_itch returns 4*n + 2. toom32_mul does this,

  #define scratch_out scratch + 4 * n + 2
  [...]
  TOOM22_MUL_N_REC (vm1, asm1, bsm1, n, scratch_out);
 
The recursive call is to toom22_mul, which needs ~ 2n limbs of scratch
space, according to toom22_mul_itch, and this area is beyond the
scratch area that toom32_mul asks for.

Regards,
/Niels
_______________________________________________
gmp-devel mailing list
gmp-devel@...
http://gmplib.org/mailman/listinfo/gmp-devel