3.9.1

View: New views
5 Messages — Rating Filter:   Alert me  

3.9.1

by Clint Whaley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Guys,

I have released 3.9.1.  It is a bugfix release on 3.9.0.

Cheers,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

dgemm performance dependance from CacheEdge value

by Mikhail Kuzminsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I performed simultaneous installation of 4 atlas 3.9.1 examplars on
Opteron 2350.
The CacheEdge value obtained was 384K and 512K (depending from
build/tune thread). OK, 512K is that I want: 512= size(L3)/4.

I thought that I'll see gemm kernel performance differences, at least
for large matrixes test. But make time gives pracrically no difference
between results for different CacheEdge values (2 MB, 512K or 384K).
So there is few questions.

1) How (d)gemm performance (for large matrixes) depends from CacheEdge
value ?

2) Does Atlas 3.9.x "know" that Opteron K10 has 512K L2 cache *in
addition* to L3 cache ?
I looked that 3.8.2 used *L2* cache size for CacheEdge value.

3) Does gemm kernels use software prefetch ? IMHO prefetch in K10 (in
opposition to K8) is performed directly to L1 cache (instead of L2
cache in K8).

Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: dgemm performance dependance from CacheEdge value

by Clint Whaley-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sorry, just found this in my queue . . .

>I performed simultaneous installation of 4 atlas 3.9.1 examplars on
>Opteron 2350.
>The CacheEdge value obtained was 384K and 512K (depending from
>build/tune thread).

Not sure what 4 simult installs is supposed to do for you.  If you want to
tune for parallel performance, then I recommend:
   http://math-atlas.sourceforge.net/errata.html#SMPCE

>OK, 512K is that I want: 512= size(L3)/4.

Not sure that CE is hitting the shared L3: I would guess the L2 . . .

>I thought that I'll see gemm kernel performance differences, at least
>for large matrixes test. But make time gives pracrically no difference
>between results for different CacheEdge values (2 MB, 512K or 384K).

It's been a while since I've seen more than 5% from varying CacheEdge.
The last machine for which it was critical was the DEC ev5, where you
had a tiny L1 and a 96K L2 that CE blocked for;  I invented CE for this
machine, where it gave a 20% boost in performance.  It seems to me that
CE gives less a boost than it used to: I put this down to ATLAS having
better prefetch support these days, so that you get only modest improvements
for 2-level cache blocking when the kernel is already L1-blocked with
aggressive prefetch . . .

>So there is few questions.
>
>1) How (d)gemm performance (for large matrixes) depends from CacheEdge
>value ?

These days, it provides a limitation on workspace, but doesn't make huge
differences.  It *can* improve overall application performance, particularly
in parallel (though, again, affects are small).

>2) Does Atlas 3.9.x "know" that Opteron K10 has 512K L2 cache *in
>addition* to L3 cache ?
>I looked that 3.8.2 used *L2* cache size for CacheEdge value.

ATLAS only does 2 levels of explicit blocking.  AFAIK,the K10h is kind of
funky: I believe the caches are exclusive, so the L3 is kind of like
a huge victim cache.  In this case, ATLAS will almost assuredly block for
the L2 with CE (since the L3 is slower).

>3) Does gemm kernels use software prefetch ? IMHO prefetch in K10 (in
>opposition to K8) is performed directly to L1 cache (instead of L2
>cache in K8).

ATLAS's GEMM kernels heavily use prefetch.  I believe the earlier AMD machines
also prefetched to the L1 (I know the original athlon did).  If you use
3.9.1, ATLAS has a kernel that targets the K10h (with prefetch) more
effectively.  I am presently working on 3.9.2, which should be out next
week at the latest.  In the meantime, if you want to use 3.9.1, be sure
to apply the bug fixes documented at:
   http://sourceforge.net/tracker/index.php?func=detail&aid=2024948&group_id=23725&atid=379482

Cheers,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: dgemm performance dependance from CacheEdge value

by Mikhail Kuzminsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

In message from Clint Whaley <whaley@...> (Fri, 08
Aug 2008 09:12:57 -0500):
>>I performed simultaneous installation of 4 atlas 3.9.1 examplars on
>>Opteron 2350.
>>The CacheEdge value obtained was 384K and 512K (depending from
>>build/tune thread).
>
>Not sure what 4 simult installs is supposed to do for you.  If you
>want to
>tune for parallel performance, then I recommend:
>   http://math-atlas.sourceforge.net/errata.html#SMPCE

Thanks ! It's better approach than my :-)

>>OK, 512K is that I want: 512= size(L3)/4.
>Not sure that CE is hitting the shared L3: I would guess the L2 . . .

  ATLAS3.9.0:
    STAGE 2-1-2: CacheEdge DETECTION
       CacheEdge set to 2621440 bytes
=================^IMHO it's L3 ^ ============

or from dMMCACHEEDGE.LOG:

gemm xdfindCE -f res/atlas_cacheedg
e.h
TA  TB       M       N       K   alpha    beta  CacheEdge       TIME
   MFLOPS
==  ==  ======  ======  ======  ======  ======  =========  =========
 ========

  T   N    1600    1600    1600    1.00    1.00          0      1.228
  6671.69
  T   N    1600    1600    1600    1.00    1.00         16     -2.000
     0.00
  T   N    1600    1600    1600    1.00    1.00         32     -2.000
     0.00
  T   N    1600    1600    1600    1.00    1.00         64      1.514
  5409.95
  T   N    1600    1600    1600    1.00    1.00        128      1.279
  6405.71
  T   N    1600    1600    1600    1.00    1.00        256      1.248
  6565.56
  T   N    1600    1600    1600    1.00    1.00        512      1.239
  6610.67
  T   N    1600    1600    1600    1.00    1.00       1024      1.227
  6673.78
  T   N    1600    1600    1600    1.00    1.00       2048      1.227
  6673.98
  T   N    1600    1600    1600    1.00    1.00       4096      1.227
  6674.22
  T   N    1600    1600    1600    1.00    1.00       8192      1.228
  6672.71

Initial CE=4096KB, mflop=6674.22

  T   N    1600    1600    1600    1.00    1.00       3072      1.227
  6674.70
  T   N    1600    1600    1600    1.00    1.00       2560      1.227
  6676.11
  T   N    1600    1600    1600    1.00    1.00       2304      1.227
  6674.23
  T   N    1600    1600    1600    1.00    1.00       2816      1.227
  6675.75

Best CE=2560KB, mflop=6676.11
====================================================================

This 3.9.0 data were the reason why I thought about using of L3 for
CE.

>>2) Does Atlas 3.9.x "know" that Opteron K10 has 512K L2 cache *in
>>addition* to L3 cache ?
>>I looked that 3.8.2 used *L2* cache size for CacheEdge value.
>
>ATLAS only does 2 levels of explicit blocking.  AFAIK,the K10h is
>kind of
>funky: I believe the caches are exclusive, so the L3 is kind of like
>a huge victim cache.  In this case, ATLAS will almost assuredly block
>for
>the L2 with CE (since the L3 is slower).

L2 is also for victims from L1 only. But looking on obtained CE value
- about 2 MB ! - I thought about L3.

> If you
>use
>3.9.1, ATLAS has a kernel that targets the K10h (with prefetch) more
>effectively.  I am presently working on 3.9.2, which should be out
>next
>week at the latest.  In the meantime, if you want to use 3.9.1, be
>sure
>to apply the bug fixes documented at:
>   http://sourceforge.net/tracker/index.php?func=detail&aid=2024948&group_id=23725&atid=379482

I tuned CE w/4 similtaneous installations w/3.9.1. There was no like
hangup , although 1 of 4 installations finished w/some error.

Yours
Mikhail

>
>Cheers,
>Clint
>
>**************************************************************************
>** R. Clint Whaley, PhD ** Assist Prof, UTSA **
>www.cs.utsa.edu/~whaley **
>**************************************************************************
>
>-------------------------------------------------------------------------
>This SF.Net email is sponsored by the Moblin Your Move Developer's
>challenge
>Build the coolest Linux based applications with Moblin SDK & win
>great prizes
>Grand prize is a trip for two to an Open Source event anywhere in the
>world
>http://moblin-contest.org/redirect.php?banner_id=100&url=/
>_______________________________________________
>Math-atlas-devel mailing list
>Math-atlas-devel@...
>https://lists.sourceforge.net/lists/listinfo/math-atlas-devel


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: dgemm performance dependance from CacheEdge value

by Clint Whaley-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>gemm xdfindCE -f res/atlas_cacheedg
>e.h
>TA  TB       M       N       K   alpha    beta  CacheEdge       TIME
>   MFLOPS
>==  ==  ======  ======  ======  ======  ======  =========  =========
> ========
>
>  T   N    1600    1600    1600    1.00    1.00          0      1.228
>  6671.69
>  T   N    1600    1600    1600    1.00    1.00         16     -2.000
>     0.00
>  T   N    1600    1600    1600    1.00    1.00         32     -2.000
>     0.00
>  T   N    1600    1600    1600    1.00    1.00         64      1.514
>  5409.95
>  T   N    1600    1600    1600    1.00    1.00        128      1.279
>  6405.71
>  T   N    1600    1600    1600    1.00    1.00        256      1.248
>  6565.56
>  T   N    1600    1600    1600    1.00    1.00        512      1.239
>  6610.67
>  T   N    1600    1600    1600    1.00    1.00       1024      1.227
>  6673.78
>  T   N    1600    1600    1600    1.00    1.00       2048      1.227
>  6673.98
>  T   N    1600    1600    1600    1.00    1.00       4096      1.227
>  6674.22
>  T   N    1600    1600    1600    1.00    1.00       8192      1.228
>  6672.71
>
>Initial CE=4096KB, mflop=6674.22
>
>  T   N    1600    1600    1600    1.00    1.00       3072      1.227
>  6674.70
>  T   N    1600    1600    1600    1.00    1.00       2560      1.227
>  6676.11
>  T   N    1600    1600    1600    1.00    1.00       2304      1.227
>  6674.23
>  T   N    1600    1600    1600    1.00    1.00       2816      1.227
>  6675.75
>
>Best CE=2560KB, mflop=6676.11
>====================================================================

If you look at this output, you see that the performance of CE=2M is absolutely
indistinguishable from CE=0 (no L2 blocking).  In such a case, ATLAS uses
CE, since its partitioning of K reduces workspace needs of large problems.
So, what you are seeing is that this system doesn't get any benefit from
L2 cache blocking, but that we can afford multiple write of C for large
matrices . . .

Cheers,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel