« Return to Thread: 3.9.1

Re: dgemm performance dependance from CacheEdge value

by Clint Whaley-2 :: Rate this Message:

Reply to Author | View in Thread

Sorry, just found this in my queue . . .

>I performed simultaneous installation of 4 atlas 3.9.1 examplars on
>Opteron 2350.
>The CacheEdge value obtained was 384K and 512K (depending from
>build/tune thread).

Not sure what 4 simult installs is supposed to do for you.  If you want to
tune for parallel performance, then I recommend:
   http://math-atlas.sourceforge.net/errata.html#SMPCE

>OK, 512K is that I want: 512= size(L3)/4.

Not sure that CE is hitting the shared L3: I would guess the L2 . . .

>I thought that I'll see gemm kernel performance differences, at least
>for large matrixes test. But make time gives pracrically no difference
>between results for different CacheEdge values (2 MB, 512K or 384K).

It's been a while since I've seen more than 5% from varying CacheEdge.
The last machine for which it was critical was the DEC ev5, where you
had a tiny L1 and a 96K L2 that CE blocked for;  I invented CE for this
machine, where it gave a 20% boost in performance.  It seems to me that
CE gives less a boost than it used to: I put this down to ATLAS having
better prefetch support these days, so that you get only modest improvements
for 2-level cache blocking when the kernel is already L1-blocked with
aggressive prefetch . . .

>So there is few questions.
>
>1) How (d)gemm performance (for large matrixes) depends from CacheEdge
>value ?

These days, it provides a limitation on workspace, but doesn't make huge
differences.  It *can* improve overall application performance, particularly
in parallel (though, again, affects are small).

>2) Does Atlas 3.9.x "know" that Opteron K10 has 512K L2 cache *in
>addition* to L3 cache ?
>I looked that 3.8.2 used *L2* cache size for CacheEdge value.

ATLAS only does 2 levels of explicit blocking.  AFAIK,the K10h is kind of
funky: I believe the caches are exclusive, so the L3 is kind of like
a huge victim cache.  In this case, ATLAS will almost assuredly block for
the L2 with CE (since the L3 is slower).

>3) Does gemm kernels use software prefetch ? IMHO prefetch in K10 (in
>opposition to K8) is performed directly to L1 cache (instead of L2
>cache in K8).

ATLAS's GEMM kernels heavily use prefetch.  I believe the earlier AMD machines
also prefetched to the L1 (I know the original athlon did).  If you use
3.9.1, ATLAS has a kernel that targets the K10h (with prefetch) more
effectively.  I am presently working on 3.9.2, which should be out next
week at the latest.  In the meantime, if you want to use 3.9.1, be sure
to apply the bug fixes documented at:
   http://sourceforge.net/tracker/index.php?func=detail&aid=2024948&group_id=23725&atid=379482

Cheers,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

 « Return to Thread: 3.9.1