tuning of Atlas on x86/NUMA

View: New views
6 Messages — Rating Filter:   Alert me  

tuning of Atlas on x86/NUMA

by Mikhail Kuzminsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

How the atlas tuning process (for example, for dgemm kernel) is
organized for the case
of SMP/NUMA servers w/CPUs having shared cache ? For example, for
dual-socket quad-core Opteron server ?

If dgemm tuning takes into account shared cache size, and is tuned
only "single threaded" (sequential run),
then it'll propose that it can use whole cache (for example, 2 MB L3
for Opteron 2350). But for multithreaded dgemm w/4 threads per CPU
only 512K of L3 will be available w/o a lot of cache miss. Therefore
multithreaded version requires, IMHO, "independed" (from sequential
version) tuning.

And the second question is about using of process affinity (taskset
for Linux) and NUMA-allocation of memory
(using of numactl) at the tuning process. Does it takes into account
this possibilities or there is no serious reasons
to use taskset/numactl in tuning ?

Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow    

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: tuning of Atlas on x86/NUMA

by Clint Whaley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mikhail,

ATLAS presently does almost not SMP-related empirical tuning.  We are
presently looking at it, but the present package does none.  Here's a link
you may be interested in from the errata:
   http://math-atlas.sourceforge.net/errata.html#SMPCE

>How the atlas tuning process (for example, for dgemm kernel) is
>organized for the case
>of SMP/NUMA servers w/CPUs having shared cache ? For example, for
>dual-socket quad-core Opteron server ?
>
>If dgemm tuning takes into account shared cache size, and is tuned
>>only "single threaded" (sequential run),
>then it'll propose that it can use whole cache (for example, 2 MB L3
>for Opteron 2350). But for multithreaded dgemm w/4 threads per CPU
>only 512K of L3 will be available w/o a lot of cache miss. Therefore
>multithreaded version requires, IMHO, "independed" (from sequential
>version) tuning.

>And the second question is about using of process affinity (taskset
>for Linux) and NUMA-allocation of memory
>(using of numactl) at the tuning process. Does it takes into account
>this possibilities or there is no serious reasons
>to use taskset/numactl in tuning ?

We are presently looking at the affects of using processor affinity.  I have
no idea what numactl is, do you have a link?

Cheers,
Clint

**************************************************************************
** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
**************************************************************************

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: tuning of Atlas on x86/NUMA

by Tim Mattox-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Clint,
For processor affinity, you may want to look at the
Portable Linux Processor Affinity (PLPA) project:
http://www.open-mpi.org/projects/plpa/

Here are links for numactl, which is a commandline utility:
http://freshmeat.net/projects/numactl/
http://oss.sgi.com/projects/libnuma/

>From the sgi link:
"The numactl program allows you to run your application
program on specific cpu's and memory nodes. It does this
by supplying a NUMA memory policy to the operating
system before running your program."

On Tue, Jun 24, 2008 at 3:23 PM, Clint Whaley <whaley@...> wrote:

> Mikhail,
>
> ATLAS presently does almost not SMP-related empirical tuning.  We are
> presently looking at it, but the present package does none.  Here's a link
> you may be interested in from the errata:
>   http://math-atlas.sourceforge.net/errata.html#SMPCE
>
>>How the atlas tuning process (for example, for dgemm kernel) is
>>organized for the case
>>of SMP/NUMA servers w/CPUs having shared cache ? For example, for
>>dual-socket quad-core Opteron server ?
>>
>>If dgemm tuning takes into account shared cache size, and is tuned
>>>only "single threaded" (sequential run),
>>then it'll propose that it can use whole cache (for example, 2 MB L3
>>for Opteron 2350). But for multithreaded dgemm w/4 threads per CPU
>>only 512K of L3 will be available w/o a lot of cache miss. Therefore
>>multithreaded version requires, IMHO, "independed" (from sequential
>>version) tuning.
>
>>And the second question is about using of process affinity (taskset
>>for Linux) and NUMA-allocation of memory
>>(using of numactl) at the tuning process. Does it takes into account
>>this possibilities or there is no serious reasons
>>to use taskset/numactl in tuning ?
>
> We are presently looking at the affects of using processor affinity.  I have
> no idea what numactl is, do you have a link?
>
> Cheers,
> Clint
>
> **************************************************************************
> ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley **
> **************************************************************************
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> Math-atlas-devel mailing list
> Math-atlas-devel@...
> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel
>



--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox@... || timattox@...
 I'm a bright... http://www.the-brights.net/

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: tuning of Atlas on x86/NUMA

by Mikhail Kuzminsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have dual socket quad-core Opteron 2350/2 Ghz - based server.

1) For single-threaded Atlas

For the current 3.8.2 I may use taskset or/and numactl utilities to
force processor affinity (and NUMA memory allocation) at the tuning
phase.

What should I use to be correct ?

I may use taskset/numactl simple for issuiing of the corresponding
makes.
Is it necessary to use

taskset -c <CPU_number> make build

IMHO, it's reasonable to use taskset/numactl for make check also -
right ?

BTW, I beleive I should use also -D c -DPentiumCPS=2000, right ?

It'll give the possibility to see on NUMA/cpu affinity influence.

I may insert taskset/numactl somewhere "more exactly" - but I don't
know where.

2) For pthreaded Atlas

The simplest practical (stupid :-)) ) idea to see on influence of
cache sharing (L3 for Opteron quad-core) is to prepare some shell
script where 8 (I have 8 cores) examplars of Atlas tuning will run
simultaneously.

i.e. something like
#! /bin/sh
numactl <parameters_for_core 1> make build >& >out_1 &  
numactl <parameters_for_core 2> make build >& >out_2 &
...
numactl <parameters_for_core 8> make build >& >out_1 &

Taking into account that building time is relative high, I think I may
neglect small difference in starting time.
I'll need then to have 8 copies of Atlas directories trees :-)  

In that case there will be 8 simultaneous tuning processses whcih will
share common L3 cache.

Is this "proposal" reasonable for particular case I have ?
(of course, better is to insert numactl somewhere more "exacrly").

FYI: I'm esepecially interesting in dgemm, which eats most of CPU time
of some our application.

Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: tuning of Atlas on x86/NUMA

by dean gaudet-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 25 Jun 2008, Mikhail Kuzminsky wrote:

> I have dual socket quad-core Opteron 2350/2 Ghz - based server.
>
> 1) For single-threaded Atlas
>
> For the current 3.8.2 I may use taskset or/and numactl utilities to
> force processor affinity (and NUMA memory allocation) at the tuning
> phase.
>
> What should I use to be correct ?
>
> I may use taskset/numactl simple for issuiing of the corresponding
> makes.
> Is it necessary to use
>
> taskset -c <CPU_number> make build

the default memory policy is a local node allocation (assuming there are
free pages)... so it may not make any difference.  but yeah for
consistency in the timing it's probably best to use something like:

numactl --membind=0 --cpubind=0 make build

-dean

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel

Re: tuning of Atlas on x86/NUMA

by Mikhail Kuzminsky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I installed atlas 3.8.2 on my dual-socket quad-core Opteron 2350-based
server (i.e. 8 cores per server) using TSC-based high granularity
timer.

I found that ATLAS recognize L2 cache correctly but "don't see"
(shared by 4 cores of CPU) L3 cache.

Taking this into account, is there the sense to "realize" my proposal
about simultaneous running of 4 (for one CPU) examplars of "make
build"
-i.e. will real sharing of L3 by "make build" processes have essential
influence to ATLAS performance tuning ?

BTW, I built pthreaded atlas libraries and linked (as an example) my
Linpack(n=1000) codes w/them. I thought that I will use all the 8
threades after that. But I see very small performance improvement in
comparison w/sequential run. Are Lapack dgetrf/dgetrs routines
thread-parallelized in ATLAS ? Or may be I'm wrong somewhere in ATLAS
using ?

Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow



In message from "Mikhail Kuzminsky" <kus@...> (Wed, 25 Jun 2008
22:12:00 +0400):

>I have dual socket quad-core Opteron 2350/2 Ghz - based server.
>
>1) For single-threaded Atlas
>
>For the current 3.8.2 I may use taskset or/and numactl utilities to
>force processor affinity (and NUMA memory allocation) at the tuning
>phase.
>
>What should I use to be correct ?
>
>I may use taskset/numactl simple for issuiing of the corresponding
>makes.
>Is it necessary to use
>
>taskset -c <CPU_number> make build
>
>IMHO, it's reasonable to use taskset/numactl for make check also -
>right ?
>
>BTW, I beleive I should use also -D c -DPentiumCPS=2000, right ?
>
>It'll give the possibility to see on NUMA/cpu affinity influence.
>
>I may insert taskset/numactl somewhere "more exactly" - but I don't
>know where.
>
>2) For pthreaded Atlas
>
>The simplest practical (stupid :-)) ) idea to see on influence of
>cache sharing (L3 for Opteron quad-core) is to prepare some shell
>script where 8 (I have 8 cores) examplars of Atlas tuning will run
>simultaneously.
>
>i.e. something like
>#! /bin/sh
>numactl <parameters_for_core 1> make build >& >out_1 &  
>numactl <parameters_for_core 2> make build >& >out_2 &
>...
>numactl <parameters_for_core 8> make build >& >out_1 &
>
>Taking into account that building time is relative high, I think I
>may
>neglect small difference in starting time.
>I'll need then to have 8 copies of Atlas directories trees :-)  
>
>In that case there will be 8 simultaneous tuning processses whcih
>will
>share common L3 cache.
>
>Is this "proposal" reasonable for particular case I have ?
>(of course, better is to insert numactl somewhere more "exacrly").
>
>FYI: I'm esepecially interesting in dgemm, which eats most of CPU
>time
>of some our application.
>
>Yours
>Mikhail Kuzminsky
>Computer Assistance to Chemical Research Center
>Zelinsky Institute of Organic Chemistry
>Moscow
>
>-------------------------------------------------------------------------
>Check out the new SourceForge.net Marketplace.
>It's the best place to buy or sell services for
>just about anything Open Source.
>http://sourceforge.net/services/buy/index.php
>_______________________________________________
>Math-atlas-devel mailing list
>Math-atlas-devel@...
>https://lists.sourceforge.net/lists/listinfo/math-atlas-devel


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel