|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
tuning of Atlas on x86/NUMAHow the atlas tuning process (for example, for dgemm kernel) is
organized for the case of SMP/NUMA servers w/CPUs having shared cache ? For example, for dual-socket quad-core Opteron server ? If dgemm tuning takes into account shared cache size, and is tuned only "single threaded" (sequential run), then it'll propose that it can use whole cache (for example, 2 MB L3 for Opteron 2350). But for multithreaded dgemm w/4 threads per CPU only 512K of L3 will be available w/o a lot of cache miss. Therefore multithreaded version requires, IMHO, "independed" (from sequential version) tuning. And the second question is about using of process affinity (taskset for Linux) and NUMA-allocation of memory (using of numactl) at the tuning process. Does it takes into account this possibilities or there is no serious reasons to use taskset/numactl in tuning ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: tuning of Atlas on x86/NUMAMikhail,
ATLAS presently does almost not SMP-related empirical tuning. We are presently looking at it, but the present package does none. Here's a link you may be interested in from the errata: http://math-atlas.sourceforge.net/errata.html#SMPCE >How the atlas tuning process (for example, for dgemm kernel) is >organized for the case >of SMP/NUMA servers w/CPUs having shared cache ? For example, for >dual-socket quad-core Opteron server ? > >If dgemm tuning takes into account shared cache size, and is tuned >>only "single threaded" (sequential run), >then it'll propose that it can use whole cache (for example, 2 MB L3 >for Opteron 2350). But for multithreaded dgemm w/4 threads per CPU >only 512K of L3 will be available w/o a lot of cache miss. Therefore >multithreaded version requires, IMHO, "independed" (from sequential >version) tuning. >And the second question is about using of process affinity (taskset >for Linux) and NUMA-allocation of memory >(using of numactl) at the tuning process. Does it takes into account >this possibilities or there is no serious reasons >to use taskset/numactl in tuning ? We are presently looking at the affects of using processor affinity. I have no idea what numactl is, do you have a link? Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** ************************************************************************** ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: tuning of Atlas on x86/NUMAHi Clint,
For processor affinity, you may want to look at the Portable Linux Processor Affinity (PLPA) project: http://www.open-mpi.org/projects/plpa/ Here are links for numactl, which is a commandline utility: http://freshmeat.net/projects/numactl/ http://oss.sgi.com/projects/libnuma/ >From the sgi link: "The numactl program allows you to run your application program on specific cpu's and memory nodes. It does this by supplying a NUMA memory policy to the operating system before running your program." On Tue, Jun 24, 2008 at 3:23 PM, Clint Whaley <whaley@...> wrote: > Mikhail, > > ATLAS presently does almost not SMP-related empirical tuning. We are > presently looking at it, but the present package does none. Here's a link > you may be interested in from the errata: > http://math-atlas.sourceforge.net/errata.html#SMPCE > >>How the atlas tuning process (for example, for dgemm kernel) is >>organized for the case >>of SMP/NUMA servers w/CPUs having shared cache ? For example, for >>dual-socket quad-core Opteron server ? >> >>If dgemm tuning takes into account shared cache size, and is tuned >>>only "single threaded" (sequential run), >>then it'll propose that it can use whole cache (for example, 2 MB L3 >>for Opteron 2350). But for multithreaded dgemm w/4 threads per CPU >>only 512K of L3 will be available w/o a lot of cache miss. Therefore >>multithreaded version requires, IMHO, "independed" (from sequential >>version) tuning. > >>And the second question is about using of process affinity (taskset >>for Linux) and NUMA-allocation of memory >>(using of numactl) at the tuning process. Does it takes into account >>this possibilities or there is no serious reasons >>to use taskset/numactl in tuning ? > > We are presently looking at the affects of using processor affinity. I have > no idea what numactl is, do you have a link? > > Cheers, > Clint > > ************************************************************************** > ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** > ************************************************************************** > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Math-atlas-devel mailing list > Math-atlas-devel@... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmattox@... || timattox@... I'm a bright... http://www.the-brights.net/ ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: tuning of Atlas on x86/NUMAI have dual socket quad-core Opteron 2350/2 Ghz - based server.
1) For single-threaded Atlas For the current 3.8.2 I may use taskset or/and numactl utilities to force processor affinity (and NUMA memory allocation) at the tuning phase. What should I use to be correct ? I may use taskset/numactl simple for issuiing of the corresponding makes. Is it necessary to use taskset -c <CPU_number> make build IMHO, it's reasonable to use taskset/numactl for make check also - right ? BTW, I beleive I should use also -D c -DPentiumCPS=2000, right ? It'll give the possibility to see on NUMA/cpu affinity influence. I may insert taskset/numactl somewhere "more exactly" - but I don't know where. 2) For pthreaded Atlas The simplest practical (stupid :-)) ) idea to see on influence of cache sharing (L3 for Opteron quad-core) is to prepare some shell script where 8 (I have 8 cores) examplars of Atlas tuning will run simultaneously. i.e. something like #! /bin/sh numactl <parameters_for_core 1> make build >& >out_1 & numactl <parameters_for_core 2> make build >& >out_2 & ... numactl <parameters_for_core 8> make build >& >out_1 & Taking into account that building time is relative high, I think I may neglect small difference in starting time. I'll need then to have 8 copies of Atlas directories trees :-) In that case there will be 8 simultaneous tuning processses whcih will share common L3 cache. Is this "proposal" reasonable for particular case I have ? (of course, better is to insert numactl somewhere more "exacrly"). FYI: I'm esepecially interesting in dgemm, which eats most of CPU time of some our application. Yours Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: tuning of Atlas on x86/NUMAOn Wed, 25 Jun 2008, Mikhail Kuzminsky wrote:
> I have dual socket quad-core Opteron 2350/2 Ghz - based server. > > 1) For single-threaded Atlas > > For the current 3.8.2 I may use taskset or/and numactl utilities to > force processor affinity (and NUMA memory allocation) at the tuning > phase. > > What should I use to be correct ? > > I may use taskset/numactl simple for issuiing of the corresponding > makes. > Is it necessary to use > > taskset -c <CPU_number> make build the default memory policy is a local node allocation (assuming there are free pages)... so it may not make any difference. but yeah for consistency in the timing it's probably best to use something like: numactl --membind=0 --cpubind=0 make build -dean ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: tuning of Atlas on x86/NUMAI installed atlas 3.8.2 on my dual-socket quad-core Opteron 2350-based
server (i.e. 8 cores per server) using TSC-based high granularity timer. I found that ATLAS recognize L2 cache correctly but "don't see" (shared by 4 cores of CPU) L3 cache. Taking this into account, is there the sense to "realize" my proposal about simultaneous running of 4 (for one CPU) examplars of "make build" -i.e. will real sharing of L3 by "make build" processes have essential influence to ATLAS performance tuning ? BTW, I built pthreaded atlas libraries and linked (as an example) my Linpack(n=1000) codes w/them. I thought that I will use all the 8 threades after that. But I see very small performance improvement in comparison w/sequential run. Are Lapack dgetrf/dgetrs routines thread-parallelized in ATLAS ? Or may be I'm wrong somewhere in ATLAS using ? Yours Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow In message from "Mikhail Kuzminsky" <kus@...> (Wed, 25 Jun 2008 22:12:00 +0400): >I have dual socket quad-core Opteron 2350/2 Ghz - based server. > >1) For single-threaded Atlas > >For the current 3.8.2 I may use taskset or/and numactl utilities to >force processor affinity (and NUMA memory allocation) at the tuning >phase. > >What should I use to be correct ? > >I may use taskset/numactl simple for issuiing of the corresponding >makes. >Is it necessary to use > >taskset -c <CPU_number> make build > >IMHO, it's reasonable to use taskset/numactl for make check also - >right ? > >BTW, I beleive I should use also -D c -DPentiumCPS=2000, right ? > >It'll give the possibility to see on NUMA/cpu affinity influence. > >I may insert taskset/numactl somewhere "more exactly" - but I don't >know where. > >2) For pthreaded Atlas > >The simplest practical (stupid :-)) ) idea to see on influence of >cache sharing (L3 for Opteron quad-core) is to prepare some shell >script where 8 (I have 8 cores) examplars of Atlas tuning will run >simultaneously. > >i.e. something like >#! /bin/sh >numactl <parameters_for_core 1> make build >& >out_1 & >numactl <parameters_for_core 2> make build >& >out_2 & >... >numactl <parameters_for_core 8> make build >& >out_1 & > >Taking into account that building time is relative high, I think I >may >neglect small difference in starting time. >I'll need then to have 8 copies of Atlas directories trees :-) > >In that case there will be 8 simultaneous tuning processses whcih >will >share common L3 cache. > >Is this "proposal" reasonable for particular case I have ? >(of course, better is to insert numactl somewhere more "exacrly"). > >FYI: I'm esepecially interesting in dgemm, which eats most of CPU >time >of some our application. > >Yours >Mikhail Kuzminsky >Computer Assistance to Chemical Research Center >Zelinsky Institute of Organic Chemistry >Moscow > >------------------------------------------------------------------------- >Check out the new SourceForge.net Marketplace. >It's the best place to buy or sell services for >just about anything Open Source. >http://sourceforge.net/services/buy/index.php >_______________________________________________ >Math-atlas-devel mailing list >Math-atlas-devel@... >https://lists.sourceforge.net/lists/listinfo/math-atlas-devel ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
| Free embeddable forum powered by Nabble | Forum Help |