I installed atlas 3.8.2 on my dual-socket quad-core Opteron 2350-based
server (i.e. 8 cores per server) using TSC-based high granularity
timer.
I found that ATLAS recognize L2 cache correctly but "don't see"
(shared by 4 cores of CPU) L3 cache.
Taking this into account, is there the sense to "realize" my proposal
about simultaneous running of 4 (for one CPU) examplars of "make
build"
-i.e. will real sharing of L3 by "make build" processes have essential
influence to ATLAS performance tuning ?
BTW, I built pthreaded atlas libraries and linked (as an example) my
Linpack(n=1000) codes w/them. I thought that I will use all the 8
threades after that. But I see very small performance improvement in
comparison w/sequential run. Are Lapack dgetrf/dgetrs routines
thread-parallelized in ATLAS ? Or may be I'm wrong somewhere in ATLAS
using ?
Yours
Mikhail Kuzminsky
Computer Assistance to Chemical Research Center
Zelinsky Institute of Organic Chemistry
Moscow
In message from "Mikhail Kuzminsky" <
kus@...> (Wed, 25 Jun 2008
22:12:00 +0400):
>I have dual socket quad-core Opteron 2350/2 Ghz - based server.
>
>1) For single-threaded Atlas
>
>For the current 3.8.2 I may use taskset or/and numactl utilities to
>force processor affinity (and NUMA memory allocation) at the tuning
>phase.
>
>What should I use to be correct ?
>
>I may use taskset/numactl simple for issuiing of the corresponding
>makes.
>Is it necessary to use
>
>taskset -c <CPU_number> make build
>
>IMHO, it's reasonable to use taskset/numactl for make check also -
>right ?
>
>BTW, I beleive I should use also -D c -DPentiumCPS=2000, right ?
>
>It'll give the possibility to see on NUMA/cpu affinity influence.
>
>I may insert taskset/numactl somewhere "more exactly" - but I don't
>know where.
>
>2) For pthreaded Atlas
>
>The simplest practical (stupid :-)) ) idea to see on influence of
>cache sharing (L3 for Opteron quad-core) is to prepare some shell
>script where 8 (I have 8 cores) examplars of Atlas tuning will run
>simultaneously.
>
>i.e. something like
>#! /bin/sh
>numactl <parameters_for_core 1> make build >& >out_1 &
>numactl <parameters_for_core 2> make build >& >out_2 &
>...
>numactl <parameters_for_core 8> make build >& >out_1 &
>
>Taking into account that building time is relative high, I think I
>may
>neglect small difference in starting time.
>I'll need then to have 8 copies of Atlas directories trees :-)
>
>In that case there will be 8 simultaneous tuning processses whcih
>will
>share common L3 cache.
>
>Is this "proposal" reasonable for particular case I have ?
>(of course, better is to insert numactl somewhere more "exacrly").
>
>FYI: I'm esepecially interesting in dgemm, which eats most of CPU
>time
>of some our application.
>
>Yours
>Mikhail Kuzminsky
>Computer Assistance to Chemical Research Center
>Zelinsky Institute of Organic Chemistry
>Moscow
>
>-------------------------------------------------------------------------
>Check out the new SourceForge.net Marketplace.
>It's the best place to buy or sell services for
>just about anything Open Source.
>
http://sourceforge.net/services/buy/index.php>_______________________________________________
>Math-atlas-devel mailing list
>
Math-atlas-devel@...
>
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php_______________________________________________
Math-atlas-devel mailing list
Math-atlas-devel@...
https://lists.sourceforge.net/lists/listinfo/math-atlas-devel