|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
k10h post-BIOS patch effectsGuys,
One of the things due for 3.9 is new kernels for Core2Duo and AMD Phenom/3rdgenOpteron. Since my Phenom continues to randomly slow down to half speed, I recently returned to the OpteronK10h for my first timings since applying the BIOS patch to fix the k10h TLB errata. It is not pretty! My timings indicate as much as a 20% drop in performance in **GEMM**, which is not really dominated by memory costs. Level 2 performance dropped something more like 30%. The funny part was the AMD guy reassured me the performance effects had been way overblown by fringe internet weirdos . . . Anyway, the good news is that the machine no longer crashes once a day (apparently a large threaded DGEMM is just as good as virtualization at crashing a processor with this errata), the bad news is that the performance is horrible. I now own two AMD K10h, neither of which I can trust for tuning. Fabulous. Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** ************************************************************************** ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectsIn message from Clint Whaley <whaley@...> (Wed, 16 Jul 2008
09:34:48 -0500): >Guys, > >One of the things due for 3.9 is new kernels for Core2Duo and AMD >Phenom/3rdgenOpteron. Since my Phenom continues to randomly slow >down >to half speed, Which Linux distro do you use ? You should kill all the "power-sensual" daemons (like powersaved in SuSE) and remove the corresponding kernel daemons like cpufreq. > I recently returned to the OpteronK10h for my first >timings >since applying the BIOS patch to fix the k10h TLB errata. It is not >pretty! > >My timings indicate as much as a 20% drop in performance in **GEMM**, >which >is not really dominated by memory costs. Level 2 performance dropped >something more like 30%. The funny part was the AMD guy reassured me >the >performance effects had been way overblown by fringe internet weirdos >. . . BIOS patch was declared as leading to some performance degradation. The better choice is to patch Linux kernel (the patch was published on AMD x86-64 electronic conference) - it must give minor performance decrease. Yours Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow > >Anyway, the good news is that the machine no longer crashes once a >day >(apparently a large threaded DGEMM is just as good as virtualization >at >crashing a processor with this errata), the bad news is that the >performance >is horrible. I now own two AMD K10h, neither of which I can trust >for tuning. >Fabulous. > >Cheers, >Clint > >************************************************************************** >** R. Clint Whaley, PhD ** Assist Prof, UTSA ** >www.cs.utsa.edu/~whaley ** >************************************************************************** > >------------------------------------------------------------------------- >This SF.Net email is sponsored by the Moblin Your Move Developer's >challenge >Build the coolest Linux based applications with Moblin SDK & win >great prizes >Grand prize is a trip for two to an Open Source event anywhere in the >world >http://moblin-contest.org/redirect.php?banner_id=100&url=/ >_______________________________________________ >Math-atlas-devel mailing list >Math-atlas-devel@... >https://lists.sourceforge.net/lists/listinfo/math-atlas-devel ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectsGuys,
>Which Linux distro do you use ? You should kill all the >"power-sensual" daemons (like powersaved in SuSE) and remove the >corresponding kernel daemons like cpufreq. I tried two linux distros: kubuntu hardy heron & Fedora Core 9. FC9 does it a *lot* less than kubuntu, but it still does it. I have turned off "cool & quiet" in the bios, and cpuinfo shows full speed even as my timings drop by half. I verified that cpufreq doesn't work after the BIOS turnoff (the scaling directiries are missing from ACPI). I more & more suspect the problem is in the motherboard. Dean (I think it was) mentioned that thermal throttling is broken in the Phenom; I wonder if the mobo assumes it works and does some voltage things it can't handle in response to OS calls. Whatever it is, it is affected by OS, so it is not pure hardware. But, I wonder if the OS sends some signal that the mobo should ignore, but instead attempts something the k10h can't do . . . Anyway, if anyone can tell me OS & mobo combinations that they have seen work for the Phenom, I'd appreciate it. >BIOS patch was declared as leading to some performance degradation. Yeah, but I did not expect that massive die-off for a cache-dominated algorithm like GEMM. For HPC, the slowdown is massive and pervasive, but the TLB bug is triggered daily. >The better choice is to patch Linux kernel (the patch was published on >AMD x86-64 electronic conference) - it must give minor performance >decrease. Last time I checked, this was not in the standard linux kernel, or even a supporte patch, but just some example code on some mailing list, where the AMD guy says, "I wouldn't use this if I were you" . . . Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** ************************************************************************** ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: atlas-3.9.0I just installed 3.9.0 version on Opteron 2350,
but "make ptcheck" gives me: gfortran -fomit-frame-pointer -mfpmath=sse -msse3 -O2 -falign-loops=32 -m64 -o xsinvtst_pt sinvtst_pt.o \ /usr/local/atlas_390_opteron235x/lib/libtstatlas.a /usr/local/atlas_390_opteron235x/lib/liblapack.a /usr/local/atlas_390_opteron235x/lib/libptcblas.a /usr/local/atlas_390_opteron235x/lib/libptf77blas.a \ /usr/local/atlas_390_opteron235x/lib/libatlas.a -lpthread -lm /usr/local/atlas_390_opteron235x/lib/libatlas.a(ATL_ptflushcache.o): In function `ATL_ptFlushAreasByCL': ATL_ptflushcache.c:(.text+0xc8): undefined reference to `ATL_FlushAreaByCL' collect2: ld returned 1 exit status make[3]: *** [xsinvtst_pt] Error 1 make[3]: Leaving directory `/usr/local/atlas_390_opteron235x/bin' make[2]: *** [ptsanity_test] Error 2 make[2]: Leaving directory `/usr/local/atlas_390_opteron235x/bin' make[1]: *** [ptsanity_test] Error 2 make[1]: Leaving directory `/usr/local/atlas_390_opteron235x' make: *** [pttest] Error 2 BTW, what is default prefix value for configure- /usr/local/atlas or /usr/local/ATLAS ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry Moscow ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: atlas-3.9.0>but "make ptcheck"
>gives me: >ATL_ptflushcache.c:(.text+0xc8): undefined reference to >`ATL_FlushAreaByCL' Had two errors in this routine. See: https://sourceforge.net/tracker/index.php?func=detail&aid=2021878&group_id=23725&atid=379482 >BTW, what is default prefix value for configure- >/usr/local/atlas or /usr/local/ATLAS ? /usr/local/atlas Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** ************************************************************************** ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: atlas-3.9.0In message from Clint Whaley <whaley@...> (Fri, 18 Jul 2008
17:20:58 -0500): >>BTW, what is default prefix value for configure- >>/usr/local/atlas or /usr/local/ATLAS ? > >/usr/local/atlas Thanks ! Then is it possible to execute 4 simultaneous "make build" w/ONE "shared" source directory tree and w/4 different target directories, i.e. something like #! /bin/bash # target directories are /home/local/atlas1 etc # I assume that 4 configuration steps for each target tree were # performed before this run # echo "start" (cd /home/local/atlas1; numactl --membind=0 --cpunodebind=0 make build 2>&1 > makebuild_1.log &) (cd /home/local/atlas2; numactl --membind=0 --cpunodebind=0 make build 2>&1 > makebuild_1.log &) (cd /home/local/atlas3; numactl --membind=0 --cpunodebind=0 make build 2>&1 > makebuild_1.log &) (cd /home/local/atlas4; numactl --membind=0 --cpunodebind=0 make build 2>&1 > makebuild_1.log &) echo "finish" - or I'll need also to have *4* source dir trees ? Yours Mikhail >************************************************************************** >** R. Clint Whaley, PhD ** Assist Prof, UTSA ** >www.cs.utsa.edu/~whaley ** >************************************************************************** > >------------------------------------------------------------------------- >This SF.Net email is sponsored by the Moblin Your Move Developer's >challenge >Build the coolest Linux based applications with Moblin SDK & win great >prizes >Grand prize is a trip for two to an Open Source event anywhere in the >world >http://moblin-contest.org/redirect.php?banner_id=100&url=/ >_______________________________________________ >Math-atlas-devel mailing list >Math-atlas-devel@... >https://lists.sourceforge.net/lists/listinfo/math-atlas-devel ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectsOn Wed, 16 Jul 2008, Clint Whaley wrote:
> Guys, > > >Which Linux distro do you use ? You should kill all the > >"power-sensual" daemons (like powersaved in SuSE) and remove the > >corresponding kernel daemons like cpufreq. > > I tried two linux distros: kubuntu hardy heron & Fedora Core 9. FC9 does > it a *lot* less than kubuntu, but it still does it. I have turned off > "cool & quiet" in the bios, and cpuinfo shows full speed even as my > timings drop by half. I verified that cpufreq doesn't work after the BIOS > turnoff (the scaling directiries are missing from ACPI). > > I more & more suspect the problem is in the motherboard. Dean (I think it was) > mentioned that thermal throttling is broken in the Phenom; I wonder if > the mobo assumes it works and does some voltage things it can't handle > in response to OS calls. hmm that could be possible ... a few choices for figuring this out -- for reference, fam10h BKDG: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116.PDF and fam10h revision guide: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF try this: setpci -d 1022:1204 64.l that should print out the "F3x64 Hardware Thermal Control (HTC) Register" ... if bit 0 is non-zero then HTC is enabled. try disabling it like so: setpci -d 1022:1204 64.l=0 > > Whatever it is, it is affected by OS, so it is not pure hardware. But, > I wonder if the OS sends some signal that the mobo should ignore, > but instead attempts something the k10h can't do . . . > > Anyway, if anyone can tell me OS & mobo combinations that they have seen > work for the Phenom, I'd appreciate it. it's been a while since i've built atlas -- but i'll give it a spin on my phenom and report back. 3.9.0 is good enough? > >BIOS patch was declared as leading to some performance degradation. > > Yeah, but I did not expect that massive die-off for a cache-dominated > algorithm like GEMM. For HPC, the slowdown is massive and pervasive, > but the TLB bug is triggered daily. are you sure it's the TLB bug? in lots of testing i've never tripped the erratum 298 problem. if you want to experiment with the workarounds, build http://code.google.com/p/iotools/ and put it into your PATH. then execute a script something like this: for cpu in `awk '/^processor/ {print $3}' /proc/cpuinfo`; do # disable erratum 298 workaround wrmsr $cpu 0xc0010015 $(and $(rdmsr $cpu 0xc0010015) $(not $(shl 1 3))) wrmsr $cpu 0xc0011023 $(and $(rdmsr $cpu 0xc0011023) $(not $(shl 1 1))) # disable erratum 309 workaround wrmsr $cpu 0xc0011023 $(and $(rdmsr $cpu 0xc0011023) $(not $(shl 1 23))) done you can get more info on both workarounds from the revision guide above. -dean ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectson a phenom:
processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9600 Quad-Core Processor stepping : 2 cpu MHz : 2306.997 in a M3A32-MVP DELUXE mobo ... whose bios info i can describe only as: Vendor: American Megatrends Inc. Version: 0801 (based on dmidecode) it's running ubuntu feisty server (and powernow/etc aren't loaded) i get the following results. -dean ******************************************************************************* ******************************************************************************* ******************************************************************************* * BEGAN ATLAS3.9.0 INSTALL OF SECTION 0-0-0 ON 07/19/2008 AT 09:50 * ******************************************************************************* ******************************************************************************* ******************************************************************************* IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE Level 1 cache size calculated as 64KB. dFPU: Separate multiply and add instructions with 4 cycle pipeline. Apparent number of registers : 13 Register-register performance=4511.70MFLOPS sFPU: Separate multiply and add instructions with 4 cycle pipeline. Apparent number of registers : 13 Register-register performance=4511.70MFLOPS IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING STAGE 2-1: TUNING PREC='d' (precision 1 of 4) STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_dmm8x1x120_L1pf.c, NB=40, written by R. Clint Whaley Performance: 8057.51MFLOPS (349.42 percent of of detected clock rate) (Gen case got 3928.97MFLOPS) mmNN : ma=0, lat=5, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 Performance = 3839.81 (47.66 of copy matmul, 166.51 of clock) mmNT : ma=0, lat=6, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 Performance = 3291.98 (40.86 of copy matmul, 142.76 of clock) mmTN : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 Performance = 3799.57 (47.16 of copy matmul, 164.77 of clock) mmTT : ma=0, lat=2, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 Performance = 3296.25 (40.91 of copy matmul, 142.94 of clock) STAGE 2-1-2: CacheEdge DETECTION CacheEdge set to 3145728 bytes STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-1-4: LEVEL 3 BLAS TUNE done. STAGE 2-1-5: GEMV TUNE gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 100 percent of L1 Performance = 1394.39 (17.31 of copy matmul, 60.47 of clock) gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley Yunroll=2, Xunroll=16, using 100 percent of L1 Performance = 1374.56 (17.06 of copy matmul, 59.61 of clock) STAGE 2-1-6: GER TUNE ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.51 percent of L1 Cache Performance = 809.66 (10.05 of copy matmul, 35.11 of clock) STAGE 2-2: TUNING PREC='s' (precision 2 of 4) STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_smm6x1x120_sse.c, NB=120, written by R. Clint Whaley Performance: 15012.39MFLOPS (651.01 percent of of detected clock rate) (Gen case got 4435.46MFLOPS) mmNN : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3834.01 (25.54 of copy matmul, 166.26 of clock) mmNT : ma=0, lat=2, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3370.67 (22.45 of copy matmul, 146.17 of clock) mmTN : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3959.61 (26.38 of copy matmul, 171.71 of clock) mmTT : ma=0, lat=3, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3486.42 (23.22 of copy matmul, 151.19 of clock) STAGE 2-2-2: CacheEdge DETECTION CacheEdge set to 3145728 bytes STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-2-4: LEVEL 3 BLAS TUNE done. STAGE 2-2-5: GEMV TUNE gemvN : chose routine 9:ATL_gemvN_32x4_1.c written by R. Clint Whaley Yunroll=32, Xunroll=4, using 100 percent of L1 Performance = 1761.79 (11.74 of copy matmul, 76.40 of clock) gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley Yunroll=2, Xunroll=16, using 100 percent of L1 Performance = 1984.77 (13.22 of copy matmul, 86.07 of clock) STAGE 2-2-6: GER TUNE ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 1.00 percent of L1 Cache Performance = 1323.34 ( 8.81 of copy matmul, 57.39 of clock) STAGE 2-3: TUNING PREC='z' (precision 3 of 4) STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_dmm14x1x56_sse2pABC.c, NB=56, written by R. Clint Whaley Performance: 7856.61MFLOPS (340.70 percent of of detected clock rate) (Gen case got 4166.97MFLOPS) mmNN : ma=0, lat=4, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 Performance = 3946.90 (50.24 of copy matmul, 171.16 of clock) mmNT : ma=0, lat=8, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 Performance = 3589.62 (45.69 of copy matmul, 155.66 of clock) mmTN : ma=0, lat=2, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 Performance = 3959.21 (50.39 of copy matmul, 171.69 of clock) mmTT : ma=0, lat=4, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 Performance = 3599.80 (45.82 of copy matmul, 156.11 of clock) STAGE 2-3-2: CacheEdge DETECTION CacheEdge set to 3145728 bytes zdNKB set to 0 bytes STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-3-4: LEVEL 3 BLAS TUNE done. STAGE 2-3-5: GEMV TUNE gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 99 percent of L1 Performance = 2835.03 (36.08 of copy matmul, 122.94 of clock) gemvT : chose routine 102:ATL_cgemvT_2x2_0.c written by R. Clint Whaley Yunroll=2, Xunroll=8, using 99 percent of L1 Performance = 2116.02 (26.93 of copy matmul, 91.76 of clock) STAGE 2-3-6: GER TUNE ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.76 percent of L1 Cache Performance = 1609.07 (20.48 of copy matmul, 69.78 of clock) STAGE 2-4: TUNING PREC='c' (precision 4 of 4) STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE The best matmul kernel was ATL_smm6x1x120_sse.c, NB=120, written by R. Clint Whaley Performance: 14625.27MFLOPS (634.23 percent of of detected clock rate) (Gen case got 4415.67MFLOPS) mmNN : ma=0, lat=8, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3934.66 (26.90 of copy matmul, 170.63 of clock) mmNT : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3615.18 (24.72 of copy matmul, 156.77 of clock) mmTN : ma=0, lat=5, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3953.05 (27.03 of copy matmul, 171.42 of clock) mmTT : ma=0, lat=5, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 Performance = 3678.05 (25.15 of copy matmul, 159.50 of clock) STAGE 2-4-2: CacheEdge DETECTION CacheEdge set to 3145728 bytes csNKB set to 0 bytes STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION done. STAGE 2-4-4: LEVEL 3 BLAS TUNE done. STAGE 2-4-5: GEMV TUNE gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley Yunroll=32, Xunroll=1, using 86 percent of L1 Performance = 5542.02 (37.89 of copy matmul, 240.33 of clock) gemvT : chose routine 102:ATL_cgemvT_2x2_0.c written by R. Clint Whaley Yunroll=2, Xunroll=8, using 86 percent of L1 Performance = 2548.30 (17.42 of copy matmul, 110.51 of clock) STAGE 2-4-6: GER TUNE ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley mu=16, nu=1, using 0.75 percent of L1 Cache Performance = 3173.71 (21.70 of copy matmul, 137.63 of clock) STAGE 3: GENERAL LIBRARY BUILD STAGE 4: POST-BUILD TUNING done. STAGE 4-2: Threading install done. ******************************************************************************* ******************************************************************************* ******************************************************************************* * FINISHED ATLAS3.9.0 INSTALL OF SECTION 0-0-0 ON 07/19/2008 AT 10:02 * ******************************************************************************* ******************************************************************************* ******************************************************************************* ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectsbtw i should add:
# for cpu in `awk '/^processor/ {print $3}' /proc/cpuinfo`; do (echo $cpu; rdmsr $cpu 0xc0010015; rdmsr $cpu 0xc0011023) | fmt -w1000; done 0 0x0000000001000010 0x0000000000200020 1 0x0000000001000010 0x0000000000200020 2 0x0000000001000010 0x0000000000200020 3 0x0000000001000010 0x0000000000200020 so neither errata 298 nor 309 were enabled... and this is a B2 part... # setpci -d 1022:1204 64.l 00000000 nor was HTC. -dean On Sat, 19 Jul 2008, dean gaudet wrote: > on a phenom: > > processor : 0 > vendor_id : AuthenticAMD > cpu family : 16 > model : 2 > model name : AMD Phenom(tm) 9600 Quad-Core Processor > stepping : 2 > cpu MHz : 2306.997 > > in a M3A32-MVP DELUXE mobo ... whose bios info i can describe only as: > > Vendor: American Megatrends Inc. > Version: 0801 > > (based on dmidecode) > > it's running ubuntu feisty server (and powernow/etc aren't loaded) > > i get the following results. > > -dean > > > ******************************************************************************* > ******************************************************************************* > ******************************************************************************* > * BEGAN ATLAS3.9.0 INSTALL OF SECTION 0-0-0 ON 07/19/2008 AT 09:50 * > ******************************************************************************* > ******************************************************************************* > ******************************************************************************* > > > > > > IN STAGE 1 INSTALL: SYSTEM PROBE/AUX COMPILE > Level 1 cache size calculated as 64KB. > > dFPU: Separate multiply and add instructions with 4 cycle pipeline. > Apparent number of registers : 13 > Register-register performance=4511.70MFLOPS > sFPU: Separate multiply and add instructions with 4 cycle pipeline. > Apparent number of registers : 13 > Register-register performance=4511.70MFLOPS > > > IN STAGE 2 INSTALL: TYPE-DEPENDENT TUNING > > > STAGE 2-1: TUNING PREC='d' (precision 1 of 4) > > > STAGE 2-1-1 : BUILDING BLOCK MATMUL TUNE > The best matmul kernel was ATL_dmm8x1x120_L1pf.c, NB=40, written by R. Clint Whaley > Performance: 8057.51MFLOPS (349.42 percent of of detected clock rate) > (Gen case got 3928.97MFLOPS) > mmNN : ma=0, lat=5, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 > Performance = 3839.81 (47.66 of copy matmul, 166.51 of clock) > mmNT : ma=0, lat=6, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 > Performance = 3291.98 (40.86 of copy matmul, 142.76 of clock) > mmTN : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 > Performance = 3799.57 (47.16 of copy matmul, 164.77 of clock) > mmTT : ma=0, lat=2, nb=40, mu=12, nu=1 ku=40, ff=0, if=12, nf=1 > Performance = 3296.25 (40.91 of copy matmul, 142.94 of clock) > > > > STAGE 2-1-2: CacheEdge DETECTION > CacheEdge set to 3145728 bytes > > > STAGE 2-1-3: LARGE/SMALL CASE CROSSOVER DETECTION > > > STAGE 2-1-3: COPY/NO-COPY CROSSOVER DETECTION > done. > > > STAGE 2-1-4: LEVEL 3 BLAS TUNE > done. > > > STAGE 2-1-5: GEMV TUNE > gemvN : chose routine 3:ATL_gemvN_1x1_1a.c written by R. Clint Whaley > Yunroll=32, Xunroll=1, using 100 percent of L1 > Performance = 1394.39 (17.31 of copy matmul, 60.47 of clock) > gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley > Yunroll=2, Xunroll=16, using 100 percent of L1 > Performance = 1374.56 (17.06 of copy matmul, 59.61 of clock) > > > STAGE 2-1-6: GER TUNE > ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley > mu=16, nu=1, using 0.51 percent of L1 Cache > Performance = 809.66 (10.05 of copy matmul, 35.11 of clock) > > > STAGE 2-2: TUNING PREC='s' (precision 2 of 4) > > > STAGE 2-2-1 : BUILDING BLOCK MATMUL TUNE > The best matmul kernel was ATL_smm6x1x120_sse.c, NB=120, written by R. Clint Whaley > Performance: 15012.39MFLOPS (651.01 percent of of detected clock rate) > (Gen case got 4435.46MFLOPS) > mmNN : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3834.01 (25.54 of copy matmul, 166.26 of clock) > mmNT : ma=0, lat=2, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3370.67 (22.45 of copy matmul, 146.17 of clock) > mmTN : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3959.61 (26.38 of copy matmul, 171.71 of clock) > mmTT : ma=0, lat=3, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3486.42 (23.22 of copy matmul, 151.19 of clock) > > > > STAGE 2-2-2: CacheEdge DETECTION > CacheEdge set to 3145728 bytes > > > STAGE 2-2-3: LARGE/SMALL CASE CROSSOVER DETECTION > > > STAGE 2-2-3: COPY/NO-COPY CROSSOVER DETECTION > done. > > > STAGE 2-2-4: LEVEL 3 BLAS TUNE > done. > > > STAGE 2-2-5: GEMV TUNE > gemvN : chose routine 9:ATL_gemvN_32x4_1.c written by R. Clint Whaley > Yunroll=32, Xunroll=4, using 100 percent of L1 > Performance = 1761.79 (11.74 of copy matmul, 76.40 of clock) > gemvT : chose routine 105:ATL_gemvT_2x16_1.c written by R. Clint Whaley > Yunroll=2, Xunroll=16, using 100 percent of L1 > Performance = 1984.77 (13.22 of copy matmul, 86.07 of clock) > > > STAGE 2-2-6: GER TUNE > ger : chose routine 1:ATL_ger1_axpy.c written by R. Clint Whaley > mu=16, nu=1, using 1.00 percent of L1 Cache > Performance = 1323.34 ( 8.81 of copy matmul, 57.39 of clock) > > > STAGE 2-3: TUNING PREC='z' (precision 3 of 4) > > > STAGE 2-3-1 : BUILDING BLOCK MATMUL TUNE > The best matmul kernel was ATL_dmm14x1x56_sse2pABC.c, NB=56, written by R. Clint Whaley > Performance: 7856.61MFLOPS (340.70 percent of of detected clock rate) > (Gen case got 4166.97MFLOPS) > mmNN : ma=0, lat=4, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 > Performance = 3946.90 (50.24 of copy matmul, 171.16 of clock) > mmNT : ma=0, lat=8, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 > Performance = 3589.62 (45.69 of copy matmul, 155.66 of clock) > mmTN : ma=0, lat=2, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 > Performance = 3959.21 (50.39 of copy matmul, 171.69 of clock) > mmTT : ma=0, lat=4, nb=40, mu=8, nu=1 ku=40, ff=0, if=9, nf=1 > Performance = 3599.80 (45.82 of copy matmul, 156.11 of clock) > > > > STAGE 2-3-2: CacheEdge DETECTION > CacheEdge set to 3145728 bytes > zdNKB set to 0 bytes > > > STAGE 2-3-3: LARGE/SMALL CASE CROSSOVER DETECTION > > > STAGE 2-3-3: COPY/NO-COPY CROSSOVER DETECTION > done. > > > STAGE 2-3-4: LEVEL 3 BLAS TUNE > done. > > > STAGE 2-3-5: GEMV TUNE > gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley > Yunroll=32, Xunroll=1, using 99 percent of L1 > Performance = 2835.03 (36.08 of copy matmul, 122.94 of clock) > gemvT : chose routine 102:ATL_cgemvT_2x2_0.c written by R. Clint Whaley > Yunroll=2, Xunroll=8, using 99 percent of L1 > Performance = 2116.02 (26.93 of copy matmul, 91.76 of clock) > > > STAGE 2-3-6: GER TUNE > ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley > mu=16, nu=1, using 0.76 percent of L1 Cache > Performance = 1609.07 (20.48 of copy matmul, 69.78 of clock) > > > STAGE 2-4: TUNING PREC='c' (precision 4 of 4) > > > STAGE 2-4-1 : BUILDING BLOCK MATMUL TUNE > The best matmul kernel was ATL_smm6x1x120_sse.c, NB=120, written by R. Clint Whaley > Performance: 14625.27MFLOPS (634.23 percent of of detected clock rate) > (Gen case got 4415.67MFLOPS) > mmNN : ma=0, lat=8, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3934.66 (26.90 of copy matmul, 170.63 of clock) > mmNT : ma=0, lat=4, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3615.18 (24.72 of copy matmul, 156.77 of clock) > mmTN : ma=0, lat=5, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3953.05 (27.03 of copy matmul, 171.42 of clock) > mmTT : ma=0, lat=5, nb=40, mu=12, nu=1 ku=40, ff=1, if=13, nf=1 > Performance = 3678.05 (25.15 of copy matmul, 159.50 of clock) > > > > STAGE 2-4-2: CacheEdge DETECTION > CacheEdge set to 3145728 bytes > csNKB set to 0 bytes > > > STAGE 2-4-3: LARGE/SMALL CASE CROSSOVER DETECTION > > > STAGE 2-4-3: COPY/NO-COPY CROSSOVER DETECTION > done. > > > STAGE 2-4-4: LEVEL 3 BLAS TUNE > done. > > > STAGE 2-4-5: GEMV TUNE > gemvN : chose routine 3:ATL_cgemvN_1x1_1a.c written by R. Clint Whaley > Yunroll=32, Xunroll=1, using 86 percent of L1 > Performance = 5542.02 (37.89 of copy matmul, 240.33 of clock) > gemvT : chose routine 102:ATL_cgemvT_2x2_0.c written by R. Clint Whaley > Yunroll=2, Xunroll=8, using 86 percent of L1 > Performance = 2548.30 (17.42 of copy matmul, 110.51 of clock) > > > STAGE 2-4-6: GER TUNE > ger : chose routine 1:ATL_cger1_axpy.c written by R. Clint Whaley > mu=16, nu=1, using 0.75 percent of L1 Cache > Performance = 3173.71 (21.70 of copy matmul, 137.63 of clock) > > > STAGE 3: GENERAL LIBRARY BUILD > > > STAGE 4: POST-BUILD TUNING > done. > > > STAGE 4-2: Threading install > done. > > ******************************************************************************* > ******************************************************************************* > ******************************************************************************* > * FINISHED ATLAS3.9.0 INSTALL OF SECTION 0-0-0 ON 07/19/2008 AT 10:02 * > ******************************************************************************* > ******************************************************************************* > ******************************************************************************* > > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Math-atlas-devel mailing list > Math-atlas-devel@... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectsDean (& guys),
OK, here are a few things. First, there is a modified xdfc available at: www.cs.utsa.edu/~whaley/dload/xdfc It is my normal kernel timer, which has been modified to keep calling the K10h kernel 1K times. When I run this on my Phenom, most of the numbers are roughly 8Gflop, but then it drops to 4Gflop for a lot of them. Can anyone with a Phenom run this executable and make sure yours doesn't do this to you too (i.e. the perf drop happens rarely enough that it can be missed)? If this executable won't work for you (eg., different libraries) you can make create it yourself by changing line 634 of ATLAS/tune/blas/gemm/fc.c from: #define NSAMPLE 3 to: #define NSAMPLE 1024 And then issuing (in $BLDdir/tune/blas/gemm): make ummcase pre=d DMCFLAGS="-x assembler-with-cpp" \ mmrout=CASES/ATL_dmm8x1x120_L1pf.c nb=40 >try this: >setpci -d 1022:1204 64.l bit 0 was zero for me :( >setpci -d 1022:1204 64.l=0 did this (despite above), and ./xdfc still behaves same way >> Yeah, but I did not expect that massive die-off for a cache-dominated >> algorithm like GEMM. For HPC, the slowdown is massive and pervasive, >> but the TLB bug is triggered daily. > >are you sure it's the TLB bug? in lots of testing i've never tripped the >erratum 298 problem. No, but we were debugging a lot of large parallel DGEMMs, and the machine was dying roughly once a day. I applied the patch, and the machine has been stable since, so I just assumed. However, it could have been something we are doing differently in our testing (as the code has changes), or an unrelated other thing in the BIOS . . . if you want to experiment with the workarounds, build http://code.google.com/p/iotools/ and put it into your PATH. >if you want to experiment with the workarounds, build >http://code.google.com/p/iotools/ and put it into your PATH. > >then execute a script something like this: > >for cpu in `awk '/^processor/ {print $3}' /proc/cpuinfo`; do > # disable erratum 298 workaround > wrmsr $cpu 0xc0010015 $(and $(rdmsr $cpu 0xc0010015) $(not $(shl 1 3))) > wrmsr $cpu 0xc0011023 $(and $(rdmsr $cpu 0xc0011023) $(not $(shl 1 1))) > > # disable erratum 309 workaround > wrmsr $cpu 0xc0011023 $(and $(rdmsr $cpu 0xc0011023) $(not $(shl 1 23))) >done This program allows you to change stuff in the BIOS on the fly? Or is this linux workarounds I need to be able to apply with an unpatched BIOS? I guess I need to check it out with savana & compile it (I didn't see any simple download link)? Thanks, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** ************************************************************************** ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: k10h post-BIOS patch effectsOn Sun, 20 Jul 2008, Clint Whaley wrote:
> Dean (& guys), > > OK, here are a few things. First, there is a modified xdfc available at: > www.cs.utsa.edu/~whaley/dload/xdfc > It is my normal kernel timer, which has been modified to keep calling the K10h > kernel 1K times. When I run this on my Phenom, most of the numbers are roughly > 8Gflop, but then it drops to 4Gflop for a lot of them. Can anyone with a > Phenom run this executable and make sure yours doesn't do this to you too > (i.e. the perf drop happens rarely enough that it can be missed)? it seems to be 8.4gflop for my 2.3ghz pheonm: # ./xdfc dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8366.61 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.58 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.81 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8376.21 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.64 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.92 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.55 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.40 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8377.10 dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.314, mflop=8376.29 ... i've let it run for a couple minutes now, no changes... actually it finished, still no significant changes in the mflop -- it climbed a tiny amount: dNB=40, ld=40,40,40, mu=4, nu=4, ku=1, lat=4, pf=0: time=0.313, mflop=8391.98 dNB=40, time=0.313, mflop=8394.20 > >if you want to experiment with the workarounds, build > >http://code.google.com/p/iotools/ and put it into your PATH. > > > >then execute a script something like this: > > > >for cpu in `awk '/^processor/ {print $3}' /proc/cpuinfo`; do > > # disable erratum 298 workaround > > wrmsr $cpu 0xc0010015 $(and $(rdmsr $cpu 0xc0010015) $(not $(shl 1 3))) > > wrmsr $cpu 0xc0011023 $(and $(rdmsr $cpu 0xc0011023) $(not $(shl 1 1))) > > > > # disable erratum 309 workaround > > wrmsr $cpu 0xc0011023 $(and $(rdmsr $cpu 0xc0011023) $(not $(shl 1 23))) > >done > > This program allows you to change stuff in the BIOS on the fly? Or is this > linux workarounds I need to be able to apply with an unpatched BIOS? I > guess I need to check it out with savana & compile it (I didn't see any > simple download link)? yeah you probably need to check it out with svn and build it. you probably also need to "modprobe msr". the BIOS workarounds for those errata amount to setting those bits (i.e. bit 3 of MSR 0xc0010015, bit 1 of 0xc0011023 and bit 23 of 0xc0011023) ... for these specific workarounds we can tweak them dynamically. if you execute that script i pasted you'll disable the workarounds... which will make your B2 behave like a B3. -dean ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: atlas-3.9.0>Then is it possible to execute 4 simultaneous "make build" w/ONE
>"shared" source directory tree and w/4 different target directories, >i.e. something like ATLAS can certainly build any number of BLDdirs from one SRCdir. I would not recommend firing multiple ones off at once, though, as the load from one install will interfere (probably strongly) with other installs' timings. So, it is fine to use the same source tree for multiple installs, but I suggest serializing the installs. Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assist Prof, UTSA ** www.cs.utsa.edu/~whaley ** ************************************************************************** ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
|
|
Re: atlas-3.9.0In message from Clint Whaley <whaley@...> (Mon, 21 Jul 2008
11:12:44 -0500): >>Then is it possible to execute 4 simultaneous "make build" w/ONE >>"shared" source directory tree and w/4 different target directories, >>i.e. something like > >ATLAS can certainly build any number of BLDdirs from one SRCdir. I >would >not recommend firing multiple ones off at once, though, as the load >from >one install will interfere (probably strongly) with other installs' >timings. >So, it is fine to use the same source tree for multiple installs, but >I >suggest serializing the installs. Eh, the idea is just to see what will be at simultaneous tuning :-)! I.e. which CacheEdge value will be obtained if you'll run 4 simultaneous building on 4-cores CPU ? Mikhail > >Cheers, >Clint > >************************************************************************** >** R. Clint Whaley, PhD ** Assist Prof, UTSA ** >www.cs.utsa.edu/~whaley ** >************************************************************************** > >------------------------------------------------------------------------- >This SF.Net email is sponsored by the Moblin Your Move Developer's >challenge >Build the coolest Linux based applications with Moblin SDK & win >great prizes >Grand prize is a trip for two to an Open Source event anywhere in the >world >http://moblin-contest.org/redirect.php?banner_id=100&url=/ >_______________________________________________ >Math-atlas-devel mailing list >Math-atlas-devel@... >https://lists.sourceforge.net/lists/listinfo/math-atlas-devel ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Math-atlas-devel mailing list Math-atlas-devel@... https://lists.sourceforge.net/lists/listinfo/math-atlas-devel |
| Free embeddable forum powered by Nabble | Forum Help |