|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Massive performance regression of glibc string functions Hi!
I have been doing some benchmarking of several string functions and discovered that some of them are *much* slower than in the past; the regressions are measured against glibc-2.9. I'm testing on small strings (4..128, though for 128 much bigger sample of calls would be needed for good comparison), following the common wisdom that operations on small strings are the bulk of the calls. In case of strlen(), there seems to be regression only with very small strings on AMD, so this is probably fine. In case of memcmp(), strcmp() and strncmp(), glibc-2.10.1 seems to improve performance somewhat especially for larger strings, but glibc-2.11 has massive performance drop across all vendors! (Interestingly, glibc-2.10.1 is also slightly slower than glibc-2.9 in these functions on Core i7.) In case of strcmp(), strncmp(), glibc-2.10.1 seems to improve performance somewhat especially for larger strings, but glibc-2.11 has massive performance drop on all vendors. I'd like to ask how the string routine changes were benchmarked, for what architectures and string sizes are they supposed to be optimized and why. I think it would be good to do something about this regression. ;-) For the benchmarking, I'm using http://pasky.or.cz/~pasky/dev/glibc/strbench/ that I quickly hacked together. Here is the data I have collected on various x86_64 systems, running with 2048 iterations; apply reasonable error margins, of course: model name : AMD Opteron (tm) Processor 848 cache size : 1024 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good nopl fucn,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd strlen4 5.630000 6.890000 7.060000 5.660000 strlen8 4.940000 3.580000 3.700000 4.170000 strlen32 2.220000 1.340000 1.490000 2.310000 strlen128 1.220000 0.830000 0.900000 1.330000 memcmp4 3.350000 3.330000 4.400000 3.310000 memcmp8 1.840000 1.740000 2.660000 2.140000 memcmp32 0.970000 0.800000 1.770000 1.300000 memcmp128 0.330000 0.310000 1.050000 0.650000 strcmp4 2.400000 2.290000 5.620000 2.470000 strcmp8 1.600000 1.280000 3.260000 1.560000 strcmp32 0.950000 0.600000 1.630000 0.870000 strcmp128 0.350000 0.210000 1.010000 0.310000 strncmp4 2.560000 2.250000 5.880000 2.960000 strncmp8 1.400000 1.410000 3.230000 1.700000 strncmp32 0.710000 0.770000 1.370000 0.940000 strncmp128 0.270000 0.270000 0.670000 0.350000 model name : Dual Core AMD Opteron(tm) Processor 165 cache size : 1024 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd strlen4 6.780000 8.350000 8.580000 6.850000 strlen8 5.920000 4.300000 4.420000 5.010000 strlen32 2.570000 1.440000 1.430000 2.660000 strlen128 1.260000 0.910000 0.850000 1.240000 memcmp4 3.960000 4.040000 5.160000 2.840000 memcmp8 2.020000 2.060000 3.000000 1.890000 memcmp32 0.770000 0.720000 1.350000 0.980000 memcmp128 0.260000 0.240000 0.540000 0.430000 strcmp4 2.740000 2.750000 6.790000 2.910000 strcmp8 1.410000 1.410000 3.600000 1.620000 strcmp32 0.630000 0.580000 1.260000 0.700000 strcmp128 0.200000 0.180000 0.620000 0.230000 strncmp4 3.080000 2.720000 7.180000 3.540000 strncmp8 1.580000 1.440000 3.940000 1.880000 strncmp32 0.720000 0.670000 1.310000 0.840000 strncmp128 0.240000 0.220000 0.550000 0.280000 model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz cache size : 4096 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd strlen4 3.870000 3.050000 3.270000 3.870000 strlen8 2.370000 1.530000 1.640000 3.450000 strlen32 1.040000 0.480000 0.470000 1.520000 strlen128 0.600000 0.290000 0.280000 0.680000 memcmp4 2.080000 2.260000 2.680000 1.800000 memcmp8 1.040000 1.130000 1.460000 1.860000 memcmp32 0.270000 0.270000 0.350000 0.770000 memcmp128 0.070000 0.070000 0.090000 0.190000 strcmp4 1.910000 1.910000 3.480000 1.920000 strcmp8 0.960000 0.950000 1.200000 0.960000 strcmp32 0.240000 0.240000 0.290000 0.240000 strcmp128 0.060000 0.060000 0.080000 0.060000 strncmp4 2.030000 1.690000 4.240000 2.810000 strncmp8 1.020000 0.850000 1.610000 1.410000 strncmp32 0.260000 0.210000 0.380000 0.360000 strncmp128 0.070000 0.060000 0.100000 0.080000 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz cache size : 6144 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd strlen4 3.090000 2.960000 2.750000 3.450000 strlen8 1.890000 1.230000 1.360000 3.140000 strlen32 0.810000 0.370000 0.340000 1.220000 strlen128 0.460000 0.220000 0.200000 0.660000 memcmp4 2.160000 1.820000 2.500000 1.800000 memcmp8 1.100000 0.910000 1.500000 1.170000 memcmp32 0.310000 0.220000 0.320000 0.380000 memcmp128 0.090000 0.060000 0.090000 0.110000 strcmp4 1.860000 1.910000 3.530000 1.570000 strcmp8 0.960000 0.960000 1.170000 0.840000 strcmp32 0.280000 0.250000 0.300000 0.270000 strcmp128 0.050000 0.050000 0.090000 0.070000 strncmp4 1.740000 1.750000 3.790000 2.840000 strncmp8 0.940000 0.850000 1.380000 1.380000 strncmp32 0.220000 0.220000 0.320000 0.400000 strncmp128 0.050000 0.050000 0.090000 0.080000 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz cache size : 8192 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd strlen4 3.440000 3.500000 2.780000 3.320000 strlen8 2.260000 1.750000 1.440000 2.220000 strlen32 0.850000 0.500000 0.380000 0.900000 strlen128 0.470000 0.260000 0.200000 0.500000 memcmp4 2.180000 2.060000 2.500000 1.840000 memcmp8 1.100000 1.050000 1.320000 1.060000 memcmp32 0.270000 0.260000 0.350000 0.330000 memcmp128 0.080000 0.070000 0.090000 0.090000 strcmp4 1.660000 1.930000 2.250000 1.640000 strcmp8 0.830000 0.970000 1.140000 0.840000 strcmp32 0.210000 0.240000 0.240000 0.210000 strcmp128 0.050000 0.070000 0.080000 0.060000 strncmp4 1.740000 1.830000 2.490000 2.570000 strncmp8 0.870000 0.920000 1.220000 1.300000 strncmp32 0.220000 0.230000 0.260000 0.320000 strncmp128 0.050000 0.050000 0.090000 0.080000 * numbers after function names indicate string sizes ** 2.11-amd is very old AMD-provided x86_64 string routines patch (it doesn't implement some of the new things like bounded pointers checks support) that we still use in SUSE glibc: http://pasky.or.cz/~pasky/dev/glibc/amd64-string-2.11.diff If the regression against 2.10.1 is fixed, it is probably not very interesting, it performs better only at very short memcmp()s.) *** I can't seem to find newer AMD processors to test on right now, sorry. If you have any, feel free to run the benchmark there - just get the /strbench/ directory and run `./strbench.sh outfile`. Kind regards, -- Petr "Pasky" Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth |
|
|
Re: Massive performance regression of glibc string functionsI am using the rdtsc timing in glibc string tests. Here is strlen data on
Intel(R) Xeon(R) CPU X3350 @ 2.66GHz strlen_2_11 builtin_strlen strlen in glibc 2.9 LAT: Pos 1, alignment 0: 8 16 16 LAT: Pos 2, alignment 0: 8 24 16 LAT: Pos 3, alignment 0: 8 24 16 LAT: Pos 4, alignment 0: 8 24 16 LAT: Pos 5, alignment 0: 8 24 16 LAT: Pos 6, alignment 0: 8 24 24 LAT: Pos 7, alignment 0: 8 24 16 LAT: Pos 1, alignment 1: 8 16 8 LAT: Pos 2, alignment 2: 8 24 16 LAT: Pos 3, alignment 3: 8 24 16 LAT: Pos 4, alignment 4: 8 32 24 LAT: Pos 5, alignment 5: 8 32 24 LAT: Pos 6, alignment 6: 16 32 24 LAT: Pos 7, alignment 7: 16 32 24 LAT: Pos 4, alignment 0: 8 24 16 LAT: Pos 4, alignment 1: 16 24 16 LAT: Pos 8, alignment 0: 8 24 16 LAT: Pos 8, alignment 1: 8 40 32 LAT: Pos 16, alignment 0: 16 24 24 LAT: Pos 16, alignment 1: 16 40 32 LAT: Pos 32, alignment 0: 16 32 24 LAT: Pos 32, alignment 1: 16 48 40 LAT: Pos 64, alignment 0: 24 40 40 LAT: Pos 64, alignment 1: 24 56 56 LAT: Pos 128, alignment 0: 32 64 64 LAT: Pos 128, alignment 1: 32 80 80 LAT: Pos 256, alignment 0: 56 136 128 LAT: Pos 256, alignment 1: 56 152 136 LAT: Pos 512, alignment 0: 96 264 256 LAT: Pos 512, alignment 1: 96 272 264 LAT: Pos 1024, alignment 0: 224 512 504 LAT: Pos 1024, alignment 1: 224 528 520 LAT: Pos 1, alignment 0: 8 16 16 LAT: Pos 2, alignment 0: 8 24 16 LAT: Pos 3, alignment 0: 8 24 16 LAT: Pos 4, alignment 0: 8 24 16 LAT: Pos 5, alignment 0: 8 24 16 LAT: Pos 6, alignment 0: 8 24 24 LAT: Pos 7, alignment 0: 8 24 16 LAT: Pos 1, alignment 1: 16 16 8 LAT: Pos 2, alignment 2: 8 24 16 LAT: Pos 3, alignment 3: 8 24 16 LAT: Pos 4, alignment 4: 8 32 24 LAT: Pos 5, alignment 5: 16 32 24 LAT: Pos 6, alignment 6: 8 32 24 LAT: Pos 7, alignment 7: 16 32 24 LAT: Pos 4, alignment 0: 8 24 16 LAT: Pos 4, alignment 1: 8 24 16 LAT: Pos 8, alignment 0: 8 24 16 LAT: Pos 8, alignment 1: 8 40 32 LAT: Pos 16, alignment 0: 16 24 24 LAT: Pos 16, alignment 1: 16 40 32 LAT: Pos 32, alignment 0: 16 32 24 LAT: Pos 32, alignment 1: 16 48 40 LAT: Pos 64, alignment 0: 24 40 40 LAT: Pos 64, alignment 1: 24 56 56 LAT: Pos 128, alignment 0: 32 64 64 LAT: Pos 128, alignment 1: 32 80 80 LAT: Pos 256, alignment 0: 56 136 128 LAT: Pos 256, alignment 1: 56 152 136 LAT: Pos 512, alignment 0: 96 264 256 LAT: Pos 512, alignment 1: 96 272 264 LAT: Pos 1024, alignment 0: 224 512 504 LAT: Pos 1024, alignment 1: 224 528 520 LAT: Pos 0, alignment 0: 8 16 16 LAT: Pos 1, alignment 0: 8 16 16 LAT: Pos 1, alignment 1: 8 16 8 LAT: Pos 2, alignment 0: 8 24 16 LAT: Pos 2, alignment 1: 16 24 8 LAT: Pos 2, alignment 2: 8 24 16 LAT: Pos 3, alignment 0: 8 24 16 LAT: Pos 3, alignment 1: 8 24 16 LAT: Pos 3, alignment 2: 16 24 16 LAT: Pos 3, alignment 3: 16 24 16 LAT: Pos 4, alignment 0: 8 24 16 LAT: Pos 4, alignment 1: 8 24 16 LAT: Pos 4, alignment 2: 16 24 16 LAT: Pos 4, alignment 3: 8 24 16 LAT: Pos 4, alignment 4: 16 32 24 LAT: Pos 5, alignment 0: 8 24 16 LAT: Pos 5, alignment 1: 8 32 24 LAT: Pos 5, alignment 2: 16 32 24 LAT: Pos 5, alignment 3: 16 32 24 LAT: Pos 5, alignment 4: 16 32 24 LAT: Pos 5, alignment 5: 8 32 24 LAT: Pos 6, alignment 0: 8 24 24 LAT: Pos 6, alignment 1: 16 32 24 LAT: Pos 6, alignment 2: 16 32 24 LAT: Pos 6, alignment 3: 8 32 24 LAT: Pos 6, alignment 4: 16 32 24 LAT: Pos 6, alignment 5: 16 32 24 LAT: Pos 6, alignment 6: 16 32 24 LAT: Pos 7, alignment 0: 8 24 16 LAT: Pos 7, alignment 1: 8 40 32 LAT: Pos 7, alignment 2: 16 32 32 LAT: Pos 7, alignment 3: 16 32 24 LAT: Pos 7, alignment 4: 8 32 24 LAT: Pos 7, alignment 5: 16 32 24 LAT: Pos 7, alignment 6: 8 32 24 LAT: Pos 7, alignment 7: 16 32 24 LAT: Pos 8, alignment 0: 8 24 16 LAT: Pos 8, alignment 1: 8 40 32 LAT: Pos 8, alignment 2: 16 32 32 LAT: Pos 8, alignment 3: 16 32 24 LAT: Pos 8, alignment 4: 8 32 32 LAT: Pos 8, alignment 5: 8 32 24 LAT: Pos 8, alignment 6: 8 32 24 LAT: Pos 8, alignment 7: 16 24 24 LAT: Pos 8, alignment 8: 16 24 16 LAT: Pos 9, alignment 0: 8 24 16 LAT: Pos 9, alignment 1: 16 40 32 LAT: Pos 9, alignment 2: 8 40 32 LAT: Pos 9, alignment 3: 16 32 24 LAT: Pos 9, alignment 4: 8 32 32 LAT: Pos 9, alignment 5: 16 32 24 LAT: Pos 9, alignment 6: 8 32 24 LAT: Pos 9, alignment 7: 16 24 16 LAT: Pos 9, alignment 8: 16 24 16 LAT: Pos 9, alignment 9: 8 40 32 LAT: Pos 10, alignment 0: 8 24 16 LAT: Pos 10, alignment 1: 16 40 32 LAT: Pos 10, alignment 2: 8 40 32 LAT: Pos 10, alignment 3: 16 40 32 LAT: Pos 10, alignment 4: 16 32 32 LAT: Pos 10, alignment 5: 8 32 24 LAT: Pos 10, alignment 6: 16 32 16 LAT: Pos 10, alignment 7: 16 24 24 LAT: Pos 10, alignment 8: 16 24 16 LAT: Pos 10, alignment 9: 16 40 32 LAT: Pos 10, alignment 10: 16 40 32 LAT: Pos 11, alignment 0: 8 24 16 LAT: Pos 11, alignment 1: 8 40 32 LAT: Pos 11, alignment 2: 8 40 32 LAT: Pos 11, alignment 3: 8 40 32 LAT: Pos 11, alignment 4: 8 32 32 LAT: Pos 11, alignment 5: 16 32 24 LAT: Pos 11, alignment 6: 16 32 24 LAT: Pos 11, alignment 7: 16 24 24 LAT: Pos 11, alignment 8: 16 24 16 LAT: Pos 11, alignment 9: 16 40 32 LAT: Pos 11, alignment 10: 16 40 32 LAT: Pos 11, alignment 11: 16 40 32 LAT: Pos 12, alignment 0: 8 24 16 LAT: Pos 12, alignment 1: 8 40 32 LAT: Pos 12, alignment 2: 8 40 32 LAT: Pos 12, alignment 3: 8 40 32 LAT: Pos 12, alignment 4: 16 32 32 LAT: Pos 12, alignment 5: 16 32 24 LAT: Pos 12, alignment 6: 16 32 24 LAT: Pos 12, alignment 7: 16 24 24 LAT: Pos 12, alignment 8: 16 24 16 LAT: Pos 12, alignment 9: 16 40 40 LAT: Pos 12, alignment 10: 16 40 32 LAT: Pos 12, alignment 11: 16 40 32 LAT: Pos 12, alignment 12: 16 32 32 LAT: Pos 13, alignment 0: 8 24 24 LAT: Pos 13, alignment 1: 8 40 40 LAT: Pos 13, alignment 2: 8 40 32 LAT: Pos 13, alignment 3: 16 32 32 LAT: Pos 13, alignment 4: 16 32 32 LAT: Pos 13, alignment 5: 16 32 24 LAT: Pos 13, alignment 6: 16 32 24 LAT: Pos 13, alignment 7: 16 24 24 LAT: Pos 13, alignment 8: 16 24 16 LAT: Pos 13, alignment 9: 16 40 40 LAT: Pos 13, alignment 10: 16 40 32 LAT: Pos 13, alignment 11: 16 32 32 LAT: Pos 13, alignment 12: 8 32 32 LAT: Pos 13, alignment 13: 16 32 24 LAT: Pos 14, alignment 0: 8 24 24 LAT: Pos 14, alignment 1: 16 40 32 LAT: Pos 14, alignment 2: 16 40 32 LAT: Pos 14, alignment 3: 16 32 32 LAT: Pos 14, alignment 4: 16 32 32 LAT: Pos 14, alignment 5: 16 32 24 LAT: Pos 14, alignment 6: 16 32 24 LAT: Pos 14, alignment 7: 16 32 24 LAT: Pos 14, alignment 8: 16 32 24 LAT: Pos 14, alignment 9: 16 40 32 LAT: Pos 14, alignment 10: 16 40 32 LAT: Pos 14, alignment 11: 16 40 32 LAT: Pos 14, alignment 12: 16 32 32 LAT: Pos 14, alignment 13: 16 32 24 LAT: Pos 14, alignment 14: 16 32 24 LAT: Pos 15, alignment 0: 8 24 24 LAT: Pos 15, alignment 1: 16 40 32 LAT: Pos 15, alignment 2: 16 40 32 LAT: Pos 15, alignment 3: 16 40 32 LAT: Pos 15, alignment 4: 16 32 32 LAT: Pos 15, alignment 5: 16 32 32 LAT: Pos 15, alignment 6: 16 32 32 LAT: Pos 15, alignment 7: 16 24 24 LAT: Pos 15, alignment 8: 16 24 24 LAT: Pos 15, alignment 9: 16 40 32 LAT: Pos 15, alignment 10: 16 40 32 LAT: Pos 15, alignment 11: 16 40 32 LAT: Pos 15, alignment 12: 8 32 32 LAT: Pos 15, alignment 13: 16 32 32 LAT: Pos 15, alignment 14: 16 32 32 LAT: Pos 15, alignment 15: 16 32 24 Data on memcmp and strcmp show similar results. The new ones in glibc 2.11 are much better than the old ones in glibc 2.9. If you believe there is a regression, please provide length as well as alignments on input data. I will take a look. Thanks. H.J. ---- On Fri, Nov 6, 2009 at 6:04 AM, Petr Baudis <pasky@...> wrote: > Hi! > > I have been doing some benchmarking of several string functions and > discovered that some of them are *much* slower than in the past; the > regressions are measured against glibc-2.9. I'm testing on small > strings (4..128, though for 128 much bigger sample of calls would be > needed for good comparison), following the common wisdom that operations > on small strings are the bulk of the calls. > > In case of strlen(), there seems to be regression only with very small > strings on AMD, so this is probably fine. > > In case of memcmp(), strcmp() and strncmp(), glibc-2.10.1 seems to > improve performance somewhat especially for larger strings, but > glibc-2.11 has massive performance drop across all vendors! > (Interestingly, glibc-2.10.1 is also slightly slower than glibc-2.9 in > these functions on Core i7.) > > In case of strcmp(), strncmp(), glibc-2.10.1 seems to improve performance > somewhat especially for larger strings, but glibc-2.11 has massive > performance drop on all vendors. > > I'd like to ask how the string routine changes were benchmarked, > for what architectures and string sizes are they supposed to be > optimized and why. I think it would be good to do something about this > regression. ;-) > > For the benchmarking, I'm using > > http://pasky.or.cz/~pasky/dev/glibc/strbench/ > > that I quickly hacked together. Here is the data I have collected > on various x86_64 systems, running with 2048 iterations; apply > reasonable error margins, of course: > > > model name : AMD Opteron (tm) Processor 848 > cache size : 1024 KB > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good nopl > > fucn,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd > strlen4 5.630000 6.890000 7.060000 5.660000 > strlen8 4.940000 3.580000 3.700000 4.170000 > strlen32 2.220000 1.340000 1.490000 2.310000 > strlen128 1.220000 0.830000 0.900000 1.330000 > memcmp4 3.350000 3.330000 4.400000 3.310000 > memcmp8 1.840000 1.740000 2.660000 2.140000 > memcmp32 0.970000 0.800000 1.770000 1.300000 > memcmp128 0.330000 0.310000 1.050000 0.650000 > strcmp4 2.400000 2.290000 5.620000 2.470000 > strcmp8 1.600000 1.280000 3.260000 1.560000 > strcmp32 0.950000 0.600000 1.630000 0.870000 > strcmp128 0.350000 0.210000 1.010000 0.310000 > strncmp4 2.560000 2.250000 5.880000 2.960000 > strncmp8 1.400000 1.410000 3.230000 1.700000 > strncmp32 0.710000 0.770000 1.370000 0.940000 > strncmp128 0.270000 0.270000 0.670000 0.350000 > > > model name : Dual Core AMD Opteron(tm) Processor 165 > cache size : 1024 KB > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy > > func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd > strlen4 6.780000 8.350000 8.580000 6.850000 > strlen8 5.920000 4.300000 4.420000 5.010000 > strlen32 2.570000 1.440000 1.430000 2.660000 > strlen128 1.260000 0.910000 0.850000 1.240000 > memcmp4 3.960000 4.040000 5.160000 2.840000 > memcmp8 2.020000 2.060000 3.000000 1.890000 > memcmp32 0.770000 0.720000 1.350000 0.980000 > memcmp128 0.260000 0.240000 0.540000 0.430000 > strcmp4 2.740000 2.750000 6.790000 2.910000 > strcmp8 1.410000 1.410000 3.600000 1.620000 > strcmp32 0.630000 0.580000 1.260000 0.700000 > strcmp128 0.200000 0.180000 0.620000 0.230000 > strncmp4 3.080000 2.720000 7.180000 3.540000 > strncmp8 1.580000 1.440000 3.940000 1.880000 > strncmp32 0.720000 0.670000 1.310000 0.840000 > strncmp128 0.240000 0.220000 0.550000 0.280000 > > > model name : Intel(R) Xeon(R) CPU X3220 @ 2.40GHz > cache size : 4096 KB > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm > > func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd > strlen4 3.870000 3.050000 3.270000 3.870000 > strlen8 2.370000 1.530000 1.640000 3.450000 > strlen32 1.040000 0.480000 0.470000 1.520000 > strlen128 0.600000 0.290000 0.280000 0.680000 > memcmp4 2.080000 2.260000 2.680000 1.800000 > memcmp8 1.040000 1.130000 1.460000 1.860000 > memcmp32 0.270000 0.270000 0.350000 0.770000 > memcmp128 0.070000 0.070000 0.090000 0.190000 > strcmp4 1.910000 1.910000 3.480000 1.920000 > strcmp8 0.960000 0.950000 1.200000 0.960000 > strcmp32 0.240000 0.240000 0.290000 0.240000 > strcmp128 0.060000 0.060000 0.080000 0.060000 > strncmp4 2.030000 1.690000 4.240000 2.810000 > strncmp8 1.020000 0.850000 1.610000 1.410000 > strncmp32 0.260000 0.210000 0.380000 0.360000 > strncmp128 0.070000 0.060000 0.100000 0.080000 > > > model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz > cache size : 6144 KB > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority > > func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd > strlen4 3.090000 2.960000 2.750000 3.450000 > strlen8 1.890000 1.230000 1.360000 3.140000 > strlen32 0.810000 0.370000 0.340000 1.220000 > strlen128 0.460000 0.220000 0.200000 0.660000 > memcmp4 2.160000 1.820000 2.500000 1.800000 > memcmp8 1.100000 0.910000 1.500000 1.170000 > memcmp32 0.310000 0.220000 0.320000 0.380000 > memcmp128 0.090000 0.060000 0.090000 0.110000 > strcmp4 1.860000 1.910000 3.530000 1.570000 > strcmp8 0.960000 0.960000 1.170000 0.840000 > strcmp32 0.280000 0.250000 0.300000 0.270000 > strcmp128 0.050000 0.050000 0.090000 0.070000 > strncmp4 1.740000 1.750000 3.790000 2.840000 > strncmp8 0.940000 0.850000 1.380000 1.380000 > strncmp32 0.220000 0.220000 0.320000 0.400000 > strncmp128 0.050000 0.050000 0.090000 0.080000 > > > model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz > cache size : 8192 KB > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida > > func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd > strlen4 3.440000 3.500000 2.780000 3.320000 > strlen8 2.260000 1.750000 1.440000 2.220000 > strlen32 0.850000 0.500000 0.380000 0.900000 > strlen128 0.470000 0.260000 0.200000 0.500000 > memcmp4 2.180000 2.060000 2.500000 1.840000 > memcmp8 1.100000 1.050000 1.320000 1.060000 > memcmp32 0.270000 0.260000 0.350000 0.330000 > memcmp128 0.080000 0.070000 0.090000 0.090000 > strcmp4 1.660000 1.930000 2.250000 1.640000 > strcmp8 0.830000 0.970000 1.140000 0.840000 > strcmp32 0.210000 0.240000 0.240000 0.210000 > strcmp128 0.050000 0.070000 0.080000 0.060000 > strncmp4 1.740000 1.830000 2.490000 2.570000 > strncmp8 0.870000 0.920000 1.220000 1.300000 > strncmp32 0.220000 0.230000 0.260000 0.320000 > strncmp128 0.050000 0.050000 0.090000 0.080000 > > > * numbers after function names indicate string sizes > ** 2.11-amd is very old AMD-provided x86_64 string routines patch > (it doesn't implement some of the new things like bounded pointers > checks support) that we still use in SUSE glibc: > > http://pasky.or.cz/~pasky/dev/glibc/amd64-string-2.11.diff > > If the regression against 2.10.1 is fixed, it is probably not very > interesting, it performs better only at very short memcmp()s.) > > *** I can't seem to find newer AMD processors to test on right now, > sorry. If you have any, feel free to run the benchmark there - just > get the /strbench/ directory and run `./strbench.sh outfile`. > > Kind regards, > > -- > Petr "Pasky" Baudis > A lot of people have my books on their bookshelves. > That's the problem, they need to read them. -- Don Knuth > -- H.J. |
|
|
Re: Massive performance regression of glibc string functionsOn Fri, Nov 06, 2009 at 10:20:41AM -0700, H.J. Lu wrote:
> I am using the rdtsc timing in glibc string tests. Here is strlen data on > > Intel(R) Xeon(R) CPU X3350 @ 2.66GHz ..snip.. > > Data on memcmp and strcmp show similar results. The new ones > in glibc 2.11 are much better than the old ones in glibc 2.9. I think the one you have shown exactly matches my findings - I also think strlen() in glibc-2.11 is much better than in glibc-2.9 (except on AMD and very small strings). But that is the only one of these I tested; could you please post the same numbers for e.g. memcmp()? > If you believe there is a regression, please provide length as well > as alignments on input data. I will take a look. The lengths are the numbers after function names - i.e. I'm testing with 4, 8, 32 and 128. All the values are 8-aligned, I can test misaligned strings too if you think 2.11 will do better there. -- Petr "Pasky" Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth |
|
|
Re: Massive performance regression of glibc string functionsOn Sat, Nov 7, 2009 at 12:58 AM, Petr Baudis <pasky@...> wrote:
> On Fri, Nov 06, 2009 at 10:20:41AM -0700, H.J. Lu wrote: >> I am using the rdtsc timing in glibc string tests. Here is strlen data on >> >> Intel(R) Xeon(R) CPU X3350 @ 2.66GHz > ..snip.. >> >> Data on memcmp and strcmp show similar results. The new ones >> in glibc 2.11 are much better than the old ones in glibc 2.9. > > I think the one you have shown exactly matches my findings - I also > think strlen() in glibc-2.11 is much better than in glibc-2.9 (except on > AMD and very small strings). But that is the only one of these I tested; > could you please post the same numbers for e.g. memcmp()? memcmp_2_11 memcmp 2.9 LAT: Len 1, alignment 13/13: 8 16 LAT: Len 1, alignment 13/13: 8 16 LAT: Len 1, alignment 13/13: 8 16 LAT: Len 2, alignment 12/12: 16 24 LAT: Len 2, alignment 12/12: 16 24 LAT: Len 2, alignment 12/12: 16 24 LAT: Len 3, alignment 10/10: 16 24 LAT: Len 3, alignment 10/10: 24 24 LAT: Len 3, alignment 10/10: 24 24 LAT: Len 4, alignment 8/ 8: 16 24 LAT: Len 4, alignment 8/ 8: 16 24 LAT: Len 4, alignment 8/ 8: 16 24 LAT: Len 5, alignment 6/ 6: 16 32 LAT: Len 5, alignment 6/ 6: 24 24 LAT: Len 5, alignment 6/ 6: 24 24 LAT: Len 6, alignment 4/ 4: 16 32 LAT: Len 6, alignment 4/ 4: 24 32 LAT: Len 6, alignment 4/ 4: 24 32 LAT: Len 7, alignment 2/ 2: 16 32 LAT: Len 7, alignment 2/ 2: 24 32 LAT: Len 7, alignment 2/ 2: 24 32 LAT: Len 8, alignment 0/ 0: 16 40 LAT: Len 8, alignment 0/ 0: 24 32 LAT: Len 8, alignment 0/ 0: 24 32 LAT: Len 9, alignment 14/14: 16 56 LAT: Len 9, alignment 14/14: 24 32 LAT: Len 9, alignment 14/14: 24 32 LAT: Len 10, alignment 12/12: 16 40 LAT: Len 10, alignment 12/12: 24 40 LAT: Len 10, alignment 12/12: 24 40 LAT: Len 11, alignment 10/10: 24 48 LAT: Len 11, alignment 10/10: 24 40 LAT: Len 11, alignment 10/10: 24 40 LAT: Len 12, alignment 8/ 8: 16 48 LAT: Len 12, alignment 8/ 8: 24 40 LAT: Len 12, alignment 8/ 8: 24 40 LAT: Len 13, alignment 6/ 6: 24 48 LAT: Len 13, alignment 6/ 6: 24 40 LAT: Len 13, alignment 6/ 6: 24 40 LAT: Len 14, alignment 4/ 4: 24 56 LAT: Len 14, alignment 4/ 4: 24 48 LAT: Len 14, alignment 4/ 4: 24 48 LAT: Len 15, alignment 2/ 2: 24 56 LAT: Len 15, alignment 2/ 2: 24 48 LAT: Len 15, alignment 2/ 2: 24 48 LAT: Len 1, alignment 0/ 0: 8 16 LAT: Len 1, alignment 0/ 0: 8 16 LAT: Len 1, alignment 0/ 0: 8 16 LAT: Len 2, alignment 0/ 0: 16 24 LAT: Len 2, alignment 0/ 0: 16 24 LAT: Len 2, alignment 0/ 0: 16 24 LAT: Len 3, alignment 0/ 0: 16 24 LAT: Len 3, alignment 0/ 0: 24 24 LAT: Len 3, alignment 0/ 0: 24 24 LAT: Len 4, alignment 0/ 0: 16 24 LAT: Len 4, alignment 0/ 0: 16 24 LAT: Len 4, alignment 0/ 0: 16 24 LAT: Len 5, alignment 0/ 0: 16 32 LAT: Len 5, alignment 0/ 0: 24 24 LAT: Len 5, alignment 0/ 0: 24 24 LAT: Len 6, alignment 0/ 0: 16 32 LAT: Len 6, alignment 0/ 0: 24 32 LAT: Len 6, alignment 0/ 0: 24 32 LAT: Len 7, alignment 0/ 0: 16 32 LAT: Len 7, alignment 0/ 0: 24 32 LAT: Len 7, alignment 0/ 0: 24 32 LAT: Len 8, alignment 0/ 0: 16 40 LAT: Len 8, alignment 0/ 0: 24 32 LAT: Len 8, alignment 0/ 0: 24 32 LAT: Len 9, alignment 0/ 0: 16 56 LAT: Len 9, alignment 0/ 0: 24 32 LAT: Len 9, alignment 0/ 0: 24 32 LAT: Len 10, alignment 0/ 0: 16 40 LAT: Len 10, alignment 0/ 0: 24 40 LAT: Len 10, alignment 0/ 0: 24 40 LAT: Len 11, alignment 0/ 0: 24 48 LAT: Len 11, alignment 0/ 0: 24 40 LAT: Len 11, alignment 0/ 0: 24 40 LAT: Len 12, alignment 0/ 0: 16 48 LAT: Len 12, alignment 0/ 0: 24 40 LAT: Len 12, alignment 0/ 0: 24 40 LAT: Len 13, alignment 0/ 0: 24 48 LAT: Len 13, alignment 0/ 0: 24 40 LAT: Len 13, alignment 0/ 0: 24 40 LAT: Len 14, alignment 0/ 0: 24 56 LAT: Len 14, alignment 0/ 0: 24 48 LAT: Len 14, alignment 0/ 0: 24 48 LAT: Len 15, alignment 0/ 0: 24 56 LAT: Len 15, alignment 0/ 0: 24 48 LAT: Len 15, alignment 0/ 0: 24 48 LAT: Len 4, alignment 0/ 0: 16 24 LAT: Len 4, alignment 0/ 0: 16 24 LAT: Len 4, alignment 0/ 0: 16 24 LAT: Len 32, alignment 0/ 0: 32 32 LAT: Len 32, alignment 13/14: 40 64 LAT: Len 32, alignment 0/ 0: 32 64 LAT: Len 32, alignment 0/ 0: 32 64 LAT: Len 8, alignment 0/ 0: 16 40 LAT: Len 8, alignment 0/ 0: 24 32 LAT: Len 8, alignment 0/ 0: 24 32 LAT: Len 64, alignment 0/ 0: 40 40 LAT: Len 64, alignment 14/12: 112 88 LAT: Len 64, alignment 0/ 0: 32 72 LAT: Len 64, alignment 0/ 0: 32 104 LAT: Len 16, alignment 0/ 0: 24 32 LAT: Len 16, alignment 0/ 0: 24 56 LAT: Len 16, alignment 0/ 0: 24 56 LAT: Len 128, alignment 0/ 0: 48 56 LAT: Len 128, alignment 14/12: 144 120 LAT: Len 128, alignment 0/ 0: 40 88 LAT: Len 128, alignment 0/ 0: 40 88 >> If you believe there is a regression, please provide length as well >> as alignments on input data. I will take a look. > > The lengths are the numbers after function names - i.e. I'm testing with > 4, 8, 32 and 128. All the values are 8-aligned, I can test misaligned > strings too if you think 2.11 will do better there. > Your test compares timings of 2 implementations in 2 C libraries on 2 sets of random data. You should compare 2 implementations on the same set of data linked against the same C library. -- H.J. |
| Free embeddable forum powered by Nabble | Forum Help |