Massive performance regression of glibc string functions

View: New views
4 Messages — Rating Filter:   Alert me  

Massive performance regression of glibc string functions

by Petr Baudis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

  Hi!

  I have been doing some benchmarking of several string functions and
discovered that some of them are *much* slower than in the past; the
regressions are measured against glibc-2.9. I'm testing on small
strings (4..128, though for 128 much bigger sample of calls would be
needed for good comparison), following the common wisdom that operations
on small strings are the bulk of the calls.

  In case of strlen(), there seems to be regression only with very small
strings on AMD, so this is probably fine.

  In case of memcmp(), strcmp() and strncmp(), glibc-2.10.1 seems to
improve performance somewhat especially for larger strings, but
glibc-2.11 has massive performance drop across all vendors!
(Interestingly, glibc-2.10.1 is also slightly slower than glibc-2.9 in
these functions on Core i7.)

  In case of strcmp(), strncmp(), glibc-2.10.1 seems to improve performance
somewhat especially for larger strings, but glibc-2.11 has massive
performance drop on all vendors.

  I'd like to ask how the string routine changes were benchmarked,
for what architectures and string sizes are they supposed to be
optimized and why. I think it would be good to do something about this
regression. ;-)

  For the benchmarking, I'm using

        http://pasky.or.cz/~pasky/dev/glibc/strbench/

that I quickly hacked together. Here is the data I have collected
on various x86_64 systems, running with 2048 iterations; apply
reasonable error margins, of course:


model name : AMD Opteron (tm) Processor 848
cache size : 1024 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good nopl

fucn,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4         5.630000 6.890000 7.060000 5.660000
strlen8         4.940000 3.580000 3.700000 4.170000
strlen32       2.220000 1.340000 1.490000 2.310000
strlen128       1.220000 0.830000 0.900000 1.330000
memcmp4         3.350000 3.330000 4.400000 3.310000
memcmp8         1.840000 1.740000 2.660000 2.140000
memcmp32       0.970000 0.800000 1.770000 1.300000
memcmp128       0.330000 0.310000 1.050000 0.650000
strcmp4         2.400000 2.290000 5.620000 2.470000
strcmp8         1.600000 1.280000 3.260000 1.560000
strcmp32       0.950000 0.600000 1.630000 0.870000
strcmp128       0.350000 0.210000 1.010000 0.310000
strncmp4       2.560000 2.250000 5.880000 2.960000
strncmp8       1.400000 1.410000 3.230000 1.700000
strncmp32       0.710000 0.770000 1.370000 0.940000
strncmp128     0.270000 0.270000 0.670000 0.350000


model name : Dual Core AMD Opteron(tm) Processor 165
cache size : 1024 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy

func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4         6.780000 8.350000 8.580000 6.850000
strlen8         5.920000 4.300000 4.420000 5.010000
strlen32       2.570000 1.440000 1.430000 2.660000
strlen128       1.260000 0.910000 0.850000 1.240000
memcmp4         3.960000 4.040000 5.160000 2.840000
memcmp8         2.020000 2.060000 3.000000 1.890000
memcmp32       0.770000 0.720000 1.350000 0.980000
memcmp128       0.260000 0.240000 0.540000 0.430000
strcmp4         2.740000 2.750000 6.790000 2.910000
strcmp8         1.410000 1.410000 3.600000 1.620000
strcmp32       0.630000 0.580000 1.260000 0.700000
strcmp128       0.200000 0.180000 0.620000 0.230000
strncmp4       3.080000 2.720000 7.180000 3.540000
strncmp8       1.580000 1.440000 3.940000 1.880000
strncmp32       0.720000 0.670000 1.310000 0.840000
strncmp128     0.240000 0.220000 0.550000 0.280000


model name : Intel(R) Xeon(R) CPU           X3220  @ 2.40GHz
cache size : 4096 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm

func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4         3.870000 3.050000 3.270000 3.870000
strlen8         2.370000 1.530000 1.640000 3.450000
strlen32       1.040000 0.480000 0.470000 1.520000
strlen128       0.600000 0.290000 0.280000 0.680000
memcmp4         2.080000 2.260000 2.680000 1.800000
memcmp8         1.040000 1.130000 1.460000 1.860000
memcmp32       0.270000 0.270000 0.350000 0.770000
memcmp128       0.070000 0.070000 0.090000 0.190000
strcmp4         1.910000 1.910000 3.480000 1.920000
strcmp8         0.960000 0.950000 1.200000 0.960000
strcmp32       0.240000 0.240000 0.290000 0.240000
strcmp128       0.060000 0.060000 0.080000 0.060000
strncmp4       2.030000 1.690000 4.240000 2.810000
strncmp8       1.020000 0.850000 1.610000 1.410000
strncmp32       0.260000 0.210000 0.380000 0.360000
strncmp128     0.070000 0.060000 0.100000 0.080000


model name : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
cache size : 6144 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority

func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4         3.090000 2.960000 2.750000 3.450000
strlen8         1.890000 1.230000 1.360000 3.140000
strlen32       0.810000 0.370000 0.340000 1.220000
strlen128       0.460000 0.220000 0.200000 0.660000
memcmp4         2.160000 1.820000 2.500000 1.800000
memcmp8         1.100000 0.910000 1.500000 1.170000
memcmp32       0.310000 0.220000 0.320000 0.380000
memcmp128       0.090000 0.060000 0.090000 0.110000
strcmp4         1.860000 1.910000 3.530000 1.570000
strcmp8         0.960000 0.960000 1.170000 0.840000
strcmp32       0.280000 0.250000 0.300000 0.270000
strcmp128       0.050000 0.050000 0.090000 0.070000
strncmp4       1.740000 1.750000 3.790000 2.840000
strncmp8       0.940000 0.850000 1.380000 1.380000
strncmp32       0.220000 0.220000 0.320000 0.400000
strncmp128     0.050000 0.050000 0.090000 0.080000


model name : Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
cache size : 8192 KB
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida

func,size 2.9-vanilla 2.10.1-vanilla 2.11-vanilla 2.11-amd
strlen4         3.440000 3.500000 2.780000 3.320000
strlen8         2.260000 1.750000 1.440000 2.220000
strlen32       0.850000 0.500000 0.380000 0.900000
strlen128       0.470000 0.260000 0.200000 0.500000
memcmp4         2.180000 2.060000 2.500000 1.840000
memcmp8         1.100000 1.050000 1.320000 1.060000
memcmp32       0.270000 0.260000 0.350000 0.330000
memcmp128       0.080000 0.070000 0.090000 0.090000
strcmp4         1.660000 1.930000 2.250000 1.640000
strcmp8         0.830000 0.970000 1.140000 0.840000
strcmp32       0.210000 0.240000 0.240000 0.210000
strcmp128       0.050000 0.070000 0.080000 0.060000
strncmp4       1.740000 1.830000 2.490000 2.570000
strncmp8       0.870000 0.920000 1.220000 1.300000
strncmp32       0.220000 0.230000 0.260000 0.320000
strncmp128     0.050000 0.050000 0.090000 0.080000


  * numbers after function names indicate string sizes
  ** 2.11-amd is very old AMD-provided x86_64 string routines patch
(it doesn't implement some of the new things like bounded pointers
checks support) that we still use in SUSE glibc:

        http://pasky.or.cz/~pasky/dev/glibc/amd64-string-2.11.diff

If the regression against 2.10.1 is fixed, it is probably not very
interesting, it performs better only at very short memcmp()s.)

  *** I can't seem to find newer AMD processors to test on right now,
sorry. If you have any, feel free to run the benchmark there - just
get the /strbench/ directory and run `./strbench.sh outfile`.

  Kind regards,

--
                                Petr "Pasky" Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth

Re: Massive performance regression of glibc string functions

by H.J. Lu-30 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am using the rdtsc timing in glibc string tests. Here is strlen data on

Intel(R) Xeon(R) CPU           X3350  @ 2.66GHz

                    strlen_2_11 builtin_strlen strlen in glibc 2.9
LAT: Pos    1, alignment  0: 8 16 16
LAT: Pos    2, alignment  0: 8 24 16
LAT: Pos    3, alignment  0: 8 24 16
LAT: Pos    4, alignment  0: 8 24 16
LAT: Pos    5, alignment  0: 8 24 16
LAT: Pos    6, alignment  0: 8 24 24
LAT: Pos    7, alignment  0: 8 24 16
LAT: Pos    1, alignment  1: 8 16 8
LAT: Pos    2, alignment  2: 8 24 16
LAT: Pos    3, alignment  3: 8 24 16
LAT: Pos    4, alignment  4: 8 32 24
LAT: Pos    5, alignment  5: 8 32 24
LAT: Pos    6, alignment  6: 16 32 24
LAT: Pos    7, alignment  7: 16 32 24
LAT: Pos    4, alignment  0: 8 24 16
LAT: Pos    4, alignment  1: 16 24 16
LAT: Pos    8, alignment  0: 8 24 16
LAT: Pos    8, alignment  1: 8 40 32
LAT: Pos   16, alignment  0: 16 24 24
LAT: Pos   16, alignment  1: 16 40 32
LAT: Pos   32, alignment  0: 16 32 24
LAT: Pos   32, alignment  1: 16 48 40
LAT: Pos   64, alignment  0: 24 40 40
LAT: Pos   64, alignment  1: 24 56 56
LAT: Pos  128, alignment  0: 32 64 64
LAT: Pos  128, alignment  1: 32 80 80
LAT: Pos  256, alignment  0: 56 136 128
LAT: Pos  256, alignment  1: 56 152 136
LAT: Pos  512, alignment  0: 96 264 256
LAT: Pos  512, alignment  1: 96 272 264
LAT: Pos 1024, alignment  0: 224 512 504
LAT: Pos 1024, alignment  1: 224 528 520
LAT: Pos    1, alignment  0: 8 16 16
LAT: Pos    2, alignment  0: 8 24 16
LAT: Pos    3, alignment  0: 8 24 16
LAT: Pos    4, alignment  0: 8 24 16
LAT: Pos    5, alignment  0: 8 24 16
LAT: Pos    6, alignment  0: 8 24 24
LAT: Pos    7, alignment  0: 8 24 16
LAT: Pos    1, alignment  1: 16 16 8
LAT: Pos    2, alignment  2: 8 24 16
LAT: Pos    3, alignment  3: 8 24 16
LAT: Pos    4, alignment  4: 8 32 24
LAT: Pos    5, alignment  5: 16 32 24
LAT: Pos    6, alignment  6: 8 32 24
LAT: Pos    7, alignment  7: 16 32 24
LAT: Pos    4, alignment  0: 8 24 16
LAT: Pos    4, alignment  1: 8 24 16
LAT: Pos    8, alignment  0: 8 24 16
LAT: Pos    8, alignment  1: 8 40 32
LAT: Pos   16, alignment  0: 16 24 24
LAT: Pos   16, alignment  1: 16 40 32
LAT: Pos   32, alignment  0: 16 32 24
LAT: Pos   32, alignment  1: 16 48 40
LAT: Pos   64, alignment  0: 24 40 40
LAT: Pos   64, alignment  1: 24 56 56
LAT: Pos  128, alignment  0: 32 64 64
LAT: Pos  128, alignment  1: 32 80 80
LAT: Pos  256, alignment  0: 56 136 128
LAT: Pos  256, alignment  1: 56 152 136
LAT: Pos  512, alignment  0: 96 264 256
LAT: Pos  512, alignment  1: 96 272 264
LAT: Pos 1024, alignment  0: 224 512 504
LAT: Pos 1024, alignment  1: 224 528 520
LAT: Pos    0, alignment  0: 8 16 16
LAT: Pos    1, alignment  0: 8 16 16
LAT: Pos    1, alignment  1: 8 16 8
LAT: Pos    2, alignment  0: 8 24 16
LAT: Pos    2, alignment  1: 16 24 8
LAT: Pos    2, alignment  2: 8 24 16
LAT: Pos    3, alignment  0: 8 24 16
LAT: Pos    3, alignment  1: 8 24 16
LAT: Pos    3, alignment  2: 16 24 16
LAT: Pos    3, alignment  3: 16 24 16
LAT: Pos    4, alignment  0: 8 24 16
LAT: Pos    4, alignment  1: 8 24 16
LAT: Pos    4, alignment  2: 16 24 16
LAT: Pos    4, alignment  3: 8 24 16
LAT: Pos    4, alignment  4: 16 32 24
LAT: Pos    5, alignment  0: 8 24 16
LAT: Pos    5, alignment  1: 8 32 24
LAT: Pos    5, alignment  2: 16 32 24
LAT: Pos    5, alignment  3: 16 32 24
LAT: Pos    5, alignment  4: 16 32 24
LAT: Pos    5, alignment  5: 8 32 24
LAT: Pos    6, alignment  0: 8 24 24
LAT: Pos    6, alignment  1: 16 32 24
LAT: Pos    6, alignment  2: 16 32 24
LAT: Pos    6, alignment  3: 8 32 24
LAT: Pos    6, alignment  4: 16 32 24
LAT: Pos    6, alignment  5: 16 32 24
LAT: Pos    6, alignment  6: 16 32 24
LAT: Pos    7, alignment  0: 8 24 16
LAT: Pos    7, alignment  1: 8 40 32
LAT: Pos    7, alignment  2: 16 32 32
LAT: Pos    7, alignment  3: 16 32 24
LAT: Pos    7, alignment  4: 8 32 24
LAT: Pos    7, alignment  5: 16 32 24
LAT: Pos    7, alignment  6: 8 32 24
LAT: Pos    7, alignment  7: 16 32 24
LAT: Pos    8, alignment  0: 8 24 16
LAT: Pos    8, alignment  1: 8 40 32
LAT: Pos    8, alignment  2: 16 32 32
LAT: Pos    8, alignment  3: 16 32 24
LAT: Pos    8, alignment  4: 8 32 32
LAT: Pos    8, alignment  5: 8 32 24
LAT: Pos    8, alignment  6: 8 32 24
LAT: Pos    8, alignment  7: 16 24 24
LAT: Pos    8, alignment  8: 16 24 16
LAT: Pos    9, alignment  0: 8 24 16
LAT: Pos    9, alignment  1: 16 40 32
LAT: Pos    9, alignment  2: 8 40 32
LAT: Pos    9, alignment  3: 16 32 24
LAT: Pos    9, alignment  4: 8 32 32
LAT: Pos    9, alignment  5: 16 32 24
LAT: Pos    9, alignment  6: 8 32 24
LAT: Pos    9, alignment  7: 16 24 16
LAT: Pos    9, alignment  8: 16 24 16
LAT: Pos    9, alignment  9: 8 40 32
LAT: Pos   10, alignment  0: 8 24 16
LAT: Pos   10, alignment  1: 16 40 32
LAT: Pos   10, alignment  2: 8 40 32
LAT: Pos   10, alignment  3: 16 40 32
LAT: Pos   10, alignment  4: 16 32 32
LAT: Pos   10, alignment  5: 8 32 24
LAT: Pos   10, alignment  6: 16 32 16
LAT: Pos   10, alignment  7: 16 24 24
LAT: Pos   10, alignment  8: 16 24 16
LAT: Pos   10, alignment  9: 16 40 32
LAT: Pos   10, alignment 10: 16 40 32
LAT: Pos   11, alignment  0: 8 24 16
LAT: Pos   11, alignment  1: 8 40 32
LAT: Pos   11, alignment  2: 8 40 32
LAT: Pos   11, alignment  3: 8 40 32
LAT: Pos   11, alignment  4: 8 32 32
LAT: Pos   11, alignment  5: 16 32 24
LAT: Pos   11, alignment  6: 16 32 24
LAT: Pos   11, alignment  7: 16 24 24
LAT: Pos   11, alignment  8: 16 24 16
LAT: Pos   11, alignment  9: 16 40 32
LAT: Pos   11, alignment 10: 16 40 32
LAT: Pos   11, alignment 11: 16 40 32
LAT: Pos   12, alignment  0: 8 24 16
LAT: Pos   12, alignment  1: 8 40 32
LAT: Pos   12, alignment  2: 8 40 32
LAT: Pos   12, alignment  3: 8 40 32
LAT: Pos   12, alignment  4: 16 32 32
LAT: Pos   12, alignment  5: 16 32 24
LAT: Pos   12, alignment  6: 16 32 24
LAT: Pos   12, alignment  7: 16 24 24
LAT: Pos   12, alignment  8: 16 24 16
LAT: Pos   12, alignment  9: 16 40 40
LAT: Pos   12, alignment 10: 16 40 32
LAT: Pos   12, alignment 11: 16 40 32
LAT: Pos   12, alignment 12: 16 32 32
LAT: Pos   13, alignment  0: 8 24 24
LAT: Pos   13, alignment  1: 8 40 40
LAT: Pos   13, alignment  2: 8 40 32
LAT: Pos   13, alignment  3: 16 32 32
LAT: Pos   13, alignment  4: 16 32 32
LAT: Pos   13, alignment  5: 16 32 24
LAT: Pos   13, alignment  6: 16 32 24
LAT: Pos   13, alignment  7: 16 24 24
LAT: Pos   13, alignment  8: 16 24 16
LAT: Pos   13, alignment  9: 16 40 40
LAT: Pos   13, alignment 10: 16 40 32
LAT: Pos   13, alignment 11: 16 32 32
LAT: Pos   13, alignment 12: 8 32 32
LAT: Pos   13, alignment 13: 16 32 24
LAT: Pos   14, alignment  0: 8 24 24
LAT: Pos   14, alignment  1: 16 40 32
LAT: Pos   14, alignment  2: 16 40 32
LAT: Pos   14, alignment  3: 16 32 32
LAT: Pos   14, alignment  4: 16 32 32
LAT: Pos   14, alignment  5: 16 32 24
LAT: Pos   14, alignment  6: 16 32 24
LAT: Pos   14, alignment  7: 16 32 24
LAT: Pos   14, alignment  8: 16 32 24
LAT: Pos   14, alignment  9: 16 40 32
LAT: Pos   14, alignment 10: 16 40 32
LAT: Pos   14, alignment 11: 16 40 32
LAT: Pos   14, alignment 12: 16 32 32
LAT: Pos   14, alignment 13: 16 32 24
LAT: Pos   14, alignment 14: 16 32 24
LAT: Pos   15, alignment  0: 8 24 24
LAT: Pos   15, alignment  1: 16 40 32
LAT: Pos   15, alignment  2: 16 40 32
LAT: Pos   15, alignment  3: 16 40 32
LAT: Pos   15, alignment  4: 16 32 32
LAT: Pos   15, alignment  5: 16 32 32
LAT: Pos   15, alignment  6: 16 32 32
LAT: Pos   15, alignment  7: 16 24 24
LAT: Pos   15, alignment  8: 16 24 24
LAT: Pos   15, alignment  9: 16 40 32
LAT: Pos   15, alignment 10: 16 40 32
LAT: Pos   15, alignment 11: 16 40 32
LAT: Pos   15, alignment 12: 8 32 32
LAT: Pos   15, alignment 13: 16 32 32
LAT: Pos   15, alignment 14: 16 32 32
LAT: Pos   15, alignment 15: 16 32 24

Data on memcmp and strcmp show similar results. The new ones
in glibc 2.11 are much better than the old ones in glibc 2.9.

If you believe there is a regression, please provide length as well
as alignments on input data. I will take a look.

Thanks.


H.J.
----
On Fri, Nov 6, 2009 at 6:04 AM, Petr Baudis <pasky@...> wrote:

>  Hi!
>
>  I have been doing some benchmarking of several string functions and
> discovered that some of them are *much* slower than in the past; the
> regressions are measured against glibc-2.9. I'm testing on small
> strings (4..128, though for 128 much bigger sample of calls would be
> needed for good comparison), following the common wisdom that operations
> on small strings are the bulk of the calls.
>
>  In case of strlen(), there seems to be regression only with very small
> strings on AMD, so this is probably fine.
>
>  In case of memcmp(), strcmp() and strncmp(), glibc-2.10.1 seems to
> improve performance somewhat especially for larger strings, but
> glibc-2.11 has massive performance drop across all vendors!
> (Interestingly, glibc-2.10.1 is also slightly slower than glibc-2.9 in
> these functions on Core i7.)
>
>  In case of strcmp(), strncmp(), glibc-2.10.1 seems to improve performance
> somewhat especially for larger strings, but glibc-2.11 has massive
> performance drop on all vendors.
>
>  I'd like to ask how the string routine changes were benchmarked,
> for what architectures and string sizes are they supposed to be
> optimized and why. I think it would be good to do something about this
> regression. ;-)
>
>  For the benchmarking, I'm using
>
>        http://pasky.or.cz/~pasky/dev/glibc/strbench/
>
> that I quickly hacked together. Here is the data I have collected
> on various x86_64 systems, running with 2048 iterations; apply
> reasonable error margins, of course:
>
>
> model name      : AMD Opteron (tm) Processor 848
> cache size      : 1024 KB
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good nopl
>
> fucn,size       2.9-vanilla     2.10.1-vanilla  2.11-vanilla    2.11-amd
> strlen4         5.630000        6.890000        7.060000        5.660000
> strlen8         4.940000        3.580000        3.700000        4.170000
> strlen32        2.220000        1.340000        1.490000        2.310000
> strlen128       1.220000        0.830000        0.900000        1.330000
> memcmp4         3.350000        3.330000        4.400000        3.310000
> memcmp8         1.840000        1.740000        2.660000        2.140000
> memcmp32        0.970000        0.800000        1.770000        1.300000
> memcmp128       0.330000        0.310000        1.050000        0.650000
> strcmp4         2.400000        2.290000        5.620000        2.470000
> strcmp8         1.600000        1.280000        3.260000        1.560000
> strcmp32        0.950000        0.600000        1.630000        0.870000
> strcmp128       0.350000        0.210000        1.010000        0.310000
> strncmp4        2.560000        2.250000        5.880000        2.960000
> strncmp8        1.400000        1.410000        3.230000        1.700000
> strncmp32       0.710000        0.770000        1.370000        0.940000
> strncmp128      0.270000        0.270000        0.670000        0.350000
>
>
> model name      : Dual Core AMD Opteron(tm) Processor 165
> cache size      : 1024 KB
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
>
> func,size       2.9-vanilla     2.10.1-vanilla  2.11-vanilla    2.11-amd
> strlen4         6.780000        8.350000        8.580000        6.850000
> strlen8         5.920000        4.300000        4.420000        5.010000
> strlen32        2.570000        1.440000        1.430000        2.660000
> strlen128       1.260000        0.910000        0.850000        1.240000
> memcmp4         3.960000        4.040000        5.160000        2.840000
> memcmp8         2.020000        2.060000        3.000000        1.890000
> memcmp32        0.770000        0.720000        1.350000        0.980000
> memcmp128       0.260000        0.240000        0.540000        0.430000
> strcmp4         2.740000        2.750000        6.790000        2.910000
> strcmp8         1.410000        1.410000        3.600000        1.620000
> strcmp32        0.630000        0.580000        1.260000        0.700000
> strcmp128       0.200000        0.180000        0.620000        0.230000
> strncmp4        3.080000        2.720000        7.180000        3.540000
> strncmp8        1.580000        1.440000        3.940000        1.880000
> strncmp32       0.720000        0.670000        1.310000        0.840000
> strncmp128      0.240000        0.220000        0.550000        0.280000
>
>
> model name      : Intel(R) Xeon(R) CPU           X3220  @ 2.40GHz
> cache size      : 4096 KB
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>
> func,size       2.9-vanilla     2.10.1-vanilla  2.11-vanilla    2.11-amd
> strlen4         3.870000        3.050000        3.270000        3.870000
> strlen8         2.370000        1.530000        1.640000        3.450000
> strlen32        1.040000        0.480000        0.470000        1.520000
> strlen128       0.600000        0.290000        0.280000        0.680000
> memcmp4         2.080000        2.260000        2.680000        1.800000
> memcmp8         1.040000        1.130000        1.460000        1.860000
> memcmp32        0.270000        0.270000        0.350000        0.770000
> memcmp128       0.070000        0.070000        0.090000        0.190000
> strcmp4         1.910000        1.910000        3.480000        1.920000
> strcmp8         0.960000        0.950000        1.200000        0.960000
> strcmp32        0.240000        0.240000        0.290000        0.240000
> strcmp128       0.060000        0.060000        0.080000        0.060000
> strncmp4        2.030000        1.690000        4.240000        2.810000
> strncmp8        1.020000        0.850000        1.610000        1.410000
> strncmp32       0.260000        0.210000        0.380000        0.360000
> strncmp128      0.070000        0.060000        0.100000        0.080000
>
>
> model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
> cache size      : 6144 KB
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
>
> func,size       2.9-vanilla     2.10.1-vanilla  2.11-vanilla    2.11-amd
> strlen4         3.090000        2.960000        2.750000        3.450000
> strlen8         1.890000        1.230000        1.360000        3.140000
> strlen32        0.810000        0.370000        0.340000        1.220000
> strlen128       0.460000        0.220000        0.200000        0.660000
> memcmp4         2.160000        1.820000        2.500000        1.800000
> memcmp8         1.100000        0.910000        1.500000        1.170000
> memcmp32        0.310000        0.220000        0.320000        0.380000
> memcmp128       0.090000        0.060000        0.090000        0.110000
> strcmp4         1.860000        1.910000        3.530000        1.570000
> strcmp8         0.960000        0.960000        1.170000        0.840000
> strcmp32        0.280000        0.250000        0.300000        0.270000
> strcmp128       0.050000        0.050000        0.090000        0.070000
> strncmp4        1.740000        1.750000        3.790000        2.840000
> strncmp8        0.940000        0.850000        1.380000        1.380000
> strncmp32       0.220000        0.220000        0.320000        0.400000
> strncmp128      0.050000        0.050000        0.090000        0.080000
>
>
> model name      : Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
> cache size      : 8192 KB
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida
>
> func,size       2.9-vanilla     2.10.1-vanilla  2.11-vanilla    2.11-amd
> strlen4         3.440000        3.500000        2.780000        3.320000
> strlen8         2.260000        1.750000        1.440000        2.220000
> strlen32        0.850000        0.500000        0.380000        0.900000
> strlen128       0.470000        0.260000        0.200000        0.500000
> memcmp4         2.180000        2.060000        2.500000        1.840000
> memcmp8         1.100000        1.050000        1.320000        1.060000
> memcmp32        0.270000        0.260000        0.350000        0.330000
> memcmp128       0.080000        0.070000        0.090000        0.090000
> strcmp4         1.660000        1.930000        2.250000        1.640000
> strcmp8         0.830000        0.970000        1.140000        0.840000
> strcmp32        0.210000        0.240000        0.240000        0.210000
> strcmp128       0.050000        0.070000        0.080000        0.060000
> strncmp4        1.740000        1.830000        2.490000        2.570000
> strncmp8        0.870000        0.920000        1.220000        1.300000
> strncmp32       0.220000        0.230000        0.260000        0.320000
> strncmp128      0.050000        0.050000        0.090000        0.080000
>
>
>  * numbers after function names indicate string sizes
>  ** 2.11-amd is very old AMD-provided x86_64 string routines patch
> (it doesn't implement some of the new things like bounded pointers
> checks support) that we still use in SUSE glibc:
>
>        http://pasky.or.cz/~pasky/dev/glibc/amd64-string-2.11.diff
>
> If the regression against 2.10.1 is fixed, it is probably not very
> interesting, it performs better only at very short memcmp()s.)
>
>  *** I can't seem to find newer AMD processors to test on right now,
> sorry. If you have any, feel free to run the benchmark there - just
> get the /strbench/ directory and run `./strbench.sh outfile`.
>
>  Kind regards,
>
> --
>                                Petr "Pasky" Baudis
> A lot of people have my books on their bookshelves.
> That's the problem, they need to read them. -- Don Knuth
>



--
H.J.

Re: Massive performance regression of glibc string functions

by Petr Baudis-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Nov 06, 2009 at 10:20:41AM -0700, H.J. Lu wrote:
> I am using the rdtsc timing in glibc string tests. Here is strlen data on
>
> Intel(R) Xeon(R) CPU           X3350  @ 2.66GHz
..snip..
>
> Data on memcmp and strcmp show similar results. The new ones
> in glibc 2.11 are much better than the old ones in glibc 2.9.

I think the one you have shown exactly matches my findings - I also
think strlen() in glibc-2.11 is much better than in glibc-2.9 (except on
AMD and very small strings). But that is the only one of these I tested;
could you please post the same numbers for e.g. memcmp()?

> If you believe there is a regression, please provide length as well
> as alignments on input data. I will take a look.

The lengths are the numbers after function names - i.e. I'm testing with
4, 8, 32 and 128. All the values are 8-aligned, I can test misaligned
strings too if you think 2.11 will do better there.

--
                                Petr "Pasky" Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth

Re: Massive performance regression of glibc string functions

by H.J. Lu-30 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Nov 7, 2009 at 12:58 AM, Petr Baudis <pasky@...> wrote:

> On Fri, Nov 06, 2009 at 10:20:41AM -0700, H.J. Lu wrote:
>> I am using the rdtsc timing in glibc string tests. Here is strlen data on
>>
>> Intel(R) Xeon(R) CPU           X3350  @ 2.66GHz
> ..snip..
>>
>> Data on memcmp and strcmp show similar results. The new ones
>> in glibc 2.11 are much better than the old ones in glibc 2.9.
>
> I think the one you have shown exactly matches my findings - I also
> think strlen() in glibc-2.11 is much better than in glibc-2.9 (except on
> AMD and very small strings). But that is the only one of these I tested;
> could you please post the same numbers for e.g. memcmp()?

                               memcmp_2_11     memcmp 2.9
LAT: Len    1, alignment 13/13: 8       16
LAT: Len    1, alignment 13/13: 8       16
LAT: Len    1, alignment 13/13: 8       16
LAT: Len    2, alignment 12/12: 16      24
LAT: Len    2, alignment 12/12: 16      24
LAT: Len    2, alignment 12/12: 16      24
LAT: Len    3, alignment 10/10: 16      24
LAT: Len    3, alignment 10/10: 24      24
LAT: Len    3, alignment 10/10: 24      24
LAT: Len    4, alignment  8/ 8: 16      24
LAT: Len    4, alignment  8/ 8: 16      24
LAT: Len    4, alignment  8/ 8: 16      24
LAT: Len    5, alignment  6/ 6: 16      32
LAT: Len    5, alignment  6/ 6: 24      24
LAT: Len    5, alignment  6/ 6: 24      24
LAT: Len    6, alignment  4/ 4: 16      32
LAT: Len    6, alignment  4/ 4: 24      32
LAT: Len    6, alignment  4/ 4: 24      32
LAT: Len    7, alignment  2/ 2: 16      32
LAT: Len    7, alignment  2/ 2: 24      32
LAT: Len    7, alignment  2/ 2: 24      32
LAT: Len    8, alignment  0/ 0: 16      40
LAT: Len    8, alignment  0/ 0: 24      32
LAT: Len    8, alignment  0/ 0: 24      32
LAT: Len    9, alignment 14/14: 16      56
LAT: Len    9, alignment 14/14: 24      32
LAT: Len    9, alignment 14/14: 24      32
LAT: Len   10, alignment 12/12: 16      40
LAT: Len   10, alignment 12/12: 24      40
LAT: Len   10, alignment 12/12: 24      40
LAT: Len   11, alignment 10/10: 24      48
LAT: Len   11, alignment 10/10: 24      40
LAT: Len   11, alignment 10/10: 24      40
LAT: Len   12, alignment  8/ 8: 16      48
LAT: Len   12, alignment  8/ 8: 24      40
LAT: Len   12, alignment  8/ 8: 24      40
LAT: Len   13, alignment  6/ 6: 24      48
LAT: Len   13, alignment  6/ 6: 24      40
LAT: Len   13, alignment  6/ 6: 24      40
LAT: Len   14, alignment  4/ 4: 24      56
LAT: Len   14, alignment  4/ 4: 24      48
LAT: Len   14, alignment  4/ 4: 24      48
LAT: Len   15, alignment  2/ 2: 24      56
LAT: Len   15, alignment  2/ 2: 24      48
LAT: Len   15, alignment  2/ 2: 24      48
LAT: Len    1, alignment  0/ 0: 8       16
LAT: Len    1, alignment  0/ 0: 8       16
LAT: Len    1, alignment  0/ 0: 8       16
LAT: Len    2, alignment  0/ 0: 16      24
LAT: Len    2, alignment  0/ 0: 16      24
LAT: Len    2, alignment  0/ 0: 16      24
LAT: Len    3, alignment  0/ 0: 16      24
LAT: Len    3, alignment  0/ 0: 24      24
LAT: Len    3, alignment  0/ 0: 24      24
LAT: Len    4, alignment  0/ 0: 16      24
LAT: Len    4, alignment  0/ 0: 16      24
LAT: Len    4, alignment  0/ 0: 16      24
LAT: Len    5, alignment  0/ 0: 16      32
LAT: Len    5, alignment  0/ 0: 24      24
LAT: Len    5, alignment  0/ 0: 24      24
LAT: Len    6, alignment  0/ 0: 16      32
LAT: Len    6, alignment  0/ 0: 24      32
LAT: Len    6, alignment  0/ 0: 24      32
LAT: Len    7, alignment  0/ 0: 16      32
LAT: Len    7, alignment  0/ 0: 24      32
LAT: Len    7, alignment  0/ 0: 24      32
LAT: Len    8, alignment  0/ 0: 16      40
LAT: Len    8, alignment  0/ 0: 24      32
LAT: Len    8, alignment  0/ 0: 24      32
LAT: Len    9, alignment  0/ 0: 16      56
LAT: Len    9, alignment  0/ 0: 24      32
LAT: Len    9, alignment  0/ 0: 24      32
LAT: Len   10, alignment  0/ 0: 16      40
LAT: Len   10, alignment  0/ 0: 24      40
LAT: Len   10, alignment  0/ 0: 24      40
LAT: Len   11, alignment  0/ 0: 24      48
LAT: Len   11, alignment  0/ 0: 24      40
LAT: Len   11, alignment  0/ 0: 24      40
LAT: Len   12, alignment  0/ 0: 16      48
LAT: Len   12, alignment  0/ 0: 24      40
LAT: Len   12, alignment  0/ 0: 24      40
LAT: Len   13, alignment  0/ 0: 24      48
LAT: Len   13, alignment  0/ 0: 24      40
LAT: Len   13, alignment  0/ 0: 24      40
LAT: Len   14, alignment  0/ 0: 24      56
LAT: Len   14, alignment  0/ 0: 24      48
LAT: Len   14, alignment  0/ 0: 24      48
LAT: Len   15, alignment  0/ 0: 24      56
LAT: Len   15, alignment  0/ 0: 24      48
LAT: Len   15, alignment  0/ 0: 24      48
LAT: Len    4, alignment  0/ 0: 16      24
LAT: Len    4, alignment  0/ 0: 16      24
LAT: Len    4, alignment  0/ 0: 16      24
LAT: Len   32, alignment  0/ 0: 32      32
LAT: Len   32, alignment 13/14: 40      64
LAT: Len   32, alignment  0/ 0: 32      64
LAT: Len   32, alignment  0/ 0: 32      64
LAT: Len    8, alignment  0/ 0: 16      40
LAT: Len    8, alignment  0/ 0: 24      32
LAT: Len    8, alignment  0/ 0: 24      32
LAT: Len   64, alignment  0/ 0: 40      40
LAT: Len   64, alignment 14/12: 112     88
LAT: Len   64, alignment  0/ 0: 32      72
LAT: Len   64, alignment  0/ 0: 32      104
LAT: Len   16, alignment  0/ 0: 24      32
LAT: Len   16, alignment  0/ 0: 24      56
LAT: Len   16, alignment  0/ 0: 24      56
LAT: Len  128, alignment  0/ 0: 48      56
LAT: Len  128, alignment 14/12: 144     120
LAT: Len  128, alignment  0/ 0: 40      88
LAT: Len  128, alignment  0/ 0: 40      88

>> If you believe there is a regression, please provide length as well
>> as alignments on input data. I will take a look.
>
> The lengths are the numbers after function names - i.e. I'm testing with
> 4, 8, 32 and 128. All the values are 8-aligned, I can test misaligned
> strings too if you think 2.11 will do better there.
>

Your test compares timings of 2 implementations in 2 C libraries on
2 sets of random data. You should compare 2 implementations on the
same set of data linked against the same C library.


--
H.J.