Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

View: New views
3 Messages — Rating Filter:   Alert me  

Parent Message unknown Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Corinna Vinschen on 11/6/2009 6:51 AM:

>> The problem *is* with grep (and sed), however, because there is no
>> good reason that UTF-8 should give us a penalty of being 100times
>> slower on most search operations, this is just poor programming of
>> grep and sed.
>
> The penalty on Linux is much smaller, about 15-20%.  It looks like
> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> Then it evaluates for each byte in the line whether the byte is a
> single byte or the start of a multibyte sequence using mbrtowc on
> every charatcer on the input line.  Then, for each potential match,
> it checks if it's the start byte of a multibyte sequence and ignores
> all other matches.  Eventually, it calls free, and the game starts
> over for the next line.

Adding bug-grep, since this slowdown caused by additional mallocs is
definitely the sign of a poor algorithm that could be improved by reusing
existing buffers.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkr0KxUACgkQ84KuGfSFAYCOCACgvjz2v65vK8DIcGg6zfnLQgcT
tfQAmwbpWbriBJSv0rjYobYgsh4KXOiZ
=B3nZ
-----END PGP SIGNATURE-----



Parent Message unknown Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Corinna Vinschen wrote:

> On Nov  6 06:56, Eric Blake wrote:
>  
>> According to Corinna Vinschen on 11/6/2009 6:51 AM:
>>    
>>>> The problem *is* with grep (and sed), however, because there is no
>>>> good reason that UTF-8 should give us a penalty of being 100times
>>>> slower on most search operations, this is just poor programming of
>>>> grep and sed.
>>>>        
>>> The penalty on Linux is much smaller, about 15-20%.  
On my Linux system, the penalty is a factor between 6 and 8.

>>> It looks like
>>> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
>>> Then it evaluates for each byte in the line whether the byte is a
>>> single byte or the start of a multibyte sequence using mbrtowc on
>>> every charatcer on the input line.  Then, for each potential match,
>>> it checks if it's the start byte of a multibyte sequence and ignores
>>> all other matches.  Eventually, it calls free, and the game starts
>>> over for the next line.
>>>      
>> Adding bug-grep, since this slowdown caused by additional mallocs is
>> definitely the sign of a poor algorithm that could be improved by reusing
>> existing buffers.
>>    
>
> I created a simple testcase:
>
> ==== SNIP ===
> ...
> ==== SNAP ====
>  
I extended your test program to demonstrate the inefficiency of the
standard mbrtowc function. Instead I use a function from my editor
(mined) to extract a Unicode character from a UTF-8 sequence. This is
the simple case only, not converting character sets other than UTF-8 but
that's the same thing mbrtowc does in the sample invocation. Program
attached. Results below.

> Under Cygwin (tcsh time output):
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
>
> Running on the same CPU under Linux:
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
>
> So, while Linux is definitely faster, the number are still comparable
> for 1 million iterations.  That still doens't explain why grep is a
> multitude slower when using UTF-8 as charset.
>  
Results of mbrtowc vs. utftouni on Linux:

thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
UTF-8
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m2.897s
user    0m2.836s
sys     0m0.012s
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.030s
user    0m0.028s
sys     0m0.000s
thw[en_US.UTF-8]@scotty:~/tmp:


Results of mbrtowc vs. utftouni on cygwin:

demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m3.034s
user    0m2.974s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.110s
user    0m0.070s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools:


The conclusion is, as long as calling mbrtowc is as inefficient, a
program caring about performance should not use it.

Thomas

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>

void utf8_info (u, length, ucs)
  char * u;
  int * length;
  unsigned long * ucs;
{
  char * textpoi = u;
  unsigned char c = * textpoi;
  int utfcount;
  unsigned long unichar;

        if ((c & 0x80) == 0x00) {
                utfcount = 1;
                unichar = c;
        } else if ((c & 0xE0) == 0xC0) {
                utfcount = 2;
                unichar = c & 0x1F;
        } else if ((c & 0xF0) == 0xE0) {
                utfcount = 3;
                unichar = c & 0x0F;
        } else if ((c & 0xF8) == 0xF0) {
                utfcount = 4;
                unichar = c & 0x07;
        } else if ((c & 0xFC) == 0xF8) {
                utfcount = 5;
                unichar = c & 0x03;
        } else if ((c & 0xFE) == 0xFC) {
                utfcount = 6;
                unichar = c & 0x01;
        } else if (c == 0xFE) {
                /* illegal UTF-8 code */
                utfcount = 1;
                unichar = 0;
        } else if (c == 0xFF) {
                /* illegal UTF-8 code */
                utfcount = 1;
                unichar = 0;
        } else {
                /* illegal UTF-8 sequence character */
                utfcount = 1;
                unichar = 0;
        }

        * length = utfcount;

        utfcount --;
        textpoi ++;
        while (utfcount > 0 && (* textpoi & 0xC0) == 0x80) {
                unichar = (unichar << 6) | (* textpoi & 0x3F);
                utfcount --;
                textpoi ++;
        }
        if (utfcount > 0) {
                /* too short UTF-8 sequence */
                unichar = 0;
                * length -= utfcount;
        }

        * ucs = unichar;
}

void utftouni (wchar_t * wpoi, char * s)
{
  unsigned long c;
  int l;
  int i = 0;

  while (* s) {
        utf8_info (s, & l, & wpoi [i++]);
        s += l;
  }
}

int main (int argc, char **argv)
{
  char in[] = "The quick brown fox jumps over the lazy dog";
  int line, i;
  mbstate_t mbs;
  size_t mbclen;
  size_t size = sizeof (in);
  wchar_t wc;
  int lines = argc > 1 ? atoi (argv[1]) : 1000;
  int do_malloc = 1;
  int do_mbrtowc = 1;
  int do_utftouni = 1;

  if (argc > 2)
    do_malloc = atoi (argv[2]);
  if (argc > 3)
    do_mbrtowc = atoi (argv[3]);
  if (argc > 4)
    do_utftouni = atoi (argv[4]);

  printf ("with malloc: %d, with mbrtowc: %d, with utftouni: %d\n", do_malloc, do_mbrtowc, do_utftouni);

  memset (&mbs, 0, sizeof mbs);
  for (line = 0; line < lines; ++line)
    {
      char *x;
      if (do_malloc) x = malloc (size);
      if (do_mbrtowc)
        for (i = 0; i < size; i += mbclen)
          if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
            break;
      if (do_utftouni)
        utftouni (&wc, in);
      if (do_malloc) free (x);
    }
  return 0;
}

Parent Message unknown Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[forgot to CC to bug-grep before, so I'm resending this, with one more
comment, and leaving out cygwin-specific parts]

Corinna Vinschen wrote:
> On Nov  6 16:00, Thomas Wolff wrote:
>  
>>> ...
>>>      
>> I extended your test program to demonstrate the inefficiency of the
>> standard mbrtowc function. [...]
>>    
I later had to correct:
> Anyway, corrected results are still by a factor of 3 to 4 in favor of
> my algorithm.
Corinna wrote:
> That's sort of an unfair test.  Your utftouni function doesn't care for
> mbstate, error, and surrogate pair handling.
>  
This is a question of use cases:
* mbstate is needed e.g. if you feed results of read() which possibly
come in arbitrary chunks directly into mbtowc(); it's not needed if you
only transform complete lines of text at once. The stdlib function is a
little bit too generic (and thus complicated, too) for many applications.
* error handling is there, in my function; it's simplified, incorrect
sequences are all mapped to 0 for the test case but they could as well
return an error indication without performance impact.
* surrogate pair handling is only needed if you pass the string from/to
the Windows API. It's not needed for POSIX applications (provided
wchar_t would be sufficiently wide). So if wchar_t can be extended in
the newlib API, it might be useful to have two implementations; one for
applications (w/o surrogates), one for cygwin itself.


[...]

My main point was that, depending on the use case, some applications
would be better off using less generic, optimized functions.
The kind of dogmatic suggestion (as seen in the "locale scene") that
everybody should use the stdlib wide character functions is often
misleading.
grep and sed would certainly be well advised to change that.

Thomas