1.7] BUG - GREP slows to a crawl with large number of matches on a single file

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by aputerguy :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Running grep on a 20MB file with ~100,000 matches takes an incredible almost 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 (on a 2nd machine).

The following cases show how grep under 1.7 grinds to a halt as the number of matches increases.

The data 'testfile' is a plain text file of the acl's of all the 108,000 files on my Windoze computer.

Note since the machines are different, compare relative times across cases rather than the times between the two machines.

Case 1] Zero matches
time grep "sfsdfdsfds" testfile | wc
 0       0       0

Cygwin 1.5
real    0m0.093s
user    0m0.092s
sys     0m0.030s

Cygwin 1.7
real    0m1.353s
user    0m1.342s
sys     0m0.062s

Case 2] One match
time grep ".lesshst" testfile | wc
      1       3      29

Cygwin 1.5 (~same as zero matches)
real    0m0.234s
user    0m0.091s
sys     0m0.061s

Cygwin 1.7 (~same as zero matches)
real    0m1.499s
user    0m1.404s
sys     0m0.046s

Case 3] ~1400 matches

Cygwin 1.5 (~ same as zero matches)
time grep ".bin" testfile | wc
   1439    5661   71067

real    0m0.110s
user    0m0.076s
sys     0m0.077s

Cygwin 1.7 (~6x zero matches case
real    0m7.537s
user    0m7.341s
sys     0m0.045s

Case 4] ~16000 matches
time grep "Documents and Settings" testfile | wc
  15824  131573 1918500

Cygwin 1.5 (~same as zero matches)
real    0m0.437s
user    0m0.092s
sys     0m0.092s

Cygwin 1.7 (~50x zero matches)
real    1m14.491s
user    1m8.904s
sys     0m0.031s


Case 5] ~100,000 matches
time grep "# file" testfile | wc
 106988  510944 7930558

Cygwin 1.5 (~1.5x zero matches)

real    0m0.475s
user    0m0.154s
sys     0m0.201s

Cygwin 1.7 (~350x zero matches)
real    7m51.771s
user    7m16.810s
sys     0m0.062s

Case 6] Test that nothing wrong with file system reads or 'wc'
time cat testfile | wc
 966300 1821815 20426592

Cygwin 1.5 (approx same time as grepping zero matches)
real    0m0.344s
user    0m0.201s
sys     0m0.186s

Cygwin 1.7 (approx same time as grepping zero matches)
real    0m1.662s
user    0m1.373s
sys     0m0.138s



Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Dave Korn-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

aputerguy wrote:

> The data 'testfile' is a plain text file of the acl's of all the 108,000
> files on my Windoze computer.

  So, the "find | xargs" trick worked then did it? :-)

    cheers,
      DaveK

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Christopher Faylor-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Nov 05, 2009 at 03:27:07PM -0800, aputerguy wrote:

>
>Running grep on a 20MB file with ~100,000 matches takes an incredible almost
>8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
>(on a 2nd machine).
>
>The following cases show how grep under 1.7 grinds to a halt as the number
>of matches increases.
>
>The data 'testfile' is a plain text file of the acl's of all the 108,000
>files on my Windoze computer.
>
>Note since the machines are different, compare relative times across cases
>rather than the times between the two machines.

We'll need an actual test case if you want us to track it down.

cgf

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Linda Walsh :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

aputerguy wrote:
> Running grep on a 20MB file with ~100,000 matches takes an incredible almost
> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
> (on a 2nd machine).
---
        I've seen nasty behavior with grep that isnt' cygwin
specific.  Try "pcregrep" and see if you have the same issue.

        I found it to be about ~100 times faster under _some_ searches
though 2-3x is more typical. The gnu re-parser isn't real
efficient under some circumstances.  

        If you find a big difference, you might also want to report
it to the bug-grep@... mailing list, but last time I did,
they told me "that's the way it is" due to some posix conformance
thing...

-l

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Christopher Faylor-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:

>aputerguy wrote:
>> Running grep on a 20MB file with ~100,000 matches takes an incredible almost
>> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
>> (on a 2nd machine).
>
>I've seen nasty behavior with grep that isnt' cygwin specific.  Try
>"pcregrep" and see if you have the same issue.
>
>I found it to be about ~100 times faster under _some_ searches though
>2-3x is more typical.  The gnu re-parser isn't real efficient under
>some circumstances.
>
>If you find a big difference, you might also want to report it to the
>bug-grep@... mailing list, but last time I did, they told me
>"that's the way it is" due to some posix conformance thing...

The fact that it behaves differently between Cygwin 1.5 and 1.7 would
suggest that this isn't a grep problem.

That's why I asked for a test case.

cgf

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by aputerguy :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

OK. Here is a simple test case:


X=100000
while [ $X -gt 0 ] ; do echo "The quick brown fox jumped over the lazy dog" ; let X=X-1; done  > testfile

time grep dog testfile | wc

Cygwin 1.5:
real    0m0.219s
user    0m0.232s
sys     0m0.045s

Cygwin 1.7:
real    7m46.575s
user    7m14.138s
sys     0m0.076s

While using sed on Cygwin 1.5, I get the reasonable result:
time sed -ne /dog/p testfile | wc

real    0m1.229s
user    0m1.202s
sys     0m0.046s


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Christopher Faylor wrote:

> On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
>  
>> aputerguy wrote:
>>    
>>> Running grep on a 20MB file with ~100,000 matches takes an incredible almost
>>> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
>>> (on a 2nd machine).
>>>      
>> I've seen nasty behavior with grep that isnt' cygwin specific.  Try
>> "pcregrep" and see if you have the same issue.
>>
>> I found it to be about ~100 times faster under _some_ searches though
>> 2-3x is more typical.  The gnu re-parser isn't real efficient under
>> some circumstances.
>>
>> If you find a big difference, you might also want to report it to the
>> bug-grep@... mailing list, but last time I did, they told me
>> "that's the way it is" due to some posix conformance thing...
>>    
>
> The fact that it behaves differently between Cygwin 1.5 and 1.7 would
> suggest that this isn't a grep problem.
>  
This is likely to be triggered by the transition to UTF-8 as a default
charset. The same problem is observed on Linux, with grep as well as
with sed.
That's why I have changed most of my shell scripts to use something like
LC_ALL=C grep or LC_ALL=C sed
where possible. Please try this.

The problem *is* with grep (and sed), however, because there is no good
reason that UTF-8 should give us a penalty of being 100times slower on
most search operations, this is just poor programming of grep and sed.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Christopher Faylor-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Nov 06, 2009 at 02:09:59PM +0100, Thomas Wolff wrote:

>Christopher Faylor wrote:
>> On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
>>  
>>> aputerguy wrote:
>>>    
>>>> Running grep on a 20MB file with ~100,000 matches takes an incredible almost
>>>> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
>>>> (on a 2nd machine).
>>>>      
>>> I've seen nasty behavior with grep that isnt' cygwin specific.  Try
>>> "pcregrep" and see if you have the same issue.
>>>
>>> I found it to be about ~100 times faster under _some_ searches though
>>> 2-3x is more typical.  The gnu re-parser isn't real efficient under
>>> some circumstances.
>>>
>>> If you find a big difference, you might also want to report it to the
>>> bug-grep@... mailing list, but last time I did, they told me
>>> "that's the way it is" due to some posix conformance thing...
>>>    
>>
>> The fact that it behaves differently between Cygwin 1.5 and 1.7 would
>> suggest that this isn't a grep problem.
>>  
>This is likely to be triggered by the transition to UTF-8 as a default
>charset. The same problem is observed on Linux, with grep as well as
>with sed.
>That's why I have changed most of my shell scripts to use something like
>LC_ALL=C grep or LC_ALL=C sed
>where possible. Please try this.

Thanks for catching this.  I'll hold off on trying the test case until I
hear a report about running the same test with LC_ALL=C.

cgf

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


RE: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Cooper, Karl (US SSA) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thomas Wolff wrote:
> Christopher Faylor wrote:
..snip..
>>> aputerguy wrote:
>>>

>>>> Running grep on a 20MB file with ~100,000 matches takes an
>>>> incredible almost 8 minutes under Cygwin 1.7 while taking just 0.2
>>>> seconds under Cygwin 1.5 (on a 2nd machine).
..snip..
>> The fact that it behaves differently between Cygwin 1.5 and 1.7 would
>> suggest that this isn't a grep problem.
>>
> This is likely to be triggered by the transition to UTF-8 as a
> default charset. The same problem is observed on Linux, with grep as
> well as with sed.  
> That's why I have changed most of my shell scripts to use something
> like LC_ALL=C grep or LC_ALL=C sed where possible. Please try this.

I don't have Cygwin 1.5 for comparison, but the testcase provided does show grep using a long time on *my* Cygwin 1.7.
And LC_ALL=C didn't seem to help:

$ time grep dog testfile | wc
 100000  900000 4500000

real    3m28.229s
user    3m26.951s
Sys     0m0.170s

$ LC_ALL=C time grep dog testfile | wc
207.26user 0.06system 3:28.32elapsed 99%CPU (0avgtext+0avtdata 278784maxresident)k
0inputs+0outputs (1091major+0minor)pagefaults 0swaps
 100000  900000 4500000

$ time LC_ALL=C grep dog testfile | wc
 100000  900000 4500000

real    3m24.265s
user    3m24.124s
sys     0m0.202s

(Well . . . Doesn't help very *much*, anyway; a few seconds.)
I don't have the *latest* 1.7:

$ uname -a
CYGWIN_NT-5.1 gldlkcooper 1.7.0(0.214/5/3) 2009-10-03 14:33 i686 Cygwin

Karl Cooper

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Corinna Vinschen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Nov  6 14:09, Thomas Wolff wrote:

> Christopher Faylor wrote:
> >On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
> >>aputerguy wrote:
> >>>Running grep on a 20MB file with ~100,000 matches takes an incredible almost
> >>>8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5
> >>>(on a 2nd machine).
> >>I've seen nasty behavior with grep that isnt' cygwin specific.  Try
> >>"pcregrep" and see if you have the same issue.
> >>
> >>I found it to be about ~100 times faster under _some_ searches though
> >>2-3x is more typical.  The gnu re-parser isn't real efficient under
> >>some circumstances.
> >>
> >>If you find a big difference, you might also want to report it to the
> >>bug-grep@... mailing list, but last time I did, they told me
> >>"that's the way it is" due to some posix conformance thing...
> >
> >The fact that it behaves differently between Cygwin 1.5 and 1.7 would
> >suggest that this isn't a grep problem.
> This is likely to be triggered by the transition to UTF-8 as a
> default charset. The same problem is observed on Linux, with grep as
> well as with sed.
> That's why I have changed most of my shell scripts to use something like
> LC_ALL=C grep or LC_ALL=C sed
> where possible. Please try this.

Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset
when calling nl_langinfo(CHARSET).

> The problem *is* with grep (and sed), however, because there is no
> good reason that UTF-8 should give us a penalty of being 100times
> slower on most search operations, this is just poor programming of
> grep and sed.

The penalty on Linux is much smaller, about 15-20%.  It looks like
grep is calling malloc for every input line if MB_CUR_MAX is > 1.
Then it evaluates for each byte in the line whether the byte is a
single byte or the start of a multibyte sequence using mbrtowc on
every charatcer on the input line.  Then, for each potential match,
it checks if it's the start byte of a multibyte sequence and ignores
all other matches.  Eventually, it calls free, and the game starts
over for the next line.

It appears that either our malloc is that slow, or the mbrtowc call.
But I can't really believe the latter.  The function should be quite
fast, as far as I can see...


Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Cooper, Karl (US SSA) on 11/6/2009 6:48 AM:
>> like LC_ALL=C grep or LC_ALL=C sed where possible. Please try this.
>
> I don't have Cygwin 1.5 for comparison, but the testcase provided does show grep using a long time on *my* Cygwin 1.7.
> And LC_ALL=C didn't seem to help:

LC_ALL=C still selects UTF-8 charset.  That's probably the culprit.  Try
LC_ALL=C.ASCII, to force the issue of a one-byte charset.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkr0KjAACgkQ84KuGfSFAYCwmgCdHqUtDXEIw01uyjPKW1+lv2Sx
+AAAoNY++kRbZqAWYjpqWndTaFj2/y2F
=7bwZ
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Corinna Vinschen on 11/6/2009 6:51 AM:

>> The problem *is* with grep (and sed), however, because there is no
>> good reason that UTF-8 should give us a penalty of being 100times
>> slower on most search operations, this is just poor programming of
>> grep and sed.
>
> The penalty on Linux is much smaller, about 15-20%.  It looks like
> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> Then it evaluates for each byte in the line whether the byte is a
> single byte or the start of a multibyte sequence using mbrtowc on
> every charatcer on the input line.  Then, for each potential match,
> it checks if it's the start byte of a multibyte sequence and ignores
> all other matches.  Eventually, it calls free, and the game starts
> over for the next line.

Adding bug-grep, since this slowdown caused by additional mallocs is
definitely the sign of a poor algorithm that could be improved by reusing
existing buffers.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkr0KxUACgkQ84KuGfSFAYCOCACgvjz2v65vK8DIcGg6zfnLQgcT
tfQAmwbpWbriBJSv0rjYobYgsh4KXOiZ
=B3nZ
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


RE: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Cooper, Karl (US SSA) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Corinna Vinschen wrote:
> Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset
> when calling nl_langinfo(CHARSET).

Yes, this solves it:

$ time LC_ALL=C.ASCII grep dog testfile | wc
 100000  900000 4500000

real    0m0.359s
user    0m0.279s
sys     0m0.232s

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Corinna Vinschen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Nov  6 06:56, Eric Blake wrote:

> According to Corinna Vinschen on 11/6/2009 6:51 AM:
> >> The problem *is* with grep (and sed), however, because there is no
> >> good reason that UTF-8 should give us a penalty of being 100times
> >> slower on most search operations, this is just poor programming of
> >> grep and sed.
> >
> > The penalty on Linux is much smaller, about 15-20%.  It looks like
> > grep is calling malloc for every input line if MB_CUR_MAX is > 1.
> > Then it evaluates for each byte in the line whether the byte is a
> > single byte or the start of a multibyte sequence using mbrtowc on
> > every charatcer on the input line.  Then, for each potential match,
> > it checks if it's the start byte of a multibyte sequence and ignores
> > all other matches.  Eventually, it calls free, and the game starts
> > over for the next line.
>
> Adding bug-grep, since this slowdown caused by additional mallocs is
> definitely the sign of a poor algorithm that could be improved by reusing
> existing buffers.

I created a simple testcase:

==== SNIP ===
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>

int main (int argc, char **argv)
{
  const char in[] = "The quick brown fox jumps over the lazy dog";
  int line, i;
  mbstate_t mbs;
  size_t mbclen;
  size_t size = sizeof (in);
  wchar_t wc;
  int lines = argc > 1 ? atoi (argv[1]) : 1000;
  int do_malloc = 1;
  int do_mbrtowc = 1;

  if (argc > 2)
    do_malloc = atoi (argv[2]);
  if (argc > 3)
    do_mbrtowc = atoi (argv[3]);

  printf ("with malloc: %d, with mbrtowc: %d\n", do_malloc, do_mbrtowc);

  memset (&mbs, 0, sizeof mbs);
  for (line = 0; line < lines; ++line)
    {
      char *x;
      if (do_malloc) x = malloc (size);
      if (do_mbrtowc)
        for (i = 0; i < size; i += mbclen)
          if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
            break;
      if (do_malloc) free (x);
    }
  return 0;
}
==== SNAP ====

Under Cygwin (tcsh time output):

  $ setenv LANG en_US.UTF-8
  $ time ./mb 1000000 1 0
  with malloc: 1, with mbrtowc: 0
  0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
  $ time ./mb 1000000 0 1
  with malloc: 0, with mbrtowc: 1
  1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
  $ time ./mb 1000000 1 1
  with malloc: 1, with mbrtowc: 1
  2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w

Running on the same CPU under Linux:

  $ setenv LANG en_US.UTF-8
  $ time ./mb 1000000 1 0
  with malloc: 1, with mbrtowc: 0
  0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
  $ time ./mb 1000000 0 1
  with malloc: 0, with mbrtowc: 1
  1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
  $ time ./mb 1000000 1 1
  with malloc: 1, with mbrtowc: 1
  1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w

So, while Linux is definitely faster, the number are still comparable
for 1 million iterations.  That still doens't explain why grep is a
multitude slower when using UTF-8 as charset.


Puzzled,
Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Corinna Vinschen wrote:

> On Nov  6 06:56, Eric Blake wrote:
>  
>> According to Corinna Vinschen on 11/6/2009 6:51 AM:
>>    
>>>> The problem *is* with grep (and sed), however, because there is no
>>>> good reason that UTF-8 should give us a penalty of being 100times
>>>> slower on most search operations, this is just poor programming of
>>>> grep and sed.
>>>>        
>>> The penalty on Linux is much smaller, about 15-20%.  
On my Linux system, the penalty is a factor between 6 and 8.

>>> It looks like
>>> grep is calling malloc for every input line if MB_CUR_MAX is > 1.
>>> Then it evaluates for each byte in the line whether the byte is a
>>> single byte or the start of a multibyte sequence using mbrtowc on
>>> every charatcer on the input line.  Then, for each potential match,
>>> it checks if it's the start byte of a multibyte sequence and ignores
>>> all other matches.  Eventually, it calls free, and the game starts
>>> over for the next line.
>>>      
>> Adding bug-grep, since this slowdown caused by additional mallocs is
>> definitely the sign of a poor algorithm that could be improved by reusing
>> existing buffers.
>>    
>
> I created a simple testcase:
>
> ==== SNIP ===
> ...
> ==== SNAP ====
>  
I extended your test program to demonstrate the inefficiency of the
standard mbrtowc function. Instead I use a function from my editor
(mined) to extract a Unicode character from a UTF-8 sequence. This is
the simple case only, not converting character sets other than UTF-8 but
that's the same thing mbrtowc does in the sample invocation. Program
attached. Results below.

> Under Cygwin (tcsh time output):
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
>
> Running on the same CPU under Linux:
>
>   $ setenv LANG en_US.UTF-8
>   $ time ./mb 1000000 1 0
>   with malloc: 1, with mbrtowc: 0
>   0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 0 1
>   with malloc: 0, with mbrtowc: 1
>   1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
>   $ time ./mb 1000000 1 1
>   with malloc: 1, with mbrtowc: 1
>   1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
>
> So, while Linux is definitely faster, the number are still comparable
> for 1 million iterations.  That still doens't explain why grep is a
> multitude slower when using UTF-8 as charset.
>  
Results of mbrtowc vs. utftouni on Linux:

thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
UTF-8
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m2.897s
user    0m2.836s
sys     0m0.012s
thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.030s
user    0m0.028s
sys     0m0.000s
thw[en_US.UTF-8]@scotty:~/tmp:


Results of mbrtowc vs. utftouni on cygwin:

demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0

real    0m3.034s
user    0m2.974s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1

real    0m0.110s
user    0m0.070s
sys     0m0.030s
demsn702[C.UTF-8]@stbm8186:/H/tools:


The conclusion is, as long as calling mbrtowc is as inefficient, a
program caring about performance should not use it.

Thomas

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>

void utf8_info (u, length, ucs)
  char * u;
  int * length;
  unsigned long * ucs;
{
  char * textpoi = u;
  unsigned char c = * textpoi;
  int utfcount;
  unsigned long unichar;

        if ((c & 0x80) == 0x00) {
                utfcount = 1;
                unichar = c;
        } else if ((c & 0xE0) == 0xC0) {
                utfcount = 2;
                unichar = c & 0x1F;
        } else if ((c & 0xF0) == 0xE0) {
                utfcount = 3;
                unichar = c & 0x0F;
        } else if ((c & 0xF8) == 0xF0) {
                utfcount = 4;
                unichar = c & 0x07;
        } else if ((c & 0xFC) == 0xF8) {
                utfcount = 5;
                unichar = c & 0x03;
        } else if ((c & 0xFE) == 0xFC) {
                utfcount = 6;
                unichar = c & 0x01;
        } else if (c == 0xFE) {
                /* illegal UTF-8 code */
                utfcount = 1;
                unichar = 0;
        } else if (c == 0xFF) {
                /* illegal UTF-8 code */
                utfcount = 1;
                unichar = 0;
        } else {
                /* illegal UTF-8 sequence character */
                utfcount = 1;
                unichar = 0;
        }

        * length = utfcount;

        utfcount --;
        textpoi ++;
        while (utfcount > 0 && (* textpoi & 0xC0) == 0x80) {
                unichar = (unichar << 6) | (* textpoi & 0x3F);
                utfcount --;
                textpoi ++;
        }
        if (utfcount > 0) {
                /* too short UTF-8 sequence */
                unichar = 0;
                * length -= utfcount;
        }

        * ucs = unichar;
}

void utftouni (wchar_t * wpoi, char * s)
{
  unsigned long c;
  int l;
  int i = 0;

  while (* s) {
        utf8_info (s, & l, & wpoi [i++]);
        s += l;
  }
}

int main (int argc, char **argv)
{
  char in[] = "The quick brown fox jumps over the lazy dog";
  int line, i;
  mbstate_t mbs;
  size_t mbclen;
  size_t size = sizeof (in);
  wchar_t wc;
  int lines = argc > 1 ? atoi (argv[1]) : 1000;
  int do_malloc = 1;
  int do_mbrtowc = 1;
  int do_utftouni = 1;

  if (argc > 2)
    do_malloc = atoi (argv[2]);
  if (argc > 3)
    do_mbrtowc = atoi (argv[3]);
  if (argc > 4)
    do_utftouni = atoi (argv[4]);

  printf ("with malloc: %d, with mbrtowc: %d, with utftouni: %d\n", do_malloc, do_mbrtowc, do_utftouni);

  memset (&mbs, 0, sizeof mbs);
  for (line = 0; line < lines; ++line)
    {
      char *x;
      if (do_malloc) x = malloc (size);
      if (do_mbrtowc)
        for (i = 0; i < size; i += mbclen)
          if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
            break;
      if (do_utftouni)
        utftouni (&wc, in);
      if (do_malloc) free (x);
    }
  return 0;
}


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I wrote:
> Corinna Vinschen wrote:
>> ...
> I extended your test program to demonstrate the inefficiency of the
> standard mbrtowc function. Instead I use a function from my editor
> (mined) to extract a Unicode character from a UTF-8 sequence. This is
> the simple case only, not converting character sets other than UTF-8
> but that's the same thing mbrtowc does in the sample invocation.
> Program attached. Results below.
Actually, there was a bug in the test program, wc not being an array,
which led to variable corruption and thus incorrect test results in my
extension.
Sorry for my embarrassing mistake to overlook this.
Anyway, corrected results are still by a factor of 3 to 4 in favor of my
algorithm.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Corinna Vinschen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Nov  6 16:00, Thomas Wolff wrote:

> Corinna Vinschen wrote:
> >I created a simple testcase:
> >
> >==== SNIP ===
> >...
> >==== SNAP ====
> I extended your test program to demonstrate the inefficiency of the
> standard mbrtowc function. [...]
> >Under Cygwin (tcsh time output):
> >
> >  $ setenv LANG en_US.UTF-8
> >  $ time ./mb 1000000 1 0
> >  with malloc: 1, with mbrtowc: 0
> >  0.328u 0.031s 0:00.34 102.9%    0+0k 0+0io 1834pf+0w
> >  $ time ./mb 1000000 0 1
> >  with malloc: 0, with mbrtowc: 1
> >  1.921u 0.092s 0:02.09 96.1%     0+0k 0+0io 1827pf+0w
> >  $ time ./mb 1000000 1 1
> >  with malloc: 1, with mbrtowc: 1
> >  2.062u 0.140s 0:02.15 102.3%    0+0k 0+0io 1839pf+0w
> >
> >Running on the same CPU under Linux:
> >
> >  $ setenv LANG en_US.UTF-8
> >  $ time ./mb 1000000 1 0
> >  with malloc: 1, with mbrtowc: 0
> >  0.088u 0.004s 0:00.09 88.8%     0+0k 0+0io 0pf+0w
> >  $ time ./mb 1000000 0 1
> >  with malloc: 0, with mbrtowc: 1
> >  1.836u 0.000s 0:01.85 98.9%     0+0k 0+0io 0pf+0w
> >  $ time ./mb 1000000 1 1
> >  with malloc: 1, with mbrtowc: 1
> >  1.888u 0.000s 0:01.93 97.4%     0+0k 0+0io 0pf+0w
> >
> >So, while Linux is definitely faster, the number are still comparable
> >for 1 million iterations.  That still doens't explain why grep is a
> >multitude slower when using UTF-8 as charset.
> Results of mbrtowc vs. utftouni on Linux:
>
> thw[en_US.UTF-8]@scotty:~/tmp: locale charmap
> UTF-8
> thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0
> with malloc: 0, with mbrtowc: 1, with utftouni: 0
>
> real    0m2.897s
> user    0m2.836s
> sys     0m0.012s
> thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1
> with malloc: 0, with mbrtowc: 0, with utftouni: 1
>
> real    0m0.030s
> user    0m0.028s
> sys     0m0.000s
> thw[en_US.UTF-8]@scotty:~/tmp:
> [...]
> The conclusion is, as long as calling mbrtowc is as inefficient, a
> program caring about performance should not use it.

That's sort of an unfair test.  Your utftouni function doesn't care for
mbstate, error, and surrogate pair handling.

Having said that, I just experimented further with mbrtowc, and I was
able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of
almost 50 per cent, just by reducing the function call depth in newlib,
which is the result of reentrancy and isolation efforts.

Talking about your implementation, if you could come up with a faster
implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would
certainly be another welcome performance boost.


Corinna

--
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Corinna Vinschen wrote:
> On Nov  6 16:00, Thomas Wolff wrote:
>  
>>> ...
>>>      
>> I extended your test program to demonstrate the inefficiency of the
>> standard mbrtowc function. [...]
>>    
I later had to correct:
> Anyway, corrected results are still by a factor of 3 to 4 in favor of
> my algorithm.
Corinna wrote:
> That's sort of an unfair test.  Your utftouni function doesn't care for
> mbstate, error, and surrogate pair handling.
>  
This is a question of use cases:
* mbstate is needed e.g. if you feed results of read() which possibly
come in arbitrary chunks directly into mbtowc(); it's not needed if you
only transform complete lines of text at once. The stdlib function is a
little bit too generic (and thus complicated, too) for many applications.
* error handling is there, in my function; it's simplified, incorrect
sequences are all mapped to 0 for the test case but they could as well
return an error indication without performance impact.
* surrogate pair handling is only needed if you pass the string from/to
the Windows API. It's not needed for POSIX applications (provided
wchar_t would be sufficiently wide). So if wchar_t can be extended in
the newlib API, it might be useful to have two implementations; one for
applications (w/o surrogates), one for cygwin itself.

> Having said that, I just experimented further with mbrtowc, and I was
> able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of
> almost 50 per cent, just by reducing the function call depth in newlib,
> which is the result of reentrancy and isolation efforts.
>  
Great! That comes close to my corrected results  :-[

> Talking about your implementation, if you could come up with a faster
> implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would
> certainly be another welcome performance boost.
>  
A quick look at those function doesn't reveal much potential, except for
tiny optimizations like
-  if (ch >= 0xe0 && ch <= 0xef) /* three-byte sequence */
+  if (ch & 0xf0 == 0xe0) /* three-byte sequence */
But even that, given the way the compiler optimizes expressions, is
probably not an improvement.

Also, I remember some recent trouble was fixed by your tweaking of wide
character functions, so this is better not touched again.

My main point was that, depending on the use case, some applications
would be better off using less generic, optimized functions.
grep and sed would certainly be well advised to do that.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Thomas Wolff-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[forgot to CC to bug-grep before, so I'm resending this, with one more
comment, and leaving out cygwin-specific parts]

Corinna Vinschen wrote:
> On Nov  6 16:00, Thomas Wolff wrote:
>  
>>> ...
>>>      
>> I extended your test program to demonstrate the inefficiency of the
>> standard mbrtowc function. [...]
>>    
I later had to correct:
> Anyway, corrected results are still by a factor of 3 to 4 in favor of
> my algorithm.
Corinna wrote:
> That's sort of an unfair test.  Your utftouni function doesn't care for
> mbstate, error, and surrogate pair handling.
>  
This is a question of use cases:
* mbstate is needed e.g. if you feed results of read() which possibly
come in arbitrary chunks directly into mbtowc(); it's not needed if you
only transform complete lines of text at once. The stdlib function is a
little bit too generic (and thus complicated, too) for many applications.
* error handling is there, in my function; it's simplified, incorrect
sequences are all mapped to 0 for the test case but they could as well
return an error indication without performance impact.
* surrogate pair handling is only needed if you pass the string from/to
the Windows API. It's not needed for POSIX applications (provided
wchar_t would be sufficiently wide). So if wchar_t can be extended in
the newlib API, it might be useful to have two implementations; one for
applications (w/o surrogates), one for cygwin itself.


[...]

My main point was that, depending on the use case, some applications
would be better off using less generic, optimized functions.
The kind of dogmatic suggestion (as seen in the "locale scene") that
everybody should use the stdlib wide character functions is often
misleading.
grep and sed would certainly be well advised to change that.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file

by Jim Reisert AD1C :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Nov 6, 2009 at 7:12 AM, Cooper, Karl (US SSA)
<karl.cooper@...> wrote:

> Corinna Vinschen wrote:
>> Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset
>> when calling nl_langinfo(CHARSET).
>
> Yes, this solves it:
>
> $ time LC_ALL=C.ASCII grep dog testfile | wc
>  100000  900000 4500000
>
> real    0m0.359s
> user    0m0.279s
> sys     0m0.232s


I just tried this on my system, I routinely grep groups of files
containing 100K lines.  I was *astounded* how fast "grep" is after
setting LC_ALL=C.ASCII !

--
Jim Reisert AD1C, <jjreisert@...>, http://www.ad1c.us

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

< Prev | 1 - 2 | Next >