|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
1.7] BUG - GREP slows to a crawl with large number of matches on a single fileRunning grep on a 20MB file with ~100,000 matches takes an incredible almost 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 (on a 2nd machine).
The following cases show how grep under 1.7 grinds to a halt as the number of matches increases. The data 'testfile' is a plain text file of the acl's of all the 108,000 files on my Windoze computer. Note since the machines are different, compare relative times across cases rather than the times between the two machines. Case 1] Zero matches time grep "sfsdfdsfds" testfile | wc 0 0 0 Cygwin 1.5 real 0m0.093s user 0m0.092s sys 0m0.030s Cygwin 1.7 real 0m1.353s user 0m1.342s sys 0m0.062s Case 2] One match time grep ".lesshst" testfile | wc 1 3 29 Cygwin 1.5 (~same as zero matches) real 0m0.234s user 0m0.091s sys 0m0.061s Cygwin 1.7 (~same as zero matches) real 0m1.499s user 0m1.404s sys 0m0.046s Case 3] ~1400 matches Cygwin 1.5 (~ same as zero matches) time grep ".bin" testfile | wc 1439 5661 71067 real 0m0.110s user 0m0.076s sys 0m0.077s Cygwin 1.7 (~6x zero matches case real 0m7.537s user 0m7.341s sys 0m0.045s Case 4] ~16000 matches time grep "Documents and Settings" testfile | wc 15824 131573 1918500 Cygwin 1.5 (~same as zero matches) real 0m0.437s user 0m0.092s sys 0m0.092s Cygwin 1.7 (~50x zero matches) real 1m14.491s user 1m8.904s sys 0m0.031s Case 5] ~100,000 matches time grep "# file" testfile | wc 106988 510944 7930558 Cygwin 1.5 (~1.5x zero matches) real 0m0.475s user 0m0.154s sys 0m0.201s Cygwin 1.7 (~350x zero matches) real 7m51.771s user 7m16.810s sys 0m0.062s Case 6] Test that nothing wrong with file system reads or 'wc' time cat testfile | wc 966300 1821815 20426592 Cygwin 1.5 (approx same time as grepping zero matches) real 0m0.344s user 0m0.201s sys 0m0.186s Cygwin 1.7 (approx same time as grepping zero matches) real 0m1.662s user 0m1.373s sys 0m0.138s |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileaputerguy wrote:
> The data 'testfile' is a plain text file of the acl's of all the 108,000 > files on my Windoze computer. So, the "find | xargs" trick worked then did it? :-) cheers, DaveK -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Thu, Nov 05, 2009 at 03:27:07PM -0800, aputerguy wrote:
> >Running grep on a 20MB file with ~100,000 matches takes an incredible almost >8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 >(on a 2nd machine). > >The following cases show how grep under 1.7 grinds to a halt as the number >of matches increases. > >The data 'testfile' is a plain text file of the acl's of all the 108,000 >files on my Windoze computer. > >Note since the machines are different, compare relative times across cases >rather than the times between the two machines. We'll need an actual test case if you want us to track it down. cgf -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileaputerguy wrote:
> Running grep on a 20MB file with ~100,000 matches takes an incredible almost > 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 > (on a 2nd machine). --- I've seen nasty behavior with grep that isnt' cygwin specific. Try "pcregrep" and see if you have the same issue. I found it to be about ~100 times faster under _some_ searches though 2-3x is more typical. The gnu re-parser isn't real efficient under some circumstances. If you find a big difference, you might also want to report it to the bug-grep@... mailing list, but last time I did, they told me "that's the way it is" due to some posix conformance thing... -l -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote:
>aputerguy wrote: >> Running grep on a 20MB file with ~100,000 matches takes an incredible almost >> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 >> (on a 2nd machine). > >I've seen nasty behavior with grep that isnt' cygwin specific. Try >"pcregrep" and see if you have the same issue. > >I found it to be about ~100 times faster under _some_ searches though >2-3x is more typical. The gnu re-parser isn't real efficient under >some circumstances. > >If you find a big difference, you might also want to report it to the >bug-grep@... mailing list, but last time I did, they told me >"that's the way it is" due to some posix conformance thing... The fact that it behaves differently between Cygwin 1.5 and 1.7 would suggest that this isn't a grep problem. That's why I asked for a test case. cgf -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOK. Here is a simple test case:
X=100000 while [ $X -gt 0 ] ; do echo "The quick brown fox jumped over the lazy dog" ; let X=X-1; done > testfile time grep dog testfile | wc Cygwin 1.5: real 0m0.219s user 0m0.232s sys 0m0.045s Cygwin 1.7: real 7m46.575s user 7m14.138s sys 0m0.076s While using sed on Cygwin 1.5, I get the reasonable result: time sed -ne /dog/p testfile | wc real 0m1.229s user 0m1.202s sys 0m0.046s |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileChristopher Faylor wrote:
> On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote: > >> aputerguy wrote: >> >>> Running grep on a 20MB file with ~100,000 matches takes an incredible almost >>> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 >>> (on a 2nd machine). >>> >> I've seen nasty behavior with grep that isnt' cygwin specific. Try >> "pcregrep" and see if you have the same issue. >> >> I found it to be about ~100 times faster under _some_ searches though >> 2-3x is more typical. The gnu re-parser isn't real efficient under >> some circumstances. >> >> If you find a big difference, you might also want to report it to the >> bug-grep@... mailing list, but last time I did, they told me >> "that's the way it is" due to some posix conformance thing... >> > > The fact that it behaves differently between Cygwin 1.5 and 1.7 would > suggest that this isn't a grep problem. > charset. The same problem is observed on Linux, with grep as well as with sed. That's why I have changed most of my shell scripts to use something like LC_ALL=C grep or LC_ALL=C sed where possible. Please try this. The problem *is* with grep (and sed), however, because there is no good reason that UTF-8 should give us a penalty of being 100times slower on most search operations, this is just poor programming of grep and sed. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Fri, Nov 06, 2009 at 02:09:59PM +0100, Thomas Wolff wrote:
>Christopher Faylor wrote: >> On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote: >> >>> aputerguy wrote: >>> >>>> Running grep on a 20MB file with ~100,000 matches takes an incredible almost >>>> 8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 >>>> (on a 2nd machine). >>>> >>> I've seen nasty behavior with grep that isnt' cygwin specific. Try >>> "pcregrep" and see if you have the same issue. >>> >>> I found it to be about ~100 times faster under _some_ searches though >>> 2-3x is more typical. The gnu re-parser isn't real efficient under >>> some circumstances. >>> >>> If you find a big difference, you might also want to report it to the >>> bug-grep@... mailing list, but last time I did, they told me >>> "that's the way it is" due to some posix conformance thing... >>> >> >> The fact that it behaves differently between Cygwin 1.5 and 1.7 would >> suggest that this isn't a grep problem. >> >This is likely to be triggered by the transition to UTF-8 as a default >charset. The same problem is observed on Linux, with grep as well as >with sed. >That's why I have changed most of my shell scripts to use something like >LC_ALL=C grep or LC_ALL=C sed >where possible. Please try this. Thanks for catching this. I'll hold off on trying the test case until I hear a report about running the same test with LC_ALL=C. cgf -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
RE: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileThomas Wolff wrote:
> Christopher Faylor wrote: ..snip.. >>> aputerguy wrote: >>> >>>> Running grep on a 20MB file with ~100,000 matches takes an >>>> incredible almost 8 minutes under Cygwin 1.7 while taking just 0.2 >>>> seconds under Cygwin 1.5 (on a 2nd machine). ..snip.. >> The fact that it behaves differently between Cygwin 1.5 and 1.7 would >> suggest that this isn't a grep problem. >> > This is likely to be triggered by the transition to UTF-8 as a > default charset. The same problem is observed on Linux, with grep as > well as with sed. > That's why I have changed most of my shell scripts to use something > like LC_ALL=C grep or LC_ALL=C sed where possible. Please try this. I don't have Cygwin 1.5 for comparison, but the testcase provided does show grep using a long time on *my* Cygwin 1.7. And LC_ALL=C didn't seem to help: $ time grep dog testfile | wc 100000 900000 4500000 real 3m28.229s user 3m26.951s Sys 0m0.170s $ LC_ALL=C time grep dog testfile | wc 207.26user 0.06system 3:28.32elapsed 99%CPU (0avgtext+0avtdata 278784maxresident)k 0inputs+0outputs (1091major+0minor)pagefaults 0swaps 100000 900000 4500000 $ time LC_ALL=C grep dog testfile | wc 100000 900000 4500000 real 3m24.265s user 3m24.124s sys 0m0.202s (Well . . . Doesn't help very *much*, anyway; a few seconds.) I don't have the *latest* 1.7: $ uname -a CYGWIN_NT-5.1 gldlkcooper 1.7.0(0.214/5/3) 2009-10-03 14:33 i686 Cygwin Karl Cooper -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Nov 6 14:09, Thomas Wolff wrote:
> Christopher Faylor wrote: > >On Thu, Nov 05, 2009 at 07:11:02PM -0800, Linda Walsh wrote: > >>aputerguy wrote: > >>>Running grep on a 20MB file with ~100,000 matches takes an incredible almost > >>>8 minutes under Cygwin 1.7 while taking just 0.2 seconds under Cygwin 1.5 > >>>(on a 2nd machine). > >>I've seen nasty behavior with grep that isnt' cygwin specific. Try > >>"pcregrep" and see if you have the same issue. > >> > >>I found it to be about ~100 times faster under _some_ searches though > >>2-3x is more typical. The gnu re-parser isn't real efficient under > >>some circumstances. > >> > >>If you find a big difference, you might also want to report it to the > >>bug-grep@... mailing list, but last time I did, they told me > >>"that's the way it is" due to some posix conformance thing... > > > >The fact that it behaves differently between Cygwin 1.5 and 1.7 would > >suggest that this isn't a grep problem. > This is likely to be triggered by the transition to UTF-8 as a > default charset. The same problem is observed on Linux, with grep as > well as with sed. > That's why I have changed most of my shell scripts to use something like > LC_ALL=C grep or LC_ALL=C sed > where possible. Please try this. Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset when calling nl_langinfo(CHARSET). > The problem *is* with grep (and sed), however, because there is no > good reason that UTF-8 should give us a penalty of being 100times > slower on most search operations, this is just poor programming of > grep and sed. The penalty on Linux is much smaller, about 15-20%. It looks like grep is calling malloc for every input line if MB_CUR_MAX is > 1. Then it evaluates for each byte in the line whether the byte is a single byte or the start of a multibyte sequence using mbrtowc on every charatcer on the input line. Then, for each potential match, it checks if it's the start byte of a multibyte sequence and ignores all other matches. Eventually, it calls free, and the game starts over for the next line. It appears that either our malloc is that slow, or the mbrtowc call. But I can't really believe the latter. The function should be quite fast, as far as I can see... Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 According to Cooper, Karl (US SSA) on 11/6/2009 6:48 AM: >> like LC_ALL=C grep or LC_ALL=C sed where possible. Please try this. > > I don't have Cygwin 1.5 for comparison, but the testcase provided does show grep using a long time on *my* Cygwin 1.7. > And LC_ALL=C didn't seem to help: LC_ALL=C still selects UTF-8 charset. That's probably the culprit. Try LC_ALL=C.ASCII, to force the issue of a one-byte charset. - -- Don't work too hard, make some time for fun as well! Eric Blake ebb9@... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkr0KjAACgkQ84KuGfSFAYCwmgCdHqUtDXEIw01uyjPKW1+lv2Sx +AAAoNY++kRbZqAWYjpqWndTaFj2/y2F =7bwZ -----END PGP SIGNATURE----- -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 According to Corinna Vinschen on 11/6/2009 6:51 AM: >> The problem *is* with grep (and sed), however, because there is no >> good reason that UTF-8 should give us a penalty of being 100times >> slower on most search operations, this is just poor programming of >> grep and sed. > > The penalty on Linux is much smaller, about 15-20%. It looks like > grep is calling malloc for every input line if MB_CUR_MAX is > 1. > Then it evaluates for each byte in the line whether the byte is a > single byte or the start of a multibyte sequence using mbrtowc on > every charatcer on the input line. Then, for each potential match, > it checks if it's the start byte of a multibyte sequence and ignores > all other matches. Eventually, it calls free, and the game starts > over for the next line. Adding bug-grep, since this slowdown caused by additional mallocs is definitely the sign of a poor algorithm that could be improved by reusing existing buffers. - -- Don't work too hard, make some time for fun as well! Eric Blake ebb9@... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkr0KxUACgkQ84KuGfSFAYCOCACgvjz2v65vK8DIcGg6zfnLQgcT tfQAmwbpWbriBJSv0rjYobYgsh4KXOiZ =B3nZ -----END PGP SIGNATURE----- -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
RE: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileCorinna Vinschen wrote:
> Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset > when calling nl_langinfo(CHARSET). Yes, this solves it: $ time LC_ALL=C.ASCII grep dog testfile | wc 100000 900000 4500000 real 0m0.359s user 0m0.279s sys 0m0.232s -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Nov 6 06:56, Eric Blake wrote:
> According to Corinna Vinschen on 11/6/2009 6:51 AM: > >> The problem *is* with grep (and sed), however, because there is no > >> good reason that UTF-8 should give us a penalty of being 100times > >> slower on most search operations, this is just poor programming of > >> grep and sed. > > > > The penalty on Linux is much smaller, about 15-20%. It looks like > > grep is calling malloc for every input line if MB_CUR_MAX is > 1. > > Then it evaluates for each byte in the line whether the byte is a > > single byte or the start of a multibyte sequence using mbrtowc on > > every charatcer on the input line. Then, for each potential match, > > it checks if it's the start byte of a multibyte sequence and ignores > > all other matches. Eventually, it calls free, and the game starts > > over for the next line. > > Adding bug-grep, since this slowdown caused by additional mallocs is > definitely the sign of a poor algorithm that could be improved by reusing > existing buffers. I created a simple testcase: ==== SNIP === #include <stdio.h> #include <wchar.h> #include <stdlib.h> #include <string.h> int main (int argc, char **argv) { const char in[] = "The quick brown fox jumps over the lazy dog"; int line, i; mbstate_t mbs; size_t mbclen; size_t size = sizeof (in); wchar_t wc; int lines = argc > 1 ? atoi (argv[1]) : 1000; int do_malloc = 1; int do_mbrtowc = 1; if (argc > 2) do_malloc = atoi (argv[2]); if (argc > 3) do_mbrtowc = atoi (argv[3]); printf ("with malloc: %d, with mbrtowc: %d\n", do_malloc, do_mbrtowc); memset (&mbs, 0, sizeof mbs); for (line = 0; line < lines; ++line) { char *x; if (do_malloc) x = malloc (size); if (do_mbrtowc) for (i = 0; i < size; i += mbclen) if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0) break; if (do_malloc) free (x); } return 0; } ==== SNAP ==== Under Cygwin (tcsh time output): $ setenv LANG en_US.UTF-8 $ time ./mb 1000000 1 0 with malloc: 1, with mbrtowc: 0 0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w $ time ./mb 1000000 0 1 with malloc: 0, with mbrtowc: 1 1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w $ time ./mb 1000000 1 1 with malloc: 1, with mbrtowc: 1 2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w Running on the same CPU under Linux: $ setenv LANG en_US.UTF-8 $ time ./mb 1000000 1 0 with malloc: 1, with mbrtowc: 0 0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w $ time ./mb 1000000 0 1 with malloc: 0, with mbrtowc: 1 1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w $ time ./mb 1000000 1 1 with malloc: 1, with mbrtowc: 1 1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w So, while Linux is definitely faster, the number are still comparable for 1 million iterations. That still doens't explain why grep is a multitude slower when using UTF-8 as charset. Puzzled, Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileCorinna Vinschen wrote:
> On Nov 6 06:56, Eric Blake wrote: > >> According to Corinna Vinschen on 11/6/2009 6:51 AM: >> >>>> The problem *is* with grep (and sed), however, because there is no >>>> good reason that UTF-8 should give us a penalty of being 100times >>>> slower on most search operations, this is just poor programming of >>>> grep and sed. >>>> >>> The penalty on Linux is much smaller, about 15-20%. >>> It looks like >>> grep is calling malloc for every input line if MB_CUR_MAX is > 1. >>> Then it evaluates for each byte in the line whether the byte is a >>> single byte or the start of a multibyte sequence using mbrtowc on >>> every charatcer on the input line. Then, for each potential match, >>> it checks if it's the start byte of a multibyte sequence and ignores >>> all other matches. Eventually, it calls free, and the game starts >>> over for the next line. >>> >> Adding bug-grep, since this slowdown caused by additional mallocs is >> definitely the sign of a poor algorithm that could be improved by reusing >> existing buffers. >> > > I created a simple testcase: > > ==== SNIP === > ... > ==== SNAP ==== > (mined) to extract a Unicode character from a UTF-8 sequence. This is the simple case only, not converting character sets other than UTF-8 but that's the same thing mbrtowc does in the sample invocation. Program attached. Results below. > Under Cygwin (tcsh time output): > > $ setenv LANG en_US.UTF-8 > $ time ./mb 1000000 1 0 > with malloc: 1, with mbrtowc: 0 > 0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w > $ time ./mb 1000000 0 1 > with malloc: 0, with mbrtowc: 1 > 1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w > $ time ./mb 1000000 1 1 > with malloc: 1, with mbrtowc: 1 > 2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w > > Running on the same CPU under Linux: > > $ setenv LANG en_US.UTF-8 > $ time ./mb 1000000 1 0 > with malloc: 1, with mbrtowc: 0 > 0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w > $ time ./mb 1000000 0 1 > with malloc: 0, with mbrtowc: 1 > 1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w > $ time ./mb 1000000 1 1 > with malloc: 1, with mbrtowc: 1 > 1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w > > So, while Linux is definitely faster, the number are still comparable > for 1 million iterations. That still doens't explain why grep is a > multitude slower when using UTF-8 as charset. > thw[en_US.UTF-8]@scotty:~/tmp: locale charmap UTF-8 thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0 with malloc: 0, with mbrtowc: 1, with utftouni: 0 real 0m2.897s user 0m2.836s sys 0m0.012s thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1 with malloc: 0, with mbrtowc: 0, with utftouni: 1 real 0m0.030s user 0m0.028s sys 0m0.000s thw[en_US.UTF-8]@scotty:~/tmp: Results of mbrtowc vs. utftouni on cygwin: demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 1 0 with malloc: 0, with mbrtowc: 1, with utftouni: 0 real 0m3.034s user 0m2.974s sys 0m0.030s demsn702[C.UTF-8]@stbm8186:/H/tools: time ./uu.exe 1000000 0 0 1 with malloc: 0, with mbrtowc: 0, with utftouni: 1 real 0m0.110s user 0m0.070s sys 0m0.030s demsn702[C.UTF-8]@stbm8186:/H/tools: The conclusion is, as long as calling mbrtowc is as inefficient, a program caring about performance should not use it. Thomas #include <stdio.h> #include <wchar.h> #include <stdlib.h> #include <string.h> void utf8_info (u, length, ucs) char * u; int * length; unsigned long * ucs; { char * textpoi = u; unsigned char c = * textpoi; int utfcount; unsigned long unichar; if ((c & 0x80) == 0x00) { utfcount = 1; unichar = c; } else if ((c & 0xE0) == 0xC0) { utfcount = 2; unichar = c & 0x1F; } else if ((c & 0xF0) == 0xE0) { utfcount = 3; unichar = c & 0x0F; } else if ((c & 0xF8) == 0xF0) { utfcount = 4; unichar = c & 0x07; } else if ((c & 0xFC) == 0xF8) { utfcount = 5; unichar = c & 0x03; } else if ((c & 0xFE) == 0xFC) { utfcount = 6; unichar = c & 0x01; } else if (c == 0xFE) { /* illegal UTF-8 code */ utfcount = 1; unichar = 0; } else if (c == 0xFF) { /* illegal UTF-8 code */ utfcount = 1; unichar = 0; } else { /* illegal UTF-8 sequence character */ utfcount = 1; unichar = 0; } * length = utfcount; utfcount --; textpoi ++; while (utfcount > 0 && (* textpoi & 0xC0) == 0x80) { unichar = (unichar << 6) | (* textpoi & 0x3F); utfcount --; textpoi ++; } if (utfcount > 0) { /* too short UTF-8 sequence */ unichar = 0; * length -= utfcount; } * ucs = unichar; } void utftouni (wchar_t * wpoi, char * s) { unsigned long c; int l; int i = 0; while (* s) { utf8_info (s, & l, & wpoi [i++]); s += l; } } int main (int argc, char **argv) { char in[] = "The quick brown fox jumps over the lazy dog"; int line, i; mbstate_t mbs; size_t mbclen; size_t size = sizeof (in); wchar_t wc; int lines = argc > 1 ? atoi (argv[1]) : 1000; int do_malloc = 1; int do_mbrtowc = 1; int do_utftouni = 1; if (argc > 2) do_malloc = atoi (argv[2]); if (argc > 3) do_mbrtowc = atoi (argv[3]); if (argc > 4) do_utftouni = atoi (argv[4]); printf ("with malloc: %d, with mbrtowc: %d, with utftouni: %d\n", do_malloc, do_mbrtowc, do_utftouni); memset (&mbs, 0, sizeof mbs); for (line = 0; line < lines; ++line) { char *x; if (do_malloc) x = malloc (size); if (do_mbrtowc) for (i = 0; i < size; i += mbclen) if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0) break; if (do_utftouni) utftouni (&wc, in); if (do_malloc) free (x); } return 0; } -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileI wrote:
> Corinna Vinschen wrote: >> ... > I extended your test program to demonstrate the inefficiency of the > standard mbrtowc function. Instead I use a function from my editor > (mined) to extract a Unicode character from a UTF-8 sequence. This is > the simple case only, not converting character sets other than UTF-8 > but that's the same thing mbrtowc does in the sample invocation. > Program attached. Results below. Actually, there was a bug in the test program, wc not being an array, which led to variable corruption and thus incorrect test results in my extension. Sorry for my embarrassing mistake to overlook this. Anyway, corrected results are still by a factor of 3 to 4 in favor of my algorithm. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Nov 6 16:00, Thomas Wolff wrote:
> Corinna Vinschen wrote: > >I created a simple testcase: > > > >==== SNIP === > >... > >==== SNAP ==== > I extended your test program to demonstrate the inefficiency of the > standard mbrtowc function. [...] > >Under Cygwin (tcsh time output): > > > > $ setenv LANG en_US.UTF-8 > > $ time ./mb 1000000 1 0 > > with malloc: 1, with mbrtowc: 0 > > 0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w > > $ time ./mb 1000000 0 1 > > with malloc: 0, with mbrtowc: 1 > > 1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w > > $ time ./mb 1000000 1 1 > > with malloc: 1, with mbrtowc: 1 > > 2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w > > > >Running on the same CPU under Linux: > > > > $ setenv LANG en_US.UTF-8 > > $ time ./mb 1000000 1 0 > > with malloc: 1, with mbrtowc: 0 > > 0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w > > $ time ./mb 1000000 0 1 > > with malloc: 0, with mbrtowc: 1 > > 1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w > > $ time ./mb 1000000 1 1 > > with malloc: 1, with mbrtowc: 1 > > 1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w > > > >So, while Linux is definitely faster, the number are still comparable > >for 1 million iterations. That still doens't explain why grep is a > >multitude slower when using UTF-8 as charset. > Results of mbrtowc vs. utftouni on Linux: > > thw[en_US.UTF-8]@scotty:~/tmp: locale charmap > UTF-8 > thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 1 0 > with malloc: 0, with mbrtowc: 1, with utftouni: 0 > > real 0m2.897s > user 0m2.836s > sys 0m0.012s > thw[en_US.UTF-8]@scotty:~/tmp: time ./uu 1000000 0 0 1 > with malloc: 0, with mbrtowc: 0, with utftouni: 1 > > real 0m0.030s > user 0m0.028s > sys 0m0.000s > thw[en_US.UTF-8]@scotty:~/tmp: > [...] > The conclusion is, as long as calling mbrtowc is as inefficient, a > program caring about performance should not use it. That's sort of an unfair test. Your utftouni function doesn't care for mbstate, error, and surrogate pair handling. Having said that, I just experimented further with mbrtowc, and I was able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of almost 50 per cent, just by reducing the function call depth in newlib, which is the result of reentrancy and isolation efforts. Talking about your implementation, if you could come up with a faster implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would certainly be another welcome performance boost. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileCorinna Vinschen wrote:
> On Nov 6 16:00, Thomas Wolff wrote: > >>> ... >>> >> I extended your test program to demonstrate the inefficiency of the >> standard mbrtowc function. [...] >> I later had to correct: > Anyway, corrected results are still by a factor of 3 to 4 in favor of > my algorithm. Corinna wrote: > That's sort of an unfair test. Your utftouni function doesn't care for > mbstate, error, and surrogate pair handling. > This is a question of use cases: * mbstate is needed e.g. if you feed results of read() which possibly come in arbitrary chunks directly into mbtowc(); it's not needed if you only transform complete lines of text at once. The stdlib function is a little bit too generic (and thus complicated, too) for many applications. * error handling is there, in my function; it's simplified, incorrect sequences are all mapped to 0 for the test case but they could as well return an error indication without performance impact. * surrogate pair handling is only needed if you pass the string from/to the Windows API. It's not needed for POSIX applications (provided wchar_t would be sufficiently wide). So if wchar_t can be extended in the newlib API, it might be useful to have two implementations; one for applications (w/o surrogates), one for cygwin itself. > Having said that, I just experimented further with mbrtowc, and I was > able to speed up mbrtowc and wcrtomb calls on Cygwin by a factor of > almost 50 per cent, just by reducing the function call depth in newlib, > which is the result of reentrancy and isolation efforts. > Great! That comes close to my corrected results :-[ > Talking about your implementation, if you could come up with a faster > implementation of newlib's __utf8_wctomb/__utf8_mbtowc, it would > certainly be another welcome performance boost. > A quick look at those function doesn't reveal much potential, except for tiny optimizations like - if (ch >= 0xe0 && ch <= 0xef) /* three-byte sequence */ + if (ch & 0xf0 == 0xe0) /* three-byte sequence */ But even that, given the way the compiler optimizes expressions, is probably not an improvement. Also, I remember some recent trouble was fixed by your tweaking of wide character functions, so this is better not touched again. My main point was that, depending on the use case, some applications would be better off using less generic, optimized functions. grep and sed would certainly be well advised to do that. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file[forgot to CC to bug-grep before, so I'm resending this, with one more
comment, and leaving out cygwin-specific parts] Corinna Vinschen wrote: > On Nov 6 16:00, Thomas Wolff wrote: > >>> ... >>> >> I extended your test program to demonstrate the inefficiency of the >> standard mbrtowc function. [...] >> I later had to correct: > Anyway, corrected results are still by a factor of 3 to 4 in favor of > my algorithm. Corinna wrote: > That's sort of an unfair test. Your utftouni function doesn't care for > mbstate, error, and surrogate pair handling. > This is a question of use cases: * mbstate is needed e.g. if you feed results of read() which possibly come in arbitrary chunks directly into mbtowc(); it's not needed if you only transform complete lines of text at once. The stdlib function is a little bit too generic (and thus complicated, too) for many applications. * error handling is there, in my function; it's simplified, incorrect sequences are all mapped to 0 for the test case but they could as well return an error indication without performance impact. * surrogate pair handling is only needed if you pass the string from/to the Windows API. It's not needed for POSIX applications (provided wchar_t would be sufficiently wide). So if wchar_t can be extended in the newlib API, it might be useful to have two implementations; one for applications (w/o surrogates), one for cygwin itself. [...] My main point was that, depending on the use case, some applications would be better off using less generic, optimized functions. The kind of dogmatic suggestion (as seen in the "locale scene") that everybody should use the stdlib wide character functions is often misleading. grep and sed would certainly be well advised to change that. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
|
|
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single fileOn Fri, Nov 6, 2009 at 7:12 AM, Cooper, Karl (US SSA)
<karl.cooper@...> wrote: > Corinna Vinschen wrote: >> Or try LANG=C.ASCII since LANG=C will still return UTF-8 as charset >> when calling nl_langinfo(CHARSET). > > Yes, this solves it: > > $ time LC_ALL=C.ASCII grep dog testfile | wc > 100000 900000 4500000 > > real 0m0.359s > user 0m0.279s > sys 0m0.232s I just tried this on my system, I routinely grep groups of files containing 100K lines. I was *astounded* how fast "grep" is after setting LC_ALL=C.ASCII ! -- Jim Reisert AD1C, <jjreisert@...>, http://www.ad1c.us -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |