|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
possible grep bug: -i switch with character classThere seems to be some sort of glitch with character classes and the
-i (--ignore-case) switch. I am using Mac OS X 10.6.1 (Snow Leopard). grep --version shows v2.5.1. I have not checked the latest grep version. Hopefully you can do that without too much trouble. To reproduce-- Create a text file called "test.txt" that consists of a single line of text with the two letters T and K followed by a unix line feed, as shown below (also see the attached file): TK (I did not attempt to reduce the test case further or see if the particular choice of letters matters.) The case-insensitive search "grep -i [A-Z]K test.txt" finds no matches while the same search done case-sensitively (i.e. without the -i switch) does find a match. See below for a console session that shows a couple other related searches that work. $ grep [A-Z]K test.txt TK $ grep [T]K test.txt TK $ grep -i [T]K test.txt TK $ grep -i [^A-Z]K test.txt TK $ grep -i [A-Z]K test.txt $ I will also attempt to report this here: http://savannah.gnu.org/projects/grep/ Thanks, --Chris TK |
|
|
Re: possible grep bug: -i switch with character classChris Jerdonek wrote:
> There seems to be some sort of glitch with character classes and the > -i (--ignore-case) switch. Thank you for your report. I tried to reproduce the issue but believe you have insufficient quoting to know for sure exactly what the problem is in this case. It is either insufficient quoting or it is your choice of locale which changes the sort ordering. > Create a text file called "test.txt" that consists of a single line of > text with the two letters T and K followed by a unix line feed, as > shown below (also see the attached file): > > TK Good. Also note that using 'echo' or 'printf' works well to create small test cases like this. > The case-insensitive search "grep -i [A-Z]K test.txt" finds no matches > while the same search done case-sensitively (i.e. without the -i > switch) does find a match. See below for a console session that shows > a couple other related searches that work. I think you may have insufficiently quoted your pattern. The brackets are special characters to the shell and will match against filenames. They are file globbing characters. > $ grep [A-Z]K test.txt > TK > $ grep [T]K test.txt > TK > $ grep -i [T]K test.txt > TK > $ grep -i [^A-Z]K test.txt > TK > $ grep -i [A-Z]K test.txt > $ Try this: $ echo TK | grep "[A-Z]K" TK $ echo TK | grep "[T]K" TK $ echo TK | grep -i "[T]K" TK $ echo TK | grep -i "[^A-Z]K" $ echo TK | grep -i "[A-Z]K" TK Also note that the value of LANG, LC_COLLATE, and LC_ALL affect your character collation sequence within your locale. Try setting LC_ALL=C to set a standard environment. $ locale In en_US.UTF-8 for example "[A-Z]" matches "aAbBcC...zZ" and doesn't match lower case 'a' because that is the defined collation order for that locale. The shell can demonstrate the issue this way: $ LC_ALL=en_US.UTF-8 bash $ mkdir mytestdir && cd ./mytestdir $ touch a A b B c C X x Y y Z z $ echo [a-z] a A b B c C x X y Y z $ echo [A-Z] A b B c C x X y Y z Z Setting LC_ALL=C restores a standard sort ordering. Personally I use the following settings. But this won't be valid for every combination. YMMV. export LANG=en_US.UTF-8 export LC_COLLATE=C Hope that helps, Bob |
|
|
Re: possible grep bug: -i switch with character classOn Tue, Nov 3, 2009 at 1:42 PM, Bob Proulx <bob@...> wrote:
> Thank you for your report. I tried to reproduce the issue but believe > you have insufficient quoting to know for sure exactly what the > problem is in this case. It is either insufficient quoting or it is > your choice of locale which changes the sort ordering. Thanks for your help, Bob. > Try this: > > $ echo TK | grep "[A-Z]K" The quoting doesn't seem related to the issue in this case. The case insensitive search still picks up fewer results-- $ echo TK | grep "[A-Z]K" TK echo TK | grep -i "[A-Z]K" $ > Also note that the value of LANG, LC_COLLATE, and LC_ALL affect your > character collation sequence within your locale. Ah, okay. Here were my locale settings: $ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= > Try setting LC_ALL=C Setting LC_ALL=C does seem to give the expected behavior. Thanks! > In en_US.UTF-8 for example "[A-Z]" matches "aAbBcC...zZ" and doesn't > match lower case 'a' because that is the defined collation order for > that locale. Okay. But shouldn't a case insensitive grep search for "[A-Z]" still pick up "T"? It looks like it only picks up "t" and not "T": $ echo "tK" | grep -i "[A-Z]K" tK $ echo "TK" | grep -i "[A-Z]K" $ Or does grep officially not support locales other than "C"? In your example above with en_US.UTF-8, if the ordering is aAbB..., it seems like "[A-Z]" should match "A" but not "a" when case sensitive. And a case insensitive search should pick up strictly more -- i.e. both "a" and "A". On the other hand, "[a-z]" should match both "a" and "A" when case sensitive. > The shell can demonstrate the issue this way: > ... > $ touch a A b B c C X x Y y Z z Hmm, it doesn't seem like your example worked (even with LC_ALL="en_US.UTF-8"). "[A-Z]" picked up just the upper case files, and "[a-z]" just the lower case. Also, Mac seems to treat "a" and "A" as the same file. So the above command created only 6 files instead of 12 -- using the first filename in each case rather than the second. Thanks, --Chris |
|
|
Re: possible grep bug: -i switch with character classChris Jerdonek wrote:
> > $ echo TK | grep "[A-Z]K" > > The quoting doesn't seem related to the issue in this case. The case > insensitive search still picks up fewer results-- Hmm... Curiouser and curiouser. > $ echo TK | grep "[A-Z]K" > TK > echo TK | grep -i "[A-Z]K" > $ That does seem incorrect. Since you reported that you were using the Mac OS X version 2.5.1 then I would report this problem them. This may be something Mac specific. It doesn't seem to me to be a problem in the upstream source. Alternatively I would grab the latest stable grep sources and compile them yourself. Grep compiles easy and shouldn't be a problem to compile from source. (Although I don't use a Mac and so have no first hand knowledge there.) Searching the web shows me a darwinports page specifically for this. The GNU grep source is here: ftp://ftp.gnu.org/gnu/grep/ If it works for you compiled from upstream source and not from the stock Mac version then it is probably a problem specific to the Mac version that we wouldn't know about here. And if it doesn't work that way then we have something that we could work through. > > Try setting LC_ALL=C > > Setting LC_ALL=C does seem to give the expected behavior. Thanks! Oh good. At least you have some progress to show. You will see that in use a lot these days. > > In en_US.UTF-8 for example "[A-Z]" matches "aAbBcC...zZ" and doesn't > > match lower case 'a' because that is the defined collation order for > > that locale. > > Okay. But shouldn't a case insensitive grep search for "[A-Z]" still > pick up "T"? It looks like it only picks up "t" and not "T": I would think so yes. I was simply trying to describe the sequence. Otherwise it is just too confusing. > $ echo "tK" | grep -i "[A-Z]K" > tK > $ echo "TK" | grep -i "[A-Z]K" > $ I can't recreate that behavior. (Works for me.) > > The shell can demonstrate the issue this way: > > ... > > $ touch a A b B c C X x Y y Z z > > Hmm, it doesn't seem like your example worked (even with > LC_ALL="en_US.UTF-8"). "[A-Z]" picked up just the upper case files, > and "[a-z]" just the lower case. Also, Mac seems to treat "a" and "A" > as the same file. So the above command created only 6 files instead > of 12 -- using the first filename in each case rather than the second. Oh wow. I didn't expect that. Good information for me. But at least you figured out the point I was attempting to make. I am sorry but I think you will need to do some debugging in order to get to the root cause of the problem. It seems to work correctly on GNU systems. This problem seems to be something specific to the Mac. Also posting to bug-gnu-utils for grep is okay. It is a generic catchall for all things gnu. You will have people like me here helping. But for grep there is a specific mailing list just for grep bugs. For more follow-up you would reach the grep maintainers directly at bug-grep@... instead of here. That is the bug reporting address listed in the grep --help output. More folks there know the internals of the source code more intently and should be able to provide even more help. You can search archives of bug-grep here: http://lists.gnu.org/archive/html/bug-grep/ And in fact looking just now I find a copy of your bug report there. Good stuff! So whether you realized it or not you had already posted there. :-) Hope this helps, Bob |
|
|
Re: possible grep bug: -i switch with character classOn Wed, Nov 4, 2009 at 12:50 AM, Bob Proulx <bob@...> wrote:
> Alternatively I would grab the latest stable grep sources and compile > them yourself. Grep compiles easy and shouldn't be a problem to > compile from source. I had trouble compiling directly from source by following the INSTALL file. For example, I got the following error when running "make": search.c:44:19: error: pcre.h: No such file or directory What I did instead was get the latest version v2.5.4 using MacPorts ( http://www.macports.org/ ), which seems to have supplanted DarwinPorts. I'm not seeing the issue in v2.5.4 even with the UTF8 locale, which is good. You can see more info I posted in a follow-up to my bug report. > Also posting to bug-gnu-utils for grep is okay. It is a generic > catchall for all things gnu. You will have people like me here > helping. But for grep there is a specific mailing list just for grep > bugs. For more follow-up you would reach the grep maintainers > directly at bug-grep@... instead of here. That is the bug > reporting address listed in the grep --help output. I apologize for posting here instead of to the greps bug list. The man page listed this e-mail address (though the man page dates to 2002). I did learn a lot from you though, so I appreciate all your help. All of this makes me wonder how Apple decides what version of software to include in its Mac OS X versions. Let me know if you have any insight! Thanks again! --Chris |
|
|
Re: possible grep bug: -i switch with character classChris Jerdonek wrote:
> I had trouble compiling directly from source by following the INSTALL > file. That is a generic INSTALL file that covers building packages that use the autotools. Instead of every package having different installation procedures it is easier if packages all share the same procedure and then modify themselves to fit. > For example, I got the following error when running "make": > search.c:44:19: error: pcre.h: No such file or directory Well, you will need to install any dependent libraries needed by the package. In my case that would be installing that library on the system with "apt-get install libpcre-dev" but I have no idea on the Mac. You can manually configure the package not to use the pcre (perl compatible regular expressions). Since I have always had the pcre library installed I had never run into the case of trying to compile without it. It always just built easily for me and that was why I suggested it. A cursory glance through the configure.ac for grep leads me to believe that it will try to detect if you do not have libpcre available and in that case automatically disable building with that library. If that isn't working then that is certainly a valid problem to report upstream. It is something that perhaps isn't tested very well since that is a library that will most likely always be installed. So it might actually be broken without it and no one would realize it. > What I did instead was get the latest version v2.5.4 using MacPorts ( > http://www.macports.org/ ), which seems to have supplanted > DarwinPorts. Good to know. > I'm not seeing the issue in v2.5.4 even with the UTF8 locale, which is > good. You can see more info I posted in a follow-up to my bug report. Thanks for following up on the report with the subsequent information. > > Also posting to bug-gnu-utils for grep is okay. It is a generic > > catchall for all things gnu. You will have people like me here > > helping. But for grep there is a specific mailing list just for grep > > bugs. For more follow-up you would reach the grep maintainers > > directly at bug-grep@... instead of here. That is the bug > > reporting address listed in the grep --help output. > > I apologize for posting here instead of to the greps bug list. The > man page listed this e-mail address (though the man page dates to > 2002). I did learn a lot from you though, so I appreciate all your > help. It is all good so no worries. Things had just progressed to the limit of what I could do to help and so it was time to direct things to the better group for it. I am glad I was able to provide some useful help. > All of this makes me wonder how Apple decides what version of software > to include in its Mac OS X versions. Let me know if you have any > insight! I have no idea about Apple in particular. But there are a lot of software distributions using ports of GNU Project code. Obviously the emphasis for the GNU Project is to help GNU but when possible it is good to try to help others too. It exposes a wider audience to Free Software and perhaps will win some more advocates for it. But GNU is currently a collection of many different projects. Each operates in the way that their respective maintainers want them to operate. Some do things one way and some another way. Most importantly each one releases on their own independent schedule. It is like having ten thousand independent conveyer belts each running at a different speed and each producing packages at a different rate. This isn't too much of a problem for source based GNU systems since GNU is all about the source. For a large number of users this is the best model for them to use. This has been called the software bazaar model. But the software distributions have a problem to manage this type of bazaar. They will need to release a complete bundle of software all together. They need to turn the bazaar model into the cathedral model. This meets a different set of user's needs. Usually this means picking the latest stable known good version of a package and including it in a distribution. This goes through a time of testing that may be months and months. During the time of testing a new version of any particular package may be released. Do you throw away all of the previous testing, update to it, and start the clock ticking again? Obviously you would never release a full distribution if you did that and therefore software distributions always release with at least some software that has already been replaced upstream with a newer version. Many upstream projects only want to discuss problems in their current top of trunk source code. You can see why. It can take a lot of energy to track older versions. And who knows what versions the many hundreds of software distributions are shipping at any given time. An upstream can't track everyone. Personally I have more sympathy for people using recent known good stable releases even if they have been superseded by newer upstream releases. But unfortunately some others are somewhat contemptuous of anything but today's source code. The end user has a hard time knowing the difference. A good plan for end users who are wishing to report bugs is to report them to their software distribution vendor. The maintainer there will know the upstream processes. They can test the bug against the latest upstream code and deduce if the bug still exists. They can tell if the bug is something they did in the distribution. So reporting to the distributor is never wrong. You can report it upstream too. You just personally went through this and hopefully it wasn't too unpleasant. :-) But it looks like it in your case it is something that has already been addressed and will hopefully trickle down to your system in the future. It just takes some time. In which case poking your vendor may encourage them to trickle it down sooner. More agressive users will get the latest current stable and compile it themselves. You just did that. If the problem still exists then they will report that upstream. That is great! It is all about the source after all. But that is also more work for the end user who may not be as skilled for it. However learning more about how things work is never bad either and with the knowledge gained everyone wins. And often during this test the upstream compile works fine. In which case they have another item they could report to their vendor. Or the problem could be worse there because of a turn in direction that upstream has taken. In which case that clearly needs to be reported. Many eyes make all bugs transparent. Also the vendor may have had to make a compromise. You might be running into something that is causing you a problem but is necessary in order to prevent a different worse problem. The Cygwin project is a good example because the underlying system is so very limited that many compromises must be made. Well that is enough of reading me talk about the current practice of software distribution. I am glad you were able to make progress on your problem. I look forward to reading any future reports you make. Good luck! Bob |
| Free embeddable forum powered by Nabble | Forum Help |