|
View:
New views
12 Messages
—
Rating Filter:
Alert me
|
|
|
Sort order bug in GNU sortHi,
The following is the output of GNU sort (without any switches) on an unsorted file. Numerous errors (of the same variety) seem present in the ordering. I am using coreutils-7.2-4.fc11.x86_64. Problems are shown in red. Additionally, there probably needs to be a switch added to sort that uses the entire line as the sort key, not blank-to-non-blank transition (the default), so that lines are always compared character-by-character in the same character position, so that corresponding characters are always compared within the same sort field, and so that sorting always creates the correct lexicographical ordering of lines based on line content, not based upon any interpretation of separator boundary. (I think parsing blank/non-blank separator boundaries does not always produce the same thing?) Please confirm that you have received this email, and let me know if I need to file this bug elsewhere. Thanks, Luke Hutchison -- sampleId-1000,1.0 sampleId-100,0.125 sampleId-1002,0.25 sampleId-1002,0.5 sampleId-1004,0.25 sampleId-1005,0.0625 sampleId-1005,0.125 sampleId-1006,0.125 sampleId-1007,0.125 sampleId-1008,1.0 sampleId-1010,0.0625 sampleId-101,0.0625 sampleId-1010,1.0 sampleId-1011,0.0625 sampleId-1011,0.125 sampleId-1012,0.125 sampleId-1012,0.25 sampleId-1013,1.0 sampleId-1014,1.0 sampleId-1015,1.0 sampleId-1017,1.0 sampleId-1018,0.0625 sampleId-1018,0.125 sampleId-1019,1.0 sampleId-102,0.0625 sampleId-1020,1.0 sampleId-102,0.5 sampleId-1023,1.0 sampleId-1024,0.125 sampleId-978,1.0 sampleId-979,1.0 sampleId-980,1.0 sampleId-98,1.0 sampleId-981,0.0625 sampleId-981,0.25 sampleId-982,1.0 sampleId-984,0.125 sampleId-984,0.5 sampleId-985,0.0625 sampleId-985,0.5 sampleId-986,1.0 sampleId-987,1.0 sampleId-988,1.0 sampleId-990,1.0 sampleId-99,1.0 sampleId-991,0.25 sampleId-992,0.125 sampleId-992,0.25 sampleId-995,0.0625 sampleId-995,0.25 sampleId-996,0.125 sampleId-996,0.25 sampleId-997,0.125 sampleId-997,0.5 |
|
|
Re: Sort order bug in GNU sortLuke Hutchison wrote:
> Hi, > > The following is the output of GNU sort (without any switches) on an > unsorted file. Numerous errors (of the same variety) seem present in the > ordering. I am using coreutils-7.2-4.fc11.x86_64. Problems are shown in > red. You need to specify the sort command you used. Does this sort your data correctly? sort -t, -k1,1V > Additionally, there probably needs to be a switch added to sort that uses > the entire line as the sort key, It does that by default > not blank-to-non-blank transition Note also the 'b' option. cheers, Pádraig. |
|
|
Re: Sort order bug in GNU sortHi Pádraig,
As stated, "The following is the output of GNU sort (without any switches)" -- i.e. I used the defaults, and did not specify any commandline switches. If as you say, by default the whole line is the sort key, and if default sorting is lexicographic order, how are the following snippets from the sorted output possibly correct? sampleId-1010,0.0625 sampleId-101,0.0625 sampleId-1010,1.0 sampleId-980,1.0 sampleId-98,1.0 sampleId-981,0.0625 sampleId-990,1.0 sampleId-99,1.0 sampleId-991,0.25 Based on ASCII encoding (',' < '0' < '1'), I believe these should be: sampleId-101,0.0625 sampleId-1010,0.0625 sampleId-1010,1.0 sampleId-98,1.0 sampleId-980,1.0 sampleId-981,0.0625 sampleId-99,1.0 sampleId-990,1.0 sampleId-991,0.25 Even if in some weird locale, ',' > '0', or some other weird thing were true, the two lines "sampleId-1010,0.0625" and "sampleId-1010,1.0" should be grouped together either before or after "sampleId-101,0.0625", because they share a common prefix "sampleId-1010" -- but they are separated. Similarly, "sampleId-990,1.0" and "sampleId-991,0.25" absolutely should not be separated by "sampleId-99,1.0", because there is no way in any locale that '0' < ',' < '1'. I was led to think that sorting happened field-wise (not line-wise) by default by the man page, which says, "-t , --field-separator=SEP : use SEP instead of non-blank to blank transition". It would be helpful to explicitly add to the description of "-k" that "If no key is given, the whole line is used as the key". Thanks, Luke 2009/10/29 Pádraig Brady <P@...> > > Luke Hutchison wrote: > > Hi, > > > > The following is the output of GNU sort (without any switches) on an > > unsorted file. Numerous errors (of the same variety) seem present in the > > ordering. I am using coreutils-7.2-4.fc11.x86_64. Problems are shown in > > red. > > You need to specify the sort command you used. > Does this sort your data correctly? > > sort -t, -k1,1V > > > Additionally, there probably needs to be a switch added to sort that uses > > the entire line as the sort key, > > It does that by default > > > not blank-to-non-blank transition > > Note also the 'b' option. > > cheers, > Pádraig. |
|
|
Re: Sort order bug in GNU sortOK, it turns out that LANG=en_US.UTF-8 by default. Setting LANG=C or
LC_ALL=C, I get the correct expected sort order as shown below. Where does that indicate the bug lies? Where is the locale-specific comparison code? sampleId-100,0.125 sampleId-1000,1.0 sampleId-1002,0.25 sampleId-1002,0.5 sampleId-1004,0.25 sampleId-1005,0.0625 sampleId-1005,0.125 sampleId-1006,0.125 sampleId-1007,0.125 sampleId-1008,1.0 sampleId-101,0.0625 sampleId-1010,0.0625 sampleId-1010,1.0 sampleId-1011,0.0625 sampleId-1011,0.125 sampleId-1012,0.125 sampleId-1012,0.25 sampleId-1013,1.0 sampleId-1014,1.0 sampleId-1015,1.0 sampleId-1017,1.0 sampleId-1018,0.0625 sampleId-1018,0.125 sampleId-1019,1.0 sampleId-102,0.0625 sampleId-102,0.5 sampleId-1020,1.0 sampleId-1023,1.0 sampleId-1024,0.125 sampleId-978,1.0 sampleId-979,1.0 sampleId-98,1.0 sampleId-980,1.0 sampleId-981,0.0625 sampleId-981,0.25 sampleId-982,1.0 sampleId-984,0.125 sampleId-984,0.5 sampleId-985,0.0625 sampleId-985,0.5 sampleId-986,1.0 sampleId-987,1.0 sampleId-988,1.0 sampleId-99,1.0 sampleId-990,1.0 sampleId-991,0.25 sampleId-992,0.125 sampleId-992,0.25 sampleId-995,0.0625 sampleId-995,0.25 sampleId-996,0.125 sampleId-996,0.25 sampleId-997,0.125 sampleId-997,0.5 2009/10/29 Luke Hutchison <luke.hutch@...>: > Hi Pádraig, > As stated, "The following is the output of GNU sort (without any > switches)" -- i.e. I used the defaults, and did not specify any > commandline switches. If as you say, by default the whole line is the > sort key, and if default sorting is lexicographic order, how are the > following snippets from the sorted output possibly correct? > > sampleId-1010,0.0625 > sampleId-101,0.0625 > sampleId-1010,1.0 > > sampleId-980,1.0 > sampleId-98,1.0 > sampleId-981,0.0625 > > sampleId-990,1.0 > sampleId-99,1.0 > sampleId-991,0.25 > > Based on ASCII encoding (',' < '0' < '1'), I believe these should be: > > sampleId-101,0.0625 > sampleId-1010,0.0625 > sampleId-1010,1.0 > > sampleId-98,1.0 > sampleId-980,1.0 > sampleId-981,0.0625 > > sampleId-99,1.0 > sampleId-990,1.0 > sampleId-991,0.25 > > Even if in some weird locale, ',' > '0', or some other weird thing > were true, the two lines "sampleId-1010,0.0625" and > "sampleId-1010,1.0" should be grouped together either before or after > "sampleId-101,0.0625", because they share a common prefix > "sampleId-1010" -- but they are separated. Similarly, > "sampleId-990,1.0" and "sampleId-991,0.25" absolutely should not be > separated by "sampleId-99,1.0", because there is no way in any locale > that '0' < ',' < '1'. > > I was led to think that sorting happened field-wise (not line-wise) by > default by the man page, which says, "-t , --field-separator=SEP : use > SEP instead of non-blank to blank transition". It would be helpful to > explicitly add to the description of "-k" that "If no key is given, > the whole line is used as the key". > > Thanks, > Luke > > > 2009/10/29 Pádraig Brady <P@...> >> >> Luke Hutchison wrote: >> > Hi, >> > >> > The following is the output of GNU sort (without any switches) on an >> > unsorted file. Numerous errors (of the same variety) seem present in the >> > ordering. I am using coreutils-7.2-4.fc11.x86_64. Problems are shown in >> > red. >> >> You need to specify the sort command you used. >> Does this sort your data correctly? >> >> sort -t, -k1,1V >> >> > Additionally, there probably needs to be a switch added to sort that uses >> > the entire line as the sort key, >> >> It does that by default >> >> > not blank-to-non-blank transition >> >> Note also the 'b' option. >> >> cheers, >> Pádraig. > |
|
|
Re: Sort order bug in GNU sort-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 [please don't top-post on technical lists] According to Luke Hutchison on 10/29/2009 6:43 PM: > Hi Pádraig, > As stated, "The following is the output of GNU sort (without any > switches)" -- i.e. I used the defaults, and did not specify any > commandline switches. If as you say, by default the whole line is the > sort key, and if default sorting is lexicographic order, how are the > following snippets from the sorted output possibly correct? > > sampleId-1010,0.0625 > sampleId-101,0.0625 > sampleId-1010,1.0 Well, that looks correct to me, if your current locale specifies that punctuation is ignored during collation (that is, you are getting: 101000 < 101006 < 101010, after ignoring , and .). http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 Try 'LC_ALL=C sort' to see the difference. > Even if in some weird locale, ',' > '0', or some other weird thing > were true, the two lines "sampleId-1010,0.0625" and > "sampleId-1010,1.0" should be grouped together either before or after > "sampleId-101,0.0625", because they share a common prefix Nope. And the locale is not that weird. Many locales ignore punctuation during collation, in order to get dictionary sorting (rather than byte-wise prefix sorting). - -- Don't work too hard, make some time for fun as well! Eric Blake ebb9@... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrqOHYACgkQ84KuGfSFAYCNhQCeMHBDVREcrM+QAlsSRJRGSTkd 3lYAoIIbWaNZvleYo1jKDoDfQ1mpi5aE =Uhof -----END PGP SIGNATURE----- |
|
|
Re: Sort order bug in GNU sort-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 According to Luke Hutchison on 10/29/2009 6:48 PM: > OK, it turns out that LANG=en_US.UTF-8 by default. Setting LANG=C or > LC_ALL=C, I get the correct expected sort order as shown below. Where > does that indicate the bug lies? The bug lies in your expectations. Sort is doing exactly what the en_US.UTF-8 locale told it to do. - -- Don't work too hard, make some time for fun as well! Eric Blake ebb9@... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrqOKwACgkQ84KuGfSFAYCs+gCggd68VByNhevFdQCU97YU7iiS 54kAoIhiS2Absqf2l9KK2BHnAejvyYpZ =nMnX -----END PGP SIGNATURE----- |
|
|
Re: Sort order bug in GNU sortLuke Hutchison wrote:
> The following is the output of GNU sort (without any switches) on an > unsorted file. Numerous errors (of the same variety) seem present in the > ordering. I am using coreutils-7.2-4.fc11.x86_64. Please say what locale you are using. $ locale Try sorting with a standard locale. LC_ALL=C Frequently this is a user misunderstanding of how the user specified locale effects the collation sequence and sort ordering. It appears to me that your specified locale is en_US.UTF-8 or similar. In that locale punctuation is ignored and case is ignored. IIRC it is dictionary ordering. Here is a reference. http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 > Problems are shown in red. Please, no HTML to mailing lists. We can't see your coloring. http://www.gnu.org/software/coreutils/ Personally I set the following to achieve my desired result. YMMV. export LANG=en_US.UTF-8 export LC_COLLATE=C Bob |
|
|
Re: Sort order bug in GNU sortOn Thu, Oct 29, 2009 at 5:51 PM, Eric Blake <ebb9@...> wrote:
> [please don't top-post on technical lists] Sorry about the lack of mailing list etiquette, the sort manpage doesn't make it clear that bug-coreutils@... is a mailing list... > Well, that looks correct to me, if your current locale specifies that > punctuation is ignored during collation (that is, you are getting: 101000 > < 101006 < 101010, after ignoring , and .). > > http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 > > Try 'LC_ALL=C sort' to see the difference. I don't know why punctuation is not treated as a space en the en_US locale, or for that matter why the decision was made to ignore spaces in en_US (I would love to see the background thinking that went into that decision, the sorted order "San Juan, Santa Clara, San Teodoro" doesn't make intuitive sense to me). I note that the Wikipedia page on Collation says that sorting is done both ways (with or without spaces) but that ignoring spaces is supposedly more common. Anyway, thanks for explaining and sorry that I didn't see the explanation in the FAQ. Given that (according to the FAQ) "This one question arises almost more often than any other", and given the inconvenience of changing locales in a script just so sort will work right, wouldn't it make sense to just add an optional switch that effectively sets LC_ALL=C for the sort? I note now the warning in the man page: "*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values." I had no idea this would affect non-accented characters before hitting this. Could the manpage please be extended to give a simple example comparing the sort order in the en_US locale with the C locale, to make this much clearer? Thanks, Luke |
|
|
Re: Sort order bug in GNU sort-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 According to Luke Hutchison on 10/29/2009 8:30 PM: > given the inconvenience of changing > locales in a script just so sort will work right, wouldn't it make > sense to just add an optional switch that effectively sets LC_ALL=C > for the sort? It's already there, in the form of a per-command environment variable change. The FAQ mentions changing LC_ALL or LC_COLLATE up front, because that is easiest, but you can fine-tune it down to the per-process level. For example, this command runs two different sorts in two different locales, in order to sort the first field by English dictionary order and the second by byte order: LC_ALL=en_US.UTF-8 sort -k1,1 | LC_ALL=C sort -k2,2 > Could the manpage please be extended to give a simple example > comparing the sort order in the en_US locale with the C locale, to > make this much clearer? The man page comes from 'sort --help', and both try to be as concise as possible. But the info pages would be a reasonable place to add another example, if it would help make it obvious what consequences are associated with alternate locales. Would you like to try your hand at writing the patch, or at least propose the wording to use? - -- Don't work too hard, make some time for fun as well! Eric Blake ebb9@... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Cygwin) Comment: Public key at home.comcast.net/~ericblake/eblake.gpg Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrqVSQACgkQ84KuGfSFAYAb+wCgv/uje2VKJs7BN3oltsmZxnfa onMAnRqfxmNY/5VXeL/Bh6yVAYbL9m9/ =Bkoq -----END PGP SIGNATURE----- |
|
|
Re: Sort order bug in GNU sortOn Thu, Oct 29, 2009 at 7:53 PM, Eric Blake <ebb9@...> wrote:
>> Could the manpage please be extended to give a simple example >> comparing the sort order in the en_US locale with the C locale, to >> make this much clearer? > > The man page comes from 'sort --help', and both try to be as concise as > possible. But the info pages would be a reasonable place to add another > example, if it would help make it obvious what consequences are associated > with alternate locales. Would you like to try your hand at writing the > patch, or at least propose the wording to use? I understand that --help output needs to be relatively short, nevertheless adding a short example will probably save lots of similar questions on the mailing lists if the FAQ is correct in stating how frequently this question is raised. Original: *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values. Suggested change: *** WARNING *** The locale specified by the environment affects sort order. Assuming any given locale may cause unexpected or unportable behavior. For example, the en_US locale is case-insensitive and ignores spaces and punctuation, while the C locale uses native byte values and does not ignore either punctuation or spaces. Set LC_ALL=C before calling sort to get the traditional sort order. An example of different sort orders in two different locales using default switch values: en_US locale sort order: C locale sort order: 100,0,San Diego 10,1,San Juan 10,1,San Juan 10,1,San Tomas 10,1,Santa Monica 10,1,Santa Monica 10,1,San Tomas 100,0,San Diego 102,0,Santa Cruz 102,0,Santa Cruz |
|
|
Re: Sort order bug in GNU sortOn Thu, Oct 29, 2009 at 8:28 PM, Luke Hutchison <luke.hutch@...> wrote:
> On Thu, Oct 29, 2009 at 7:53 PM, Eric Blake <ebb9@...> wrote: >>> Could the manpage please be extended to give a simple example >>> comparing the sort order in the en_US locale with the C locale, to >>> make this much clearer? >> >> The man page comes from 'sort --help', and both try to be as concise as >> possible. But the info pages would be a reasonable place to add another >> example, if it would help make it obvious what consequences are associated >> with alternate locales. Would you like to try your hand at writing the >> patch, or at least propose the wording to use? > > I understand that --help output needs to be relatively short, > nevertheless adding a short example will probably save lots of similar > questions on the mailing lists if the FAQ is correct in stating how > frequently this question is raised. I also submit another suggested change for consideration: Original: -k, --key=POS1[,POS2] start a key at POS1 (origin 1), end it at POS2 (default end of line) Suggested change: -k, --key=POS1[,POS2] start a key at POS1 (origin 1), end it at POS2 (default end of line); if no key specified, the whole input line is used as the key Reasoning for needing this additional wording: as mentioned in a previous email, the following text made me think sort used blank/non-blank field-based sorting if no delimiter character was given, and in fact the exact opposite is true in en_US, where blanks are ignored entirely: -t, --field-separator=SEP use SEP instead of non-blank to blank transition Thanks. |
|
|
Re: Sort order bug in GNU sortLuke Hutchison wrote:
> I understand that --help output needs to be relatively short, Let me vote here too, --help needs to be concise. Longer explaination ant tutelage should go in the full info documentation. > nevertheless adding a short example will probably save lots of similar > questions on the mailing lists if the FAQ is correct in stating how > frequently this question is raised. The number of questions about it has definitely fallen off in the last couple of years. During the initial transition from the old libc to the new libc with locale based collating (it is really a libc thing) there were many. But not so many more recently. Mostly because by now most users have already made the transition. And now new users are arriving only having ever seen the new behavior and are learning it for the first time and are now coming to expect locale sorting by default. And the idiom to correct this problem of setting the local to C/POSIX is getting to be much more widely known. Bob |
| Free embeddable forum powered by Nabble | Forum Help |