Sort order bug in GNU sort

View: New views
12 Messages — Rating Filter:   Alert me  

Sort order bug in GNU sort

by Luke Hutchison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

The following is the output of GNU sort (without any switches) on an
unsorted file.  Numerous errors (of the same variety) seem present in the
ordering.  I am using coreutils-7.2-4.fc11.x86_64.  Problems are shown in
red.

Additionally, there probably needs to be a switch added to sort that uses
the entire line as the sort key, not blank-to-non-blank transition (the
default), so that lines are always compared character-by-character in the
same character position, so that corresponding characters are always
compared within the same sort field, and so that sorting always creates the
correct lexicographical ordering of lines based on line content, not based
upon any interpretation of separator boundary.  (I think parsing
blank/non-blank separator boundaries does not always produce the same
thing?)

Please confirm that you have received this email, and let me know if I need
to file this bug elsewhere.

Thanks,
Luke Hutchison

--

sampleId-1000,1.0
sampleId-100,0.125
sampleId-1002,0.25
sampleId-1002,0.5
sampleId-1004,0.25
sampleId-1005,0.0625
sampleId-1005,0.125
sampleId-1006,0.125
sampleId-1007,0.125
sampleId-1008,1.0
sampleId-1010,0.0625
sampleId-101,0.0625
sampleId-1010,1.0
sampleId-1011,0.0625
sampleId-1011,0.125
sampleId-1012,0.125
sampleId-1012,0.25
sampleId-1013,1.0
sampleId-1014,1.0
sampleId-1015,1.0
sampleId-1017,1.0
sampleId-1018,0.0625
sampleId-1018,0.125
sampleId-1019,1.0
sampleId-102,0.0625
sampleId-1020,1.0
sampleId-102,0.5
sampleId-1023,1.0
sampleId-1024,0.125
sampleId-978,1.0
sampleId-979,1.0
sampleId-980,1.0
sampleId-98,1.0
sampleId-981,0.0625
sampleId-981,0.25
sampleId-982,1.0
sampleId-984,0.125
sampleId-984,0.5
sampleId-985,0.0625
sampleId-985,0.5
sampleId-986,1.0
sampleId-987,1.0
sampleId-988,1.0
sampleId-990,1.0
sampleId-99,1.0
sampleId-991,0.25
sampleId-992,0.125
sampleId-992,0.25
sampleId-995,0.0625
sampleId-995,0.25
sampleId-996,0.125
sampleId-996,0.25
sampleId-997,0.125
sampleId-997,0.5

Re: Sort order bug in GNU sort

by Pádraig Brady :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Luke Hutchison wrote:
> Hi,
>
> The following is the output of GNU sort (without any switches) on an
> unsorted file.  Numerous errors (of the same variety) seem present in the
> ordering.  I am using coreutils-7.2-4.fc11.x86_64.  Problems are shown in
> red.

You need to specify the sort command you used.
Does this sort your data correctly?

sort -t, -k1,1V

> Additionally, there probably needs to be a switch added to sort that uses
> the entire line as the sort key,

It does that by default

> not blank-to-non-blank transition

Note also the 'b' option.

cheers,
Pádraig.



Re: Sort order bug in GNU sort

by Luke Hutchison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Pádraig,
As stated, "The following is the output of GNU sort (without any
switches)" -- i.e. I used the defaults, and did not specify any
commandline switches.  If as you say, by default the whole line is the
sort key, and if default sorting is lexicographic order, how are the
following snippets from the sorted output possibly correct?

sampleId-1010,0.0625
sampleId-101,0.0625
sampleId-1010,1.0

sampleId-980,1.0
sampleId-98,1.0
sampleId-981,0.0625

sampleId-990,1.0
sampleId-99,1.0
sampleId-991,0.25

Based on ASCII encoding (',' < '0' < '1'), I believe these should be:

sampleId-101,0.0625
sampleId-1010,0.0625
sampleId-1010,1.0

sampleId-98,1.0
sampleId-980,1.0
sampleId-981,0.0625

sampleId-99,1.0
sampleId-990,1.0
sampleId-991,0.25

Even if in some weird locale, ',' > '0', or some other weird thing
were true, the two lines "sampleId-1010,0.0625" and
"sampleId-1010,1.0" should be grouped together either before or after
"sampleId-101,0.0625", because they share a common prefix
"sampleId-1010" -- but they are separated.  Similarly,
"sampleId-990,1.0" and "sampleId-991,0.25" absolutely should not be
separated by "sampleId-99,1.0", because there is no way in any locale
that '0' < ',' < '1'.

I was led to think that sorting happened field-wise (not line-wise) by
default by the man page, which says, "-t , --field-separator=SEP : use
SEP instead of non-blank to blank transition".  It would be helpful to
explicitly add to the description of "-k" that "If no key is given,
the whole line is used as the key".

Thanks,
Luke


2009/10/29 Pádraig Brady <P@...>

>
> Luke Hutchison wrote:
> > Hi,
> >
> > The following is the output of GNU sort (without any switches) on an
> > unsorted file.  Numerous errors (of the same variety) seem present in the
> > ordering.  I am using coreutils-7.2-4.fc11.x86_64.  Problems are shown in
> > red.
>
> You need to specify the sort command you used.
> Does this sort your data correctly?
>
> sort -t, -k1,1V
>
> > Additionally, there probably needs to be a switch added to sort that uses
> > the entire line as the sort key,
>
> It does that by default
>
> > not blank-to-non-blank transition
>
> Note also the 'b' option.
>
> cheers,
> Pádraig.



Re: Sort order bug in GNU sort

by Luke Hutchison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

OK, it turns out that LANG=en_US.UTF-8 by default.  Setting LANG=C or
LC_ALL=C, I get the correct expected sort order as shown below.  Where
does that indicate the bug lies?  Where is the locale-specific
comparison code?

sampleId-100,0.125
sampleId-1000,1.0
sampleId-1002,0.25
sampleId-1002,0.5
sampleId-1004,0.25
sampleId-1005,0.0625
sampleId-1005,0.125
sampleId-1006,0.125
sampleId-1007,0.125
sampleId-1008,1.0
sampleId-101,0.0625
sampleId-1010,0.0625
sampleId-1010,1.0
sampleId-1011,0.0625
sampleId-1011,0.125
sampleId-1012,0.125
sampleId-1012,0.25
sampleId-1013,1.0
sampleId-1014,1.0
sampleId-1015,1.0
sampleId-1017,1.0
sampleId-1018,0.0625
sampleId-1018,0.125
sampleId-1019,1.0
sampleId-102,0.0625
sampleId-102,0.5
sampleId-1020,1.0
sampleId-1023,1.0
sampleId-1024,0.125
sampleId-978,1.0
sampleId-979,1.0
sampleId-98,1.0
sampleId-980,1.0
sampleId-981,0.0625
sampleId-981,0.25
sampleId-982,1.0
sampleId-984,0.125
sampleId-984,0.5
sampleId-985,0.0625
sampleId-985,0.5
sampleId-986,1.0
sampleId-987,1.0
sampleId-988,1.0
sampleId-99,1.0
sampleId-990,1.0
sampleId-991,0.25
sampleId-992,0.125
sampleId-992,0.25
sampleId-995,0.0625
sampleId-995,0.25
sampleId-996,0.125
sampleId-996,0.25
sampleId-997,0.125
sampleId-997,0.5


2009/10/29 Luke Hutchison <luke.hutch@...>:

> Hi Pádraig,
> As stated, "The following is the output of GNU sort (without any
> switches)" -- i.e. I used the defaults, and did not specify any
> commandline switches.  If as you say, by default the whole line is the
> sort key, and if default sorting is lexicographic order, how are the
> following snippets from the sorted output possibly correct?
>
> sampleId-1010,0.0625
> sampleId-101,0.0625
> sampleId-1010,1.0
>
> sampleId-980,1.0
> sampleId-98,1.0
> sampleId-981,0.0625
>
> sampleId-990,1.0
> sampleId-99,1.0
> sampleId-991,0.25
>
> Based on ASCII encoding (',' < '0' < '1'), I believe these should be:
>
> sampleId-101,0.0625
> sampleId-1010,0.0625
> sampleId-1010,1.0
>
> sampleId-98,1.0
> sampleId-980,1.0
> sampleId-981,0.0625
>
> sampleId-99,1.0
> sampleId-990,1.0
> sampleId-991,0.25
>
> Even if in some weird locale, ',' > '0', or some other weird thing
> were true, the two lines "sampleId-1010,0.0625" and
> "sampleId-1010,1.0" should be grouped together either before or after
> "sampleId-101,0.0625", because they share a common prefix
> "sampleId-1010" -- but they are separated.  Similarly,
> "sampleId-990,1.0" and "sampleId-991,0.25" absolutely should not be
> separated by "sampleId-99,1.0", because there is no way in any locale
> that '0' < ',' < '1'.
>
> I was led to think that sorting happened field-wise (not line-wise) by
> default by the man page, which says, "-t , --field-separator=SEP : use
> SEP instead of non-blank to blank transition".  It would be helpful to
> explicitly add to the description of "-k" that "If no key is given,
> the whole line is used as the key".
>
> Thanks,
> Luke
>
>
> 2009/10/29 Pádraig Brady <P@...>
>>
>> Luke Hutchison wrote:
>> > Hi,
>> >
>> > The following is the output of GNU sort (without any switches) on an
>> > unsorted file.  Numerous errors (of the same variety) seem present in the
>> > ordering.  I am using coreutils-7.2-4.fc11.x86_64.  Problems are shown in
>> > red.
>>
>> You need to specify the sort command you used.
>> Does this sort your data correctly?
>>
>> sort -t, -k1,1V
>>
>> > Additionally, there probably needs to be a switch added to sort that uses
>> > the entire line as the sort key,
>>
>> It does that by default
>>
>> > not blank-to-non-blank transition
>>
>> Note also the 'b' option.
>>
>> cheers,
>> Pádraig.
>



Re: Sort order bug in GNU sort

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[please don't top-post on technical lists]

According to Luke Hutchison on 10/29/2009 6:43 PM:

> Hi Pádraig,
> As stated, "The following is the output of GNU sort (without any
> switches)" -- i.e. I used the defaults, and did not specify any
> commandline switches.  If as you say, by default the whole line is the
> sort key, and if default sorting is lexicographic order, how are the
> following snippets from the sorted output possibly correct?
>
> sampleId-1010,0.0625
> sampleId-101,0.0625
> sampleId-1010,1.0

Well, that looks correct to me, if your current locale specifies that
punctuation is ignored during collation (that is, you are getting: 101000
< 101006 < 101010, after ignoring , and .).

http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

Try 'LC_ALL=C sort' to see the difference.

> Even if in some weird locale, ',' > '0', or some other weird thing
> were true, the two lines "sampleId-1010,0.0625" and
> "sampleId-1010,1.0" should be grouped together either before or after
> "sampleId-101,0.0625", because they share a common prefix

Nope.  And the locale is not that weird.  Many locales ignore punctuation
during collation, in order to get dictionary sorting (rather than
byte-wise prefix sorting).

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrqOHYACgkQ84KuGfSFAYCNhQCeMHBDVREcrM+QAlsSRJRGSTkd
3lYAoIIbWaNZvleYo1jKDoDfQ1mpi5aE
=Uhof
-----END PGP SIGNATURE-----



Re: Sort order bug in GNU sort

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Luke Hutchison on 10/29/2009 6:48 PM:
> OK, it turns out that LANG=en_US.UTF-8 by default.  Setting LANG=C or
> LC_ALL=C, I get the correct expected sort order as shown below.  Where
> does that indicate the bug lies?

The bug lies in your expectations.  Sort is doing exactly what the
en_US.UTF-8 locale told it to do.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrqOKwACgkQ84KuGfSFAYCs+gCggd68VByNhevFdQCU97YU7iiS
54kAoIhiS2Absqf2l9KK2BHnAejvyYpZ
=nMnX
-----END PGP SIGNATURE-----



Re: Sort order bug in GNU sort

by Bob Proulx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Luke Hutchison wrote:
> The following is the output of GNU sort (without any switches) on an
> unsorted file.  Numerous errors (of the same variety) seem present in the
> ordering.  I am using coreutils-7.2-4.fc11.x86_64.

Please say what locale you are using.

  $ locale

Try sorting with a standard locale.

  LC_ALL=C

Frequently this is a user misunderstanding of how the user specified
locale effects the collation sequence and sort ordering.  It appears
to me that your specified locale is en_US.UTF-8 or similar.  In that
locale punctuation is ignored and case is ignored.  IIRC it is
dictionary ordering.

Here is a reference.

  http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

> Problems are shown in red.

Please, no HTML to mailing lists.  We can't see your coloring.

  http://www.gnu.org/software/coreutils/

Personally I set the following to achieve my desired result.  YMMV.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

Bob



Re: Sort order bug in GNU sort

by Luke Hutchison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 5:51 PM, Eric Blake <ebb9@...> wrote:
> [please don't top-post on technical lists]

Sorry about the lack of mailing list etiquette, the sort manpage
doesn't make it clear that bug-coreutils@... is a mailing list...

> Well, that looks correct to me, if your current locale specifies that
> punctuation is ignored during collation (that is, you are getting: 101000
> < 101006 < 101010, after ignoring , and .).
>
> http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
>
> Try 'LC_ALL=C sort' to see the difference.

I don't know why punctuation is not treated as a space en the en_US
locale, or for that matter why the decision was made to ignore spaces
in en_US (I would love to see the background thinking that went into
that decision, the sorted order "San Juan, Santa Clara, San Teodoro"
doesn't make intuitive sense to me).  I note that the Wikipedia page
on Collation says that sorting is done both ways (with or without
spaces) but that ignoring spaces is supposedly more common.  Anyway,
thanks for explaining and sorry that I didn't see the explanation in
the FAQ.

Given that (according to the FAQ) "This one question arises almost
more often than any other", and given the inconvenience of changing
locales in a script just so sort will work right, wouldn't it make
sense to just add an optional switch that effectively sets LC_ALL=C
for the sort?

I note now the warning in the man page: "*** WARNING *** The locale
specified by the environment affects sort order.  Set LC_ALL=C to get
the traditional sort order that  uses native byte values."  I had no
idea this would affect non-accented characters before hitting this.
Could the manpage please be extended to give a simple example
comparing the sort order in the en_US locale with the C locale, to
make this much clearer?

Thanks,
Luke



Re: Sort order bug in GNU sort

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Luke Hutchison on 10/29/2009 8:30 PM:
> given the inconvenience of changing
> locales in a script just so sort will work right, wouldn't it make
> sense to just add an optional switch that effectively sets LC_ALL=C
> for the sort?

It's already there, in the form of a per-command environment variable
change.  The FAQ mentions changing LC_ALL or LC_COLLATE up front, because
that is easiest, but you can fine-tune it down to the per-process level.
For example, this command runs two different sorts in two different
locales, in order to sort the first field by English dictionary order and
the second by byte order:

LC_ALL=en_US.UTF-8 sort -k1,1 | LC_ALL=C sort -k2,2

> Could the manpage please be extended to give a simple example
> comparing the sort order in the en_US locale with the C locale, to
> make this much clearer?

The man page comes from 'sort --help', and both try to be as concise as
possible.  But the info pages would be a reasonable place to add another
example, if it would help make it obvious what consequences are associated
with alternate locales.  Would you like to try your hand at writing the
patch, or at least propose the wording to use?

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9@...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrqVSQACgkQ84KuGfSFAYAb+wCgv/uje2VKJs7BN3oltsmZxnfa
onMAnRqfxmNY/5VXeL/Bh6yVAYbL9m9/
=Bkoq
-----END PGP SIGNATURE-----



Re: Sort order bug in GNU sort

by Luke Hutchison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 7:53 PM, Eric Blake <ebb9@...> wrote:
>> Could the manpage please be extended to give a simple example
>> comparing the sort order in the en_US locale with the C locale, to
>> make this much clearer?
>
> The man page comes from 'sort --help', and both try to be as concise as
> possible.  But the info pages would be a reasonable place to add another
> example, if it would help make it obvious what consequences are associated
> with alternate locales.  Would you like to try your hand at writing the
> patch, or at least propose the wording to use?

I understand that --help output needs to be relatively short,
nevertheless adding a short example will probably save lots of similar
questions on the mailing lists if the FAQ is correct in stating how
frequently this question is raised.

Original:

*** WARNING *** The locale specified by the environment affects sort
order.  Set LC_ALL=C to get the traditional sort order that uses
native byte values.

Suggested change:

*** WARNING *** The locale specified by the environment affects sort
order.  Assuming any given locale may cause unexpected or unportable
behavior.  For example, the en_US locale is case-insensitive and
ignores spaces and punctuation, while the C locale uses native byte
values and does not ignore either punctuation or spaces.  Set LC_ALL=C
before calling sort to get the traditional sort order.  An example of
different sort orders in two different locales using default switch
values:

       en_US locale sort order:       C locale sort order:
       100,0,San Diego                10,1,San Juan
       10,1,San Juan                  10,1,San Tomas
       10,1,Santa Monica              10,1,Santa Monica
       10,1,San Tomas                 100,0,San Diego
       102,0,Santa Cruz               102,0,Santa Cruz



Re: Sort order bug in GNU sort

by Luke Hutchison :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Oct 29, 2009 at 8:28 PM, Luke Hutchison <luke.hutch@...> wrote:

> On Thu, Oct 29, 2009 at 7:53 PM, Eric Blake <ebb9@...> wrote:
>>> Could the manpage please be extended to give a simple example
>>> comparing the sort order in the en_US locale with the C locale, to
>>> make this much clearer?
>>
>> The man page comes from 'sort --help', and both try to be as concise as
>> possible.  But the info pages would be a reasonable place to add another
>> example, if it would help make it obvious what consequences are associated
>> with alternate locales.  Would you like to try your hand at writing the
>> patch, or at least propose the wording to use?
>
> I understand that --help output needs to be relatively short,
> nevertheless adding a short example will probably save lots of similar
> questions on the mailing lists if the FAQ is correct in stating how
> frequently this question is raised.

I also submit another suggested change for consideration:

Original:

    -k, --key=POS1[,POS2]
         start a key at POS1 (origin 1), end it at POS2 (default end of line)

 Suggested change:

    -k, --key=POS1[,POS2]
         start a key at POS1 (origin 1), end it at POS2 (default end of line);
         if no key specified, the whole input line is used as the key

Reasoning for needing this additional wording: as mentioned in a
previous email, the following text made me think sort used
blank/non-blank field-based sorting if no delimiter character was
given, and in fact the exact opposite is true in en_US, where blanks
are ignored entirely:

    -t, --field-separator=SEP
          use SEP instead of non-blank to blank transition

Thanks.



Re: Sort order bug in GNU sort

by Bob Proulx :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Luke Hutchison wrote:
> I understand that --help output needs to be relatively short,

Let me vote here too, --help needs to be concise.  Longer explaination
ant tutelage should go in the full info documentation.

> nevertheless adding a short example will probably save lots of similar
> questions on the mailing lists if the FAQ is correct in stating how
> frequently this question is raised.

The number of questions about it has definitely fallen off in the last
couple of years.  During the initial transition from the old libc to
the new libc with locale based collating (it is really a libc thing)
there were many.  But not so many more recently.  Mostly because by
now most users have already made the transition.  And now new users
are arriving only having ever seen the new behavior and are learning
it for the first time and are now coming to expect locale sorting by
default.  And the idiom to correct this problem of setting the local
to C/POSIX is getting to be much more widely known.

Bob