unicode issue with QuickLook on Leopard

View: New views
11 Messages — Rating Filter:   Alert me  

unicode issue with QuickLook on Leopard

by Vincent Noel :: Rate this Message:

| View Threaded | Show Only this Message

Hello,

I've been seeing a weird issue with non-ASCII characters appearing
messed up when QuickLooking a utf-8 file saved from TextMate. Here is
a screenshot:
http://vnoel.files.wordpress.com/2008/06/ql-bug.png
The same problem appears when opening the file with TextEdit --
non-ASCII characters are messed up. Unix tools (file, cat) show the
file correctly. After digging around I found a thread [1] on the vim
mailing list that says it is related to extended attributes that are
set by TextEdit and that need to be present for QL to correctly
recognize the utf8 encoding.

If you set the extended attribute manually on the utf8 file, non-ASCII
characters appear fine in QL. This looks like a bug in QL which, given
its lack of publicity, is rarely triggered and probably due to
something specific to my setup. I'm not sure what to do next, apart
from applying 'xattr' to every text file I save with TextMate :-) Does
anyone has advice ?

I'm sorry if this issue has already been discussed previously on this
list or elsewhere, but I couldn't find any other reference using
Google.

Thanks !

[1] http://www.nabble.com/MacVim-file-encoding-and-Quicklook-td17289501.html

Cheers,
Vincent Noel

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Vincent Noel :: Rate this Message:

| View Threaded | Show Only this Message

Hello,
I'm ashamed of answering to myself, but I thought maybe this had
slipped through the cracks.

Am I the only one with this issue, or am I doing something really
stupid and obvious?

The non-ability to display properly utf8 files saved from TextMate
using standard tools from the OS seems like a reasonable concern to
me...

Thanks
Cheers,
Vincent Noel


On Wed, Jun 18, 2008 at 20:10, Vincent Noel <vincent.noel@...> wrote:

> Hello,
>
> I've been seeing a weird issue with non-ASCII characters appearing
> messed up when QuickLooking a utf-8 file saved from TextMate. Here is
> a screenshot:
> http://vnoel.files.wordpress.com/2008/06/ql-bug.png
> The same problem appears when opening the file with TextEdit --
> non-ASCII characters are messed up. Unix tools (file, cat) show the
> file correctly. After digging around I found a thread [1] on the vim
> mailing list that says it is related to extended attributes that are
> set by TextEdit and that need to be present for QL to correctly
> recognize the utf8 encoding.
>
> If you set the extended attribute manually on the utf8 file, non-ASCII
> characters appear fine in QL. This looks like a bug in QL which, given
> its lack of publicity, is rarely triggered and probably due to
> something specific to my setup. I'm not sure what to do next, apart
> from applying 'xattr' to every text file I save with TextMate :-) Does
> anyone has advice ?
>
> I'm sorry if this issue has already been discussed previously on this
> list or elsewhere, but I couldn't find any other reference using
> Google.
>
> Thanks !
>
> [1] http://www.nabble.com/MacVim-file-encoding-and-Quicklook-td17289501.html
>
> Cheers,
> Vincent Noel
>

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Allan Odgaard-4 :: Rate this Message:

| View Threaded | Show Only this Message

On 29 Jun 2008, at 21:50, Vincent Noel wrote:

> I'm ashamed of answering to myself, but I thought maybe this had
> slipped through the cracks.
>
> Am I the only one with this issue, or am I doing something really
> stupid and obvious?

Sadly I find that some parts of the OS still assume stuff to be in  
MacRoman¹ if not explictly told otherwise (and some stuff can’t even  
safely be told otherwise, like pbcopy/pbpaste).

I’d encourage you to file an enhancement report with http://bugreport.apple.com/ 
  -- Leopard has moved several things to UTF-8 (like osascript), but  
some stuff is still lacking. Explictly tagging text files with  
extended attributes to tell the systme that they are UTF-8 is IMO very  
wrong, especially given that UTF-8 can be recognized with 99.999999%  
certainty (so even if one disagrees about making it the standard  
encoding, it can still be at least detected safely without mistakenly  
treating e.g. a MacRoman file as UTF-8).


¹ Really the “system encoding” which for US/Western systems will be  
MacRoman, a thing that comes from Classic.


_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Adam R. Maxwell :: Rate this Message:

| View Threaded | Show Only this Message


On Jun 29, 2008, at 3:07 PM, Allan Odgaard wrote:

> Explictly tagging text files with
> extended attributes to tell the systme that they are UTF-8 is IMO very
> wrong, especially given that UTF-8 can be recognized with 99.999999%
> certainty (so even if one disagrees about making it the standard
> encoding, it can still be at least detected safely without mistakenly
> treating e.g. a MacRoman file as UTF-8).

According to the Foundation release notes, 10.5 now tries UTF-8 as a  
fallback if -[NSString initWithContentsOfURL:usedEncoding:error] is  
used (presumably if xattrs and BOM are absent).  Does that answer your  
suggestion?  Personally, I like the idea of using xattrs for this, but  
we were using them for that purpose before Leopard.

MacRoman is a good fallback encoding if UTF-8 fails, but QL is  
evidently using it too early.  In fact, it looks like  
NSAttributedString doesn't try UTF-8 and goes right to MacRoman, which  
probably explains the OP's problem with TextEdit.  Very unfortunate.

--
Adam

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Allan Odgaard-4 :: Rate this Message:

| View Threaded | Show Only this Message

On 30 Jun 2008, at 00:57, Adam R. Maxwell wrote:

> [...] According to the Foundation release notes, 10.5 now tries  
> UTF-8 as a fallback if -[NSString  
> initWithContentsOfURL:usedEncoding:error] is used (presumably if  
> xattrs and BOM are absent).  Does that answer your suggestion?

Answer my suggestion? Not sure what you mean by that.

> Personally, I like the idea of using xattrs for this, but we were  
> using them for that purpose before Leopard.

Using it to make UTF-8 work is bad! As said, UTF-8 can be safely  
detected without such attribute, so the attribute is 100% redundant.

Also, UTF-8 is the only sane encoding to use for new stuff, so text  
files should be UTF-8 today, having to do special tricks to make the  
system detect them as that, sucks, especially since majority of text  
files are not generated in CoreFoundation-using applications -- even  
on a Mac they might come from shell redirects, be generated by  
scripts, or similar. So majority of files will not have this  
attribute, and will never get such attribute.



_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Takaaki Kato-2 :: Rate this Message:

| View Threaded | Show Only this Message

http://x.nest.jp/mac/080109_0217.htm

Looks like this guy made the QuickLook plugin that accords with UTI in  
public.plain-text.

I don't know what I'm talking about. Please check the source because  
this may help.


Takaaki
--
Takaaki Kato
http://samuraicoder.net

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Vincent Noel :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 00:07, Allan Odgaard <mailinglist@...> wrote:

> Sadly I find that some parts of the OS still assume stuff to be in MacRoman¹
> if not explictly told otherwise (and some stuff can't even safely be told
> otherwise, like pbcopy/pbpaste).
>
> I'd encourage you to file an enhancement report with
> http://bugreport.apple.com/ -- Leopard has moved several things to UTF-8
> (like osascript), but some stuff is still lacking. Explictly tagging text
> files with extended attributes to tell the systme that they are UTF-8 is IMO
> very wrong, especially given that UTF-8 can be recognized with 99.999999%
> certainty (so even if one disagrees about making it the standard encoding,
> it can still be at least detected safely without mistakenly treating e.g. a
> MacRoman file as UTF-8).

I will file a bug. Thanks for the advice.

Still, it's seems like a pretty big screwup on Apple's part,
especially in user-visible 'new' features like Quicklook. It's
especially weird that it seems somebody noticed the problem, and
decided to fix it using extended attributes when more standards tools
(e.g. the 'file' command) are perfectly able to identify utf8 without
non-standard trickery...

Cheers
V

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Hans-Jörg Bibiko :: Rate this Message:

| View Threaded | Show Only this Message


On 30 Jun 2008, at 12:33, Vincent Noel wrote:

> It's
> especially weird that it seems somebody noticed the problem, and
> decided to fix it using extended attributes when more standards tools
> (e.g. the 'file' command) are perfectly able to identify utf8 without
> non-standard trickery...


To use 'file' is a good idea, BUT it looks only for the first (I don't  
know how many) characters in a file. I.e. if you have a rather large  
UTF-8 file containing 'normal' ASCII and the last character is e.g. a  
ü, 'file' will output: "test.txt: ASCII text, with very long lines" or  
similar.

The only definite way to get the encoding is to parse the ENTIRE file  
or parse to the first byte sequence which determine the used encoding  
one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at  
the beginning of a file).

--Hans

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Vincent Noel :: Rate this Message:

| View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 12:49, Hans-Joerg Bibiko <bibiko@...> wrote:
> The only definite way to get the encoding is to parse the ENTIRE file
> or parse to the first byte sequence which determine the used encoding
> one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at
> the beginning of a file).

Ok... So I guess the real bug is that Quicklook and other utilities
decide to fall back on MacRoman instead of utf8.

Thanks for the clarification :-)

Cheers,
V

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Hans-Jörg Bibiko :: Rate this Message:

| View Threaded | Show Only this Message


On 30.06.2008, at 13:04, Vincent Noel wrote:

> On Mon, Jun 30, 2008 at 12:49, Hans-Joerg Bibiko  
> <bibiko@...> wrote:
>> The only definite way to get the encoding is to parse the ENTIRE file
>> or parse to the first byte sequence which determine the used encoding
>> one-to-one. Or one uses the obsolete UTF-8 BOM (byte order marker at
>> the beginning of a file).
>
> Ok... So I guess the real bug is that Quicklook and other utilities
> decide to fall back on MacRoman instead of utf8.
>

This would be one possibility. But the whole issue is much more  
complicated.
E.g. it is not possible to distinguish the text encoding if the text  
is stored in ISO-8859-1..12. Each byte sequence would be valid, but  
each byte represents different glyphs according to its encoding.
Even for UTF-8 it is very complex. For instance the UTF-8 byte  
sequence C3 A4 (ä) could also be ISO-8859-1 (ä) [it could be that  
this makes sense].

My general suggestion to Apple would be to introduce an unique  
attribute 'encoding'. By doing so each application could store the  
correct text encoding in that attribute file.

--Hans

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate

Re: unicode issue with QuickLook on Leopard

by Adam R. Maxwell :: Rate this Message:

| View Threaded | Show Only this Message


On Jun 29, 2008, at 10:39 PM, Allan Odgaard wrote:

> On 30 Jun 2008, at 00:57, Adam R. Maxwell wrote:
>
>> [...] According to the Foundation release notes, 10.5 now tries
>> UTF-8 as a fallback if -[NSString
>> initWithContentsOfURL:usedEncoding:error] is used (presumably if
>> xattrs and BOM are absent).  Does that answer your suggestion?
>
> Answer my suggestion? Not sure what you mean by that.

UTF-8 is now tried as a fallback before giving up completely.  
Unfortunately, it's limited to two methods on NSString AFAIK.

>> Personally, I like the idea of using xattrs for this, but we were
>> using them for that purpose before Leopard.
>
> Using it to make UTF-8 work is bad! As said, UTF-8 can be safely
> detected without such attribute, so the attribute is 100% redundant.

Yes, it's redundant for files with a BOM and for UTF-8.  I'm curious  
as to you think it's entire purpose is to make UTF-8 work?  I would  
suggest that it's purpose is to make Latin-1, Shift-JIS, and all the  
other weird encodings work.  Using UTF-8 is certainly preferable, but  
not everyone can do that!

> Also, UTF-8 is the only sane encoding to use for new stuff, so text
> files should be UTF-8 today, having to do special tricks to make the
> system detect them as that, sucks, especially since majority of text
> files are not generated in CoreFoundation-using applications -- even
> on a Mac they might come from shell redirects, be generated by
> scripts, or similar. So majority of files will not have this
> attribute, and will never get such attribute.

I agree; UTF-8 /should/ be used.  If files are loaded into a Cocoa app  
using the method I noted, it'll try UTF-8 before giving up, regardless  
of whether the attribute is present.  I have to deal with other  
encodings fairly regularly, unfortunately.

--
Adam

_______________________________________________
textmate mailing list
textmate@...
http://lists.macromates.com/listinfo/textmate