Trash Can Question

View: New views
10 Messages — Rating Filter:   Alert me  

Trash Can Question

by Ozan Türkyılmaz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hey,

which character encoding is getting used in trash info files? there
should be info about it in the spesification in my opinion.



--
Ozan
オザン

Close the world, txEn eht nepO
_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Bugzilla from faure@kde.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday 05 August 2009, Ozan Türkyılmaz wrote:
> hey,
>
> which character encoding is getting used in trash info files? there
> should be info about it in the spesification in my opinion.

The same encoding as the file system.
The filename is copied "as is" into the info file.


In practice I would recommend using utf8 everywhere and getting rid
of the whole "filesystem encoding" mess in the first place.

--
David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE,
Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Andrea Francia-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/8/5 David Faure <faure@...>
In practice I would recommend using utf8 everywhere and getting rid
of the whole "filesystem encoding" mess in the first place.

Who is interested to work a new draft (a draft) of the spec which solves this and the other problems emerged?

--
Andrea Francia
http://andreafrancia.blogspot.com/

_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Bugzilla from faure@kde.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thursday 06 August 2009, Andrea Francia wrote:
> 2009/8/5 David Faure <faure@...>
> >
> > In practice I would recommend using utf8 everywhere and getting rid
> > of the whole "filesystem encoding" mess in the first place.
>
>
> Who is interested to work a new draft (a draft) of the spec which solves
> this and the other problems emerged?

Well, I am.
But you'll probably need Alex Larsson on board too, for it to make sense.

Anyway, how do you propose to "solve this"?


David (who also has something to suggest in an updated trash spec;
a file containing the size of the trash, as a "cache" of that information;
more details once I have time to formalize this a little bit)

--
David Faure, faure@..., sponsored by Qt Software @ Nokia to work on KDE,
Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Alexander Larsson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
> 2009/8/5 David Faure <faure@...>
>         In practice I would recommend using utf8 everywhere and
>         getting rid
>         of the whole "filesystem encoding" mess in the first place.
>
>
> Who is interested to work a new draft (a draft) of the spec which
> solves this and the other problems emerged?

This is not a "problem" that should be "solved". It was very
delibirately added to the spec in order to allow all files to be
trashed. How would you trash a file named some non-utf8 string if only
utf8 is allowed in the format?

Filenames on linux are zero terminated arrays of bytes. If you treat it
like anything else you will just fail in some corner cases.

Of course, we should all move towards all filenames being in UTF8, avoid
creating non-UTF8 filenames, etc. That is a different issue, and should
not make us limit our specifications to only work on a subset of the
valid filenames.



_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Andrea Francia-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



2009/8/21 Alexander Larsson <alexl@...>
On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
> 2009/8/5 David Faure <faure@...>
>         In practice I would recommend using utf8 everywhere and
>         getting rid
>         of the whole "filesystem encoding" mess in the first place.
>
>
> Who is interested to work a new draft (a draft) of the spec which
> solves this and the other problems emerged?

This is not a "problem" that should be "solved". It was very
delibirately added to the spec in order to allow all files to be
trashed. How would you trash a file named some non-utf8 string if only
utf8 is allowed in the format?

Filenames on linux are zero terminated arrays of bytes. If you treat it
like anything else you will just fail in some corner cases.

For me filenames are a list of unicode characters. The way those filenames are represented using array of bytes is a different issue.
As far I know the filesystem is possible to create filename with the zero character '\0' or the newline ('\n') in it. 

Of course, we should all move towards all filenames being in UTF8, avoid
creating non-UTF8 filenames, etc.

This sound strange to me, UTF-8 is about encoding not about character set.
May be there is a little misunderstanding about utf8, unicode and encoding system.

It seems to me that you are using the term utf-8 as character set.

I see two different aspects:
 1) which character set the trash system should be able to handle?
 2) how the trash system handle it?

I think that the trash system should be able to manage filenames and path expressed in unicode.
One way to encode unicode characters is UTF-8, but there also UTF-16, and others.

I don't see any problem with filesystem whose filenames aren't encoded in non-utf8. 
All the pre-unicode character set are part of unicode and all character of unicode can be represented in utf8.
 
That is a different issue, and should
not make us limit our specifications to only work on a subset of the
valid filenames.

That's true but currently I see the following problems:
 - the subset of valid filenames doesn't contains filenames with '\n' or '\0' in it
 - isn't clear (probably only be for me) which encoding should be used for reading .trashinfo files.
 - the uses of character set like latin1 for encoding .trashinfo files contents could lead to a loss of information
 

--
Andrea Francia
http://andreafrancia.blogspot.com/

_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by PCMan-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Aug 22, 2009 at 3:23 AM, Andrea Francia<andrea@...> wrote:

>
>
> 2009/8/21 Alexander Larsson <alexl@...>
>>
>> On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
>> > 2009/8/5 David Faure <faure@...>
>> >         In practice I would recommend using utf8 everywhere and
>> >         getting rid
>> >         of the whole "filesystem encoding" mess in the first place.
>> >
>> >
>> > Who is interested to work a new draft (a draft) of the spec which
>> > solves this and the other problems emerged?
>>
>> This is not a "problem" that should be "solved". It was very
>> delibirately added to the spec in order to allow all files to be
>> trashed. How would you trash a file named some non-utf8 string if only
>> utf8 is allowed in the format?
>>
>> Filenames on linux are zero terminated arrays of bytes. If you treat it
>> like anything else you will just fail in some corner cases.
>
> For me filenames are a list of unicode characters. The way those filenames
> are represented using array of bytes is a different issue.
> As far I know the filesystem is possible to create filename with the zero
> character '\0' or the newline ('\n') in it.
>>
>> Of course, we should all move towards all filenames being in UTF8, avoid
>> creating non-UTF8 filenames, etc.
This is not a real solution if you're going to support remote filesystems.
On local machine, you can use any filename encoding you want.
The remote servers, however, cannot be totally migrated to UTF-8 sometimes.
So unless your vfs implementation can convert the encodings and only
show UTF-8 to applications, handling non-UTF-8 will always be needed.

> This sound strange to me, UTF-8 is about encoding not about character set.
> May be there is a little misunderstanding about utf8, unicode and encoding
> system.
> It seems to me that you are using the term utf-8 as character set.
> I see two different aspects:
>  1) which character set the trash system should be able to handle?
>  2) how the trash system handle it?
> I think that the trash system should be able to manage filenames and path
> expressed in unicode.
> One way to encode unicode characters is UTF-8, but there also UTF-16, and
> others.
> I don't see any problem with filesystem whose filenames aren't encoded in
> non-utf8.
> All the pre-unicode character set are part of unicode and all character of
> unicode can be represented in utf8.
No, this assumption is incorrect.
Most of the pre-unicode characters are defined in unicode, but some of
them are not.
So if you convert everything to UTF-8, some data might be lost.
>> That is a different issue, and should
>> not make us limit our specifications to only work on a subset of the
>> valid filenames.
> That's true but currently I see the following problems:
>  - the subset of valid filenames doesn't contains filenames with '\n' or
> '\0' in it
If the stored path is URL-encoded, this is not an issue since they
will be escaped.
Besides, even the filesystem support using \0 inside filename, I don't
believe that there is a real-world filemanager able to handle this. In
theory this should be allowed but it's an extremely rare use case
which doesn't exist in real world.

>  - isn't clear (probably only be for me) which encoding should be used for
> reading .trashinfo files.
>  - the uses of character set like latin1 for encoding .trashinfo files
> contents could lead to a loss of information
>
> --
> Andrea Francia
> http://andreafrancia.blogspot.com/
>
> _______________________________________________
> xdg mailing list
> xdg@...
> http://lists.freedesktop.org/mailman/listinfo/xdg
>
>
_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Alexander Larsson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 2009-08-21 at 21:23 +0200, Andrea Francia wrote:

>
>
> 2009/8/21 Alexander Larsson <alexl@...>
>         On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
>         > 2009/8/5 David Faure <faure@...>
>         >         In practice I would recommend using utf8 everywhere
>         and
>         >         getting rid
>         >         of the whole "filesystem encoding" mess in the first
>         place.
>         >
>         >
>         > Who is interested to work a new draft (a draft) of the spec
>         which
>         > solves this and the other problems emerged?
>        
>        
>         This is not a "problem" that should be "solved". It was very
>         delibirately added to the spec in order to allow all files to
>         be
>         trashed. How would you trash a file named some non-utf8 string
>         if only
>         utf8 is allowed in the format?
>        
>         Filenames on linux are zero terminated arrays of bytes. If you
>         treat it
>         like anything else you will just fail in some corner cases.
>
>
> For me filenames are a list of unicode characters. The way those
> filenames are represented using array of bytes is a different issue.
> As far I know the filesystem is possible to create filename with the
> zero character '\0' or the newline ('\n') in it.

I don't know what operating system you are running, but I'm running
UNIX, and the unix APIs define filenames as a list of bytes, terminated
by a zero byte, where only byte 47 ("/" in ascii) is treated specially.
What you believe filenames are does not really affect things. On the
filesystem a file has a name consisting of an array of bytes, and only
if you can give this exact array of bytes can you open the file. There
is not implicit either encoding or character set, there is only bytes.

A filename may be a string of bytes that are not valid in any existing
encoding of any existing character set. You still need to be able to
represent this in the trashinfo file. And even if the filename *is* in
some valid encoding you don't know which on it is. The various locale
settings can give you a hint about what it may be, and should be used
when you *create* a new filename from a known unicode string. However,
whats on the disk is whats on the disk and have no strict connection to
unicode.

>         Of course, we should all move towards all filenames being in
>         UTF8, avoid
>         creating non-UTF8 filenames, etc.
>
>
> This sound strange to me, UTF-8 is about encoding not about character
> set.
> May be there is a little misunderstanding about utf8, unicode and
> encoding system.

I'm not confused about this.

> It seems to me that you are using the term utf-8 as character set.
>
>
> I see two different aspects:
>  1) which character set the trash system should be able to handle?

The trash system is not about handling any character set at all. It is
about storing the identifier used on the operating system to access the
file. Anything else and there are valid files that the trash system
wouldn't handle.

>  2) how the trash system handle it?
>
> I think that the trash system should be able to manage filenames and
> path expressed in unicode.
> One way to encode unicode characters is UTF-8, but there also UTF-16,
> and others.

So, say you have a filename that is basically a random set of bytes, not
valid utf8, not valid utf16. Its e.g. valid latin-1, because all strings
are, but if you view it in latin-1 its full of unprintable characters
and gobeligok. This is a valid filename in UNIX, and you must give
exactly this array of bytes in order to access the file (to open it,
rename, delete, etc). How do you propose to "encode this in UTF-8"?

> I don't see any problem with filesystem whose filenames aren't encoded
> in non-utf8.
> All the pre-unicode character set are part of unicode and all
> character of unicode can be represented in utf8.
>  
>         That is a different issue, and should
>         not make us limit our specifications to only work on a subset
>         of the
>         valid filenames.
>
>
> That's true but currently I see the following problems:
>  - the subset of valid filenames doesn't contains filenames with '\n'
> or '\0' in it

Oh, the string itself in the file is URI-style encoded, so it can
contain any sort of bytes. (In fact, it can even contain encoded \0s in
it i guess, but that is unlikely to work well, as you can't e.g. pass
such a filename to the kernel since it things supplied filenames stop at
the first zero.)

>  - isn't clear (probably only be for me) which encoding should be used
> for reading .trashinfo files.

In general utf8 is the character set of the whole file, but the name
key, when unescaped is to be treated as an array of bytes, representing
the actual filename, which may be of any or no encoding.

>  - the uses of character set like latin1 for encoding .trashinfo files
> contents could lead to a loss of information

Since the filename is uri encoded we can store whatever bytes we want in
the filename. However, once that is decoded we can't expect it to be in
any encoding or character set, because that would as you say lead to a
loss of information.



_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Alexander Larsson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, 2009-08-22 at 08:54 +0800, PCMan wrote:

> On Sat, Aug 22, 2009 at 3:23 AM, Andrea Francia<andrea@...> wrote:
> >
> >
> >>
> >> Of course, we should all move towards all filenames being in UTF8, avoid
> >> creating non-UTF8 filenames, etc.
> This is not a real solution if you're going to support remote filesystems.
> On local machine, you can use any filename encoding you want.
> The remote servers, however, cannot be totally migrated to UTF-8 sometimes.
> So unless your vfs implementation can convert the encodings and only
> show UTF-8 to applications, handling non-UTF-8 will always be needed.

I'm of course talking about local files here. Obviously you have to
treat remote systems differently, exactly how depends on the remote
filesystem.

In gvfs for instance this is handled by all filenames being strings of
bytes in undefined encoding, but there are backend-implemented ways to
get the "display name" of a file (and back) so that you can display this
in a user interface.

> Besides, even the filesystem support using \0 inside filename, I don't
> believe that there is a real-world filemanager able to handle this. In
> theory this should be allowed but it's an extremely rare use case
> which doesn't exist in real world.

How could this be allowed? How would you pass such a filename to the
filesystem via e.g. the open() API? Any pathname passed via the POSIX
APIs are "c-strings", i.e. they end at the first passed in zero byte.


_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: Trash Can Question

by Andrea Francia-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I reviewed the messages from Larsson and PCMag and I think I was wrong about what constitutes an identifier for a file in a filesystem. Thanks to you both for helping me to review my opinions. 

I think that there are also something that I don't get, I'll write about my perplexities after I reviewed some more documentations.

2009/8/22 Alexander Larsson <alexl@...>
On Fri, 2009-08-21 at 21:23 +0200, Andrea Francia wrote:
>
>
> 2009/8/21 Alexander Larsson <alexl@...>
>         On Thu, 2009-08-06 at 00:05 +0200, Andrea Francia wrote:
>         > 2009/8/5 David Faure <faure@...>
>         >         In practice I would recommend using utf8 everywhere
>         and
>         >         getting rid
>         >         of the whole "filesystem encoding" mess in the first
>         place.
>         >
>         >
>         > Who is interested to work a new draft (a draft) of the spec
>         which
>         > solves this and the other problems emerged?
>
>
>         This is not a "problem" that should be "solved". It was very
>         delibirately added to the spec in order to allow all files to
>         be
>         trashed. How would you trash a file named some non-utf8 string
>         if only
>         utf8 is allowed in the format?
>
>         Filenames on linux are zero terminated arrays of bytes. If you
>         treat it
>         like anything else you will just fail in some corner cases.
>
>
> For me filenames are a list of unicode characters. The way those
> filenames are represented using array of bytes is a different issue.
> As far I know the filesystem is possible to create filename with the
> zero character '\0' or the newline ('\n') in it.

I don't know what operating system you are running, but I'm running
UNIX, and the unix APIs define filenames as a list of bytes, terminated
by a zero byte, where only byte 47 ("/" in ascii) is treated specially.
What you believe filenames are does not really affect things. On the
filesystem a file has a name consisting of an array of bytes, and only
if you can give this exact array of bytes can you open the file. There
is not implicit either encoding or character set, there is only bytes.

A filename may be a string of bytes that are not valid in any existing
encoding of any existing character set. You still need to be able to
represent this in the trashinfo file. And even if the filename *is* in
some valid encoding you don't know which on it is. The various locale
settings can give you a hint about what it may be, and should be used
when you *create* a new filename from a known unicode string. However,
whats on the disk is whats on the disk and have no strict connection to
unicode.

>         Of course, we should all move towards all filenames being in
>         UTF8, avoid
>         creating non-UTF8 filenames, etc.
>
>
> This sound strange to me, UTF-8 is about encoding not about character
> set.
> May be there is a little misunderstanding about utf8, unicode and
> encoding system.

I'm not confused about this.

> It seems to me that you are using the term utf-8 as character set.
>
>
> I see two different aspects:
>  1) which character set the trash system should be able to handle?

The trash system is not about handling any character set at all. It is
about storing the identifier used on the operating system to access the
file. Anything else and there are valid files that the trash system
wouldn't handle.

>  2) how the trash system handle it?
>
> I think that the trash system should be able to manage filenames and
> path expressed in unicode.
> One way to encode unicode characters is UTF-8, but there also UTF-16,
> and others.

So, say you have a filename that is basically a random set of bytes, not
valid utf8, not valid utf16. Its e.g. valid latin-1, because all strings
are, but if you view it in latin-1 its full of unprintable characters
and gobeligok. This is a valid filename in UNIX, and you must give
exactly this array of bytes in order to access the file (to open it,
rename, delete, etc). How do you propose to "encode this in UTF-8"?

> I don't see any problem with filesystem whose filenames aren't encoded
> in non-utf8.
> All the pre-unicode character set are part of unicode and all
> character of unicode can be represented in utf8.
>
>         That is a different issue, and should
>         not make us limit our specifications to only work on a subset
>         of the
>         valid filenames.
>
>
> That's true but currently I see the following problems:
>  - the subset of valid filenames doesn't contains filenames with '\n'
> or '\0' in it

Oh, the string itself in the file is URI-style encoded, so it can
contain any sort of bytes. (In fact, it can even contain encoded \0s in
it i guess, but that is unlikely to work well, as you can't e.g. pass
such a filename to the kernel since it things supplied filenames stop at
the first zero.)

>  - isn't clear (probably only be for me) which encoding should be used
> for reading .trashinfo files.

In general utf8 is the character set of the whole file, but the name
key, when unescaped is to be treated as an array of bytes, representing
the actual filename, which may be of any or no encoding.

>  - the uses of character set like latin1 for encoding .trashinfo files
> contents could lead to a loss of information

Since the filename is uri encoded we can store whatever bytes we want in
the filename. However, once that is decoded we can't expect it to be in
any encoding or character set, because that would as you say lead to a
loss of information.






--
Andrea Francia
http://andreafrancia.blogspot.com/

_______________________________________________
xdg mailing list
xdg@...
http://lists.freedesktop.org/mailman/listinfo/xdg