Re: checkout problem with Chinese filenames on linux

View: New views
2 Messages — Rating Filter:   Alert me  

Re: checkout problem with Chinese filenames on linux

by Zsolt Koppany-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear Alexander,

I have verified the followings:

- The test filesystem (reiserfs) DO support UTF-8 file names, and the files
names look the same 'spoiled' when looking at from browser or from listing
the file system. So this is NOT the cause of the problem: both our OS and FS
both supports UTF-8, as it works fine when the LOCALE set to en_US.UTF8.

- Also verified with the normal svn/svnadmin tools; the svnadmin properly
imports the dump file as UTF-8, and ignores the LOCALE settings, although it
complains about. So the imported repository will be correct!
When trying to check out this repository using the svn commant it will abort
with an error message see below. Here is the full printout with SVNADMIN/SVN
commands:
zluspai@duo:~/tmp/utf8test/locale-test$ ./run.sh
---- creating repository ----
svnadmin: warning: cannot set LC_CTYPE locale
svnadmin: warning: environment variable LC_ALL is en_US
svnadmin: warning: please check that your locale name is correct
svnadmin: warning: cannot set LC_CTYPE locale
svnadmin: warning: environment variable LC_ALL is en_US
svnadmin: warning: please check that your locale name is correct
<<< Started new transaction, based on original revision 1
     * adding path : branches ... done.
     * adding path : tags ... done.
     * adding path : trunk ... done.

------- Committed revision 1 >>>

<<< Started new transaction, based on original revision 2
     * adding path :
trunk/?\230?\150?\176?\229?\162?\158?\230?\150?\135?\229?\173?\151?\230?\150
?\135?\228?\187?\182.txt ... done.

------- Committed revision 2 >>>

<<< Started new transaction, based on original revision 3
     * adding path : trunk/OrderScheduleMO.java ... done.

------- Committed revision 3 >>>

<<< Started new transaction, based on original revision 4
     * editing path : trunk/OrderScheduleMO.java ... done.

------- Committed revision 4 >>>

---- checking out from repository ----
./run.sh: 21: md: not found
svn: warning: cannot set LC_CTYPE locale
svn: warning: environment variable LC_ALL is en_US
svn: warning: please check that your locale name is correct
A    testdir/trunk
svn: Can't convert string from 'UTF-8' to native encoding:
svn:
testdir/trunk/?\230?\150?\176?\229?\162?\158?\230?\150?\135?\229?\173?\151?\
230?\150?\135?\228?\187?\182.txt
zluspai@duo:~/tmp/utf8test/locale-test$


As a conclusion we are sure the the bug is in SVNKIT the way it handles the
locale settings and the character sets. For us the most important would be
that there should be some way to tell the SVNKIT classes explicitly which
encoding to use. And if an encoding is set it should ignore the OS locale
settings!

So when we call this code, we should be able to override charsets:

                SVNAdminClient adminClient =
SVNClientManager.newInstance().getAdminClient();
                adminClient.doLoad(repositoryRoot, inputStream);

There probably should be a setCharset(Charset) or setEncoding(String) method
on the ISVNOptions class which would allow us to override OS/default
settings.

Regards,
Zoltan



-----Original Message-----
From: Alexander Sinyushkin [mailto:Alexander.Sinyushkin@...]
Sent: Monday, April 23, 2007 6:00 PM
To: javasvn-users@...
Subject: Re: checkout problem with Chinese filenames on linux

Hello Zsolt.

After we discussed possible reasons of the problem you have encountered
with we think that there can be the following reasons why you've got
wrong filenames:

1) it may be the current locale of the OS that makes your filenames look
strange. For example, on my Windows I also do not see Chinese
hieroglyphs as I've got a different locale. But if I browse my file
system in firefox I see correct names (see the picture attached). So, if
this is the problem of a locale (you may try to look your files in a
browser just like we did) then it's not the problem of SVNKit, since
locales of our users is out of our response.

2) it may be the problem of a file system: it's possible that a
particular file system stores filenames in ASCII only charset (using one
byte per symbol). In this case if the first point is not the reason we
would ask you to tell us what file system the problem is reproduced on -
we will check this hypotheses.

3) if both previous cases appear to be wrong, there's the last and most
undesirable case - the problem resides somewhere
in SVNKit. Here we will advise you to locate the source of the problem:
after you get your dump file loaded

/jsvnadmin load testrepo <svnUTF8FileNames.dump

check your repository with a web browser if possible (or try to change
the locale to utf-8 and browse it with Subversion command line client).
If jsvnadmin load is not the source you will see correct Chinese
filenames and the problem is in the SVNKit update engine.

And yet one more question - maybe the most important one: how does
Subversion command line client behave in the same environment where
SVNKit update spoils (?) filenames?

Thank you.

----
Alexander Sinyushkin,
TMate Software,
http://svnkit.com/ - Java [Sub]Versioning Library!

Zsolt Koppany wrote:
 
Alexander,

the problem happens only on linux and if the environment variables are not
pointing to UTF8. This is a problem for us because our application must
   
NOT
 
rely on environment variables that we cannot control.

The filenames of working copies are wrong.

Zsolt

   
-----Original Message-----
From: Alexander Sinyushkin [mailto:Alexander.Sinyushkin@...]
Sent: Monday, April 23, 2007 7:45 AM
To: javasvn-users@...
Subject: Re: checkout problem with Chinese filenames on linux

Hello Zsolt,

What is the exact problem - checkout doesn't work at all or works but
spoils Chinese filenames? If it doesn't work at all I suppose there
should be an exception you get running checkout. If so, send us a stack
trace, please. Another question: is this problem only Linux specific?
Have you tried to do checkout on Windows? It may also appear to be of
great use to know what environment variables causes checkout to run
without problems. Thank you.

----
Alexander Sinyushkin,
TMate Software,
http://svnkit.com/ - Java [Sub]Versioning Library!

Zsolt Koppany wrote:
     
Hi,

We found checkout problems (on linux) if the repository contains Chinese
filenames. We also figured out that the checkout works if some
       
environment
     
variables are set.

In the attachment you find the shell script and also the repository
       
dump.
     
We tested with svn-1.1.2.

Zsolt

       


   

________________________________________

 


Zsolt Koppany
Phone: +49-711-67400-679
--
Intland Software, Wankelstrasse 3
D-70563 Stuttgart, Germany
Phone: +49-711-67400-677, e-mail:zsolt.koppany@...
Fax: +49-711-67400-686
Intland GmbH, Amtsgericht Stuttgart HRB 19479
Geschäftsführer Janos Koppany, Zsolt Koppany



Parent Message unknown Re: checkout problem with Chinese filenames on linux

by Alexander Kitaev-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Zsolt,

Thank you for report! I would like to find and fix the reason of that
problem as well as you do, but I'm still not sure the bug is in SVNKit.

1. As you know, dump files are always stored in UTF-8 format, both by
SVNKit and native Subversion client. Repository also stores everything
in UTF-8, this makes it independent on the system locale - one could
transfer repositories between systems with different locales.

 > - Also verified with the normal svn/svnadmin tools; the svnadmin properly
 > imports the dump file as UTF-8, and ignores the LOCALE settings,
although it
 > complains about. So the imported repository will be correct!
SVNKit also exports repository (creates dump file) in UTF8 encoding.
This doesn't depend on system locale - internally String objects are
kept in Unicode (UTF-16) at JVM runtime, and at the moment we have to
save it to file we get UTF-8 byte sequence from String object:

private void writeDumpData(String data) throws IOException {
   myDumpStream.write(data.getBytes("UTF-8"));
   ...
}

As far as I know, JDK includes UTF-8 encoder and uses it without calling
OS encoding facilities, if any. That is probably why native client
complains and SVNKit does not. In case JDK miss "UTF-8" encoder it will
throw UnsupportedEncodingException, but it could not happen for UTF-8 as
it is always there.

I would suggest, if possible, to compare dump files produced by svnadmin
and jsvnadmin - they should be exactly the same, both with paths stored
in UTF-8 encoding. In case dump file differs there is a bug in SVNKit.

2. When it comes to import (load) data into repository the opposite
operation takes place - byte sequence is read from the dump file and
decoded as UTF-8 enconded String, then stored in repository again
encoded as UTF-8. Both operations (decoding and encoding) are done by
JDK encoder that doesn't depend on System locale.

To verify this I'd suggest to compare repositories created by 'jsvnadmin
load' and 'svnadmin load' commands. They should be exactly the same
(unless svn client is one of the older version). In case repositories
differ - there is a bug in SVNKit.

Also, after data is loaded into repository, correct file names should be
displayed when browsing this repository over network - file names are
sent in UTF-8 encoding by mod_dav_svn module, and I think no
intermediate encoding takes place - file names are just sent as they are
stored in repository.

3. Checkout operation is where Subversion client creates files, i.e.
interacts with file system.

In case of SVNKit, UTF-8 byte sequence received from repository (either
over network, or read directly) is converted to Java String object, that
stores chars (at runtime each char takes two bytes, this is UTF-16
encoding). This String object is used to create java.io.File object -
i.e. no encoding takes place - java.io.File object is created using
unicode String and actually there is no other way to create File object.
File creation is then up to JDK implementation (that most probably calls
one of the POSIX API functions (fopen?) provided by the kernel). Most
probably JDK internally takes into account system locale, but from your
description it looks like while native client just fails, JDK creates
file with the 'spoiled' name.

So far, I think the problem occurs on the level lower than SVNKit could
control.

There are number of system properties that defines JDK's default
encoding and locale (file.encoding), but I doubt changing them could
resolve the problem - SVNKit always specifies encoding explicitly when
converting bytes to String and vice versa. What could help is upgrading
to the more recent JDK version, also please make sure that you're not
using GCJ (GNU JDK), but Sun's one.

Does my explanations make sense?

Alexander Kitaev,
TMate Software,
http://svnkit.com/ - Java [Sub]Versioning Library!

Zsolt Koppany wrote:

> Dear Alexander,
>
> I have verified the followings:
>
> - The test filesystem (reiserfs) DO support UTF-8 file names, and the files
> names look the same 'spoiled' when looking at from browser or from listing
> the file system. So this is NOT the cause of the problem: both our OS and FS
> both supports UTF-8, as it works fine when the LOCALE set to en_US.UTF8.
>
> - Also verified with the normal svn/svnadmin tools; the svnadmin properly
> imports the dump file as UTF-8, and ignores the LOCALE settings, although it
> complains about. So the imported repository will be correct!
> When trying to check out this repository using the svn commant it will abort
> with an error message see below. Here is the full printout with SVNADMIN/SVN
> commands:
> zluspai@duo:~/tmp/utf8test/locale-test$ ./run.sh
> ---- creating repository ----
> svnadmin: warning: cannot set LC_CTYPE locale
> svnadmin: warning: environment variable LC_ALL is en_US
> svnadmin: warning: please check that your locale name is correct
> svnadmin: warning: cannot set LC_CTYPE locale
> svnadmin: warning: environment variable LC_ALL is en_US
> svnadmin: warning: please check that your locale name is correct
> <<< Started new transaction, based on original revision 1
>      * adding path : branches ... done.
>      * adding path : tags ... done.
>      * adding path : trunk ... done.
>
> ------- Committed revision 1 >>>
>
> <<< Started new transaction, based on original revision 2
>      * adding path :
> trunk/?\230?\150?\176?\229?\162?\158?\230?\150?\135?\229?\173?\151?\230?\150
> ?\135?\228?\187?\182.txt ... done.
>
> ------- Committed revision 2 >>>
>
> <<< Started new transaction, based on original revision 3
>      * adding path : trunk/OrderScheduleMO.java ... done.
>
> ------- Committed revision 3 >>>
>
> <<< Started new transaction, based on original revision 4
>      * editing path : trunk/OrderScheduleMO.java ... done.
>
> ------- Committed revision 4 >>>
>
> ---- checking out from repository ----
> ./run.sh: 21: md: not found
> svn: warning: cannot set LC_CTYPE locale
> svn: warning: environment variable LC_ALL is en_US
> svn: warning: please check that your locale name is correct
> A    testdir/trunk
> svn: Can't convert string from 'UTF-8' to native encoding:
> svn:
> testdir/trunk/?\230?\150?\176?\229?\162?\158?\230?\150?\135?\229?\173?\151?\
> 230?\150?\135?\228?\187?\182.txt
> zluspai@duo:~/tmp/utf8test/locale-test$
>
>
> As a conclusion we are sure the the bug is in SVNKIT the way it handles the
> locale settings and the character sets. For us the most important would be
> that there should be some way to tell the SVNKIT classes explicitly which
> encoding to use. And if an encoding is set it should ignore the OS locale
> settings!
>
> So when we call this code, we should be able to override charsets:
>
>                 SVNAdminClient adminClient =
> SVNClientManager.newInstance().getAdminClient();
>                 adminClient.doLoad(repositoryRoot, inputStream);
>
> There probably should be a setCharset(Charset) or setEncoding(String) method
> on the ISVNOptions class which would allow us to override OS/default
> settings.
>
> Regards,
> Zoltan
>
>
>
> -----Original Message-----
> From: Alexander Sinyushkin [mailto:Alexander.Sinyushkin@...]
> Sent: Monday, April 23, 2007 6:00 PM
> To: javasvn-users@...
> Subject: Re: checkout problem with Chinese filenames on linux
>
> Hello Zsolt.
>
> After we discussed possible reasons of the problem you have encountered
> with we think that there can be the following reasons why you've got
> wrong filenames:
>
> 1) it may be the current locale of the OS that makes your filenames look
> strange. For example, on my Windows I also do not see Chinese
> hieroglyphs as I've got a different locale. But if I browse my file
> system in firefox I see correct names (see the picture attached). So, if
> this is the problem of a locale (you may try to look your files in a
> browser just like we did) then it's not the problem of SVNKit, since
> locales of our users is out of our response.
>
> 2) it may be the problem of a file system: it's possible that a
> particular file system stores filenames in ASCII only charset (using one
> byte per symbol). In this case if the first point is not the reason we
> would ask you to tell us what file system the problem is reproduced on -
> we will check this hypotheses.
>
> 3) if both previous cases appear to be wrong, there's the last and most
> undesirable case - the problem resides somewhere
> in SVNKit. Here we will advise you to locate the source of the problem:
> after you get your dump file loaded
>
> /jsvnadmin load testrepo <svnUTF8FileNames.dump
>
> check your repository with a web browser if possible (or try to change
> the locale to utf-8 and browse it with Subversion command line client).
> If jsvnadmin load is not the source you will see correct Chinese
> filenames and the problem is in the SVNKit update engine.
>
> And yet one more question - maybe the most important one: how does
> Subversion command line client behave in the same environment where
> SVNKit update spoils (?) filenames?
>
> Thank you.
>
> ----
> Alexander Sinyushkin,
> TMate Software,
> http://svnkit.com/ - Java [Sub]Versioning Library!
>
> Zsolt Koppany wrote:
>  
> Alexander,
>
> the problem happens only on linux and if the environment variables are not
> pointing to UTF8. This is a problem for us because our application must
>    
> NOT
>  
> rely on environment variables that we cannot control.
>
> The filenames of working copies are wrong.
>
> Zsolt
>
>    
> -----Original Message-----
> From: Alexander Sinyushkin [mailto:Alexander.Sinyushkin@...]
> Sent: Monday, April 23, 2007 7:45 AM
> To: javasvn-users@...
> Subject: Re: checkout problem with Chinese filenames on linux
>
> Hello Zsolt,
>
> What is the exact problem - checkout doesn't work at all or works but
> spoils Chinese filenames? If it doesn't work at all I suppose there
> should be an exception you get running checkout. If so, send us a stack
> trace, please. Another question: is this problem only Linux specific?
> Have you tried to do checkout on Windows? It may also appear to be of
> great use to know what environment variables causes checkout to run
> without problems. Thank you.
>
> ----
> Alexander Sinyushkin,
> TMate Software,
> http://svnkit.com/ - Java [Sub]Versioning Library!
>
> Zsolt Koppany wrote:
>      
> Hi,
>
> We found checkout problems (on linux) if the repository contains Chinese
> filenames. We also figured out that the checkout works if some
>        
> environment
>      
> variables are set.
>
> In the attachment you find the shell script and also the repository
>        
> dump.
>      
> We tested with svn-1.1.2.
>
> Zsolt
>
>        
>
>
>    
>
> ________________________________________
>
>  
>
>
> Zsolt Koppany
> Phone: +49-711-67400-679
> --
> Intland Software, Wankelstrasse 3
> D-70563 Stuttgart, Germany
> Phone: +49-711-67400-677, e-mail:zsolt.koppany@...
> Fax: +49-711-67400-686
> Intland GmbH, Amtsgericht Stuttgart HRB 19479
> Geschäftsführer Janos Koppany, Zsolt Koppany
>
>
>