-1: please revert this change
String.valueOf( '\u00a9' ) represents the Copyright symbol: there is no
encoding notion available here.
=> the code was perfectly correct and did what was expected
I'll try to explain what the new code does:
1. String.valueOf( '\u00a9' ): gets the Copyright symbol (as previously)
2. .getBytes(): gets binary representation of this symbol IN PLATFORM ENCODING
(see [1] API), result vary depending on your platform
3. new String( ..., "UTF-8" ): interprets previous binary data using UTF-8
encoding
Then the result vary depending on your platform encoding:
- if it was UTF-8 (like my Linux box), you finally have the initial Copyright
symbol: the code just turned the character to bytes then back to character
- if it was another encoding, you get something that is not Copyright symbol,
and might even not be anything valid
let's do the transofmration in case you're on Windows, in west Europe.
Platform encoding is then CP-1252.
Copyright symbol in CP-1252 is 0xA9=10101001 (see [2]).
In UTF-8, a byte starting with binary 10 tells that it is a continuation of a
multi-bytes serie (see [3], or [4] simpler explanation in the french page for
people reading french ;) ): then we're facing an invalid byte sequence, since
we didn't have any previous byte.
Write a little program with:
String copyright = String.valueOf( '\u00a9' );
byte[] b = copyright.getBytes( "CP1252" );
String result = new String( b, "UTF-8" );
byte[] b2 = result.getBytes( "UTF-8" );
and run it step by step in a debugger, looking into the variables, and you'll
see that b contains 1 byte but b2 contains 3 bytes. And copyright is
perfectly displayed while result is not.
I hope these explanations will help to understand.
Regards,
Hervé
[1]
http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()
[2]
http://fr.wikipedia.org/wiki/Windows-1252[3]
http://en.wikipedia.org/wiki/UTF-8[4]
http://fr.wikipedia.org/wiki/UTF-8Le vendredi 07 novembre 2008,
vsiveton@... a écrit :
> Author: vsiveton
> Date: Fri Nov 7 07:01:52 2008
> New Revision: 712146
>
> URL:
http://svn.apache.org/viewvc?rev=712146&view=rev> Log:
> o be sure that UTF-8 will be used
>
> Modified:
>
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java
>
> Modified:
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java URL:
>
http://svn.apache.org/viewvc/maven/doxia/doxia/trunk/doxia-core/src/test/ja>va/org/apache/maven/doxia/sink/SinkTestDocument.java?rev=712146&r1=712145&r2
>=712146&view=diff
> ===========================================================================
>=== ---
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java (original) +++
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java Fri Nov 7 07:01:52 2008 @@ -19,6 +19,7 @@
> * under the License.
> */
>
> +import java.io.UnsupportedEncodingException;
>
> /**
> * Static methods to generate standard Doxia sink events.
> @@ -595,7 +596,15 @@
> sink.paragraph_();
>
> sink.paragraph();
> - String copyright = String.valueOf( '\u00a9' );
> + String copyright;
> + try
> + {
> + copyright = new String( String.valueOf( '\u00a9' ).getBytes(),
> "UTF-8" ); + }
> + catch ( UnsupportedEncodingException e )
> + {
> + copyright = "";
> + }
> sink.text( "Copyright symbol: " + copyright + ", "
> + copyright + ", " + copyright + "." );
> sink.paragraph_();