Re: svn commit: r712146 - /maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sink/SinkTestDocument.java

View: New views
4 Messages — Rating Filter:   Alert me  

Parent Message unknown Re: svn commit: r712146 - /maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sink/SinkTestDocument.java

by Hervé BOUTEMY :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-1: please revert this change

String.valueOf( '\u00a9' ) represents the Copyright symbol: there is no
encoding notion available here.
=> the code was perfectly correct and did what was expected

I'll try to explain what the new code does:
1. String.valueOf( '\u00a9' ): gets the Copyright symbol (as previously)
2. .getBytes(): gets binary representation of this symbol IN PLATFORM ENCODING
(see [1] API), result vary depending on your platform
3. new String( ..., "UTF-8" ): interprets previous binary data using UTF-8
encoding

Then the result vary depending on your platform encoding:
- if it was UTF-8 (like my Linux box), you finally have the initial Copyright
symbol: the code just turned the character to bytes then back to character
- if it was another encoding, you get something that is not Copyright symbol,
and might even not be anything valid

let's do the transofmration in case you're on Windows, in west Europe.
Platform encoding is then CP-1252.
Copyright symbol in CP-1252 is 0xA9=10101001 (see [2]).
In UTF-8, a byte starting with binary 10 tells that it is a continuation of a
multi-bytes serie (see [3], or [4] simpler explanation in the french page for
people reading french ;) ): then we're facing an invalid byte sequence, since
we didn't have any previous byte.
Write a little program with:
String copyright = String.valueOf( '\u00a9' );
byte[] b = copyright.getBytes( "CP1252" );
String result = new String( b, "UTF-8" );
byte[] b2 = result.getBytes( "UTF-8" );

and run it step by step in a debugger, looking into the variables, and you'll
see that b contains 1 byte but b2 contains 3 bytes. And copyright is
perfectly displayed while result is not.

I hope these explanations will help to understand.

Regards,

Hervé


[1] http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()

[2] http://fr.wikipedia.org/wiki/Windows-1252

[3] http://en.wikipedia.org/wiki/UTF-8

[4] http://fr.wikipedia.org/wiki/UTF-8

Le vendredi 07 novembre 2008, vsiveton@... a écrit :

> Author: vsiveton
> Date: Fri Nov  7 07:01:52 2008
> New Revision: 712146
>
> URL: http://svn.apache.org/viewvc?rev=712146&view=rev
> Log:
> o be sure that UTF-8 will be used
>
> Modified:
>    
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java
>
> Modified:
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java URL:
> http://svn.apache.org/viewvc/maven/doxia/doxia/trunk/doxia-core/src/test/ja
>va/org/apache/maven/doxia/sink/SinkTestDocument.java?rev=712146&r1=712145&r2
>=712146&view=diff
> ===========================================================================
>=== ---
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java (original) +++
> maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sin
>k/SinkTestDocument.java Fri Nov  7 07:01:52 2008 @@ -19,6 +19,7 @@
>   * under the License.
>   */
>
> +import java.io.UnsupportedEncodingException;
>
>  /**
>   * Static methods to generate standard Doxia sink events.
> @@ -595,7 +596,15 @@
>          sink.paragraph_();
>
>          sink.paragraph();
> -        String copyright = String.valueOf( '\u00a9' );
> +        String copyright;
> +        try
> +        {
> +            copyright = new String( String.valueOf( '\u00a9' ).getBytes(),
> "UTF-8" ); +        }
> +        catch ( UnsupportedEncodingException e )
> +        {
> +            copyright = "";
> +        }
>          sink.text( "Copyright symbol: " + copyright + ", "
>              + copyright + ", " + copyright + "." );
>          sink.paragraph_();



Re: svn commit: r712146 - /maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sink/SinkTestDocument.java

by Vincent Siveton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

2008/11/7 Hervé BOUTEMY <herve.boutemy@...>:
> -1: please revert this change
>
> String.valueOf( '\u00a9' ) represents the Copyright symbol: there is no
> encoding notion available here.
> => the code was perfectly correct and did what was expected

Not sure since I got exception with the original code (BTW it is weird
since everything should be in UTF-8...)
You are our encoding expert so feel free to fix it.

-------------------------------------------------------------------------------
Test set: org.apache.maven.doxia.module.xhtml.XhtmlIdentityTest
-------------------------------------------------------------------------------
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.657
sec <<< FAILURE!
testIdentity(org.apache.maven.doxia.module.xhtml.XhtmlIdentityTest)
Time elapsed: 1.625 sec  <<< ERROR!
org.apache.maven.doxia.parser.ParseException: Error validating the
model: Fatal error:
  Public ID: null
  System ID: null
  Line number: 2
  Column number: 1701
  Message: Invalid byte 1 of 1-byte UTF-8 sequence.
        at org.apache.maven.doxia.parser.AbstractXmlParser.validate(AbstractXmlParser.java:554)
        at org.apache.maven.doxia.parser.AbstractXmlParser.parse(AbstractXmlParser.java:105)
        at org.apache.maven.doxia.module.AbstractIdentityTest.testIdentity(AbstractIdentityTest.java:112)
 and so one

Cheers

Vincent

Re: svn commit: r712146 - /maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sink/SinkTestDocument.java

by Vincent Siveton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi again,

I reverted this in 712235 and fix the pb of invalid bytes.

Hope that it is correct now.

Vincent

2008/11/7 Vincent Siveton <vincent.siveton@...>:

> Hi,
>
> 2008/11/7 Hervé BOUTEMY <herve.boutemy@...>:
>> -1: please revert this change
>>
>> String.valueOf( '\u00a9' ) represents the Copyright symbol: there is no
>> encoding notion available here.
>> => the code was perfectly correct and did what was expected
>
> Not sure since I got exception with the original code (BTW it is weird
> since everything should be in UTF-8...)
> You are our encoding expert so feel free to fix it.
>
> -------------------------------------------------------------------------------
> Test set: org.apache.maven.doxia.module.xhtml.XhtmlIdentityTest
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.657
> sec <<< FAILURE!
> testIdentity(org.apache.maven.doxia.module.xhtml.XhtmlIdentityTest)
> Time elapsed: 1.625 sec  <<< ERROR!
> org.apache.maven.doxia.parser.ParseException: Error validating the
> model: Fatal error:
>  Public ID: null
>  System ID: null
>  Line number: 2
>  Column number: 1701
>  Message: Invalid byte 1 of 1-byte UTF-8 sequence.
>        at org.apache.maven.doxia.parser.AbstractXmlParser.validate(AbstractXmlParser.java:554)
>        at org.apache.maven.doxia.parser.AbstractXmlParser.parse(AbstractXmlParser.java:105)
>        at org.apache.maven.doxia.module.AbstractIdentityTest.testIdentity(AbstractIdentityTest.java:112)
>  and so one
>
> Cheers
>
> Vincent
>

Re: svn commit: r712146 - /maven/doxia/doxia/trunk/doxia-core/src/test/java/org/apache/maven/doxia/sink/SinkTestDocument.java

by Hervé BOUTEMY :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

+1
yes, that's it

well done :)

Hervé

Le vendredi 07 novembre 2008, Vincent Siveton a écrit :

> Hi again,
>
> I reverted this in 712235 and fix the pb of invalid bytes.
>
> Hope that it is correct now.
>
> Vincent
>
> 2008/11/7 Vincent Siveton <vincent.siveton@...>:
> > Hi,
> >
> > 2008/11/7 Hervé BOUTEMY <herve.boutemy@...>:
> >> -1: please revert this change
> >>
> >> String.valueOf( '\u00a9' ) represents the Copyright symbol: there is no
> >> encoding notion available here.
> >> => the code was perfectly correct and did what was expected
> >
> > Not sure since I got exception with the original code (BTW it is weird
> > since everything should be in UTF-8...)
> > You are our encoding expert so feel free to fix it.
> >
> > -------------------------------------------------------------------------
> >------ Test set: org.apache.maven.doxia.module.xhtml.XhtmlIdentityTest
> > -------------------------------------------------------------------------
> >------ Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> > 1.657 sec <<< FAILURE!
> > testIdentity(org.apache.maven.doxia.module.xhtml.XhtmlIdentityTest)
> > Time elapsed: 1.625 sec  <<< ERROR!
> > org.apache.maven.doxia.parser.ParseException: Error validating the
> > model: Fatal error:
> >  Public ID: null
> >  System ID: null
> >  Line number: 2
> >  Column number: 1701
> >  Message: Invalid byte 1 of 1-byte UTF-8 sequence.
> >        at
> > org.apache.maven.doxia.parser.AbstractXmlParser.validate(AbstractXmlParse
> >r.java:554) at
> > org.apache.maven.doxia.parser.AbstractXmlParser.parse(AbstractXmlParser.j
> >ava:105) at
> > org.apache.maven.doxia.module.AbstractIdentityTest.testIdentity(AbstractI
> >dentityTest.java:112) and so one
> >
> > Cheers
> >
> > Vincent