Charset trouble, questionmarks

View: New views
14 Messages — Rating Filter:   Alert me  

Charset trouble, questionmarks

by Magnus Olstad Hansen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I'm using HttpClient 4.0 to download a webpage the same way as shown in
one of the examples. This is my method to return a webpage as a string:

        protected static String leechUrl(String url) throws IOException {
                HttpClient httpclient = new DefaultHttpClient();
                HttpGet httpget = new HttpGet(url);

                System.out.println("executing request " + httpget.getURI());

                // Create a response handler
                ResponseHandler<String> responseHandler = new
BasicResponseHandler();
                String responseBody = httpclient.execute(httpget,
responseHandler);

                // When HttpClient instance is no longer needed,
                // shut down the connection manager to ensure
                // immediate deallocation of all system resources
                httpclient.getConnectionManager().shutdown();
                return responseBody;
        }

However; the responseBody returned here contains ? (questionmarks) for
all norwegian characters (æøåÆØÅ) on the page. For example if I try to
dump "http://www.vg.no" I can find the following at line 107:

        <li><a
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue
*p?* veggen (blogg)</a></li>

...that questionmark there should've been the character å.  For
certainty I've compared to the same page and line dumped with wget:

        <li><a
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue
på veggen (blogg)</a></li>

My question is simply what I need to do to keep the norwegian letters
intact? So far I've tried:
- Copying BasicResponseHandler and debug that
EntityUtils.getContentCharset() finds a reasonable charset, it does.
- Hacking EntityUtils.toString() to override both detected and default
charset with "ISO-8859-1" and "UTF-8".
- Adding header to the request with content-type and charset (which
isn't really logical to add to a request, but I tried anyway)

All I've accomplished with this is to sometimes get two ?'s instead of
one for the norwegian letters. I also tried to dump the response as
directly as I saw possible by using EntityUtils.toByteArray() and
writing directly to a file. To my surprise I can see that the ?'s are
still there and via hexdump I can see that they are all 3F
(questionmark) - so it's infact impossible to recover the norwegian
letters. They must have been replaced with a questionmark somewhere.

Please advice, and a thousand thanks for reading my problem!

Regards,
Magnus

Re: Charset trouble, questionmarks

by olegk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Sep 02, 2009 at 10:22:16AM +0200, Magnus Olstad Hansen wrote:

> Hello,
>
> I'm using HttpClient 4.0 to download a webpage the same way as shown in  
> one of the examples. This is my method to return a webpage as a string:
>
>        protected static String leechUrl(String url) throws IOException {
>                HttpClient httpclient = new DefaultHttpClient();
>                HttpGet httpget = new HttpGet(url);
>
>                System.out.println("executing request " + httpget.getURI());
>
>                // Create a response handler
>                ResponseHandler<String> responseHandler = new  
> BasicResponseHandler();
>                String responseBody = httpclient.execute(httpget,  
> responseHandler);
>
>                // When HttpClient instance is no longer needed,
>                // shut down the connection manager to ensure
>                // immediate deallocation of all system resources
>                httpclient.getConnectionManager().shutdown();
>                return responseBody;
>        }
>
> However; the responseBody returned here contains ? (questionmarks) for  
> all norwegian characters (??????) on the page. For example if I try to  
> dump "http://www.vg.no" I can find the following at line 107:
>
>        <li><a  
> href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue  
> *p?* veggen (blogg)</a></li>
>
> ...that questionmark there should've been the character ?.  For  
> certainty I've compared to the same page and line dumped with wget:
>
>        <li><a  
> href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue  
> p? veggen (blogg)</a></li>
>
> My question is simply what I need to do to keep the norwegian letters  
> intact? So far I've tried:
> - Copying BasicResponseHandler and debug that  
> EntityUtils.getContentCharset() finds a reasonable charset, it does.
> - Hacking EntityUtils.toString() to override both detected and default  
> charset with "ISO-8859-1" and "UTF-8".
> - Adding header to the request with content-type and charset (which  
> isn't really logical to add to a request, but I tried anyway)
>
> All I've accomplished with this is to sometimes get two ?'s instead of  
> one for the norwegian letters. I also tried to dump the response as  
> directly as I saw possible by using EntityUtils.toByteArray() and  
> writing directly to a file. To my surprise I can see that the ?'s are  
> still there and via hexdump I can see that they are all 3F  
> (questionmark) - so it's infact impossible to recover the norwegian  
> letters. They must have been replaced with a questionmark somewhere.
>
> Please advice, and a thousand thanks for reading my problem!
>
> Regards,
> Magnus

Hi Magnus,

Please use this guide to generate a wire / context log of the HTTP session and
post it to this list.

http://hc.apache.org/httpcomponents-client/logging.html

Oleg




---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Magnus,

On Sep 2, 2009, at 1:22am, Magnus Olstad Hansen wrote:

> Hello,
>
> I'm using HttpClient 4.0 to download a webpage the same way as shown  
> in one of the examples. This is my method to return a webpage as a  
> string:
>
>       protected static String leechUrl(String url) throws  
> IOException {
>               HttpClient httpclient = new DefaultHttpClient();
>               HttpGet httpget = new HttpGet(url);
>
>               System.out.println("executing request " +  
> httpget.getURI());
>
>               // Create a response handler
>               ResponseHandler<String> responseHandler = new  
> BasicResponseHandler();
>               String responseBody = httpclient.execute(httpget,  
> responseHandler);
>
>               // When HttpClient instance is no longer needed,
>               // shut down the connection manager to ensure
>               // immediate deallocation of all system resources
>               httpclient.getConnectionManager().shutdown();
>               return responseBody;
>       }
>
> However; the responseBody returned here contains ? (questionmarks)  
> for all norwegian characters (æøåÆØÅ) on the page. For example if I  
> try to dump "http://www.vg.no" I can find the following at line 107:
>
>       <li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/ 
> ">Frue *p?* veggen (blogg)</a></li>
>
> ...that questionmark there should've been the character å.  For  
> certainty I've compared to the same page and line dumped with wget:
>
>       <li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/ 
> ">Frue på veggen (blogg)</a></li>
>
> My question is simply what I need to do to keep the norwegian  
> letters intact? So far I've tried:
> - Copying BasicResponseHandler and debug that  
> EntityUtils.getContentCharset() finds a reasonable charset, it does.
> - Hacking EntityUtils.toString() to override both detected and  
> default charset with "ISO-8859-1" and "UTF-8".
> - Adding header to the request with content-type and charset (which  
> isn't really logical to add to a request, but I tried anyway)
>
> All I've accomplished with this is to sometimes get two ?'s instead  
> of one for the norwegian letters. I also tried to dump the response  
> as directly as I saw possible by using EntityUtils.toByteArray() and  
> writing directly to a file. To my surprise I can see that the ?'s  
> are still there and via hexdump I can see that they are all 3F  
> (questionmark) - so it's infact impossible to recover the norwegian  
> letters. They must have been replaced with a questionmark somewhere.
>
> Please advice, and a thousand thanks for reading my problem!

The basic problem is that determining the character set of a web page  
is complex, and not something that HttpClient is designed to handle.

If you check out (for example) the Nutch source, you'll see that it  
has a multi-step process, where it uses the Content-type in the  
response header, the meta http-equiv tag in the HTML, and low-level  
charset sniiffing code to try to guess at the right encoding, but even  
then sometimes it gets it wrong.

In your case, the page looks pretty clean - all UTF-8.

But when you call httpclient.execute(httpget, responseHandler), the  
BasicResponseHandler will call EntityUtils.toString, and that in turn  
uses ISO-8859-1 as its default charset when converting the bytes it  
receives into the string.

So go one level deeper, and get the HttpEntity from the response, then  
try EntityUtils.toString(entity, "UTF-8") and see what you get back.

-- Ken
---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by Florent Blondeau :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Magnus,

Don't know exactly where you can plug this, but this project helped me a
lot
parsing non ISO charset :
http://jchardet.sourceforge.net/

hope that helps

Florent

Pingwy
27, rue des arènes
49100 Angers



Magnus Olstad Hansen a écrit :

> Hello,
>
> I'm using HttpClient 4.0 to download a webpage the same way as shown
> in one of the examples. This is my method to return a webpage as a
> string:
>
>        protected static String leechUrl(String url) throws IOException {
>                HttpClient httpclient = new DefaultHttpClient();
>                HttpGet httpget = new HttpGet(url);
>
>                System.out.println("executing request " +
> httpget.getURI());
>
>                // Create a response handler
>                ResponseHandler<String> responseHandler = new
> BasicResponseHandler();
>                String responseBody = httpclient.execute(httpget,
> responseHandler);
>
>                // When HttpClient instance is no longer needed,
>                // shut down the connection manager to ensure
>                // immediate deallocation of all system resources
>                httpclient.getConnectionManager().shutdown();
>                return responseBody;
>        }
>
> However; the responseBody returned here contains ? (questionmarks) for
> all norwegian characters (æøåÆØÅ) on the page. For example if I try to
> dump "http://www.vg.no" I can find the following at line 107:
>
>        <li><a
> href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue
> *p?* veggen (blogg)</a></li>
>
> ...that questionmark there should've been the character å.  For
> certainty I've compared to the same page and line dumped with wget:
>
>        <li><a
> href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue
> på veggen (blogg)</a></li>
>
> My question is simply what I need to do to keep the norwegian letters
> intact? So far I've tried:
> - Copying BasicResponseHandler and debug that
> EntityUtils.getContentCharset() finds a reasonable charset, it does.
> - Hacking EntityUtils.toString() to override both detected and default
> charset with "ISO-8859-1" and "UTF-8".
> - Adding header to the request with content-type and charset (which
> isn't really logical to add to a request, but I tried anyway)
>
> All I've accomplished with this is to sometimes get two ?'s instead of
> one for the norwegian letters. I also tried to dump the response as
> directly as I saw possible by using EntityUtils.toByteArray() and
> writing directly to a file. To my surprise I can see that the ?'s are
> still there and via hexdump I can see that they are all 3F
> (questionmark) - so it's infact impossible to recover the norwegian
> letters. They must have been replaced with a questionmark somewhere.
>
> Please advice, and a thousand thanks for reading my problem!
>
> Regards,
> Magnus
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by MaGGE :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the reply, Ken.
>
> The basic problem is that determining the character set of a web page
> is complex, and not something that HttpClient is designed to handle.
>
> If you check out (for example) the Nutch source, you'll see that it
> has a multi-step process, where it uses the Content-type in the
> response header, the meta http-equiv tag in the HTML, and low-level
> charset sniiffing code to try to guess at the right encoding, but even
> then sometimes it gets it wrong.
Yep - I know it's a complex matter - headers are not always specified,
metas the same... :) However if the page is actually in one specific
charset, and one can force this charset to be used by HttpClient, it
should all work in my opinion.
>
> In your case, the page looks pretty clean - all UTF-8.
>
> But when you call httpclient.execute(httpget, responseHandler), the
> BasicResponseHandler will call EntityUtils.toString, and that in turn
> uses ISO-8859-1 as its default charset when converting the bytes it
> receives into the string.
Well, I've read the source on that and it's not entirely what I saw. 1
of 3 possibilites exists for charset-variable used within
EntityUtils.toString(), in preferred order :
1) Result of EntityUtils.getContentCharset()
2) Default charset given as argument to EntityUtils.toString().
3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct name
of the constant, but you get the point).

I tested EntityUtils.getContentCharset() separatly, and it returns UTF-8
(as found in the Content-Type of the given page).
>
> So go one level deeper, and get the HttpEntity from the response, then
> try EntityUtils.toString(entity, "UTF-8") and see what you get back.
Conclusion must be I have tried this without luck ... ?

Regards,
Magnus


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by NBW :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What about passing -Dfile.encoding=utf-8?

On Wed, Sep 2, 2009 at 10:58 AM, Magnus Olstad Hansen <magnus@...>wrote:

>
> > But when you call httpclient.execute(httpget, responseHandler), the
> > BasicResponseHandler will call EntityUtils.toString, and that in turn
> > uses ISO-8859-1 as its default charset when converting the bytes it
> > receives into the string.
>
>

Re: Charset trouble, questionmarks

by olegk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Sep 02, 2009 at 11:39:35AM -0400, NBW wrote:
> What about passing -Dfile.encoding=utf-8?
>

HttpClient does not use system properties (per design)

Oleg


> On Wed, Sep 2, 2009 at 10:58 AM, Magnus Olstad Hansen <magnus@...>wrote:
>
> >
> > > But when you call httpclient.execute(httpget, responseHandler), the
> > > BasicResponseHandler will call EntityUtils.toString, and that in turn
> > > uses ISO-8859-1 as its default charset when converting the bytes it
> > > receives into the string.
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by NBW :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

No use of things like InputStreamReaders then I take it.

On Wed, Sep 2, 2009 at 11:41 AM, Oleg Kalnichevski <olegk@...> wrote:

> On Wed, Sep 02, 2009 at 11:39:35AM -0400, NBW wrote:
> > What about passing -Dfile.encoding=utf-8?
> >
>
> HttpClient does not use system properties (per design)
>
> Oleg
>
>
>

Re: Charset trouble, questionmarks

by olegk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Sep 02, 2009 at 11:54:42AM -0400, NBW wrote:
> No use of things like InputStreamReaders then I take it.
>

InputStreamReaders is used by one utility method in HttpCore
(EntityUtils#toString()). However, it uses an InputStreamReaders constructor
that explicitly takes the charset name to be used for byte to char
conversion as a parameter.

Oleg


> On Wed, Sep 2, 2009 at 11:41 AM, Oleg Kalnichevski <olegk@...> wrote:
>
> > On Wed, Sep 02, 2009 at 11:39:35AM -0400, NBW wrote:
> > > What about passing -Dfile.encoding=utf-8?
> > >
> >
> > HttpClient does not use system properties (per design)
> >
> > Oleg
> >
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Magnus,

I used curl to grab the file, and the bytes at 0x1845...0x1847 are  
0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small  
letter a with ring above).

I also used Bixo (http://bixo.101tec.com) to crawl the same page, and  
wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a  
good test.

Given what you've tried (in your initial email), I've only got one  
weak guess - that your tools are showing you stuff that isn't actually  
there.

If you use HttpClient to dump the string and the byte array to files,  
and look at the files with a real, honest-to-gosh hex editor, do you  
still see 0x3F at offset 0x1845, or 0xC3 0xA5?

-- Ken


On Sep 2, 2009, at 7:58am, Magnus Olstad Hansen wrote:

> Thanks for the reply, Ken.
>>
>> The basic problem is that determining the character set of a web page
>> is complex, and not something that HttpClient is designed to handle.
>>
>> If you check out (for example) the Nutch source, you'll see that it
>> has a multi-step process, where it uses the Content-type in the
>> response header, the meta http-equiv tag in the HTML, and low-level
>> charset sniiffing code to try to guess at the right encoding, but  
>> even
>> then sometimes it gets it wrong.
> Yep - I know it's a complex matter - headers are not always specified,
> metas the same... :) However if the page is actually in one specific
> charset, and one can force this charset to be used by HttpClient, it
> should all work in my opinion.
>>
>> In your case, the page looks pretty clean - all UTF-8.
>>
>> But when you call httpclient.execute(httpget, responseHandler), the
>> BasicResponseHandler will call EntityUtils.toString, and that in turn
>> uses ISO-8859-1 as its default charset when converting the bytes it
>> receives into the string.
> Well, I've read the source on that and it's not entirely what I saw. 1
> of 3 possibilites exists for charset-variable used within
> EntityUtils.toString(), in preferred order :
> 1) Result of EntityUtils.getContentCharset()
> 2) Default charset given as argument to EntityUtils.toString().
> 3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct  
> name
> of the constant, but you get the point).
>
> I tested EntityUtils.getContentCharset() separatly, and it returns  
> UTF-8
> (as found in the Content-Type of the given page).
>>
>> So go one level deeper, and get the HttpEntity from the response,  
>> then
>> try EntityUtils.toString(entity, "UTF-8") and see what you get back.
> Conclusion must be I have tried this without luck ... ?
>
> Regards,
> Magnus
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@...
> For additional commands, e-mail: httpclient-users-help@...
>

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by MaGGE :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello again Ken,

Sorry to lag behind on the replies - work is busy these days... :)

Seems you're right. I've made a custom ResponseHandler class to be able to dump the raw output from HttpClient. However, I'd used FileWriter/BufferedWriter to dump to my file. This must've tried to interpret charset also, causing the bothersome 0x3F's mentioned before.

Your tip about another HttpClient app returning the content successfully caused me to look at my method again - and I made the output via FileOutputStream,write(byte[]) instead. Using hexdump as before I can now confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be.

(...from wget)
# hexdump -s 0x1845 -C index.html | head -n 2
00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen (blog|
00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g)</li>...<l|

(...from my dump)
# hexdump -s 0x1845 -C raw.txt | head -n 2
00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen (blog|
00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g)</li>...<l|

What remains a mystery to me is, however, why the string returned from HttpClient.execute() and thus EntityUtils.toString(Entity) does not represent the letters correctly. I also tested this with the BasicResponseHandler to be sure it was nothing I'd done.

Atleast now I can use my custom ResponseHandler and figure out how to treat the intact byte-array correctly. So thanks a lot! :)


Ken Krugler wrote:
Hi Magnus,

I used curl to grab the file, and the bytes at 0x1845...0x1847 are  
0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small  
letter a with ring above).

I also used Bixo (http://bixo.101tec.com) to crawl the same page, and  
wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a  
good test.

Given what you've tried (in your initial email), I've only got one  
weak guess - that your tools are showing you stuff that isn't actually  
there.

Re: Charset trouble, questionmarks

by olegk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

MaGGE wrote:

> Hello again Ken,
>
> Sorry to lag behind on the replies - work is busy these days... :)
>
> Seems you're right. I've made a custom ResponseHandler class to be able to
> dump the raw output from HttpClient. However, I'd used
> FileWriter/BufferedWriter to dump to my file. This must've tried to
> interpret charset also, causing the bothersome 0x3F's mentioned before.
>
> Your tip about another HttpClient app returning the content successfully
> caused me to look at my method again - and I made the output via
> FileOutputStream,write(byte[]) instead. Using hexdump as before I can now
> confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be.
>
> (...from wget)
> # hexdump -s 0x1845 -C index.html | head -n 2
> 00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
> (blog|
> 00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|
>
> (...from my dump)
> # hexdump -s 0x1845 -C raw.txt | head -n 2
> 00001845  70 c3 a5 20 76 65 67 67  65 6e 20 28 62 6c 6f 67  |p.. veggen
> (blog|
> 00001855  67 29 3c 2f 61 3e 3c 2f  6c 69 3e 0a 09 09 3c 6c  |g) </li>...<l|
>
> What remains a mystery to me is, however, why the string returned from
> HttpClient.execute() and thus EntityUtils.toString(Entity) does not
> represent the letters correctly. I also tested this with the
> BasicResponseHandler to be sure it was nothing I'd done.
>
> Atleast now I can use my custom ResponseHandler and figure out how to treat
> the intact byte-array correctly. So thanks a lot! :)
>
>

If you only listened and produced a wire / context log, when I asked
you, all this could have been found out much earlier, and I most likely
would also have been able to tell why EntityUtils#toString failed to
detect the charset.

Oleg



>
> Ken Krugler wrote:
>> Hi Magnus,
>>
>> I used curl to grab the file, and the bytes at 0x1845...0x1847 are  
>> 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small  
>> letter a with ring above).
>>
>> I also used Bixo (http://bixo.101tec.com) to crawl the same page, and  
>> wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a  
>> good test.
>>
>> Given what you've tried (in your initial email), I've only got one  
>> weak guess - that your tools are showing you stuff that isn't actually  
>> there.
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Charset trouble, questionmarks

by Magnus Olstad Hansen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Oleg Kalnichevski wrote:
> If you only listened and produced a wire / context log, when I asked
> you, all this could have been found out much earlier, and I most
> likely would also have been able to tell why EntityUtils#toString
> failed to detect the charset.
>
> Oleg


Hello Oleg,

Sorry - I only tried what seemed easiest to me first here. Today is the
first day I've had the time to look into this again.
A wire + context log should be attached to this mail. Hope this can
clearify what is going on.

PS! I'm pretty sure that EntityUtils.getContentCharset() returns the
right charset (UTF-8 in this case) - so the confusing point is why
EntityUtils.toString() returns 0x3F for the norwegian letters. As far as
I could tell from the sources the charset from getContentCharset() has
top priority in toString()...

Thanks for helping,
Magnus









---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...

log.txt.gz (48K) Download Attachment

Re: Charset trouble, questionmarks

by olegk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Magnus Olstad Hansen wrote:

>
> Hello Oleg,
>
> Sorry - I only tried what seemed easiest to me first here. Today is the
> first day I've had the time to look into this again.
> A wire + context log should be attached to this mail. Hope this can
> clearify what is going on.
>
> PS! I'm pretty sure that EntityUtils.getContentCharset() returns the
> right charset (UTF-8 in this case) - so the confusing point is why
> EntityUtils.toString() returns 0x3F for the norwegian letters. As far as
> I could tell from the sources the charset from getContentCharset() has
> top priority in toString()...
>
> Thanks for helping,
> Magnus
>

That's what I am seeing in the wire log

---
FINE: << "[0x9][0x9]<li><a
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisanett.vgb.no/">Skjermdump
(blogg)</a></li>[\n]"
FINE: << "[0x9][0x9]<li><a
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue
p[0xc3][0xa5] veggen (blogg)</a></li>[\n]"
---

As far as I can tell the content is correctly encoded as UTF-8


This is a snippet of output produced using EntityUtils#toString
---
href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue
på veggen (blogg)</a></li>
---

As far as I can tell non-ascii characters appear correctly decoded.

Apparently you were printing the output of your application to a console
that simply could not handle non-ascii characters.

This whole story was a non-issue from the very beginning.

Oleg



>
>
>
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@...
> For additional commands, e-mail: httpclient-users-help@...


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...