|
View:
New views
14 Messages
—
Rating Filter:
Alert me
|
|
|
Charset trouble, questionmarksHello,
I'm using HttpClient 4.0 to download a webpage the same way as shown in one of the examples. This is my method to return a webpage as a string: protected static String leechUrl(String url) throws IOException { HttpClient httpclient = new DefaultHttpClient(); HttpGet httpget = new HttpGet(url); System.out.println("executing request " + httpget.getURI()); // Create a response handler ResponseHandler<String> responseHandler = new BasicResponseHandler(); String responseBody = httpclient.execute(httpget, responseHandler); // When HttpClient instance is no longer needed, // shut down the connection manager to ensure // immediate deallocation of all system resources httpclient.getConnectionManager().shutdown(); return responseBody; } However; the responseBody returned here contains ? (questionmarks) for all norwegian characters (æøåÆØÅ) on the page. For example if I try to dump "http://www.vg.no" I can find the following at line 107: <li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue *p?* veggen (blogg)</a></li> ...that questionmark there should've been the character å. For certainty I've compared to the same page and line dumped with wget: <li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue på veggen (blogg)</a></li> My question is simply what I need to do to keep the norwegian letters intact? So far I've tried: - Copying BasicResponseHandler and debug that EntityUtils.getContentCharset() finds a reasonable charset, it does. - Hacking EntityUtils.toString() to override both detected and default charset with "ISO-8859-1" and "UTF-8". - Adding header to the request with content-type and charset (which isn't really logical to add to a request, but I tried anyway) All I've accomplished with this is to sometimes get two ?'s instead of one for the norwegian letters. I also tried to dump the response as directly as I saw possible by using EntityUtils.toByteArray() and writing directly to a file. To my surprise I can see that the ?'s are still there and via hexdump I can see that they are all 3F (questionmark) - so it's infact impossible to recover the norwegian letters. They must have been replaced with a questionmark somewhere. Please advice, and a thousand thanks for reading my problem! Regards, Magnus |
|
|
Re: Charset trouble, questionmarksOn Wed, Sep 02, 2009 at 10:22:16AM +0200, Magnus Olstad Hansen wrote:
> Hello, > > I'm using HttpClient 4.0 to download a webpage the same way as shown in > one of the examples. This is my method to return a webpage as a string: > > protected static String leechUrl(String url) throws IOException { > HttpClient httpclient = new DefaultHttpClient(); > HttpGet httpget = new HttpGet(url); > > System.out.println("executing request " + httpget.getURI()); > > // Create a response handler > ResponseHandler<String> responseHandler = new > BasicResponseHandler(); > String responseBody = httpclient.execute(httpget, > responseHandler); > > // When HttpClient instance is no longer needed, > // shut down the connection manager to ensure > // immediate deallocation of all system resources > httpclient.getConnectionManager().shutdown(); > return responseBody; > } > > However; the responseBody returned here contains ? (questionmarks) for > all norwegian characters (??????) on the page. For example if I try to > dump "http://www.vg.no" I can find the following at line 107: > > <li><a > href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue > *p?* veggen (blogg)</a></li> > > ...that questionmark there should've been the character ?. For > certainty I've compared to the same page and line dumped with wget: > > <li><a > href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue > p? veggen (blogg)</a></li> > > My question is simply what I need to do to keep the norwegian letters > intact? So far I've tried: > - Copying BasicResponseHandler and debug that > EntityUtils.getContentCharset() finds a reasonable charset, it does. > - Hacking EntityUtils.toString() to override both detected and default > charset with "ISO-8859-1" and "UTF-8". > - Adding header to the request with content-type and charset (which > isn't really logical to add to a request, but I tried anyway) > > All I've accomplished with this is to sometimes get two ?'s instead of > one for the norwegian letters. I also tried to dump the response as > directly as I saw possible by using EntityUtils.toByteArray() and > writing directly to a file. To my surprise I can see that the ?'s are > still there and via hexdump I can see that they are all 3F > (questionmark) - so it's infact impossible to recover the norwegian > letters. They must have been replaced with a questionmark somewhere. > > Please advice, and a thousand thanks for reading my problem! > > Regards, > Magnus Hi Magnus, Please use this guide to generate a wire / context log of the HTTP session and post it to this list. http://hc.apache.org/httpcomponents-client/logging.html Oleg --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksHi Magnus,
On Sep 2, 2009, at 1:22am, Magnus Olstad Hansen wrote: > Hello, > > I'm using HttpClient 4.0 to download a webpage the same way as shown > in one of the examples. This is my method to return a webpage as a > string: > > protected static String leechUrl(String url) throws > IOException { > HttpClient httpclient = new DefaultHttpClient(); > HttpGet httpget = new HttpGet(url); > > System.out.println("executing request " + > httpget.getURI()); > > // Create a response handler > ResponseHandler<String> responseHandler = new > BasicResponseHandler(); > String responseBody = httpclient.execute(httpget, > responseHandler); > > // When HttpClient instance is no longer needed, > // shut down the connection manager to ensure > // immediate deallocation of all system resources > httpclient.getConnectionManager().shutdown(); > return responseBody; > } > > However; the responseBody returned here contains ? (questionmarks) > for all norwegian characters (æøåÆØÅ) on the page. For example if I > try to dump "http://www.vg.no" I can find the following at line 107: > > <li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/ > ">Frue *p?* veggen (blogg)</a></li> > > ...that questionmark there should've been the character å. For > certainty I've compared to the same page and line dumped with wget: > > <li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/ > ">Frue på veggen (blogg)</a></li> > > My question is simply what I need to do to keep the norwegian > letters intact? So far I've tried: > - Copying BasicResponseHandler and debug that > EntityUtils.getContentCharset() finds a reasonable charset, it does. > - Hacking EntityUtils.toString() to override both detected and > default charset with "ISO-8859-1" and "UTF-8". > - Adding header to the request with content-type and charset (which > isn't really logical to add to a request, but I tried anyway) > > All I've accomplished with this is to sometimes get two ?'s instead > of one for the norwegian letters. I also tried to dump the response > as directly as I saw possible by using EntityUtils.toByteArray() and > writing directly to a file. To my surprise I can see that the ?'s > are still there and via hexdump I can see that they are all 3F > (questionmark) - so it's infact impossible to recover the norwegian > letters. They must have been replaced with a questionmark somewhere. > > Please advice, and a thousand thanks for reading my problem! The basic problem is that determining the character set of a web page is complex, and not something that HttpClient is designed to handle. If you check out (for example) the Nutch source, you'll see that it has a multi-step process, where it uses the Content-type in the response header, the meta http-equiv tag in the HTML, and low-level charset sniiffing code to try to guess at the right encoding, but even then sometimes it gets it wrong. In your case, the page looks pretty clean - all UTF-8. But when you call httpclient.execute(httpget, responseHandler), the BasicResponseHandler will call EntityUtils.toString, and that in turn uses ISO-8859-1 as its default charset when converting the bytes it receives into the string. So go one level deeper, and get the HttpEntity from the response, then try EntityUtils.toString(entity, "UTF-8") and see what you get back. -- Ken --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksHi Magnus,
Don't know exactly where you can plug this, but this project helped me a lot parsing non ISO charset : http://jchardet.sourceforge.net/ hope that helps Florent Pingwy 27, rue des arènes 49100 Angers Magnus Olstad Hansen a écrit : > Hello, > > I'm using HttpClient 4.0 to download a webpage the same way as shown > in one of the examples. This is my method to return a webpage as a > string: > > protected static String leechUrl(String url) throws IOException { > HttpClient httpclient = new DefaultHttpClient(); > HttpGet httpget = new HttpGet(url); > > System.out.println("executing request " + > httpget.getURI()); > > // Create a response handler > ResponseHandler<String> responseHandler = new > BasicResponseHandler(); > String responseBody = httpclient.execute(httpget, > responseHandler); > > // When HttpClient instance is no longer needed, > // shut down the connection manager to ensure > // immediate deallocation of all system resources > httpclient.getConnectionManager().shutdown(); > return responseBody; > } > > However; the responseBody returned here contains ? (questionmarks) for > all norwegian characters (æøåÆØÅ) on the page. For example if I try to > dump "http://www.vg.no" I can find the following at line 107: > > <li><a > href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue > *p?* veggen (blogg)</a></li> > > ...that questionmark there should've been the character å. For > certainty I've compared to the same page and line dumped with wget: > > <li><a > href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue > på veggen (blogg)</a></li> > > My question is simply what I need to do to keep the norwegian letters > intact? So far I've tried: > - Copying BasicResponseHandler and debug that > EntityUtils.getContentCharset() finds a reasonable charset, it does. > - Hacking EntityUtils.toString() to override both detected and default > charset with "ISO-8859-1" and "UTF-8". > - Adding header to the request with content-type and charset (which > isn't really logical to add to a request, but I tried anyway) > > All I've accomplished with this is to sometimes get two ?'s instead of > one for the norwegian letters. I also tried to dump the response as > directly as I saw possible by using EntityUtils.toByteArray() and > writing directly to a file. To my surprise I can see that the ?'s are > still there and via hexdump I can see that they are all 3F > (questionmark) - so it's infact impossible to recover the norwegian > letters. They must have been replaced with a questionmark somewhere. > > Please advice, and a thousand thanks for reading my problem! > > Regards, > Magnus > --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksThanks for the reply, Ken.
> > The basic problem is that determining the character set of a web page > is complex, and not something that HttpClient is designed to handle. > > If you check out (for example) the Nutch source, you'll see that it > has a multi-step process, where it uses the Content-type in the > response header, the meta http-equiv tag in the HTML, and low-level > charset sniiffing code to try to guess at the right encoding, but even > then sometimes it gets it wrong. Yep - I know it's a complex matter - headers are not always specified, metas the same... :) However if the page is actually in one specific charset, and one can force this charset to be used by HttpClient, it should all work in my opinion. > > In your case, the page looks pretty clean - all UTF-8. > > But when you call httpclient.execute(httpget, responseHandler), the > BasicResponseHandler will call EntityUtils.toString, and that in turn > uses ISO-8859-1 as its default charset when converting the bytes it > receives into the string. Well, I've read the source on that and it's not entirely what I saw. 1 of 3 possibilites exists for charset-variable used within EntityUtils.toString(), in preferred order : 1) Result of EntityUtils.getContentCharset() 2) Default charset given as argument to EntityUtils.toString(). 3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct name of the constant, but you get the point). I tested EntityUtils.getContentCharset() separatly, and it returns UTF-8 (as found in the Content-Type of the given page). > > So go one level deeper, and get the HttpEntity from the response, then > try EntityUtils.toString(entity, "UTF-8") and see what you get back. Conclusion must be I have tried this without luck ... ? Regards, Magnus --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksWhat about passing -Dfile.encoding=utf-8?
On Wed, Sep 2, 2009 at 10:58 AM, Magnus Olstad Hansen <magnus@...>wrote: > > > But when you call httpclient.execute(httpget, responseHandler), the > > BasicResponseHandler will call EntityUtils.toString, and that in turn > > uses ISO-8859-1 as its default charset when converting the bytes it > > receives into the string. > > |
|
|
Re: Charset trouble, questionmarksOn Wed, Sep 02, 2009 at 11:39:35AM -0400, NBW wrote:
> What about passing -Dfile.encoding=utf-8? > HttpClient does not use system properties (per design) Oleg > On Wed, Sep 2, 2009 at 10:58 AM, Magnus Olstad Hansen <magnus@...>wrote: > > > > > > But when you call httpclient.execute(httpget, responseHandler), the > > > BasicResponseHandler will call EntityUtils.toString, and that in turn > > > uses ISO-8859-1 as its default charset when converting the bytes it > > > receives into the string. > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksNo use of things like InputStreamReaders then I take it.
On Wed, Sep 2, 2009 at 11:41 AM, Oleg Kalnichevski <olegk@...> wrote: > On Wed, Sep 02, 2009 at 11:39:35AM -0400, NBW wrote: > > What about passing -Dfile.encoding=utf-8? > > > > HttpClient does not use system properties (per design) > > Oleg > > > |
|
|
Re: Charset trouble, questionmarksOn Wed, Sep 02, 2009 at 11:54:42AM -0400, NBW wrote:
> No use of things like InputStreamReaders then I take it. > InputStreamReaders is used by one utility method in HttpCore (EntityUtils#toString()). However, it uses an InputStreamReaders constructor that explicitly takes the charset name to be used for byte to char conversion as a parameter. Oleg > On Wed, Sep 2, 2009 at 11:41 AM, Oleg Kalnichevski <olegk@...> wrote: > > > On Wed, Sep 02, 2009 at 11:39:35AM -0400, NBW wrote: > > > What about passing -Dfile.encoding=utf-8? > > > > > > > HttpClient does not use system properties (per design) > > > > Oleg > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksHi Magnus,
I used curl to grab the file, and the bytes at 0x1845...0x1847 are 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small letter a with ring above). I also used Bixo (http://bixo.101tec.com) to crawl the same page, and wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a good test. Given what you've tried (in your initial email), I've only got one weak guess - that your tools are showing you stuff that isn't actually there. If you use HttpClient to dump the string and the byte array to files, and look at the files with a real, honest-to-gosh hex editor, do you still see 0x3F at offset 0x1845, or 0xC3 0xA5? -- Ken On Sep 2, 2009, at 7:58am, Magnus Olstad Hansen wrote: > Thanks for the reply, Ken. >> >> The basic problem is that determining the character set of a web page >> is complex, and not something that HttpClient is designed to handle. >> >> If you check out (for example) the Nutch source, you'll see that it >> has a multi-step process, where it uses the Content-type in the >> response header, the meta http-equiv tag in the HTML, and low-level >> charset sniiffing code to try to guess at the right encoding, but >> even >> then sometimes it gets it wrong. > Yep - I know it's a complex matter - headers are not always specified, > metas the same... :) However if the page is actually in one specific > charset, and one can force this charset to be used by HttpClient, it > should all work in my opinion. >> >> In your case, the page looks pretty clean - all UTF-8. >> >> But when you call httpclient.execute(httpget, responseHandler), the >> BasicResponseHandler will call EntityUtils.toString, and that in turn >> uses ISO-8859-1 as its default charset when converting the bytes it >> receives into the string. > Well, I've read the source on that and it's not entirely what I saw. 1 > of 3 possibilites exists for charset-variable used within > EntityUtils.toString(), in preferred order : > 1) Result of EntityUtils.getContentCharset() > 2) Default charset given as argument to EntityUtils.toString(). > 3) HttpClient-global HTTP.DEFAULT_CHARSET (probably not the correct > name > of the constant, but you get the point). > > I tested EntityUtils.getContentCharset() separatly, and it returns > UTF-8 > (as found in the Content-Type of the given page). >> >> So go one level deeper, and get the HttpEntity from the response, >> then >> try EntityUtils.toString(entity, "UTF-8") and see what you get back. > Conclusion must be I have tried this without luck ... ? > > Regards, > Magnus > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: httpclient-users-unsubscribe@... > For additional commands, e-mail: httpclient-users-help@... > -------------------------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378 --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksHello again Ken,
Sorry to lag behind on the replies - work is busy these days... :) Seems you're right. I've made a custom ResponseHandler class to be able to dump the raw output from HttpClient. However, I'd used FileWriter/BufferedWriter to dump to my file. This must've tried to interpret charset also, causing the bothersome 0x3F's mentioned before. Your tip about another HttpClient app returning the content successfully caused me to look at my method again - and I made the output via FileOutputStream,write(byte[]) instead. Using hexdump as before I can now confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be. (...from wget) # hexdump -s 0x1845 -C index.html | head -n 2 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen (blog| 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g)</li>...<l| (...from my dump) # hexdump -s 0x1845 -C raw.txt | head -n 2 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen (blog| 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g)</li>...<l| What remains a mystery to me is, however, why the string returned from HttpClient.execute() and thus EntityUtils.toString(Entity) does not represent the letters correctly. I also tested this with the BasicResponseHandler to be sure it was nothing I'd done. Atleast now I can use my custom ResponseHandler and figure out how to treat the intact byte-array correctly. So thanks a lot! :)
|
|
|
Re: Charset trouble, questionmarksMaGGE wrote:
> Hello again Ken, > > Sorry to lag behind on the replies - work is busy these days... :) > > Seems you're right. I've made a custom ResponseHandler class to be able to > dump the raw output from HttpClient. However, I'd used > FileWriter/BufferedWriter to dump to my file. This must've tried to > interpret charset also, causing the bothersome 0x3F's mentioned before. > > Your tip about another HttpClient app returning the content successfully > caused me to look at my method again - and I made the output via > FileOutputStream,write(byte[]) instead. Using hexdump as before I can now > confirm that there's no longer a 0x3F but 0xC3 0xA5 as it should be. > > (...from wget) > # hexdump -s 0x1845 -C index.html | head -n 2 > 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen > (blog| > 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g) </li>...<l| > > (...from my dump) > # hexdump -s 0x1845 -C raw.txt | head -n 2 > 00001845 70 c3 a5 20 76 65 67 67 65 6e 20 28 62 6c 6f 67 |p.. veggen > (blog| > 00001855 67 29 3c 2f 61 3e 3c 2f 6c 69 3e 0a 09 09 3c 6c |g) </li>...<l| > > What remains a mystery to me is, however, why the string returned from > HttpClient.execute() and thus EntityUtils.toString(Entity) does not > represent the letters correctly. I also tested this with the > BasicResponseHandler to be sure it was nothing I'd done. > > Atleast now I can use my custom ResponseHandler and figure out how to treat > the intact byte-array correctly. So thanks a lot! :) > > If you only listened and produced a wire / context log, when I asked you, all this could have been found out much earlier, and I most likely would also have been able to tell why EntityUtils#toString failed to detect the charset. Oleg > > Ken Krugler wrote: >> Hi Magnus, >> >> I used curl to grab the file, and the bytes at 0x1845...0x1847 are >> 0xC3 0xA5, which is valid UTF-8 for the u00E5 code point (latin small >> letter a with ring above). >> >> I also used Bixo (http://bixo.101tec.com) to crawl the same page, and >> wound up with the same raw data. Bixo uses HttpClient 4.0, so it's a >> good test. >> >> Given what you've tried (in your initial email), I've only got one >> weak guess - that your tools are showing you stuff that isn't actually >> there. >> > --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksOleg Kalnichevski wrote:
> If you only listened and produced a wire / context log, when I asked > you, all this could have been found out much earlier, and I most > likely would also have been able to tell why EntityUtils#toString > failed to detect the charset. > > Oleg Hello Oleg, Sorry - I only tried what seemed easiest to me first here. Today is the first day I've had the time to look into this again. A wire + context log should be attached to this mail. Hope this can clearify what is going on. PS! I'm pretty sure that EntityUtils.getContentCharset() returns the right charset (UTF-8 in this case) - so the confusing point is why EntityUtils.toString() returns 0x3F for the norwegian letters. As far as I could tell from the sources the charset from getContentCharset() has top priority in toString()... Thanks for helping, Magnus --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
|
|
Re: Charset trouble, questionmarksMagnus Olstad Hansen wrote:
> > Hello Oleg, > > Sorry - I only tried what seemed easiest to me first here. Today is the > first day I've had the time to look into this again. > A wire + context log should be attached to this mail. Hope this can > clearify what is going on. > > PS! I'm pretty sure that EntityUtils.getContentCharset() returns the > right charset (UTF-8 in this case) - so the confusing point is why > EntityUtils.toString() returns 0x3F for the norwegian letters. As far as > I could tell from the sources the charset from getContentCharset() has > top priority in toString()... > > Thanks for helping, > Magnus > That's what I am seeing in the wire log --- FINE: << "[0x9][0x9]<li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisanett.vgb.no/">Skjermdump (blogg)</a></li>[\n]" FINE: << "[0x9][0x9]<li><a href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue p[0xc3][0xa5] veggen (blogg)</a></li>[\n]" --- As far as I can tell the content is correctly encoded as UTF-8 This is a snippet of output produced using EntityUtils#toString --- href="http://go.vg.no/cgi-bin/go.cgi/meny/http://elisabeth.vgb.no/">Frue på veggen (blogg)</a></li> --- As far as I can tell non-ascii characters appear correctly decoded. Apparently you were printing the output of your application to a console that simply could not handle non-ascii characters. This whole story was a non-issue from the very beginning. Oleg > > > > > > ------------------------------------------------------------------------ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: httpclient-users-unsubscribe@... > For additional commands, e-mail: httpclient-users-help@... --------------------------------------------------------------------- To unsubscribe, e-mail: httpclient-users-unsubscribe@... For additional commands, e-mail: httpclient-users-help@... |
| Free embeddable forum powered by Nabble | Forum Help |