char encoding

View: New views
9 Messages — Rating Filter:   Alert me  

char encoding

by Fadzi Ushewokunze-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi there,

i am having issues with the HTMLParser failing to detect the char
encoding. so lots of non alpha-numeric chars end up as "?" ;

i dont have any special requirement for any special characters, i am
happy with usual utf-8

any suggestion on the best way to configure this correctly; everything
seems quite ok looking at the code not sure whats missing.

thanks.

 


Re: char encoding

by Aaron Binns :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Are you sure that the HTMLParser is decoding the page incorrectly?

I've seen Nutch deployments where the characters are correctly decoded
by the HTMLParser and are correct in the Lucene index, but then the
webapp is misconfigured such that they are not displayed correctly on
the search results page.

You can use the Lucene toolkit "luke" to open the index and examine
the contents.  If the stuff in the index is good, then the problem
is not the HTMLParser.


Regards,

Aaron

--
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aaron@...

Re: char encoding

by Fadzi Ushewokunze-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi,

its also in the index, also debugging the content extracted during fetch
phase shows words like "someone's" being fetched as "someone?s"; it
seems its mostly to do with apostrophe char;

so its not the webapp at all.

any suggestions to maybe force the parser to recognise utf-8?


On Thu, 2009-10-29 at 16:08 -0700, Aaron Binns wrote:

> Are you sure that the HTMLParser is decoding the page incorrectly?
>
> I've seen Nutch deployments where the characters are correctly decoded
> by the HTMLParser and are correct in the Lucene index, but then the
> webapp is misconfigured such that they are not displayed correctly on
> the search results page.
>
> You can use the Lucene toolkit "luke" to open the index and examine
> the contents.  If the stuff in the index is good, then the problem
> is not the HTMLParser.
>
>
> Regards,
>
> Aaron
>


Re: char encoding

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

As Aaron noted, this is often a problem with search results page using  
an incorrect encoding.

I've also seen cases where the pages in question were not tagged  
appropriately - e.g. there's a meta tag in the HTML that specifies the  
wrong encoding.

I think browsers like Firefox trust this info less than Nutch, so they  
do more "sniffing" to determine if the specified encoding is wrong,  
and ignore it if so.

-- Ken

On Oct 29, 2009, at 4:05pm, Fadzi Ushewokunze wrote:

> hi there,
>
> i am having issues with the HTMLParser failing to detect the char
> encoding. so lots of non alpha-numeric chars end up as "?" ;
>
> i dont have any special requirement for any special characters, i am
> happy with usual utf-8
>
> any suggestion on the best way to configure this correctly; everything
> seems quite ok looking at the code not sure whats missing.
>
> thanks.
>
>
>

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


RE: char encoding

by Funtick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Is it "?" or "¿" (Inverted Question Mark)?

Because ¿ is replacement for character codes not having representation in
specific encoding scheme; you may get it, for instance, if binary stream is
UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
having representation in windows-1252 will be represented as "¿".

Nutch tries on the best effort; however, it can't use dedicated CPU as
browsers.... I agree with Ken. Browsers may fully ignore headers/meta and
sniff and analyze byte array to find correct encoding (in case, for
instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
can't do that (it requires a lot of CPU).

Windows-1252 -s default scheme for html-parser in case if Nutch can't find
correct HTTP/META...


From HtmlParser API:
   * We need to do something similar to what's done by mozilla
   *
(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
1993).
   * See also http://www.w3.org/TR/REC-xml/#sec-guessing


private static String sniffCharacterEncoding(byte[] content) {...}

- it doesn't currently use HTTP Headers.
- it tries to find META tag in first 2000 bytes.


So, for instance, some weird sites (such as AJAX/Portals) may have a lot of
generated JavaScript before META tag; 2000 could be small.

Then, EncodingDetector is called:
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");

- but it doen't make sense...


  public String guessEncoding(Content content, String defaultValue) {
    /*
     * This algorithm could be replaced by something more sophisticated;
     * ideally we would gather a bunch of data on where various clues
     * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
with
     * the correct answer, and use machine learning/some statistical method
     * to generate a better heuristic.
     */



TODO list... as a workaround, please check for this site that META could be
found in first 2000 bytes...



-Fuad
http://www.linkedin.com/liferay


> -----Original Message-----
> From: Fadzi Ushewokunze [mailto:fadzi@...]
> Sent: October-29-09 7:05 PM
> To: nutch-user@...
> Subject: char encoding
>
> hi there,
>
> i am having issues with the HTMLParser failing to detect the char
> encoding. so lots of non alpha-numeric chars end up as "?" ;
>
> i dont have any special requirement for any special characters, i am
> happy with usual utf-8
>
> any suggestion on the best way to configure this correctly; everything
> seems quite ok looking at the code not sure whats missing.
>
> thanks.
>
>




RE: char encoding

by Funtick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > i dont have any special requirement for any special characters, i am
> > happy with usual utf-8
> >
> > any suggestion on the best way to configure this correctly; everything
> > seems quite ok looking at the code not sure whats missing.


Try to set UTF-8 in configuration file:
parser.character.encoding.default = UTF-8



> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@...]
> Sent: October-29-09 8:19 PM
> To: nutch-user@...; fadzi@...
> Subject: RE: char encoding
>
> Is it "?" or "¿" (Inverted Question Mark)?
>
> Because ¿ is replacement for character codes not having representation in
> specific encoding scheme; you may get it, for instance, if binary stream
is

> UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
> having representation in windows-1252 will be represented as "¿".
>
> Nutch tries on the best effort; however, it can't use dedicated CPU as
> browsers.... I agree with Ken. Browsers may fully ignore headers/meta and
> sniff and analyze byte array to find correct encoding (in case, for
> instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
> can't do that (it requires a lot of CPU).
>
> Windows-1252 -s default scheme for html-parser in case if Nutch can't find
> correct HTTP/META...
>
>
> From HtmlParser API:
>    * We need to do something similar to what's done by mozilla
>    *
>
(http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#

> 1993).
>    * See also http://www.w3.org/TR/REC-xml/#sec-guessing
>
>
> private static String sniffCharacterEncoding(byte[] content) {...}
>
> - it doesn't currently use HTTP Headers.
> - it tries to find META tag in first 2000 bytes.
>
>
> So, for instance, some weird sites (such as AJAX/Portals) may have a lot
of
> generated JavaScript before META tag; 2000 could be small.
>
> Then, EncodingDetector is called:
>       detector.addClue(sniffCharacterEncoding(contentInOctets),
"sniffed");

>
> - but it doen't make sense...
>
>
>   public String guessEncoding(Content content, String defaultValue) {
>     /*
>      * This algorithm could be replaced by something more sophisticated;
>      * ideally we would gather a bunch of data on where various clues
>      * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
> with
>      * the correct answer, and use machine learning/some statistical
method
>      * to generate a better heuristic.
>      */
>
>
>
> TODO list... as a workaround, please check for this site that META could
be

> found in first 2000 bytes...
>
>
>
> -Fuad
> http://www.linkedin.com/in/liferay
>
>
> > -----Original Message-----
> > From: Fadzi Ushewokunze [mailto:fadzi@...]
> > Sent: October-29-09 7:05 PM
> > To: nutch-user@...
> > Subject: char encoding
> >
> > hi there,
> >
> > i am having issues with the HTMLParser failing to detect the char
> > encoding. so lots of non alpha-numeric chars end up as "?" ;
> >
> > i dont have any special requirement for any special characters, i am
> > happy with usual utf-8
> >
> > any suggestion on the best way to configure this correctly; everything
> > seems quite ok looking at the code not sure whats missing.
> >
> > thanks.
> >
> >
>
>




RE: char encoding

by Fadzi Ushewokunze-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

interesting - i will try this and let you know because it was set to
windows encoding (why on earth!?)



On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote:

> > > i dont have any special requirement for any special characters, i am
> > > happy with usual utf-8
> > >
> > > any suggestion on the best way to configure this correctly; everything
> > > seems quite ok looking at the code not sure whats missing.
>
>
> Try to set UTF-8 in configuration file:
> parser.character.encoding.default = UTF-8
>
>
>
> > -----Original Message-----
> > From: Fuad Efendi [mailto:fuad@...]
> > Sent: October-29-09 8:19 PM
> > To: nutch-user@...; fadzi@...
> > Subject: RE: char encoding
> >
> > Is it "?" or "¿" (Inverted Question Mark)?
> >
> > Because ¿ is replacement for character codes not having representation in
> > specific encoding scheme; you may get it, for instance, if binary stream
> is
> > UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
> > having representation in windows-1252 will be represented as "¿".
> >
> > Nutch tries on the best effort; however, it can't use dedicated CPU as
> > browsers.... I agree with Ken. Browsers may fully ignore headers/meta and
> > sniff and analyze byte array to find correct encoding (in case, for
> > instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
> > can't do that (it requires a lot of CPU).
> >
> > Windows-1252 -s default scheme for html-parser in case if Nutch can't find
> > correct HTTP/META...
> >
> >
> > From HtmlParser API:
> >    * We need to do something similar to what's done by mozilla
> >    *
> >
> (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
> > 1993).
> >    * See also http://www.w3.org/TR/REC-xml/#sec-guessing
> >
> >
> > private static String sniffCharacterEncoding(byte[] content) {...}
> >
> > - it doesn't currently use HTTP Headers.
> > - it tries to find META tag in first 2000 bytes.
> >
> >
> > So, for instance, some weird sites (such as AJAX/Portals) may have a lot
> of
> > generated JavaScript before META tag; 2000 could be small.
> >
> > Then, EncodingDetector is called:
> >       detector.addClue(sniffCharacterEncoding(contentInOctets),
> "sniffed");
> >
> > - but it doen't make sense...
> >
> >
> >   public String guessEncoding(Content content, String defaultValue) {
> >     /*
> >      * This algorithm could be replaced by something more sophisticated;
> >      * ideally we would gather a bunch of data on where various clues
> >      * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
> > with
> >      * the correct answer, and use machine learning/some statistical
> method
> >      * to generate a better heuristic.
> >      */
> >
> >
> >
> > TODO list... as a workaround, please check for this site that META could
> be
> > found in first 2000 bytes...
> >
> >
> >
> > -Fuad
> > http://www.linkedin.com/in/liferay
> >
> >
> > > -----Original Message-----
> > > From: Fadzi Ushewokunze [mailto:fadzi@...]
> > > Sent: October-29-09 7:05 PM
> > > To: nutch-user@...
> > > Subject: char encoding
> > >
> > > hi there,
> > >
> > > i am having issues with the HTMLParser failing to detect the char
> > > encoding. so lots of non alpha-numeric chars end up as "?" ;
> > >
> > > i dont have any special requirement for any special characters, i am
> > > happy with usual utf-8
> > >
> > > any suggestion on the best way to configure this correctly; everything
> > > seems quite ok looking at the code not sure whats missing.
> > >
> > > thanks.
> > >
> > >
> >
> >
>
>
>


Re: char encoding

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 30, 2009, at 12:26am, Fadzi Ushewokunze wrote:

> interesting - i will try this and let you know because it was set to
> windows encoding (why on earth!?)

Because this encoding is only used when neither the server nor the  
HTML meta specifies an encoding.

And in those situations, the most common encoding being used is CP-1252.

-- Ken


> On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote:
>>>> i dont have any special requirement for any special characters, i  
>>>> am
>>>> happy with usual utf-8
>>>>
>>>> any suggestion on the best way to configure this correctly;  
>>>> everything
>>>> seems quite ok looking at the code not sure whats missing.
>>
>>
>> Try to set UTF-8 in configuration file:
>> parser.character.encoding.default = UTF-8
>>
>>
>>
>>> -----Original Message-----
>>> From: Fuad Efendi [mailto:fuad@...]
>>> Sent: October-29-09 8:19 PM
>>> To: nutch-user@...; fadzi@...
>>> Subject: RE: char encoding
>>>
>>> Is it "?" or "¿" (Inverted Question Mark)?
>>>
>>> Because ¿ is replacement for character codes not having  
>>> representation in
>>> specific encoding scheme; you may get it, for instance, if binary  
>>> stream
>> is
>>> UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s)  
>>> not
>>> having representation in windows-1252 will be represented as "¿".
>>>
>>> Nutch tries on the best effort; however, it can't use dedicated  
>>> CPU as
>>> browsers.... I agree with Ken. Browsers may fully ignore headers/
>>> meta and
>>> sniff and analyze byte array to find correct encoding (in case, for
>>> instance, if byte stream is UTF-8, and http/meta is windows-1252).  
>>> Nutch
>>> can't do that (it requires a lot of CPU).
>>>
>>> Windows-1252 -s default scheme for html-parser in case if Nutch  
>>> can't find
>>> correct HTTP/META...
>>>
>>>
>>> From HtmlParser API:
>>>   * We need to do something similar to what's done by mozilla
>>>   *
>>>
>> (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
>>> 1993).
>>>   * See also http://www.w3.org/TR/REC-xml/#sec-guessing
>>>
>>>
>>> private static String sniffCharacterEncoding(byte[] content) {...}
>>>
>>> - it doesn't currently use HTTP Headers.
>>> - it tries to find META tag in first 2000 bytes.
>>>
>>>
>>> So, for instance, some weird sites (such as AJAX/Portals) may have  
>>> a lot
>> of
>>> generated JavaScript before META tag; 2000 could be small.
>>>
>>> Then, EncodingDetector is called:
>>>      detector.addClue(sniffCharacterEncoding(contentInOctets),
>> "sniffed");
>>>
>>> - but it doen't make sense...
>>>
>>>
>>>  public String guessEncoding(Content content, String defaultValue) {
>>>    /*
>>>     * This algorithm could be replaced by something more  
>>> sophisticated;
>>>     * ideally we would gather a bunch of data on where various clues
>>>     * (autodetect, HTTP headers, HTML meta tags, etc.) disagree,  
>>> tag each
>>> with
>>>     * the correct answer, and use machine learning/some statistical
>> method
>>>     * to generate a better heuristic.
>>>     */
>>>
>>>
>>>
>>> TODO list... as a workaround, please check for this site that META  
>>> could
>> be
>>> found in first 2000 bytes...
>>>
>>>
>>>
>>> -Fuad
>>> http://www.linkedin.com/in/liferay
>>>
>>>
>>>> -----Original Message-----
>>>> From: Fadzi Ushewokunze [mailto:fadzi@...]
>>>> Sent: October-29-09 7:05 PM
>>>> To: nutch-user@...
>>>> Subject: char encoding
>>>>
>>>> hi there,
>>>>
>>>> i am having issues with the HTMLParser failing to detect the char
>>>> encoding. so lots of non alpha-numeric chars end up as "?" ;
>>>>
>>>> i dont have any special requirement for any special characters, i  
>>>> am
>>>> happy with usual utf-8
>>>>
>>>> any suggestion on the best way to configure this correctly;  
>>>> everything
>>>> seems quite ok looking at the code not sure whats missing.
>>>>
>>>> thanks.
>>>>
>>>>
>>>
>>>
>>
>>
>>
>

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Re: char encoding

by Fadzi Ushewokunze-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

thanks, i thought it would defualt to something like this.

anyway changing to defaulting to UTF-8 seems to do the trick for what i
need anyway..



On Fri, 2009-10-30 at 05:11 -0700, Ken Krugler wrote:

> On Oct 30, 2009, at 12:26am, Fadzi Ushewokunze wrote:
>
> > interesting - i will try this and let you know because it was set to
> > windows encoding (why on earth!?)
>
> Because this encoding is only used when neither the server nor the  
> HTML meta specifies an encoding.
>
> And in those situations, the most common encoding being used is CP-1252.
>
> -- Ken
>
>
> > On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote:
> >>>> i dont have any special requirement for any special characters, i  
> >>>> am
> >>>> happy with usual utf-8
> >>>>
> >>>> any suggestion on the best way to configure this correctly;  
> >>>> everything
> >>>> seems quite ok looking at the code not sure whats missing.
> >>
> >>
> >> Try to set UTF-8 in configuration file:
> >> parser.character.encoding.default = UTF-8
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Fuad Efendi [mailto:fuad@...]
> >>> Sent: October-29-09 8:19 PM
> >>> To: nutch-user@...; fadzi@...
> >>> Subject: RE: char encoding
> >>>
> >>> Is it "?" or "¿" (Inverted Question Mark)?
> >>>
> >>> Because ¿ is replacement for character codes not having  
> >>> representation in
> >>> specific encoding scheme; you may get it, for instance, if binary  
> >>> stream
> >> is
> >>> UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s)  
> >>> not
> >>> having representation in windows-1252 will be represented as "¿".
> >>>
> >>> Nutch tries on the best effort; however, it can't use dedicated  
> >>> CPU as
> >>> browsers.... I agree with Ken. Browsers may fully ignore headers/
> >>> meta and
> >>> sniff and analyze byte array to find correct encoding (in case, for
> >>> instance, if byte stream is UTF-8, and http/meta is windows-1252).  
> >>> Nutch
> >>> can't do that (it requires a lot of CPU).
> >>>
> >>> Windows-1252 -s default scheme for html-parser in case if Nutch  
> >>> can't find
> >>> correct HTTP/META...
> >>>
> >>>
> >>> From HtmlParser API:
> >>>   * We need to do something similar to what's done by mozilla
> >>>   *
> >>>
> >> (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
> >>> 1993).
> >>>   * See also http://www.w3.org/TR/REC-xml/#sec-guessing
> >>>
> >>>
> >>> private static String sniffCharacterEncoding(byte[] content) {...}
> >>>
> >>> - it doesn't currently use HTTP Headers.
> >>> - it tries to find META tag in first 2000 bytes.
> >>>
> >>>
> >>> So, for instance, some weird sites (such as AJAX/Portals) may have  
> >>> a lot
> >> of
> >>> generated JavaScript before META tag; 2000 could be small.
> >>>
> >>> Then, EncodingDetector is called:
> >>>      detector.addClue(sniffCharacterEncoding(contentInOctets),
> >> "sniffed");
> >>>
> >>> - but it doen't make sense...
> >>>
> >>>
> >>>  public String guessEncoding(Content content, String defaultValue) {
> >>>    /*
> >>>     * This algorithm could be replaced by something more  
> >>> sophisticated;
> >>>     * ideally we would gather a bunch of data on where various clues
> >>>     * (autodetect, HTTP headers, HTML meta tags, etc.) disagree,  
> >>> tag each
> >>> with
> >>>     * the correct answer, and use machine learning/some statistical
> >> method
> >>>     * to generate a better heuristic.
> >>>     */
> >>>
> >>>
> >>>
> >>> TODO list... as a workaround, please check for this site that META  
> >>> could
> >> be
> >>> found in first 2000 bytes...
> >>>
> >>>
> >>>
> >>> -Fuad
> >>> http://www.linkedin.com/in/liferay
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Fadzi Ushewokunze [mailto:fadzi@...]
> >>>> Sent: October-29-09 7:05 PM
> >>>> To: nutch-user@...
> >>>> Subject: char encoding
> >>>>
> >>>> hi there,
> >>>>
> >>>> i am having issues with the HTMLParser failing to detect the char
> >>>> encoding. so lots of non alpha-numeric chars end up as "?" ;
> >>>>
> >>>> i dont have any special requirement for any special characters, i  
> >>>> am
> >>>> happy with usual utf-8
> >>>>
> >>>> any suggestion on the best way to configure this correctly;  
> >>>> everything
> >>>> seems quite ok looking at the code not sure whats missing.
> >>>>
> >>>> thanks.
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
>
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
>