|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
char encodinghi there,
i am having issues with the HTMLParser failing to detect the char encoding. so lots of non alpha-numeric chars end up as "?" ; i dont have any special requirement for any special characters, i am happy with usual utf-8 any suggestion on the best way to configure this correctly; everything seems quite ok looking at the code not sure whats missing. thanks. |
|
|
Re: char encodingAre you sure that the HTMLParser is decoding the page incorrectly? I've seen Nutch deployments where the characters are correctly decoded by the HTMLParser and are correct in the Lucene index, but then the webapp is misconfigured such that they are not displayed correctly on the search results page. You can use the Lucene toolkit "luke" to open the index and examine the contents. If the stuff in the index is good, then the problem is not the HTMLParser. Regards, Aaron -- Aaron Binns Senior Software Engineer, Web Group Internet Archive aaron@... |
|
|
Re: char encodinghi,
its also in the index, also debugging the content extracted during fetch phase shows words like "someone's" being fetched as "someone?s"; it seems its mostly to do with apostrophe char; so its not the webapp at all. any suggestions to maybe force the parser to recognise utf-8? On Thu, 2009-10-29 at 16:08 -0700, Aaron Binns wrote: > Are you sure that the HTMLParser is decoding the page incorrectly? > > I've seen Nutch deployments where the characters are correctly decoded > by the HTMLParser and are correct in the Lucene index, but then the > webapp is misconfigured such that they are not displayed correctly on > the search results page. > > You can use the Lucene toolkit "luke" to open the index and examine > the contents. If the stuff in the index is good, then the problem > is not the HTMLParser. > > > Regards, > > Aaron > |
|
|
Re: char encodingAs Aaron noted, this is often a problem with search results page using
an incorrect encoding. I've also seen cases where the pages in question were not tagged appropriately - e.g. there's a meta tag in the HTML that specifies the wrong encoding. I think browsers like Firefox trust this info less than Nutch, so they do more "sniffing" to determine if the specified encoding is wrong, and ignore it if so. -- Ken On Oct 29, 2009, at 4:05pm, Fadzi Ushewokunze wrote: > hi there, > > i am having issues with the HTMLParser failing to detect the char > encoding. so lots of non alpha-numeric chars end up as "?" ; > > i dont have any special requirement for any special characters, i am > happy with usual utf-8 > > any suggestion on the best way to configure this correctly; everything > seems quite ok looking at the code not sure whats missing. > > thanks. > > > -------------------------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378 |
|
|
RE: char encodingIs it "?" or "¿" (Inverted Question Mark)?
Because ¿ is replacement for character codes not having representation in specific encoding scheme; you may get it, for instance, if binary stream is UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not having representation in windows-1252 will be represented as "¿". Nutch tries on the best effort; however, it can't use dedicated CPU as browsers.... I agree with Ken. Browsers may fully ignore headers/meta and sniff and analyze byte array to find correct encoding (in case, for instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch can't do that (it requires a lot of CPU). Windows-1252 -s default scheme for html-parser in case if Nutch can't find correct HTTP/META... From HtmlParser API: * We need to do something similar to what's done by mozilla * (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp# 1993). * See also http://www.w3.org/TR/REC-xml/#sec-guessing private static String sniffCharacterEncoding(byte[] content) {...} - it doesn't currently use HTTP Headers. - it tries to find META tag in first 2000 bytes. So, for instance, some weird sites (such as AJAX/Portals) may have a lot of generated JavaScript before META tag; 2000 could be small. Then, EncodingDetector is called: detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed"); - but it doen't make sense... public String guessEncoding(Content content, String defaultValue) { /* * This algorithm could be replaced by something more sophisticated; * ideally we would gather a bunch of data on where various clues * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each with * the correct answer, and use machine learning/some statistical method * to generate a better heuristic. */ TODO list... as a workaround, please check for this site that META could be found in first 2000 bytes... -Fuad http://www.linkedin.com/liferay > -----Original Message----- > From: Fadzi Ushewokunze [mailto:fadzi@...] > Sent: October-29-09 7:05 PM > To: nutch-user@... > Subject: char encoding > > hi there, > > i am having issues with the HTMLParser failing to detect the char > encoding. so lots of non alpha-numeric chars end up as "?" ; > > i dont have any special requirement for any special characters, i am > happy with usual utf-8 > > any suggestion on the best way to configure this correctly; everything > seems quite ok looking at the code not sure whats missing. > > thanks. > > |
|
|
RE: char encoding> > i dont have any special requirement for any special characters, i am
> > happy with usual utf-8 > > > > any suggestion on the best way to configure this correctly; everything > > seems quite ok looking at the code not sure whats missing. Try to set UTF-8 in configuration file: parser.character.encoding.default = UTF-8 > -----Original Message----- > From: Fuad Efendi [mailto:fuad@...] > Sent: October-29-09 8:19 PM > To: nutch-user@...; fadzi@... > Subject: RE: char encoding > > Is it "?" or "¿" (Inverted Question Mark)? > > Because ¿ is replacement for character codes not having representation in > specific encoding scheme; you may get it, for instance, if binary stream > UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not > having representation in windows-1252 will be represented as "¿". > > Nutch tries on the best effort; however, it can't use dedicated CPU as > browsers.... I agree with Ken. Browsers may fully ignore headers/meta and > sniff and analyze byte array to find correct encoding (in case, for > instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch > can't do that (it requires a lot of CPU). > > Windows-1252 -s default scheme for html-parser in case if Nutch can't find > correct HTTP/META... > > > From HtmlParser API: > * We need to do something similar to what's done by mozilla > * > > 1993). > * See also http://www.w3.org/TR/REC-xml/#sec-guessing > > > private static String sniffCharacterEncoding(byte[] content) {...} > > - it doesn't currently use HTTP Headers. > - it tries to find META tag in first 2000 bytes. > > > So, for instance, some weird sites (such as AJAX/Portals) may have a lot > generated JavaScript before META tag; 2000 could be small. > > Then, EncodingDetector is called: > detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed"); > > - but it doen't make sense... > > > public String guessEncoding(Content content, String defaultValue) { > /* > * This algorithm could be replaced by something more sophisticated; > * ideally we would gather a bunch of data on where various clues > * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each > with > * the correct answer, and use machine learning/some statistical > * to generate a better heuristic. > */ > > > > TODO list... as a workaround, please check for this site that META could be > found in first 2000 bytes... > > > > -Fuad > http://www.linkedin.com/in/liferay > > > > -----Original Message----- > > From: Fadzi Ushewokunze [mailto:fadzi@...] > > Sent: October-29-09 7:05 PM > > To: nutch-user@... > > Subject: char encoding > > > > hi there, > > > > i am having issues with the HTMLParser failing to detect the char > > encoding. so lots of non alpha-numeric chars end up as "?" ; > > > > i dont have any special requirement for any special characters, i am > > happy with usual utf-8 > > > > any suggestion on the best way to configure this correctly; everything > > seems quite ok looking at the code not sure whats missing. > > > > thanks. > > > > > > |
|
|
RE: char encodinginteresting - i will try this and let you know because it was set to
windows encoding (why on earth!?) On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote: > > > i dont have any special requirement for any special characters, i am > > > happy with usual utf-8 > > > > > > any suggestion on the best way to configure this correctly; everything > > > seems quite ok looking at the code not sure whats missing. > > > Try to set UTF-8 in configuration file: > parser.character.encoding.default = UTF-8 > > > > > -----Original Message----- > > From: Fuad Efendi [mailto:fuad@...] > > Sent: October-29-09 8:19 PM > > To: nutch-user@...; fadzi@... > > Subject: RE: char encoding > > > > Is it "?" or "¿" (Inverted Question Mark)? > > > > Because ¿ is replacement for character codes not having representation in > > specific encoding scheme; you may get it, for instance, if binary stream > is > > UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not > > having representation in windows-1252 will be represented as "¿". > > > > Nutch tries on the best effort; however, it can't use dedicated CPU as > > browsers.... I agree with Ken. Browsers may fully ignore headers/meta and > > sniff and analyze byte array to find correct encoding (in case, for > > instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch > > can't do that (it requires a lot of CPU). > > > > Windows-1252 -s default scheme for html-parser in case if Nutch can't find > > correct HTTP/META... > > > > > > From HtmlParser API: > > * We need to do something similar to what's done by mozilla > > * > > > (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp# > > 1993). > > * See also http://www.w3.org/TR/REC-xml/#sec-guessing > > > > > > private static String sniffCharacterEncoding(byte[] content) {...} > > > > - it doesn't currently use HTTP Headers. > > - it tries to find META tag in first 2000 bytes. > > > > > > So, for instance, some weird sites (such as AJAX/Portals) may have a lot > of > > generated JavaScript before META tag; 2000 could be small. > > > > Then, EncodingDetector is called: > > detector.addClue(sniffCharacterEncoding(contentInOctets), > "sniffed"); > > > > - but it doen't make sense... > > > > > > public String guessEncoding(Content content, String defaultValue) { > > /* > > * This algorithm could be replaced by something more sophisticated; > > * ideally we would gather a bunch of data on where various clues > > * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each > > with > > * the correct answer, and use machine learning/some statistical > method > > * to generate a better heuristic. > > */ > > > > > > > > TODO list... as a workaround, please check for this site that META could > be > > found in first 2000 bytes... > > > > > > > > -Fuad > > http://www.linkedin.com/in/liferay > > > > > > > -----Original Message----- > > > From: Fadzi Ushewokunze [mailto:fadzi@...] > > > Sent: October-29-09 7:05 PM > > > To: nutch-user@... > > > Subject: char encoding > > > > > > hi there, > > > > > > i am having issues with the HTMLParser failing to detect the char > > > encoding. so lots of non alpha-numeric chars end up as "?" ; > > > > > > i dont have any special requirement for any special characters, i am > > > happy with usual utf-8 > > > > > > any suggestion on the best way to configure this correctly; everything > > > seems quite ok looking at the code not sure whats missing. > > > > > > thanks. > > > > > > > > > > > > > |
|
|
Re: char encodingOn Oct 30, 2009, at 12:26am, Fadzi Ushewokunze wrote: > interesting - i will try this and let you know because it was set to > windows encoding (why on earth!?) Because this encoding is only used when neither the server nor the HTML meta specifies an encoding. And in those situations, the most common encoding being used is CP-1252. -- Ken > On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote: >>>> i dont have any special requirement for any special characters, i >>>> am >>>> happy with usual utf-8 >>>> >>>> any suggestion on the best way to configure this correctly; >>>> everything >>>> seems quite ok looking at the code not sure whats missing. >> >> >> Try to set UTF-8 in configuration file: >> parser.character.encoding.default = UTF-8 >> >> >> >>> -----Original Message----- >>> From: Fuad Efendi [mailto:fuad@...] >>> Sent: October-29-09 8:19 PM >>> To: nutch-user@...; fadzi@... >>> Subject: RE: char encoding >>> >>> Is it "?" or "¿" (Inverted Question Mark)? >>> >>> Because ¿ is replacement for character codes not having >>> representation in >>> specific encoding scheme; you may get it, for instance, if binary >>> stream >> is >>> UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) >>> not >>> having representation in windows-1252 will be represented as "¿". >>> >>> Nutch tries on the best effort; however, it can't use dedicated >>> CPU as >>> browsers.... I agree with Ken. Browsers may fully ignore headers/ >>> meta and >>> sniff and analyze byte array to find correct encoding (in case, for >>> instance, if byte stream is UTF-8, and http/meta is windows-1252). >>> Nutch >>> can't do that (it requires a lot of CPU). >>> >>> Windows-1252 -s default scheme for html-parser in case if Nutch >>> can't find >>> correct HTTP/META... >>> >>> >>> From HtmlParser API: >>> * We need to do something similar to what's done by mozilla >>> * >>> >> (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp# >>> 1993). >>> * See also http://www.w3.org/TR/REC-xml/#sec-guessing >>> >>> >>> private static String sniffCharacterEncoding(byte[] content) {...} >>> >>> - it doesn't currently use HTTP Headers. >>> - it tries to find META tag in first 2000 bytes. >>> >>> >>> So, for instance, some weird sites (such as AJAX/Portals) may have >>> a lot >> of >>> generated JavaScript before META tag; 2000 could be small. >>> >>> Then, EncodingDetector is called: >>> detector.addClue(sniffCharacterEncoding(contentInOctets), >> "sniffed"); >>> >>> - but it doen't make sense... >>> >>> >>> public String guessEncoding(Content content, String defaultValue) { >>> /* >>> * This algorithm could be replaced by something more >>> sophisticated; >>> * ideally we would gather a bunch of data on where various clues >>> * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, >>> tag each >>> with >>> * the correct answer, and use machine learning/some statistical >> method >>> * to generate a better heuristic. >>> */ >>> >>> >>> >>> TODO list... as a workaround, please check for this site that META >>> could >> be >>> found in first 2000 bytes... >>> >>> >>> >>> -Fuad >>> http://www.linkedin.com/in/liferay >>> >>> >>>> -----Original Message----- >>>> From: Fadzi Ushewokunze [mailto:fadzi@...] >>>> Sent: October-29-09 7:05 PM >>>> To: nutch-user@... >>>> Subject: char encoding >>>> >>>> hi there, >>>> >>>> i am having issues with the HTMLParser failing to detect the char >>>> encoding. so lots of non alpha-numeric chars end up as "?" ; >>>> >>>> i dont have any special requirement for any special characters, i >>>> am >>>> happy with usual utf-8 >>>> >>>> any suggestion on the best way to configure this correctly; >>>> everything >>>> seems quite ok looking at the code not sure whats missing. >>>> >>>> thanks. >>>> >>>> >>> >>> >> >> >> > -------------------------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378 |
|
|
Re: char encodingthanks, i thought it would defualt to something like this.
anyway changing to defaulting to UTF-8 seems to do the trick for what i need anyway.. On Fri, 2009-10-30 at 05:11 -0700, Ken Krugler wrote: > On Oct 30, 2009, at 12:26am, Fadzi Ushewokunze wrote: > > > interesting - i will try this and let you know because it was set to > > windows encoding (why on earth!?) > > Because this encoding is only used when neither the server nor the > HTML meta specifies an encoding. > > And in those situations, the most common encoding being used is CP-1252. > > -- Ken > > > > On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote: > >>>> i dont have any special requirement for any special characters, i > >>>> am > >>>> happy with usual utf-8 > >>>> > >>>> any suggestion on the best way to configure this correctly; > >>>> everything > >>>> seems quite ok looking at the code not sure whats missing. > >> > >> > >> Try to set UTF-8 in configuration file: > >> parser.character.encoding.default = UTF-8 > >> > >> > >> > >>> -----Original Message----- > >>> From: Fuad Efendi [mailto:fuad@...] > >>> Sent: October-29-09 8:19 PM > >>> To: nutch-user@...; fadzi@... > >>> Subject: RE: char encoding > >>> > >>> Is it "?" or "¿" (Inverted Question Mark)? > >>> > >>> Because ¿ is replacement for character codes not having > >>> representation in > >>> specific encoding scheme; you may get it, for instance, if binary > >>> stream > >> is > >>> UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) > >>> not > >>> having representation in windows-1252 will be represented as "¿". > >>> > >>> Nutch tries on the best effort; however, it can't use dedicated > >>> CPU as > >>> browsers.... I agree with Ken. Browsers may fully ignore headers/ > >>> meta and > >>> sniff and analyze byte array to find correct encoding (in case, for > >>> instance, if byte stream is UTF-8, and http/meta is windows-1252). > >>> Nutch > >>> can't do that (it requires a lot of CPU). > >>> > >>> Windows-1252 -s default scheme for html-parser in case if Nutch > >>> can't find > >>> correct HTTP/META... > >>> > >>> > >>> From HtmlParser API: > >>> * We need to do something similar to what's done by mozilla > >>> * > >>> > >> (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp# > >>> 1993). > >>> * See also http://www.w3.org/TR/REC-xml/#sec-guessing > >>> > >>> > >>> private static String sniffCharacterEncoding(byte[] content) {...} > >>> > >>> - it doesn't currently use HTTP Headers. > >>> - it tries to find META tag in first 2000 bytes. > >>> > >>> > >>> So, for instance, some weird sites (such as AJAX/Portals) may have > >>> a lot > >> of > >>> generated JavaScript before META tag; 2000 could be small. > >>> > >>> Then, EncodingDetector is called: > >>> detector.addClue(sniffCharacterEncoding(contentInOctets), > >> "sniffed"); > >>> > >>> - but it doen't make sense... > >>> > >>> > >>> public String guessEncoding(Content content, String defaultValue) { > >>> /* > >>> * This algorithm could be replaced by something more > >>> sophisticated; > >>> * ideally we would gather a bunch of data on where various clues > >>> * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, > >>> tag each > >>> with > >>> * the correct answer, and use machine learning/some statistical > >> method > >>> * to generate a better heuristic. > >>> */ > >>> > >>> > >>> > >>> TODO list... as a workaround, please check for this site that META > >>> could > >> be > >>> found in first 2000 bytes... > >>> > >>> > >>> > >>> -Fuad > >>> http://www.linkedin.com/in/liferay > >>> > >>> > >>>> -----Original Message----- > >>>> From: Fadzi Ushewokunze [mailto:fadzi@...] > >>>> Sent: October-29-09 7:05 PM > >>>> To: nutch-user@... > >>>> Subject: char encoding > >>>> > >>>> hi there, > >>>> > >>>> i am having issues with the HTMLParser failing to detect the char > >>>> encoding. so lots of non alpha-numeric chars end up as "?" ; > >>>> > >>>> i dont have any special requirement for any special characters, i > >>>> am > >>>> happy with usual utf-8 > >>>> > >>>> any suggestion on the best way to configure this correctly; > >>>> everything > >>>> seems quite ok looking at the code not sure whats missing. > >>>> > >>>> thanks. > >>>> > >>>> > >>> > >>> > >> > >> > >> > > > > -------------------------- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-210-6378 > |
| Free embeddable forum powered by Nabble | Forum Help |