|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Html parser questionsHi all,
OK - my inline image got stripped out, but at least the email went through. Here's the HTML again, let's see if this works. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <title>Untitled</title> <base href="http://newdomain.com"> </head> <body> <a href="link" target="_blank">link1</a> <a href="http://domain.com/link" target="_blank">link2</a> </body> </html> -- Ken = = = = = = = = = = = = = = = = = = = = = ======================================================================== [Resending with an image instead of the HTML example - previous attempt was rejected by Apache.org as being spam...weird] I'm doing a comparison of the Tika HtmlParser with the original Nutch HTML parsing code. I've run into some issues, and wanted input before filing any Jira requests/bugs. As an example of a test document: 1. The handler's startElement() never gets called with the <base> tag. I'm assuming this is because <base> isn't part of the SAFE_ELEMENTS set. But without the base tag, you can't correctly resolve relative URLs in anchor tags. Seems like <base> should be part of the SAFE_ELEMENTS set. How as this set of tags derived? 2. The handler's characters() method gets called with the following text Untitled \n\n link1 \n link2 \n\n \n \n The first six calls make sense to me. The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected. From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call. This seems like a bug. Has anybody else noticed this? Thanks, -- Ken -------------------------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378 |
|
|
Re: Html parser questionsHi,
On Fri, Sep 25, 2009 at 8:21 PM, Ken Krugler <kkrugler_lists@...> wrote: > 1. The handler's startElement() never gets called with the <base> tag. I'm > assuming this is because <base> isn't part of the SAFE_ELEMENTS set. > > But without the base tag, you can't correctly resolve relative URLs in > anchor tags. > > Seems like <base> should be part of the SAFE_ELEMENTS set. Instead of passing the <base> element to the client (and thus requiring it to keep track of it to correctly resolve local links), I'd rather use it inside Tika to automatically turn local links to absolute URLs in the parse output. Such a mechanism could also leverage out-of-band information like a base URL given in the RESOURCE_NAME_KEY metadata entry. > How as this set of tags derived? The purpose of the SAFE_ELEMENTS set is to only pass out tags that may be useful in inferring the semantic structure of the incoming HTML document and to ensure that the output conforms to XHTML 1.0 Strict. Note that in the future we may even start doing something like limited JavaScript and CSS processing to "render" the incoming page to better decide what outgoing XHTML structure best matches the content seen by someone reading the page on a browser. This will help general purpose Tika-based web crawlers to avoid SEO tricks like "display: none;" or JavaScript DOM reorganization. So as a general rule clients should not assume a one-to-one mapping between the HTML input and XHTML output tags in Tika. > 2. The handler's characters() method gets called with the following text > > Untitled > \n\n > link1 > \n > link2 > \n\n > \n > \n > > The first six calls make sense to me. > > The last two calls (with a single \n) happen just before endElement("body") > is called, and this is unexpected. > > From the offset in the buffer, passed to characters(), these are the return > _after_ the </body> tag. If I put any number of returns in between the > </body> and </html>, they all get passed to characters() before the > endElement("body") call. This seems like a bug. > > Has anybody else noticed this? No, but you're right in that it seems like a bug. BR, Jukka Zitting |
|
|
Re: Html parser questionsHi Jukka,
On Sep 27, 2009, at 3:31am, Jukka Zitting wrote: > On Fri, Sep 25, 2009 at 8:21 PM, Ken Krugler > <kkrugler_lists@...> wrote: >> 1. The handler's startElement() never gets called with the <base> >> tag. I'm >> assuming this is because <base> isn't part of the SAFE_ELEMENTS set. >> >> But without the base tag, you can't correctly resolve relative URLs >> in >> anchor tags. >> >> Seems like <base> should be part of the SAFE_ELEMENTS set. > > Instead of passing the <base> element to the client (and thus > requiring it to keep track of it to correctly resolve local links), > I'd rather use it inside Tika to automatically turn local links to > absolute URLs in the parse output. Such a mechanism could also > leverage out-of-band information like a base URL given in the > RESOURCE_NAME_KEY metadata entry. Makes sense. I'll file an issue and create a patch. >> How as this set of tags derived? > > The purpose of the SAFE_ELEMENTS set is to only pass out tags that may > be useful in inferring the semantic structure of the incoming HTML > document and to ensure that the output conforms to XHTML 1.0 Strict. > > Note that in the future we may even start doing something like limited > JavaScript and CSS processing to "render" the incoming page to better > decide what outgoing XHTML structure best matches the content seen by > someone reading the page on a browser. This will help general purpose > Tika-based web crawlers to avoid SEO tricks like "display: none;" or > JavaScript DOM reorganization. So as a general rule clients should not > assume a one-to-one mapping between the HTML input and XHTML output > tags in Tika. So recently I had to derive URLs from a web page, where the information I needed was in the attribute of a <span> element. It sounds like the direction Tika is going would mean I couldn't use it for this use case. I understand why the output might not match the input, which is fine. But what's the benefit of stripping arbitrary tags? And given the heavy use of CSS/Ajax, how is it possible to decide what might or might not be useful for determining "semantic structure"? E.g. "ignore footers" implies layout, and layout is often done using CSS. >> 2. The handler's characters() method gets called with the following >> text >> >> Untitled >> \n\n >> link1 >> \n >> link2 >> \n\n >> \n >> \n >> >> The first six calls make sense to me. >> >> The last two calls (with a single \n) happen just before >> endElement("body") >> is called, and this is unexpected. >> >> From the offset in the buffer, passed to characters(), these are >> the return >> _after_ the </body> tag. If I put any number of returns in between >> the >> </body> and </html>, they all get passed to characters() before the >> endElement("body") call. This seems like a bug. >> >> Has anybody else noticed this? > > No, but you're right in that it seems like a bug. OK, I'll file a Jira issue. Thanks, -- Ken |
|
|
Re: Html parser questionsHi Jukka,
After writing this email, while on my morning jog, I realized that you were right about the issue of "tag fidelity". It makes total sense for Tika to manipulate tags to provide a better (more consistent/accurate) representation of the document. In my use case, I depended on getting back exactly the tag data that was in the original document. Since these two are in opposition, it's valid and appropriate for Tika to significantly restrict the set of returned tags to those that arguably help with the abstract representation. Where it can get very fuzzy is with things like <b> tags - e.g. if you pass through an <h3>, why not <b>, since these are often used interchangeably? Or should Tika look for the case of: <b>some text</b> and remap it to use <h3> tags? In any case, I'm going to file an issue w/a patch for AutoDetectParser to have an alternative constructor that takes a Map<class, Parser> so I can explicitly override HtmlParser with my version, to handle cases like <b>. -- Ken >> On Fri, Sep 25, 2009 at 8:21 PM, Ken Krugler >> <kkrugler_lists@...> wrote: >>> 1. The handler's startElement() never gets called with the <base> >>> tag. I'm >>> assuming this is because <base> isn't part of the SAFE_ELEMENTS set. >>> >>> But without the base tag, you can't correctly resolve relative >>> URLs in >>> anchor tags. >>> >>> Seems like <base> should be part of the SAFE_ELEMENTS set. >> >> Instead of passing the <base> element to the client (and thus >> requiring it to keep track of it to correctly resolve local links), >> I'd rather use it inside Tika to automatically turn local links to >> absolute URLs in the parse output. Such a mechanism could also >> leverage out-of-band information like a base URL given in the >> RESOURCE_NAME_KEY metadata entry. > > Makes sense. I'll file an issue and create a patch. > >>> How as this set of tags derived? >> >> The purpose of the SAFE_ELEMENTS set is to only pass out tags that >> may >> be useful in inferring the semantic structure of the incoming HTML >> document and to ensure that the output conforms to XHTML 1.0 Strict. >> >> Note that in the future we may even start doing something like >> limited >> JavaScript and CSS processing to "render" the incoming page to better >> decide what outgoing XHTML structure best matches the content seen by >> someone reading the page on a browser. This will help general purpose >> Tika-based web crawlers to avoid SEO tricks like "display: none;" or >> JavaScript DOM reorganization. So as a general rule clients should >> not >> assume a one-to-one mapping between the HTML input and XHTML output >> tags in Tika. > > So recently I had to derive URLs from a web page, where the > information I needed was in the attribute of a <span> element. > > It sounds like the direction Tika is going would mean I couldn't use > it for this use case. > > I understand why the output might not match the input, which is > fine. But what's the benefit of stripping arbitrary tags? > > And given the heavy use of CSS/Ajax, how is it possible to decide > what might or might not be useful for determining "semantic > structure"? E.g. "ignore footers" implies layout, and layout is > often done using CSS. > >>> 2. The handler's characters() method gets called with the >>> following text >>> >>> Untitled >>> \n\n >>> link1 >>> \n >>> link2 >>> \n\n >>> \n >>> \n >>> >>> The first six calls make sense to me. >>> >>> The last two calls (with a single \n) happen just before >>> endElement("body") >>> is called, and this is unexpected. >>> >>> From the offset in the buffer, passed to characters(), these are >>> the return >>> _after_ the </body> tag. If I put any number of returns in between >>> the >>> </body> and </html>, they all get passed to characters() before the >>> endElement("body") call. This seems like a bug. >>> >>> Has anybody else noticed this? >> >> No, but you're right in that it seems like a bug. > > OK, I'll file a Jira issue. > > Thanks, > > -- Ken -------------------------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378 |
| Free embeddable forum powered by Nabble | Forum Help |